Serverless AI API
The nature of AI and LLM workloads on already trained models lends itself very naturally to a serverless-style architecture. As a framework for building and deploying serverless applications, Spin provides an interface for you to perform AI inference within Spin applications.
Using Serverless AI From Applications
Configuration
By default, a given component of a Spin application will not have access to any Serverless AI models. Access must be provided explicitly via the Spin application’s manifest (the spin.toml
file). For example, an individual component in a Spin application could be given access to the llama2-chat model by adding the following ai_models
configuration inside the specific [component.(name)]
section:
// -- snip --
[component.please-send-the-codes]
ai_models = ["codellama-instruct"]
// -- snip --
Spin supports the models of the Llama architecture for inferencing and “all-minilm-l6-v2” for generating embeddings.
File Structure
By default, the Spin framework will expect any already trained model files (which are configured as per the previous section) to be downloaded by the user and made available inside a .spin/ai-models/
file path of a given application.
Within the .spin/ai-models
directory, models of the same architecture (e.g. llama
) must be grouped under a directory with the same name as the architecture.
Within an architecture directory, each individual model (e.g. llama2-chat
, codellama-instruct
) must be placed under a folder with the same name as the model. So for any given model, that files for the model are placed in the directory .spin/ai-models/<architecture>/<model>
. For example:
code-generator-rs/.spin/ai-models/llama/codellama-instruct/safetensors
code-generator-rs/.spin/ai-models/llama/codellama-instruct/config.json
code-generator-rs/.spin/ai-models/llama/codellama-instruct/tokenizer.json
See the serverless AI Tutorial documentation for more concrete examples of implementing the Fermyon Serverless AI API, in your favorite language.
For embeddings models, it is expected that both a
tokenizer.json
and amodel.safetensors
are located in the directory named after the model. For example, for thefoo-bar-baz
model, Spin will look in the.spin/ai-models/foo-bar-baz
directory fortokenizer.json
and amodel.safetensors
.
Serverless AI Interface
The Spin SDK surfaces the Serverless AI interface to a variety of different languages. See the Language Support Overview to see if your specific language is supported.
The set of operations is common across all supporting language SDKs:
Operation | Parameters | Returns | Behavior |
---|---|---|---|
infer | modelstring prompt string | string | The infer is performed on a specific model.The name of the model is the first parameter provided (i.e. llama2-chat , codellama-instruct , or other; passed in as a string ).The second parameter is a prompt; passed in as a string . |
infer_with_options | modelstring prompt string params list | string | The infer_with_options is performed on a specific model.The name of the model is the first parameter provided (i.e. llama2-chat , codellama-instruct , or other; passed in as a string ).The second parameter is a prompt; passed in as a string .The third parameter is a mix of float and unsigned integers relating to inferencing parameters in this order: - max-tokens (unsigned 32 integer) Note: the backing implementation may return less tokens. Default is 100 - repeat-penalty (float 32) The amount the model should avoid repeating tokens. Default is 1.1 - repeat-penalty-last-n-token-count (unsigned 32 integer) The number of tokens the model should apply the repeat penalty to. Default is 64 - temperature (float 32) The randomness with which the next token is selected. Default is 0.8 - top-k (unsigned 32 integer) The number of possible next tokens the model will choose from. Default is 40 - top-p (float 32) The probability total of next tokens the model will choose from. Default is 0.9 The result from infer_with_options is a string |
generate-embeddings | modelstring prompt list<string> | string | The generate-embeddings is performed on a specific model.The name of the model is the first parameter provided (i.e. all-minilm-l6-v2 , passed in as a string ).The second parameter is a prompt; passed in as a list of string s.The result from generate-embeddings is a two-dimension array containing float32 type values only |
The exact detail of calling these operations from your application depends on your language:
Want to go straight to the reference documentation? Find it here.
To use Serverless AI functions, the llm
module from the Spin SDK provides the methods. The following snippet is from the Rust code generation example:
use spin_sdk::{
http::{IntoResponse, Request, Response},
llm,
};
// -- snip --
fn handle_code(req: Request) -> anyhow::Result<impl IntoResponse> {
// -- snip --
let result = llm::infer_with_options(
llm::InferencingModel::CodellamaInstruct,
&prompt,
llm::InferencingParams {
max_tokens: 400,
repeat_penalty: 1.1,
repeat_penalty_last_n_token_count: 64,
temperature: 0.8,
top_k: 40,
top_p: 0.9,
},
)?;
// -- snip --
}
General Notes
The infer_with_options
examples, operation:
- The above example takes the model name
llm::InferencingModel::CodellamaInstruct
as input. From an interface point of view, the model name is technically an alias for a string (to maximize future compatibility as users want to support more and different types of models). - The second parameter is a prompt (string) from whoever/whatever is making the request to the
handle_code()
function. - A third, optional, parameter which is an interface allows you to specify parameters such as
max_tokens
,repeat_penalty
,repeat_penalty_last_n_token_count
,temperature
,top_k
andtop_p
. - The return value (the
inferencing-result
record) contains a text field of typestring
. Ideally, this would be astream
that would allow streaming inferencing results back to the user, but alas streaming support is not yet ready for use so we leave that as a possible future backward incompatible change.
Want to go straight to the reference documentation? Find it here.
To use Serverless AI functions, the Llm
module from the Spin SDK provides two methods: infer
and generateEmbeddings
. For example:
import { ResponseBuilder, Llm} from "@fermyon/spin-sdk"
export async function handler(req: Request, res: ResponseBuilder) {
let embeddings = Llm.generateEmbeddings(Llm.EmbeddingModels.AllMiniLmL6V2, ["someString"])
console.log(embeddings.embeddings)
let result = Llm.infer(Llm.InferencingModels.Llama2Chat, prompt)
res.set({"content-type":"text/plain"})
res.send(result.text)
}
General Notes
infer
operation:
- It takes in the following arguments - model name, prompt and a optional third parameter for inferencing options.
- The model name is a string. There are enums for the inbuilt models (llama2-chat and codellama) in
InferencingModels
. - The optional third parameter which is an InferencingOptions interface allows you to specify parameters such as
maxTokens
,repeatPenalty
,repeatPenaltyLastNTokenCount
,temperature
,topK
,topP
. - The return value is an
InferenceResult
.
generateEmbeddings
operation:
- It takes two arguments - model name and list of strings to generate the embeddings for.
- The model name is a string. There are enums for the inbuilt models (AllMiniLmL6V2) in
EmbeddingModels
. - The return value is an
EmbeddingResult
Want to go straight to the reference documentation? Find it here.
from spin_sdk import http
from spin_sdk.http import Request, Response
from spin_sdk import llm
class IncomingHandler(http.IncomingHandler):
def handle_request(self, request: Request) -> Response:
prompt="You are a stand up comedy writer. Tell me a joke."
result = llm.infer("llama2-chat", prompt)
return Response(200,
{"content-type": "application/json"},
bytes(result.text, "utf-8"))
General Notes
-
The model name is passed in as a string (as shown above;
"llama2-chat"
).infer_with_options
operation: -
It takes in a model name, prompt text, and optionally a parameter object to control the inferencing.
Want to go straight to the reference documentation? Find it here.
Serverless AI functions are available in the github.com/fermyon/spin/sdk/go/v2/llm
package. See Go Packages for reference documentation. For example:
package main
import (
"fmt"
"net/http"
spinhttp "github.com/fermyon/spin/sdk/go/v2/http"
"github.com/fermyon/spin/sdk/go/v2/llm"
)
func init() {
spinhttp.Handle(func(w http.ResponseWriter, r *http.Request) {
result, err := llm.Infer("llama2-chat", "What is a good prompt?", nil)
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
fmt.Printf("Prompt tokens: %d\n", result.Usage.PromptTokenCount)
fmt.Printf("Generated tokens: %d\n", result.Usage.GeneratedTokenCount)
fmt.Fprintf(w, "%s\n", result.Text)
embeddings, err := llm.GenerateEmbeddings("all-minilm-l6-v2", []string{"Hello world"})
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
fmt.Printf("Prompt Tokens: %d\n", embeddings.Usage.PromptTokenCount)
fmt.Printf("%v\n", embeddings.Embeddings)
})
}
General Notes
infer
operation:
- It takes in the following arguments - model name, prompt and an optional third parameter for inferencing options (pass
nil
if you don’t want to specify it). - The model name is a string.
- The params allows you to specify
MaxTokens
,RepeatPenalty
,RepeatPenaltyLastNTokenCount
,Temperature
,TopK
,TopP
. - It returns a result struct with a
Text
field that contains the answer and aUsage
field that contains metadata about the operation.
generateEmbeddings
operation:
- It takes two arguments - model name and list of strings to generate the embeddings for.
- The model name is a string:
all-minilm-l6-v2
- It returns a result struct with an
Embeddings
field that contains the[][]float32
embeddings and aUsage
field that contains metadata about the operation.
Troubleshooting
Error “Local LLM operations are not supported in this version of Spin”
If you see “Local LLM operations are not supported in this version of Spin”, then your copy of Spin has been built without local LLM support.
The term “version” in the error message refers to how the software you are using built the Spin runtime, not to the numeric version of the runtime itself.
Most Spin builds support local LLMs as described above. However, the models built into Spin do not build on some combinations of platforms (for example, there are known problems with the aarch64/musl combination). This may cause some environments that embed Spin to disable the local LLM feature altogether. (For examples, some versions of the containerd-spin-shim
did this.) In such cases, you will see the error above.
In such cases, you can:
- See if there is another Spin build available for your platform. All current builds from the Spin GitHub repository or Fermyon Spin installer support support local LLMs.
- Use the
cloud-gpu
plugin and runtime config option to have LLM inferencing serviced in Fermyon Cloud instead of locally.