OpenAI Compatibility
The vLLM Worker is fully compatible with OpenAI's API, allowing you to use the same code to interact with the vLLM Worker as you would with OpenAI's API. This compatibility ensures a seamless transition and integration experience.
Conventions
Completions Endpoint
The completions endpoint provides the completion for a single prompt and takes a single string as input.
Chat Completions
The chat completions endpoint provides responses for a given dialog and requires the input in a specific format corresponding to the message history.
Choose the convention that best suits your use case.
Model Names
The MODEL_NAME environment variable is required for all requests. Use this value when making requests to the vLLM Worker.
Examples of model names include openchat/openchat-3.5-0106, mistral:latest, and llama2:70b.
To generate a response for a given prompt with a provided model, use the streaming endpoint. This will produce a series of responses, with the final response object including statistics and additional data from the request.
Parameters
Chat Completions
When using the chat completion feature of the vLLM Serverless Endpoint Worker, you can customize your requests with the following parameters:
Supported Chat Completions Inputs and Descriptions
messages
Union[str, List[Dict[str, str]]]
List of messages, where each message is a dictionary with a role and content. The model's chat template will be applied to the messages automatically, so the model must have one or it should be specified as CUSTOM_CHAT_TEMPLATE env var.
model
str
The model repo that you've deployed on your SubModel Serverless Endpoint. If you are unsure what the name is or are baking the model in, use the guide to get the list of available models in the Examples: Using your SubModel endpoint with OpenAI section
temperature
Optional[float]
0.7
Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.
top_p
Optional[float]
1.0
Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
n
Optional[int]
1
Number of output sequences to return for the given prompt.
max_tokens
Optional[int]
None
Maximum number of tokens to generate per output sequence.
seed
Optional[int]
None
Random seed to use for the generation.
stop
Optional[Union[str, List[str]]]
list
List of strings that stop the generation when they are generated. The returned output will not contain the stop strings.
stream
Optional[bool]
False
Whether to stream or not
presence_penalty
Optional[float]
0.0
Float that penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.
frequency_penalty
Optional[float]
0.0
Float that penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.
logit_bias
Optional[Dict[str, float]]
None
Unsupported by vLLM
user
Optional[str]
None
Unsupported by vLLM
Additional Parameters Supported by vLLM
best_of
Optional[int]
None
Number of output sequences that are generated from the prompt. From these best_of sequences, the top ζsequences are returned. best_ofmust be greater than or equal ton. This is treated as the beam width when use_beam_searchis True. By default,best_ofis set ton`.
top_k
Optional[int]
-1
Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.
ignore_eos
Optional[bool]
False
Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.
use_beam_search
Optional[bool]
False
Whether to use beam search instead of sampling.
stop_token_ids
Optional[List[int]]
list
List of tokens that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens.
skip_special_tokens
Optional[bool]
True
Whether to skip special tokens in the output.
spaces_between_special_tokens
Optional[bool]
True
Whether to add spaces between special tokens in the output. Defaults to True.
echo
Optional[bool]
False
Echo back the prompt in addition to the completion
repetition_penalty
Optional[float]
1.0
Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.
min_p
Optional[float]
0.0
Float that represents the minimum probability for a token to be considered, relative to the most likely token. Must be in [0, 1]. Set to 0 to disable.
length_penalty
Optional[float]
1.0
Float that penalizes sequences based on their length. Used in beam search.
include_stop_str_in_output
Optional[bool]
False
Whether to include the stop strings in output text. Defaults to False.
Completions
When using the completions feature, you can customize your requests with the following parameters:
Supported Completions Inputs and Descriptions
model
str
The model repo that you've deployed on your SubModel Serverless Endpoint. If you are unsure what the name is or are baking the model in, use the guide to get the list of available models in the Examples: Using your SubModel endpoint with OpenAI section.
prompt
Union[List[int], List[List[int]], str, List[str]]
A string, array of strings, array of tokens, or array of token arrays to be used as the input for the model.
suffix
Optional[str]
None
A string to be appended to the end of the generated text.
max_tokens
Optional[int]
16
Maximum number of tokens to generate per output sequence.
temperature
Optional[float]
1.0
Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.
top_p
Optional[float]
1.0
Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
n
Optional[int]
1
Number of output sequences to return for the given prompt.
stream
Optional[bool]
False
Whether to stream the output.
logprobs
Optional[int]
None
Number of log probabilities to return per output token.
echo
Optional[bool]
False
Whether to echo back the prompt in addition to the completion.
stop
Optional[Union[str, List[str]]]
list
List of strings that stop the generation when they are generated. The returned output will not contain the stop strings.
seed
Optional[int]
None
Random seed to use for the generation.
presence_penalty
Optional[float]
0.0
Float that penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.
frequency_penalty
Optional[float]
0.0
Float that penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.
best_of
Optional[int]
None
Number of output sequences that are generated from the prompt. From these best_of sequences, the top n sequences are returned. best_of must be greater than or equal to n. This parameter influences the diversity of the output.
logit_bias
Optional[Dict[str, float]]
None
Dictionary of token IDs to biases.
user
Optional[str]
None
User identifier for personalizing responses. (Unsupported by vLLM)
Additional Parameters Supported by vLLM
top_k
Optional[int]
-1
Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.
ignore_eos
Optional[bool]
False
Whether to ignore the End Of Sentence token and continue generating tokens after the EOS token is generated.
use_beam_search
Optional[bool]
False
Whether to use beam search instead of sampling for generating outputs.
stop_token_ids
Optional[List[int]]
list
List of tokens that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens.
skip_special_tokens
Optional[bool]
True
Whether to skip special tokens in the output.
spaces_between_special_tokens
Optional[bool]
True
Whether to add spaces between special tokens in the output. Defaults to True.
repetition_penalty
Optional[float]
1.0
Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.
min_p
Optional[float]
0.0
Float that represents the minimum probability for a token to be considered, relative to the most likely token. Must be in [0, 1]. Set to 0 to disable.
length_penalty
Optional[float]
1.0
Float that penalizes sequences based on their length. Used in beam search.
include_stop_str_in_output
Optional[bool]
False
Whether to include the stop strings in output text. Defaults to False.
Initialize Your Project
Begin by setting up the OpenAI Client with your SubModel API Key and Endpoint URL.
With the client now initialized, you're ready to start sending requests to your SubModel Serverless Endpoint.
Generating a Request
You can leverage LLMs for instruction-following and chat capabilities. This is suitable for a variety of open source chat and instruct models such as:
meta-llama/Llama-2-7b-chat-hfmistralai/Mixtral-8x7B-Instruct-v0.1and more
Models not inherently designed for chat and instruct tasks can be adapted using a custom chat template specified by the CUSTOM_CHAT_TEMPLATE environment variable.
For more information see the OpenAI documentation.
Streaming Responses
For real-time interaction with the model, create a chat completion stream. This method is ideal for applications requiring feedback.
Non-Streaming Responses
You can also return a synchronous, non-streaming response for batch processing or when a single, consolidated response is sufficient.
Generating a Chat Completion
This method is tailored for models that support text completion. It complements your input with a continuation stream of output, differing from the interactive chat format.
Streaming Responses
Enable streaming for continuous, real-time output. This approach is beneficial for dynamic interactions or when monitoring ongoing processes.
Non-Streaming Responses
Choose a non-streaming method when a single, consolidated response meets your needs.
Get a List of Available Models
You can list the available models.
Last updated