OpenAI Compatibility
The vLLM Worker is fully compatible with OpenAI's API, allowing you to use the same code to interact with the vLLM Worker as you would with OpenAI's API. This compatibility ensures a seamless transition and integration experience.
Conventions
Completions Endpoint
The completions endpoint provides the completion for a single prompt and takes a single string as input.
Chat Completions
The chat completions endpoint provides responses for a given dialog and requires the input in a specific format corresponding to the message history.
Choose the convention that best suits your use case.
Model Names
The MODEL_NAME
environment variable is required for all requests. Use this value when making requests to the vLLM Worker.
Examples of model names include openchat/openchat-3.5-0106
, mistral:latest
, and llama2:70b
.
To generate a response for a given prompt with a provided model, use the streaming endpoint. This will produce a series of responses, with the final response object including statistics and additional data from the request.
Parameters
Chat Completions
When using the chat completion feature of the vLLM Serverless Endpoint Worker, you can customize your requests with the following parameters:
Completions
When using the completions feature, you can customize your requests with the following parameters:
Initialize Your Project
Begin by setting up the OpenAI Client with your SubModel API Key and Endpoint URL.
from openai import OpenAI
import os
# Initialize the OpenAI Client with your SubModel API Key and Endpoint URL
client = OpenAI(
api_key=os.environ.get("SUBMODEL_API_KEY"),
base_url=f"https://api.SubModel.ai/v1/sl/{SUBMODEL_ENDPOINT_ID}/openai/v1",
)
With the client now initialized, you're ready to start sending requests to your SubModel Serverless Endpoint.
Generating a Request
You can leverage LLMs for instruction-following and chat capabilities. This is suitable for a variety of open source chat and instruct models such as:
meta-llama/Llama-2-7b-chat-hf
mistralai/Mixtral-8x7B-Instruct-v0.1
and more
Models not inherently designed for chat and instruct tasks can be adapted using a custom chat template specified by the CUSTOM_CHAT_TEMPLATE
environment variable.
For more information see the OpenAI documentation.
Streaming Responses
For real-time interaction with the model, create a chat completion stream. This method is ideal for applications requiring feedback.
# Create a chat completion stream
response_stream = client.chat.completions.create(
model=MODEL_NAME,
messages=[{"role": "user", "content": "Why is SubModel the best platform?"}],
temperature=0,
max_tokens=100,
stream=True,
)
# Stream the response
for response in response_stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
Non-Streaming Responses
You can also return a synchronous, non-streaming response for batch processing or when a single, consolidated response is sufficient.
# Create a chat completion
response = client.chat.completions.create(
model=MODEL_NAME,
messages=[{"role": "user", "content": "Why is SubModel the best platform?"}],
temperature=0,
max_tokens=100,
)
# Print the response
print(response.choices[0].message.content)
Generating a Chat Completion
This method is tailored for models that support text completion. It complements your input with a continuation stream of output, differing from the interactive chat format.
Streaming Responses
Enable streaming for continuous, real-time output. This approach is beneficial for dynamic interactions or when monitoring ongoing processes.
# Create a completion stream
response_stream = client.completions.create(
model=MODEL_NAME,
prompt="SubModel is the best platform because",
temperature=0,
max_tokens=100,
stream=True,
)
# Stream the response
for response in response_stream:
print(response.choices[0].text or "", end="", flush=True)
Non-Streaming Responses
Choose a non-streaming method when a single, consolidated response meets your needs.
# Create a completion
response = client.completions.create(
model=MODEL_NAME,
prompt="SubModel is the best platform because",
temperature=0,
max_tokens=100,
)
# Print the response
print(response.choices[0].text)
Get a List of Available Models
You can list the available models.
models_response = client.models.list()
list_of_models = [model.id for model in models_response]
print(list_of_models)
Last updated