References
The following are configurable settings within an Endpoint.
Endpoint Name
Create a name you'd like to use for the Endpoint configuration. The resulting Endpoint is assigned a random ID to be used for making calls.
The name is only visible to you.
GPU Selection
Select one or more GPUs that you want your Endpoint to run on. SubModel matches you with GPUs in the order that you select them, so the first GPU type that you select is prioritized, then the second, and so on. Selecting multiple GPU types can help you get a worker more quickly, especially if your first selection is an in-demand GPU.
Active (Min) Workers
Setting the active workers to 1 or more ensures you have “always on” workers, ready to respond to job requests without cold start delays.
Default: 0
note
Active workers incur charges as soon as you enable them (set to >0), but they come with a discount of up to 30% off the regular price.
Max Workers
Max workers set a limit on the number of workers your endpoint can run simultaneously. If the max workers are set too low, you might experience throttled workers. To prevent this, consider increasing the max workers to 5 or more if you see frequent throttling.
Default: 3
GPUs / Worker
The number of GPUs you would like assigned to your worker.
note
Currently only available for 48 GB GPUs.
Idle Timeout
The amount of time a worker remains running after completing its current request. During this period, the worker stays active, continuously checking the queue for new jobs, and continues to incur charges. If no new requests arrive within this time, the worker will go to sleep.
Default: 5 seconds
Advanced
Additional controls to help you control where your endpoint is deployed and how it responds to incoming requests.
Data Centers
Control which data centers can deploy and cache your workers. Allowing multiple data centers can help you get a worker more quickly.
Default: all data centers
Scale Type
Queue Delay scaling strategy adjusts worker numbers based on request wait times. With zero workers initially, the first request adds one worker. Subsequent requests add workers only after waiting in the queue for the defined number of delay seconds.
Request Count scaling strategy adjusts worker numbers according to total requests in the queue and in progress. It automatically adds workers as the number of requests increases, ensuring tasks are handled efficiently.
_Total Workers Formula: Math.ceil((requestsInQueue + requestsInProgress) / <set request count)_
GPU Types
Within the select GPU size category you can further select which GPU models you would like your endpoint workers to run on.
Default: 4090
| A4000
| A4500
CUDA version selection
You have the ability to select the allowed CUDA versions for your workloads. The CUDA version selection determines the compatible GPU types that will be used to execute your serverless tasks.
Specifically, the CUDA version selection works as follows:
You can choose one or more CUDA versions that your workload is compatible with or requires.
SubModel will then match your workload to available GPU instances that have the selected CUDA versions installed.
This ensures that your serverless tasks run on GPU hardware that meets the CUDA version requirements.
For example, if you select CUDA 11.6, your serverless tasks will be scheduled to run on GPU instances that have CUDA 11.6 or a compatible version installed. This allows you to target specific CUDA versions based on your workload's dependencies or performance requirements.
Last updated