For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://modelgates.ai/docs/_mcp/server.

Streaming

The ModelGates API allows streaming responses from any model. This is useful for building chat interfaces or other applications where the UI should update as the model generates the response.

To enable streaming, you can set the stream parameter to true in your request. The model will then stream the response to the client in chunks, rather than returning the entire response at once.

Here is an example of how to stream a response, and process it:

typescript

import { ModelGates } from '@modelgates/sdk'; const modelgates = new ModelGates({  apiKey: '{}',}); const question = 'How would you build the tallest building ever?'; const stream = await modelgates.chat.send({  model: '{}',  messages: [{ role: 'user', content: question }],  stream: true,}); for await (const chunk of stream) {  const content = chunk.choices?.[0]?.delta?.content;  if (content) {    console.log(content);  }   // Final chunk includes usage stats  if (chunk.usage) {    console.log('Usage:', chunk.usage);  }}

python

import requestsimport json question = "How would you build the tallest building ever?" url = "https://modelgates.ai/api/v1/chat/completions"headers = {  "Authorization": f"Bearer {}",  "Content-Type": "application/json"} payload = {  "model": "{}",  "messages": [{"role": "user", "content": question}],  "stream": True} buffer = ""with requests.post(url, headers=headers, json=payload, stream=True) as r:  for chunk in r.iter_content(chunk_size=1024, decode_unicode=True):    buffer += chunk    while True:      try:        # Find the next complete SSE line        line_end = buffer.find('\n')        if line_end == -1:          break         line = buffer[:line_end].strip()        buffer = buffer[line_end + 1:]         if line.startswith('data: '):          data = line[6:]          if data == '[DONE]':            break           try:            data_obj = json.loads(data)            content = data_obj["choices"][0]["delta"].get("content")            if content:              print(content, end="", flush=True)          except json.JSONDecodeError:            pass      except Exception:        break

Additional Information

For SSE (Server-Sent Events) streams, ModelGates occasionally sends comments to prevent connection timeouts. These comments look like:

text

: MODELGATES PROCESSING

Comment payload can be safely ignored per the SSE specs. However, you can leverage it to improve UX as needed, e.g. by showing a dynamic loading indicator.

The generation ID is returned in the X-Generation-Id response header for all endpoints (chat completions, completions, responses, and messages), which can be useful for debugging and correlating requests.

Some SSE client implementations might not parse the payload according to spec, which leads to an uncaught error when you JSON.stringify the non-JSON payloads. We recommend the following clients:

Stream Cancellation

Streaming requests can be cancelled by aborting the connection. For supported providers, this immediately stops model processing and billing.

Supported

OpenAI, Azure, Anthropic
Fireworks, Mancer, Recursal
AnyScale, Lepton, OctoAI
Novita, DeepInfra, Together
Cohere, Hyperbolic, Infermatic
Avian, XAI, Cloudflare
SFCompute, Nineteen, Liquid
Friendli, Chutes, DeepSeek

Not Currently Supported

AWS Bedrock, Groq, Modal
Google, Google AI Studio, Minimax
HuggingFace, Replicate, Perplexity
Mistral, AI21, Featherless
Lynn, Lambda, Reflection
SambaNova, Inflection, ZeroOneAI
AionLabs, Alibaba, Nebius
Kluster, Targon, InferenceNet

To implement stream cancellation:

typescript

import { ModelGates } from '@modelgates/sdk'; const modelgates = new ModelGates({  apiKey: '{}',}); const controller = new AbortController(); try {  const stream = await modelgates.chat.send({    model: '{{MODEL}}',    messages: [{ role: 'user', content: 'Write a story' }],    stream: true,  }, {    signal: controller.signal,  });   for await (const chunk of stream) {    const content = chunk.choices?.[0]?.delta?.content;    if (content) {      console.log(content);    }  }} catch (error) {  if (error.name === 'AbortError') {    console.log('Stream cancelled');  } else {    throw error;  }} // To cancel the stream:controller.abort();

python

import requestsfrom threading import Event, Thread def stream_with_cancellation(prompt: str, cancel_event: Event):    with requests.Session() as session:        response = session.post(            "https://modelgates.ai/api/v1/chat/completions",            headers={"Authorization": f"Bearer {{API_KEY_REF}}"},            json={"model": "{{MODEL}}", "messages": [{"role": "user", "content": prompt}], "stream": True},            stream=True        )         try:            for line in response.iter_lines():                if cancel_event.is_set():                    response.close()                    return                if line:                    print(line.decode(), end="", flush=True)        finally:            response.close() # Example usage:cancel_event = Event()stream_thread = Thread(target=lambda: stream_with_cancellation("Write a story", cancel_event))stream_thread.start() # To cancel the stream:cancel_event.set()

Cancellation only works for streaming requests with supported providers. For non-streaming requests or unsupported providers, the model will continue processing and you will be billed for the complete response.

Handling Errors During Streaming

ModelGates handles errors differently depending on when they occur during the streaming process:

Errors Before Any Tokens Are Sent

If an error occurs before any tokens have been streamed to the client, ModelGates returns a standard JSON error response with the appropriate HTTP status code. This follows the standard error format:

json

{  "error": {    "code": 400,    "message": "Invalid model specified"  }}

Common HTTP status codes include:

400: Bad Request (invalid parameters)
401: Unauthorized (invalid API key)
402: Payment Required (insufficient credits)
429: Too Many Requests (rate limited)
502: Bad Gateway (provider error)
503: Service Unavailable (no available providers)

Errors After Tokens Have Been Sent (Mid-Stream)

If an error occurs after some tokens have already been streamed to the client, ModelGates cannot change the HTTP status code (which is already 200 OK). Instead, the error is sent as a Server-Sent Event (SSE) with a unified structure:

text

data: {"id":"cmpl-abc123","object":"chat.completion.chunk","created":1234567890,"model":"openai/gpt-4o","provider":"openai","error":{"code":"server_error","message":"Provider disconnected unexpectedly"},"choices":[{"index":0,"delta":{"content":""},"finish_reason":"error"}]}

Key characteristics of mid-stream errors:

The error appears at the top level alongside standard response fields (id, object, created, etc.)
A choices array is included with finish_reason: "error" to properly terminate the stream
The HTTP status remains 200 OK since headers were already sent
The stream is terminated after this unified error event

Code Examples

Here's how to properly handle both types of errors in your streaming implementation:

typescript

import { ModelGates } from '@modelgates/sdk'; const modelgates = new ModelGates({  apiKey: '{}',}); async function streamWithErrorHandling(prompt: string) {  try {    const stream = await modelgates.chat.send({      model: '{{MODEL}}',      messages: [{ role: 'user', content: prompt }],      stream: true,    });     for await (const chunk of stream) {      // Check for errors in chunk      if ('error' in chunk) {        console.error(`Stream error: ${chunk.error.message}`);        if (chunk.choices?.[0]?.finish_reason === 'error') {          console.log('Stream terminated due to error');        }        return;      }       // Process normal content      const content = chunk.choices?.[0]?.delta?.content;      if (content) {        console.log(content);      }    }  } catch (error) {    // Handle pre-stream errors    console.error(`Error: ${error.message}`);  }}

python

import requestsimport json async def stream_with_error_handling(prompt):    response = requests.post(        'https://modelgates.ai/api/v1/chat/completions',        headers={'Authorization': f'Bearer {{API_KEY_REF}}'},        json={            'model': '{{MODEL}}',            'messages': [{'role': 'user', 'content': prompt}],            'stream': True        },        stream=True    )     # Check initial HTTP status for pre-stream errors    if response.status_code != 200:        error_data = response.json()        print(f"Error: {error_data['error']['message']}")        return     # Process stream and handle mid-stream errors    for line in response.iter_lines():        if line:            line_text = line.decode('utf-8')            if line_text.startswith('data: '):                data = line_text[6:]                if data == '[DONE]':                    break                 try:                    parsed = json.loads(data)                     # Check for mid-stream error                    if 'error' in parsed:                        print(f"Stream error: {parsed['error']['message']}")                        # Check finish_reason if needed                        if parsed.get('choices', [{}])[0].get('finish_reason') == 'error':                            print("Stream terminated due to error")                        break                     # Process normal content                    content = parsed['choices'][0]['delta'].get('content')                    if content:                        print(content, end='', flush=True)                 except json.JSONDecodeError:                    pass

API-Specific Behavior

Different API endpoints may handle streaming errors slightly differently:

OpenAI Chat Completions API: Returns ErrorResponse directly if no chunks were processed, or includes error information in the response if some chunks were processed
OpenAI Responses API: May transform certain error codes (like context_length_exceeded) into a successful response with finish_reason: "length" instead of treating them as errors