Integrating Cycles with Groq
This guide shows how to add budget governance to Groq API calls. Groq provides an OpenAI-compatible API, so you use the standard OpenAI SDK with a different base_url. All Cycles patterns from the OpenAI integration apply directly.
Prerequisites
pip install runcycles openaiexport CYCLES_BASE_URL="http://localhost:7878"
export CYCLES_API_KEY="your-api-key" # create via Admin Server — see note below
export CYCLES_TENANT="acme"
export GROQ_API_KEY="gsk_..."Need a Cycles API key? Create one via the Admin Server — see Deploy the Full Stack or API Key Management.
60-Second Quick Start
from openai import OpenAI
from runcycles import CyclesClient, CyclesConfig, cycles, set_default_client
set_default_client(CyclesClient(CyclesConfig.from_env()))
groq = OpenAI(base_url="https://api.groq.com/openai/v1", api_key="gsk_...")
@cycles(estimate=50_000, action_kind="llm.completion", action_name="llama-4-scout")
def ask(prompt: str) -> str:
return groq.chat.completions.create(
model="meta-llama/llama-4-scout-17b-16e-instruct",
messages=[{"role": "user", "content": prompt}],
).choices[0].message.content
print(ask("What is budget authority?"))Same OpenAI SDK, same @cycles decorator — just a different base_url. Notice the estimate is much lower than GPT-4o because Groq's pricing is 10-50x cheaper.
Basic pattern
import os
from openai import OpenAI
from runcycles import (
CyclesConfig, CyclesClient, CyclesMetrics,
cycles, get_cycles_context, set_default_client,
)
set_default_client(CyclesClient(CyclesConfig.from_env()))
groq = OpenAI(
base_url="https://api.groq.com/openai/v1",
api_key=os.environ["GROQ_API_KEY"],
)
# Llama 4 Scout on Groq
PRICE_PER_INPUT_TOKEN = 11 # $0.11 / 1M tokens
PRICE_PER_OUTPUT_TOKEN = 34 # $0.34 / 1M tokens
@cycles(
estimate=lambda prompt, **kw: len(prompt.split()) * 2 * PRICE_PER_INPUT_TOKEN
+ kw.get("max_tokens", 1024) * PRICE_PER_OUTPUT_TOKEN,
actual=lambda result: (
result["usage"]["prompt_tokens"] * PRICE_PER_INPUT_TOKEN
+ result["usage"]["completion_tokens"] * PRICE_PER_OUTPUT_TOKEN
),
action_kind="llm.completion",
action_name="meta-llama/llama-4-scout-17b-16e-instruct",
unit="USD_MICROCENTS",
)
def chat(prompt: str, max_tokens: int = 1024) -> dict:
ctx = get_cycles_context()
if ctx and ctx.has_caps() and ctx.caps.max_tokens:
max_tokens = min(max_tokens, ctx.caps.max_tokens)
response = groq.chat.completions.create(
model="meta-llama/llama-4-scout-17b-16e-instruct",
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
)
if ctx:
ctx.metrics = CyclesMetrics(
tokens_input=response.usage.prompt_tokens,
tokens_output=response.usage.completion_tokens,
model_version=response.model,
)
return {
"content": response.choices[0].message.content,
"usage": {
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens,
},
}TypeScript
import OpenAI from "openai";
import { CyclesClient, CyclesConfig, withCycles, getCyclesContext } from "runcycles";
const cycles = new CyclesClient(CyclesConfig.fromEnv());
const groq = new OpenAI({
baseURL: "https://api.groq.com/openai/v1",
apiKey: process.env.GROQ_API_KEY,
});
const INPUT_PRICE = 11;
const OUTPUT_PRICE = 34;
const chat = withCycles(
{
client: cycles,
actionKind: "llm.completion",
actionName: "meta-llama/llama-4-scout-17b-16e-instruct",
estimate: (prompt: string) => {
const inputTokens = Math.ceil(prompt.length / 4);
return inputTokens * INPUT_PRICE + 1024 * OUTPUT_PRICE;
},
actual: (r: OpenAI.ChatCompletion) =>
(r.usage?.prompt_tokens ?? 0) * INPUT_PRICE +
(r.usage?.completion_tokens ?? 0) * OUTPUT_PRICE,
},
async (prompt: string) => {
const ctx = getCyclesContext();
let maxTokens = 1024;
if (ctx?.caps?.maxTokens) {
maxTokens = Math.min(maxTokens, ctx.caps.maxTokens);
}
return groq.chat.completions.create({
model: "meta-llama/llama-4-scout-17b-16e-instruct",
max_tokens: maxTokens,
messages: [{ role: "user", content: prompt }],
});
},
);Groq pricing reference
Groq hosts open-source models on custom LPU hardware. Pricing is significantly lower than proprietary models:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Input (microcents/token) | Output (microcents/token) |
|---|---|---|---|---|
| Llama 4 Scout 17B | $0.11 | $0.34 | 11 | 34 |
| Llama 3.3 70B | $0.59 | $0.79 | 59 | 79 |
| Llama 3.1 8B | $0.05 | $0.08 | 5 | 8 |
For comparison, GPT-4o is 250/1,000 microcents per token — 23x more expensive than Llama 4 Scout on Groq for input, 29x more for output.
Note
Groq pricing changes. Check groq.com/pricing for current rates.
Model-downgrade degradation pattern
The most powerful Cycles + Groq pattern: when your primary model's budget runs low, automatically downgrade to a cheaper Groq model instead of denying the request entirely.
from runcycles import BudgetExceededError
# Primary: GPT-4o (expensive, high quality)
primary_client = OpenAI()
# Fallback: Llama 4 Scout on Groq (cheap, good quality)
fallback_client = OpenAI(
base_url="https://api.groq.com/openai/v1",
api_key=os.environ["GROQ_API_KEY"],
)
@cycles(
estimate=1_500_000,
action_kind="llm.completion",
action_name="gpt-4o",
)
def primary_chat(prompt: str) -> dict:
response = primary_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
)
return {"content": response.choices[0].message.content, "model": "gpt-4o"}
@cycles(
estimate=50_000,
action_kind="llm.completion",
action_name="llama-4-scout",
)
def fallback_chat(prompt: str) -> dict:
response = fallback_client.chat.completions.create(
model="meta-llama/llama-4-scout-17b-16e-instruct",
messages=[{"role": "user", "content": prompt}],
)
return {"content": response.choices[0].message.content, "model": "llama-4-scout"}
def chat_with_downgrade(prompt: str) -> dict:
"""Try GPT-4o first; fall back to Groq if budget is exhausted."""
try:
return primary_chat(prompt)
except BudgetExceededError:
return fallback_chat(prompt)This pattern gives you:
- Full quality when budget allows (GPT-4o at $2.50/$10 per 1M tokens)
- Continued service when budget is low (Llama 4 Scout at $0.11/$0.34 per 1M tokens)
- Per-model observability — Cycles tracks spend separately for each
action_name
See Degradation Paths for more strategies.
Key points
- Same SDK, different
base_url. Groq uses the OpenAI-compatible API — no new SDK to learn. - Much lower estimates. Groq models are 10-50x cheaper than GPT-4o. Adjust your
estimatevalues accordingly. - Model downgrade is the killer pattern. Use Groq as a budget-aware fallback when your primary model's budget runs low.
- All OpenAI patterns apply. Everything from the OpenAI integration guide works with Groq — decorators, streaming, caps, metrics.
Next steps
- Integrating with OpenAI — full OpenAI patterns (all apply to Groq)
- Integrating with OpenAI (TypeScript) — TypeScript streaming patterns
- Degradation Paths — model downgrade and other strategies
- Cost Estimation Cheat Sheet — pricing reference for all providers
- Integrating with Ollama — self-hosted open-source models