Model Routing

Intelligent routing between local LLMs and cloud APIs

Overview

Model Routing intelligently directs inference requests to the optimal model based on task complexity, hardware availability, and cost efficiency. It seamlessly switches between local Ollama models and cloud APIs.

Routing Strategy

Complexity-Based

Local Fallback

Always Available

Cloud Providers

4 Supported

Auto-Scaling

Dynamic

Routing Architecture

Task Analysis

Classify complexity

Hardware Check

VRAM availability

Router Decision

Select optimal model

Model Execution

Local or Cloud

Response

Return result

Complexity Classification

Trivial

Local Only

Simple Q&A

Low

Local Preferred

Basic tasks

Medium

Balanced

Code review

High

Cloud Preferred

Complex analysis

Critical

Cloud Only

Multi-step reasoning

Local LLM Setup (Ollama)

Auto-Managed

Ollama is automatically installed and managed by SingularCore. No manual setup required. Models are downloaded on-demand through the Model Hub interface.

Configure Local Models

Edit config/models.yaml:

local_models:
  enabled: true
  ollama_url: "http://localhost:11434"
  default_model: "llama3.2"
  auto_manage: true  # Ollama is auto-installed
  
  models:
    - name: "llama3.2"
      display_name: "Llama 3.2"
      size: "2GB"
      quant: "Q4_K_M"
      context: 128000
      
    - name: "deepseek-r1:14b"
      display_name: "DeepSeek R1 14B"
      size: "9GB"
      quant: "Q4_K_M"
      context: 128000

Download Models via Model Hub

Navigate to /models in the dashboard
Go to the Local Models tab
Search for models like "llama3.2" or "deepseek-r1"
Click Pull Model to download
Monitor download progress in real-time

Verify Models

Check available models in the dashboard:

Open Model Hub at /models
View Installed Models section
Verify models show as "Available"
Test in chat interface

Cloud API Setup

Google Gemini

Get API key from Google AI Studio

Set environment variable:

export GOOGLE_API_KEY="your-api-key-here"

OpenAI

Get API key from OpenAI Dashboard

Set environment variable:

export OPENAI_API_KEY="sk-..."

Anthropic Claude

Get API key from Anthropic Console

Set environment variable:

export ANTHROPIC_API_KEY="sk-ant-..."

OpenRouter

Get API key from OpenRouter

Set environment variable:

export OPENROUTER_API_KEY="sk-or-..."

Complete Configuration

Create or edit config/models.yaml:

# Model Routing Configuration
model_routing:
  enabled: true
  strategy: "complexity_based"
  fallback_to_local: true
  
  # Complexity thresholds
  thresholds:
    trivial: 0.2    # Local only
    low: 0.4        # Local preferred
    medium: 0.6     # Balanced
    high: 0.8       # Cloud preferred
    critical: 1.0   # Cloud only
  
  # Provider priority
  providers:
    local:
      priority: 1
      enabled: true
    google:
      priority: 2
      enabled: true
      models:
        - "gemini-2.5-flash"
        - "gemini-2.5-pro"
    openai:
      priority: 3
      enabled: true
      models:
        - "gpt-4o"
        - "gpt-4o-mini"
    anthropic:
      priority: 4
      enabled: true
      models:
        - "claude-3-5-sonnet-20241022"
    openrouter:
      priority: 5
      enabled: true

# Local Ollama Configuration
local_models:
  enabled: true
  ollama_url: "http://localhost:11434"
  default_model: "llama3.2"
  timeout: 300
  
  models:
    - name: "llama3.2"
      display_name: "Llama 3.2"
      size: "2GB"
      quant: "Q4_K_M"
      context: 128000
      max_tokens: 8192
      
    - name: "deepseek-r1:14b"
      display_name: "DeepSeek R1 14B"
      size: "9GB"
      quant: "Q4_K_M"
      context: 128000
      max_tokens: 8192

Environment Variables

Create .env file in project root:

# Google Gemini
GOOGLE_API_KEY=your-google-api-key

# OpenAI
OPENAI_API_KEY=sk-your-openai-key

# Anthropic
ANTHROPIC_API_KEY=sk-ant-your-anthropic-key

# OpenRouter
OPENROUTER_API_KEY=sk-or-your-openrouter-key

# Local Ollama (optional, defaults to localhost:11434)
OLLAMA_URL=http://localhost:11434

Security Warning

Never commit API keys to version control. Add .env to your .gitignore file.

Testing Your Setup

Test Local Model

curl http://localhost:11434/api/generate -d '{"model":"llama3.2","prompt":"Hello"}'

Test Cloud API

curl https://generativelanguage.googleapis.com/v1beta/models?key=$GOOGLE_API_KEY

Verify in Dashboard

Start SingularCore: python core/brain/main.py
Open http://localhost:3000/models
Check that configured models appear
Test inference in chat interface

Troubleshooting

Ollama Connection Refused

Ensure SingularCore has auto-started Ollama (check logs)
Check port 11434 is not blocked
Verify OLLAMA_URL in config

API Key Invalid

Verify key is copied correctly (no extra spaces)
Check key has not expired
Ensure billing is enabled (for paid tiers)

Out of VRAM

Use lower quantization (Q4 instead of Q8)
Switch to cloud API for large models
Close other GPU applications

Next Steps

Now that your models are configured, explore the Model Hub to download and manage models.

Go to Model Hub View Full Guide

Documentation Menu

Back to Docs

Model Routing

Intelligent routing between local LLMs and cloud APIs

Overview

Routing Strategy

Complexity-Based

Local Fallback

Always Available

Cloud Providers

4 Supported

Auto-Scaling

Dynamic

Routing Architecture

Task Analysis

Classify complexity

Hardware Check

VRAM availability

Router Decision

Select optimal model

Model Execution

Local or Cloud

Response

Return result

Complexity Classification

Trivial

Local Only

Simple Q&A

Low

Local Preferred

Basic tasks

Medium

Balanced

Code review

High

Cloud Preferred

Complex analysis

Critical

Cloud Only

Multi-step reasoning

Local LLM Setup (Ollama)

Auto-Managed

Ollama is automatically installed and managed by SingularCore. No manual setup required. Models are downloaded on-demand through the Model Hub interface.

Configure Local Models

Edit config/models.yaml:

local_models:
  enabled: true
  ollama_url: "http://localhost:11434"
  default_model: "llama3.2"
  auto_manage: true  # Ollama is auto-installed
  
  models:
    - name: "llama3.2"
      display_name: "Llama 3.2"
      size: "2GB"
      quant: "Q4_K_M"
      context: 128000
      
    - name: "deepseek-r1:14b"
      display_name: "DeepSeek R1 14B"
      size: "9GB"
      quant: "Q4_K_M"
      context: 128000

Download Models via Model Hub

Navigate to /models in the dashboard
Go to the Local Models tab
Search for models like "llama3.2" or "deepseek-r1"
Click Pull Model to download
Monitor download progress in real-time

Verify Models

Check available models in the dashboard:

Open Model Hub at /models
View Installed Models section
Verify models show as "Available"
Test in chat interface

Cloud API Setup

Google Gemini

Get API key from Google AI Studio

Set environment variable:

export GOOGLE_API_KEY="your-api-key-here"

OpenAI

Get API key from OpenAI Dashboard

Set environment variable:

export OPENAI_API_KEY="sk-..."

Anthropic Claude

Get API key from Anthropic Console

Set environment variable:

export ANTHROPIC_API_KEY="sk-ant-..."

OpenRouter

Get API key from OpenRouter

Set environment variable:

export OPENROUTER_API_KEY="sk-or-..."

Complete Configuration

Create or edit config/models.yaml:

# Model Routing Configuration
model_routing:
  enabled: true
  strategy: "complexity_based"
  fallback_to_local: true
  
  # Complexity thresholds
  thresholds:
    trivial: 0.2    # Local only
    low: 0.4        # Local preferred
    medium: 0.6     # Balanced
    high: 0.8       # Cloud preferred
    critical: 1.0   # Cloud only
  
  # Provider priority
  providers:
    local:
      priority: 1
      enabled: true
    google:
      priority: 2
      enabled: true
      models:
        - "gemini-2.5-flash"
        - "gemini-2.5-pro"
    openai:
      priority: 3
      enabled: true
      models:
        - "gpt-4o"
        - "gpt-4o-mini"
    anthropic:
      priority: 4
      enabled: true
      models:
        - "claude-3-5-sonnet-20241022"
    openrouter:
      priority: 5
      enabled: true

# Local Ollama Configuration
local_models:
  enabled: true
  ollama_url: "http://localhost:11434"
  default_model: "llama3.2"
  timeout: 300
  
  models:
    - name: "llama3.2"
      display_name: "Llama 3.2"
      size: "2GB"
      quant: "Q4_K_M"
      context: 128000
      max_tokens: 8192
      
    - name: "deepseek-r1:14b"
      display_name: "DeepSeek R1 14B"
      size: "9GB"
      quant: "Q4_K_M"
      context: 128000
      max_tokens: 8192

Environment Variables

Create .env file in project root:

# Google Gemini
GOOGLE_API_KEY=your-google-api-key

# OpenAI
OPENAI_API_KEY=sk-your-openai-key

# Anthropic
ANTHROPIC_API_KEY=sk-ant-your-anthropic-key

# OpenRouter
OPENROUTER_API_KEY=sk-or-your-openrouter-key

# Local Ollama (optional, defaults to localhost:11434)
OLLAMA_URL=http://localhost:11434

Security Warning

Never commit API keys to version control. Add .env to your .gitignore file.

Testing Your Setup

Test Local Model

curl http://localhost:11434/api/generate -d '{"model":"llama3.2","prompt":"Hello"}'

Test Cloud API

curl https://generativelanguage.googleapis.com/v1beta/models?key=$GOOGLE_API_KEY

Verify in Dashboard

Start SingularCore: python core/brain/main.py
Open http://localhost:3000/models
Check that configured models appear
Test inference in chat interface

Troubleshooting

Ollama Connection Refused

Ensure SingularCore has auto-started Ollama (check logs)
Check port 11434 is not blocked
Verify OLLAMA_URL in config

API Key Invalid

Verify key is copied correctly (no extra spaces)
Check key has not expired
Ensure billing is enabled (for paid tiers)

Out of VRAM

Use lower quantization (Q4 instead of Q8)
Switch to cloud API for large models
Close other GPU applications

Next Steps

Now that your models are configured, explore the Model Hub to download and manage models.

Go to Model Hub View Full Guide