Intelligent Region Management with Azure OpenAI Service: Implementing Automatic Region Switching Upon Quota Exceedance


Intelligent Region Management with Azure OpenAI Service: Implementing Automatic Region Switching Upon Quota Exceedance

In the ever-evolving landscape of cloud services and artificial intelligence, efficient resource management stands as a pivotal aspect of ensuring seamless operation and scalability. Particularly in the realm of Azure OpenAI Service, understanding and navigating through regional quotas becomes crucial for maintaining uninterrupted service. Recently, I encountered a common yet significant challenge: the exhaustion of regional quotas in Azure’s OpenAI Service. This led to a temporary halt in service, a scenario that many developers and businesses might find familiar. To address this, I embarked on a journey to develop a solution that not only overcomes these limitations but also optimizes the usage of available resources across different regions. In this blog post, I’ll share my experience and insights on implementing an automatic region-switching mechanism in Azure OpenAI Service, ensuring that your AI-powered applications remain up and running, even when faced with rate limit errors.

Join me as we delve into the intricacies of managing Azure OpenAI Service across multiple regions, and explore how a little bit of coding can lead to a robust and reliable application infrastructure.

Azure OpenAI Service has different models available for deployment in each region, and the usage limits also vary. This can be checked on the following link.

While it’s possible to request Azure to increase usage limits, my experience has shown that these requests often don’t result in a significant increase. Therefore, I decided to deploy the same models across multiple regions and implemented a system where, if a usage limit is reached in one region, another region is automatically called. Furthermore, in cases where a RateLimitError occurs in all regions, I have set up a fallback to directly call OpenAI, bypassing Azure.

First, save the following details in a .env file. These are the deployment information for Azure OpenAI Service and the OpenAI API Key details.

I will assume that the names of the models deployed in each region are the same as the model names in the OpenAI service. The OpenAI model names can be checked at the link below.

OPENAI_API_KEY="sk-...."

# Azure OpenAI API
# 1. east-us
AZURE_OPENAI_API_KEY_1="..."
AZURE_OPENAI_ENDPOINT_1="https://....openai.azure.com/"

# 2. SC US
AZURE_OPENAI_API_KEY_2="..."
AZURE_OPENAI_ENDPOINT_2="https://....openai.azure.com/"

# 3. NC US
AZURE_OPENAI_API_KEY_3="..."
AZURE_OPENAI_ENDPOINT_3="https://....openai.azure.com/"

# 4. Canada East
AZURE_OPENAI_API_KEY_4="..."
AZURE_OPENAI_ENDPOINT_4="https://....openai.azure.com/"

# 5. West US
AZURE_OPENAI_API_KEY_5="..."
AZURE_OPENAI_ENDPOINT_5="https://....openai.azure.com/"

Implementation of synchronous

from dotenv import load_dotenv
load_dotenv()

from os import getenv
from openai import AzureOpenAI, OpenAI, RateLimitError

AZURE_OPENAI_API_KEYS = [
    {
        "api_key": getenv("AZURE_OPENAI_API_KEY"),
        "azure_endpoint_url": getenv("AZURE_OPENAI_ENDPOINT"),
    },
    {
        "api_key": getenv("AZURE_OPENAI_API_KEY_2"),
        "azure_endpoint_url": getenv("AZURE_OPENAI_ENDPOINT_2"),
    },
    {
        "api_key": getenv("AZURE_OPENAI_API_KEY_3"),
        "azure_endpoint_url": getenv("AZURE_OPENAI_ENDPOINT_3"),
    },
    {
        "api_key": getenv("AZURE_OPENAI_API_KEY_4"),
        "azure_endpoint_url": getenv("AZURE_OPENAI_ENDPOINT_4"),
    },
    {
        "api_key": getenv("AZURE_OPENAI_API_KEY_5"),
        "azure_endpoint_url": getenv("AZURE_OPENAI_ENDPOINT_5"),
    },
]

def _call_completion(client, messages, model, temperature=0.4):
    completion = client.chat.completions.create(model=model, messages=messages, temperature=temperature)
    message = str(completion.choices[0].message.content)
    return message


def _call_azure_openai_api(credential, messages, model="gpt-3.5-turbo-1106", temperature=0.4):
    client = AzureOpenAI(
        api_key=credential["api_key"],
        api_version="2023-05-15",
        azure_endpoint=credential["azure_endpoint_url"],
    )

    try:
        return _call_completion(client, messages, model)
    except RateLimitError:
        return None


def call_completion(messages, model="gpt-3.5-turbo-1106", temperature=0.4):
    for credential in AZURE_OPENAI_API_KEYS:
        message = _call_azure_openai_api(credential, messages, model, temperature)
        if message is not None:
            return message

    client = OpenAI(api_key=getenv("OPENAI_API_KEY"))
    return _call_completion(client, messages, model, temperature)

The above code is implemented synchronously. If implemented asynchronously, it would look like the following.

Implementation of asynchronous

from os import getenv

from openai import AsyncAzureOpenAI, AsyncOpenAI, RateLimitError

AZURE_OPENAI_API_KEYS = [
    {
        "api_key": getenv("AZURE_OPENAI_API_KEY"),
        "azure_endpoint_url": getenv("AZURE_OPENAI_ENDPOINT"),
    },
    {
        "api_key": getenv("AZURE_OPENAI_API_KEY_2"),
        "azure_endpoint_url": getenv("AZURE_OPENAI_ENDPOINT_2"),
    },
    {
        "api_key": getenv("AZURE_OPENAI_API_KEY_3"),
        "azure_endpoint_url": getenv("AZURE_OPENAI_ENDPOINT_3"),
    },
    {
        "api_key": getenv("AZURE_OPENAI_API_KEY_4"),
        "azure_endpoint_url": getenv("AZURE_OPENAI_ENDPOINT_4"),
    },
    {
        "api_key": getenv("AZURE_OPENAI_API_KEY_5"),
        "azure_endpoint_url": getenv("AZURE_OPENAI_ENDPOINT_5"),
    },
]

async def _call_completion_async(client, messages, model, temperature=0.4):
    completion = await client.chat.completions.create(model=model, messages=messages, temperature=temperature)
    message = str(completion.choices[0].message.content)
    return message


async def _call_azure_openai_api_async(credential, messages, model="gpt-3.5-turbo-1106", temperature=0.4):
    client = AsyncAzureOpenAI(
        api_key=credential["api_key"],
        api_version="2023-05-15",
        azure_endpoint=credential["azure_endpoint_url"],
    )

    try:
        return await _call_completion_async(client, messages, model)
    except RateLimitError:
        return None


async def call_completion_async(messages, model="gpt-3.5-turbo-1106", temperature=0.4):
    for credential in AZURE_OPENAI_API_KEYS:
        message = await _call_azure_openai_api_async(credential, messages, model, temperature)
        if message is not None:
            return message

    client = AsyncOpenAI(api_key=getenv("OPENAI_API_KEY"))
    return await _call_completion_async(client, messages, model, temperature)

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)