NAV
python

Introduction - Aibro Inference

Aibro Version: 1.1.5 (alpha)

Last Documentation Update: May 20, 2022

Definition: API embedded python library

Aibro is a serverless MLOps tool that helps data scientists train & deploy AI models on cloud platforms in 2 minutes.

The first step in the process is training a machine learning model, whereas this document focuses on the second step, cloud-based machine learning inference. If you are also interested in learning about training a model, please refer to the training document.

Why Aibro Inference?

Authentication

Authentication is required when Aibro APIs are called for the first time.

Aibro allows one of the following two ways to authenticate:

ps: "humming" is literally how an alpaca "greets" people😉

Start The First Inference Job on Aibro

Step 1: Install

pip install aibro

The first step is to install the Aibro Python library using pip.

During the installation, if the OSError: protocol now found message appears, then it indicates an error caused by a missing file that can be easily resolved. The missing file is /etc/protocols, and entering the following command should remedy it.

sudo apt-get -o Dpkg::Options::="--force-confmiss" install --reinstall netbase

Step 2: Prepare repository

The second step is to prepare a formatted inference model repository. The following instructions and source code can be found on our GitHub page, Aibro-examples.

The repo should be structured using the following format:

repo
    |__ predict.py
    |__ model
    |__ data
    |__ requirement.txt
    |__ other artifacts

predict.py

This is the entry point that will be called by Aibro. It should contain two methods, load_model() and run(...).

load_model():

def load_model():
    # Portuguese to English translator
    translator = tf.saved_model.load('model')
    return translator

This method is required to load and return your machine learning model from the model folder. As an example, a transformer-based Portuguese to English translator is used.

run():

This method accepts the model as input. It loads the data from the data folder, generates predictions, and returns the results of the inference.

def run(model):
    fp = open("./data/data.json", "r")
    data = json.load(fp)
    sentence = data["data"]
    result = {"data": model(sentence).numpy().decode("utf-8")}
    return result








test tip: predict.py() should be able to return an inference result by:

run(load_model())

'model' and 'data' folders

There are no format restrictions on these two folders, as long as the input and output of load_model() and run(...) from predict.py are correct.

requirement.txt

Prior to starting model deployment, packages from requirement.txt are installed as part of setting up the environment.

NOTE: If your requirement.txt contains paths (such as pandas @ file:///...), then it is the subject of an open issue with pip freeze in version 20.1. As a workaround, the following line can be used:

pip list --format=freeze > requirements.txt

Other Artifacts

This refers to all of the other files and folders.

Step 3: Test the Repo using Dryrun

from aibro.inference import Inference
api_url = Inference.deploy(
    artifacts_path = "./aibro_repo",
    dryrun = True,
)

Dryrun locally validates the repo structure and tests to ensure that an inference result can be successfully returned.

Step 4: Create an inference API with one-line code

from aibro.inference import Inference
api_url = Inference.deploy(
    model_name = "my_fancy_transformer",
    machine_id_config = "c5.large.od",
    artifacts_path = "./aibro_repo",
)

Assume the formatted model repo is saved at path "/aibro_repo". It can now be used to create an inference job. The model name must be unique with respect to all current active inference jobs under your profile.

In this example, we deploy a public custom model from "./aibro_repo" called "my_fancy_transformer" on machine "c5.large.od", using an access token for authentication.

Once the deployment is complete, an API URL is returned with the syntax:

{client_id}: if your inference job is public, {client_id} is simply "public". Otherwise, {client_id} will indicate one of your clients' ID.

In this tutorial, the API URL is:

Step 5: Test an Aibro API with curl:

curl -X POST "http://api.aipaca.ai/v1/{username}/public/my_fancy_transformer/predict" -d '{"data": "Olá"}'
# replace the {username} by your own

In this example, we demonstrate the use of the Aibro API using the curl utility. However, feel free to use whatever API tool you feel comfortable with.

Note: The syntax when using curl depends on the file type in the data folder. In this tutorial, we use a JSON file.

File Type syntax
json curl -X POST {{api url}} -d '{"your": "data"}'
curl -X POST {{api url}} -F file=@'path/to/json/file'
txt curl -X POST {{api url}} -d 'your data'
curl -X POST {{api url}} -F file=@'path/to/txt/file'
csv curl -X POST {{api url}} -F file=@'path/to/csv/file'
others curl -X POST {{api url}} -F file=@'path/to/zip/file'

You may have observed some patterns from the syntax lookup table above. The rules are summarized as follows:

Important! The posted data will replace everything in the data folder. Therefore, the data that you post should have the same format as what was originally there.

Tips: If your inference time is more than one minute, we recommend either reducing the data size or increasing the --keepalive-time value when using curl.

Step 6: Limit API Access to Specific Clients (Optional)

As the API owner, you have the option of restricting access to specific clients by assigning them a unique ID. This mitigates the risk of receiving an overwhelming number of API requests from different clients. The client ID is included with API endpoints, as shown in step 4. If no client ID is added, this inference job will be public.

from aibro.inference import Inference
Inference.update_clients(
    job_id,
    add_client_ids = ["client_1", "client_2"]
)

Step 7: Complete Job

Once the inference job is no longer required, to avoid unnecessary costs, please remember to close it with the Inference.complete() method.

from aibro.inference import Inference
Inference.complete(job_id)

aibro.Inference Methods

deploy()

def deploy(
    artifacts_path: Union[str, list],
    model_name: str = None,
    machine_id_config: Union[str, dict] = None,
    dryrun: bool = False,
    cool_down_period_s: int = 600,
    client_ids: List[str] = [],
    access_token: str = None,
    description: str = "",
    wait_request_s: int = 30,
) -> str:

The deploy() method starts an inference job, deploying models on cloud instances.

Parameters

artifacts_path: Union[str, list]
The path to the formatted repository. The input type can be either of type string or list:

Type Syntax Meaning
str "path/to/repo" The whole repo would be uploaded and deployed on cloud
list ["./path/to/model",
"./path/to/predict.py",
"./path/to/data",
"./path/to/requirements.txt",
"./path/to/other_artifacts"]
Select important artifacts one-by-one. Aibro will combine and recreate a model repository called aibro_repo under the root path of aibro library

model_name: str = None
This parameter specifies the model name used in the deployed inference job. It can only be set to None if, and only if dryrun is set to False. Within the scope of the user profile, the model name should be unique with respect to those among the active inference jobs.

machine_id_config: Union[str, dict]
This parameter specifies the machine configuration to be used to deploy the model. In the configuration, machines are categorized as either Standby or Cooling.

The macine_id_config can be set to None if and only dryrun is set to False. The input type can be either a string or a dictionary:

Type Syntax Meaning
str "c5.large.od" use the on-demand "c5.large" instance as the standby instance
dict {"standby": "c5.large.od"} use the on-demand "c5.large" instance as the standby instance
dict {"standby": "c5.large.od", "cooling": "g4dn.4xlarge.od"} use the on-demand "c5.large" instance as the standby instance and the on-demand "g4dn.4xlarge" instance as the cooling instance

Important:

  1. The machine id has to be on-demand (ends by .od).
  2. One standby instance is mandatory whereas cooling instance is optional (e.g. syntax such as {"cooling": "g4dn.4xlarge.od"} is invalid).

For more details about the usage of standby and cooling instances, check out the section below.

dryrun: bool = False
The dryrun option is used to perform a local test of the model. When set to True, the deploy() method will validate the structure of the repository and test whether the inference result can be successfully returned.

cool_down_period_s: int = 600
This parameter specifies the cool-down period of a cooling instance. By default, the cooling instance will stop after there have been no new inference requests for 600 seconds (10 minutes).

client_ids: List[str] = []
This argument is used to restrict access to your inference API, for use only by specific clients. The client IDs can be customized in any syntax, provided that there is no duplication. If there are no client IDs specified then the inference job is public.

The Inference.update_clients() method can be used to add/remove client IDs. As mentioned in the tutorial, the client ID is used as a part of the API URL.

The table below defines each role.

Role Description
API owner The person that created the inference
API client Individuals that have access to the inference

access_token: str = None
The access token is used to authenticate the API request. If its value is None, the client’s email and password are required for the request to be accepted.

description: str = ""
The description is used to briefly explain the inference, allowing users to better distinguish them.

wait_request_s: int = 30
This parameter specifies the number of seconds that Aibro will wait for instance requests to be fulfilled.

complete()

def complete(
    model_name: str = None,
    job_id: str = None,
    access_token: str = None,
):

The complete() method signals the end of an inference, shutting down all of its API services.

Parameters

model_name: str = None
This parameter is used to identify the inference job to be stopped. Since a model name is required to be unique, it can be used instead of the job_id to locate it. If values for both model_name and job_id are specified then the job_id will be used for the search.

job_id: str = None
This identifies the deployed inference job using the job ID. Specifying an ID will cause the model_name parameter to be ignored.

access_token: str = None
The access token is used to authenticate the API request. If its value is None, the client’s email and password are required for the request to be accepted.

update_clients()

def update_clients(
        add_client_ids: Union[str, List[str]] = [],
        remove_client_ids: Union[str, List[str]] = [],
        be_public: bool = False,
        model_name: str = None,
        job_id: str = None,
        access_token: str = None,
    )->List[str]:

The update_clients() method is used to modify the list of authorized client IDs for the inference job.

Parameters

add_client_ids: Union[str, List[str]] = []
This parameter will add a single client ID, or a list of client IDs. If the duplicate IDs are found within the list or have already been authorized then the update will be canceled.

remove_client_ids: Union[str, List[str]] = []
This parameter will remove a single client ID or a list of client IDs from the set of authorized clients. IDs that are not found will be ignored.

be_public: bool = False
Setting the be_public parameter to True removes all of the client IDs from the access list, leaving the inference job accessible to all.

model_name: str = None
This parameter is used to identify the model within the inference job. Since a model name is required to be unique, it can be used instead of the job_id to locate it. If values for both model_name and job_id are specified then the job_id will be used for the search.

job_id: str = None
This identifies the deployed inference job using the job ID. Specifying an ID will cause the model_name parameter to be ignored.

access_token: str = None
The access token is used to authenticate the API request. If its value is None, the client’s email and password are required for the request to be accepted.

list_clients()

def list_clients(
    model_name: str = None,
    job_id: str = None,
    access_token: str = None,
)->List[str]:

The list_clients() method returns a list of client IDs for a specific inference job.

Parameters

model_name: str = None
This parameter is used to identify the model within the inference job. Since a model name is required to be unique, it can be used instead of the job_id to locate it. If values for both model_name and job_id are specified then the job_id will be used for the search.

job_id: str = None
This identifies the deployed inference job using the job ID. Specifying an ID will cause the model_name parameter to be ignored.

access_token: str = None
The access token is used to authenticate the API request. If its value is None, the client’s email and password are required for the request to be accepted.

Standby and Cooling instances

Definitions:

Instances are classified in one of two ways; Standby or Cooling.

If both standby and cooling instances are turned on, the cooling instance would have a higher priority to handle upcoming inference requests.

Configure for non-uniform traffic

Dealing with non-uniform traffic and having the ability to scale are important features of the system, and it is helpful to understand how sporadic and non-uniform traffic can affect your strategy. In simple words, the combination of standby and cooling instances, properly configured, will lead you toward optimal pricing for your usage.

Consider that you have an inference API for a web application and the call frequency is proportional to the site traffic. If the traffic is typically non-uniform, such as the case where there is more traffic during the day and relatively little during the night, then the configuration should be set accordingly. To take advantage of known traffic patterns, in this case, you might set a CPU or cheaper GPU instance as the standby instance, and dedicate a powerful GPU for the cooling instance. In this approach, the powerful cooling instance would efficiently handle intensive traffic during the day and the standby instance would save you on costs because it is running at night. Importantly, this balance maintains the near real-time performance.

Configure for high traffic

On the other hand, some sites have traffic that is uniform. Constant traffic is common, for example, in applications that are depended on by clients operating in many different time zones. If your site traffic is more uniform, and people never stop using it, using standby instances with a powerful GPU is the recommended configuration.

The reason for this becomes clear when you consider the previous configuration, which uses a weak standby processor and a powerful cooling one. With constant traffic, the cooling instance would never shut down because the timeout period would never expire. Consequently, the lesser-powered standby processor would not be utilized and thus wasted.

Case study: calculate the saving

Graph explanation: This graph describes an inference job with configuration machine_id_config = machine_id_config = {"standby": "g4dn.4xlarge", "cooling": "g4dn.12xlarge.od"}. The job was operated between 8:00 am and 11.59 pm, and it is clear that most of the inference requests (IRs) were received during the morning and afternoon.

Every time an IR was received by the cooling instance, the cooldown period reset. If the cooldown period passed by without receiving an IR, the instance stopped, and the standby instance handled the next IR. At that time, the standby invoked the previously stopped cooling instance. During the period it took the cooling instance to fully come online, the standby instance processed the IRs.

Over the course of the 16 hours, cooling instances were turned on for 6 hours and the standby instance was never turned off. The relevant machine pricing is shown below:

Machine id Pricing
g4dn.12xlarge.od $3.91/hr
g4dn.4xlarge.od $1.20/hr
g4dn.12xlarge $1.61/hr
g4dn.4xlarge $0.36/hr

Let's compare the savings with and without the hybrid configuration.

Configuration machine_id_config Cost Spot Cost
Without standby&cooling {"standby": "g4dn.12xlarge.od"} 3.91 * 16 = $62.56 $25.76
With standby&cooling {"standby": "g4dn.4xlarge",
"cooling": "g4dn.12xlarge.od"}
1.2_ 16 + 3.91 * 6 = $42.66 $15.42

In this scenario, the savings from using the hybrid standby and cooling configuration was more than 30%. Furthermore, if spot instances were available over the period, Aibro would further cut the cost from $62.56 to $15.42. This is a savings of more than 75%!

Inference Job & Instance Status

Once a job starts, its states and substates are updated on the Jobs page within the Aibro Console.

Job Status Description
QUEUING Waiting to be deployed
DEPLOYED Inference API is ready to be used
CANCELED Canceled due to some errors
COMPLETED Job was completed
Job Substatus Description
REQUESTING SERVER Requesting an instance to deploy models
CONNECTING SERVER Connecting an initializing instance
GEARING UP ENV Gearing up tensorflow and mounting GPUs
DEPLOYING MODEL Deploying model
DEPLOYED MODEL Model was deployed
CANCELED Canceled due to errors
COMPLETED Completed the job
Instance Status Description
LAUNCHING Setting up instance for inference
EXECUTING Jobs being processed and ready to receive requests
COOLING Within cooling Period (must be a cooling instance)
CLOSING Stopping/terminating instance
CLOSED Instance has been stopped/terminated
Instance Substatus Description
STOPPING/STOPPED Shut down instance but retain root volume Reference
TERMINATING/TERMINATED Completely delete the instance Reference

The following table is a status-substatus map of jobs and instances.

Job Status Job Substatus Instance Status Instance Substatus
QUEUING REQUESTING SERVER
QUEUING CONNECTING SERVER LAUNCHING
QUEUING GEARING UP ENV LAUNCHING
QUEUING DEPLOYING MODEL LAUNCHING
---------- ------------ ---------------------- --------------------------
DEPLOYED DEPLOYED EXECUTING/COOLING
---------- ------------ ---------------------- --------------------------
CANCELED CANCELED COOLING/CLOSING/CLOSED COOLING/(STOPPING, TERMINATING)/(STOPPED, TERMINATED)
COMPLETED COMPLETED COOLING/CLOSING/CLOSED COOLING/(STOPPING, TERMINATING)/(STOPPED, TERMINATED)

Cloud Instance

The words "machine", "server", and “instance” are used interchangeably in the following content.

Spot Vs On-demand Instance

Machine id

By simply adding ".od" after the machine ID, it is converted from a spot instance to an on-demand instance (e.g. p2.xlarge is spot and p2.xlarge.od is on-demand).

Pricing

Spot instances are usually 70% cheaper than their corresponding on-demand instances.

Availability

As a tradeoff, spot instance requests are not always fulfilled. We define "availability" as the success probability of an instance request.

Clearly, the availability of on-demand instances is always 100%. Provided there is sufficient capacity in the Aibro marketplace, an on-demand instance is guaranteed. There is a small chance that AWS will reach capacity and create a bottleneck, although this cannot be detected until the request encounters an error during a job.

The availability of spot instances is varied by instance type and request time. We have found that in general, the more powerful instance types have less availability (e.g. p3.2xlarge is less available than p2.xlarge). Not surprisingly, we have also found that spot instances are more often available during non-business hours.

Reliability

When choosing a configuration, it is relevant that spot instances can be interrupted by AWS, whereas on-demand instances are always stable.

Contact Us

If you have comments, questions, or concerns then please reach out to us in one of the following ways:

  1. Recommended: We are available through the Discord Community and you can direct message our support team.
  2. Use the "Contact Us" button on our website
  3. Send us a message using the aibro.comm.Comm.send_message()
  4. Email us at hello@aipaca.ai

Data Privacy

Each Aibro inference stores only the metadata from inference requests. This is done for the purpose of service improvement.