Introduction - Aibro Training

Aibro Version: 1.1.5 (alpha)

Last Documentation Update: May 20, 2022

Definition: API embedded python library

Aibro is a serverless MLOps tool that helps data scientists train & inference AI models on cloud platforms in 2 minutes.

This document focuses on cloud training. If you are also interested in inference, here is the inference document link.


Aibro uses email & password to allow access to the API. Accounts are registered at the Aibro website.

Authentication is required when APIs from Aibro python library are called for the first time.

Support Environment

Framework Version
Tensorflow <= 2.5.1
Cloud Platform Spot Instance On-demand Instance
AWS Yes Yes
Training Data Type Maximum Data Size
NumPy 2 GB
Limit Amount
Max Active Jobs 5
Max Active Instances 5
Max Stopped Instances 4

If more environment support is required, please feel free to Contact Us.

We are working hard to support more varieties of environments shortly. Thank you for your patience.

Start The First Training Job on Aibro

Step 1: Install

pip install aibro

Install aibro python library by pip.

If OSError: protocol not found shows up, it is caused by missing /etc/protocols file. This command should be able to resolve the error: sudo apt-get -o Dpkg::Options::="--force-confmiss" install --reinstall netbase

Step 2: Prepare model & data

import tensorflow as tf
from tensorflow import keras
def get_mnist_data():
    num_val_samples = 100

    # Return the MNIST dataset in the form of a [``]
    # reference: (
    (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

    # Preprocess the data (these are Numpy arrays)
    x_train = x_train.reshape(-1, 784).astype("float32") / 255
    x_test = x_test.reshape(-1, 784).astype("float32") / 255
    y_train = y_train.astype("float32")
    y_test = y_test.astype("float32")

    # Reserve num_val_samples samples for validation
    x_val = x_train[-num_val_samples:]
    y_val = y_train[-num_val_samples:]
    x_train = x_train[:-num_val_samples]
    y_train = y_train[:-num_val_samples]
    return x_train, y_train, x_val, y_val

def get_compiled_FFNN_model():
    # Make a simple 2-layer densely-connected neural network.
    inputs = keras.Input(shape=(784,))
    x = keras.layers.Dense(256, activation="relu")(inputs)
    x = keras.layers.Dense(256, activation="relu")(x)
    outputs = keras.layers.Dense(10)(x)
    model = keras.Model(inputs, outputs)
    return model

train_X, train_Y, validation_X, validation_Y = get_mnist_data()
model = get_compiled_FFNN_model()

As an example, we used a custom feed-forward neural network (FFNN) as the model and MNIST dataset. You could plug in your own model and just remember to confirm the model's compilability with Support Environment.

Step 3: Cloud training with one-line code

from import Training

job_id, result_model, history= Training.online_fit(
    validation_data=(validation_X, validation_Y),
    description="my first training job",

Now, it is the time to train the model on cloud with the one line of code - Training.online_fit().

Once the job start, job status update will be displayed on the Jobs page of Aibro Console in real-time.

This example used a basic online_fit(). To explore more features, please check out the aibro.Training section.



def online_fit(
    model: tensorflow.keras.models.Model,
    train_X: numpy.ndarray,
    train_Y: numpy.ndarray,
    machine_ids: List[str] = None,
    batch_size: int = 1,
    epochs: int = 1,
    validation_data: Tuple[np.array, np.array] = None,
    description: str = "",
    cool_down_period_s: int = 0,
    fit_kwargs: Dict[str, Any] = {},
    directory_to_save_ckpt: str = None,
    directory_to_save_log: str = None,
    wait_request_s: int = 30,
    wait_new_job_create_s: int = -1,
    wait_new_job_create_interval: int = 10,
    record: bool = False,
) -> Tuple[Optional[str], Optional[TF_Model], Optional[History]]

Online fit is basically synchronize fit. Once it is called, model fitting progress will be shown in real time. We used the word "online" because it requires the internet stay in connection while training (offline_fit is coming soon).

Every call of online fit will create a new training job unless the maximum limit of active jobs or instances per user is reached.

When designing the parameters, we were trying to stay with the tensorflow style as close as possible.


model: tensorflow.keras.models.Model
The machine learning model to be trained.

train_X: numpy.ndarray
The input training data feeding to model.

train_Y: numpy.ndarray
The output training data feeding to model.

machine_ids: List[str] = None
The cloud machines used to train the model. The machines are requested in the order of list; for example, in the case of ["p2.xlarge", "g4dn.4xlarge"], the API will try to request p2.xlarge first. If p2.xlarge is not available or has no capacity within wait_request_s seconds, g4dn.4xlarge will be requested next.

If machine_ids is None, a select-action message will pop up.

batch_size: int = 1
The training batch size. It is recommended to set the value as the multiple of the number of GPUs (more details).

epochs: int = 1
The training epochs.

validation_data: Tuple[np.array, np.array] = None
The input and output validation data feeding to the model. The order is (validation_X, validation_Y).

description: str = ""
The description used to remind which training job was which. We highly recommend setting every job a unique description for easy lookup.

cool_down_period_s: int = 0
The time in seconds that an idle instance will be held before termination. This is an important concept. Please check out the Cooling Period section for more details.

fit_kwargs: Dict[str, Any] = {}
the arguments used to pass into

directory_to_save_ckpt: str = None
The directory used to save checkpoints. Checkpoints are stored by per epoch. If None, the model checkpoint won't be saved in your local machine.

directory_to_save_log: str = None
The directory used to save tensorboard log. If None, the log file won't be saved in your local machine. current working directory.

wait_request_s: int = 10000
The time in seconds used to wait for instance request to be fulfilled. A long wait_request_s is helpful when requesting a spot instance with low availability.

wait_new_job_create_s: int = -1
The time in seconds used to wait for a new job to be created. This parameter is helpful when the maximum active jobs or instances is reached. It allows the new job to wait until one of the active jobs or instances are finished. If the value is non-positive, the new job will wait 99999999 seconds (basically forever).

wait_new_job_create_interval: int = 10
The time in seconds used to check whether there is a spot for the new job to be created.

record: bool = False
Turn on record mode to report issues. With record mode turned on, Our support team can easily reproduce issues. Before using the feature, we would recommend reading more details in the Report Issue section and its Privacy Items.

Cooling Period

The time in seconds that an idle instance will be held before termination. This is an important concept.

To use cooling period, you should set cool_down_period_s non-zero and request the same machine_id in your next job, then Aibro will automatically pick up the same instance for the next job. Those instances are called cooling instances.

The cooling period has the following benefit:

However, we don't really encourage users to set a long cooling period for spot instances as spot instances are not always available. The cooling period may impact the capacity of other users. Therefore, if a spot instance is held over a BASELINE period, its pricing would increment an UNIT_PERCENTAGE per minute. After 1/UNIT_PERCENTAGE minutes, its price will reach a maximum, which is same as its on-demand price.

In this version, the variables are set as the following:

Variable Value
BASELINE 10 minutes

Distributed Training

If a multi-GPUs machine is selected (e.g. p3.8xlarge), Aibro automatically trains the model with all visible GPUs. We use tf.distribute.MirroredStrategy (reference) to implement synchronous training.

MirroredStrategy evenly shards batch data to each GPU. To increase GPU utility, batch size should be set as the multiple of the number of GPUs. For instance such as p3.8xlarge, batch_size should be one of 4, 8, 16 ... because it has 4 V100 GPUs.



def get_tensorboard_logs(
    job_id: str,
    directory: str="."

This method download Tensorboard logs by training job id. Use Tensorboard command such as tensorboard --logdir logs to open the board.


job_id: str
Training job's ID.

directory: str = "."
Directory path to save the decoded Tensorboard log. The log path would be {directory}/logs/{job_id}/.


def plot_timeline(job_id: str, char_type: str = None)

This method is used to interpret the time spend of a training job from end-to-end.


job_id: str
The target Job's id.

char_type: str = None
The type of chart. If its value is "pie", a pie chart would be plotted. Otherwise, it is a funnel chart.


This method can be called anytime even if the job has not been ended.

The timeline shows a little more insights than job status.

From the beginning to the end, the following time periods are shown on timeline plots:

Timeline Sample


def replay_job(
    job_id: str,
    description: str = "",
    directory_to_save_ckpt: str = ".",
    directory_to_save_log: str = ".",
    wait_request_s: int = 10000,

Once a recorded job is submitted, Aibro team would use replay_job() method to reproduce the issue. You may also use it to check whether the reported issue is reproducible.

Record your job

We would appreciate it if you turned on the record parameter in online_fit() before reporting an issue. In recorded jobs, Aibro stores the model & data to replay the issue.

Note: Even though AIpaca would only use the data for service improvement purposes only, "record" privacy items should be double checked so that it won't violate your or your agencies' IP privacy. If need extra privacy protection is needed, please don't hesitate to contact us by one of the ways above. AIpaca team is always here to help you out.


job_id: str
The recorded job's id.

description: str = ""
The issue description. More details help us diagnose the issue easier.

directory_to_save_ckpt: str = None
The directory used to save checkpoints. Checkpoints are stored by per epoch. If None, the model checkpoint won't be saved in your local machine.

directory_to_save_log: str = None
The directory used to save tensorboard log. If None, the log file won't be saved in your local machine.

wait_request_s: int = 10000
The time in seconds used to wait instance request to be fulfilled. A long enough wait_request_s is helpful when requesting a spot instance with low availability.

Training Job & Instance Status

Once a job starts, its states and substates are updated on the Jobs page of Aibro Console.

Job Status Description
QUEUING Waiting for training
TRAINING During training process or returning training results
CANCELED Canceled due to some errors
COMPLETED Completed the job
Job Substatus Description
REQUESTING SERVER Requesting an instance to train models
CONNECTING SERVER Connecting an initializing instance
GEARING UP ENV Gearing up tensorflow and mounting GPUs
SENDING MODEL & DATA Sending model and training data to the instance
TRAINING Training model
RETURNING Returning trained model
CANCELED Canceled due to some errors
COMPLETED Completed the job
Instance Status Description
LAUNCHING Setting up instance for training
EXECUTING Having jobs in training process
COOLING Within Cooling Period
CLOSING Stopping/terminating instance
CLOSED instance has been stopped/terminated
Instance Substatus Description
STOPPING/STOPPED Shut down instance but retain root volume Reference
TERMINATING/TERMINATED Completely delete the instance Reference

In the Aibro usage case, Setup speed is the main advantage of stopped instance over terminated instance.

The following table is a status-substatus map of jobs and instances.

Job Status Job Substatus Instance Status Instance Substatus
---------- ------------ ---------------------- --------------------------
---------- ------------ ---------------------- --------------------------



def available_machines()

# Sample Output:
# Machine Id: g4dn.12xlarge  GPU Type: 4xT4     num_vCPU: 48    cost: 1.43      capacity: 2 availability: 32.0%
# Machine Id: g4dn.12xlarge.od GPU Type: 4xT4     num_vCPU: 48    cost: 3.91      capacity: 3 availability: 100%
# Machine Id: g4dn.16xlarge  GPU Type: 1xT4     num_vCPU: 64    cost: 1.31      capacity: 2 availability: 13.0%
# Machine Id: g4dn.16xlarge.od GPU Type: 1xT4     num_vCPU: 64    cost: 4.35      capacity: 2 availability: 100%
# Machine Id: g4dn.4xlarge   GPU Type: 1xT4     num_vCPU: 16    cost: 0.36      capacity: 8 availability: 86.0%

This method is used to grab machine information from the Aibro Marketplace.

Two concepts in marketplace:


def send_message(
    email: str,
    feedback_message: str,
    category: str = "random"

This method sends feedback to Aibro support directly.


email: str
Registered email address.

feedback_message: str
Anything you want to say to us.

category: str = "random"
Category of the message. The category should be one of ['random', 'feature_request', 'bug_report']

Cloud Instance

We will use the word "machine" and "server" interchangeably with "instance" in the following content.

Spot Vs On-demand Instance

Machine id

By simply adding ".od" after machine id to convert instance type from spot to on-demand (e.g. p2.xlarge is spot and p2.xlarge.od is on-demand).


Spot instances are usually 70% cheaper than their corresponding on-demand instances.


As a tradeoff, spot instance requests are not always fulfilled. We defined the term "availability" as the success probability of instance request.

Clearly, the availability of on-demand instances are always 100%. Therefore, it is guaranteed to get an on-demand instance as long as there is enough capacity in Aibro marketplace. With a small possibility, AWS can runs out of capacity itself, but it is not detectable until the request error occurs in jobs.

The availabilities of spot instances is varied by instance types and request time. In general, we found more powerful instance types have less availability (e.g. p3.2xlarge is less available than p2.xlarge). Meanwhile, spot instances are usually more available during non-working hours.


Spot instances have a chance to be interrupted by AWS. On-demand instances are always stable.

Setup Speed

The stop feature allows on-demand instances to be set up faster than spot instances. For the first time a new instance was requested, tensorflow always needs around 5-7 minutes to gear up before training starts. Unlike spot instances that can only be terminated, on-demand instances are stoppable, which reduces their second-time gear up time to 0.5-1.5 minutes. Both instance types can set Cooling Period, which, of course, won't have any gear up time at all.

Except for longer gear up time, a spot instance needs extra time to fulfill its request.

The following table compares the job timelines (refer to plot_timeline()) of new, stopped, and cooling instances when training the same model & data on the same instance.

Type gear up time (minutes) timeline
new 5-7 Timeline
stopped 0.5-1.5 Timeline
cooling 0 Timeline

Contact Us

You could reach out to us in one of the following ways:

  1. Most recommend: Discord Community to direct message our support team
  2. Use the "Contact Us" button from our website
  3. Send message by aibro.Comm.send_message()
  4. Email us at

Data Privacy

While using every feature, Aibro needs data access at different levels. Usually, the more accesses a feature needs the better experience will be created (e.g. more time/money saving). Of course, whether using those features is totally upon your decision. By default, online_fit won't have any access. The following table gives a privacy item overview of each feature.

Feature Server access Model Access Data Access Reason
Cooling server Yes No No Aibro needs to retain the server access to reuse the cooling instances. The access will be permanently deleted once the instances is terminated
Stopped server Yes No No Aibro needs to retain the server access to restart stopped instances. The access will be permanently deleted once the instances is terminated
Report job No Yes Yes Support team needs the Model & data to replay the job and diagnose the reported issues