Introduction - Aibro Training
Aibro Version: 1.1.5 (alpha)
Last Documentation Update: May 20, 2022
Definition: API embedded python library
Aibro is a serverless MLOps tool that helps data scientists train & inference AI models on cloud platforms in 2 minutes.
This document focuses on cloud training. If you are also interested in inference, here is the inference document link.
Authentication
Aibro uses email & password to allow access to the API. Accounts are registered at the Aibro website.
Authentication is required when APIs from Aibro python library are called for the first time.
Support Environment
Framework | Version |
---|---|
Tensorflow | <= 2.5.1 |
Cloud Platform | Spot Instance | On-demand Instance |
---|---|---|
AWS | Yes | Yes |
Training Data Type | Maximum Data Size |
---|---|
NumPy | 2 GB |
Limit | Amount |
---|---|
Max Active Jobs | 5 |
Max Active Instances | 5 |
Max Stopped Instances | 4 |
If more environment support is required, please feel free to Contact Us.
We are working hard to support more varieties of environments shortly. Thank you for your patience.
Start The First Training Job on Aibro
Step 1: Install
pip install aibro
Install aibro python library by pip.
If OSError: protocol not found
shows up, it is caused by missing /etc/protocols
file. This command should be able to resolve the error: sudo apt-get -o Dpkg::Options::="--force-confmiss" install --reinstall netbase
Step 2: Prepare model & data
import tensorflow as tf
from tensorflow import keras
def get_mnist_data():
num_val_samples = 100
# Return the MNIST dataset in the form of a [`tf.data.Dataset`]
# reference: (https://www.tensorflow.org/api_docs/python/tf/data/Dataset).
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
# Preprocess the data (these are Numpy arrays)
x_train = x_train.reshape(-1, 784).astype("float32") / 255
x_test = x_test.reshape(-1, 784).astype("float32") / 255
y_train = y_train.astype("float32")
y_test = y_test.astype("float32")
# Reserve num_val_samples samples for validation
x_val = x_train[-num_val_samples:]
y_val = y_train[-num_val_samples:]
x_train = x_train[:-num_val_samples]
y_train = y_train[:-num_val_samples]
return x_train, y_train, x_val, y_val
def get_compiled_FFNN_model():
# Make a simple 2-layer densely-connected neural network.
inputs = keras.Input(shape=(784,))
x = keras.layers.Dense(256, activation="relu")(inputs)
x = keras.layers.Dense(256, activation="relu")(x)
outputs = keras.layers.Dense(10)(x)
model = keras.Model(inputs, outputs)
model.compile(
optimizer=keras.optimizers.Adam(),
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[keras.metrics.SparseCategoricalAccuracy()],
)
return model
train_X, train_Y, validation_X, validation_Y = get_mnist_data()
model = get_compiled_FFNN_model()
As an example, we used a custom feed-forward neural network (FFNN) as the model and MNIST dataset. You could plug in your own model and just remember to confirm the model's compilability with Support Environment.
Step 3: Cloud training with one-line code
from aibro.training import Training
job_id, result_model, history= Training.online_fit(
model=model,
train_X=train_X,
train_Y=train_Y,
validation_data=(validation_X, validation_Y),
machine_ids=["p2.xlarge.od"],
batch_size=8,
epochs=15,
description="my first training job",
)
Now, it is the time to train the model on cloud with the one line of code - Training.online_fit()
.
Once the job start, job status update will be displayed on the Jobs page of Aibro Console in real-time.
This example used a basic online_fit()
. To explore more features, please check out the aibro.Training section.
aibro.Training
online_fit()
def online_fit(
model: tensorflow.keras.models.Model,
train_X: numpy.ndarray,
train_Y: numpy.ndarray,
machine_ids: List[str] = None,
batch_size: int = 1,
epochs: int = 1,
validation_data: Tuple[np.array, np.array] = None,
description: str = "",
cool_down_period_s: int = 0,
fit_kwargs: Dict[str, Any] = {},
directory_to_save_ckpt: str = None,
directory_to_save_log: str = None,
wait_request_s: int = 30,
wait_new_job_create_s: int = -1,
wait_new_job_create_interval: int = 10,
record: bool = False,
) -> Tuple[Optional[str], Optional[TF_Model], Optional[History]]
Online fit is basically synchronize fit. Once it is called, model fitting progress will be shown in real time. We used the word "online" because it requires the internet stay in connection while training (offline_fit is coming soon).
Every call of online fit will create a new training job unless the maximum limit of active jobs or instances per user is reached.
When designing the parameters, we were trying to stay with the tensorflow style as close as possible.
Parameters
model: tensorflow.keras.models.Model
The machine learning model to be trained.
train_X: numpy.ndarray
The input training data feeding to model.
train_Y: numpy.ndarray
The output training data feeding to model.
machine_ids: List[str] = None
The cloud machines used to train the model. The machines are requested in the order of list; for example,
in the case of ["p2.xlarge", "g4dn.4xlarge"]
, the API will try to request p2.xlarge
first. If p2.xlarge
is not available
or has no capacity within wait_request_s
seconds, g4dn.4xlarge
will be requested next.
If machine_ids
is None
, a select-action message will pop up.
batch_size: int = 1
The training batch size. It is recommended to set the value as the multiple of the number of GPUs (more details).
epochs: int = 1
The training epochs.
validation_data: Tuple[np.array, np.array] = None
The input and output validation data feeding to the model. The order is (validation_X, validation_Y).
description: str = ""
The description used to remind which training job was which. We highly recommend setting every job a unique
description for easy lookup.
cool_down_period_s: int = 0
The time in seconds that an idle instance will be held before termination. This is an important concept.
Please check out the Cooling Period section for more details.
fit_kwargs: Dict[str, Any] = {}
the arguments used to pass into model.fit().
directory_to_save_ckpt: str = None
The directory used to save checkpoints. Checkpoints are stored by per epoch. If None
, the model checkpoint won't be saved in your local machine.
directory_to_save_log: str = None
The directory used to save tensorboard log. If None
, the log file won't be saved in your local machine.
current working directory.
wait_request_s: int = 10000
The time in seconds used to wait for instance request to be fulfilled. A long wait_request_s
is helpful when requesting a spot instance with low availability.
wait_new_job_create_s: int = -1
The time in seconds used to wait for a new job to be created. This parameter is helpful when the maximum active jobs or instances is reached. It allows the new job to wait until one of the active jobs or instances are finished. If the value is non-positive, the new job will wait 99999999 seconds (basically forever).
wait_new_job_create_interval: int = 10
The time in seconds used to check whether there is a spot for the new job to be created.
record: bool = False
Turn on record mode to report issues. With record mode turned on, Our support team can easily reproduce issues. Before
using the feature, we would recommend reading more details in the Report Issue section and its
Privacy Items.
Cooling Period
The time in seconds that an idle instance will be held before termination. This is an important concept.
To use cooling period, you should set cool_down_period_s non-zero and request the same machine_id in your next job, then Aibro will automatically pick up the same instance for the next job. Those instances are called cooling instances.
The cooling period has the following benefit:
- Avoid environment gear up time (around 5-6 mins). When a new instance is requested, it takes over 5 mins to mount GPU and gear up tensorflow modules. On the other hand, cooling instances have no environment gear up time.
- Saving money. Environment gear up time would spend extra money.
However, we don't really encourage users to set a long cooling period for spot instances as spot instances are not always available. The cooling period may impact the capacity of other users. Therefore, if a spot instance is held over a BASELINE
period, its pricing would increment an UNIT_PERCENTAGE
per minute. After 1/UNIT_PERCENTAGE
minutes, its price will reach a maximum, which is same as its on-demand price.
In this version, the variables are set as the following:
Variable | Value |
---|---|
BASELINE | 10 minutes |
UNIT_PERCENTAGE | 1% |
Distributed Training
If a multi-GPUs machine is selected (e.g. p3.8xlarge), Aibro automatically trains the model with all visible GPUs. We use tf.distribute.MirroredStrategy
(reference) to implement synchronous training.
MirroredStrategy evenly shards batch data to each GPU. To increase GPU utility, batch size should be set as the multiple of the number of GPUs. For instance such as p3.8xlarge, batch_size
should be one of 4, 8, 16 ... because it has 4 V100 GPUs.
aibro.Job
get_tensorboard_logs()
def get_tensorboard_logs(
job_id: str,
directory: str="."
)
This method download Tensorboard logs by training job id. Use Tensorboard command such as tensorboard --logdir logs
to open the board.
Parameters
job_id: str
Training job's ID.
directory: str = "."
Directory path to save the decoded Tensorboard log. The log path would be {directory}/logs/{job_id}/
.
plot_timeline()
def plot_timeline(job_id: str, char_type: str = None)
This method is used to interpret the time spend of a training job from end-to-end.
Parameters
job_id: str
The target Job's id.
char_type: str = None
The type of chart. If its value is "pie", a pie chart would be plotted. Otherwise, it is a funnel chart.
Timeline
This method can be called anytime even if the job has not been ended.
The timeline shows a little more insights than job status.
From the beginning to the end, the following time periods are shown on timeline plots:
- Job Create: time taken from code execution to job creation.
- Request Launch: time taken from a spot request start to be fulfilled; this period only applies to spot instance.
- Instance Connect: time taken from request fulfilled to successfully establish instance connection.
- API Transfer & Server Setup: time taken to set up Aibro API infrastructure in the instance.
- M&D Serialization: time taken to serialize model and training data.
- Env Gear up: time taken to gear up tensorflow.
- M&D Transfer: time taken to transfer model and training data from your local machine to the instance.
- M&D Deserialization: time taken to deserialize model and training data.
- Model Training: time taken to train the model.
- Result Serialization: time taken to serialize the trained model and other relevant objects.
- Result Transfer: time taken to transfer the trained model and other relevant objects.
replay_job()
def replay_job(
job_id: str,
description: str = "",
directory_to_save_ckpt: str = ".",
directory_to_save_log: str = ".",
wait_request_s: int = 10000,
)
Once a recorded job is submitted, Aibro team would use replay_job() method to reproduce the issue. You may also use it to check whether the reported issue is reproducible.
Record your job
We would appreciate it if you turned on the record parameter in online_fit() before reporting an issue. In recorded jobs, Aibro stores the model & data to replay the issue.
Note: Even though AIpaca would only use the data for service improvement purposes only, "record" privacy items should be double checked so that it won't violate your or your agencies' IP privacy. If need extra privacy protection is needed, please don't hesitate to contact us by one of the ways above. AIpaca team is always here to help you out.
Parameters
job_id: str
The recorded job's id.
description: str = ""
The issue description. More details help us diagnose the issue easier.
directory_to_save_ckpt: str = None
The directory used to save checkpoints. Checkpoints are stored by per epoch. If None
, the model checkpoint won't be saved in your local machine.
directory_to_save_log: str = None
The directory used to save tensorboard log. If None
, the log file won't be saved in your local machine.
wait_request_s: int = 10000
The time in seconds used to wait instance request to be fulfilled. A long enough wait_request_s
is helpful when
requesting a spot instance with low availability.
Training Job & Instance Status
Once a job starts, its states and substates are updated on the Jobs page of Aibro Console.
Job Status | Description |
---|---|
QUEUING | Waiting for training |
TRAINING | During training process or returning training results |
CANCELED | Canceled due to some errors |
COMPLETED | Completed the job |
Job Substatus | Description |
---|---|
REQUESTING SERVER | Requesting an instance to train models |
CONNECTING SERVER | Connecting an initializing instance |
GEARING UP ENV | Gearing up tensorflow and mounting GPUs |
SENDING MODEL & DATA | Sending model and training data to the instance |
TRAINING | Training model |
RETURNING | Returning trained model |
CANCELED | Canceled due to some errors |
COMPLETED | Completed the job |
Instance Status | Description |
---|---|
LAUNCHING | Setting up instance for training |
EXECUTING | Having jobs in training process |
COOLING | Within Cooling Period |
CLOSING | Stopping/terminating instance |
CLOSED | instance has been stopped/terminated |
Instance Substatus | Description |
---|---|
STOPPING/STOPPED | Shut down instance but retain root volume Reference |
TERMINATING/TERMINATED | Completely delete the instance Reference |
In the Aibro usage case, Setup speed is the main advantage of stopped instance over terminated instance.
The following table is a status-substatus map of jobs and instances.
Job Status | Job Substatus | Instance Status | Instance Substatus |
---|---|---|---|
QUEUING | REQUESTING SERVER | ||
QUEUING | CONNECTING SERVER | LAUNCHING | |
QUEUING | GEARING UP ENV | LAUNCHING | |
QUEUING | SENDING MODEL & DATA | LAUNCHING | |
---------- | ------------ | ---------------------- | -------------------------- |
TRAINING | TRAINING | EXECUTING | |
TRAINING | RETURNING | EXECUTING | |
---------- | ------------ | ---------------------- | -------------------------- |
CANCELED | CANCELED | COOLING/CLOSING/CLOSED | COOLING/(STOPPING, TERMINATING)/(STOPPED, TERMINATED) |
COMPLETED | COMPLETED | COOLING/CLOSING/CLOSED | COOLING/(STOPPING, TERMINATING)/(STOPPED, TERMINATED) |
aibro.Comm
available_machines()
def available_machines()
# Sample Output:
# Machine Id: g4dn.12xlarge GPU Type: 4xT4 num_vCPU: 48 cost: 1.43 capacity: 2 availability: 32.0%
# Machine Id: g4dn.12xlarge.od GPU Type: 4xT4 num_vCPU: 48 cost: 3.91 capacity: 3 availability: 100%
# Machine Id: g4dn.16xlarge GPU Type: 1xT4 num_vCPU: 64 cost: 1.31 capacity: 2 availability: 13.0%
# Machine Id: g4dn.16xlarge.od GPU Type: 1xT4 num_vCPU: 64 cost: 4.35 capacity: 2 availability: 100%
# Machine Id: g4dn.4xlarge GPU Type: 1xT4 num_vCPU: 16 cost: 0.36 capacity: 8 availability: 86.0%
This method is used to grab machine information from the Aibro Marketplace.
Two concepts in marketplace:
- Capacity: the number of instances that are requestable.
- Availability: the success probability of instance request.
send_message()
def send_message(
email: str,
feedback_message: str,
category: str = "random"
)
This method sends feedback to Aibro support directly.
Parameters
email: str
Registered email address.
feedback_message: str
Anything you want to say to us.
category: str = "random"
Category of the message. The category should be one of ['random', 'feature_request', 'bug_report']
Cloud Instance
We will use the word "machine" and "server" interchangeably with "instance" in the following content.
Spot Vs On-demand Instance
Machine id
By simply adding ".od" after machine id to convert instance type from spot to on-demand (e.g. p2.xlarge is spot and p2.xlarge.od is on-demand).
Pricing
Spot instances are usually 70% cheaper than their corresponding on-demand instances.
Availability
As a tradeoff, spot instance requests are not always fulfilled. We defined the term "availability" as the success probability of instance request.
Clearly, the availability of on-demand instances are always 100%. Therefore, it is guaranteed to get an on-demand instance as long as there is enough capacity in Aibro marketplace. With a small possibility, AWS can runs out of capacity itself, but it is not detectable until the request error occurs in jobs.
The availabilities of spot instances is varied by instance types and request time. In general, we found more powerful instance types have less availability (e.g. p3.2xlarge is less available than p2.xlarge). Meanwhile, spot instances are usually more available during non-working hours.
Reliability
Spot instances have a chance to be interrupted by AWS. On-demand instances are always stable.
Setup Speed
The stop feature allows on-demand instances to be set up faster than spot instances. For the first time a new instance was requested, tensorflow always needs around 5-7 minutes to gear up before training starts. Unlike spot instances that can only be terminated, on-demand instances are stoppable, which reduces their second-time gear up time to 0.5-1.5 minutes. Both instance types can set Cooling Period, which, of course, won't have any gear up time at all.
Except for longer gear up time, a spot instance needs extra time to fulfill its request.
The following table compares the job timelines (refer to plot_timeline()) of new, stopped, and cooling instances when training the same model & data on the same instance.
Type | gear up time (minutes) | timeline |
---|---|---|
new | 5-7 | Timeline |
stopped | 0.5-1.5 | Timeline |
cooling | 0 | Timeline |
Contact Us
You could reach out to us in one of the following ways:
- Most recommend: Discord Community to direct message our support team
- Use the "Contact Us" button from our website
- Send message by
aibro.Comm.send_message()
- Email us at hello@aipaca.ai
Data Privacy
While using every feature, Aibro needs data access at different levels. Usually, the more accesses a feature needs the better experience will be created (e.g. more time/money saving). Of course, whether using those features is totally upon your decision. By default, online_fit won't have any access. The following table gives a privacy item overview of each feature.
Feature | Server access | Model Access | Data Access | Reason |
---|---|---|---|---|
Cooling server | Yes | No | No | Aibro needs to retain the server access to reuse the cooling instances. The access will be permanently deleted once the instances is terminated |
Stopped server | Yes | No | No | Aibro needs to retain the server access to restart stopped instances. The access will be permanently deleted once the instances is terminated |
Report job | No | Yes | Yes | Support team needs the Model & data to replay the job and diagnose the reported issues |