PGPYML
Python is a popular programming language to develop machine learning solutions. You may use it to do some exploratory data analysis and write a model to do predictions over new data. When you are satisfied with the results produced by your model you will deploy it, but this task can be time-consuming. With pgpyml you can easily use your model as an in-database solution and take the advantages of this approach.
Data Scientists may not know how to deploy models, or don't know how to use python to do it. On the other hand, Software Engineers may don't have knowledge to build machine learning models. The database unites Software Engineers and Data Scientists, pgpyml is an extension that allows you to use your models written in Python inside PostgreSQL. You can make predictions directly on the data being inserted at your base, you can also create triggers to avoid insertions based on the predictions. With this extension, Data Scientists can focus on building models while Software Engineers can build the MVP.
This extension is under development, designed to be compatible with sklearn and is not ready to be used in the production environment.
Instalation
This section will instruct you how to install the pgpyml extension. You must follow the Prerequisites section and choose the pgxnclient instructions or the git installation. Once the installation is done you can create the extension on your database.
Alternatively you can use the Vagrant machine to test it.
Prerequisites
First of all, you will need to install the python libraries that are used to write and run your machine learning models:
pip3 install numpy scikit-learn pandas
This command install the numpy, sklearn and pandas libraries. The pgpyml extension expects that you use sklearn to build your models.
You will need to install the Python extension for PostgreSQL. You can do this with the command:
# Replace the <version> with your PostgreSQL version
apt -y install postgresql-plpython3-<version>
Replace the <version> in the command with the Postgres version that you are using, for example, if you want to install the Python extension to Postgres 14 use:
apt -y install postgresql-plpython3-14
Install with PGNXClient
You can install the extension using the pgnxclient, you will need to install it if you don't already have it:
apt install pgxnclient
And then you can install the pgpyml extension with:
pgxn install pgpyml
Install with Git
You can install the pgpyml with git, cloning the repository and running the instalation script. To do this, first clone the extension's repository:
git clone -b v0.3.0 https://github.com/Minoro/pgpyml.git
Change the v0.3.0 to the desired version. And inside the downloaded folder run the make file:
make install
make clean
Create the Extension
Once you have installed the extension you can create it on your PostgreSQL:
-- Create a new schema to install the extension
CREATE SCHEMA pgpyml;
-- Create the extension in the new schema
CREATE EXTENSION pgpyml SCHEMA pgpyml CASCADE;
Using the Vagrant Machine
The pgpyml repository has a Vagrantfile that you can use to test this extension, the repository also include an example of how you can train and save a model. The example use the UCI Iris Dataset, with are also included in the repository.
To use the vagrant machine you can navigate to the vagrant folder and run:
vagrant up # Initiate the machine
vagrant ssh # Acess the machine
After that you will be able to connect to Postgresql on host http://localhost:5555 through your host machine.
Although it is not necessary to use the vagrant file from the extension repository, the examples will be displayed using the paths of the virtual machine.
Train Your Model
You can train your model as you are used to. The Sklearn documentation has many examples that you can use, in this section we will build a minimal example using Python and Sklearn and deploy it on Python.
We will use the Iris Dataset as an example, we will build a model to predict the class of new instances and use it on PostgresSQL.
Let's create a file named train.py with the code:
# Script: train.py
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets
from joblib import dump, load
iris = datasets.load_iris()
X = iris.data
y = iris.target
model = DecisionTreeClassifier()
model.fit(X, y)
dump(model, '/home/vagrant/models/iris_decision_tree.joblib')
After running this code you will see an file named iris_decision_tree.joblib in the same folder that you runned the training script. This is the model that we will deploy on PostgreSQL. You can read more about persisting your models on Sklearn Documentation.
Deploy on PostgresSQL
Now you have everything you need to deploy your Machine Learning model on PostgresSQL. Run the following command to make a prediction on new data:
SELECT * FROM pgpyml.predict(
'/home/vagrant/examples/iris/models/iris_decision_tree.joblib',
ARRAY[[5.2,3.5,1.5,0.2], [7.7,2.8,6.7,2.0]]
);
This command will load your model and use it to predict a single row in your database. The first parameters is the model you want to use, the second one is an array with the data you want to predict.
Next Steps
This extensions have more functions that you can use to run your models. You can read more about loading and unloading models in the IO Page, you can read more about predictions and triggers in the Prediction Page.