Simple machine learning with PostgreSQL
PostgresML is a PostgreSQL extension that enables you to perform ML training and inference on text and tabular data using SQL queries. With PostgresML, you can seamlessly integrate machine learning models into your PostgreSQL database and harness the power of cutting-edge algorithms to process text and tabular data efficiently.
- Perform natural language processing (NLP) tasks like sentiment analysis, question and answering, translation, summarization and text generation
- Access 1000s of state-of-the-art language models like GPT-2, GPT-J, GPT-Neo from 🤗 HuggingFace model hub
- Fine tune large language models (LLMs) on your own text data for different tasks
Translation
SQL query
SELECT pgml.transform(
'translation_en_to_fr',
inputs => ARRAY[
'Welcome to the future!',
'Where have you been all this time?'
]
) AS french;
Result
french
------------------------------------------------------------
[
{"translation_text": "Bienvenue à l'avenir!"},
{"translation_text": "Où êtes-vous allé tout ce temps?"}
]
Sentiment Analysis SQL query
SELECT pgml.transform(
task => 'text-classification',
inputs => ARRAY[
'I love how amazingly simple ML has become!',
'I hate doing mundane and thankless tasks. ☹️'
]
) AS positivity;
Result
positivity
------------------------------------------------------
[
{"label": "POSITIVE", "score": 0.9995759129524232},
{"label": "NEGATIVE", "score": 0.9903519749641418}
]
- 47+ classification and regression algorithms
- 8 - 40X faster inference than HTTP based model serving
- Millions of transactions per second
- Horizontal scalability
Training a classification model
Training
SELECT * FROM pgml.train(
'Handwritten Digit Image Classifier',
algorithm => 'xgboost',
'classification',
'pgml.digits',
'target'
);
Inference
SELECT pgml.predict(
'My Classification Project',
ARRAY[0.1, 2.0, 5.0]
) AS prediction;
PostgresML installation consists of three parts: PostgreSQL database, Postgres extension for machine learning and a dashboard app. The extension provides all the machine learning functionality and can be used independently using any SQL IDE. The dashboard app provides a eays to use interface for writing SQL notebooks, performing and tracking ML experiments and ML models.
Step 1: Clone this repository
git clone git@github.com:postgresml/postgresml.git
Step 2: Start dockerized services. PostgresML will run on port 5433, just in case you already have Postgres running. You can find Docker installation instructions here
cd postgresml
docker-compose up
Step 3: Connect to PostgresDB with PostgresML enabled using a SQL IDE or psql
postgres://postgres@localhost:5433/pgml_development
If you want to check out the functionality without the hassle of Docker please go ahead and start PostgresML by signing up for a free account here. We will provide 5GiB disk space on a shared tenant.
-
On local installation go to dashboard app at
http://localhost:8000/
to use SQL notebooks. -
On the free tier click on Dashboard button to use SQL notebooks.
- Use any of these popular tools to connect to PostgresML and write SQL queries
PostgresML integrates 🤗 Hugging Face Transformers to bring state-of-the-art NLP models into the data layer. There are tens of thousands of pre-trained models with pipelines to turn raw text in your database into useful results. Many state of the art deep learning architectures have been published and made available from Hugging Face model hub.
You can call different NLP tasks and customize using them using the following SQL query.
SELECT pgml.transform(
task => TEXT OR JSONB, -- Pipeline initializer arguments
inputs => TEXT[] OR BYTEA[], -- inputs for inference
args => JSONB -- (optional) arguments to the pipeline.
)
Text classification involves assigning a label or category to a given text. Common use cases include sentiment analysis, natural language inference, and the assessment of grammatical correctness.
Basic SQL query
SELECT pgml.transform(
task => 'text-classification',
inputs => ARRAY[
'I love how amazingly simple ML has become!',
'I hate doing mundane and thankless tasks. ☹️'
]
) AS positivity;
Result
positivity
------------------------------------------------------
[
{"label": "POSITIVE", "score": 0.9995759129524232},
{"label": "NEGATIVE", "score": 0.9903519749641418}
]
A fine-tune checkpoint of DistilBERT-base-uncased that is tuned on Stanford Sentiment Treebank(sst2) is used as a default model for text classification.
SQL query using specific model
To use one of the over 19,000 models available on Hugging Face, include the name of the desired model and its associated task as a JSONB object in the SQL query. For example, if you want to use a RoBERTa model trained on around 40,000 English tweets and that has POS (positive), NEG (negative), and NEU (neutral) labels for its classes, include this information in the JSONB object when making your query.
SELECT pgml.transform(
inputs => ARRAY[
'I love how amazingly simple ML has become!',
'I hate doing mundane and thankless tasks. ☹️'
],
task => '{"task": "text-classification",
"model": "finiteautomata/bertweet-base-sentiment-analysis"
}'::JSONB
) AS positivity;
Result
positivity
-----------------------------------------------
[
{"label": "POS", "score": 0.992932200431826},
{"label": "NEG", "score": 0.975599765777588}
]
SQL query using models from specific industry
By selecting a model that has been specifically designed for a particular industry, you can achieve more accurate and relevant text classification. An example of such a model is FinBERT, a pre-trained NLP model that has been optimized for analyzing sentiment in financial text. FinBERT was created by training the BERT language model on a large financial corpus, and fine-tuning it to specifically classify financial sentiment. When using FinBERT, the model will provide softmax outputs for three different labels: positive, negative, or neutral.
SELECT pgml.transform(
inputs => ARRAY[
'Stocks rallied and the British pound gained.',
'Stocks making the biggest moves midday: Nvidia, Palantir and more'
],
task => '{"task": "text-classification",
"model": "ProsusAI/finbert"
}'::JSONB
) AS market_sentiment;
Result
market_sentiment
------------------------------------------------------
[
{"label": "positive", "score": 0.8983612656593323},
{"label": "neutral", "score": 0.8062630891799927}
]
- Token Classification
- Table Question Answering
- Question Answering
- Zero-Shot Classification
- Translation
- Summarization
- Conversational
- Text Generation
- Text2Text Generation
- Fill-Mask
- Sentence Similarity
- AI writing partner
- Chatbot for customer support
- Social media post analysis
- Fintech
- Healthcare
- Insurance
- Fraud detection
- Recommendation
- Access to hugging face models - a little more about open source language models
- Ease of fine tuning and why
- Rust based extension and its benefits
- Problems with HTTP serving and how PML enables microsecond latency
- Pgcat for horizontal scaling
- Database
- Extension
- ML on text data
- Transform operation
- Fine tune operation
- ML on tabular data
- Train operation
- Deploy operation
- Predict operation
- Docker images
- CPU
- GPU
- Data persistence on local/EC2/EKS
- Deployment on AWS using docker images
See the documentation for a complete list of functionality.
Whether you need a simple linear regression, or extreme gradient boosting, we've included support for all classification and regression algorithms in Scikit Learn and XGBoost with no extra configuration.
Models can be periodically retrained and automatically promoted to production depending on their key metric. Rollback capability is provided to ensure that you're always able to serve the highest quality predictions, along with historical logs of all deployments for long term study.
Predictions are served via a standard Postgres connection to ensure that your core apps can always access both your data and your models in real time. Pure SQL workflows also enable batch predictions to cache results in native Postgres tables for lookup.
Run standard analysis on your datasets to detect outliers, bimodal distributions, feature correlation, and other common data visualizations on your datasets. Everything is cataloged in the dashboard for easy reference.
Use either grid or random searches with cross validation on your training set to discover the most important knobs to tweak on your favorite algorithm.
Vector operations make working with learned embeddings a snap, for things like nearest neighbor searches or other similarity comparisons.
Since your data never leaves the database, you retain the speed, reliability and security you expect in your foundational stateful services. Leverage your existing infrastructure and expertise to deliver new capabilities.
We're building on the shoulders of giants. These machine learning libraries and Postgres have received extensive academic and industry use, and we'll continue their tradition to build with the community. Licensed under MIT.