Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit e6a1619

Browse filesBrowse files
authored
pgml sdk examples (#669)
1 parent 9949cde commit e6a1619
Copy full SHA for e6a1619
Expand file treeCollapse file tree

12 files changed

+324
-253
lines changed

‎pgml-sdks/python/pgml/README.md

Copy file name to clipboardExpand all lines: pgml-sdks/python/pgml/README.md
+15-2Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,14 @@
1-
# Table of Contents
1+
# Open Source Alternative for Building End-to-End Vector Search Applications without OpenAI & Pinecone
2+
3+
## Table of Contents
24

35
- [Overview](#overview)
46
- [Quickstart](#quickstart)
57
- [Usage](#usage)
8+
- [Examples](./examples/README.md)
69
- [Developer setup](#developer-setup)
710
- [API Reference](#api-reference)
11+
- [Roadmap](#roadmap)
812

913
## Overview
1014
Python SDK is designed to facilitate the development of scalable vector search applications on PostgreSQL databases. With this SDK, you can seamlessly manage various database tables related to documents, text chunks, text splitters, LLM (Language Model) models, and embeddings. By leveraging the SDK's capabilities, you can efficiently index LLM embeddings using PgVector for fast and accurate queries.
@@ -274,4 +278,13 @@ LOGLEVEL=INFO python -m unittest tests/test_collection.py
274278
### API Reference
275279

276280
- [Database](./docs/pgml/database.md)
277-
- [Collection](./docs/pgml/collection.md)
281+
- [Collection](./docs/pgml/collection.md)
282+
283+
### Roadmap
284+
285+
- Enable filters on document metadata in `vector_search`. [Issue](https://github.com/postgresml/postgresml/issues/663)
286+
- `text_search` functionality on documents using Postgres text search. [Issue](https://github.com/postgresml/postgresml/issues/664)
287+
- `hybrid_search` functionality that does a combination of `vector_search` and `text_search` in an order specified by the user. [Issue](https://github.com/postgresml/postgresml/issues/665)
288+
- Ability to call and manage OpenAI embeddings for comparison purposes. [Issue](https://github.com/postgresml/postgresml/issues/666)
289+
- Save `vector_search` history for downstream monitoring of model performance. [Issue](https://github.com/postgresml/postgresml/issues/667)
290+
- Perform chunking on the DB with multiple langchain splitters. [Issue](https://github.com/postgresml/postgresml/issues/668)
+19Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
## Examples
2+
3+
### [Semantic Search](./semantic_search.py)
4+
This is a basic example to perform semantic search on a collection of documents. It loads the Quora dataset, creates a collection in a PostgreSQL database, upserts documents, generates chunks and embeddings, and then performs a vector search on a query. Embeddings are created using `intfloat/e5-small` model. The results are are semantically similar documemts to the query. Finally, the collection is archived.
5+
6+
### [Question Answering](./question_answering.py)
7+
This is an example to find documents relevant to a question from the collection of documents. It loads the Stanford Question Answering Dataset (SQuAD) into the database, generates chunks and embeddings. Query is passed to vector search to retrieve documents that match closely in the embeddings space. A score is returned with each of the search result.
8+
9+
### [Question Answering using Instructore Model](./question_answering_instructor.py)
10+
In this example, we will use `hknlp/instructor-base` model to build text embeddings instead of the default `intfloat/e5-small` model. We will show how to use `register_model` method and use the `model_id` to build and query embeddings.
11+
12+
### [Extractive Question Answering](./extractive_question_answering.py)
13+
In this example, we will show how to use `vector_search` result as a `context` to a HuggingFace question answering model. We will use `pgml.transform` to run the model on the database.
14+
15+
### [Table Question Answering](./table_question_answering.py)
16+
In this example, we will use [Open Table-and-Text Question Answering (OTT-QA)
17+
](https://github.com/wenhuchen/OTT-QA) dataset to run queries on tables. We will use `deepset/all-mpnet-base-v2-table` model that is trained for embedding tabular data for retrieval tasks.
18+
19+
+69Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
from pgml import Database
2+
import os
3+
import json
4+
from datasets import load_dataset
5+
from time import time
6+
from dotenv import load_dotenv
7+
from rich.console import Console
8+
from psycopg import sql
9+
from pgml.dbutils import run_select_statement
10+
11+
load_dotenv()
12+
console = Console()
13+
14+
local_pgml = "postgres://postgres@127.0.0.1:5433/pgml_development"
15+
16+
conninfo = os.environ.get("PGML_CONNECTION", local_pgml)
17+
db = Database(conninfo)
18+
19+
collection_name = "squad_collection"
20+
collection = db.create_or_get_collection(collection_name)
21+
22+
23+
data = load_dataset("squad", split="train")
24+
data = data.to_pandas()
25+
data = data.drop_duplicates(subset=["context"])
26+
27+
documents = [
28+
{"id": r["id"], "text": r["context"], "title": r["title"]}
29+
for r in data.to_dict(orient="records")
30+
]
31+
32+
collection.upsert_documents(documents[:200])
33+
collection.generate_chunks()
34+
collection.generate_embeddings()
35+
36+
start = time()
37+
query = "Who won more than 20 grammy awards?"
38+
results = collection.vector_search(query, top_k=5)
39+
_end = time()
40+
console.print("\nResults for '%s'" % (query), style="bold")
41+
console.print(results)
42+
console.print("Query time = %0.3f" % (_end - start))
43+
44+
# Get the context passage and use pgml.transform to get short answer to the question
45+
46+
47+
conn = db.pool.getconn()
48+
context = " ".join(results[0]["chunk"].strip().split())
49+
context = context.replace('"', '\\"').replace("'", "''")
50+
51+
select_statement = """SELECT pgml.transform(
52+
'question-answering',
53+
inputs => ARRAY[
54+
'{
55+
\"question\": \"%s\",
56+
\"context\": \"%s\"
57+
}'
58+
]
59+
) AS answer;""" % (
60+
query,
61+
context,
62+
)
63+
64+
results = run_select_statement(conn, select_statement)
65+
db.pool.putconn(conn)
66+
67+
console.print("\nResults for query '%s'" % query)
68+
console.print(results)
69+
db.archive_collection(collection_name)

‎pgml-sdks/python/pgml/examples/vector_search.py renamed to ‎pgml-sdks/python/pgml/examples/question_answering.py

Copy file name to clipboardExpand all lines: pgml-sdks/python/pgml/examples/question_answering.py
+14-6Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,18 @@
33
import json
44
from datasets import load_dataset
55
from time import time
6-
from rich import print as rprint
6+
from dotenv import load_dotenv
7+
from rich.console import Console
8+
9+
load_dotenv()
10+
console = Console()
711

812
local_pgml = "postgres://postgres@127.0.0.1:5433/pgml_development"
913

1014
conninfo = os.environ.get("PGML_CONNECTION", local_pgml)
1115
db = Database(conninfo)
1216

13-
collection_name = "test_pgml_sdk_1"
17+
collection_name = "squad_collection"
1418
collection = db.create_or_get_collection(collection_name)
1519

1620

@@ -19,7 +23,7 @@
1923
data = data.drop_duplicates(subset=["context"])
2024

2125
documents = [
22-
{'id': r['id'], "text": r["context"], "title": r["title"]}
26+
{"id": r["id"], "text": r["context"], "title": r["title"]}
2327
for r in data.to_dict(orient="records")
2428
]
2529

@@ -28,7 +32,11 @@
2832
collection.generate_embeddings()
2933

3034
start = time()
31-
results = collection.vector_search("Who won 20 grammy awards?", top_k=2)
32-
rprint("Query time %0.3f"%(time()-start))
33-
rprint(json.dumps(results, indent=2))
35+
query = "Who won 20 grammy awards?"
36+
results = collection.vector_search(query, top_k=5)
37+
_end = time()
38+
console.print("\nResults for '%s'" % (query), style="bold")
39+
console.print(results)
40+
console.print("Query time = %0.3f" % (_end - start))
41+
3442
db.archive_collection(collection_name)
+55Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
from pgml import Database
2+
import os
3+
import json
4+
from datasets import load_dataset
5+
from time import time
6+
from dotenv import load_dotenv
7+
from rich.console import Console
8+
9+
load_dotenv()
10+
console = Console()
11+
12+
local_pgml = "postgres://postgres@127.0.0.1:5433/pgml_development"
13+
14+
conninfo = os.environ.get("PGML_CONNECTION", local_pgml)
15+
db = Database(conninfo)
16+
17+
collection_name = "squad_collection"
18+
collection = db.create_or_get_collection(collection_name)
19+
20+
21+
data = load_dataset("squad", split="train")
22+
data = data.to_pandas()
23+
data = data.drop_duplicates(subset=["context"])
24+
25+
documents = [
26+
{"id": r["id"], "text": r["context"], "title": r["title"]}
27+
for r in data.to_dict(orient="records")
28+
]
29+
30+
collection.upsert_documents(documents[:200])
31+
collection.generate_chunks()
32+
33+
# register instructor model
34+
model_id = collection.register_model(
35+
model_name="hkunlp/instructor-base",
36+
model_params={"instruction": "Represent the Wikipedia document for retrieval: "},
37+
)
38+
collection.generate_embeddings(model_id=model_id)
39+
40+
start = time()
41+
query = "Who won 20 grammy awards?"
42+
results = collection.vector_search(
43+
query,
44+
top_k=5,
45+
model_id=model_id,
46+
query_parameters={
47+
"instruction": "Represent the Wikipedia question for retrieving supporting documents: "
48+
},
49+
)
50+
_end = time()
51+
console.print("\nResults for '%s'" % (query), style="bold")
52+
console.print(results)
53+
console.print("Query time = %0.3f" % (_end - start))
54+
55+
db.archive_collection(collection_name)
+50Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
from datasets import load_dataset
2+
from pgml import Database
3+
import os
4+
from rich import print as rprint
5+
from dotenv import load_dotenv
6+
from time import time
7+
from rich.console import Console
8+
9+
load_dotenv()
10+
console = Console()
11+
12+
# Prepare Data
13+
dataset = load_dataset("quora", split="train")
14+
questions = []
15+
16+
for record in dataset["questions"]:
17+
questions.extend(record["text"])
18+
19+
# remove duplicates
20+
documents = []
21+
for question in list(set(questions)):
22+
if question:
23+
documents.append({"text": question})
24+
25+
26+
# Get Database connection
27+
local_pgml = "postgres://postgres@127.0.0.1:5433/pgml_development"
28+
conninfo = os.environ.get("PGML_CONNECTION", local_pgml)
29+
db = Database(conninfo, min_connections=4)
30+
31+
# Create or get collection
32+
collection_name = "quora_collection"
33+
collection = db.create_or_get_collection(collection_name)
34+
35+
# Upsert documents, chunk text, and generate embeddings
36+
collection.upsert_documents(documents[:200])
37+
collection.generate_chunks()
38+
collection.generate_embeddings()
39+
40+
# Query vector embeddings
41+
start = time()
42+
query = "What is a good mobile os?"
43+
result = collection.vector_search(query)
44+
_end = time()
45+
46+
console.print("\nResults for '%s'" % (query), style="bold")
47+
console.print(result)
48+
console.print("Query time = %0.3f" % (_end - start))
49+
50+
db.archive_collection(collection_name)
+56Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
from pgml import Database
2+
import os
3+
import json
4+
from datasets import load_dataset
5+
from time import time
6+
from dotenv import load_dotenv
7+
from rich.console import Console
8+
from rich.progress import track
9+
from psycopg import sql
10+
from pgml.dbutils import run_select_statement
11+
import pandas as pd
12+
13+
load_dotenv()
14+
console = Console()
15+
16+
local_pgml = "postgres://postgres@127.0.0.1:5433/pgml_development"
17+
18+
conninfo = os.environ.get("PGML_CONNECTION", local_pgml)
19+
db = Database(conninfo)
20+
21+
collection_name = "ott_qa_20k_collection"
22+
collection = db.create_or_get_collection(collection_name)
23+
24+
25+
data = load_dataset("ashraq/ott-qa-20k", split="train")
26+
documents = []
27+
28+
# loop through the dataset and convert tabular data to pandas dataframes
29+
for doc in track(data):
30+
table = pd.DataFrame(doc["data"], columns=doc["header"])
31+
processed_table = "\n".join([table.to_csv(index=False)])
32+
documents.append(
33+
{
34+
"text": processed_table,
35+
"title": doc["title"],
36+
"url": doc["url"],
37+
"uid": doc["uid"],
38+
}
39+
)
40+
41+
collection.upsert_documents(documents)
42+
collection.generate_chunks()
43+
44+
# SentenceTransformer model trained specifically for embedding tabular data for retrieval tasks
45+
model_id = collection.register_model(model_name="deepset/all-mpnet-base-v2-table")
46+
collection.generate_embeddings(model_id=model_id)
47+
48+
start = time()
49+
query = "which country has the highest GDP in 2020?"
50+
results = collection.vector_search(query, top_k=5, model_id=model_id)
51+
_end = time()
52+
console.print("\nResults for '%s'" % (query), style="bold")
53+
console.print(results)
54+
console.print("Query time = %0.3f" % (_end - start))
55+
56+
db.archive_collection(collection_name)

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.