Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit d7d8635

Browse filesBrowse files
committed
use vector type
1 parent 36c1c9a commit d7d8635
Copy full SHA for d7d8635

7 files changed

+39
-47
lines changed

‎README.md

Copy file name to clipboardExpand all lines: README.md
+4-25Lines changed: 4 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ The **native option** is to use the new Vector Functions, recently introduced in
2424
> [!NOTE]
2525
> Vector Functions are in Early Adopter Preview. Get access to the preview via https://aka.ms/azuresql-vector-eap-announcement
2626
27-
![](_assets/azure-sql-cosine-similarity-native.gif)
27+
![](_assets/azure-sql-cosine-similarity-vector-type.gif)
2828

2929
The **classic option** is to use the classic T-SQL to perform vector operations, with the support for columnstore indexes for getting good performances.
3030

@@ -50,16 +50,16 @@ Run each section (each section starts with a comment) separately. At the end of
5050

5151
## Add embeddings columns to table
5252

53-
In the imported data, vectors are stored as JSON arrays. To take advtange of vector processing, the arrays must be saved into more compact and optimzed binary format index. Thanks to `JSON_ARRAY_TO_VECTOR`, turning a vector into a set of values that can be saved into a column is very easy:
53+
In the imported data, vectors are stored as JSON arrays. To take advtange of vector processing, the arrays must be saved into more compact and optimzed binary format index. Thanks to the new `VECTOR` type, turning a vector into a set of values that can be saved into a column is very easy:
5454

5555
```sql
5656
alter table wikipedia_articles_embeddings
57-
add title_vector_native varbinary(8000);
57+
add title_vector_ada2 vector(1536);
5858

5959
update
6060
wikipedia_articles_embeddings
6161
set
62-
title_vector_native = json_array_to_vector(title_vector),
62+
title_vector_ada2 = cast(title_vector as vector(1536)),
6363
```
6464

6565
The script `./vector-embeddings/02-use-native-vectors.sql` does exactly that. It takes the existing columns with vectors stored in JSON arrays and turns them into vectors saved in binary format.
@@ -106,27 +106,6 @@ The described process can be wrapped into stored procedures to make it easy to r
106106

107107
The script `05-find-similar-articles.sql` uses the created stored procedure and the process explained above to find similar articles to the provided text.
108108

109-
## Encapsulating logic to do similarity search
110-
111-
To make it even easier to use, the script `06-sample-function.sql` shows a sample function that can be used to find similar articles by just providing the text, as demonstrated in script `07-sample-function-usage` with the following example:
112-
113-
```sql
114-
declare @embedding varbinary(8000);
115-
declare @text nvarchar(max) = N'the foundation series by isaac asimov';
116-
117-
exec dbo.get_embedding 'embeddings', @text, embedding output;
118-
119-
select top(10)
120-
a.id,
121-
a.title,
122-
a.url,
123-
vector_distance('cosine', @embedding, title_vector) cosine_distance
124-
from
125-
dbo.wikipedia_articles_embeddings a
126-
order by
127-
cosine_distance;
128-
```
129-
130109
## Alternative sample with Python and a local embedding model
131110

132111
If you don't want or can't use OpenAI to generate embeddings, you can use a local model like `https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1` to generate embeddings. The Python script `./python/hybrid_search.py` shows how to
-304 KB
Binary file not shown.
Loading

‎python/.env.sample

Copy file name to clipboard
+1-1Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
MSSQL='Driver={ODBC Driver 18 for SQL Server};Server=tcp:<server>.database.windows.net,1433;Database=<database>;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30'
1+
MSSQL='Driver={ODBC Driver 18 for SQL Server};Server=tcp:<server>.database.windows.net,1433;Database=<database>;LongAsMax=yes;Connection Timeout=30'

‎python/00-setup-database.sql

Copy file name to clipboard
+6-6Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
drop table if exists dbo.sample_documents
1+
drop table if exists dbo.hybrid_search_sample
22
go
33

4-
create table dbo.sample_documents
4+
create table dbo.hybrid_search_sample
55
(
6-
id int constraint pk__documents primary key,
6+
id int constraint pk__hybrid_search_sample primary key,
77
content nvarchar(max),
8-
embedding varbinary(8000)
8+
embedding vector(384)
99
)
1010

1111
if not exists(select * from sys.fulltext_catalogs where [name] = 'main_ft_catalog')
@@ -14,8 +14,8 @@ begin
1414
end
1515
go
1616

17-
create fulltext index on dbo.sample_documents (content) key index pk__documents;
17+
create fulltext index on dbo.hybrid_search_sample (content) key index pk__hybrid_search_sample;
1818
go
1919

20-
alter fulltext index on dbo.sample_documents enable;
20+
alter fulltext index on dbo.hybrid_search_sample enable;
2121
go

‎python/README.md

Copy file name to clipboardExpand all lines: python/README.md
+1-1Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
This sample shows how to combine Fulltext search in Azure SQL database with BM25 ranking and cosine similarity ranking to do hybrid search.
44

5-
In this sample the local model [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) to generate embeddings. The Python script `./python/hybrid_search.py` shows how to
5+
In this sample the local model [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) to generate embeddings. The Python script `./hybrid_search.py` shows how to
66

77
- use Python to generate the embeddings
88
- do similarity search in Azure SQL database

‎python/hybrid_search.py

Copy file name to clipboardExpand all lines: python/hybrid_search.py
+27-14Lines changed: 27 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import os
2+
import time
23
import pyodbc
34
import logging
45
import json
@@ -10,22 +11,29 @@
1011

1112
if __name__ == '__main__':
1213
print('Initializing sample...')
14+
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1', tokenizer_kwargs={'clean_up_tokenization_spaces': True})
15+
1316
print('Getting embeddings...')
1417
sentences = [
1518
'The dog is barking',
1619
'The cat is purring',
1720
'The bear is growling',
18-
'A bear growling to a cat'
21+
'A bear growling to a cat',
22+
'A cat purring to a dog',
23+
'A dog barking to a bear',
24+
'A bear growling to a dog',
25+
'A cat purring to a bear',
26+
'A wolf howling to a bear',
27+
'A bear growling to a wolf'
1928
]
20-
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
2129
embeddings = model.encode(sentences)
2230

2331
conn = get_mssql_connection()
2432

2533
print('Cleaning up the database...')
2634
try:
2735
cursor = conn.cursor()
28-
cursor.execute("DELETE FROM dbo.sample_documents;")
36+
cursor.execute("DELETE FROM dbo.hybrid_search_sample;")
2937
cursor.commit();
3038
finally:
3139
cursor.close()
@@ -34,36 +42,39 @@
3442
try:
3543
cursor = conn.cursor()
3644

37-
for id, (content, embedding) in enumerate(zip(sentences, embeddings)):
45+
for id, (sentence, embedding) in enumerate(zip(sentences, embeddings)):
3846
cursor.execute(f"""
3947
DECLARE @id INT = ?;
4048
DECLARE @content NVARCHAR(MAX) = ?;
41-
DECLARE @embedding NVARCHAR(MAX) = ?;
42-
INSERT INTO dbo.sample_documents (id, content, embedding) VALUES (@id, @content, JSON_ARRAY_TO_VECTOR(@embedding));
49+
DECLARE @embedding VECTOR(384) = CAST(? AS VECTOR(384));
50+
INSERT INTO dbo.hybrid_search_sample (id, content, embedding) VALUES (@id, @content, @embedding);
4351
""",
4452
id,
45-
content,
53+
sentence,
4654
json.dumps(embedding.tolist())
4755
)
4856

4957
cursor.commit()
5058
finally:
5159
cursor.close()
5260

61+
print('Waiting a few seconds to let fulltext index sync...')
62+
time.sleep(3)
63+
5364
print('Searching for similar documents...')
5465
print('Getting embeddings...')
5566
query = 'a growling bear'
5667
embedding = model.encode(query)
5768

58-
print(f'Querying database for "{query}"...')
5969
k = 5
70+
print(f'Querying database for {k} similar sentenct to "{query}"...')
6071
try:
6172
cursor = conn.cursor()
6273

6374
results = cursor.execute(f"""
6475
DECLARE @k INT = ?;
6576
DECLARE @q NVARCHAR(4000) = ?;
66-
DECLARE @e VARBINARY(8000) = JSON_ARRAY_TO_VECTOR(CAST(? AS NVARCHAR(MAX)));
77+
DECLARE @e VECTOR(384) = CAST(? AS VECTOR(384));
6778
WITH keyword_search AS (
6879
SELECT TOP(@k)
6980
id,
@@ -76,9 +87,9 @@
7687
ftt.[RANK] AS rank,
7788
sd.content
7889
FROM
79-
dbo.sample_documents AS sd
90+
dbo.hybrid_search_sample AS sd
8091
INNER JOIN
81-
FREETEXTTABLE(dbo.sample_documents, *, @q) AS ftt ON sd.id = ftt.[KEY]
92+
FREETEXTTABLE(dbo.hybrid_search_sample, *, @q) AS ftt ON sd.id = ftt.[KEY]
8293
) AS t
8394
ORDER BY
8495
rank
@@ -96,7 +107,7 @@
96107
VECTOR_DISTANCE('cosine', embedding, @e) AS distance,
97108
content
98109
FROM
99-
dbo.sample_documents
110+
dbo.hybrid_search_sample
100111
ORDER BY
101112
distance
102113
) AS t
@@ -122,8 +133,10 @@
122133
json.dumps(embedding.tolist()),
123134
)
124135

125-
for row in results:
126-
print(f'Document: "{row[2]}", Id: {row[0]} -> RRF score: {row[1]:0.4} (Semantic Rank: {row[3]}, Keyword Rank: {row[4]})')
136+
for (pos, row) in enumerate(results):
137+
print(f'[{pos}] RRF score: {row[1]:0.4} (Semantic Rank: {row[3]}, Keyword Rank: {row[4]})\tDocument: "{row[2]}", Id: {row[0]}')
127138

128139
finally:
129140
cursor.close()
141+
142+
print("Done.")

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.