Bubby2015
diff --git a/‎README.md
Copy file name to clipboardExpand all lines: README.md
+4-25Lines changed: 4 additions & 25 deletions b/‎README.md
Copy file name to clipboardExpand all lines: README.md
+4-25Lines changed: 4 additions & 25 deletions
diff --git a/‎_assets/azure-sql-cosine-similarity-native.gif
Copy file name to clipboard
-304 KB b/‎_assets/azure-sql-cosine-similarity-native.gif
Copy file name to clipboard
-304 KB
diff --git a/‎_assets/azure-sql-cosine-similarity-vector-type.gif
Copy file name to clipboard
157 KB b/‎_assets/azure-sql-cosine-similarity-vector-type.gif
Copy file name to clipboard
157 KB
diff --git a/‎python/.env.sample
Copy file name to clipboard
+1-1Lines changed: 1 addition & 1 deletion b/‎python/.env.sample
Copy file name to clipboard
+1-1Lines changed: 1 addition & 1 deletion
diff --git a/‎python/00-setup-database.sql
Copy file name to clipboard
+6-6Lines changed: 6 additions & 6 deletions b/‎python/00-setup-database.sql
Copy file name to clipboard
+6-6Lines changed: 6 additions & 6 deletions
diff --git a/‎python/README.md
Copy file name to clipboardExpand all lines: python/README.md
+1-1Lines changed: 1 addition & 1 deletion b/‎python/README.md
Copy file name to clipboardExpand all lines: python/README.md
+1-1Lines changed: 1 addition & 1 deletion
diff --git a/‎python/hybrid_search.py
Copy file name to clipboardExpand all lines: python/hybrid_search.py
+27-14Lines changed: 27 additions & 14 deletions b/‎python/hybrid_search.py
Copy file name to clipboardExpand all lines: python/hybrid_search.py
+27-14Lines changed: 27 additions & 14 deletions
@@ -24,7 +24,7 @@ The **native option** is to use the new Vector Functions, recently introduced in
 > [!NOTE]  
 > Vector Functions are in Early Adopter Preview. Get access to the preview via https://aka.ms/azuresql-vector-eap-announcement
 
-![](_assets/azure-sql-cosine-similarity-native.gif)
+![](_assets/azure-sql-cosine-similarity-vector-type.gif)
 
 The **classic option** is to use the classic T-SQL to perform vector operations, with the support for columnstore indexes for getting good performances.
 
@@ -50,16 +50,16 @@ Run each section (each section starts with a comment) separately. At the end of
 
 ## Add embeddings columns to table
 
-In the imported data, vectors are stored as JSON arrays. To take advtange of vector processing, the arrays must be saved into more compact and optimzed binary format index. Thanks to `JSON_ARRAY_TO_VECTOR`, turning a vector into a set of values that can be saved into a column is very easy:
+In the imported data, vectors are stored as JSON arrays. To take advtange of vector processing, the arrays must be saved into more compact and optimzed binary format index. Thanks to the new `VECTOR` type, turning a vector into a set of values that can be saved into a column is very easy:
 
 ```sql
 alter table wikipedia_articles_embeddings
-add title_vector_native varbinary(8000);
+add title_vector_ada2 vector(1536);
 
 update 
     wikipedia_articles_embeddings
 set 
-    title_vector_native = json_array_to_vector(title_vector),
+    title_vector_ada2 = cast(title_vector as vector(1536)),
 ```
 
 The script `./vector-embeddings/02-use-native-vectors.sql` does exactly that. It takes the existing columns with vectors stored in JSON arrays and turns them into vectors saved in binary format.
@@ -106,27 +106,6 @@ The described process can be wrapped into stored procedures to make it easy to r
 
 The script `05-find-similar-articles.sql` uses the created stored procedure and the process explained above to find similar articles to the provided text. 
 
-## Encapsulating logic to do similarity search
-
-To make it even easier to use, the script `06-sample-function.sql` shows a sample function that can be used to find similar articles by just providing the text, as demonstrated in script `07-sample-function-usage` with the following example:
-
-```sql
-declare @embedding varbinary(8000);
-declare @text nvarchar(max) = N'the foundation series by isaac asimov';
-
-exec dbo.get_embedding 'embeddings', @text, embedding output;
-
-select top(10)
-    a.id,
-    a.title,
-    a.url,
-    vector_distance('cosine', @embedding, title_vector) cosine_distance
-from
-    dbo.wikipedia_articles_embeddings a
-order by
-    cosine_distance;
-```
-
 ## Alternative sample with Python and a local embedding model
 
 If you don't want or can't use OpenAI to generate embeddings, you can use a local model like `https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1` to generate embeddings. The Python script `./python/hybrid_search.py` shows how to 
 
@@ -1 +1 @@
-MSSQL='Driver={ODBC Driver 18 for SQL Server};Server=tcp:<server>.database.windows.net,1433;Database=<database>;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30'
+MSSQL='Driver={ODBC Driver 18 for SQL Server};Server=tcp:<server>.database.windows.net,1433;Database=<database>;LongAsMax=yes;Connection Timeout=30'
@@ -1,11 +1,11 @@
-drop table if exists dbo.sample_documents
+drop table if exists dbo.hybrid_search_sample
 go
 
-create table dbo.sample_documents
+create table dbo.hybrid_search_sample
 (
-    id int constraint pk__documents primary key,
+    id int constraint pk__hybrid_search_sample primary key,
     content nvarchar(max),
-    embedding varbinary(8000)
+    embedding vector(384)
 )
 
 if not exists(select * from sys.fulltext_catalogs where [name] = 'main_ft_catalog')
@@ -14,8 +14,8 @@ begin
 end
 go
 
-create fulltext index on dbo.sample_documents (content) key index pk__documents;
+create fulltext index on dbo.hybrid_search_sample (content) key index pk__hybrid_search_sample;
 go
 
-alter fulltext index on dbo.sample_documents enable; 
+alter fulltext index on dbo.hybrid_search_sample enable; 
 go
@@ -2,7 +2,7 @@
 
 This sample shows how to combine Fulltext search in Azure SQL database with BM25 ranking and cosine similarity ranking to do hybrid search.
 
-In this sample the local model [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) to generate embeddings. The Python script `./python/hybrid_search.py` shows how to 
+In this sample the local model [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) to generate embeddings. The Python script `./hybrid_search.py` shows how to 
 
 - use Python to generate the embeddings 
 - do similarity search in Azure SQL database
 
@@ -1,4 +1,5 @@
 import os
+import time
 import pyodbc
 import logging
 import json
@@ -10,22 +11,29 @@
 
 if __name__ == '__main__':
     print('Initializing sample...')
+    model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1', tokenizer_kwargs={'clean_up_tokenization_spaces': True})
+
     print('Getting embeddings...')    
     sentences = [
         'The dog is barking',
         'The cat is purring',
         'The bear is growling',
-        'A bear growling to a cat'
+        'A bear growling to a cat',
+        'A cat purring to a dog',
+        'A dog barking to a bear',
+        'A bear growling to a dog',
+        'A cat purring to a bear',
+        'A wolf howling to a bear',
+        'A bear growling to a wolf'
     ]
-    model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
     embeddings = model.encode(sentences)
 
     conn = get_mssql_connection()
 
     print('Cleaning up the database...')
     try:
         cursor = conn.cursor()    
-        cursor.execute("DELETE FROM dbo.sample_documents;")
+        cursor.execute("DELETE FROM dbo.hybrid_search_sample;")
         cursor.commit();        
     finally:
         cursor.close()
@@ -34,36 +42,39 @@
     try:
         cursor = conn.cursor()  
 
-        for id, (content, embedding) in enumerate(zip(sentences, embeddings)):
+        for id, (sentence, embedding) in enumerate(zip(sentences, embeddings)):
             cursor.execute(f"""
                 DECLARE @id INT = ?;
                 DECLARE @content NVARCHAR(MAX) = ?;
-                DECLARE @embedding NVARCHAR(MAX) = ?;
-                INSERT INTO dbo.sample_documents (id, content, embedding) VALUES (@id, @content, JSON_ARRAY_TO_VECTOR(@embedding));
+                DECLARE @embedding VECTOR(384) = CAST(? AS VECTOR(384));
+                INSERT INTO dbo.hybrid_search_sample (id, content, embedding) VALUES (@id, @content, @embedding);
             """,
             id,
-            content, 
+            sentence, 
             json.dumps(embedding.tolist())
             )
 
         cursor.commit()
     finally:
         cursor.close()
 
+    print('Waiting a few seconds to let fulltext index sync...')    
+    time.sleep(3)
+
     print('Searching for similar documents...')
     print('Getting embeddings...')    
     query = 'a growling bear'
     embedding = model.encode(query)    
 
-    print(f'Querying database for "{query}"...')  
     k = 5  
+    print(f'Querying database for {k} similar sentenct to "{query}"...')  
     try:
         cursor = conn.cursor()  
 
         results  = cursor.execute(f"""
             DECLARE @k INT = ?;
             DECLARE @q NVARCHAR(4000) = ?;
-            DECLARE @e VARBINARY(8000) = JSON_ARRAY_TO_VECTOR(CAST(? AS NVARCHAR(MAX)));
+            DECLARE @e VECTOR(384) = CAST(? AS VECTOR(384));
             WITH keyword_search AS (
                 SELECT TOP(@k)
                     id,
@@ -76,9 +87,9 @@
                             ftt.[RANK] AS rank,
                             sd.content
                         FROM 
-                            dbo.sample_documents AS sd
+                            dbo.hybrid_search_sample AS sd
                         INNER JOIN 
-                            FREETEXTTABLE(dbo.sample_documents, *, @q) AS ftt ON sd.id = ftt.[KEY]
+                            FREETEXTTABLE(dbo.hybrid_search_sample, *, @q) AS ftt ON sd.id = ftt.[KEY]
                     ) AS t
                 ORDER BY
                     rank
@@ -96,7 +107,7 @@
                             VECTOR_DISTANCE('cosine', embedding, @e) AS distance,
                             content
                         FROM 
-                            dbo.sample_documents
+                            dbo.hybrid_search_sample
                         ORDER BY
                             distance
                     ) AS t
@@ -122,8 +133,10 @@
             json.dumps(embedding.tolist()),        
         )
 
-        for row in results:
-            print(f'Document: "{row[2]}", Id: {row[0]} -> RRF score: {row[1]:0.4} (Semantic Rank: {row[3]}, Keyword Rank: {row[4]})')
+        for (pos, row) in enumerate(results):
+            print(f'[{pos}] RRF score: {row[1]:0.4} (Semantic Rank: {row[3]}, Keyword Rank: {row[4]})\tDocument: "{row[2]}", Id: {row[0]}')
 
     finally:
         cursor.close()
+    
+    print("Done.")
Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-MSSQL='Driver={ODBC Driver 18 for SQL Server};Server=tcp:<server>.database.windows.net,1433;Database=<database>;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30'`
	`1`	`+MSSQL='Driver={ODBC Driver 18 for SQL Server};Server=tcp:<server>.database.windows.net,1433;Database=<database>;LongAsMax=yes;Connection Timeout=30'`