Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit fa0c458

Browse filesBrowse files
Minor improvements in blog posts (#724)
1 parent 9947e76 commit fa0c458
Copy full SHA for fa0c458

3 files changed

+37
-35
lines changed

‎pgml-dashboard/content/blog/generating-llm-embeddings-with-open-source-models-in-postgresml.md

Copy file name to clipboardExpand all lines: pgml-dashboard/content/blog/generating-llm-embeddings-with-open-source-models-in-postgresml.md
+7-5Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -127,8 +127,10 @@ Since our corpus of documents (movie reviews) are all relatively short and simil
127127

128128
It takes a couple of minutes to download and cache the `intfloat/e5-small` model to generate the first embedding. After that, it's pretty fast.
129129

130+
Note how we prefix the text we want to embed with either `passage: ` or `query: `, the e5 model requires us to prefix our data with `passage: ` if we're generating embeddings for our corpus and `query: ` if we want to find semantically similar content.
131+
130132
```postgresql
131-
SELECT pgml.embed('intfloat/e5-small', 'hi mom');
133+
SELECT pgml.embed('intfloat/e5-small', 'passage: hi mom');
132134
```
133135

134136
This is a pretty powerful function, because we can pass any arbitrary text to any open source model, and it will generate an embedding for us. We can benchmark how long it takes to generate an embedding for a single review, using client-side timings in Postgres:
@@ -147,7 +149,7 @@ Aside from using this function with strings passed from a client, we can use it
147149
```postgresql
148150
SELECT
149151
review_body,
150-
pgml.embed('intfloat/e5-small', review_body)
152+
pgml.embed('intfloat/e5-small', 'passage: ' || review_body)
151153
FROM pgml.amazon_us_reviews
152154
LIMIT 1;
153155
```
@@ -171,7 +173,7 @@ Time to generate an embedding increases with the length of the input text, and v
171173
```postgresql
172174
SELECT
173175
review_body,
174-
pgml.embed('intfloat/e5-small', review_body) AS embedding
176+
pgml.embed('intfloat/e5-small', 'passage: ' || review_body) AS embedding
175177
FROM pgml.amazon_us_reviews
176178
LIMIT 1000;
177179
```
@@ -191,7 +193,7 @@ SELECT
191193
reviqew_body,
192194
pgml.embed(
193195
'intfloat/e5-small',
194-
review_body,
196+
'passage: ' || review_body,
195197
'{"device": "cpu"}'
196198
) AS embedding
197199
FROM pgml.amazon_us_reviews
@@ -328,7 +330,7 @@ BEGIN
328330
UPDATE pgml.amazon_us_reviews
329331
SET review_embedding_e5_large = pgml.embed(
330332
'intfloat/e5-large',
331-
review_body
333+
'passage: ' || review_body
332334
)
333335
WHERE id BETWEEN i AND i + 10
334336
AND review_embedding_e5_large IS NULL;

‎pgml-dashboard/content/blog/personalize-embedding-vector-search-results-with-huggingface-and-pgvector.md

Copy file name to clipboardExpand all lines: pgml-dashboard/content/blog/personalize-embedding-vector-search-results-with-huggingface-and-pgvector.md
+14-14Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,7 @@ We can find a customer that our embeddings model feels is close to the sentiment
137137
WITH request AS (
138138
SELECT pgml.embed(
139139
'intfloat/e5-large',
140-
'I love all Star Wars, but Empire Strikes Back is particularly amazing'
140+
'query: I love all Star Wars, but Empire Strikes Back is particularly amazing'
141141
)::vector(1024) AS embedding
142142
)
143143
@@ -147,17 +147,17 @@ SELECT
147147
star_rating_avg,
148148
1 - (
149149
movie_embedding_e5_large <=> (SELECT embedding FROM request)
150-
) AS cosine_similiarity
150+
) AS cosine_similarity
151151
FROM customers
152-
ORDER BY cosine_similiarity DESC
152+
ORDER BY cosine_similarity DESC
153153
LIMIT 1;
154154
```
155155

156156
!!!
157157

158158
!!! results
159159

160-
| id | total_reviews | star_rating_avg | cosine_similiarity |
160+
| id | total_reviews | star_rating_avg | cosine_similarity |
161161
|----------|---------------|--------------------|--------------------|
162162
| 44366773 | 1 | 2.0000000000000000 | 0.8831349398621555 |
163163

@@ -215,7 +215,7 @@ Now we can write our personalized SQL query. It's nearly the same as our query f
215215
WITH request AS (
216216
SELECT pgml.embed(
217217
'intfloat/e5-large',
218-
'Best 1980''s scifi movie'
218+
'query: Best 1980''s scifi movie'
219219
)::vector(1024) AS embedding
220220
),
221221
@@ -226,18 +226,18 @@ customer AS (
226226
WHERE id = '44366773'
227227
),
228228
229-
-- vector similarity search for movies and calculate a customer_cosine_similiarity at the same time
229+
-- vector similarity search for movies and calculate a customer_cosine_similarity at the same time
230230
first_pass AS (
231231
SELECT
232232
title,
233233
total_reviews,
234234
star_rating_avg,
235235
1 - (
236236
review_embedding_e5_large <=> (SELECT embedding FROM request)
237-
) AS request_cosine_similiarity,
237+
) AS request_cosine_similarity,
238238
(1 - (
239239
review_embedding_e5_large <=> (SELECT embedding FROM customer)
240-
) - 0.9) * 10 AS customer_cosine_similiarity,
240+
) - 0.9) * 10 AS customer_cosine_similarity,
241241
star_rating_avg / 5 AS star_rating_score
242242
FROM movies
243243
WHERE total_reviews > 10
@@ -251,9 +251,9 @@ SELECT
251251
total_reviews,
252252
round(star_rating_avg, 2) as star_rating_avg,
253253
star_rating_score,
254-
request_cosine_similiarity,
255-
customer_cosine_similiarity,
256-
request_cosine_similiarity + customer_cosine_similiarity + star_rating_score AS final_score
254+
request_cosine_similarity,
255+
customer_cosine_similarity,
256+
request_cosine_similarity + customer_cosine_similarity + star_rating_score AS final_score
257257
FROM first_pass
258258
ORDER BY final_score DESC
259259
LIMIT 10;
@@ -263,7 +263,7 @@ LIMIT 10;
263263

264264
!!! results
265265

266-
| title | total_reviews | star_rating_avg | star_rating_score | request_cosine_similiarity | customer_cosine_similiarity | final_score |
266+
| title | total_reviews | star_rating_avg | star_rating_score | request_cosine_similarity | customer_cosine_similarity | final_score |
267267
|----------------------------------------------------------------------|---------------|-----------------|------------------------|----------------------------|-----------------------------|--------------------|
268268
| Star Wars, Episode V: The Empire Strikes Back (Widescreen Edition) | 78 | 4.44 | 0.88717948717948718000 | 0.8295302273865711 | 0.9999999999999998 | 2.716709714566058 |
269269
| Star Wars, Episode IV: A New Hope (Widescreen Edition) | 80 | 4.36 | 0.87250000000000000000 | 0.8339361274771777 | 0.9336656923446551 | 2.640101819821833 |
@@ -280,15 +280,15 @@ LIMIT 10;
280280

281281
!!!
282282

283-
Bingo. Now we're boosting movies by `(customer_cosine_similiarity - 0.9) * 10`, and we've kept our previous boost for movies with a high average star rating. Not only does Episode V top the list as expected, Episode IV is a close second. This query has gotten fairly complex! But the results are perfect for me, I mean our hypothetical customer who is searching for "Best 1980's scifi movie" but has already revealed to us with their one movie review that they think like the comment "I love all Star Wars, but Empire Strikes Back is particularly amazing". I promise I'm not just doing all of this to find a new movie to watch tonight.
283+
Bingo. Now we're boosting movies by `(customer_cosine_similarity - 0.9) * 10`, and we've kept our previous boost for movies with a high average star rating. Not only does Episode V top the list as expected, Episode IV is a close second. This query has gotten fairly complex! But the results are perfect for me, I mean our hypothetical customer who is searching for "Best 1980's scifi movie" but has already revealed to us with their one movie review that they think like the comment "I love all Star Wars, but Empire Strikes Back is particularly amazing". I promise I'm not just doing all of this to find a new movie to watch tonight.
284284

285285
You can compare this to our non-personalized results from the previous article for reference Forbidden Planet used to be the top result, but now it's #3.
286286

287287
!!! code_block time="124.119 ms"
288288

289289
!!! results
290290

291-
| title | total_reviews | star_rating_avg | final_score | star_rating_score | cosine_similiarity |
291+
| title | total_reviews | star_rating_avg | final_score | star_rating_score | cosine_similarity |
292292
|:-----------------------------------------------------|--------------:|----------------:|-------------------:|-----------------------:|-------------------:|
293293
| Forbidden Planet (Two-Disc 50th Anniversary Edition) | 255 | 4.82 | 1.8216832158805154 | 0.96392156862745098000 | 0.8577616472530644 |
294294
| Back to the Future | 31 | 4.94 | 1.82090702765472 | 0.98709677419354838000 | 0.8338102534611714 |

‎pgml-dashboard/content/blog/tuning-vector-recall-while-generating-query-embeddings-in-the-database.md

Copy file name to clipboardExpand all lines: pgml-dashboard/content/blog/tuning-vector-recall-while-generating-query-embeddings-in-the-database.md
+16-16Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -129,7 +129,7 @@ We'll start with semantic search. Given a user query, e.g. "Best 1980's scifi mo
129129
WITH request AS (
130130
SELECT pgml.embed(
131131
'intfloat/e5-large',
132-
'Best 1980''s scifi movie'
132+
'query: Best 1980''s scifi movie'
133133
)::vector(1024) AS embedding
134134
)
135135
@@ -142,17 +142,17 @@ SELECT
142142
review_embedding_e5_large <=> (
143143
SELECT embedding FROM request
144144
)
145-
) AS cosine_similiarity
145+
) AS cosine_similarity
146146
FROM pgml.amazon_us_reviews
147-
ORDER BY review_embedding_e5_large <=> (SELECT embedding FROM request)
147+
ORDER BY cosine_similarity
148148
LIMIT 5;
149149
```
150150

151151
!!!
152152

153153
!!! results
154154

155-
| review_body | product_title | star_rating | total_votes | cosine_similiarity |
155+
| review_body | product_title | star_rating | total_votes | cosine_similarity |
156156
|-----------------------------------------------------|---------------------------------------------------------------|-------------|-------------|--------------------|
157157
| best 80s SciFi movie ever | The Adventures of Buckaroo Banzai Across the Eighth Dimension | 5 | 1 | 0.956207707312679 |
158158
| One of the best 80's sci-fi movies, beyond a doubt! | Close Encounters of the Third Kind [Blu-ray] | 5 | 1 | 0.9298004258989776 |
@@ -270,7 +270,7 @@ SELECT
270270
title,
271271
1 - (
272272
review_embedding_e5_large <=> (SELECT embedding FROM request)
273-
) AS cosine_similiarity
273+
) AS cosine_similarity
274274
FROM movies
275275
ORDER BY review_embedding_e5_large <=> (SELECT embedding FROM request)
276276
LIMIT 10;
@@ -280,7 +280,7 @@ LIMIT 10;
280280

281281
!!! results
282282

283-
| title | cosine_similiarity |
283+
| title | cosine_similarity |
284284
|--------------------------------------------------------------------|--------------------|
285285
| THX 1138 (The George Lucas Director's Cut Special Edition/ 2-Disc) | 0.8652007733744973 |
286286
| 2010: The Year We Make Contact | 0.8621574666546908 |
@@ -328,7 +328,7 @@ SELECT
328328
title,
329329
1 - (
330330
review_embedding_e5_large <=> (SELECT embedding FROM request)
331-
) AS cosine_similiarity
331+
) AS cosine_similarity
332332
FROM movies
333333
ORDER BY review_embedding_e5_large <=> (SELECT embedding FROM request)
334334
LIMIT 10;
@@ -338,7 +338,7 @@ LIMIT 10;
338338

339339
!!! results
340340

341-
| title | cosine_similiarity |
341+
| title | cosine_similarity |
342342
|--------------------------------------------------------------------|--------------------|
343343
| THX 1138 (The George Lucas Director's Cut Special Edition/ 2-Disc) | 0.8652007733744973 |
344344
| Big Trouble in Little China [UMD for PSP] | 0.8649691870870362 |
@@ -411,7 +411,7 @@ SET ivfflat.probes = 1;
411411
WITH request AS (
412412
SELECT pgml.embed(
413413
'intfloat/e5-large',
414-
'Best 1980''s scifi movie'
414+
'query: Best 1980''s scifi movie'
415415
)::vector(1024) AS embedding
416416
)
417417
@@ -420,7 +420,7 @@ SELECT
420420
total_reviews,
421421
1 - (
422422
review_embedding_e5_large <=> (SELECT embedding FROM request)
423-
) AS cosine_similiarity
423+
) AS cosine_similarity
424424
FROM movies
425425
WHERE <strong>total_reviews > 10</strong>
426426
ORDER BY review_embedding_e5_large <=> (SELECT embedding FROM request)
@@ -431,7 +431,7 @@ LIMIT 10;
431431

432432
!!! results
433433

434-
| title | total_reviews | cosine_similiarity |
434+
| title | total_reviews | cosine_similarity |
435435
|------------------------------------------------------|---------------|--------------------|
436436
| 2010: The Year We Make Contact | 29 | 0.8621574666546908 |
437437
| Forbidden Planet | 202 | 0.861032948199611 |
@@ -467,7 +467,7 @@ SQL is a very expressive language that can handle a lot of complexity. To keep t
467467
WITH request AS (
468468
SELECT pgml.embed(
469469
'intfloat/e5-large',
470-
'Best 1980''s scifi movie'
470+
'query: Best 1980''s scifi movie'
471471
)::vector(1024) AS embedding
472472
),
473473
@@ -479,7 +479,7 @@ first_pass AS (
479479
star_rating_avg,
480480
1 - (
481481
review_embedding_e5_large <=> (SELECT embedding FROM request)
482-
) AS cosine_similiarity,
482+
) AS cosine_similarity,
483483
star_rating_avg / 5 AS star_rating_score
484484
FROM movies
485485
WHERE total_reviews > 10
@@ -493,8 +493,8 @@ SELECT
493493
total_reviews,
494494
round(star_rating_avg, 2) as star_rating_avg,
495495
star_rating_score,
496-
cosine_similiarity,
497-
cosine_similiarity + star_rating_score AS final_score
496+
cosine_similarity,
497+
cosine_similarity + star_rating_score AS final_score
498498
FROM first_pass
499499
ORDER BY final_score DESC
500500
LIMIT 10;
@@ -504,7 +504,7 @@ LIMIT 10;
504504

505505
!!! results
506506

507-
| title | total_reviews | star_rating_avg | final_score | star_rating_score | cosine_similiarity |
507+
| title | total_reviews | star_rating_avg | final_score | star_rating_score | cosine_similarity |
508508
|:-----------------------------------------------------|--------------:|----------------:|-------------------:|-----------------------:|-------------------:|
509509
| Forbidden Planet (Two-Disc 50th Anniversary Edition) | 255 | 4.82 | 1.8216832158805154 | 0.96392156862745098000 | 0.8577616472530644 |
510510
| Back to the Future | 31 | 4.94 | 1.82090702765472 | 0.98709677419354838000 | 0.8338102534611714 |

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.