Snowflake Arctic Embed M V2.0 Embeddings for MSMARCO V2.1 for TREC-RAG

This dataset contains the embeddings for the MSMARCO-V2.1 dataset which is used as the corpora for TREC RAG All embeddings are created using Snowflake's Arctic Embed M v2.0 and are intended to serve as a simple baseline for dense retrieval-based methods. Note, that the embeddings are not normalized so you will need to normalize them before usage.

Retrieval Performance

Retrieval performance for the TREC DL21-23, MSMARCOV2-Dev and Raggy Queries can be found below with BM25 as a baseline. For both systems, retrieval is at the segment level and Doc Score = Max (passage score). Retrieval is done via a dot product and happens in BF16.

NDCG@10

Dataset	BM25	Arctic-M-V2.0 (768 Dimensions)
Deep Learning 2021	0.5778
Deep Learning 2022	0.3576
Deep Learning 2023	0.3356
msmarcov2-dev	N/A
msmarcov2-dev2	N/A
Raggy Queries	0.4227
RAG 2024

Recall@100

Dataset	BM25	Arctic-M-V2.0 (768 Dimensions)
Deep Learning 2021	0.3811
Deep Learning 2022	0.233
Deep Learning 2023	0.3049
msmarcov2-dev	0.6683
msmarcov2-dev2	0.6771
Raggy Queries	0.2807
RAG 2024

Recall@1000

Dataset	BM25	Arctic-M-V2.0 (768 Dimensions)
Deep Learning 2021	0.7115
Deep Learning 2022	0.479
Deep Learning 2023	0.5852
msmarcov2-dev	0.8528
msmarcov2-dev2	0.8577
Raggy Queries	0.5745
RAG 2024

Loading the dataset

Loading the document embeddings

You can either load the dataset like this:

from datasets import load_dataset
docs = load_dataset("Snowflake/msmarco-v2.1-snowflake-arctic-embed-m-v2.0", split="train")

Or you can also stream it without downloading it before:

from datasets import load_dataset
docs = load_dataset("Snowflake/msmarco-v2.1-snowflake-arctic-embed-m-v2.0",  split="train", streaming=True)
for doc in docs:
    doc_id = j['docid']
    url = doc['url']
    text = doc['text']
    emb = doc['embedding']

Note, The full dataset corpus is ~ 620GB so it will take a while to download and may not fit on some devices/

Search

A full search example (on the first 1,000 paragraphs):

from datasets import load_dataset
import torch
from transformers import AutoModel, AutoTokenizer
import numpy as np


top_k = 100
docs_stream = load_dataset("Snowflake/msmarco-v2.1-snowflake-arctic-embed-m-v2.0",split="train", streaming=True)

docs = []
doc_embeddings = []

for doc in docs_stream:
    docs.append(doc)
    doc_embeddings.append(doc['embedding'])
    if len(docs) >= top_k:
        break

doc_embeddings = np.asarray(doc_embeddings)

tokenizer = AutoTokenizer.from_pretrained('Snowflake/snowflake-arctic-embed-m-v2.0')
model = AutoModel.from_pretrained('Snowflake/snowflake-arctic-embed-m-v2.0', add_pooling_layer=False)
model.eval()

query_prefix = 'Represent this sentence for searching relevant passages: '
queries  = ['how do you clean smoke off walls']
queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)

# Compute token embeddings
with torch.no_grad():
    query_embeddings = model(**query_tokens)[0][:, 0]


# normalize embeddings
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
doc_embeddings = torch.nn.functional.normalize(doc_embeddings, p=2, dim=1)

# Compute dot score between query embedding and document embeddings
dot_scores = np.matmul(query_embeddings, doc_embeddings.transpose())[0]
top_k_hits = np.argpartition(dot_scores, -top_k)[-top_k:].tolist()

# Sort top_k_hits by dot score
top_k_hits.sort(key=lambda x: dot_scores[x], reverse=True)

# Print results
print("Query:", queries[0])
for doc_id in top_k_hits:
    print(docs[doc_id]['doc_id'])
    print(docs[doc_id]['text'])
    print(docs[doc_id]['url'], "\n")