Datasets:
The dataset viewer is not available for this split.
Error code: JobManagerCrashedError
Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.
Snowflake Arctic Embed M V2.0 Embeddings for MSMARCO V2.1 for TREC-RAG
This dataset contains the embeddings for the MSMARCO-V2.1 dataset which is used as the corpora for TREC RAG All embeddings are created using Snowflake's Arctic Embed M v2.0 and are intended to serve as a simple baseline for dense retrieval-based methods. Note, that the embeddings are not normalized so you will need to normalize them before usage.
Retrieval Performance
Retrieval performance for the TREC DL21-23, MSMARCOV2-Dev and Raggy Queries can be found below with BM25 as a baseline. For both systems, retrieval is at the segment level and Doc Score = Max (passage score). Retrieval is done via a dot product and happens in BF16.
NDCG@10
Dataset | BM25 | Arctic-M-V2.0 (768 Dimensions) |
---|---|---|
Deep Learning 2021 | 0.5778 | |
Deep Learning 2022 | 0.3576 | |
Deep Learning 2023 | 0.3356 | |
msmarcov2-dev | N/A | |
msmarcov2-dev2 | N/A | |
Raggy Queries | 0.4227 | |
RAG 2024 |
Recall@100
Dataset | BM25 | Arctic-M-V2.0 (768 Dimensions) |
---|---|---|
Deep Learning 2021 | 0.3811 | |
Deep Learning 2022 | 0.233 | |
Deep Learning 2023 | 0.3049 | |
msmarcov2-dev | 0.6683 | |
msmarcov2-dev2 | 0.6771 | |
Raggy Queries | 0.2807 | |
RAG 2024 |
Recall@1000
Dataset | BM25 | Arctic-M-V2.0 (768 Dimensions) |
---|---|---|
Deep Learning 2021 | 0.7115 | |
Deep Learning 2022 | 0.479 | |
Deep Learning 2023 | 0.5852 | |
msmarcov2-dev | 0.8528 | |
msmarcov2-dev2 | 0.8577 | |
Raggy Queries | 0.5745 | |
RAG 2024 |
Loading the dataset
Loading the document embeddings
You can either load the dataset like this:
from datasets import load_dataset
docs = load_dataset("Snowflake/msmarco-v2.1-snowflake-arctic-embed-m-v2.0", split="train")
Or you can also stream it without downloading it before:
from datasets import load_dataset
docs = load_dataset("Snowflake/msmarco-v2.1-snowflake-arctic-embed-m-v2.0", split="train", streaming=True)
for doc in docs:
doc_id = j['docid']
url = doc['url']
text = doc['text']
emb = doc['embedding']
Note, The full dataset corpus is ~ 620GB so it will take a while to download and may not fit on some devices/
Search
A full search example (on the first 1,000 paragraphs):
from datasets import load_dataset
import torch
from transformers import AutoModel, AutoTokenizer
import numpy as np
top_k = 100
docs_stream = load_dataset("Snowflake/msmarco-v2.1-snowflake-arctic-embed-m-v2.0",split="train", streaming=True)
docs = []
doc_embeddings = []
for doc in docs_stream:
docs.append(doc)
doc_embeddings.append(doc['embedding'])
if len(docs) >= top_k:
break
doc_embeddings = np.asarray(doc_embeddings)
tokenizer = AutoTokenizer.from_pretrained('Snowflake/snowflake-arctic-embed-m-v2.0')
model = AutoModel.from_pretrained('Snowflake/snowflake-arctic-embed-m-v2.0', add_pooling_layer=False)
model.eval()
query_prefix = 'Represent this sentence for searching relevant passages: '
queries = ['how do you clean smoke off walls']
queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)
# Compute token embeddings
with torch.no_grad():
query_embeddings = model(**query_tokens)[0][:, 0]
# normalize embeddings
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
doc_embeddings = torch.nn.functional.normalize(doc_embeddings, p=2, dim=1)
# Compute dot score between query embedding and document embeddings
dot_scores = np.matmul(query_embeddings, doc_embeddings.transpose())[0]
top_k_hits = np.argpartition(dot_scores, -top_k)[-top_k:].tolist()
# Sort top_k_hits by dot score
top_k_hits.sort(key=lambda x: dot_scores[x], reverse=True)
# Print results
print("Query:", queries[0])
for doc_id in top_k_hits:
print(docs[doc_id]['doc_id'])
print(docs[doc_id]['text'])
print(docs[doc_id]['url'], "\n")
- Downloads last month
- 81