This directory holds the benchmarks we constructed, including four categories, four scenarios and corresponding imbalanced versions. The packaged images can be downloaded from here or using the download script. Since the whole corpus is too large (~30G), please contact us if you need it.
For more information on this, please refer to the original paper and the following sections.
For each category, we have a training set, validation set and eight test sets. The test sets include four scenarios (Vanilla, Record Linking, Cluster-focused Matching, and Open Matching) and their corresponding imbalanced versions.
The filename explanation is as follow:
.
├── datasets
│ ├── all/clothing/shoes/accessories
│ │ ├── train.parquet <- Training Set
│ │ ├── val.parquet <- Validation Set
│ │ ├── test.parquet <- Vanilla Test Set
│ │ ├── test_rl.parquet <- Record Linking Test Set
│ │ ├── test_cfm.parquet <- Cluster-focused Matching Test Set
│ │ ├── test_om.parquet <- Open Matching Test Set
│ │ ├── test_i.parquet <- Imbalanced Version
│ │ ├── test_irl.parquet <- Same as above
│ │ ├── test_icfm.parquet <- Same as above
│ │ └── test_iom.parquet <- Same as above
Each record consists of seven fields, as explained below:
id: Record ID
title: Product Title
pict_url: URL of Product Image
cate_level_name: Coarse Category
cate_name: Fine-grained Category
pv_pairs: Product Attributes, which split by #;# and #:#, for example "color#:#red#;#year#:#2022"
cluster_id: Cluster ID, record in the same cluster are considered to refer to the same product.
Each instance (record pair) consists of two records, distinguished by the suffixes “_left” and “_right”, and the “label” field.
The statisstics of bechmarks is as follow:
#Postive | #Negtive | #Total | ||
---|---|---|---|---|
All | Training | 1718 | 5282 | 7000 |
Validation | 256 | 744 | 1000 | |
Vanilla Test | 526 | 1474 | 2000 | |
Shoes | Training | 1732 | 5268 | 7000 |
Validation | 248 | 752 | 1000 | |
Vanilla Test | 520 | 1480 | 2000 | |
Clothing | Training | 1737 | 5263 | 7000 |
Validation | 259 | 741 | 1000 | |
Vanilla Test | 504 | 1496 | 2000 | |
Accessories | Training | 1721 | 5279 | 7000 |
Validation |
250 | 750 | 1000 | |
Vanilla Test |
529 | 1471 | 2000 |
#Postive | #Negtive | #Total | |
---|---|---|---|
Test (RL, CFM, OM) | 500 | 1500 | 2000 |
Imbalanced Version | 500 | 49500 | 50000 |