ember

Constructed Benchmarks

This directory holds the benchmarks we constructed, including four categories, four scenarios and corresponding imbalanced versions. The packaged images can be downloaded from here or using the download script. Since the whole corpus is too large (~30G), please contact us if you need it.

For more information on this, please refer to the original paper and the following sections.

Filename Explanation

For each category, we have a training set, validation set and eight test sets. The test sets include four scenarios (Vanilla, Record Linking, Cluster-focused Matching, and Open Matching) and their corresponding imbalanced versions.

benchmarks

The filename explanation is as follow:

.
├── datasets
│   ├── all/clothing/shoes/accessories
│   │   ├── train.parquet         <- Training Set
│   │   ├── val.parquet           <- Validation Set
│   │   ├── test.parquet          <- Vanilla Test Set
│   │   ├── test_rl.parquet       <- Record Linking Test Set
│   │   ├── test_cfm.parquet      <- Cluster-focused Matching Test Set
│   │   ├── test_om.parquet       <- Open Matching Test Set
│   │   ├── test_i.parquet        <- Imbalanced Version
│   │   ├── test_irl.parquet      <- Same as above
│   │   ├── test_icfm.parquet     <- Same as above
│   │   └── test_iom.parquet      <- Same as above

Field Description

Each record consists of seven fields, as explained below:

id: Record ID
title: Product Title
pict_url: URL of Product Image
cate_level_name: Coarse Category
cate_name: Fine-grained Category
pv_pairs: Product Attributes, which split by #;# and #:#, for example "color#:#red#;#year#:#2022"
cluster_id: Cluster ID, record in the same cluster are considered to refer to the same product.

Each instance (record pair) consists of two records, distinguished by the suffixes “_left” and “_right”, and the “label” field.

Statistics

The statisstics of bechmarks is as follow:

#Postive #Negtive #Total
All Training 1718 5282 7000
Validation 256 744 1000
Vanilla Test 526 1474 2000
Shoes Training 1732 5268 7000
Validation 248 752 1000
Vanilla Test 520 1480 2000
Clothing Training 1737 5263 7000
Validation 259 741 1000
Vanilla Test 504 1496 2000
Accessories Training 1721 5279 7000
Validation
250 750 1000
Vanilla Test
529 1471 2000
#Postive #Negtive #Total
Test (RL, CFM, OM) 500 1500 2000
Imbalanced Version 500 49500 50000