ember

Constructed Benchmarks

This directory holds the benchmarks we constructed, including four categories, four scenarios and corresponding imbalanced versions. The packaged images can be downloaded from here or using the download script. Since the whole corpus is too large (~30G), please contact us if you need it.

For more information on this, please refer to the original paper and the following sections.

Filename Explanation

For each category, we have a training set, validation set and eight test sets. The test sets include four scenarios (Vanilla, Record Linking, Cluster-focused Matching, and Open Matching) and their corresponding imbalanced versions.

benchmarks

The filename explanation is as follow:

.
├── datasets
│   ├── all/clothing/shoes/accessories
│   │   ├── train.parquet         <- Training Set
│   │   ├── val.parquet           <- Validation Set
│   │   ├── test.parquet          <- Vanilla Test Set
│   │   ├── test_rl.parquet       <- Record Linking Test Set
│   │   ├── test_cfm.parquet      <- Cluster-focused Matching Test Set
│   │   ├── test_om.parquet       <- Open Matching Test Set
│   │   ├── test_i.parquet        <- Imbalanced Version
│   │   ├── test_irl.parquet      <- Same as above
│   │   ├── test_icfm.parquet     <- Same as above
│   │   └── test_iom.parquet      <- Same as above

Field Description

Each record consists of seven fields, as explained below:

id: Record ID
title: Product Title
pict_url: URL of Product Image
cate_level_name: Coarse Category
cate_name: Fine-grained Category
pv_pairs: Product Attributes, which split by #;# and #:#, for example "color#:#red#;#year#:#2022"
cluster_id: Cluster ID, record in the same cluster are considered to refer to the same product.

Each instance (record pair) consists of two records, distinguished by the suffixes “_left” and “_right”, and the “label” field.

Statistics

The statisstics of bechmarks is as follow:

		#Postive	#Negtive	#Total
All	Training	1718	5282	7000
	Validation	256	744	1000
	Vanilla Test	526	1474	2000
Shoes	Training	1732	5268	7000
	Validation	248	752	1000
	Vanilla Test	520	1480	2000
Clothing	Training	1737	5263	7000
	Validation	259	741	1000
	Vanilla Test	504	1496	2000
Accessories	Training	1721	5279	7000
	Validation	250	750	1000
	Vanilla Test	529	1471	2000

	#Postive	#Negtive	#Total
Test (RL, CFM, OM)	500	1500	2000
Imbalanced Version	500	49500	50000