Datasets for matching identities



KDD15 Dataset


Due to privacy constraints we are able to provide parts (but not all datasets) used for our KDD15 paper. Specifically we are able to share the matching identities users willingly provide on Google+. If you are interested in using this data, please read the terms and conditions carefully.

Google+ sitemap (contains all the users in Google+ in 2010): Profile information for 475.257 random Google+ indentities. Each line corresponds to a user and contains the links to the matching identities of the user on other social networks. Profile information on other social networks of the matching identities of the users in the above Google+ dataset. Each line correponds to a (Facebook/Linkedin/Twitter/Flickr/Myspace) identity. The linked_username element corresponds to the ID of the Google+ identity of the user (i.e., the key that links to the google_profiles.json.tar.gz dataset). For 1037 Twitter identities, the list of identities in Facebook that have the same or similar names. Each file corresponds to a Twitter identity and contains a list of identities on Facebook that have the same or a similar name with the Twitter identity. The ID of the Twitter identity is in the name of the file.

WWW13 Dataset


List of matching identities, the first column is the twitter id, second column the flickr id, and third column the yelp id. If in a row an id is missing means that we did not find the matching identity. Profile infromation of users, each line corresponds to a different profile.
User posts and their metadata, each line corresponds to a tweet and it contains the id of the user that generated the tweet.

Terms and Conditions

  1. You will use the data solely for the purpose of non-profit research or non-profit education.

  2. You will respect the privacy of end users and organizations that may be identified in the data. You will not attempt to reverse engineer, decrypt, de-anonymize, derive or otherwise re-identify anonymized information.

  3. You will not distribute the data beyond your immediate research group.

  4. If you create a publication using our datasets, please cite our papers as follows.

    • If you use the WWW13 Dataset, please cite our WWW13 paper.

    • If you use the KDD15 Dataset, please cite our KDD15 paper.

@inproceedings{Goga:2013:EIA:2488388.2488428,
	author = {Goga, Oana and Lei, Howard and Parthasarathi, Sree Hari Krishnan and Friedland, Gerald and Sommer, Robin and Teixeira, Renata},
	title = {Exploiting Innocuous Activity for Correlating Users Across Sites},
	booktitle = {Proceedings of the 22Nd International Conference on World Wide Web},
	series = {WWW '13},
	year = {2013},
	isbn = {978-1-4503-2035-1},
	location = {Rio de Janeiro, Brazil},
	pages = {447--458},
	numpages = {12},
	url = {http://doi.acm.org/10.1145/2488388.2488428},
	doi = {10.1145/2488388.2488428},
	acmid = {2488428},
	publisher = {ACM},
	address = {New York, NY, USA},
	keywords = {account correlation, geotags, language, location, online social networks, privacy, user profiles},
} 

@inproceedings{Goga:2015:RPM:2783258.2788601,
	author = {Goga, Oana and Loiseau, Patrick and Sommer, Robin and Teixeira, Renata and Gummadi, Krishna P.},
	title = {On the Reliability of Profile Matching Across Large Online Social Networks},
	booktitle = {Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
	series = {KDD '15},
	year = {2015},
	isbn = {978-1-4503-3664-2},
	location = {Sydney, NSW, Australia},
	pages = {1799--1808},
	numpages = {10},
	url = {http://doi.acm.org/10.1145/2783258.2788601},
	doi = {10.1145/2783258.2788601},
	acmid = {2788601},
	publisher = {ACM},
	address = {New York, NY, USA},
	keywords = {matching accounts, online social networks, reliability},
} 


Download

If you agree with these terms and conditions you can download the datasets at the following links:

WWW13 dataset: https://calendar.mpi-sws.org/index.php/s/d6e678df3127b1d92ffbedb1d27a01ae

KDD15 dataset: https://calendar.mpi-sws.org/index.php/s/7b1f307a79a4e728698ce6a4dac97198

The password is data05.