Private entity resolution for big data on Apache Spark using multiple phonetic codes

Private entity resolution for big data on Apache Spark using multiple phonetic codes

For access to this article, please select a purchase option:

Buy chapter PDF
(plus tax if applicable)
Buy Knowledge Pack
10 chapters for $120.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Your details
Why are you recommending this title?
Select reason:
Big Data Recommender Systems - Volume 1: Algorithms, Architectures, Big Data, Security and Trust — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

Recommender systems are experiencing a significant boost due to the availability of big data which supply an abundance of user data such as past purchases and browsing history. The benefits are increased when a recommender system can use and combine data that come from multiple sites and illustrate a more complete picture of the user's preferences and interests. However, such data often originate from dispersed, heterogeneous sources, and before processing and analyzing them, it is required to integrate or link them. The problem of linking such data consists of identifying data that refer to the same real-world entity across the heterogeneous sources and is known as record linkage or entity resolution. As these data also concern human activities, privacy issues arise when linking data across different sources. The problem is known as privacy preserving record linkage (PPRL). In this chapter, we propose a parallel protocol for PPRL based on phonetic encodings that exploits novel big data processing engines to provide results of high quality in an efficient manner. Our phonetics encoding scheme extends the work presented by Karakasidis and Verykios that is based on the use of the Soundex phonetic algorithm. This protocol also features noise generation to prevent frequency attacks and encryption of both actual and fake data to enable processing by an untrusted party. However, to cope with the low recall that the existence of a big percentage of dirty data may incur, we propose using Soundex combined with another popular phonetic algorithm, and particularly NYSIIS. By combining two phonetic encodings, our protocol becomes more robust and more tolerant to errors in the matching fields, as it introduces redundancy. Furthermore, as Soundex is particularly vulnerable to errors that occur in the beginning of the encoded text; our protocol deploys another optimization by encoding the reverse of the original text with the second phonetic algorithm.

Chapter Contents:

  • 13.1 Introduction
  • 13.2 Related work
  • 13.3 Problem formulation and background
  • 13.3.1 Problem formulation and notation used
  • 13.3.2 Phonetic algorithms for privacy preserving matching
  • 13.3.3 The Soundex algorithm
  • 13.3.4 The NYSIIS algorithm
  • 13.3.5 Apache Spark
  • 13.4 A parallel privacy preserving phonetics matching protocol
  • 13.4.1 Multiple algorithms for phonetic matching
  • 13.4.2 Protocol operation
  • 13.4.3 Privacy discussion
  • 13.5 Empirical evaluation
  • 13.5.1 Experimental setup
  • 13.5.2 Experimental results
  • Algorithm combination selection
  • Matching accuracy
  • Time performance
  • 13.6 Conclusions and future work
  • References

Inspec keywords: Big Data; Linked Data; data privacy; recommender systems; cluster computing; parallel algorithms

Other keywords: noise generation; parallel protocol; Apache Spark; privacy preserving record linkage; recommender systems; NYSIIS phonetic algorithm; phonetic encodings; Soundex phonetic algorithm; multiple phonetic codes; PPRL; private entity resolution; big data

Subjects: Parallel software; Information networks; Data security; Database management systems (DBMS); Search engines

Preview this chapter:
Zoom in

Private entity resolution for big data on Apache Spark using multiple phonetic codes, Page 1 of 2

| /docserver/preview/fulltext/books/pc/pbpc035f/PBPC035F_ch13-1.gif /docserver/preview/fulltext/books/pc/pbpc035f/PBPC035F_ch13-2.gif

Related content

This is a required field
Please enter a valid email address