Privacy preserving big data publishing: a scalable k-anonymization approach using MapReduce

Brijesh B. Mehta; Udai Pratap Rao

Privacy preserving big data publishing: a scalable k-anonymization approach using MapReduce

View Fulltext

Author(s): Brijesh B. Mehta¹ and Udai Pratap Rao¹
- Affiliations: 1: Computer Engineering Department , Sardar Vallabhbhai National Institute of Technology , Surat , India
Source: Volume 11, Issue 5, October 2017, p. 271 – 276
DOI: 10.1049/iet-sen.2016.0264 , Print ISSN 1751-8806, Online ISSN 1751-8814

Received 17/10/2016, Accepted 17/07/2017, Revised 28/06/2017, Published 31/07/2017

Big data is collected and processed using different sources and tools that lead to privacy issues. Privacy preserving data publishing techniques such as k-anonymity, l-diversity, and t-closeness are used to de-identify the data; however, the chances of re-identification are always remain present since data is collected from multiple sources. Owing to the large volume of data, less generalisation or suppression is required to achieve the same level of privacy, which is also known as ‘large crowd effect’, although it is always challenging to handle such a large data for anonymization. MapReduce handles large volume of data and distributes the data into the smaller chunks across the multiple nodes; consequently, the full advantage of large volume of data is underachieved. Therefore, scalability of privacy preserving techniques becomes a challenging area of research. The authors explore this area and propose an algorithm named scalable k-anonymization (SKA) using MapReduce for privacy preserving big data publishing. The authors also compare the approach with existing approaches that results into a remarkable improvement of the data utility and significantly enhances the performance in terms of running time.

References

1. 1)
  - 28. Li, N., Li, T., Venkatasubramanian, S.: ‘t-closeness: privacy beyond k-anonymity and l-diversity’. Proc. 23rd Int. Conf. Data Engineering, Istanbul, Turkey: IEEE, April 2007, pp. 106–115.
2. 2)
  - 9. Mehta, B.B., Rao, U.P.: ‘Privacy preserving unstructured big data analytics: Issues and challenges’. Procedia Computer Science. Elsevier, 2016, 1st Int. Conf. Information Security and Privacy 2015, 2016, vol. 78, pp. 120–124.
3. 3)
  - 18. LeFevre, K., DeWitt, D. J., Ramakrishnan, R.: ‘Incognito: efficient full-domain k-anonymity’. Proc. 2005 ACM SIGMOD Int. Conf. Management of Data, Ser. SIGMOD ‘05, New York, NY, USA: ACM, June 2005, pp. 49–60.
4. 4)
  - 14. Cate, F.H.: ‘Privacy in the information age’ (Brookings Institution Press, Washington, DC, USA, 1997). Available at http://hdl.handle.net/10822/1037837, accessed 13-June-2017.
5. 5)
  - 6. Dean, J., Ghemawat, S.: ‘MapReduce: simplified data processing on large clusters’, Commun. ACM, 2008, 51, (1), pp. 107–113.
6. 6)
  - 34. Dwork, C.: ‘Ask a better question, get a better answer a new approach to private data analysis,’ inSchwentick, T., Suciu, D. Eds: ‘Database theory ICDT 2007’. Springer, Berlin, Heidelberg, January 2007, (LNCS4353), pp. 18–27.
7. 7)
  - 7. Lämmel, R.: ‘Google's MapReduce programming model – revisited’, Sci. Comput. Program., 2008, 70, (1), pp. 1–30.
8. 8)
  - 46. Cattral, R., Oppacher, F.: ‘Poker dataset’, 2007. Available at https://archive.ics.uci.edu/ml/datasets/Poker+Hand, accessed 18 April 2016.
9. 9)
  - 23. Zakerzadeh, H., Osborn, S.L.: ‘Delay-sensitive approaches for anonymizing numerical streaming data’, Int. J. Inf. Secur., 2013, 12, (5), pp. 423–437.
10. 10)
  - 38. Fung, B.C.M., Wang, K., Chen, R., et al: ‘Privacy preserving data publishing: a survey of recent developments’, ACM Comput. Surv., 2010, 42, (4), pp. 14:1–14:53.
11. 11)
  - 41. Zhang, X., Yang, L.T., Liu, C., et al: ‘A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud’, IEEE Trans. Parallel Distrib. Syst., 2014, 25, (2), pp. 363–373.
12. 12)
  - 8. Sagiroglu, S., Sinanc, D.: ‘Big data: a review’. Proc. of the Int. Conf. Collaboration Technologies and Systems (CTS), San Diego, CA, USA, May 2013, pp. 42–47.
13. 13)
  - 17. LeFevre, K., DeWitt, D. J., Ramakrishnan, R.: ‘Mondrian multidimensional k-anonymity’. Proc. 22nd Int. Conf. Data Engineering, Ser. ICDE ’06, Washington, DC, USA: IEEE Computer Society, April 2006, pp. 1–11.
14. 14)
  - 35. Nergiz, M., Clifton, C., Nergiz, A.: ‘Multirelational k-anonymity’. Proc. IEEE 23rd Int. Conf. Data Engineering, Istanbul, Turkey, April 2007, pp. 1417–1421.
15. 15)
  - 22. Hay, M., Miklau, G., Jensen, D., et al: ‘Resisting structural re-identification in anonymized social networks’, Proc. VLDB Endowment, 2008, 1, (1), pp. 102–114.
16. 16)
  - 26. Machanavajjhala, A., Kifer, D., Gehrke, J., et al: ‘l-diversity: Privacy beyond k-anonymity’, ACM Trans. Knowl. Discov. Data, 2007, 1, (1), pp. 1–52.
17. 17)
  - 29. Nergiz, M. E., Atzori, M., Clifton, C.: ‘Hiding the presence of individuals from shared databases’. Proc. 2007 ACM SIGMOD Int. Conf. Management of Data, Ser. SIGMOD ’07, New York, NY, USA: ACM, June 2007, pp. 665–676.
18. 18)
  - 45. Ghinita, G., Karras, P., Kalnis, P., et al: ‘Fast data anonymization with low information loss’. Proc. 33rd Int. Conf. Very Large Data Bases, Ser. VLDB ’07, Vienna, Austria: VLDB Endowment, September 2007, pp. 758–769.
19. 19)
  - 10. Garber, L.: ‘Security, privacy, policy, and dependability roundup’, IEEE Secur. Priv., 2013, 11, (2), pp. 6–7.
20. 20)
  - 42. Zhang, X., Liu, C., Nepal, S., et al: ‘A hybrid approach for scalable sub-tree anonymization over big data using mapreduce on cloud’, J. Comput. Syst. Sci., 2014, 80, (5), pp. 1008–1020.
21. 21)
  - 47. Kohavi, R., Becker, B.: ‘Adult dataset’, 1996. Available at https://archive.ics.uci.edu/ml/datasets/Adult, accessed 18 April 2016.
22. 22)
  - 33. Dwork, C.: ‘Differential privacy’. in, Bugliesi, M., Preneel, B., Sassone, V., et al, Eds: ‘Automata, languages and programming’. Springer, Berlin, Heidelberg, July 2006, (LNCS4052), pp. 1–12.
23. 23)
  - 11. Musolesi, M.: ‘Big mobile data mining: good or evil?’, IEEE Internet Comput., 2014, 18, (1), pp. 78–81.
24. 24)
  - 39. Mehta, B.B., Rao, U.P., Kumar, N., et al: ‘Towards privacy preserving big data analytics’. Proc. 2016 Sixth Int. Conf. Advanced Computing and Communication Technologies, Ser. ACCT-2016, Rohtak, India: Research Publishing, September 2016, pp. 28–35.
25. 25)
  - 27. Machanavajjhala, A., Gehrke, J., Kifer, D., et al: ‘l-diversity: privacy beyond k-anonymity’. Proc. 22nd Int. Conf. Data Engineering, Atlanta, GA, USA, April 2006, pp. 13–24.
26. 26)
  - 24. Zhou, B., Han, Y., Pei, J., et al: ‘Continuous privacy preserving publishing of data streams’. Proc. 12th Int. Conf. Extending Database Technology: Advances in Database Technology, Ser. EDBT ‘09, New York, NY, USA: ACM, March 2009, pp. 648–659.
27. 27)
  - 3. ‘Science, technology and innovation for the 21st century.: ‘Organization for Economic Co-operation and Development, Tech. Rep., January 2004. Available at http://www.oecd.org/science/sci-tech/sciencetechnologyandinnovationforthe21stcenturymeetingoftheoecdcommitteeforscientific.andtechnologicalpolicyatministeriallevel29-30january2004-finalcommunique.htm, accessed 14 June 2017.
28. 28)
  - 20. Wong, W.K., Mamoulis, N., Cheung, D.W.L.: ‘Non-homogeneous generalization in privacy preserving data publishing’. Proc. 2010 ACM SIGMOD Int. Conf. Management of Data, Ser. SIGMOD ‘10, New York, NY, USA: ACM, June 2010, pp. 747–758.
29. 29)
  - 2. Yolles, B.J., Connors, J.C., Grufferman, S.: ‘Obtaining access to data from government-sponsored medical research’, N. Engl. J. Med., 1986, 315, (26), pp. 1669–1672.
30. 30)
  - 30. Chawla, S., Dwork, C., McSherry, F., et al: ‘Toward privacy in public databases’. Proc. Second Int. Conf. Theory of Cryptography, Ser. TCC'05, Cambridge, MA, USA: Springer, Berlin, Heidelberg, February 2005, pp. 363–385.
31. 31)
  - 25. Xue, M., Karras, P., Raïssi, C., et al: ‘Anonymizing set-valued data by nonreciprocal recoding’. Proc. 18th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, ser. KDD ‘12, New York, NY, USA: ACM, August 2012, pp. 1050–1058.
32. 32)
  - 1. N. R. Council.: in Fienberg, S.E., Martin, M.E., Straf, M.L. (Eds): Benefits of Data Sharing, ‘Sharing research data’ (The National Academies Press, Washington, DC, USA, 1985).
33. 33)
  - 44. Zakerzadeh, H., Aggarwal, C.C., Barker, K.: ‘Privacy-preserving big data publishing’. Proc. 27th Int. Conf. Scientific and Statistical Database Management, Ser. SSDBM ‘15, New York, NY, USA: ACM, June 2015, pp. 26:1–26:11.
34. 34)
  - 37. Clifton, C., Tassa, T.: ‘On syntactic anonymity and differential privacy’, Trans. Data Privacy, 2013, 6, (2), pp. 161–183Available at http://dl.acm.org/citation.cfm?id=2612167.2612170, accessed:14 June 2017.
35. 35)
  - 21. Liu, K., Terzi, E.: ‘Towards identity anonymization on graphs’. Proc. 2008 ACM SIGMOD Int. Conf. Management of Data, Ser. SIGMOD ‘08, New York, NY, USA: ACM, June 2008, pp. 93–106.
36. 36)
  - 16. Samarati, P., Sweeney, L.: ‘Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression’, SRI Int., Technical Report, April 1998. Available at http://www.csl.sri.com/papers/sritr-98-04/, accessed 18 February 2015.
37. 37)
  - 15. Samarati, P., Sweeney, L.: ‘Generalizing data to provide anonymity when disclosing information’. Proc. the Seventeenth ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, Ser. PODS ’98, New York, NY, USA: ACM, June 1998, pp. 188–188.
38. 38)
  - 12. Patil, H.K., Seshadri, R.: ‘Big data security and privacy issues in healthcare’. Proc. IEEE Int. Congress on Big Data (BigData Congress), Anchorage, AK, USA, June 2014, pp. 762–765.
39. 39)
  - 31. Chawla, S., Dwork, C., McSherry, F., et al: ‘On privacy-preserving histograms’. Proc. Twenty-First Conf. Uncertainty in Artificial Intelligence (UAI2005), Edinburgh, Scotland: Association for Uncertainty in Artificial Intelligence Press, July 2005, pp. 120–127. Available at http://research.microsoft.com/apps/pubs/default.aspx?id=64359.
40. 40)
  - 43. Wang, K., Yu, P.S., Chakraborty, S.: ‘Bottom-up generalization: a data mining solution to privacy protection’. Proc. Fourth IEEE Int. Conf. Data Mining, 2004. ICDM'04, Brighton, UK: IEEE, November 2004, pp. 249–256.
41. 41)
  - 4. OECD.: ‘Data-driven innovation: big data for growth and well-being’ (OECD Publishing, Paris, France, 2015).
42. 42)
  - 19. Nergiz, M.E., Clifton, C., Nergiz, A.E.: ‘Multirelational k-anonymity’, IEEE Trans. Knowl. Data Eng., 2009, 21, (8), pp. 1104–1117.
43. 43)
  - 32. Dwork, C.: ‘Differential privacy: A survey of results’, in Agrawal, M., Du, D., Duan, Z., et al, (Eds): ‘Theory and applications of models of computation’. Springer, Berlin, Heidelberg, April 2008, (LNCS4978), pp. 1–19.
44. 44)
  - 5. ‘CIHR open access policy,’ March 2015. Available at http://www.cihr-irsc.gc.ca/e/46068.html, accessed 14 June 2017.
45. 45)
  - 36. Cormode, G., Srivastava, D.: ‘Anonymized data: generation, models, usage’. Proc. 2009 ACM SIGMOD Int. Conf. Management of Data, Ser. SIGMOD ’09, New York, NY, USA: ACM, June 2009, pp. 1015–1018.
46. 46)
  - 13. Sedayao, J., Bhardwaj, R., Gorade, N.: ‘Making big data, privacy, and anonymization work together in the enterprise: experiences and issues’. Proc. of IEEE Int. Congress on Big Data (BigData Congress), Anchorage, AK, USA, June 2014, pp. 601–607.
47. 47)
  - 40. Fung, B.C.M., Wang, K., Yu, P.S.: ‘Anonymizing classification data for privacy preservation’, IEEE Trans. Knowl. Data Eng., 2007, 19, (5), pp. 711–725.

Login

Not registered yet?

Share

Tools

Login to add to favourites

Key

Privacy preserving big data publishing: a scalable k-anonymization approach using MapReduce

References

Related content