Synthetic Data – Substitute for real-world data?

6

October

2019

5/5 (4)

The statement “Data is the new oil”, coined by Clive Humby, grasped the importance of data for modern technology and society already in 2006. Just as there are problems with the use of oil to the environment, there are also risks of data usage. These risks are generally about privacy protection and the question about who owns the data and is it used in a way the user intended it to be used. Governments or political unions like the EU try to protect personal data, which is often raised by companies in the course of the use of one of their services.

However, with stricter regulations or anonymization of personal data there is no guaranteed protection of data misuse if it gets into the wrong hands. One way to handle these risks and to confirm with legal regulations is the usage of synthetic data.

“Synthetic data is information that is artificially manufactured rather than generated by real-world events” (Myers, 2019). According to Garg (2019) the process can either be to use a model to describe a real-world behavior or to use a real-world distribution to generate synthetic data. Thus, synthetic data is anonymous and can be applied without risks. With these realistic and person-independent data, the range of possibilities is extended by simplifying the processing, analysis and exchange of large amounts of data. Compared to existing anonymization methods, synthetic data enables the full depth of detail and therefore delivers a higher value. The generation of data bears additional advantages like it can meet specific needs, conditions or can even be created for not yet existing events, e.g. for clinical or scientific trials.

The need and handling of big data in times of machine learning increases significantly thus the trade with data and third-party data rises. Synthetic data could decrease the trade of third-party data if companies are able to generate data sets themselves. According to the “Gartner Hype Cycle for Emerging technologies, 2019” synthetic data is a rising trend. With its benefits it will offer new ways for data driven technologies and may also protect the privacy of personal data in the future, due to alternative substitutes. This can be a chance for new developments and breakthroughs.

After all, synthetic data and its generation is still in an early stage and needs further research and technological development. In particular, at this time synthetic data needs to be evaluated against real-world data occasionally. Furthermore, if it is used in areas such as AI or clinical studies in healthcare, validation and traceability of the data is necessary, also the origin of used datasets. In summary, it can be said that there are still hurdles which have to be overcome before synthetic data will replace personal data in most of its applications. To encourage and boost necessary improvements the National Institute of Standard and Technology, a non-regulatory agency of the United States Department of Commerce, launched a challenge in 2018/2019 to improve synthetic data generation tools (total price purse: 150.000$).

Do you think data can be generated in the future and therefore be a substitute for real-world data?

References:
AI Multiple (2019). Synthetic Data: An Introduction & 10 Tools. [Online] Available at: https://blog.aimultiple.com/synthetic-data/ [Accessed: 02.10.2019].

Myers, A. (2019). Deepfakes: What’s real with synthetic data? [Online] Available at: https://medium.com/memory-leak/deepfakes-whats-real-with-synthetic-data-5c8348b041d2 [Accessed: 02.10.2019].

Garg, A. (2018). The Power and Challenges of Synthetic Data – 3 Principles. [Online] Available at: https://medium.com/@amitgarg/the-power-and-challenges-of-synthetic-data-3-principles-c254e25fc6d5 [Accessed: 02.10.2019].

Please rate this

1 thought on “Synthetic Data – Substitute for real-world data?”

  1. Very thought-provoking read!
    This could lead to quite the though experiment and a whole series of new related research.
    With advances in AI, there is definitely the possibility of computers generating a dataset for test purposes. We know that through python programming test data can be created for analysis purposes.
    This would be interesting to see on a much large scale. Would also be curious to know what the implications would be in the fields of finance and healthcare.

Leave a Reply

Your email address will not be published. Required fields are marked *