Spark Plug #19: Democratizing Data

Stuff worth sharing for research and learning technologists

Nov 01, 2021

person with assorted-color paint on face — Photo by @hayerlein

Synthetic data startups are taking over this month. A huge problem for many companies, and especially banks with all the Chinese walls, is that access to data is now a more long-winded bureaucratic permissioning process 🤯.

This isn’t new for social scientists, medical and clinical researchers, who are probably a bit more tolerant of the wait for ethics approvals. What’s more, once access is obtained, datasets have to be anonymised. Anonymisation won’t always work, and it will be relatively easy to reindentify people when data is aggregated, linked or brought together in any other way. It’s even harder to anonymise text data.

Data anonymisation is traditionally done by swapping personal information, or adding noise for example. But this can damage the data. Hence the recent rise of companies that generate synthetic data; the demand for it is high (even the UK gov acknowledge a need for this) and the technology is already good enough or even better than traditional ways of anonymisation. 🚩Although, some researchers who carried out a first empirical study comparing the two broad methods will disagree.

The term synthetic data generally refers to data that has been artificially generated and not collected via direct measurements. Although sometimes it can be a combination of the two.

At the Research Methods eFestival last week, Paul Calcraft (see project) from the Behavioral Insights Team and Dora Kokosi (see project with Katie Harron) from UCL discussed the definition of synthetic data, how it can be used as an alternative or while waiting on the actual data to be released.

They described different ways to generate synthetic data; you can do that via process driven methods like Monte Carlo (very common in finance), or data driven methods like GANs and other deep learning algorithms. The method you use determines the validity or value of the data for your research. UK’s ONS drafted a spectrum of analytical value and disclosure risk, with datasets that are likely to replicate the original one (and hence potentially include entire records unchanged) at the higher value and higher risk end.

Between September and October 2021, 5 companies either launched or raised some funding, and at least 3 others did that last year. Will that democratize access to data, or just mean that those who’ve got 💰money will get it?

SmartNoise is a differential privacy focused company that creates synthetic data; was developed by the Harvard Quant Soc Science and Engineering depts together with Microsoft.
Tonic.ai creates mock data that preserves characteristics of datasets; raised $35 million in September 2021. This is similar to hazy that raised $5.8M, with last round closed in 2020; and mostly.ai that raised $6.1M in 2020.
Rendered.ai, also generates synthetic data, and raised $6 million in seed funding in October 2021.
Private AI is a Canadian company that develops privacy-preserving machine learning and natural language processing tools [aka algorithms that detect and redact private data, works in 7 different languages], raised $3.15 million in September 2021. Amongst its investors, there is Ajay Agrawa, the founder of Creative Destruction Lab, and one of the authors of 📚Prediction Machines. And M12, Microsoft's VC.
Gretel, an American startup that anonymises data sets [focuses on developers as the end user], raised $52 million in October 2021, bringing total to 67.7M. The company was cofounded by former Google executive Lazlo Block.
Privitar deidentifies data, offering multiple forms of anonymisation; raised $80M in 2020.

📝More big news:

SurveyMonkey (newly known as Momentive), a survey tool used and loved in academic circles, was acquired by Zendesk for $4.13 billion 💰. This comes as no surprise, most surveying tools, including the very fabulous Qualtrics, are driven by the demand in continuous market research and user experience from the corporate world. Investors didn’t think this was a good deal: Zendesk shares fell more than 10%.
Teachmint, a startup with at least 10 million users in India🇮🇳, raised more than $90 million this year already. The company kicked off during the pandemic, and developed tools that were 🤳mobile and 🎥video first to help teachers create the online classroom experience. One of the biggest challenges (for both students and instructors) to carry out proper online learning is access to the right technology.
Also more on the school (rather than HE) side, Amplify raised $215 million for ‘strategic acquisitions’. Impressive how K12 #edtechs go for these huge amounts.

See you next week!! 😽

Danger! Women at Work

Spark Plug #19: Democratizing Data

Stuff worth sharing for research and learning technologists

📝More big news: