Data democratization, the process of allowing as many people as possible to have access to data without any bottlenecks or gatekeepers, can happen both within and between organizations. Within an organization, data democratization might mean that the IT department makes data easily and readily accessible to all other departments.
In contrast, when data democratization is applied between organizations, one organization might make their data freely available so as to serve the greater interests of some broader community.
Just as democracy empowers the common individual over the elite, data democratization empowers individuals and organizations to access proprietary information and data in order to achieve the same insights previously available only to gatekeepers.
And in a world where the tools of data science, data analysis and machine learning are growing ever more powerful, data is increasingly becoming the key to making organizational decisions and to technological progress in general. Data democratization helps to ensure that the benefits of the future of technology are shared widely across organizations rather than concentrated among a select few, thereby furthering the greater good.
Data democracy and ML applications
Machine learning and data science competitions provide excellent examples of data democratization. For instance, Kaggle, a machine learning competition company owned by Google, offers several different free-to-enter machine learning competitions hosted by various customers, each providing prizes to the winners.
By freely giving instances of high quality datasets to the public, Kaggle helps to train the next generation of data scientists and machine learning engineers, all while providing a valuable service to their customers. Kaggle’s customers, by democratizing their data to all of Kaggle’s users, gain the benefit of much greater understanding and insights into their data than could have been achieved had they chosen to keep their data siloed within the company.
For companies with an exclusive or monopolistic ability to make use of a given dataset, there is little cost to democratizing their data, as the barriers for entry into the industry are too high for most to overcome. However, there are also some downsides to data democratization within this context.
For problems like image recognition by neural networks, where large amounts of data and compute power are often required to achieve adequate results, data democratization may be less effective as only large organizations with gargantuan resources will be able to make adequate use of the data.
Another testament to the power of data democratization is provided by the ImageNet competition, which is widely recognized for having produced the first high-performing example of an end-to-end neural network trained on image data—a historic breakthrough. The winning team won by a large margin over the other teams using classical machine learning methods, and the results of that competition established the standard of using CNNs for image recognition tasks.
This breakthrough was considered by many to be among the most important breakthroughs in machine learning in the past 10 years and has led to a host of actual and potential business applications, from improved internet search results to self-driving cars to facial recognition for law enforcement purposes. By democratizing its data, the ImageNet competition was able to achieve superior results to what could have been achieved by simply handing its data over to large machine learning gatekeeper organizations like Google and Microsoft Research.
As another example, Airbnb, a platform for finding vacation rentals online that was founded in 2008, has made heavy use of data science and machine learning in its technology. As of 2017 its data science team consisted of nearly 100 people. Airbnb is an example of data democratization within an organization.
One of Airbnb’s beliefs was that every employee, not only engineers, should be empowered to involve data in their decisions. They separated their efforts in data democratization into three categories: data education, data access, and data tools. Their data tools included Dataportal, which made much of their internal data easily searchable by nontechnical employees.
Their data education efforts included Data University, consisting of over 30 internal college-style courses.
Roughly 10 percent of Airbnb employees had participated in at least one Data University course as of 2017, and it illustrates the fact that data democratization does not only involve making datasets publicly available, it also involves equipping more people with the tools necessary to manipulate data to achieve their goals and make more informed decisions.
Data democratization is not everywhere
It’s important to note that data democratization is not suitable in all cases or in all types of problems. For example, government databases can involve classified information in a way that makes it illegal to share them publicly. Similarly, companies often have proprietary datasets that they cannot release publicly for fear that doing so will drive competition.
While it may often be in the interests of the greater good to share such data publicly and allow the company with the best model to win over, companies have a financial incentive to maximize profits and may not be interested in serving the greater good. Furthermore, sometimes the process of securing such proprietary datasets can be so expensive that the company or organization must reward itself for the effort by keeping its dataset secret.
High quality self-driving car datasets, for instance, can be expensive to obtain as they often need to be acquired from a real car driving on a real road. The self-driving car company, Cruise, for example, has gone so far as to manually drive Cruise-owned cars around the streets of San Francisco for the purposes of collecting a self-driving data set. Given the expense of obtaining such a data set and the fierce competition that characterizes the self-driving industry, it’s understandable the reasons why Cruise is, at least for now, keeping its data private.
Data democracy and data governance
In data analysis, analytics and machine learning, data governance is the management of data, including determining and ensuring its level of availability, its usability for its intended purpose, and its integrity or freedom from error.
Data democratization can play a critical role in data governance. By democratizing data and getting it in front of as many eyeballs as possible, one can catch errors in data and gain tips for improving its usability. Just as open-source code is often more reliable and preferable to use over closed source code, the open-source data yielded by data democratization will, in general, show superior qualities to closed source data held by gatekeepers.