Machine Learning and Data Science (ML) is starting to take the place in industry that "Information Technology" had in the late 1990s: businesses of all sizes and in all sectors, are recognizing how necessary it has become to develop predictive capabilities for continued profitability of their core competencies. To be effective, ML algorithms rely on high-quality training data – and not just any data, but data that is specific to the business problem that ML is applied to. Obtaining relevant training data can be very difficult for firms to do themselves, especially those early in their path towards incorporating ML into their operations. This problem is only further exacerbated, as businesses increasingly need to solve these prediction problems in real-time (e.g. a ride-share company setting prices, retailers/restaurants sending targeted coupons to clear inventory), which means that data gets “stale” quickly. Therefore, it is imperative that there are real-time market structures for the buying and selling of training data for ML. Further it is insufficient to view ML performance metrics (e.g. RMSE) in isolation of real-world applications; for example, a 10% increase in prediction accuracy means very different things for a hedge fund maximizing profits vs. a retailer decreasing inventory costs vs. a hospital trying to save lives. Hence the value of a dataset will necessarily have to consider more than simply the prediction accuracy it provides. Domain knowledge will be just as essential, if not more so, if we aim to view data as an asset and create a rigorous method to define its value.
In this work, we aim to create a data marketplace – a robust matching mechanism to efficiently buy and sell data while optimizing social welfare and maximizing revenue. While the monetization of data and pre-trained models is an essential focus by many industries and vendors today, there does not exist a market mechanism that can price data and match buyers to vendors while still addressing the (computational and other) complexity associated with creating a market platform. The challenge in creating such a marketplace stems from the very nature of data as an asset: (i) it can be replicated at zero marginal cost; (ii) its value to a firm is inherently combinatorial (i.e. the value of a particular dataset depends on what other (potentially correlated) datasets are available); (iii) its value to a firm is dependent on which other firms get access to the same data; (iv) prediction tasks and the value of an increase in prediction accuracy vary widely between different firms, and so it is not obvious how to set prices for a collection of datasets with correlated signals; (v) finally, the authenticity and truthfulness of data is difficult to verify a priori without first applying it to a prediction task. Our proposed marketplace will take a holistic view of this problem and provide an algorithmic solution combining concepts from statistical machine learning, economics of data with respect to various application domains, algorithmic market design, and mathematical optimization under uncertainty. We will discuss some examples motivating this work.
This is joint work with Anish Agarwal, Tuhin Sarkar, and Devavrat Shah.