Quantification (machine learning)

In machine learning and data mining, quantification (variously called learning to quantify, or supervised prevalence estimation, or class prior estimation) is the task of using supervised learning in order to train models (quantifiers) that estimate the relative frequencies (also known as prevalence values) of the classes of interest in a sample of unlabelled data items.[1][2] For instance, in a sample of 100,000 unlabelled tweets known to express opinions about a certain political candidate, a quantifier may be used to estimate the percentage of these 100,000 tweets which belong to class `Positive' (i.e., which manifest a positive stance towards this candidate), and to do the same for classes `Neutral' and `Negative'.[3]

Quantification may also be viewed as the task of training predictors that estimate a (discrete) probability distribution, i.e., that generate a predicted distribution that approximates the unknown true distribution of the items across the classes of interest. Quantification is different from classification, since the goal of classification is to predict the class labels of individual data items, while the goal of quantification it to predict the class prevalence values of sets of data items. Quantification is also different from regression, since in regression the training data items have real-valued labels, while in quantification the training data items have class labels.

It has been shown in multiple research works[4][5][6][7][8] that performing quantification by classifying all unlabelled instances and then counting the instances that have been attributed to each class (the 'classify and count' method) usually leads to suboptimal quantification accuracy. This suboptimality may be seen as a direct consequence of 'Vapnik's principle', which states:

If you possess a restricted amount of information for solving some problem, try to solve the problem directly and never solve a more general problem as an intermediate step. It is possible that the available information is sufficient for a direct solution but is insufficient for solving a more general intermediate problem.[9]

In our case, the problem to be solved directly is quantification, while the more general intermediate problem is classification. As a result of the suboptimality of the 'classify and count' method, quantification has evolved as a task in its own right, different (in goals, methods, techniques, and evaluation measures) from classification.

  1. ^ Pablo González; Alberto Castaño; Nitesh Chawla; Juan José del Coz (2017). "A review on quantification learning". ACM Computing Surveys. 50 (5): 74:1–74:40. doi:10.1145/3117807. hdl:10651/45313. S2CID 38185871.
  2. ^ Andrea Esuli; Alessandro Fabris; Alejandro Moreo; Fabrizio Sebastiani (2023). Learning to Quantify. The Information Retrieval Series. Vol. 47. Cham, CH: Springer Nature. doi:10.1007/978-3-031-20467-8. ISBN 978-3-031-20466-1. S2CID 257560090.
  3. ^ Hopkins, Daniel J.; King, Gary (2010). "A Method of Automated Nonparametric Content Analysis for Social Science". American Journal of Political Science. 54 (1): 229–247. doi:10.1111/j.1540-5907.2009.00428.x. ISSN 0092-5853. JSTOR 20647981. S2CID 1177676.
  4. ^ George Forman (2008). "Quantifying counts and costs via classification". Data Mining and Knowledge Discovery. 17 (2): 164–206. doi:10.1007/s10618-008-0097-y. S2CID 1435935.
  5. ^ Antonio Bella; Cèsar Ferri; José Hernández-Orallo; María José Ramírez-Quintana (2010). "Quantification via Probability Estimators". 2010 IEEE International Conference on Data Mining. pp. 737–742. doi:10.1109/icdm.2010.75. ISBN 978-1-4244-9131-5. S2CID 9670485.
  6. ^ José Barranquero; Jorge Díez; Juan José del Coz (2015). "Quantification-oriented learning based on reliable classifiers". Pattern Recognition. 48 (2): 591–604. Bibcode:2015PatRe..48..591B. doi:10.1016/j.patcog.2014.07.032. hdl:10651/30611.
  7. ^ Andrea Esuli; Fabrizio Sebastiani (2015). "Optimizing text quantifiers for multivariate loss functions". ACM Transactions on Knowledge Discovery from Data. 9 (4): Article 27. arXiv:1502.05491. doi:10.1145/2700406. S2CID 16824608.
  8. ^ Wei Gao; Fabrizio Sebastiani (2016). "From classification to quantification in tweet sentiment analysis". Social Network Analysis and Mining. 6 (19): 1–22. doi:10.1007/s13278-016-0327-z. S2CID 15631612.
  9. ^ Vladimir Vapnik (1998). Statistical learning theory. New York, US: Wiley.