Hypothetical case study

To illustrate the value of machine learning, consider the following fictional case. Client Foods Inc. (CFI) manufactures products sold in grocery stores. Plaintiffs allege that CFI marketed its products containing numerous unhealthy ingredients in a way that gave reasonable consumers the impression that the products were healthy. CFI must show that its marketing materials did not misrepresent the nutritional value of its products to consumers.

A simple approach

Prior to the mainstream adaptation of machine learning approaches and the wide availability of third-party marketing materials through social media, an analysis of marketing content may have been limited to a keyword search. One keyword analysis might compare the number of materials that include references to terms like “healthy,” “low-fat,” or “nourish” to the number of materials without such terms.

While this approach would successfully provide a quantitative measure of the degree to which CFI emphasized the healthfulness of its products in its marketing materials, it suffers from several shortcomings. First, compiling a complete list of health-related terms may be impossible. Furthermore, constructing this lexicon may reflect the researcher’s subjectivity in determining which terms truly pertain to the type of health concerns that resonate with consumers. And regardless of the completeness and objectivity of the lexicon, a keyword search necessarily limits the analysis to text, ignoring other marketing media (e.g., images).

A machine learning approach

In these situations, machine learning can provide an objective, compelling alternative. By allowing a machine learning model to digest entire collections of marketing materials, the model determines the extent to which each feature conveys a sense of healthfulness. This outsourcing of the assessment also limits the extent to which researchers may intentionally or unintentionally impart their own biases. Moreover, because models are now widely available to analyze images, audio, and video, the analysis may consider the full complement of marketing materials, including multimedia social media posts, online banner ads, television and radio spots, and traditional print advertising.

Supervised machine learning methods—those that generate predictions based on the labels or outcomes associated with input data—require a training dataset. Training datasets link inputs (in this case, marketing materials) to outputs (in this case, the healthfulness of the corresponding marketing material).

Unfortunately, a high-quality public training dataset is rarely available for highly specific applications. In these situations, researchers can compile a training dataset by collecting marketing materials and examining each item for the output of interest. While effective, this manual compilation is time-consuming, expensive, and subjective.

Leveraging exemplars

When a robust training dataset is unavailable, and manual compilation is costly, infeasible, or inappropriate, exemplar-based training dataset curation provides an efficient alternative. In this approach, the researcher identifies a series of exemplars (other entities that represent the characteristics of concern).

For the analysis of CFI, this process might involve identifying food brands that offer healthy products and brands that offer unhealthy products. Exemplars of healthy food include brands that market whole grain products with limited chemicals and preservatives. Exemplars of non-healthy food might include brands that market sugary snacks or products with a high amount of processed ingredients. Selecting the exemplar brands from a list by an authoritative source like a government agency or academic study lends an extra degree of objectivity to the process.

Once the exemplars are selected, the data collection process begins. To analyze CFI’s social media posts, for example, the company’s social media posts are collected along with those of the exemplars. In the exemplar data, each post is supplemented with a feature indicating whether it was authored by a healthy exemplar or a non-healthy exemplar.

Model training

With the data collected, exemplar brands are randomly assigned to either the training or the validation datasets. Social media posts from brands in the training dataset will be used to fit (or, colloquially, “teach”) the machine learning model. Those in the validation dataset will be omitted from training and instead be used to demonstrate model performance.

The specific numeric model features and architecture can take any form of supervised model appropriate for the data, meaning that the technical implementation can be tailored to the dataset size, researcher familiarity, and available analytical infrastructure. For example, content text may be converted to numeric data as a binary bag of words or by transformer-generated embeddings, and the classification model may take the form of a simple logistic regression or a deep neural network. To further reduce researcher influence, the choice of feature representation, model architecture, and hyperparameter specification may be made through cross-validation.

The final model is fit to the training dataset to predict whether each social media post stems from a healthy or a non-healthy brand. These predictions will generally fall along a continuum from 0 to 1, which the researcher can interpret as a measure of healthfulness. The model is then applied to the validation dataset to confirm that the healthy validation brands generally receive higher healthfulness scores than the non-healthy validation brands. Finally, the model is applied to CFI’s marketing materials to generate a healthfulness score for each post.

Results

The CFI results can be presented with those of the validation brands to serve two purposes. First, examining the validation brands relative to one another demonstrates that the model correctly assigns higher healthfulness to healthy validation exemplars than to non-healthy ones. Second, it demonstrates that CFI’s marketing content is more consistent with that of the non-healthy exemplars than the healthy exemplars, thereby refuting plaintiffs’ allegations that CFI marketed its products in a manner consistent with a health food brand.

Because even the most dissimilar brands occasionally engage in similar marketing activities (e.g., promotions, seasonal themes), some marketing items from healthy exemplars will receive lower healthfulness scores and vice versa. For simplicity, this variation can be aggregated into a single metric for each brand. Here, the median healthfulness score is displayed for CFI and each validation brand.

Conclusion

As shown, an exemplar-based machine learning approach requires little human input to generate an effective measure of a characteristic of interest. The most significant decision involves the selection of the exemplars, while extensive cross-validation procedures can point to the most accurate feature representation and model selection. This approach limits subjectivity and facilitates the efficient analysis of voluminous and complex data, making it a compelling option in content analyses.