A Machine Learning–based Analysis of Alleged Marketing Misrepresentations

Litigation involving alleged marketing misrepresentations requires an examination of the at-issue content. The most robust analyses of this content are systematic and objective. One state-of-the-art approach that satisfies these criteria leverages machine learning models and exemplars, which represent the type of at-issue content.

Hypothetical case study

To illustrate the value of machine learning, consider the following fictional case. Client Foods Inc. (CFI) manufactures products sold in grocery stores. Plaintiffs allege that CFI marketed its products containing numerous unhealthy ingredients in a way that gave reasonable consumers the impression that the products were healthy. CFI must show that its marketing materials did not misrepresent the nutritional value of its products to consumers.

A simple approach

Prior to the mainstream adaptation of machine learning approaches and the wide availability of third-party marketing materials through social media, an analysis of marketing content may have been limited to a keyword search. One keyword analysis might compare the number of materials that include references to terms like “healthy,” “low-fat,” or “nourish” to the number of materials without such terms.

While this approach would successfully provide a quantitative measure of the degree to which CFI emphasized the healthfulness of its products in its marketing materials, it suffers from several shortcomings. First, compiling a complete list of health-related terms may be impossible. Furthermore, constructing this lexicon may reflect the researcher’s subjectivity in determining which terms truly pertain to the type of health concerns that resonate with consumers. And regardless of the completeness and objectivity of the lexicon, a keyword search necessarily limits the analysis to text, ignoring other marketing media (e.g., images).

A machine learning approach

In these situations, machine learning can provide an objective, compelling alternative. By allowing a machine learning model to digest entire collections of marketing materials, the model determines the extent to which each feature conveys a sense of healthfulness. This outsourcing of the assessment also limits the extent to which researchers may intentionally or unintentionally impart their own biases. Moreover, because models are now widely available to analyze images, audio, and video, the analysis may consider the full complement of marketing materials, including multimedia social media posts, online banner ads, television and radio spots, and traditional print advertising.

Supervised machine learning methods—those that generate predictions based on the labels or outcomes associated with input data—require a training dataset. Training datasets link inputs (in this case, marketing materials) to outputs (in this case, the healthfulness of the corresponding marketing material).

Unfortunately, a high-quality public training dataset is rarely available for highly specific applications. In these situations, researchers can compile a training dataset by collecting marketing materials and examining each item for the output of interest. While effective, this manual compilation is time-consuming, expensive, and subjective.

Leveraging exemplars

When a robust training dataset is unavailable, and manual compilation is costly, infeasible, or inappropriate, exemplar-based training dataset curation provides an efficient alternative. In this approach, the researcher identifies a series of exemplars (other entities that represent the characteristics of concern).

For the analysis of CFI, this process might involve identifying food brands that offer healthy products and brands that offer unhealthy products. Exemplars of healthy food include brands that market whole grain products with limited chemicals and preservatives. Exemplars of non-healthy food might include brands that market sugary snacks or products with a high amount of processed ingredients. Selecting the exemplar brands from a list by an authoritative source like a government agency or academic study lends an extra degree of objectivity to the process.

Once the exemplars are selected, the data collection process begins. To analyze CFI’s social media posts, for example, the company’s social media posts are collected along with those of the exemplars. In the exemplar data, each post is supplemented with a feature indicating whether it was authored by a healthy exemplar or a non-healthy exemplar.

Model training

With the data collected, exemplar brands are randomly assigned to either the training or the validation datasets. Social media posts from brands in the training dataset will be used to fit (or, colloquially, “teach”) the machine learning model. Those in the validation dataset will be omitted from training and instead be used to demonstrate model performance.

The specific numeric model features and architecture can take any form of supervised model appropriate for the data, meaning that the technical implementation can be tailored to the dataset size, researcher familiarity, and available analytical infrastructure. For example, content text may be converted to numeric data as a binary bag of words or by transformer-generated embeddings, and the classification model may take the form of a simple logistic regression or a deep neural network. To further reduce researcher influence, the choice of feature representation, model architecture, and hyperparameter specification may be made through cross-validation.

The final model is fit to the training dataset to predict whether each social media post stems from a healthy or a non-healthy brand. These predictions will generally fall along a continuum from 0 to 1, which the researcher can interpret as a measure of healthfulness. The model is then applied to the validation dataset to confirm that the healthy validation brands generally receive higher healthfulness scores than the non-healthy validation brands. Finally, the model is applied to CFI’s marketing materials to generate a healthfulness score for each post.

Results

The CFI results can be presented with those of the validation brands to serve two purposes. First, examining the validation brands relative to one another demonstrates that the model correctly assigns higher healthfulness to healthy validation exemplars than to non-healthy ones. Second, it demonstrates that CFI’s marketing content is more consistent with that of the non-healthy exemplars than the healthy exemplars, thereby refuting plaintiffs’ allegations that CFI marketed its products in a manner consistent with a health food brand.

Because even the most dissimilar brands occasionally engage in similar marketing activities (e.g., promotions, seasonal themes), some marketing items from healthy exemplars will receive lower healthfulness scores and vice versa. For simplicity, this variation can be aggregated into a single metric for each brand. Here, the median healthfulness score is displayed for CFI and each validation brand.

Conclusion

As shown, an exemplar-based machine learning approach requires little human input to generate an effective measure of a characteristic of interest. The most significant decision involves the selection of the exemplars, while extensive cross-validation procedures can point to the most accurate feature representation and model selection. This approach limits subjectivity and facilitates the efficient analysis of voluminous and complex data, making it a compelling option in content analyses.

Cookie	Duration	Description
AWSELB	session	Associated with Amazon Web Services and created by Elastic Load Balancing, AWSELB cookie is used to manage sticky sessions across production servers.
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	CookieYes sets this cookie to record the default button state of the corresponding category and the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
_cfuvid	session	The _cfuvid cookie is used to allow the Cloudflare WAF to distinguish individual users who share the same IP address. Visitors who do not provide the cookie are likely to be grouped together and may not be able to access the site if there are many other visitors from the same IP address.
cf_clearance	1 year	The cf_clearance cookie is used by Cloudflare to verify that visitors have successfully passed a security challenge and can access the website.
PBSECURESUSID	session	This cookie is set by the provider Podbean. This is a session cookie used to verify that the users are on secure sessions. It helps iin implementing audio files on the website.
wpEmojiSettingsSupports	session	WordPress sets this cookie when a user interacts with emojis on a WordPress site. It helps determine if the user's browser can display emojis properly.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_*	1 year 1 month 4 days	Google Analytics sets this cookie to store and count page views.
_gat_UA-*	1 minute	Google Analytics sets this cookie for user behaviour tracking.
_gat_UA-12672498-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
vuid	2 years	Vimeo-generated ID used for generating analytics information for the video owner.

Cookie	Duration	Description
_guid	90 days	linkedin.com - Used to identify a LinkedIn Member for advertising through Google ads - LinkedIn
AMCVS_14215E3D5995C57C0A495C55%40AdobeOrg	session	.linkedin.com - Indicates the start of a session for Adobe Experience Cloud - Adobe
AnalyticsSyncHistory	30 days	.linkedin.com - Used to store information about the time a sync took place with the lms_analytics cookie - LinkedIn
bcookie	1 year	.linkedin.com - Browser Identifier cookie used for diagnostic purposes. - LinkedIn
dfpfpt	2 years	.linkedin.com - Unique user identifier to prevent abuse in payment workflows for LinkedIn - LinkedIn
fptctx2	session	.linkedin.com - Used to prevent abuse in payment workflows for LinkedIn - Microsoft
gpv_pn	6 months	.linkedin.com - Used to retain and fetch previous page visited in Adobe Analytics - Adobe
lang	session	.linkedin.com - Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings. - LinkedIn
li_gp	1 year	.linkedin.com - Stores privacy preferences for guests to LinkedIn - LinkedIn
li_sugr	90 days	.linkedin.com - Used to make a probabilistic match of a user's identity - LinkedIn
liap	1 year	.linkedin.com - Used by non-www.domains to denote the logged in status of a member - LinkedIn
lidc	24 hours	.linkedin.com - To facilitate data center selection - LinkedIn
lms_ads	30 days	.linkedin.com - Used to identify LinkedIn Members off LinkedIn for advertising - LinkedIn
lms_analytics	30 days	.linkedin.com - Used to identify LinkedIn Members off LinkedIn for analytics - LinkedIn
s_cc	session	.linkedin.com - Used to determine if cookies are enabled for Adobe Analytics - Adobe
s_fid	180 days	.linkedin.com - Unique identifier for Adobe Analytics - Adobe
s_ips	session	.linkedin.com - Tracks percent of page viewed - Adobe
s_plt	session	.linkedin.com - Tracks the time that the previous page took to load - Adobe
s_ppv	session	.linkedin.com - Used by Adobe Analytics to retain and fetch what percentage of a page was viewed - Adobe
s_sq	session	.linkedin.com - Used to store information about the previous link that was clicked on by the user by Adobe Analytics - Adobe
s_tp	session	.linkedin.com - Tracks percent of page viewed - Adobe
s_tslv	6 months	.linkedin.com - Used to retain and fetch time since last visit in Adobe Analytics - Adobe
UserMatchHistory	30 days	linkedin.com - Used for id sync process. It stores the last sync time to avoid repeating the syncing process in a frequent manner - LinkedIn