5 Questions is a periodic feature produced by Cornerstone Research, which asks our professionals, senior advisors, or affiliated experts to answer five questions.
We interview Mike DeCesaris, vice president of Cornerstone Research’s Data Science Center, about the challenges of working with unstructured data and how his team has developed custom processes to turn it into valuable information clients can use.
What is unstructured data, and how does it relate to litigation?
Traditional data analytics typically involves the analysis of structured data, such as spreadsheets or relational databases. Unstructured data, on the other hand, is essentially any information not stored according to a predefined structure. Examples of unstructured data include text documents, emails, Adobe PDFs, image files, etc. Some estimate that unstructured data accounts for 80 percent or more of all data, and unstructured datasets are growing fast.
The information contained in unstructured documents can be crucial to supporting expert analyses, but locating and extracting the relevant information can be challenging when there are large volumes of unstructured data. Whereas structured data can be processed and analyzed using traditional database tools and data analysis programs, analyzing unstructured data requires either numerous hours of manual work or a significantly higher level of technical expertise and sophistication.
Given the large amount of unstructured data in our work, how is Cornerstone Research responding?
We’ve developed sophisticated tools that can be used in concert to create tailored approaches to turning unstructured data into structured data. The result can be leveraged for quantitative analysis. This can eliminate large-scale manual review, significantly reducing processing time and cost. Perhaps more importantly, this can unlock new analysis possibilities that would otherwise have been impossible.
For example, we have:
- developed a parallelized data processing pipeline to convert hundreds of thousands of pages (hundreds of gigabytes) of daily reports across multiple distinct text report formats into tables and extract key information to enable cost-efficient analyses in multiple joint defense matters;
- digitized a large set of image-based account statements with various counterparties and automated the creation of machine-readable transaction datasets;
- identified PDFs of emails in a document dump of 250 thousand pages containing relevant trade tables and programmatically extracted and aggregated data into a database; and
- extracted and structured entries from consumer complaint forms into a comprehensive database.
In litigation, we often deal with sensitive client information. That is why Cornerstone Research has invested heavily in secure infrastructure, including high-performance and high-throughput analytical servers and storage clusters. Our analytical infrastructure is on-premises, meaning client data is never exposed to the web. We have also invested in a number of software tools and programming languages to add high-quality text layers to documents, quickly extract tabular data, and develop tailored approaches to extracting key information. Finally, we have invested in people—we have exceptional data scientists and practitioners with many years of experience across a large number of different clients and projects.
What are some of the challenges of working with this kind of data?
Extracting meaningful information from unstructured data is nuanced for a number of reasons. We can use documents stored in PDF file format (.pdf) as an example. PDF files are stored as vector graphics (essentially an image). Some PDF files may also contain a layer of text that can be combined with the image to render a searchable PDF document, but not all do. So before any text extraction can begin, an interpreted text layer that is based on the underlying images must be added to the PDFs.
The number of documents and size of each document also pose processing time issues. Clients can easily provide thousands, if not millions, of PDF documents that are each thousands of pages long. Without the proper hardware, software, and coding capabilities, processing these documents manually would take years of person-hours and be prohibitively expensive.
Finally, the content of the documents may vary widely. One document alone may contain information in several types of formats. This means any attempt to extract meaningful data from the files requires extremely high precision in distinguishing different reports from each other, but at the same time must have the flexibility to capture key information expressed in different formats.
Can you walk us through an overview of how Cornerstone Research typically approaches working with unstructured data?
We can use our example of PDF documents to show how we transform unstructured information into a structured format that can be used in analyses. The first stage in any text extraction exercise is to review a sample of the documents and determine the key pieces of information essential to analysis. This step is fundamental to understanding the structure of the contents.
The next step when we are preprocessing PDF files is to ensure that they contain what is commonly referred to as a “text layer.” The text layer of each document is then separated from its original PDF and stored as a plain text file (file extension .txt), which lends itself to highly efficient and flexible methods of processing.
Once documents are stored as plain text, we run them through proprietary software programs. Employing complex conditional logic and a text matching language, the programs discern relevant information including different report types and sections, metadata such as dates and client identifiers, and tables containing records of interest.
To turn the extracted information into a format that can be analyzed, we load the now-structured text into a database. We take advantage of parallel processing to load multiple intermediate files at once, and data from all records are loaded to a single table or multiple tables.
The final step is to validate the extracted data’s quality. Our QA processes include independently replicated text extraction to verify results; calculating coverage statistics to ensure there are no gaps in information; and frequent collaboration with subject matter experts to control the quality of the product.
Briefly, what are some other examples of how Cornerstone Research works with unstructured data?
By far, the most common type of unstructured data processing in our work resembles the example above, where we extract and organize unstructured data that is visually tabular in nature. Increasingly, however, we deal with more complex extractions from and characterizations of text, image, and even audio and video documents. This work sometimes focuses on extracting concrete information from documents, like critical references in free-form text, text transcriptions from video clips, and logo and product detection in images.
In other instances, we aim to quantify more abstract concepts, like the sentiment associated with social media posts, topic composition of public press articles, and the characterization of multimedia marketing materials. This work typically utilizes AI, machine learning, and text analytics techniques to analyze unstructured data. We hope to cover these topics in more depth in future installments of this series.
Unstructured data can provide windows into every facet of an organization and its processes, and the growth of unstructured data is expected to accelerate as machine-generated data and machine learning initiatives become more widely used. The quality of data extracted from our process is repeatable and reliable and can be effectively leveraged to support expert analyses in litigation and regulatory settings.