A financial technology company compared Primer’s natural language processing (NLP) models against multiple competitors. Out of the box, and without any retraining, Primer outscored them across all of the key performance metrics.
Recently, a financial technology company conducted an evaluation of multiple intelligence solutions on the market, including Primer, that feature artificial intelligence (AI) and natural language processing (NLP) technologies. Some of the solutions were from the largest and most established cloud companies in the space, while others were from smaller companies and open source NLP libraries. Prior to choosing a solution, the company wanted to determine which technology would best fit their need to accurately identify company names in financial filings.
To conduct the evaluation, the financial company created a custom, hand-labeled named entity recognition (NER) dataset internally to evaluate the NER predictions from each provider's solution. NER is the machine learning task of identifying and categorizing entities, such as people, locations, and organizations, that exist in unstructured text. It is a foundational and complex technology in NLP, which is why many companies choose NER as their testing and evaluation framework. If a provider performs well at NER tasks, then they will likely be able to build more complex models on top of it. For example, an NER model can be trained to automatically distinguish Apple, the company, from apple, the fruit, and it can also associate Tim Cook, the person, with an Apple computer.
There are so many ways that organizations can use NER models. Financial firms might want to deploy NER on press releases to identify newly announced executive members of a company of interest, discover new competitors named explicitly in financial filings, or understand company exposure to specific regions and locations based on their mentions in an earnings transcript. Likewise, a NER model can be trained and deployed to correctly differentiate between Amazon the company and Amazon the river. This difference is important for trading models derived from this information. If we are looking at sentiment, for example, then it is important that the quantitative trading model is actually looking at sentiment relating to Amazon the company and not the river.
A financial firm might also look to NER to categorize key information and identify specific details in SEC filings for a group of companies. For example, in looking at a financial filing, for example, the SEC filing for Coinbase Global, Inc. a NER model can identify key entities of interest.
In these sentences, Primer’s NER Engine can quickly identify several entities: Mr. Armstrong, Airbnb, Inc. Universitytutor.com, and Johnson Educational Technologies LLC. It can also categorize them correctly: Airbnb, Inc. Universitytutor.com, and Johnson Educational Technologies LLC are organizations and Mr. Armstrong is a person. Analysts can search based on the relationship between specific entities and locations, and co-mentions.
A big part of a financial analyst's job is to gather data from various documents and then organize it, clean it up, and get it into a format it can be made sense of. Primer’s NER Engine can quickly and accurately identify, structure, and extract the data for analysts. This is a true paradigm shift for analysts: instead of performing manual qualitative analysis, they can automate the structuring of the text to feed directly into their quant trading models. This allows the analyst more time to do what they are best at -- analyzing the information, looking for hidden trends, and much more.
In the test, the customer found that Primer scored better on all the industry standard measures and performed even better on the financial company’s top priority measures. Primer scored an average of 21% better than the leading competitor and an average of 42% better than its competitors across all metrics. As a result, the financial firm became one of our newest customers.
The following table illustrates how Primer stood above the competitors in nearly every category.
It is important to note that the financial company was evaluating Primer's out-of-the-box NER Engine, which has been trained on general text data and not specifically on the types of financial data the customer was testing against.
To get to this level of algorithm precision with our NER Engine, Primer engineers injected diversity into the data used to train our engines on a range of writing styles, subject matters, and entities. We also curated a highly diverse group of documents, including entities from the financial, defense-related, and scientific worlds. This diversity was carefully curated over iterative testing and training cycles, and it is what enables the Primer NLP Engines to outperform our competitors.
Companies can further improve model performance on these types of tasks by using Primer Automate, our no-code end-to-end machine intelligence platform that enables users to quickly and easily retrain the NER model using domain-specific data. The Primer NER Engine is one of our 18 best-in-class NLP Engines that customers can access via API or from within Automate to structure and conduct advanced workflows on their documents.
Using Primer’s NER Engine, the financial firm tested the output quality using industry-standard metrics: precision, recall, and an average of the two (the "f1" score1). Among NLP machine learning models, there is typically a trade-off between precision and recall. For a high-frequency trading algorithm, a high precision model is of paramount importance for automated trading workflows. After all, you don’t want to mistakenly trade on negative sentiment about Amazon (the company) when a financial analyst’s note is talking about negative consequences of the Amazon (the rainforest) being cut down for agriculture. A misclassification of entity types in trading models like this can rapidly become incredibly costly.
Conversely, there are examples where the goal of the model is to achieve a high recall score. For example, a business analyst looking to mitigate the risk of forced labor in her company’s global supply chain, cannot afford to miss any related violations in volumes of audits. This analyst will want a high recall model that casts a large and inclusive net to catch all relevant mentions, with manual review an expected part of her workflow. Compared to other commercial solutions, Primer's NER Engine outperforms the industry benchmarks in both precision and recall, as our customers have proven out in the test scores detailed in the chart above.
The financial company not only used precision, recall, and f1 scores to determine the quality of output from Primer and our competitors, they also took the scoring a step further and broke down the measurements by five categories of interest. These measures show different ways of assigning scores to the predictions from each NER model. One of the categories that the customer cared about the most was “Type Penalty.” For this measure, the prediction receives a perfect score if the type of the named entity is correct. It receives a partial score if the type is correct and the boundaries are not an exact match. In looking at the example below, this prediction would receive a positive score because it correctly identifies part of the boundary of Primer Technologies, and the type prediction is correct. A perfect score for this measure would be assigned to the prediction where "Primer Technologies" is identified as an ORG; this would be the only prediction that receives a perfect score under this scoring method.
The chart below further illustrates how Primer scored 21% better than the leading competitor on this metric.
Primer also performed better on the four other measures that the financial company evaluated:
Boundary exactness: The NER categorization receives a perfect score if the named entity's boundaries, or where the sequence of words starts and stops, are identical to the gold label data, regardless of whether the type is correct or incorrect. The prediction in the example below would receive a perfect score, even though the category type "PERSON" is incorrect.
Partial boundary: The prediction receives a perfect score if the named entity's boundaries are identical to the gold label data. It receives a partial score if the boundaries are imperfectly matched, regardless of whether the type is correct or incorrect. The prediction in the example below would receive a "partial" or fractional score because it correctly predicts part of the ORG Primer Technologies boundary.
Strict: The prediction receives a positive score only if both boundary and type are correct. Only the prediction in the example below would receive a positive score for this metric.
Type: The prediction receives a perfect score if the named entity's type is correct, regardless of whether the exact boundaries are matched. The prediction in the example below would receive a nonzero score for this metric as would the prediction that identified “Primer Technologies” or “Technologies” as ORGs. This prediction receives a perfect score despite the fact that the boundary of Primary Technologies is incorrectly predicted.
In conclusion, not all NLP models are created equal. When determining what NLP solution is right for your enterprise, you need to test the models that you are buying across key performance metrics, including F1 scores, precision, recall, boundary exactness, strictness and entity type match. When you have mission critical tasks, you want to be confident that you have the best performing models to drive business outcomes.
To learn more about Primer technology, explore our Intelligence Engines or download our AI Technology report, which details how AI can make us think smarter. You can also contact us to discuss your specific needs or schedule a demo.
Matt Lubrano Solutions Architect, Primer.ai
Sarah Morningstar, Ph.D Content Manager, Primer.ai
*The F1 score is the harmonic mean of precision and recall. This score takes both false positives and false negatives into account.