Welcome to Synpulse’s digital reading experience – Please choose your region of interest

The Magazine
Management. Expertise. Inspiration.

Date: 23/09/2020

Title: KYC: Solving a Major Data Challenge Through Natural Language Processing

Teaser: Know-your-customer (KYC) and anti-money laundering (AML) are common processes across most Financial Services (FS) institutions. The goal of these processes are to identify and flag data anomalies. Originating from an individual or an institution, the ability to ingest and assess a large amount of data is an essential aspect of KYC and AML processes.

Button: Learn more


graphic graphic

KYC: Solving a Major Data Challenge Through Natural Language Processing

Know-your-customer (KYC) and anti-money laundering (AML) are common processes across most Financial Services (FS) institutions. The goal of these processes are to identify and flag data anomalies. Originating from an individual or an institution, the ability to ingest and assess a large amount of data is an essential aspect of KYC and AML processes.

In addition to being labelled as a data challenge, KYC is also seen as an operational challenge by several FS institutions. Although a vital process, KYC is a lengthy – often largely manual – and non-revenue generating procedure.

Nevertheless, KYC is not only an integral process which provides firms with a better understanding of their client's needs, but it is also a legal requirement for most FS firms. Failure to do so could result in hefty fines by the regulators.

Authors: Murilo Silvestre | Blair Cowan

Automation in KYC

UK regulator, the FCA has reported over £390M in AML fines between January 2019 and January 20201. KYC represents a part of the overall AML process and the FCA has highlighted the importance KYC represents to banks when controlling the UK financial systems.


To ensure banks adhere to the recommendations and regulations imposed by the FCA, firms operating in the UK must take necessary precautions and impose effective risk-based AML control frameworks. Such measures can be accomplished by integrating Machine Learning (ML) and Natural Language Processing (NLP) to add value to their operation.

This article aims to investigate the benefits of automation, focusing on the use of Natural Language Processing (NLP), to enhance how specific KYC checks are conducted.

Introduction: Know Your Customer

Know Your Customer, or KYC, is a process whereby a business verifies the identity of its clients and assesses their suitability along with any potential risks that their custom may bring. These could be risks of any illegal intention towards the business relationship. This is particularly pertinent to a technological solution as the process usually involves the manipulation of large quantities of data stored in various locations and furthermore it is governed through compliance guidelines, meaning efficiency is limited. This is a process that is so crucial to the safety of a business and therefore it must be done in the most effective way possible but must also ensure that the error rate is kept as low as possible.

In general the process is very repetitive with the manipulation of large quantities of data but more importantly it possesses a potential of high impact should there be any human error in the process. KYC has a medium complexity level but this is increased due to the large volumes and the compliance guidelines that must be constantly adhered to. Based on previous projects, market research and Synpulse’s expertise, we believe that process efficiency can be improved by up to 80% should KYC be handled by a Robot. Most importantly trends in cases can be continuously monitored and studied in order to automatically and instantly draw attention to any fraudulent activity.

Most, if not all, KYC processes will involve ingesting and processing different types and formats of data. From free-text and unstructured content, to multimedia (images and sometimes even videos).

A common, yet very time-consuming part of a KYC process is what is referred to by the business as a standard «Google Check». A Google Check aims to establish that a potential new client holds no malicious intent towards the business.

The check involves searching for a potential client's first and last name into the search engine. The results are then cross-checked against any negative keywords that could potentially be flagged as an issue. Examples of such could be «prison/theft/fraud etc…». If flagged, the collected information is passed for further analysis on the nature of the malicious intent.

When conducting such cases, firms which do not have a robust and well-defined technology framework in place are bound to become error-prone and have a much lengthier processing time. Technologies such as NLP  when integrated with technologies such as robotic process automation can tremendously impact how «KYC Google Checks» are performed and handled. For example, Synpulse created a number of processes at a leading Dutch Private Bank with the aim to automate their KYC, AML and Google Checks of new clients and the banks client due diligence checks are now fully automated and running without assistance. But most importantly we saw an efficiency gain of 80% and four out five FTEs are no longer necessary for the completion of the process meaning that they can be engaged in less repetitive and more intelligently productive tasks elsewhere.

Natural Language Processing

Natural language processing is an area within Artificial Intelligence that has been around for many decades however it has only been commercially accessible in more recent years. From a technological standpoint it focusses on the manner in which a computer or device is able to process and manipulate human language in all its different forms and idiosyncrasies. One of the most well-known examples of NLP in the real world would be the popular virtual assistants like the Google Assistant, Alexa or Siri which listen out for a prompt (“Hey Google”) and take as input a query in natural language, process it, determine the answer and provide an output to the query also in natural language.

Although the time frame between a query and an output seems reasonably quick and straight forward, the computational power that goes on behind the scenes is vast.

When a query is received, the software must first establish what is being asked of it. When comparing it to a human, if a human were to be asked the following question “Should I wear this coat today?”, there are several possibilities about what is actually being queried:

“Is it cold enough outside to need a coat?”

“Is it raining right now?”

“Is it going to rain later?”

“Is this coat the correct one to wear or would you recommend a more stylish one?”

When a software is asked the same question, it must also establish the semantics behind the query to determine what the most suitable response is.

To ensure the software can compute the above query, there are several steps it must complete. From sentence and word segmentation, where the software establishes how words are combined together in order to generate complete sentences, to text lemmatisation and stemming, etc.

To better demonstrate how NLP works, the below is an extract from a BBC News article2 from 1966 when England won the Footballing World Cup. The article has been analysed and several queries were made towards different NLP algorithms3 in order to show how different the outcomes can be if questions are not worded efficiently. The results are shown below:


graphic graphic
Table 1 -  NLP Analysis

The above table of results is very helpful in highlighting the advantages of NLP, but also its frailties. It is possible to see the different algorithms won't always provide the same output and this is due to the way in which a query is processed. For example in question number two, the third algorithm provides an incorrect number because it does not stop searching once it reaches the correct number. Instead, it continues throughout the passage until it finds all numbers as the question included the phrase «How many» and therefore it incorrectly assumes that the result is the sum of all.

Most interestingly, however, are the results of questions three and four. From a human perspective, the two queries are the same, but they are worded slightly differently, however, in question three the algorithms are unable to determine the correct answer. This is particularly significant when looking at it from a business perspective. It shows that we cannot throw questions at the algorithms without first manipulating the data in some form so the solution can correctly determine the output. Therefore, it is important to mention that poor input will generate poor output.

NLP applied in KYC

Using the example highlighted in the section above, «Google Checks – KYC», it is possible to showcase how having a robust and holistic risk-based technology framework can enhance AML and KYC data and operational problems.

When focusing on the data challenge as mentioned earlier, KYC requires firms to ingest and assess large quantities of data, often in multiple formats. By using a combination of NLP and ML, a software application can learn to identify and flag data anomalies. While NLP enables the solution to understand what is being ingested, ML enables the solution to continue to learn and adapt from every query. This ensures the solutions continue to evolve as the checks are processed.

When addressing the operational efficiency challenges in KYC, those can be mitigated by applying RPA to the process. While NLP and ML will focus on analysing and interpreting the data, RPA will conduct any preprogramed manual steps in the process, reducing cost and significantly enhancing the speed of a KYC process. This is yet to be widely applied within KYC practices, however a Tractica report4 on the NLP market has estimated that the total NLP software, hardware and services market could potentially exceed £16.7 billion by 2025. Furthermore it also forecasts that NLP software solutions that also utilise AI will see a market growth from £101 million in 2016 to £4 billion by 2025.


graphic graphic
Figure 1: Future NLP Revenue Prediction4

How can Synpulse help?

Synpulse has a wealth of experience in automating a multitude of processes across different sectors and understands in depth the importance of ensuring the data being interacted with is of as high quality as possible. NLP, whilst an important and crucial aspect of intelligent automation, has its limitations and is not simply a quick fix to be thrown at a process. It should ideally be combined with additional technologies such as ML or RPA to ensure the greatest accuracy of results is obtained. Synpulse has a substantial pool of professionals with extensive implementation experience and robust judgement on how to best utilise the different technologies within the intelligent automation domain to guarantee that a process is not only automated correctly, but also efficiently and requiring little human interaction. When working on a process with our clients, we take every measure to ensure that the solution obtained provides a holistic approach and therefore meets the needs of the client as best as possible.



Marouane Bakhtar

Cookies help us deliver our services. By using our services, you agree to our use of cookies. Find out more.