Natural Language Processing (NLP) is a trend of computer science aimed at training the computer to perceive and generate human language directly, without transforming it into computer algorithms. We have described the basic concepts and algorithms of NLP, and its possible use in business in our recent article.
NLP opens the door for the development media industry. It deals with human language each hour and second; the ability of the computer to work with the human language makes it capable of completely changing media processes all over the world. Computer intelligence will automate searching for necessary information, parsing relevant news, and analyzing and systematizing the news according to predefined criteria.
Text perception at NLP
NLP is a set of complex algorithms by which the computer is trained to understand human language. Basically, it includes splitting the text into minor units and analyzing the connections between them. Please refer to this article for more information.
However, news app development is not only a mere understanding of the text. In order to successfully implement NLP in this area, it’s necessary to develop advanced algorithms, which enable the computer to perform the following actions:
- systematization of text content
- summarizing information
- parsing only the relevant information
- grouping the news according to the criteria defined by a user
This is possible due to two levels of text perception: macro-understanding and micro-understanding.
Macro-understanding implies the general understanding of the entire content and includes the following aspects:
- classifying and systematizing text according to the defined criteria
- matching different types of records (e.g. job descriptions and CVs)
- general analysis of sentiment and semantics
- extraction of topics, keywords, and key phrases
- detection of duplicates and near-duplicates
Typically, macro-understanding is performed with the help of Apache Spark and Spark MLib frameworks. The process of macro-understanding can be represented in the following way:
In this example, the users download content from the Internet and send it on with the help of the Kafka stream processing system. Spark Machine learning framework processes the information with the help of the NLP algorithm, systematizes and represents it as a database. The structured information is sent to the internal search engine of the application and is output to the interface of the end user.
Micro-understanding refers to the ability of perceiving the meaning of each separate phrase and sentence in order to understand the smallest details of the text. This appears to be a more complex task and includes the following aspects:
- extracting abbreviations and their definitions
- extracting entities (such as people, company, product, amount of money, location, etc.)
- extracting references to other documents
- extracting emotionally colored sentiments (positive/negative news and references)
- extracting quotes from people with reference to their author
- extracting contract conditions
Micro-understanding implies the syntactic analysis of the text, including the analysis of word order and usage.
Our experience in using NLP algorithms
Spam detection at CityFalcon 2500
CityFalcon is a news aggregator app that analyzes the latest financial news and tweets on the settled topic, rating them according to their relevance in helping investors to become aware of the latest trends in the financial world. The full case study is available here.
The user selects the topics that he wants to read, and the system scans the latest financial data. The output is presented by 30 of the most highly-rated, actual, and relevant pieces of news from trusted sources. In order to improve the quality of the output information, the app has implemented the algorithms allowing for the detection and removal of irrelevant information from the newsfeed displayed to the user.
The main challenge
While working on the development of parsing algorithms, the concern that the output information may contain some spam/advertising news or other types of information that is useless to the end user was addressed.
In order to sort out the most relevant information, we decided to focus on developing the micro-understanding algorithms and to create a spam filter for sorting the news. By analyzing the keywords and symbols, the system selects the important details of the particular piece of news and decides if it is worth presenting to the users.
Technical implementation of the feature
The spam filter we created was based on the Naive Bayes classifier – a machine learning algorithm used for classification of the units depending on the selected criteria. In terms of spam filtering, the Naive Bayes algorithm sorts the content into 2 main classes: spam and relevant information. The sorting is based on the features of spam content learned by a program (the presence of particular words, the frequency of using) and is performed in the following way:
- Loading the data
Two folders (spam and ham) standing for spam and useful or relevant information are created. They will contain the lists with the features of each type of content.
- Pre-processing the data
In order to use the words in the lists as features, it’s necessary for the system to convert all the texts to the lowercase and to treat the different forms of the word as the same entity, i.e. to standardize the data. This was implemented by means of tokenization and lemmatization (please refer to the recent article for more information about these methods).
- Removing stop words
Upon pre-processing the text, we eliminated the words that cannot help us to define if the text can be classified as spam or not. They are called stop words and are represented by the articles, prepositions, conjunctions, etc. We implemented the stop words filter by introducing the variables list_of_words – stop_words, where the first one is the general amount of words and the second one is a publicly available list of words.
- Extracting the features
After performing all the above-mentioned actions, we have the meaningful words that are used to determine if the content is spam or not. In order to classify them as being spam or relevant information, we need to:
1. Calculate how many times the word is encountered in the text
2. Or just to register the fact that the word occurs in the email
- Training the classified
Having brought the data into the correct format, we began to teach the algorithm to distinguish the content. We created 5 classifiers and uploaded about 100 thousand different records so that each of them received a different pack of content. Since the classifiers had the different experience, it helped us to increase the precision of the sorting.
- Testing the performance
Having completed all the aforementioned steps, we evaluated the algorithm’s functioning. The accuracy of the training stage showed us how proficent the classifier is at learning the information and the accuracy of the testing stage revealed the machine’s skill at applying the knowledge to the new content.
We have described quite a simple, yet effective NLP-based algorithm capable of filtering spam content. It can be used not only in news app development, but also introduced as a spam filter to a mailbox or server.
News clusterization at SwipeNews
SwipeNews is a news aggregator application that not only provides a customized newsfeed, but also compares its coverage over different types of media, and systematizes the information so that the user can make up his own mind regarding the particular topic. Please refer to a full case study here.
Thanks to the news aggregators, you no longer need to waste your time browsing the different websites to see the duplicated or irrelevant information. The application will demonstrate to you, the unique articles that match your interests.
The main challenge
Yet, one problem still arises: the most important events in the world are covered by numerous media, and you will be challenged to read and compare numerous pieces of information, to systematize them, and to point out the main details in order to create your own opinion in this regard.
In order to simplify the user experience of working with news content, we implemented the algorithm of searching and grouping the news according to the selected topics.
Technical implementation of the feature
The clusterization algorithm was implemented by introducing a TF / IDF vectorizer. It’s a text mining technique that analyzes the importance of the word in the document by calculating how many times it occurs in the text. In order to clusterize the articles, it’s necessary to perform some preparatory steps:
1. Setting the symbol limit in order to filter out the long articles
2. Selecting the center of the cluster (the basic article for the algorithm to find the similar one). The algorithm calculates the degree of similarity between the articles, and filters out the ones with the lowest rating. An evaluation of the similarity is performed by analyzing the frequency of using the following lexical items:
- separate words (not case-sensitive)
- abbreviations (the proper names are considered as an abbreviation)
The abbreviations are considered of more importance than the words and have more influence on the final rating.
- Defining of the trendy words – having defined the frequency of using of each word and abbreviation, the algorithm calculates which the trendy ones are. This is considered to be the basic data for further analysis.
- Grouping the words – the trendy lexical items are united to the word combinations. From then on, the algorithm analyzes the word combinations instead of separate words.
- Defining of the trendy word combinations.
Upon completing the analysis, each article receives its rating and the program forms the general index of similarity between the articles. We select the minimal similarity threshold and the content with the lowest rating is filtered out.
The remaining data is represented as a graph showing the degree of similarity between the analyzed articles.
The clusterization of text information is an advanced algorithm that can be used by robots for analyzing the existing articles and generating new content based on its analysis. This technique has already been introduced by the largest media companies – feel free to read further to learn more.
Using NLP in the news industry as a trend
By organizing effective human-computer communication, NLP can automate different areas of business. The media industry deals with human language every hour, and every second of every day, and has enormous possibilities to implement the NLP algorithms in its daily routine.
NLP-robots as journalists
As previously mentioned, the ability of the machine to understand the human language eliminates the necessity for humans to read and structure the huge amount of data. For journalists, it means that the computer algorithm can process all the information on the topic and output the key facts, figures, and statistics. By delegating the research part of the job to a robot, a human media specialist will be able to pay more attention to such aspects as analytics and creativity.
The advanced NLP algorithms are capable of not only processing the information, but also in generating the articles for news aggregators or analytics resources. The crawler bots are able to scan the information on the internet, to sort out the relevant information and to create a press release or a news article. The robot-journalists can generate the information for sports, financial, business, crime news, weather forecasts, etc. – i.e. the type of content that requires the formal presence of figures, statistics, and the formality of style.
The generation of automated news is actively utilized by the information agency, The Associated Press. The system was implemented in 2015 and was able to generate 3,000 articles per 15 minutes. A year later, the speed increased to 2,000 posts per second. The Associated Press is not the only media agency using journalism robots – this technology has also been implemented in such media companies as The New York Times, The Guardian, Forbes, Los Angeles Times, and BBC, etc.
The main challenges of automating the news industry
Although using NLP-based software is becoming a trend in the media industry, the information agencies may face several challenges while automating their workflow.
- Human journalists vs computers
Upon automating the news industry, the human journalists will find that the machine is capable of doing their work faster and more efficiently. This may cause the usual problems when labor is automated – the substitution of human employees by machines. Yet, not all aspects of a journalist’s work can be performed by software – the analytic, research, creative work, and the journalist investigations all require human intelligence and are unlikely to be performed by a computer. Thus, the human journalists will be released from the boring, routine work and will be able to dedicate their time and effort to the more challenging aspects of the job.
- Freedom of speech
The main principles of modern journalism are transparency and freedom of speech. Since the machine does not have the human element of critical thinking, it will be a challenge to train it to distinguish trustworthy information from fake information. Additionally, in order to follow the principles of transparency and freedom of speech, the media companies will probably have to open the codes of their NLP-systems.
- Ethical issues
During this period of global development of AI-technologies, the ethical aspect of robotic work remains unclear. In terms of computer journalism, the problem of media ethics arises: what should we do to prevent and stop the propaganda of antisocial morality? This is the critical question that needs to be answered in the very near future.
Now you are aware of the latest trends of the media industry and their tight integration with NLP technology. If you still have questions about how to use NLP in the news industry, do not hesitate to contact our team. Our specialists will provide you with extensive consultation both on the technical and the business development aspect and our expert developers will сreate the NLP-algorithms that will bring the future to your project right now.