How We Made a News Processing App that Works with Big Data

Table of Contents Hide

1.1 The algorithms of big data processing

3 Our experience of text processing with Ruby

The Internet is a huge network containing lots of information and a huge amount of it is generated every minute. This is called big data, and it is represented by text files, figures, images, audio tracks, and videos. Big data is highly unstructured and is located in the different storages. This unlimited information has huge potential for the development of modern business and science.

Yet, even though companies have access to huge amounts of information, the majority of them simply cannot analyze it. Big data is impossible to process with traditional software and to represent as a traditional database. Companies simply do not have the necessary tools to figure out the connections between the data and to make significant conclusions. This creates the necessity of big data analytics app development.

Big data analysis is one of the most popular and important branches of computer science. Automatization of processing the information will give people unlimited access to the information. The data will be used in science, business and other industries.

In this article, you will learn how big data processing is applied in the media industry. Also, we will tell you how to develop a news processing app with the example of one of our projects.

Big data in the modern world

The term was first introduced in 2008 by the editor of the Nature journal, Clifford Lynch. In his article “Big data: How do your data grow?” he presented the concept of big data, referring to the increasing volumes of information available. The term was oriented towards scientists, yet, a year later it became widely used in the business environment and even among the general public.

The phenomenon of big data had a strong impact on the development of computer science. Such notions as big data analytics and big data analysis derived from the term introduced by Clifford Lynch. The branch of computer science dealing with big data problems is called data science. Let’s see what these terms mean:

Big data analytics implements logical analysis of the information using mathematical and statistical methods. Its main aim is to define the patterns and correlations between the datasets.

Big data analysis means profoundly inspecting the datasets in order to figure out the useful information and make significant conclusions and decisions.
Data science is an interdisciplinary science combining statistics, mathematics, and programming. It implements the scientific approach to data processing and is in charge of developing the tools and big data application architecture.

The algorithms of big data processing

Data science means developing the techniques for processing huge amounts of unstructured data. The algorithms are one of the most important aspects in this area. Some of them are purely mathematical, others are inspired by the work of the biological systems of the human organism. Let’s learn more about them.

Stochastic Algorithms

These are derived from probability theory. Stochastic algorithms are implemented using random variables to solve the problem in several different ways, leading to the desired result. In big data processing, they are used to optimize processing by using several approaches to tasks.

Evolutionary Algorithms

They are inspired by Darwin’s evolutionary theory and the process of biological evolution. The algorithms solve the tasks, and the most optimal solutions are used for the further work. This process repeats, and the precision of the results increases each time.

Physical Algorithms

They are inspired by physical processes and deal with calculations that are not suitable for ordinary machines. Since the volumes of big data resemble the physical magnitude, these algorithms are a good choice for data science.

Probabilistic Algorithms

They are inspired by probability theory. They are capable of calculating the probability of some event. Based on this information, the system makes predictions and outputs the general statistics.

Swarm Intelligence Algorithms

This type of work with big data is inspired by collective intelligence, such as colonies of ants, flocks of birds, etc. This is the decentralized and self-organizing intelligence that can be used for the analysis of information from different sources with different experience and knowledge. This will allow more precise predictions to be made.

Immune System Algorithms

These algorithms are inspired by the functioning of the human immune system. They are capable of classifying and clustering data, detecting abnormalities, and modeling the search and optimization systems.

Neural Algorithms

This is the system of artificial neurons – the microprocessors imitating the work of human neural systems. Like human neurons, the processors accept the signals and transfer them to the other ones. The more data the network processes, the more intelligent it becomes. The trained neural network is capable of solving difficult tasks that are beyond the power of the ordinary algorithms.

The use of text processing in media

Text processing in the news industry is a major big data use case. As you already know, there are five types of data: text, figure, picture, audio, and video. Since text contains the largest percentage of relevant information that is presented in a certain context, text processing has developed into a separate branch of big data analysis.

Why does the news industry intensively use applications for big data analytics? Since human journalists are not able to quickly analyze the information, the machine can cope with it more efficiently. The modern algorithms cannot only extract the data from the text, but also understand the content, define the author’s attitude toward the event, and group the content according to the preselected criteria. To learn more about using artificial intelligence in media, please refer to this article.

The big data algorithms are capable of analyzing the textual information and extracting the relevant information, making general conclusions, etc. Let’s see the variants of processing big data in the media industry:

Sentiment Analysis

This is to define the author’s attitude towards the subject of his content. In media, analyzing the sentiment will help to define the opinions of certain groups of people, generalize it and define the trend.

Topic Modeling

This helps to analyze the words used in a document and to define the topic of each one. Having obtained this information, the algorithm can figure out the most popular topics, define the correlations (e.g. author, location, etc.), and group the documents according to the necessary criteria.

Term Frequency – Inverse Document Frequency

This is to define the frequency of the usage of a particular term. Firs, the system scans every separate document, and later defines the average coefficient. This helps the system to figure out the importance of the term in a particular text, and to classify and rank the content according to this criterion. In one of our projects, we implemented the feature that groups the news into the clusters by using TF/IDF vectorizer. You are welcome to find out more in our recent article.

Named Entity Recognition

This means recognizing the nouns and can be used for extracting the names of people, organizations, and locations. While analyzing the text, NER-algorithms pay attention to using the necessary words in the context, analyzing the upper and lower cases, and punctuation. By analyzing the upper and lower cases, NER is capable of distinguishing abbreviations and proper nouns and substituting the abbreviations with full terms.

Event Extraction

This is a more advanced algorithm than NER. Event extraction is not only capable of analyzing nouns, but also defining logical relations between them and making significant conclusions.

Our experience of text processing with Ruby

City FALCON

City Falcon is a big data analysis application focused on the members of the financial sector – business people, professional and amateur investors, traders, etc. The service provides a personalized news feed on the selected topic based on the user’s interests, search history, and preferences. The app is powered by machine-learning algorithms that provide more relevant information each time the service is used.

The goal of the project is to save investors’ time and simplify the process of decision-making.

The process of development

After receiving the project, we started to work with a mature MVP. To do our work better, we had to better explore the client’s goals and to understand the business logic of the application. Our work with CityFALCON can be divided into four stages.

1. Initial stage

Analysis of client’s business

We deeply explored the fintech sector and studied the latest trends of this business. This helped us to fully adapt to the client’s business, and act not only as a devolepment team but also business consultants.

Architecture planning

This stage included the development of the logic of data interactions. We classified the sources into the different types and defined the common feature in each case. This decision helped us to aggregate the different data sources and provide the same client-side interface for any type of data.

2. Improved MVP

Building a scalable architecture

While working on improving the MVP performance, we paid attention to the scalability of the application. In future, this will help us to avoid problems with expanding the architecture and adding new features.

News processing

Developing the app’s interaction with text data: defining the author of the article, the date of publication, understanding the content. The articles are formatted and filtered according to the selected criteria.

Basic Scoring algorithm

For the application to output the most relevant information, we developed an algorithm that defines the rank of the articles and sorts them according to their rating and relevance to the topic.

3. Private Beta

Enterprise API

We integrated an API allowing third-party services to download the data in bulk in a way that is convenient for the user. At the same time, the bulk loading does not affect the performance of the entire application.

Scoring algorithm improvement

More advanced techniques for scoring and ranking the articles were implemented. It improved the quality of the content presented in the user’s news feed.

4. Public Beta

Building a scalable infrastructure

This is to ensure that the performance stability does not depend on the user’s location or increasing load of the application. If it is necessary to increase the power of the app, the new servers are switched on in the necessary location. This practice also helped us lower the response time of the application.

Introducing voice devices

In order to provide better distribution of the information, we integrated the most popular voice devices into the application, such as Microsoft Cortana, Amazon Alexa, and Google Phone. The app became capable of understanding voice commands and exporting the information to the user’s device.

Increasing the quantity of topic coverage

The application is capable of processing more information from different sources, and covering a greater amount of topics.

The results of the project

The development and consulting services offered to CityFALCON by our company helped them focus on promoting their product to the target audience. Their team managed to win the Twitter Hatch competition and Twitter’s Global Start-Up Competition in 2015.

A year later, the CityFALCON team launched a crowdfunding campaign, and managed to collect £150,000 from about 120 investors. Later, they became a finalist in the Amazon Growing Business Awards, and were nominated for “Digital Business of the Year”.

Upon launching the voice recognition devices in 2017, they took part in VivaTechnology – a Paris conference for startups handled by BNP Paribas bank. In the same year, they took part in Kickstart Accelerator in Zurich.

Now the CityFALCON API is used by several international banks.

Conclusion

After reading this article, you have learned a lot about the tools and methods of data science and their practical implementation in the news industry. If you are enthusiastic about using big data analytics in media, let’s bring your dream to life together!