A relatively large percentage of the data that we have and can access in recent times has been generated over the years on a daily basis and presumably hourly. We generate tones of millions of bytes of data coming across from multiple platforms such as from social media platforms, capturing transactional information in banks, high quality digital images and motion pictures and so much more. Therefore as a result, it is only fitting that companies develop much more complex big data solutions that will enable a robust, affordable, scalable and flexible way of capturing, processing and even storing the data they generate. As such, it is important to move from a simple traditional way of capturing this data to a way that allows the organizations to capture this newly complex and unstructured data that has evolved with time and technological advancements. It is therefore important to know which architecture to use and the resources to employ to work hand in hand with the immensely large data architectures as data nowadays does not encompass storing figures into relations and has seen different types of content such as photos/pictures, voice notes, documents in their various forms and videos also requiring storage.
EXPLANATION OF BIG DATA ARCHITECTURE AND ITS TECHNOLOGIES
DEFINITION OF BIG DATA
Big data refers to massive data quantities that are impossible to store and process when employing a simple system used for managing a database and its approach in a specific time frame. It references all data that is in petabytes or greater in memory size that causes drawbacks in storing, analysing and envisioning the data, i.e. Terabytes, Exabytes, Zetabytes etc. Its volume outweighs the resources that are used to store it or even process it. This type of data is not transactional and has evolved to either be user generated or can be generated by machines that are of artificial intelligence.
BIG DATA ARCHITECTURE
Big data architecture, a basis for big data analytics, is an outcome of the intercommunication of big data application resources. These resources or database technology are put together to achieve high performance, high fault comprehension and scalability. It is dependent upon resources that the organization has and also on the data environment an organization has.
A big data structure is devised to handle the ingestion of data, its processing and analysis of data that is too large and difficult for simple traditional database systems. The Solutions normally involve the processing of big data sources in batches (at rest), the big data processing in real-time (in motion), interactive study of this data and analytics and machine learning that are apocalyptic.
A bunch of big data structures involve some or most of the following components;
Data source: It is possible to find a stand-alone data source or they can be many and used interchangeably based on the amount of data the organisation creates. These range from mounted data store databases to files that implementations like web server log files make.
Data storage: Operational data that results from bulk processing gets written to a distributed storage file that has the ability to hold immense data quantities in their various forms commonly referred to as a data lake.
Batch processing: The solution must systematically digest data using reliable tasks to choose, assign and make it ready for it to be analysed. This process involves reading source files, processing them and writing output to new files.
Real-time message recording: the architecture should include ways to record or store real-time communication for online processing only when the solution involves real-time sources.
Analytical Data Store: The Solution should prepare data for inspection and give out the examined one in an organized form that will allow it to simply be accessed using analytical resources.
Orchestration: Orchestration technology can be employed to enforce correlation and correspondence for solutions that involve repetition of operations responsible for digesting data and positioning the data into a data store and assemble the output in the form of a report.
EXAMPLES OF BIG DATA ARCHITECTURE
Internet of Things (IoT) architecture: The Internet of Things has no precise and universally agreed consensus regarding its architecture. As such, multiple architectures have been proposed. These include three & five-layer architectures, cloud and fog based architectures, social IoT, representative architecture and a whole lot of others. A basic layer of the IoT architecture has a sensor/device, edge, data intelligence and application layers that are stacked one over the other to carry out unique tasks with each having sub layers within it.
Lambda architecture : A data processing architecture modeled to handle large quantities of data by making use of both batch and stream processing methods. Such an approach to architecture makes an effort to find a balance between latency, throughput, and fault tolerance by using batch processing to provide a well-rounded and precise views of batch data, while simultaneously using real-time stream processing to cater for online data views. The rise of lambda architecture corresponds with the growth of big data, real-time analytics and the drive to mitigate the latencies of MapReduce.
Lambda architecture is dependent on a data model with an append-only immutable data source that serves as a system of record. It is intended for ingesting and processing time stamped events that are appended to existing events rather than overwriting them. has three layers:
1. The Batch Layer manages the master data and precomputes the batch views
2. The Speed Layer serves recent data only and increments the real-time views
3. The Serving Layer is responsible for indexing and exposing the views so that they can be queried.
The three layers are outlined in the below diagram along with a sample choice of technology stacks:
Hadoop Architecture: Hadoop Skill Set needs a considerable amount of knowledge of every process in the hadoop stack right from understanding the various components in the hadoop architecture and deviseing a hadoop cluster that includes performance,tuning it and setting up the top chain responsible for processing data.
It follows a primitive master slave architecture devise for storing data and processing distributed data using HDFS and MapReduce respectively. Hadoop is the master node for data storage, HDFS is the NameNode and the Job Tracker is the master node for parallel processing of data using Hadoop’s MapReduce. Slave nodes in the hadoop architecture are other machines in the Hadoop cluster which store data and carry out complex operations. Each slave node is designated a Task Tracker daemon and a DataNode that links the processes with the Job Tracker and NameNode respectively. In Hadoop architectural implementation the master or slave systems can be established in the cloud or on-premise.
Image Credit : OpenSource.com
BIG DATA TECHNOLOGIES
Big data technologies are the means through which drawbacks in data analytics, visualization and storage are tackled. Because of the problems brought forth by big data’s volume, variety and velocity, it prompts for new technology solutions. The most prominent and widely used big data technology is the Hadoop open source project which was invented by Apache. This open source library was created with the focus placed on scalable, reliable, distributed and flexible computing systems that can handle this big data. Hadoop is made up of two components that work hand in hand.
First up is the Hadoop Distributed File System (HDFS) which gives way to high-bandwidth that is necessary for big data computing.
The second component that makes up Hadoop is a data processing structure or platform known as MapReduce. It is important as it distributes huge data sets from search engines (e.g. google search technology) across many servers which will in turn process the overall data set it receives and creates a summary before more traditional analysis resources are used. The distribution and summary creation of the large data sets is what is presumed to be the “map” and “reduce “respectively.
Hadoop technology and various big data resources have evolved to solve the challenges faced in the big data environment. These big data resources can be classified into categories as follows;
1. Data Storage and Management
Examples include NoSQL MongoDB, CouchDB, Cassandra, HBase, Neo4J, Talend, Apache Hadoop, Apache Zoo Keeper etc
2. Data Cleaning
Examples include MS Excel, Open Refine etc.
3. Data Mining
A process of discovery insights in a database. Examples include Rapidminer, TeraData etc.
A collection of concepts that enable efficient, effective and rapid processing of data sets which are characterised by reliability, scalability, flexibility, agility and performance. Because it is called ‘NoSQL’, which is a short notation for “not-SQL” or rather “not only SQL”, it does not mean that it employs the use of a language other than SQL. It utilises SQL as well as other query languages.
NoSQL is an advancement to databases that shows a drift from simple popular relational database management systems (RDBMS). When explaining NoSQL, caution is taken to ensure that we first explain SQL, which is a structured step by step query language employed by the RDBMS. These type of databases depend on relations/tables, rows, columns or schemas to categorize and recover data.
In comparison, NoSQL does not rely on the later. It rather uses much more reliable and flexible data models. Because Relational Database Management Systems have tremendously been unable to meet the flexibility, performance and scalability needs required by these data-intensive next-generation applications, NoSQL databases have been embraced by multiple mainstream organizations fulfil the shortcomings of these RDBMS.
NoSQL is specifically used to store data that is unstructured, grows much more rapidly than structured one and does not fit into tables in the RDBMS. Regular examples of unstructured data comprises of:
Massive objects such as videos and images, chat/messaging and log based data, user entered and session generated data and lastly time series(real-time) data such as IoT and device data.
TYPES OF NOSQL DATABASES
A few distinct variations of NoSQL databases have been made to help particular needs and use cases. These fall into four principle classifications:
Key-value data stores: Key-value NoSQL databases accentuate effortlessness and are extremely helpful in quickening an application to help fast peruse and compose preparing of non-value-based information. Put away qualities can be any sort of twofold question (content, video, JSON report, and so on.) and are gotten to by means of a key. The application has finish authority over what is put away in the esteem, making this the most adaptable NoSQL demonstrate. Information is parceled and recreated over a group to get versatility and accessibility. Therefore, key esteem stores regularly don’t bolster exchanges. In any case, they are exceedingly compelling at scaling applications that bargain with high-speed, non-value-based information.
Document stores: Document databases regularly store self-depicting JSON, XML, and BSON archives. They are like key-value stores, however for this situation, a value is a solitary archive that stores all information identified with a particular key. Prevalent fields in the archive can be filed to give quick recovery without knowing the key. Each archive can have the equivalent or an alternate structure.
Wide-column stores: Wide-column NoSQL databases store information in tables with columns and rows like RDBMS, however names and the organization of segments can differ from column to push over the table. Wide-segment databases bind segments of related information together. A query can recover related information in a solitary task on the grounds that just the sections related with the inquiry are recovered. In a RDBMS, the information would be in various rows put away in various places on a disk store, requiring numerous disk operations for recovery.
Graph and chart stores: A diagram database utilizes chart structures to store, map, and query connections. They give reference free adjacency, so neighboring components are connected together without utilizing an index.
Multi-modal databases use a mix of the four sorts portrayed above and in this way can support a more extensive scope of uses.
BENEFITS OF NOSQL
NoSQL characteristics can address the challenges of big data in the following ways:
Scalability: NoSQL databases use a horizontal scale-out methodology that makes it easy to add or reduce capacity quickly and non-disruptively with commodity hardware. This eliminates the tremendous cost and complexity of manual sharding that is necessary when attempting to scale RDBMS.
Performance: By simply adding commodity resources, enterprises can increase performance with NoSQL databases. This enables organizations to continue to deliver reliably fast user experiences with a predictable return on investment for adding resources again, without the overhead associated with manual sharding.
High Availability: NoSQL databases are generally deviseed to ensure high availability and avoid the complexity that comes with a typical RDBMS architecture that relies on primary and secondary nodes. Some “distributed” NoSQL databases use a masterless architecture that automatically distributes data equally among multiple resources so that the application remains available for both read and write operations even when one node fails.
Global Availability: By automatically replicating data across multiple servers, data centers, or cloud resources, distributed NoSQL databases can minimize latency and ensure a consistent application experience wherever users are located. An added benefit is a significantly reduced database management burden from manual RDBMS configuration, freeing operations teams to focus on other business priorities.
Flexible Data Modeling: NoSQL offers the ability to implement flexible and fluid data models. Application developers can leverage the data types and query options that are the most natural fit to the specific application use case rather than those that fit the database schema. The result is a simpler interaction between the application and the database and faster, more agile development