What is Hive? Architecture & Modes
On top of the Hadoop Distributed File System sits Hive, an ETL and data warehousing tool that was built by Apache (HDFS). Hive makes task easy for doing processes like
The examination of very large datasets
Important characteristics of Hive
The process of creating tables and databases in Hive comes first, followed by the loading of data into the tables that have been constructed.
Hive is a data warehouse that was created specifically for the purpose of organising and querying solely structured data that is stored in tables.
In the context of working with structured data, MapReduce lacks optimization and usability features like UDFs, but the Hive framework possesses these capabilities. The term “query optimization” refers to a method of query execution that is efficient with regard to performance.
The user is not exposed to the complexities of Map Reduce programming because to Hive’s language, which is influenced by SQL. It makes learning much simpler by recycling concepts that are already well known from the realm of relational databases, such as tables, rows, columns, and schema, among other things.
The Hadoop programming operates on files that are flat. Therefore, in order to “partition” data and increase the efficiency of specific searches, Hive can make use of directory structures.
A very new and essential component of Hive, the Metastore is utilised for storing information pertaining to schemas. Generally speaking, a relational database is used to house this Metastore. Interacting with Hive is possible through a variety of channels, including the Web GUI and the Java Database Connectivity (JDBC) interface.
The command line interface is typically where the majority of interactions take place (CLI). Hive includes a command line interface (CLI) for writing queries in Hive Query Language (HQL)
In most cases, the syntax of HQL is comparable to the syntax of SQL, which the vast majority of data analysts are already familiar with. The following sample query will reveal all of the records that are in the table that is given.
Sample query: Choose ‘*’ from the ‘TableName’ table.
TEXTFILE, SEQUENCEFILE, ORC, and RCFILE are the four file types that Hive is capable of reading and writing (Record Columnar File).
For the storage of metadata for a single user, Hive uses the derby database. However, for the storage of metadata for numerous users or shared metadata, Hive uses MYSQL.
Check out the tutorial on “Installation and Configuration of HIVE with MYSQL” for instructions on how to configure MySQL as a database and save meta-data information.
Hive Vs Relational Databases:-
Some of the most important aspects of Hive are as follows:
The most significant distinction between HQL and SQL is that Hive queries are executed on Hadoop’s infrastructure as opposed to the standard database.
The execution of the Hive query will be similar to a sequence of jobs that are automatically generated to map and reduce data.
When a client conducts a query, the data can be retrieved more easily because to Hive’s support of the notions of partitions and buckets.
Data cleansing, filtering, and other similar tasks can be performed with the help of user-defined functions, which Hive provides. One is able to define Hive UDFs in accordance with the specifications imposed by the programmers.
Relational databases against Hive databases:
Through the utilisation of Hive, we are able to carry out a variety of unusual functionalities that are not possible with relational databases. It is essential to be able to query a massive amount of data that is measured in petabytes and obtain results in a short amount of time. And it does so in a very effective manner; Hive runs the queries really quickly and generates the results in a matter of seconds.
Now let’s have a look at what makes Hive’s speed so impressive.
The following are some of the most important distinctions that can be made between relational databases and Hive:
“Schema on READ and Schema on Write” describes the structure of relational databases. The process begins with the production of a table, followed by the addition of data to a specific table. Insertions, updates, and modifications are some of the possible operations that can be carried out on relational database tables.
The schema is only available for reading in Hive. Therefore, functions such as updating, making modifications, and so forth are inoperable with this. Because the Hive query in a typical cluster runs simultaneously on several Data Nodes. Since this is not possible, it is not possible to update or modify data that spans many nodes. (Earlier versions of Hive than 0.13).
Additionally, the “READ Many WRITE Once” pattern is supported by Hive. This indicates that after inserting the table, we will be able to update the table using the most recent versions of Hive.
NOTE: despite this, the newest version of Hive has several newly developed features. New options to update and delete data are included in the latest version of Hive (0.14), which is now available.
Architecture of the Hive An Introduction to the Hive
The architecture of Apache Hive is broken down into its component parts in the screenshot above.
The Hive Is Primarily Comprised of Three Core Components
The Services of Hive
Computing and Storage Provided by Hive
Hive offers a variety of drivers, each designed to facilitate connection with a certain class of applications. It will provide Thrift clients for communication for any applications that use the Thrift framework.
It offers JDBC Drivers, which can be utilised by programmes that are related to Java. besides any kind of programmes that provide ODBC drivers. These Clients and drivers will, in turn, communicate with the Hive server that is located within the Hive services.
Services for the Hive:
Hive Services are the medium via which customers can interact with the Hive platform. It is necessary for the client to communicate with Hive Services in order to carry out any query-related actions in the Hive database.
The command line interface, often known as CLI, functions as the Hive service for Data Definition Language (DDL) activities. According to the architecture diagram that can be found above, all drivers communicate with the Hive server and with the main driver that is included in the Hive services.
The driver that is available in the Hive services is the primary driver, and it interfaces with all different kinds of JDBC, ODBC, and other client-specific applications. The driver will process those requests coming from various apps and send them on to the meta store and field systems so that they can be processed further.
Computing and Storage Provided by Hive:
In turn, Hive services such as Meta store, File system, and Job Client communicate with Hive storage and carry out the tasks listed below:
The “Meta storage database” in Hive is where the metadata information of tables that were built in Hive is stored.
Both the results of the queries and the data that is placed into the tables will be saved in the Hadoop cluster on HDFS.
The flow of job execution:
Hive, a brief introduction
Different modes of Hive
The job execution sequence in Hive with Hadoop can be understood from the screenshot that has been provided above.
The following pattern describes the behaviour of the data flow in Hive:
Using the User Interface to Carry Out Queries ( User Interface)
The Compiler is having conversations with the driver in order to obtain the plan. (In this context, “plan” refers to the execution of queries), process, and the gathering of its associated metadata information
The executable plan for a job is conceived of and developed by the compiler. Compiler interacting with Meta store to obtain metadata through request.
The compiler receives information about metadata from the meta store.
Compiler is in communication with Driver about the query and the plan that has been proposed to execute it.
Driver Execution plans are being sent to the execution engine.
In order to complete the processing of the query, Execution Engine (EE) serves as a bridge between Hive and Hadoop. For purposes of the DFS.
To retrieve the information contained in the tables, EE must initially make contact with the Name Node, and then proceed to the Data nodes.
The records that are desired are going to be retrieved from the Data Nodes using EE. Only the data node contains the actual information that is stored in the tables. While retrieving information from the Name Node, it only retrieves the metadata that pertains to the query.
It does this by collecting actual data from data nodes that are connected to the query mentioned.
In order to carry out DDL (Data Definition Language) activities, the Execution Engine (EE) communicates in a bidirectional manner with the Meta store that is available in Hive. DDL actions such as creating, deleting, and altering tables and databases are carried out under this section. Only information pertaining to the database name, table names, and column names will be stored in the meta store. It will retrieve data that is relevant to the query that was mentioned.
Execution Engine (EE), in its turn, communicates with Hadoop daemons such as Name node, Data nodes, and job tracker in order to run the query on top of the Hadoop file system.
Collecting information from the motorist.
The findings are being sent to the Execution engine. After the results have been retrieved from the data nodes and sent to the EE, they will be sent back to the driver and to the user interface (front end) of Hive. Execution engine maintains constant communication with the Hadoop file system and all of its associated daemons. In the Job flow diagram, the communication between the Execution engine and the Hadoop daemons is represented by the dashed arrow.
Several distinct incarnations of the Hive
Depending on the number of data nodes in Hadoop, Hive can function in either of two distinct modes.
What is Hive Server2 (HS2)?
These modes are as follows:
The mode map-reduce
When to switch to the Local mode:
If Hadoop was installed in pseudo mode with only one data node, we would use Hive when operating in this mode.
We are able to make use of this mode if the data size is more manageable and is restricted to a single local machine.
Processing on smaller data sets that are already present in the local machine will go very quickly.
When to switch to the Map reduction mode:
We will utilise Hive in this mode if Hadoop has numerous data nodes and the data is dispersed across multiple nodes.
It is going to perform well on a big quantity of data sets, and queries are going to be carried out in parallel.
By utilising this mode, it is possible to process massive data sets while simultaneously improving performance.
We are able to mention in which mode Hive can operate by setting this attribute in the Hive database. It operates in the Map Reduce mode by default, and the following settings can be used to configure it for the local mode.
Work needs to be done in the local mode.
mapred.job.tracker should be set to local;
Since version 0.7 of the Hive software, it has supported a mode that would automatically conduct map-reduce operations in the local mode.
What exactly is meant by the abbreviation “HS2”?
A server interface known as HiveServer2 (HS2), it is responsible for the following operations:
Enables remote clients to conduct queries against Hive
Retrieve the results of the queries that were mentioned.
From the latest version it’s having some advanced features Using the same format as the Thrift RPC;
Multi-client concurrency Authentication