Saturday, January 24, 2015

Apache Hive : Indexes

Indexes are special lookup tables that the database search engine can use to speed up data retrieval.an index is a pointer to data in a table.the Indexes are implemented by using various data structure e.g. BTree, BitMap etc.Hive has limited indexing capabilities. you can speed up some of the data retrival operation using the indexes, the index data is stored in the another table in the Hive

The other way of speed up your query other than the Indexing are partitioning and bucketing which we will cover in coming posts.
Indexing is also a good alternative to partitioning when the logical partitions would.Indexes in Hive, like those in relational databases, need to be evaluated carefully.

Maintaining an index requires extra disk space and building an index has a processing cost.

Creating an Index
hive> create index customer_index on table customers(custid) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
> with deferred rebuild
> in table customer_index_table
> COMMENT 'Customer Index on custid';
OK
Time taken: 5.713 seconds

Creating an bitmap Index
hive> create index customer_index_bitmap on table customers(custid)
> as 'BITMAP'
> with deferred rebuild
> COMMENT 'Index on customer table on custid using bitmap';
OK
Time taken: 0.603 seconds

View an Index
hive> show formatted index on customers;                       
OK
idx_name tab_name col_names idx_tab_name idx_type comment


customer_index customers custid customer_index_table compact Customer Index on custid
customer_index_bitmap customers custid orderdb__customers_customer_index_bitmap__ bitmap Index on customer table on custid using bitmap
Time taken: 0.988 seconds, Fetched: 5 row(s)

Rebuild an Index
hive> alter index customer_index on customers rebuild;      
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201501241609_0022, Tracking URL = http://RKS-PC:50030/jobdetails.jsp?jobid=job_201501241609_0022
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -kill job_201501241609_0022
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-01-24 18:32:39,926 Stage-1 map = 0%, reduce = 0%
2015-01-24 18:32:41,975 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.81 sec
2015-01-24 18:32:42,978 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.81 sec
2015-01-24 18:32:43,982 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.81 sec
2015-01-24 18:32:45,011 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.81 sec
2015-01-24 18:32:46,071 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.81 sec
2015-01-24 18:32:47,078 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.81 sec
2015-01-24 18:32:48,093 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.81 sec
2015-01-24 18:32:49,159 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.81 sec
2015-01-24 18:32:50,245 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 0.81 sec
2015-01-24 18:32:51,305 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.24 sec
2015-01-24 18:32:52,309 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.24 sec
2015-01-24 18:32:53,385 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.24 sec
MapReduce Total cumulative CPU time: 2 seconds 240 msec
Ended Job = job_201501241609_0022
Loading data to table orderdb.customer_index_table
Deleted hdfs://RKS-PC:54310/user/hive/warehouse/orderdb.db/customer_index_table
Table orderdb.customer_index_table stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 8567, raw_data_size: 0]
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 2.24 sec HDFS Read: 8569 HDFS Write: 8567 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 240 msec
OK
Time taken: 42.379 seconds

Drop an Index
hive> drop index if exists customer_index on customers;
OK
Time taken: 3.267 seconds




Apache Hive : HiveQL View

view is a virtual table based on the result-set of an SQL statement.A view allows a query to be saved and treated like a table.

A view contains rows and columns, just like a real table. The fields in a view are fields from one or more real tables in the database.
You can add SQL functions, WHERE, and JOIN statements to a view and present the data as if the data were coming from one single table.Currently Hive does not support materialized views.

lets create a view on the join condition of customers and oreders tables

hive> create view customer_order as select c.custid,c.custname,o.orderid,o.tracking_id from customers c inner join orders o on c.custid=o.fk_cust_id;
OK
Time taken: 1.352 seconds
hive> select * from customer_order;
Total MapReduce jobs = 1
setting HADOOP_USER_NAME rks
Execution log at: /tmp/rks/.log
2015-01-24 05:40:55 Starting to launch local task to process map join; maximum memory = 932118528
2015-01-24 05:40:56 Processing rows: 101 Hashtable size: 101 Memory usage: 7985904 rate: 0.009
2015-01-24 05:40:56 Dump the hashtable into file: file:/tmp/rks/hive_2015-01-24_17-40-47_808_6049446203223532344/-local-10002/HashTable-Stage-3/MapJoin-mapfile31--.hashtable
2015-01-24 05:40:56 Upload 1 File to: file:/tmp/rks/hive_2015-01-24_17-40-47_808_6049446203223532344/-local-10002/HashTable-Stage-3/MapJoin-mapfile31--.hashtable File size: 6249
2015-01-24 05:40:56 End of local task; Time Taken: 0.895 sec.
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201501241609_0021, Tracking URL = http://RKS-PC:50030/jobdetails.jsp?jobid=job_201501241609_0021
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -kill job_201501241609_0021
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2015-01-24 17:41:02,863 Stage-3 map = 0%, reduce = 0%
2015-01-24 17:41:05,872 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 0.94 sec
2015-01-24 17:41:06,875 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 0.94 sec
2015-01-24 17:41:07,883 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 0.94 sec
MapReduce Total cumulative CPU time: 940 msec
Ended Job = job_201501241609_0021
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 0.94 sec HDFS Read: 8569 HDFS Write: 425 SUCCESS
Total MapReduce CPU Time Spent: 940 msec
OK
1009 Declan Hooper 745651 ULQ37MGX7MW
1018 Nathan Mcintyre 745652 NJJ84QEM7GO
1027 Troy Griffith 745653 UHX76SFB1EP
1036 Clark Frazier 745654 ZZH74UDJ6IC
1045 Tad Cross 745655 TFV46VBX1ZI
1054 Gannon Bradshaw 745656 EHJ68BHA6UU
1063 Walter Figueroa 745657 BNU38NFJ6FO
1072 Brady Mcclure 745658 NBK17XMP9XC
1081 Porter Bowers 745659 XHB61DLY6IK
1090 Jakeem Knight 745660 WNN67FXM2NC
1099 Macaulay Armstrong 745661 VXI39DIZ3HU
Time taken: 20.208 seconds, Fetched: 11 row(s)
hive>


Apache Hive : Joining datasets

Hive Supports the SQL Join but only equi-joins are supported, A SQL join is used to combine two or more than two table based on some criteria.the most used join is the Inner Join which return the all the rows between two datasets where join condition met.
lets see in the example how Inner join works in Hive. to demonstrate it we have two data sets the first one is the customers dataset which hold the information of customers.
customers.csv
cust_id,cust_name,ship_address,phone,email
1001,Sawyer Thompson,"-16.44456  115.92975",1-808-310-6814,faucibus@lacinia.net
1002,Xenos Campos,"5.69702  -164.57551",1-872-151-8210,dolor.Fusce@Nunc.com
1003,Brandon Castro,"-25.12774  -151.86179",1-283-827-7635,parturient@aliquameu.org
1004,Evan Gordon,"-20.12089  -85.3661",1-801-885-3833,Fusce.fermentum@Integereu.ca
1005,Macon Hopper,"22.30371  134.10815",1-354-448-6576,est.congue@acturpisegestas.net
1006,Christian Tucker,"73.86819  114.71156",1-106-627-3799,at.egestas.a@Fuscealiquam.net
the other dataset is orders dataset which hold the information about the order placed by the customers.
orders.csv
orderid,cust_id,item,order_dt,track_id
745651,1009,Cephalexin,08/09/2013,ULQ37MGX7MW
745652,1018,Hydrochlorothiazide,01/01/2015,NJJ84QEM7GO
745653,1027,Sertraline HCl,07/13/2014,UHX76SFB1EP
745654,1036,Simvastatin,01/05/2014,ZZH74UDJ6IC
745655,1045,Lisinopril,04/22/2014,TFV46VBX1ZI
745656,1054,Ibuprofen (Rx),08/22/2015,EHJ68BHA6UU
745657,1063,Suboxone,12/10/2014,BNU38NFJ6FO

we have already created the two tables in the hive named as customers and orders which hold the data for the customers and the orders.
customers table
hive> describe customers;
OK
custid              	int                 	customer id         
custname            	string              	customer name       
address             	string              	customer Address    
phone               	string              	customer phone      
email               	string              	customer email      
Time taken: 0.524 seconds, Fetched: 5 row(s)
orders table
hive> describe orders; 
OK
orderid             	int                 	Order ID            
fk_cust_id          	int                 	Cust ID reffering to customers
item                	string              	Order Item          
order_dt            	string              	Order Date          
tracking_id         	string              	Tracking ID for Order
Time taken: 0.732 seconds, Fetched: 5 row(s)

INNER JOIN
The INNER JOIN keyword selects all rows from both tables as long as there is a match between the columns in both tables.
hive> select c.custid,c.custname,o.orderid,o.tracking_id from customers c inner join orders o on c.custid=o.fk_cust_id
> ;
Total MapReduce jobs = 1
setting HADOOP_USER_NAME	rks
Execution log at: /tmp/rks/.log
2015-01-24 05:03:25	Starting to launch local task to process map join;	maximum memory = 932118528
2015-01-24 05:03:25	Processing rows:	101	Hashtable size:	101	Memory usage:	8029040	rate:	0.009
2015-01-24 05:03:25	Dump the hashtable into file: file:/tmp/rks/hive_2015-01-24_17-03-19_651_4336759746543942005/-local-10002/HashTable-Stage-3/MapJoin-mapfile01--.hashtable
2015-01-24 05:03:25	Upload 1 File to: file:/tmp/rks/hive_2015-01-24_17-03-19_651_4336759746543942005/-local-10002/HashTable-Stage-3/MapJoin-mapfile01--.hashtable File size: 6249
2015-01-24 05:03:25	End of local task; Time Taken: 0.751 sec.
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201501241609_0017, Tracking URL = http://RKS-PC:50030/jobdetails.jsp?jobid=job_201501241609_0017
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job  -kill job_201501241609_0017
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2015-01-24 17:03:31,889 Stage-3 map = 0%,  reduce = 0%
2015-01-24 17:03:34,902 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 0.92 sec
2015-01-24 17:03:35,906 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 0.92 sec
2015-01-24 17:03:36,916 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 0.92 sec
MapReduce Total cumulative CPU time: 920 msec
Ended Job = job_201501241609_0017
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 0.92 sec   HDFS Read: 8569 HDFS Write: 425 SUCCESS
Total MapReduce CPU Time Spent: 920 msec
OK
1009	Declan Hooper	745651	ULQ37MGX7MW
1018	Nathan Mcintyre	745652	NJJ84QEM7GO
1027	Troy Griffith	745653	UHX76SFB1EP
1036	Clark Frazier	745654	ZZH74UDJ6IC
1045	Tad Cross	745655	TFV46VBX1ZI
1054	Gannon Bradshaw	745656	EHJ68BHA6UU
1063	Walter Figueroa	745657	BNU38NFJ6FO
1072	Brady Mcclure	745658	NBK17XMP9XC
1081	Porter Bowers	745659	XHB61DLY6IK
1090	Jakeem Knight	745660	WNN67FXM2NC
1099	Macaulay Armstrong	745661	VXI39DIZ3HU
Time taken: 17.391 seconds, Fetched: 11 row(s)
hive>

LEFT OUTER JOIN
The LEFT JOIN keyword returns all rows from the left table with the matching rows in the right table. The result is NULL in the right side when there is no match.
hive> select c.custid,c.custname,o.orderid,o.tracking_id from customers c left outer join orders o on c.custid=o.fk_cust_id;
Total MapReduce jobs = 1
setting HADOOP_USER_NAME	rks
Execution log at: /tmp/rks/.log
2015-01-24 05:08:40	Starting to launch local task to process map join;	maximum memory = 932118528
2015-01-24 05:08:41	Processing rows:	101	Hashtable size:	101	Memory usage:	8133752	rate:	0.009
2015-01-24 05:08:41	Dump the hashtable into file: file:/tmp/rks/hive_2015-01-24_17-08-34_361_1900203016678725125/-local-10002/HashTable-Stage-3/MapJoin-mapfile11--.hashtable
2015-01-24 05:08:41	Upload 1 File to: file:/tmp/rks/hive_2015-01-24_17-08-34_361_1900203016678725125/-local-10002/HashTable-Stage-3/MapJoin-mapfile11--.hashtable File size: 6249
2015-01-24 05:08:41	End of local task; Time Taken: 0.908 sec.
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201501241609_0018, Tracking URL = http://RKS-PC:50030/jobdetails.jsp?jobid=job_201501241609_0018
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job  -kill job_201501241609_0018
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2015-01-24 17:08:48,387 Stage-3 map = 0%,  reduce = 0%
2015-01-24 17:08:51,396 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 0.88 sec
2015-01-24 17:08:52,400 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 0.88 sec
2015-01-24 17:08:53,408 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 0.88 sec
MapReduce Total cumulative CPU time: 880 msec
Ended Job = job_201501241609_0018
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 0.88 sec   HDFS Read: 8569 HDFS Write: 2629 SUCCESS
Total MapReduce CPU Time Spent: 880 msec
OK
NULL	cust_name	NULL	NULL
1001	Sawyer Thompson	NULL	NULL
1002	Xenos Campos	NULL	NULL
1003	Brandon Castro	NULL	NULL
1004	Evan Gordon	NULL	NULL
1005	Macon Hopper	NULL	NULL
1006	Christian Tucker	NULL	NULL
1007	Rafael Erickson	NULL	NULL
1008	Brent Roth	NULL	NULL
1009	Declan Hooper	745651	ULQ37MGX7MW
1010	Neil Leon	NULL	NULL

RIGHT OUTER JOIN
The RIGHT JOIN keyword returns all rows from the right table , with the matching rows in the left table. The result is NULL in the left side when there is no match.
hive> select c.custid,c.custname,o.orderid,o.tracking_id from customers c right outer join orders o on c.custid=o.fk_cust_id;
Total MapReduce jobs = 1
setting HADOOP_USER_NAME	rks
Execution log at: /tmp/rks/.log
2015-01-24 05:10:50	Starting to launch local task to process map join;	maximum memory = 932118528
2015-01-24 05:10:51	Processing rows:	101	Hashtable size:	101	Memory usage:	7971568	rate:	0.009
2015-01-24 05:10:51	Dump the hashtable into file: file:/tmp/rks/hive_2015-01-24_17-10-44_697_521683568687053567/-local-10002/HashTable-Stage-3/MapJoin-mapfile20--.hashtable
2015-01-24 05:10:51	Upload 1 File to: file:/tmp/rks/hive_2015-01-24_17-10-44_697_521683568687053567/-local-10002/HashTable-Stage-3/MapJoin-mapfile20--.hashtable File size: 6317
2015-01-24 05:10:51	End of local task; Time Taken: 0.712 sec.
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201501241609_0019, Tracking URL = http://RKS-PC:50030/jobdetails.jsp?jobid=job_201501241609_0019
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job  -kill job_201501241609_0019
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2015-01-24 17:10:58,019 Stage-3 map = 0%,  reduce = 0%
2015-01-24 17:11:01,064 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 0.91 sec
2015-01-24 17:11:02,067 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 0.91 sec
2015-01-24 17:11:03,073 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 0.91 sec
MapReduce Total cumulative CPU time: 910 msec
Ended Job = job_201501241609_0019
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 0.91 sec   HDFS Read: 5205 HDFS Write: 2668 SUCCESS
Total MapReduce CPU Time Spent: 910 msec
OK
NULL	NULL	NULL	track_id
1009	Declan Hooper	745651	ULQ37MGX7MW
1018	Nathan Mcintyre	745652	NJJ84QEM7GO
1027	Troy Griffith	745653	UHX76SFB1EP
1036	Clark Frazier	745654	ZZH74UDJ6IC
1045	Tad Cross	745655	TFV46VBX1ZI
1054	Gannon Bradshaw	745656	EHJ68BHA6UU
1063	Walter Figueroa	745657	BNU38NFJ6FO
1072	Brady Mcclure	745658	NBK17XMP9XC
1081	Porter Bowers	745659	XHB61DLY6IK
1090	Jakeem Knight	745660	WNN67FXM2NC
1099	Macaulay Armstrong	745661	VXI39DIZ3HU
NULL	NULL	745662	DKP00ZCS6FU
NULL	NULL	745663	YSJ42ZXP5ZG
NULL	NULL	745664	OBT90SWM3FN
NULL	NULL	745665	YVJ22BYO5DT
NULL	NULL	745666	DXY85QAL1BE
NULL	NULL	745667	THJ12NCF3KR

FULL OUTER JOIN
The FULL OUTER JOIN returns all rows from the left table and from the right table.The FULL OUTER JOIN combines the result of both LEFT and RIGHT joins.
hive> 
> select c.custid,c.custname,o.orderid,o.tracking_id from customers c full outer join orders o on c.custid=o.fk_cust_id; 
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201501241609_0020, Tracking URL = http://RKS-PC:50030/jobdetails.jsp?jobid=job_201501241609_0020
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job  -kill job_201501241609_0020
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2015-01-24 17:12:46,443 Stage-1 map = 0%,  reduce = 0%
2015-01-24 17:12:50,465 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 1.08 sec
2015-01-24 17:12:51,470 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 1.08 sec
2015-01-24 17:12:52,478 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.2 sec
2015-01-24 17:12:53,488 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.2 sec
2015-01-24 17:12:54,498 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.2 sec
2015-01-24 17:12:55,504 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.2 sec
2015-01-24 17:12:56,512 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.2 sec
2015-01-24 17:12:57,521 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.2 sec
2015-01-24 17:12:58,531 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 2.89 sec
2015-01-24 17:12:59,538 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 2.89 sec
2015-01-24 17:13:00,545 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.85 sec
2015-01-24 17:13:01,551 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.85 sec
2015-01-24 17:13:02,560 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.85 sec
MapReduce Total cumulative CPU time: 3 seconds 850 msec
Ended Job = job_201501241609_0020
MapReduce Jobs Launched: 
Job 0: Map: 2  Reduce: 1   Cumulative CPU: 3.85 sec   HDFS Read: 13774 HDFS Write: 4872 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 850 msec
OK
NULL	cust_name	NULL	NULL
NULL	NULL	NULL	track_id
1001	Sawyer Thompson	NULL	NULL
1002	Xenos Campos	NULL	NULL
1003	Brandon Castro	NULL	NULL
1004	Evan Gordon	NULL	NULL
1005	Macon Hopper	NULL	NULL
1006	Christian Tucker	NULL	NULL
1007	Rafael Erickson	NULL	NULL
1008	Brent Roth	NULL	NULL
1009	Declan Hooper	745651	ULQ37MGX7MW
1010	Neil Leon	NULL	NULL
1011	Lionel Vaughan	NULL	NULL
1012	Dillon Johns	NULL	NULL
1013	Davis Fisher	NULL	NULL
1014	Isaac Fields	NULL	NULL
1015	Micah Figueroa	NULL	NULL
1016	Burke Merrill	NULL	NULL
1017	Felix Ward	NULL	NULL
1018	Nathan Mcintyre	745652	NJJ84QEM7GO
1019	Perry Bullock	NULL	NULL
1020	Ali Kramer	NULL	NULL
1021	Timothy Avila	NULL	NULL
1022	Jason Wolfe	NULL	NULL

Monday, January 12, 2015

Apache Hive : HiveQL loading a data into the table and Query it

Hive has no row-level insert, update, and delete operations, the only way to put
data into an table is to use one of the “bulk” load operations. Or you can just write files
in the correct directories by other means.in this example we will see how you can load data into the hive table.

Create a managed table first defining ROW FORMAT and FIELD TERMINATED BY ',' to pre-process the data before loading into the table. then load the data using LOAD DATA LOCAL INPATH.
run the query with some where predicate will launch a mapreduce job and give you the results.

Sunday, January 11, 2015

Apache Hive : Getting started with HiveQL

HiveQL is the Hive query language.it does not conform to ANSI SQL like any other databases query languages. Hive do not support row level insert,update and delete and also do not support the transactions.HiveQL supports creation, alteration of databases and do support the drop DDL too. create,alter and drop DDL can be applied to the other HIVE database objects e.g. Table, views, Indexes and function.
In this post we will try to run some of the DDL statement on the HIVE Database.

Create Database: (Click Image to Enlarge)



Managed Tables Vs External Tables

Hive controls the life cycle of Managed table's data,hive store the data of the managed table under the directory /user/hive/warehouse by default.as soon as we drop the manged table,hive deletes the data inside of the table. theoretically hive has ownership of the data in case of manged table.
Hive provides you the flexibility to the user to define an External table that points to the data but do not take the ownership of the data. its a handy way to share data among various tools to do analytic over it.the External table can be defined using the 'EXTERNAL' keyword and LOCATION keyword to locate the table.

Create Table : Managed Table (Click Image to Enlarge)




Create Table : External Table (Click Image to Enlarge)



Create Table : Partitioned Table (Click Image to Enlarge)
To tune up database query Hive uses the concept of Partitioning in which database is partitioned among multiple part horizontally. Partitions are essentially horizontal slices of data which allow larger sets of data to be separated into more manageable chunks so that user select predicate can only look into the target partition only.

Apache Hive : Configuring Hive MetaStore using JDBC

Hive requires metastore that stores metadata (e.g. table schema,partition,SerDe) information so that user can run DDL and DML commands on the hive.
hive also have embedded metastore in the form of derby database but it is not suitable for the concurrent access or heavy usages.
Hive gives you flexibility to configure any of the JDBC complaint database to use as MetaStore such as MySQL,PostGress etc.As MySQL database is popular among the most used Hive MetaStore so in this post I will demonstrate you to configure MySQL as Hive MetaStore.

To configure MySQL as a Hive metastore, first install and start MySQL server, here is my first step to install MySQL server on Ubuntu

sudo apt-get install mysql-server

download the MySQL connector jar and place it into the $HIVE_HOME/lib

now edit conf/hive-site.xml as follows

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
 <property>
 <name>javax.jdo.option.ConnectionURL</name>
 <value>jdbc:mysql://hostname/hivedb?createDatabaseIfNotExist=true</value>
 </property>
<property>
 <name>javax.jdo.option.ConnectionDriverName</name>
 <value>com.mysql.jdbc.Driver</value>
 </property>
 <property>
 <name>javax.jdo.option.ConnectionUserName</name>
 <value>username</value>
 </property>
 <property>
 <name>javax.jdo.option.ConnectionPassword</name>
 <value>password</value>
 </property>
</configuration>

if your metastore is running on remote machine do add the following property to conf/hive-site.xml

<property>
        <name>hive.metastore.uris</name>
        <value>thrift://remotehostname:9083</value>
    </property>

now you are good to go to use MySQL MetaStore.to start an external Hive metastore service use the command
hive --service metastore &


Saturday, January 10, 2015

Apache Hive : Installation and configuration of Hive


To install Hive you can follow my old post to build hive from source if you want to do some customization with the source code. if you are newbie to hive then i will recommend you to download some stable version of hive from apache hive website and extract the tarball to your preferred location.

I have downloaded the hive-.11 tarball from the apache hive website and extracted the tarball in the folder /usr/local/hive/hive-0.11.0, thats what you need to install hive on your machine make sure that Hadoop is already installed your machine.Hive uses the environment variable HADOOP_HOME to locate Hadoop jars and its configuration files.

$cd /usr/local
$mkdir hive
$cd hive
$curl -o http://archive.apache.org/dist/hive/hive-0.11.0/hive-0.11.0-bin.tar.gz
$tar -xzf hive-0.11.0.tar.gz

now set environment variable HADOOP_HOME and HIVE_HOME. open .bashrc in some editor e.g. vi and add these lines to the file
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HIVE_HOME=/usr/local/hive/hive-0.11.0
export PATH=$PATH:$HIVE_HOME/bin
The $HIVE_HOME/bin directory contains executable scripts that launch various Hive services, including the hive command-line interface.
Hive binaries also contains the Trift services which provides access to other agents or the processes.on the top of the Thrift hive provides access using JDBC and ODBC.
one most of the important thing in hive installation is the metastore, metastore is used by hive to store table schema and metadata information.Hive has in built-in Derby database to store metadata.you are also free to configure and RDBMS of your choice to provide metastore service,typical hive installation includes MySQL and Postgress databases.

now you can check your installation by starting hive CLI.
$cd $HIVE_HOME
$hive
Logging initialized using configuration in jar:file:/usr/local/hive/hive-0.11.0/lib/hive-common-0.11.0.jar!/hive-log4j.properties
Hive history file=/tmp/rks/hive_job_log_rks_2867@RKS-PC_201501101811_648770952.txt
hive>