Saturday, January 24, 2015

Apache Hive : Indexes

Indexes are special lookup tables that the database search engine can use to speed up data retrieval.an index is a pointer to data in a table.the Indexes are implemented by using various data structure e.g. BTree, BitMap etc.Hive has limited indexing capabilities. you can speed up some of the data retrival operation using the indexes, the index data is stored in the another table in the Hive

The other way of speed up your query other than the Indexing are partitioning and bucketing which we will cover in coming posts.
Indexing is also a good alternative to partitioning when the logical partitions would.Indexes in Hive, like those in relational databases, need to be evaluated carefully.

Maintaining an index requires extra disk space and building an index has a processing cost.

Creating an Index
hive> create index customer_index on table customers(custid) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
> with deferred rebuild
> in table customer_index_table
> COMMENT 'Customer Index on custid';
OK
Time taken: 5.713 seconds

Creating an bitmap Index
hive> create index customer_index_bitmap on table customers(custid)
> as 'BITMAP'
> with deferred rebuild
> COMMENT 'Index on customer table on custid using bitmap';
OK
Time taken: 0.603 seconds

View an Index
hive> show formatted index on customers;                       
OK
idx_name tab_name col_names idx_tab_name idx_type comment


customer_index customers custid customer_index_table compact Customer Index on custid
customer_index_bitmap customers custid orderdb__customers_customer_index_bitmap__ bitmap Index on customer table on custid using bitmap
Time taken: 0.988 seconds, Fetched: 5 row(s)

Rebuild an Index
hive> alter index customer_index on customers rebuild;      
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201501241609_0022, Tracking URL = http://RKS-PC:50030/jobdetails.jsp?jobid=job_201501241609_0022
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -kill job_201501241609_0022
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-01-24 18:32:39,926 Stage-1 map = 0%, reduce = 0%
2015-01-24 18:32:41,975 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.81 sec
2015-01-24 18:32:42,978 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.81 sec
2015-01-24 18:32:43,982 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.81 sec
2015-01-24 18:32:45,011 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.81 sec
2015-01-24 18:32:46,071 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.81 sec
2015-01-24 18:32:47,078 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.81 sec
2015-01-24 18:32:48,093 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.81 sec
2015-01-24 18:32:49,159 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.81 sec
2015-01-24 18:32:50,245 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 0.81 sec
2015-01-24 18:32:51,305 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.24 sec
2015-01-24 18:32:52,309 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.24 sec
2015-01-24 18:32:53,385 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.24 sec
MapReduce Total cumulative CPU time: 2 seconds 240 msec
Ended Job = job_201501241609_0022
Loading data to table orderdb.customer_index_table
Deleted hdfs://RKS-PC:54310/user/hive/warehouse/orderdb.db/customer_index_table
Table orderdb.customer_index_table stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 8567, raw_data_size: 0]
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 2.24 sec HDFS Read: 8569 HDFS Write: 8567 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 240 msec
OK
Time taken: 42.379 seconds

Drop an Index
hive> drop index if exists customer_index on customers;
OK
Time taken: 3.267 seconds




No comments: