转载自:http://kernel-panik.blogspot.com/2013/05/force-udf-execution-to-happen-in-hive.html
Doing quick and dirty URL fetch from hive, I wanted for URL"s to be ditributed among 5 jobs. Input is small it's very hard to tune up on mapper side things to heppen on 5 mappers say.
Regular:
insert overwrite table url_raw_contant partition(dt = 20130606) select full_url, priority, regexp_replace(curl_url(full_url), '\n|\r', ' ') as raw_html from url_queue_table_sharded_temp;
Forced UDF execution to Reducer (5 reducers):
set mapred.reduce.tasks=5; insert overwrite table url_raw_contant_table partition(dt = 20130606) select full_url, priority, regexp_replace(curl_url(full_url), '\n|\r', ' ') as raw_html from ( select full_url, priority from url_queue_table_sharded_temp distribute by md5(full_url) % 5 sort by md5(full_url) % 5, priority desc ) d distribute by md5(full_url) % 5;