hive collect sethive sql 数组长度有长度限制吗

风水堪舆学 | 网络营销 | 住宅风水 | 英文歌曲 | Adobe After Effects | 电脑配置 | 书籍改编电影 | 下载 | Legion | 网络推广 | 动画制作 | 赛事 | PLC | 小说创作 | 虚拟专用服务器 | 成语 | 家庭 | 单反相机 | 电视节目 | 投影机 | 面相 | 香港购物 | 配音 | 文具 | 二次元 | 影视 | 固态硬盘ssd | 虚拟机 | 跆拳道 | r（编程语言） | 秦时明月之天行九歌 | 使命召唤 | 网盘 | 地图 | 琅琊榜（电视剧） | 手机内存 | 角色扮演 | 华硕 | 百度输入法 | 盗墓笔记（小说） | 营销策划 | 化妆品 | Windows | ip地址 | 装修设计 | 齐内丁·齐达内 | 动画电影 | 中国中央电视台 | 罗兰 | 网站优化 | 斗鱼直播 | 冷知识 | 张帅 | 任天堂 | 摄影师 | 三菱商事 | 迅雷（软件） | 计算机病毒 | amd | 屏幕 | 微单相机 | 电学 | qq浏览器 | MacOS | 联赛 | snh48 | 芯片（集成电路） | 后宫·甄嬛传（书籍） | 植物辨识 | 运动 | 大一 | 美容 | 双色球 | 蓝牙音箱 | 楼盘 | 电脑电源 | 采暖 | 显卡驱动 | 体育赛事 | thinkpad | 离婚 | 武侠小说 | 索尼笔记本 | 中国足球协会超级联赛（csl） | youtube | 王力宏（人物） | 外星人 | 努比亚（手机品牌） | 海贼王 | 移动电源 | 完美世界（游戏） | 摩托车 | 编辑器 | 低音炮 | 收益 | 海关 | 徐波 | akb48 | 互联网创业 | 张璐 | 男性 | 性价比 | MacBook Air | 新疆维吾尔自治区 | 插座 | 外汇平台 | 华为Mate30 | 羽毛球技术 | 腾讯 QQ | 蓝屏 | 字幕 | 免费软件 | 电脑故障 | 女生 | 周星驰（人物） | 足球欧洲杯 | pdf | macbook | 直播 | 生活经历 | 骁龙处理器 | 主题曲 | 户外运动 | CPU | 娱乐圈 | 初恋 | 家居 | 流氓软件 | 名言 | 中国足球 | 近视眼 | acg | 一级方程式赛车（f1） | 小品 | 网站运营 | 英格兰足球超级联赛 | 一体机 | 人肉搜索 | 日本电影 | 系统软件 | 人生 | 流星花园 | 电钢琴 | 分辨率 | 迅雷 | 机械设计 | 古典音乐 | 液晶电视 | 睡眠 | 大片 | 资产 | Html/Css | ansys | 天蝎座 | 对联 | 大二 | 吉他学习 | 实习 | uc浏览器 | 计算机科学 | 新华社 | 脱毛 | 视力 | 乐视超级电视 | 大学生活 | 开关电源 | 平面设计 | 音乐版权 | iPhone 11 Pro | 面膜 | 鞠婧祎 | 胡歌（演员） | 郭富城 | 语言 | 赵丽颖（演员） | 意大利 | 电路设计 | 情侣 | NBA篮球 | 蔡徐坤 | 豆瓣电影 | 社交软件 | 微信开发 | 足球彩票 | 电工 | 手机摄像头 | 用户界面设计师 | 华语流行音乐 | 网卡 | 易烊千玺 | 笛子 | 日语学习 | 日语歌曲 | 歌手 | 张子枫 | 搏击项目 | 谭松韵 | 快捷键 | O2O | 移民 |

你的位置：网站首页 >> 频道首页 >>编程语言 >>hive collect sethive sql 数组长度有长度限制吗

hive collect sethive sql 数组长度有长度限制吗

来源：蜘蛛抓取(WebSpider) 时间：2016-09-26 03:30 标签： hive sql 数组长度

遇到这么一个需求，输入数据为一个ID对应多个name，要求输出数据为ID是唯一的，name随便取一个就可以。
执行以下hive ql语句：
SELECT &&sid,&&class_id FROM&&table2 GROUP BY sid ;
FAILED: Error in semantic analysis: Line 1:18 Expression not in GROUP BY key 'class_id'
<div class="crayon-num" data-line="crayon-57cd7e
FAILED: Error in semantic analysis: Line 1:18 Expression not in GROUP BY key 'class_id'
查了一下，HIVE有这么一个函数collect_set，类似于mysql的group_concat函数，把每个分组的其他字段，按照逗号进行拼接，得到一个最终字符串：
collect_set(col)
返回类型：array
解释：返回一个去重后的对象集合
<div class="crayon-num" data-line="crayon-57cd7e<div class="crayon-num crayon-striped-num" data-line="crayon-57cd7e<div class="crayon-num" data-line="crayon-57cd7e
collect_set(col)返回类型：array解释：返回一个去重后的对象集合
将上述的QL语句改一下：
select sid,collect_set(class_id) from table2
select sid,collect_set(class_id) from table2 group by sid;
结果是这样的：
1 [11,12,13]
<div class="crayon-num" data-line="crayon-57cd7e<div class="crayon-num crayon-striped-num" data-line="crayon-57cd7e<div class="crayon-num" data-line="crayon-57cd7e<div class="crayon-num crayon-striped-num" data-line="crayon-57cd7e<div class="crayon-num" data-line="crayon-57cd7e<div class="crayon-num crayon-striped-num" data-line="crayon-57cd7e
1 [11,12,13]2 [11,14]3 [12,15]4 [12,13]5 [16,14]7 [13,15]
这个时候，我们就可以针对第二列做一些计数、求和操作，分别对应到Hive的聚合函数count、sum。
对应到本文的目的，直接从数组获取第一个元素就达到目的了，这样做：
select sid,collect_set(class_id)[0] from table2
select sid,collect_set(class_id)[0] from table2 group by sid;
结果如下：
<div class="crayon-num" data-line="crayon-57cd7e<div class="crayon-num crayon-striped-num" data-line="crayon-57cd7e<div class="crayon-num" data-line="crayon-57cd7e<div class="crayon-num crayon-striped-num" data-line="crayon-57cd7e<div class="crayon-num" data-line="crayon-57cd7e<div class="crayon-num crayon-striped-num" data-line="crayon-57cd7e
1 112 113 124 125 167 13
Hive不允许直接访问非group by字段；
对于非group by字段，可以用Hive的collect_set函数收集这些字段，返回一个数组；
使用数字下标，可以直接访问数组中的元素；
参考文章：http://wangjunle23./blog/static//
本文地址：http://www.crazyant.net/1600.htmlhive使用总结 - 推酷
hive使用总结
1.hive 数据类型
/sharpxiajun/archive//3114560.html
2.substr('abc',0,2) = ab
mysql 是从1开始的 select * from online_server where substr(count_time,1,10)=''
(server_name='iphone_onlines' or server_name='symbian_onlines' or server_name='gtalk_onlines' or server_name='wm_onlines');
php 是从0开始的
java 也是从0开始的
3.hadoop job -list 查看当前任务
4.split(memberid,':')[2] 分成的数组从0开始（awk split($1,arr,&:&)分成的数组是从1开始的）,好像sql 中偏移量都是从0开始的
split(string str, string pat)& Split str around pat (pat is a regular expression)&
& select dt,case when split(req_url,'[/?]')[3] is null then 'main' when split(req_url,'[/?]')[3] = '' then 'main' else split(req_url,'[/?]')[3] end,count(1) pv,count(distinct sinaglobal) uip
from f_suda_log where dt='' and channel='other' and req_url like '%'
group by dt,case when split(req_url,'[/?]')[3] is null then 'main' when split(req_url,'[/?]')[3] = '' then 'main' else split(req_url,'[/?]')[3] end
order by pv desc limit 10;
split(req_url,'[/?]')[3]是指不管碰到/还是？都要分割字符串，并且都放入到结果数组中
是按正则拆分，特殊字符需要两个反斜线转移
5.not req_url like '%v..cn/widget%' group by sinaglobal
6.hive查询结果分隔符转换 cat $tarpath/channel1UIP/* | awk '{gsub(/\001/,&\t&,$0);gsub(/\\N/,&&,$0);print $0}' &&
7.CASE WHEN THEN& END 用法示例
总结：case 条件分支语句可以来搜索数据，跟子查询一样在复杂查询和列行转换上有很大作用，这样看来case 的分支是不能有重复的？hive 跟mysql 实验证明：
如果有重复，证明只进入第一个分支，第二个没进，说明不能有重复！
1).在hive中测试
when channel='sports' then 'sports_flow'
when channel='sports' and channel2='sports_nba' then 'sports_nba_flow'
from f_suda_log
where dt=''
and channel='sports'
when channel='sports' then 'sports_flow'
when channel='sports' and channel2='sports_nba' then 'sports_nba_flow'
结果：sports_flow&&&&
证明只进入第一个分支，第二个没进，说明不能有重复
在mysql里测试
SELECT CASE WHEN `score` !=0
THEN '有分数记录的分数'
WHEN `cost` != ''
THEN '有花钱记录的分数'
END , sum( `score` ) AS sum_socre
FROM `state_score`
GROUP BY CASE WHEN `score` !=0
THEN '有分数记录的分数'
WHEN `cost` !=0
THEN '有花钱记录的分数'
END HAVING sum_socre !=0
有分数记录的分数 1947
也证明了上面的结论。
select dt,
when req_url like '/s/%' then 'a'
when req_url like& '/coupon/detail/%' then 'b'
when req_url like&& '/coupon/coupon/%' then 'c'
when req_url like& '/foodmenu/menu/%' then 'd'
from f_suda_log
and (req_url like '/s/%' or
req_url like& '/coupon/detail/%' or
req_url like&& '/coupon/coupon/%' or
req_url like& '/foodmenu/menu/%')
group by dt,
when req_url like '/s/%' then 'a'
when req_url like& '/coupon/detail/%' then 'b'
when req_url like&& '/coupon/coupon/%' then 'c'
when req_url like& '/foodmenu/menu/%' then 'd'
select dt,sum(
when result='succ' then 1 else 0&
as& 'succNum'
when result='succ' or result='fail' then 1 else 0&
as& 'allNum'
from f_unipro_chksso
case when then 还有另一种写法（可用于行转列）
select convert(varchar(10),datestr,23)datestr,
max(case chname when 'NBA专栏PV' then cast(num as int)/1000.0 else 0 end) as 'NBA专栏PV',
max(case chname when 'NBA专栏UV' then cast(num as int)/1000.0 else 0 end) as 'NBA专栏UV',
max(case chname when '访问频次' then cast(num as float) else 0 end) as '访问频次',
max(case chname when '总访问次数' then cast(num as int)/1000.0 else 0 end) as '总访问次数',
max(case chname when '停留时间' then num else 0 end) as '停留时间',
max(case chname when '最高同时在线人数' then cast(num as int)/1000.0 else 0 end) as '最高在线人数',
max(case chname when '直播页面PV' then cast(num as int)/1000.0 else 0 end) as '直播页面PV',
max(case chname when '直播页面UV' then cast(num as int)/1000.0 else 0 end) as '直播页面UV',
max(case chname when '直播累计播放量' then cast(num as int)/1000.0 else 0 end) as '直播累计播放量',
max(case chname when '点播页面UV' then cast(num as int)/1000.0 else 0 end) as '点播页面UV',
max(case chname when '点播累计播放量' then cast(num as int)/1000.0 else 0 end) as '点播累计播放量',
max(case chname when '其他新浪频道流入' then cast(num as int)/1000.0 else 0 end) as '其他新浪频道流入',
max(case chname when '导航网站流入' then cast(num as int)/1000.0 else 0 end) as '导航网站流入',
max(case chname when '搜索引擎流入' then cast(num as int)/1000.0 else 0 end) as '搜索引擎流入',
max(case chname when '专题PV' then cast(num as int)/1000.0 else 0 end) as '专题PV',
max(case chname when '专题UV' then cast(num as int)/1000.0 else 0 end) as '专题UV',
max(case chname when '原创微博数' then cast(num as int) else 0 end) as '原创微博数',
max(case chname when '被转发次数' then cast(num as int) else 0 end) as '被转发次数',
max(case chname when '被评论次数' then cast(num as int) else 0 end) as '被评论次数'
(select datestr,chname,num
from t_nba_data a join t_nba_tpname b
on a.datatype = b.enname
group by datestr
3).hive 中case when 中不能再包含聚合函数，例如如下方式是不行的：
when sum(case when substr(a.logstatus,1,1)&&5 then 1 else 0 end)=0 and a.time &0 then a.email
when sum(case when substr(a.logstatus,1,1)&&5 then 1 else 0 end)&0 and a.time &max(case when substr(c.logstatus,1,1)&&5 then a.time) then a.email
8.left join 示例
select dt,count(distinct client_ip) as allNum,b.succNum from f_unipro_chksso a
(select dt,count(distinct client_ip) as succNum from f_unipro_chksso where dt ='' and result='succ') b
on a.dt=b.dt
where dt ='' and (result='succ' or result='fail')
select dt,sum(
when result='succ'
as& uip_all
when result='succ' or result='fail'
as& uip_succ
from f_unipro_chksso
9.子查询用的很多
10.sql 语句的效率差距
select count(distinct uid) from f_tblog_behavior where dt&='' and dt&='' and behavior='' 这样很低
select count(1) from ( select distinct uid from f_tblog_behavior where dt&='' and dt&='' and behavior='')& 这样很高
10.hive 正则中 .是不是像php preg_match 的正则表示除回车符之外的任意字符呢？实验一下，hive中正则和其他正则有细微的差别
.* 也可以表示除换行符以为的任意字符 .不用转义就可以表示.
select regexp_extract(parse_url(refurl,'HOST'),'([0-9a-zA-Z]+.sina*)') from f_suda_path_site_nba where dt='' and (parse_url(refurl,'HOST') rlike '$' or parse_url(refurl,'HOST') rlike '.cn$' or parse_url(refurl,'HOST') rlike '$') limit 20;
11.regexp_extract 的用法
有时不太准，换了字段位置就准了
select regexp_extract(parse_url(refurl,'HOST'),'([0-9a-zA-Z]+.sina\.)') from f_suda_path_site_nba where dt='' and (parse_url(refurl,'HOST') rlike '$' or parse_url(refurl,'HOST') rlike '.cn$' or parse_url(refurl,'HOST') rlike '$') limit 20;
12.find_in_set 的用法
select a.dt,count(1) from
select dt,regexp_extract(req_url, '/([0-9a-zA-Z]+)/?[^/.]*$', 1) domain,split(memberid,':')[2] memberid
from f_suda_log
where& dt='' and channel='tblog' and req_url rlike '^/[0-9a-zA-Z]+/?[^/.]*$'&
and find_in_set( regexp_extract(req_url, '/([0-9a-zA-Z]+)/?[^/.]*$', 1),'app,fav,favsearch,messages,reportspam,tool,comments,atme,new,comment,k,at,pub,f,zt,systemnotice,sorry,message,settings,findfriends,verify,u')=0
) a join f_tblog_user_data b on a.memberid=b.uid and b.domain=a.domain group by a.dt
13.distribute by group by order by sort by 有什么区别和联系？
distribute by 要与sort by 对应，每个reduce中分组排序
group by 要去 order by对应，所有reduce完了，文件合起来后再分组排序
14.用先union再group by 的方式优化hive job，我理解这就是multi-group by
15.hive中添加set io.sort.mb=256;防止map机器内存溢出 /2010/12/hadoop-job-tuning.html
16.hive中split中分号，逗号需不需要转义？
试一下就知道
select zt from f_suda_log where dt='' and zt rlike '^[\\d]+\\,[\\d]+\\,[\\d]+' and zt rlike '\\;' limit 10;
select split(zt,'\\;')[0] from f_suda_log where dt='' and zt rlike '^[\\d]+\\,[\\d]+\\,[\\d]+' and zt rlike '\\;' limit 10;
select concat(split(split(zt,'\\\;')[0],',')[0],',',split(split(zt,'\\\;')[0],',')[1],',',split(split(zt,'\\\;')[0],',')[2],',') from f_suda_log where dt='' and zt rlike '^[\\d]+\\,[\\d]+\\,[\\d]+' and zt rlike '\\;' limit 10;
17.hive 调优参数学习
18.查看表分区语句 describe extended f_tblog_enterprise_user partition(dt='');
19.MPP数据库是指大规模并行处理数据库
20.查看表分区show partitions day_hour_
21.bucket的使用
提高hadoop效率，很重要的方法是多个map和多个reduce
通过指定CLUSTERED的字段，将文件通过hash打散成多个小文件。
create table sunwg_test11(id int,name string)
CLUSTERED BY(id) SORTED BY(name) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’;
执行insert前不要忘记设置
set hive.enforce.bucketing =
强制采用多个reduce进行输出
22.添加多个分区，还是一个一个添加好，多个一块添加会有问题
alter table f_selfLog_hour add partition(dt='',hour='00') location '/input/selfLog//00' partition(dt='',hour='01') location '/input/selfLog//01' partition(dt='',hour='02')
23.可以用LIKE复制表结构，而不复制数据
24.hive create table 的时候SORTED BY 可以指定几列，放到一个桶bucket中
25.LOAD DATA ，如果表包含分区，必须指定每一个分区的分区名
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE]
INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)]
26.SELECT 语句可以使用正则表达式做列选择，下面的语句查询除了 ds 和 hr 之外的所有列：
SELECT `(ds|hr)?+.+` FROM sales
select `(requestargv|requestip)?+.+` from f_selfLog_hour where dt='' and hour='12' limit 5;
27.REGEXP与rlike功能相同
28.// 多表插入
from source
insert overwrite table records_by_year select year, count(1) group by year
insert overwrite table good_records_by_year select year, count(1) where quality ==0
29.Add File /path/to/is_good_quality.&& // 将过滤程序加载到分布式缓存中
30.select * from things left semi join sales on (sales.id = things.id);
// 类似于in子查询：select * from things where things.id in (select id from sales);
// 写left semi join查询时必须遵循一个限制，右表只能在on子句中出现。
31.Map连接：如果有一个连接表小到足以放入内存，Hive就可以把较小的表放入每个mapper的内存来执行连接操作。如果要指定map连接，需要在SQL中使用C语言风格的注释：
Join 操作在 Map 阶段完成，不再需要Reduce，前提条件是需要的数据在 Map 的过程中可以访问到。
select /*+ MAPJOIN(things) */ sales.*, things.* from sales join things on (sales.id = things.id);
这个语句不会产生reduce。
MAPJOIN(things)里可以是多个表，一块读入内存。
Map连接可以利用分桶的表，因为作用于桶的mapper加载右侧表中对应的桶即可执行连接。启动优化选项： set hive.optimize.bucketmapjoin =
32.内部表也可以建分区。示例如下：
create table test_4
requestIp string
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
from f_selfLog_hour
insert overwrite table test_4 partition (dt='')
select requestip& where dt='' group by requestip
33.hive 添加列
&& ALTER TABLE f_selfPcApacheLog_deserialize_hour ADD COLUMNS (url string);
34.hive有时会有一种假null，'\N'
35.使用外部脚本文件
hive& add FILE weekday_mapper.
将数据按周进行分割
INSERT OVERWRITE TABLE u_data_new
TRANSFORM (userid, movieid, rating, unixtime)
USING 'python weekday_mapper.py'
AS (userid, movieid, rating, weekday)
SELECT weekday, COUNT(1)
FROM u_data_new
我写的例子：
create table f_jyzlog_urldecode_day
(oneline string)
PARTITIONED BY (dt string);
add file /download/jyzStat/bin/transtorm_urldecode.
INSERT OVERWRITE TABLE f_jyzlog_urldecode_day partition(dt='')
TRANSFORM(oneline)
USING 'python transtorm_urldecode.py'
AS (oneline)
from f_jyzlog_initial
where dt=''
缓存中没有了文件会导致异常。
36.mapreduce 压缩最佳实践：
gz文件不能split（大文件也是一个map），得用lzo
map输出的中间数据使用 LzoCodec，reduce输出使用 LzopCodec
&& .cn/s/blog_01b3yy.html
37.hive表创建索引：
/end/archive//2870352.html
创建索引：
create index uuid_index on table kysttdlurl(uuid) AS& 'org.apache.hadoop.hive.pactIndexHandler' WITH DEFERRED REBUILD ;
索引生成：
alter index uuid_index o
查看表的索引：show index on index_
38.LEFT SEMI JOIN
/content/11/9_.shtml
hive中现在没有IN/EXSITS，可以用LEFT SEMI JOIN 代替
LEFT SEMI JOIN 是 IN/EXISTS 子查询的一种更高效的实现。
Hive 当前没有实现 IN/EXISTS 子查询，所以你可以用 LEFT SEMI JOIN 重写你的子查询语句。
LEFT SEMI JOIN 的限制是， JOIN 子句中右边的表只能在 ON 子句中设置过滤条件，
在 WHERE 子句、SELECT 子句或其他地方过滤都不行。
39。要会用动态分区和桶。http://blog.csdn.net/awayyao/article/details/7630000
set hive.exec.dynamic.partition=
set hive.exec.dynamic.partition.mode=
静态分区和动态分区的区别在于导入数据时，是手动输入分区名称，还是通过数据来判断数据分区.
然后用hive的insert命令进行插入操作。注意，除了所有列外，需要将分区的动态字段跟在后面。
INSERT OVERWRITE TABLE target PARTITION (dt)
SELECT id,user_id,app_id,time,ip,substr(time,0,10) FROM origin
set hive.exec.dynamic.partition=
INSERT OVERWRITE TABLE temp_app_devicemac PARTITION(dt,appid)
SELECT devicemac,dt,appid
FROM f_selfpcapachelog_deserialize_hour
where dt=''
distribute by
加大动态分区数即可，下面是在当前session下设置。
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.dynamic.partitions.pernode=100000;
40.hive中有时得精确，比如rlike '^2.0' 跟rlike '^2\.0'（直接在hive里执行斜线两个跟三个都是一样的效果）有时会导致错误，
还有有时的过滤null 或者空的情况。总之要规范准确。
41.hive日期函数
date_sub('',10) 10天前
/blog/1826307
42.join无论多少表都是一个job，情况应该是这样的：
如果 Join 的 key 相同，不管有多少个表，都会则会合并为一个 Map-Reduce
如果 Join 的条件不相同，Map-Reduce 的任务数目和 Join 操作的数目是对应的
43.Join 操作的查询语句时有一条原则：应该将条目少的表/子查询放在 Join 操作符的左边。
原因是在 Join 操作的 Reduce 阶段，位于 Join 操作符左边的表的内容会被加载进内存，将条目少的表放在左边，
可以有效减少发生 OOM 错误的几率。
44. hive.map.aggr = true 是否在 Map 端进行聚合，默认为 True
45.有数据倾斜的时候进行负载均衡
&& hive.groupby.skewindata = false
&& 当选项设定为 true，生成的查询计划会有两个 MR Job。第一个 MR Job 中，Map 的输出结果集合会随机分布到 Reduce 中，每个 Reduce 做部分聚合操作，并输出结果，这样处理的结果是相同的 Group By Key 有可能被分发到不同的 Reduce 中，从而达到负载均衡的目的；第二个 MR Job 再根据预处理的数据结果按照 Group By Key 分布到 Reduce 中
&& （这个过程可以保证相同的 Group By Key 被分布到同一个 Reduce 中），最后完成最终的聚合操作。
hive使用map/reduce脚本。
SELECT TRANSFORM(user_id, page_url, unix_time)
USING 'page_url_to_id.py'
AS (user_id, page_id, unix_time)
& FROM mylog
DISTRIBUTE BY user_id
SORT BY user_id, unix_time)
SELECT TRANSFORM(user_id, page_id, unix_time)
USING 'my_python_session_cutter.py' AS (user_id, session_info);
UDF跟这个是有区别的，
add jar build/contrib/hive_contrib.
& CREATE TEMPORARY FUNCTION example_add
AS 'org.apache.hadoop.hive.contrib.udf.example.UDFExampleAdd';
& SELECT example_add(1, 2) FROM
也可以在hive中使用MAP和REDUCE关键字：
使用MAP关键字代替TRANSFORM关键字，
from records2
MAP year,temperature,quality
USING 'is_good_quality.py'
AS year,temperature) a
REDUCE year,temperature
USING 'max_temperature_reduce.py' AS year,temperature
* 用 java 写 UDF 很容易。
&&& * Hadoop 的 Writables/Text 具有较高性能。
&&& * UDF 可以被重载。
&&& * Hive 支持隐式类型转换。
&&& * UDF 支持变长的参数。
&&& * genericUDF 提供了较好的性能（避免了反射）。
&&& http://blog.csdn.net/cajeep2001/article/details/7824642
先下载net.sf.fjep.fatjar_0.0.31.jar插件包
http://blog.csdn.net/ckl_soft/article/details/8183304
add jar /home/hadoop/jar/kuaiyong_udf_test_TestUDF.
add jar /home/hadoop/jar/kuaiyong_udf_test_ip2byte.
add jar /home/hadoop/jar/kuaiyong_udf.
CREATE TEMPORARY FUNCTION ip2byte AS 'udf.ip2byte';
SELECT test_ip2byte(requestip) FROM f_selfPcApacheLog_deserialize_hour where dt='' limit 10;
mkdir -p com/kuaiyong/hive/udf
vi com/kuaiyong/hive/udf/ip2byte.java
javac -classpath /usr/local/hadoop-1.1.2/hadoop-core-1.1.2.jar:/usr/local/hive/lib/hive-exec-0.10.0.jar /home/hadoop/com/kuaiyong/hive/udf/ip2byte.java
jar -cvf ip2byte.jar /home/hadoop/com/kuaiyong/hive/udf/ip2byte.class
echo $CLASSPATH
mkdir /usr/java/jdk1.7.0_17/lib/org
chown -R hadoop:hadoop /usr/java/jdk1.7.0_17/lib/org/
su - hadoop -c &chmod -R 755 /usr/java/jdk1.7.0_17/lib/org/&
报错记录：
java.lang.IllegalArgumentException: Can not create a Path from an empty string
是因为add jar 没有成功、
&&& * 编写 UDAF 和 UDF 类似
&&& * UDAF 可以重载
&&& * UDAF 可以返回复杂类
&&& * 在使用 UDAF 的时候可以禁止部分聚合功能
尽量使用UDF而不是transfrom……
UDTF（用户自定义表生成函数）介绍
/2011/03/hive%E4%B8%ADudtf%E7%BC%96%E5%86%99%E5%92%8C%E4%BD%BF%E7%94%A8
UDTF(User-Defined Table-Generating Functions)& 用来解决输入一行输出多行(On-to-many maping) 的需求。
UDTF函数在某些应用场景下可以大大提高hql语句的性能，如需要多次解析json或者xml数据的应用场景。
add jar /home/hadoop/jar/kuaiyong_UDTF_test.
CREATE TEMPORARY FUNCTION test_UDTF AS 'com.kuaiyong.UDF.GetAllLinesUDTF';
select T.col1 from f_ip_dist_new_china lateral view serial(5) T as col1;
47.使用了Multi-group by特性连续group by了2次数据，使用不同的group by key。这一特性可以减少一次MapReduce操作。
48.Multi-distinct是淘宝开发的另一个multi-xxx特性，使用Multi-distinct可以在同一查询/子查询中使用多个distinct，
这同样减少了多次MapReduce操作。
49.if(net is null,'0',net)
50.mkfs.ext2 -c /dev/hda1检查磁盘坏道
51.正则表达式替换函数：regexp_replace
语法: regexp_replace(string A, string B, string C)
返回值: string
说明：将字符串A中的符合java正则表达式B的部分替换为C。注意，在有些情况下要使用转义字符
hive& select regexp_replace(‘foobar’, ‘oo|ar’, ”)
52.hive正则中都需要用两个斜线转义。
53.HAProxy+Hive构建高可用数据挖掘集群
http://slaytanic./626
54.各种join性能对比
/ggjucheng/archive//2842821.html
多个insert时，有时会出问题，不好用。
56.php连接hive执行sql
http://blog.csdn.net/jiedushi/article/details/6579285
http://blog.csdn.net/woailuoyaya/article/details/8508276
tar zxf thrift-0.9.0.tar.gz
cd thrift-0.9.0
./configure --without-ruby
make && make install
用hadoop权限执行
/usr/local/hive/bin/hive --service hiveserver 10001 &/dev/null& &
57.hadoop之java.net.NoRouteToHostException: No route to host
是因为防火墙没有关闭。
58.出现异常：
Failed to connect to :50010, add to deadNodes and continue java.io.IOException
/ly_89/blog/static//
查看 Hadoop out_file 的时候出现如下的错误：
出现这个问题是因为datanode rpc连接都被占用，导致客户端请求超过等待时间所致。两方面处理这个问题：
1. 增加datanode RPC连接数，默认值为3，我们设置为5。假如我们集群30台服务器，300台客户端。建议该值在3-8之间，太多会影响datanode的内存和CPU
2. 客户端conf设置属性“dfs.socket.timeout”，我这里值为&180000&
59.hive中table可以拆分成partition，table和partition可以通过‘CLUSTERED BY ’进一步分bucket，
bucket中的数据可以通过‘SORT BY’排序。
set hive.enforce.bucketing =
create external table f_appid_pcmac_day(appid string,pcmac string)
partitioned by(dt STRING)
clustered by(appid) into 50 BUCKETS
row format delimited fields terminated by '\t';
60.使用mapjoin实现非等值连接
http://my.oschina.net/leejun2005/blog/98244
61.设置hadoop 中job的优先级
当前Hadoop将作业分为了5个等级，分别是VERY_HIGH、HIGH、NORMAL、LOW、VERY_LOW，
用户提交作业时可通过参数“mapred.job.priority”设置作业优先级。
-jobconf | -D NAME=VALUE：指定作业参数，NAME是参数名，VALUE是参数值，
可以指定的参数参考hadoop-default.xml。特别建议用-jobconf mapred.job.name=’My Job Name’设置作业名，
使用-jobconf mapred.job.priority=VERY_HIGH | HIGH | NORMAL | LOW | VERY_LOW设置作业优先级，
使用-jobconf mapred.job.map.capacity=M设置同时最多运行M个map任务，
使用-jobconf mapred.job.reduce.capacity=N设置同时最多运行N个reduce任务。
设置优先级：
set mapred.job.priority=VERY_HIGH;
62.hive 内部表转成外部表
ALTER TABLE f_selfpcapachelog_deserialize_hour SET TBLPROPERTIES ('EXTERNAL'='TRUE');
63.用mapjoin解决非等值连接问题
set hive.smalltable.filesize=100M;
set hive.join.emit.interval=;
set hive.mapjoin.check.memory.rows=1000000;
set hive.mapjoin.cache.numrows=1000000;
/*+ MAPJOIN(f_ip_dist_new) */
b.ret,b.lerr,b.requesttime,b.uuid,b.pcmac,b.devicemac,b.requestip,b.clientversion,b.channel,b.product,b.report,b.appversion,b.appid,b.error,b.result,b.imode,b.wififlag,b.duid,b.device,b.iosver,b.iosdbver,b.jailbreak,b.jip,b.connectstatus,b.serverip,b.connectstatus_2,b.length,b.beginlength,b.endlength,b.tick,b.btick,b.etick,b.downloadspeed,b.dlfrom,b.resourcelog,b.keyword,b.referer,b.requestua,b.j,b.sn,b.netret,b.tep,b.lasterror,b.url,b.dla,b.def,b.type,b.step,b.cmd,b.dlat,b.dlate,b.test,a.nettype,a.province,b.dt,b.hour
from f_ip_dist_new a
right outer join
f_selfPcApacheLog_deserialize_hour b
where b.dt=''
and ip2byte(b.requestip)&=a.fromint
and ip2byte(b.requestip)&=a.toint
64.深入了解hive index
/end/archive//2871147.html
65.hive参数学习
/forum/blogPost/list/6225.html
set hive.smalltable.filesize=100M;
set hive.join.emit.interval=;
set hive.mapjoin.check.memory.rows=1000000;
set hive.mapjoin.cache.numrows=1000000;
How many rows
in the right-most join operand Hive should buffer before emitting the join result.
66.合并小文件
set hive.merge.mapfiles=
67.表关联时有主表和驱动表的概念。
68.调整map/reduce数的方法
.cn/s/blog_9fdua.html
调整hive.exec.reducers.bytes.per.reducer参数的值；
set hive.exec.reducers.bytes.per.reducer=; （500M）
hive.exec.reducers.bytes.per.reducer（每个reduce任务处理的数据量，默认为G）
hive.exec.reducers.max（每个任务最大的reduce数，默认为999）
计算reducer数的公式很简单N=min(参数2，总输入数据量/参数1)
set hive.exec.reducers.bytes.per.reducer=;
即，如果reduce的输入（map的输出）总大小不超过1G,那么只会有一个reduce任务；
69.hive函数总结
/end/archive//2553682.html
array collect_set(column_1) 返回无重复记录的数组
explode 函数能把array的元素分成多行。
70.hiveserver ESTABLISHED 与CLOSE_WAIT报错解决
/articles/MRFBRv
hadoop jar /usr/local/hadoop-1.1.2/lib/hadoop-lzo-*.jar pression.lzo.DistributedLzoIndexer /input/selfLog//*lzo*
72.一个hive多个job时
在之前加set mapred.reduce.tasks=4;每个job 的reduce数都不会超过4个。
73.桶可以用于mapjoin。
74.hive支持CTAS，即CREATE TABLE AS SELECT,并且CTAS是原子操作，如果SELECT操作因为某种原因失败，是不会创建新表的。
75.用LIKE关键字产生结构一样的新表，CREATE TABLE new_table like exists_
76.DISTRIBUTE BY 和SORT BY一块用是每个reduce聚合排序。
如果DISTRIBUTE BY 和SORT BY的字段一样，可以一个CLUSTER BY 代替。
为什么有时CLUSTER BY 不好用？我感觉是因为只有一个reduce的原因。
77.用bucket执行mapjoin需要设置优化参数：
set hive.optimize.bucketmapjoin=
78.hive对子查询的支持很有限，只允许出现在SELECT语句的FROM子句中，其他数据允许子查询出现在几乎任何表达式可以出现的地方。
很多使用子查询的地方都可以重写成连接操作。
79.全连接，半连接，内连接，外连接。
left outer join 的on上的陷阱
右边表的where除了判断null的条件都写到on里或者子查询里，左边表的条件不能写到on里，写了就会出错。
left outer join 有时在子查询过滤好才管用，直接在left outer join后边where中过滤有问题。
/*+MAPJOIN(d_pchome_appid)*/
b.appid,b.appid
from d_pchome_appid b
left outer join
f_selfpcapachelog_deserialize_hour
and dlfrom='webdl%23downloadpchomenet'
and report='dlurl'
on(regexp_replace(b.appid,'[.-]','_')=a.appid)
81.没有group by的时候会出现一个reduce
82.hive中rownumber
add jar /home/hadoop/jar/kuaiyong_UDF_rowNumber.
create temporary function row_number as 'com.kuaiyong.UDF.RowNumber';
select a.dt,a.t
select dt,appid,count(1) cnt
from f_app_hour
where dt&=''
and dt&=''
and appid!=''
and report='install'
group by dt,appid&
order by dt asc,cnt desc
where row_number(dt) &= 100;
83.lateral view
84.hive中left outer join时
on上的条件要小心，不然会加倍数据，出错
from f_jyzlog_hour a
left outer join f_jyz_station_mac b
on (a.jyz_mac=b.station_mac)
where b.dt='${yesterday_formate}'
需要改成如下方式：
from f_jyzlog_hour a
left outer join f_jyz_station_mac b
on a.jyz_mac=b.station_mac and b.dt='${yesterday_formate}'
看来left outer join 右边的表，要不得子查询查好，要不在on中得限制好数据行。
85.show tables '*channel*';
86.union all前后的表可以用相同的alias表名称
87.修改表名
alter table f_appid_devicemac_day_active rename to f_appid_devicemac_day_active_
88.hive中可以按聚合后的字段排序，没有经过聚合的字段不能排序。如下可以排序；
sum(case when date_str&'' and date_str&='' then 1 else 0 end) f_cnt,
sum(case when date_str&='' and date_str&='' then 1 else 0 end) s_cnt
from tmp_chaobo_uuid_stat__new
where date_str&=''
and date_str&=''
and report='dlurl'
group by uuid
order by f_cnt desc
89.一定要小心hive中的where陷阱(写hive中的left outer join 一定要按hive的特性写)
左边表的过滤条件一定要写到on字段里边，
右边表的过滤条件一定不要写到on字段里,
不管是左边表还是右边表都一定要先在子查询中做好过滤再做关联。
insert overwrite local directory '/download/hadoop/tmp//uuid_stat_all_3'
c.uuid,c.userType,c.date_str,c.time,c.ip,c.ip_dist,c.report
from tmp_chaobo_uuid_stat__new c
left outer join
sum(case when date_str&'' and date_str&='' then 1 else 0 end) f_cnt,
sum(case when date_str&='' and date_str&='' then 1 else 0 end) s_cnt
from tmp_chaobo_uuid_stat__new
where date_str&=''
and date_str&=''
and report='dlurl'
group by uuid
on c.uuid=b.uuid and b.f_cnt&1000 and b.s_cnt&100
and b.uuid is not null
90.add file的时候在/home/chaobo/python里不行，移动到/download/hadoop/resultDir中就可以了。
91.hive 优化原则
join需要将大表放在靠右位置；
尽量使用UDF，而不是transform
92.运行hive还有运行自己写的mapreduce，要配置HADOOP_CLASSPATH
(1.hadoop-env.sh中export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/lib
（2./etc/profile 中export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/lib
93.如果使用hive接口时，有NotFoundClass的报错，可能是CLASSPATH的问题。
94.对查询结果order by 会增加一个job。
95.hive 0.10暂时不支持select a.col1,b.col2 from a,b where a.col1=b.col1;
96.hive中有几个group by就至少有几个job。
Multi-group by
97.case when 后的字段用别名group by会有问题。
98.hive默认只支持单字符的分隔符。可以扩展。默认分隔符是\001（ctrl+A）
99.创建与已知表相同结构的表。
create table test_1 like test_2;
100.hive 在查询时的优化
用列剪裁和分区剪裁，设置hive.optimize.cp=hive.optimize.pruner=
列裁剪(column pruning)，只读取需要的列。
分区裁剪(partition pruning)
101.hive防止数据倾斜，join时连接字段数据类型得相同。
102.hive 定制inputFormat,outputFormat
/likai198981/archive//3068586.html
把jar包kuaiyong_util.jar放在hive/lib下。
create temporary function UDF_deDesCipher as 'com.kuaiyong.UDF.UDFDeDESCipher';
create temporary function UDF_urlDecode as 'com.kuaiyong.UDF.UDFURLDecode';
create external table f_gameCenter_user_login_out_day_test_2
sub_guid string,
guid string,
channel string,
report string,
app_joc_aid string,
time string,
requestip string
partitioned by (dt string)
stored as INPUTFORMAT 'com.kuaiyong.util.HiveInputFormatDoubleVertical'&&
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
alter table f_gameCenter_user_login_out_day_test_2 add partition(dt='') location '/input/gameCenter/game_login_out_log/';
查的时候得加add file /usr/local/hive/lib/kuaiyong_util.否则会包class not found 的错误。
103.hive配置默认加载类包路径
conf/hive-env.sh 中修改export HIVE_AUX_JARS_PATH=/usr/local/hive/lib即可。
104.hive修改表名
alter table f_360lite_hour rename to f_ky_crypt_
一次读表，多次写入，好用。
from f_selflog_hour
insert overwrite local directory '/download/hadoop/resultDir/tmp/_dlm2_austate_not0'
select * where dt=''
and regexp_extract(requestargv,'report=([^ &]+)',1) = 'dlurl'
and regexp_extract(requestargv,'ret=([^ &]+)',1) = '0'
and regexp_extract(requestargv,'dlm=([0-9]+)',1)='2'
and regexp_extract(requestargv,'austate=([^ &]+)',1)!='0'
and regexp_extract(requestargv,'austate=([^ &]+)',1) != ''
insert overwrite local directory '/download/hadoop/resultDir/tmp/_dlm104'
select * where dt=''
and regexp_extract(requestargv,'report=([^ &]+)',1) = 'dlurl'
and regexp_extract(requestargv,'ret=([^ &]+)',1) = '0'
and regexp_extract(requestargv,'dlm=([0-9]+)',1)='104'
and regexp_extract(requestargv,'austate=([^ &]+)',1)!=''
106.hive中!= 和is not null 不同。
select appname,appdigitalid from t_ipa_base_infor where dt='' and appname!='' and appdigitalid!='' limit 10;
select app_unique_id,ca_id from t_business_company_app where dt='' and ca_id is not null and app_unique_id is not null limit 10;
107.小文件合并
load data local inpath &/data1/result/tblog/hive/api/hqltest/test1.txt& overwrite into table f_
109.hive中union和连接有时会合并job。
已发表评论数()
请填写推刊名
描述不能大于100个字符!
权限设置：公开
仅自己可见
正文不准确
标题不准确
排版有问题
主题不准确
没有分页内容
图片无法显示
视频无法显示
与原文不一致

hive collect sethive sql 数组长度有长度限制吗

我要回帖

更多关于 hive sql 数组长度的文章

随机推荐

hive collect sethive sql 数组长度有长度限制吗

我要回帖

更多关于 hive sql 数组长度 的文章

随机推荐

更多关于 hive sql 数组长度的文章