HQL练习某视频网站的常规TopN指标分析

点击关注上方“ 知了小巷 ”,

设为“置顶或星标”,第一时间送达干货。

某视频网站的TopN指标需求

1.统计视频观看数Top10

2.统计视频类别热度Top10

3.统计视频观看数Top20所属类别

4.统计视频观看数Top50所关联视频的所属类别Rank排名

5.统计每个类别中的视频热度Top10

6.统计每个类别中视频流量Top10

7.统计上传视频最多的用户Top10以及他们上传的视频

8.统计每个类别视频观看数Top10

指标分析所涉及的核心数据结构

1.视频表

字段 备注 详细描述
video_id 视频唯一id 11位字符串
uploader 视频上传者 上传视频的用户名String
age 视频年龄 视频在平台上的整数天
category 视频类别 上传视频指定的视频分类
length 视频长度 整形数字标识的视频长度
views 观看次数 视频被浏览的次数
rate 视频评分 满分5分
ratings 流量 视频的流量,整型数字
coments 评论数 一个视频的整数评论数
related_id 相关视频id 相关视频的id,最多20个

2.用户表

字段 备注 字段类型
uploader 上传者用户名 string
videos 上传视频数 int
friends 朋友数量 int

建表和数据准备

创建表(load原始数据)

ods_video_ori

ods_video_user_ori

创建表(默认TEXTFILE转到ORC格式)

ods_video_orc

ods_video_user_orc

ods_video_ori

create table ods_video_ori (

video_id string,

uploader string,

age int,

category array<string>,

length int,

views int,

rate float,

ratings int,

comments int,

related_id array<string>

) row format delimited

fields terminated by "\t"

collection items terminated by "&"

stored as textfile;

ods_video_user_ori

create table ods_video_user_ori (

uploader string,

videos int,

friends int

) row format delimited

fields terminated by "\t"

stored as textfile;

然后把原始数据插入到ORC表中

ods_video_orc

create table ods_video_orc (

video_id string,

uploader string,

age int,

category array<string>,

length int,

views int,

rate float,

ratings int,

comments int,

related_id array<string>

) row format delimited fields terminated by "\t"

collection items terminated by "&"

stored as orc;

ods_video_user_orc

create table ods_video_user_orc (

uploader string,

videos int,

friends int

) row format delimited

fields terminated by "\t"

stored as orc;

导入数据

ods_video_ori(一个文件夹下多个文件)

hive> load data local inpath "/Users/xxx/Development/logs/video/" into table ods_video_ori;

Loading data to table default.ods_video_ori

OK

hive> select * from ods_video_ori limit 5;

OK

LKh7zAJ4nwo TheReceptionist 653 ["Entertainment"] 424 13021 4.34 1305 744 ["DjdA-5oKYFQ"]

7D0Mf4Kn4Xk periurban 583 ["Music"] 201 6508 4.19 687 312 ["e2k0h6tPvGc"]

n1cEq1C8oqQ Pipistrello 525 ["Comedy"] 125 1687 4.01 363 141 ["eprHhmurMHg"]

OHkEzL4Unck ichannel 638 ["Comedy"] 299 8043 4.4 518 371 ["eyUSTmEUQRg"]

-boOvAGNKUc mrpitifulband 639 ["Music"] 287 7548 4.48 606 386 ["fmUwUURgsX0"]

Time taken: 0.293 seconds, Fetched: 5 row(s)

ods_video_user_ori

hive> load data local inpath "/Users/xxx/Development/logs/user.txt" into table ods_video_user_ori;

hive> select * from ods_video_user_ori limit 5;

OK

barelypolitical 151 5106

bonk65 89 144

camelcars 26 674

cubskickass34 13 126

boydism08 32 50

Time taken: 2.855 seconds, Fetched: 5 row(s)

向ORC表插入数据

ods_video_orc

insert into table ods_video_orc select * from ods_video_ori;

ods_video_user_orc

insert into table ods_video_user_orc select * from ods_video_user_ori;

业务分析

统计视频观看数Top10

思路:使用order by按照views字段做一个全局排序即可,同时我们设置只显示前10条。

【全局排序】

最终sql:

select

video_id,

uploader,

age,

category,

length,

views,

rate,

ratings,

comments

from ods_video_orc

order by views desc limit 10;

需要注意内存不够的话会报错:

Ended Job = job_local1559464187_0005 with errors

Error during job, obtaining debugging information...

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

统计视频类别热度Top10

思路:某类别下的视频越多则热度越高

1.即统计每个类别有多少个视频,显示出包含视频最多的前10个类别。

2.我们需要按照类别group by聚合,然后count组内的video_id个数即可。

3.因为当前表结构为:一个视频对应一个或多个类别。所以如果要group by类别,需要先将类别进行列转行(展开),然后再进行count即可。

4.最后按照热度排序,显示前10条。

【lateral view explode的使用】

最终sql:

select

category_name as category,

count(t1.video_id) as hot

from (

select

video_id,

category_name

from ods_video_orc lateral view explode(category) t_catetory as category_name

) t1

group by t1.category_name

order by hot desc limit 10;

...

MapReduce Jobs Launched:

Stage-Stage-1: HDFS Read: 13812458 HDFS Write: 0 SUCCESS

Stage-Stage-2: HDFS Read: 13812458 HDFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

Music 179049

Entertainment 127674

Comedy 87818

Film 73293

Animation 73293

Sports 67329

Games 59817

Gadgets 59817

Blogs 48890

People 48890

...

统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

思路:

1.先找到观看数最高的20个视频所属条目的所有信息,降序排列

2.把这20条信息中的category分裂出来(列转行)

3.最后查询视频分类名称和该分类下有多少个Top20的视频

最终sql:

select

category_name as category,

count(t2.video_id) as hot_with_views

from (

select

video_id,

category_name

from (

select

*

from

ods_video_orc

order by

views

desc limit

20

) t1 lateral view explode(category) t_catetory as category_name

) t2

group by category_name

order by hot_with_views desc;

...

MapReduce Jobs Launched:

Stage-Stage-1: HDFS Read: 30471366 HDFS Write: 0 SUCCESS

Stage-Stage-2: HDFS Read: 30471366 HDFS Write: 0 SUCCESS

Stage-Stage-3: HDFS Read: 30471366 HDFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

Entertainment 6

Comedy 6

Music 5

People 2

Blogs 2

UNA 1

...

统计视频观看数Top50所关联视频的所属类别排序

思路:

1.查询出观看数最多的前50个视频的所有信息(当然包含了每个视频对应的关联视频),记为临时表t1

t1:观看数前50的视频

select

*

from ods_video_orc

order by views desc limit 50;

2.将找到的50条视频信息的相关视频related_id列转行,记为临时表t2

t2:将相关视频的id进行列转换操作

select

explode(related_id) as videoId

from t1;

3.将相关视频的id和ods_video_orc表进行inner join操作

t5:得到两列数据,一列是category,一列是之前查询出来的相关视频id

(

select

distinct(t2.video_id),

t3.category

from t2 inner join ods_video_orc t3 on t2.videoId = t3.videoId

) t4 lateral view explode(category) t_catetory as category_name;

4.按照视频类别进行分组,统计每组视频个数,然后排行

最终sql:

select

category_name as category,

count(t5.video_id) as hot

from (

select

video_id,

category_name

from (

select

distinct(t2.video_id),

t3.category

from (

select

explode(related_id) as video_id

from (

select

*

from ods_video_orc order by views desc limit 50

) t1

) t2 inner join ods_video_orc t3 on t2.video_id = t3.video_id

) t4 lateral view explode(category) t_catetory as category_name

) t5

group by category_name order by hot desc;

统计每个类别中的视频热度Top10,以Music为例

思路:

1.要想统计Music类别中的视频热度Top10,需要先找到Music类别,那么就需要将category展开,所以可以创建一张表用于存放category_id展开的数据。

2.向category展开的表中插入数据。

3.统计对应类别(Music)中的视频热度。

最终sql:

创建表类别表:

create table ods_video_category(

video_id string,

uploader string,

age int,

category_id string,

length int,

views int,

rate float,

ratings int,

comments int,

related_id array<string>)

row format delimited

fields terminated by "\t"

collection items terminated by "&"

stored as orc;

向类别表中插入数据:

insert into table ods_video_category

select

video_id,

uploader,

age,

category_id,

length,

views,

rate,

ratings,

comments,

related_id

from ods_video_orc lateral view explode(category) catetory as category_id;

hive> select count(1) from ods_video_category;

OK

1019206

统计Music类别的Top10(也可以统计其他)

select

video_id,

views

from ods_video_category

where category_id = "Music"

order by views desc limit 10;

统计每个类别中视频流量Top10,以Music为例

思路:

1.创建视频类别展开表(category_id列转行后的表)

2.按照ratings排序即可

最终sql:

select

video_id,

views,

ratings

from ods_video_category

where category_id = "Music"

order by ratings desc limit 10;

统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频

思路:

1.先找到上传视频最多的10个用户的用户信息

select

*

from ods_video_user_orc

order by videos desc limit 10;

2.通过uploader字段与ods_video_orc表进行join,得到的信息按照views观看次数进行排序即可。

最终sql:

select

t2.video_id,

t2.views,

t2.ratings,

t1.videos,

t1.friends

from (

select

*

from

ods_video_user_orc

order by videos desc

limit 10

) t1 join ods_video_orc t2 on t1.uploader = t2.uploader

order by views desc limit 20;

统计每个类别视频观看数Top10

思路:

1.先得到category_id展开的表数据

2.子查询按照category_id进行分区,然后分区内排序,并生成递增数字,该递增数字这一列起名为rank列

3.通过子查询产生的临时表,查询rank值小于等于10的数据行即可。

最终sql:

select

t1.*

from (

select

video_id,

category_id,

views,

row_number() over(partition by category_id order by views desc

) rank

from ods_video_category

) t1

where rank <= 10;

推荐阅读:

Hive企业级调优

HiveQL查询连续三天有销售记录的店铺

HiveQL实战蚂蚁森林低碳用户排名分析:解法一

HiveQL实战蚂蚁森林低碳用户排名分析:解法二

HiveQL实战蚂蚁森林植物申领统计分析

Hive-函数

Hive-查询

Hive-DML(Data Manipulation Language)数据操作语言

Hive-DDL(Data Definition Language)数据定义

Hive优化(整理版)

Spark Core之Shuffle解析

数据仓库开发规范

欢迎关注我的公众号“ 知了小巷 ”,如果喜欢,可以点一下“ 在看 ”~ 谢谢~~

我来评几句
登录后评论

已发表评论数()

相关站点

+订阅
热门文章