大数据开发工程师-第十二周 综合项目:电商数据仓库之用户行为数仓4


综合项目:电商数据仓库之用户行为数仓4

1
截止目前hdfs上,用户行为数据目录结构的样子

image-20230404203633888

image-20230404203822647

用户行为数据数仓开发

image-20230404202837476

1
2
数据仓库分为 4层:ods层、dwd层、dws层、app层,
我们先来构建第一层:ods层
1
2
3
4
5
6
注意:
1:由于在构建数据仓库的时候我们会创建多个数据库,所以在创建以及使用表的时候最好
都在表名前面带上对应的数据库名称,否则可能会出现一些不必要的问题,可能会把ods层的表建到dwd层。
2:考虑到SQL重跑的情况,需要在SQL语句中添加if not exists
3:hive中可以用string、date和timestamp表示日期时间,date用yyyy-MM-dd的形式表示,timestamp用yyyy-MM-dd hh:mm:ss 的形式表示,string 可以表示 yyyy-MM-dd和yyyy-MM-dd hh:mm:ss
这三种格式之间可以互相转换,不过在用的时候建议格式统一,String可以表示另外两种格式,并且也支持日期的大小比较,所以在这里针对时间统一使用String表示。

ods 层

1
2
3
4
创建ods层的表
表名对应的建表语句

注意:在使用这里的建表语句的时候注意里面的日期目录,需要和你HDFS中生成的日期目录保持一致
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
create external table if not exists ods_mall.ods_user_active( # 都使用的数据库.表名,这样后期就不用来回切换当前数据库
log string
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/ods/user_action/';

alter table ods_mall.ods_user_active add if not exists partition(dt='20260101') location '20260101/1'; # 这里写的相对路径,前面最后有/,则这里头不需要/,且相对路径,这里不能使用/开头;使用绝对路径一定正确

create external table if not exists ods_mall.ods_click_good(
log string
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/ods/user_action/';

alter table ods_mall.ods_click_good add if not exists partition(dt='20260101') location '20260101/2';

create external table if not exists ods_mall.ods_good_item(
log string
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/ods/user_action/';

alter table ods_mall.ods_good_item add if not exists partition(dt='20260101') location '20260101/3';

create external table if not exists ods_mall.ods_good_list(
log string
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/ods/user_action/';

alter table ods_mall.ods_good_list add if not exists partition(dt='20260101') location '20260101/4';

create external table if not exists ods_mall.ods_app_close(
log string
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/ods/user_action/';

alter table ods_mall.ods_app_close add if not exists partition(dt='20260101') location '20260101/5';

ods层抽取脚本

1
2
3
4
5
6
针对ods层抽取脚本
1:表初始化脚本(初始化执行一次)
ods_mall_init_table.sh
2:添加分区数据脚本(每天执行一次)
add_partition.sh:这个脚本是通用的,所有添加分区的地方都可以使用。
ods_mall_add_partition.sh
ods_mall_init_table.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#!/bin/bash
# ods层数据库和表初始化脚本,只需要执行一次

hive -e "
create database if not exists ods_mall;

create external table if not exists ods_mall.ods_user_active(
log string
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/ods/user_action/';


create external table if not exists ods_mall.ods_click_good(
log string
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/ods/user_action/';


create external table if not exists ods_mall.ods_good_item(
log string
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/ods/user_action/';


create external table if not exists ods_mall.ods_good_list(
log string
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/ods/user_action/';


create external table if not exists ods_mall.ods_app_close(
log string
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/ods/user_action/';

"
add_partition.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#!/bin/bash
# 给外部分区表添加分区
# 接收三个参数
#1:表名
#2:分区字段dt的值:格式20260101
#3:分区路径(相对路径或者绝对路径都可以)

if [ $# != 3 ]
then
echo "参数异常:add_partition.sh <tabkle_name> <dt> <path>"
exit 100
fi

table_name=$1
dt=$2
path=$3

hive -e "
alter table ${table_name} add if not exists partition(dt='${dt}') location '${path}';
"
ods_mall_add_partition.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/bin/bash
# 给ods层的表添加分区,这个脚本后期每天执行一次
# 每天凌晨,添加昨天的分区,添加完分区之后,再执行后面的计算脚本

# 默认获取昨天的日期,也支持传参指定一个日期
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 days ago"`
else
dt=$1
fi

#alter table ods_mall.ods_user_active add if not exists partition(dt='20260101') location '20260101/1';
#alter table ods_mall.ods_click_good add if not exists partition(dt='20260101') location '20260101/2';
#alter table ods_mall.ods_good_item add if not exists partition(dt='20260101') location '20260101/3';
#alter table ods_mall.ods_good_list add if not exists partition(dt='20260101') location '20260101/4';
#alter table ods_mall.ods_app_close add if not exists partition(dt='20260101') location '20260101/5';
sh add_partition.sh ods_mall.ods_user_active ${dt} ${dt}/1
sh add_partition.sh ods_mall.ods_click_good ${dt} ${dt}/2
sh add_partition.sh ods_mall.ods_good_item ${dt} ${dt}/3
sh add_partition.sh ods_mall.ods_good_list ${dt} ${dt}/4
sh add_partition.sh ods_mall.ods_app_close ${dt} ${dt}/5

dwd层

1
2
3
针对 ods层表中的数据进行清洗,参考数据清洗规则,按照实际情况对数据进行清洗
注意:如果清洗规则使用SQL可以实现,那么就使用SQL实现数据清洗,如果清洗的规则使用SQL实现起来非常麻烦,或者使用SQL压根无法实现,此时就可以考虑需要使用MapReduce代码或者 Spark代码对数据进行清洗了。
由于我们这里采集的数据还是比较规整的,可以使用SQL实现,所以我们就直接使用SQL实现数据清洗了。
1
2
3
创建dwd层的表
注意:
1:原始json数据中的用户id字段名称为uid,但是在商品订单数据中用户id字段名称为user_id,这块需要注意一下,在实际工作中会有这种情况,客户端数据和服务端数据的个别字段名称不一致,所以我们在使用的时候最好是统一一下,后期使用起来比较方便,所以在这里我会通过uid解析数据,解析之后,给字段起别名为user_id
1
2
2:hive中的timestamp只能解析yyyy-MM-dd HH:MM:SS格式的数据,所以针对这里面的acttime字段我们使用bigint类型
3:为了考虑到SQL重跑的情况,在使用insert into table(追加)的时候最好改为insert overwrite table(覆盖),否则SQL重复执行的时候会重复写入数据
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
create external table if not exists dwd_mall.dwd_user_active(
user_id bigint,
xaid string,
platform tinyint,
ver string,
vercode string,
net bigint,
brand string,
model string,
display string,
osver string,
acttime bigint,
ad_status tinyint,
loading_time bigint
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dwd/user_active/';


insert overwrite table dwd_mall.dwd_user_active partition(dt='20260101') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.ad_status') as ad_status,
get_json_object(log,'$.loading_time') as loading_time
from
(
select log from ods_mall.ods_user_active where dt = '20260101' group by log # group by在这里的作用是去重,因为flume采集数据可能会重复,distinct也可以但效率不高
) as tmp
where get_json_object(log,'$.xaid') !='';


create external table if not exists dwd_mall.dwd_click_good(
user_id bigint,
xaid string,
platform tinyint,
ver string,
vercode string,
net bigint,
brand string,
model string,
display string,
osver string,
acttime bigint,
goods_id bigint,
location tinyint
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dwd/click_good/';


insert overwrite table dwd_mall.dwd_click_good partition(dt='20260101') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.goods_id') as goods_id,
get_json_object(log,'$.location') as location
from
(
select log from ods_mall.ods_click_good where dt = '20260101' group by log
) as tmp
where get_json_object(log,'$.xaid') !=''; # 为空则是pc端生成的数据直接过滤掉,只保留app端生成的数据;且pc端数据实际生活中很少



create external table if not exists dwd_mall.dwd_good_item(
user_id bigint,
xaid string,
platform tinyint,
ver string,
vercode string,
net bigint,
brand string,
model string,
display string,
osver string,
acttime bigint,
goods_id bigint,
stay_time bigint,
loading_time bigint
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dwd/good_item/';


insert overwrite table dwd_mall.dwd_good_item partition(dt='20260101') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.goods_id') as goods_id,
get_json_object(log,'$.stay_time') as stay_time,
get_json_object(log,'$.loading_time') as loading_time
from
(
select log from ods_mall.ods_good_item where dt = '20260101' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';


create external table if not exists dwd_mall.dwd_good_list(
user_id bigint,
xaid string,
platform tinyint,
ver string,
vercode string,
net bigint,
brand string,
model string,
display string,
osver string,
acttime bigint,
loading_time bigint,
loading_type tinyint,
goods_num tinyint
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dwd/good_list/';



insert overwrite table dwd_mall.dwd_good_list partition(dt='20260101') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.loading_time') as loading_time,
get_json_object(log,'$.loading_type') as loading_type,
get_json_object(log,'$.goods_num') as goods_num
from
(
select log from ods_mall.ods_good_list where dt = '20260101' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';


create external table if not exists dwd_mall.dwd_app_close(
user_id bigint,
xaid string,
platform tinyint,
ver string,
vercode string,
net bigint,
brand string,
model string,
display string,
osver string,
acttime bigint
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dwd/app_close/';


insert overwrite table dwd_mall.dwd_app_close partition(dt='20260101') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime
from
(
select log from ods_mall.ods_app_close where dt = '20260101' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';

dwd层抽取脚本

1
2
3
4
1:表初始化脚本(初始化执行一次)
dwd_mall_init_table.sh
2:添加分区数据脚本(每天执行一次)
dwd_mall_add_partition.sh
dwd_mall_init_table.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
#!/bin/bash
# dwd层数据库和表初始化脚本,只需要执行一次即可

hive -e "
create database if not exists dwd_mall;

create external table if not exists dwd_mall.dwd_user_active(
user_id bigint,
xaid string,
platform tinyint,
ver string,
vercode string,
net bigint,
brand string,
model string,
display string,
osver string,
acttime bigint,
ad_status tinyint,
loading_time bigint
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dwd/user_active/';




create external table if not exists dwd_mall.dwd_click_good(
user_id bigint,
xaid string,
platform tinyint,
ver string,
vercode string,
net bigint,
brand string,
model string,
display string,
osver string,
acttime bigint,
goods_id bigint,
location tinyint
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dwd/click_good/';



create external table if not exists dwd_mall.dwd_good_item(
user_id bigint,
xaid string,
platform tinyint,
ver string,
vercode string,
net bigint,
brand string,
model string,
display string,
osver string,
acttime bigint,
goods_id bigint,
stay_time bigint,
loading_time bigint
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dwd/good_item/';



create external table if not exists dwd_mall.dwd_good_list(
user_id bigint,
xaid string,
platform tinyint,
ver string,
vercode string,
net bigint,
brand string,
model string,
display string,
osver string,
acttime bigint,
loading_time bigint,
loading_type tinyint,
goods_num tinyint
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dwd/good_list/';


create external table if not exists dwd_mall.dwd_app_close(
user_id bigint,
xaid string,
platform tinyint,
ver string,
vercode string,
net bigint,
brand string,
model string,
display string,
osver string,
acttime bigint
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dwd/app_close/';
"
dwd_mall_add_partition.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
#!/bin/bash
# 基于ods层的表进行清洗,将清洗之后的数据添加到dwd层对应表的对应分区中
# 每天凌晨执行一次

# 默认获取昨天的日期,也支持传参指定一个日期
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 days ago"`
else
dt=$1
fi

hive -e "
insert overwrite table dwd_mall.dwd_user_active partition(dt='${dt}') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.ad_status') as ad_status,
get_json_object(log,'$.loading_time') as loading_time
from
(
select log from ods_mall.ods_user_active where dt = '${dt}' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';



insert overwrite table dwd_mall.dwd_click_good partition(dt='${dt}') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.goods_id') as goods_id,
get_json_object(log,'$.location') as location
from
(
select log from ods_mall.ods_click_good where dt = '${dt}' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';


insert overwrite table dwd_mall.dwd_good_item partition(dt='${dt}') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.goods_id') as goods_id,
get_json_object(log,'$.stay_time') as stay_time,
get_json_object(log,'$.loading_time') as loading_time
from
(
select log from ods_mall.ods_good_item where dt = '${dt}' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';



insert overwrite table dwd_mall.dwd_good_list partition(dt='${dt}') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.loading_time') as loading_time,
get_json_object(log,'$.loading_type') as loading_type,
get_json_object(log,'$.goods_num') as goods_num
from
(
select log from ods_mall.ods_good_list where dt = '${dt}' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';


insert overwrite table dwd_mall.dwd_app_close partition(dt='${dt}') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime
from
(
select log from ods_mall.ods_app_close where dt = '${dt}' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';
"

需求分析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
前面的两层中的表和需求一般没什么关系,就是把已有的数据接入进来,然后对数据进行清洗处理
但是后面的dws层和app层是和业务有关联的,所以在构建这两层中的表的时候,我们需要根据一些典型的业务场景来进行分析,在根据具体业务建表的时候尽可能把表设计的更加通用,可以满足后期一些类似业务需求
就是说我们在基于业务构建表的时候,不要直接一个SQL搞定,可以把一些复杂的SQL基于一些维度进行拆分,拆分出来一些中间表,再基于这些中间表统计最终的结果。
这样这个中间表里面的数据,我们后期针对一些类似的业务需求还是可以服用的。
需求一:每日新增用户相关指标
需求二:每日活跃用户相关指标(主活)
需求三:用户7日流失push提醒
需求四:每日启动App次数相关指标
需求五:操作系统活跃用户相关指标
需求六:APP崩溃相关指标

在计算这些需求的时候,为了保证大家在下面练习时计算的结果和我这边计算的结果保持一致,所以针对后面的测试数据就不再随机生成了,而是生成固定的数据,一共1个月的数据
从2026-02-01到2026-02-2的8数据

然后执行 ods层和 dwd层的脚本,重新加载计算 2026-02月份的数据。

tmp_load_ods_data.sh

1
2
1:执行ods层的脚本
写一个临时脚本,在脚本中写一个for循环,循环加载数据
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/bin/bash
# 加载ods层的数据

for((i=1;i<=28;i++))
do
if [ $i -lt 10 ]
then
dt="2026020"$i
else
dt="202602"$i
fi

echo "ods_mall_add_partition.sh" ${dt}
sh ods_mall_add_partition.sh ${dt}
done

tmp_load_dwd_data.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/bin/bash
# 加载dwd层的数据

for((i=1;i<=28;i++))
do
if [ $i -lt 10 ]
then
dt="2026020"$i
else
dt="202602"$i
fi

echo "dwd_mall_add_partition.sh" ${dt}
sh dwd_mall_add_partition.sh ${dt}
done

image-20230405144005173

需求一:每日新增用户相关指标

1
2
3
4
5
在统计新增用户时,用户是以设备标识(xaid字段)来判断的,每一个设备都有一个唯一设备码,因为会存在用户不登录的情况,以及多人共用一个账号的情况,所以根据用户id进行统计是不准确的。
新增用户是指第一次安装并且使用app的用户,后期卸载之后再使用就不算新用户了
这个新增用户其实也可以称为新增设备,一个设备对应一个用户。
1:每日新增用户量
2:每日新增用户量的日环比和周同比
1
2
3
4
5
6
7
8
9
10
11
 先详细分析第1个指标,每日新增用户量
在实际工作中通过这个指标可以衡量我们产品的用户增长速度,如果每日新增用户量一直是
上升的,说明我们的产品势头正好,如果在一段时间内增速减缓或者下降,这个时候需要考
虑如何获取新用户。
咱们前面分析了,新增用户是指第一次安装并且使用APP的用户,咱们有一个埋点会上报
用户打开APP这个行为,所以计算新增用户量就使用这一份数据
ods层的表名为:ods_user_active
dwd层的表名为:dwd_user_active
实现思路如下:
1:我们基于清洗之后的打开app上报的数据创建一个历史表,这个表里面包含的有xaid字段,针对每天的数据基于xaid进行去重
2:如果我们要计算2026年02月1日的新增用户量的话,就拿这一天上报的打开app的数据,和前面的历史表进行left join,使用xaid进行关联,关联不上的数据则为新增数据。
每日新增用户量
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
举个例子:
(1):第一步会产生一个历史表:dws_user_active_history,这个表中有一个xaid字段
dws_user_active_history
xaid
a1
b1
c1
d1
(2):第二步会产生一个临时表,表里面包含的是那一天上报的打开app的数据
dws_user_active_20260201_tmp
xaid
a1
b1
x1
y1
z1
(3):对这两个表进行left join
dws_user_active_20260201_tmp dws_user_active_history
xaid xaid
a1 a1
b1 b1
x1 null
y1 null
z1 null
此时,dws_user_active_history.xaid为null的数据条数即为当日新增用户数

3:将计算出来的每日新增用户信息保存到表dws_user_new_item表中,这个表按照天作为分区,便于后期其它需求使用这个表

4:基于dws_user_new_item对数据进行聚合计算,将计算出来的新增用户数量保存到结果表 app_user_new_count中。

注意:在这里处理完之后,还需要将dws_user_active_20260201_tmp这个临时表中的数据
insert到dws_user_active_history 这个历史表中。
最后删除这个临时表即可
每日新增用户量的日环比和周同比
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
接下来是第2个指标,每日新增用户量的日环比和周同比
同比一般是指本期统计数据和往年的同时期的统计数据比较,例如2026年2月和2025年2月相比较;这个统计周期也可以是按月或者周

环比一般是指本期统计数据和上一期的统计数据作比较,例如 2026年2月和2026年1月相比较;这个统计周期也可以是按周或者日

在实际工作中通过同比和环比是可以衡量某一个指标的变化速度,供产品经理做一些决策的时候使用。
日环比=(本期的数据-上一期的数据)/上一期的数据
日环比中的时间单位是天
周同比=(本期的数据-上一期的数据)/上一期的数据
周同比中的时间单位是周(7天)

实现思路
直接基于app_user_new_count进行统计即可,可以统计出来某一天的日环比和周同比
生成一个新表app_user_new_count_ratio
里面包含日期、新增用户量、日环比、周同比
1
2
3
4
5
6
7
8
9
10
11
12
13
总结:
针对前面对需求的分析,我们最终在dws层需要创建三个表
dws_user_active_20260201_tmp

dws_user_new_item

dws_user_active_history

在app层需要创建两个表
app_user_new_count

app_user_new_count_ratio
下面开始在数据仓库中构建这些表
dws
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
create table if not exists dws_mall.dws_user_active_20260201_tmp(
xaid string,
times int
);

insert overwrite table dws_mall.dws_user_active_20260201_tmp select
xaid,
count(*) as times
from dwd_mall.dwd_user_active
where dt = '20260201'
group by xaid;


create external table if not exists dws_mall.dws_user_active_history(
xaid string,
times int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_active_history';


create external table if not exists dws_mall.dws_user_new_item(
xaid string
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_new_item';


insert overwrite table dws_mall.dws_user_new_item partition(dt='20260201') select
duat.xaid
from dws_mall.dws_user_active_20260201_tmp duat
left join (select xaid from dws_mall.dws_user_active_history group by xaid) duah
on duat.xaid = duah.xaid
where duah.xaid is null;

insert overwrite table dws_mall.dws_user_active_history partition(dt='20260201') select
xaid,
times
from dws_mall.dws_user_active_20260201_tmp;

create external table if not exists dws_mall.dws_user_lost_item(
xaid string
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_lost_item';

insert overwrite table dws_mall.dws_user_lost_item partition(dt='20260201') select
xaid
from dws_mall.dws_user_active_history
where dt >= regexp_replace(date_add('2026-02-01',-7),'-','')
group by xaid
having max(dt) = regexp_replace(date_add('2026-02-01',-7),'-','');



create external table if not exists dws_mall.dws_user_platform_distrib(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_platform_distrib';

insert overwrite table dws_mall.dws_user_platform_distrib partition(dt='20260201') select
case platform
when 1 then 'android'
when 2 then 'ios'
end ty,
count(*) as num
from dwd_mall.dwd_user_active
where dt = '20260201' and platform in (1,2)
group by platform;



create external table if not exists dws_mall.dws_user_android_osver_distrib(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_android_osver_distrib';

insert overwrite table dws_mall.dws_user_android_osver_distrib partition(dt='20260201') select
osver as ty,
count(*) as num
from dwd_mall.dwd_user_active
where dt = '20260201' and platform = 1
group by osver;


create external table if not exists dws_mall.dws_user_ios_osver_distrib(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_ios_osver_distrib';

insert overwrite table dws_mall.dws_user_ios_osver_distrib partition(dt='20260201') select
osver as ty,
count(*) as num
from dwd_mall.dwd_user_active
where dt = '20260201' and platform = 2
group by osver;


create external table if not exists dws_mall.dws_user_brand_distrib(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_brand_distrib';

insert overwrite table dws_mall.dws_user_brand_distrib partition(dt='20260201') select
brand as ty,
count(*) as num
from dwd_mall.dwd_user_active
where dt = '20260201'
group by brand;



create external table if not exists dws_mall.dws_user_model_distrib(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_model_distrib';

insert overwrite table dws_mall.dws_user_model_distrib partition(dt='20260201') select
model as ty,
count(*) as num
from dwd_mall.dwd_user_active
where dt = '20260201'
group by model;


create external table if not exists dws_mall.dws_user_net_distrib(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_net_distrib';

insert overwrite table dws_mall.dws_user_net_distrib partition(dt='20260201') select
case net
when 0 then '未知'
when 1 then 'WIFI'
when 2 then '2G'
when 3 then '3G'
when 4 then '4G'
when 5 then '5G'
end ty,
count(*) as num
from dwd_mall.dwd_user_active
where dt = '20260201'
group by net;


create external table if not exists dws_mall.dws_app_close_platform_vercode(
platform string,
vercode string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/app_close_platform_vercode';

insert overwrite table dws_mall.dws_app_close_platform_vercode partition(dt='20260201') select
platform,
vercode,
count(*) as num
from dwd_mall.dwd_app_close
where dt = '20260201' and platform in (1,2)
group by platform,vercode;
1
2
3
4
5
针对 dws 层抽取脚本
1:表初始化脚本(初始化执行一次)
dws_mall_init_table_1.sh
2:添加分区数据脚本(每天执行一次)
dws_mall_add_partition_1.sh
dws_mall_init_table_1.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/bin/bash
# 需求一:每日新增用户相关指标
# dws层数据库和表初始化脚本,只需要执行一次即可


# 由于这个表需要每天创建一个,用完之后删除,所以选择把这个建表语句放到添加分区数据的脚本中
#create table if not exists dws_mall.dws_user_active_20260201_tmp(
# xaid string,
# times int
#);

hive -e "
create database if not exists dws_mall;

create external table if not exists dws_mall.dws_user_active_history(
xaid string,
times int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_active_history';


create external table if not exists dws_mall.dws_user_new_item(
xaid string
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_new_item';
"
dws_mall_add_partition_1.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#!/bin/bash
# 需求一:每日新增用户相关指标
# 每天凌晨执行一次

# 默认获取昨天的日期,也支持传参指定一个日期
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 days ago"`
else
dt=$1
fi

hive -e "
create table if not exists dws_mall.dws_user_active_${dt}_tmp(
xaid string,
times int
);

insert overwrite table dws_mall.dws_user_active_${dt}_tmp select
xaid,
count(*) as times
from dwd_mall.dwd_user_active
where dt = '${dt}'
group by xaid;

--注意:考虑到脚本重跑的情况,所以在这开面每次执行的时候都会先删除dws_user_active_history表中指定分区的数据
--因为在计算每日新增用户的时候需要和dws_user_active_history进行关联查询
alter table dws_mall.dws_user_active_history drop partition(dt='${dt}');


insert overwrite table dws_mall.dws_user_new_item partition(dt='${dt}') select
duat.xaid
from dws_mall.dws_user_active_${dt}_tmp duat
left join (select xaid from dws_mall.dws_user_active_history group by xaid) duah
on duat.xaid = duah.xaid
where duah.xaid is null;

insert overwrite table dws_mall.dws_user_active_history partition(dt='${dt}') select
xaid,
times
from dws_mall.dws_user_active_${dt}_tmp;
"
app

image-20230617154405609

image-20230617154345475

image-20230617154828079

image-20230617154846516

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
create external table if not exists app_mall.app_user_new_count(
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_new_count';


insert overwrite table app_mall.app_user_new_count partition(dt='20260201') select
count(*) as num
from dws_mall.dws_user_new_item
where dt = '20260201';


create external table if not exists app_mall.app_user_new_count_ratio(
num int,
day_ratio double,
week_ratio double
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_new_count_ratio';

# 日环比,周同比
insert overwrite table app_mall.app_user_new_count_ratio partition(dt='20260201') select
num,
(num-num_1)/num_1 as day_ratio,
(num-num_7)/num_7 as week_ratio
from(
select
dt,
num,
lead(num,1) over(order by dt desc) as num_1,
lead(num,7) over(order by dt desc) as num_7
from app_mall.app_user_new_count
where dt >=regexp_replace(date_add('2026-02-01',-7),'-','')
) as t
where dt = '20260201';



create external table if not exists app_mall.app_user_active_count(
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_active_count';

insert overwrite table app_mall.app_user_active_count partition(dt='20260201') select
count(*) as num
from dws_mall.dws_user_active_history
where dt = '20260201';



create external table if not exists app_mall.app_user_active_count_ratio(
num int,
day_ratio double,
week_ratio double
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_active_count_ratio';


insert overwrite table app_mall.app_user_active_count_ratio partition(dt='20260201') select
num,
(num-num_1)/num_1 as day_ratio,
(num-num_7)/num_7 as week_ratio
from(
select
dt,
num,
lead(num,1) over(order by dt desc) as num_1,
lead(num,7) over(order by dt desc) as num_7
from app_mall.app_user_active_count
where dt >=regexp_replace(date_add('2026-02-01',-7),'-','')
) as t
where dt = '20260201';


create external table if not exists app_mall.app_user_lost_count(
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_lost_count';

insert overwrite table app_mall.app_user_lost_count partition(dt='20260201') select
count(*) as num
from dws_mall.dws_user_lost_item
where dt = '20260201';



create external table if not exists app_mall.app_user_open_app_count(
pv int,
uv int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_open_app_count';

insert overwrite table app_mall.app_user_open_app_count partition(dt='20260201') select
sum(times) as pv,
count(*) as uv
from dws_mall.dws_user_active_history
where dt = '20260201';



create external table if not exists app_mall.app_user_open_app_distrib(
ts_1 int,
ts_2 int,
ts_3_m int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_open_app_distrib';

insert overwrite table app_mall.app_user_open_app_distrib partition(dt='20260201') select
sum( case when times = 1 then 1 else 0 end) ts_1,
sum( case when times = 2 then 1 else 0 end) ts_2,
sum( case when times >= 3 then 1 else 0 end) ts_3_m
from dws_mall.dws_user_active_history
where dt = '20260201';





create external table if not exists app_mall.app_user_platform_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_platform_distrib';

insert overwrite table app_mall.app_user_platform_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_platform_distrib
group by ty;


create external table if not exists app_mall.app_user_android_osver_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_android_osver_distrib';

insert overwrite table app_mall.app_user_android_osver_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_android_osver_distrib
group by ty;


create external table if not exists app_mall.app_user_ios_osver_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_ios_osver_distrib';

insert overwrite table app_mall.app_user_ios_osver_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_ios_osver_distrib
group by ty;




create external table if not exists app_mall.app_user_brand_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_brand_distrib';

insert overwrite table app_mall.app_user_brand_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_brand_distrib
group by ty;



create external table if not exists app_mall.app_user_model_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_model_distrib';

insert overwrite table app_mall.app_user_model_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_model_distrib
group by ty;


create external table if not exists app_mall.app_user_net_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_net_distrib';

insert overwrite table app_mall.app_user_net_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_net_distrib
group by ty;


create external table if not exists app_mall.app_app_close_platform_all(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/app_close_platform_all';

insert overwrite table app_mall.app_app_close_platform_all partition(dt='20260201') select
case platform
when 1 then 'android'
when 2 then 'ios'
end ty,
sum(num) as num
from dws_mall.dws_app_close_platform_vercode
where dt = '20260201'
group by platform;




create external table if not exists app_mall.app_app_close_android_vercode(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/app_close_android_vercode';

insert overwrite table app_mall.app_app_close_android_vercode partition(dt='20260201') select
vercode as ty,
sum(num) as num
from dws_mall.dws_app_close_platform_vercode
where dt = '20260201' and platform = 1
group by vercode;

create external table if not exists app_mall.app_app_close_ios_vercode(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/app_close_ios_vercode';

insert overwrite table app_mall.app_app_close_ios_vercode partition(dt='20260201') select
vercode as ty,
sum(num) as num
from dws_mall.dws_app_close_platform_vercode
where dt = '20260201' and platform = 2
group by vercode;
app_mall_init_table_1.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/bin/bash
# 需求一:每日新增用户相关指标
# app层数据库和表初始化脚本,只需要执行一次即可

hive -e "
create database if not exists app_mall;

create external table if not exists app_mall.app_user_new_count(
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_new_count';


create external table if not exists app_mall.app_user_new_count_ratio(
num int,
day_ratio double,
week_ratio double
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_new_count_ratio';

"
app_mall_add_partition_1.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#!/bin/bash
# 需求一:每日新增用户相关指标
# 每天凌晨执行一次

# 默认获取昨天的日期,也支持传参指定一个日期
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 days ago"`
else
dt=$1
fi

# 转换日期格式,20260201改为${dt_new}
dt_new=`date +%Y-%m-%d --date="${dt}"`

hive -e "
insert overwrite table app_mall.app_user_new_count partition(dt='${dt}') select
count(*) as num
from dws_mall.dws_user_new_item
where dt = '${dt}';

insert overwrite table app_mall.app_user_new_count_ratio partition(dt='${dt}') select
num,
(num-num_1)/num_1 as day_ratio,
(num-num_7)/num_7 as week_ratio
from(
select
dt,
num,
lead(num,1) over(order by dt desc) as num_1,
lead(num,7) over(order by dt desc) as num_7
from app_mall.app_user_new_count
where dt >=regexp_replace(date_add('${dt_new}',-7),'-','')
) as t
where dt = '${dt}';
"

image-20230405162216940

image-20230405162233516

需求二:每日活跃用户(主活)相关指标

1
2
3
4
5
6
7
8
9
10
11
12
活跃用户的定义是指只要当天使用过APP就算是活跃用户,使用APP这种操作属于主动操作,所以这种活跃我们也会称为主动活跃,简称主活
针对这个需求统计的指标和新增用户的指标类似
1:每日主活用户量
2:每日主活用户量的日环比和周同比

 首先看第一个指标:每日主活用户量
主活的概念和定义我们知道了,其实就是统计每天使用过app的用户,所以我们可以直接使用dws层的dws_user_active_history这个表
直接求和即可获取到当日的主活用户量,将最终的结果保存到app层的
app_user_active_count表中

 接着看第二个指标:每日主活用户量的日环比和周同比
这个指标直接基于每日主活用户量的表(app_user_active_coun)进行计算即可,把最终的结果保存到app层的app_user_active_count_ratio 表中
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
create external table if not exists app_mall.app_user_new_count(
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_new_count';


insert overwrite table app_mall.app_user_new_count partition(dt='20260201') select
count(*) as num
from dws_mall.dws_user_new_item
where dt = '20260201';


create external table if not exists app_mall.app_user_new_count_ratio(
num int,
day_ratio double,
week_ratio double
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_new_count_ratio';

# 日环比,周同比
insert overwrite table app_mall.app_user_new_count_ratio partition(dt='20260201') select
num,
(num-num_1)/num_1 as day_ratio,
(num-num_7)/num_7 as week_ratio
from(
select
dt,
num,
lead(num,1) over(order by dt desc) as num_1,
lead(num,7) over(order by dt desc) as num_7
from app_mall.app_user_new_count
where dt >=regexp_replace(date_add('2026-02-01',-7),'-','')
) as t
where dt = '20260201';



create external table if not exists app_mall.app_user_active_count(
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_active_count';

insert overwrite table app_mall.app_user_active_count partition(dt='20260201') select
count(*) as num
from dws_mall.dws_user_active_history
where dt = '20260201';



create external table if not exists app_mall.app_user_active_count_ratio(
num int,
day_ratio double,
week_ratio double
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_active_count_ratio';


insert overwrite table app_mall.app_user_active_count_ratio partition(dt='20260201') select
num,
(num-num_1)/num_1 as day_ratio,
(num-num_7)/num_7 as week_ratio
from(
select
dt,
num,
lead(num,1) over(order by dt desc) as num_1,
lead(num,7) over(order by dt desc) as num_7
from app_mall.app_user_active_count
where dt >=regexp_replace(date_add('2026-02-01',-7),'-','')
) as t
where dt = '20260201';


create external table if not exists app_mall.app_user_lost_count(
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_lost_count';

insert overwrite table app_mall.app_user_lost_count partition(dt='20260201') select
count(*) as num
from dws_mall.dws_user_lost_item
where dt = '20260201';



create external table if not exists app_mall.app_user_open_app_count(
pv int,
uv int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_open_app_count';

insert overwrite table app_mall.app_user_open_app_count partition(dt='20260201') select
sum(times) as pv,
count(*) as uv
from dws_mall.dws_user_active_history
where dt = '20260201';



create external table if not exists app_mall.app_user_open_app_distrib(
ts_1 int,
ts_2 int,
ts_3_m int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_open_app_distrib';

insert overwrite table app_mall.app_user_open_app_distrib partition(dt='20260201') select
sum( case when times = 1 then 1 else 0 end) ts_1,
sum( case when times = 2 then 1 else 0 end) ts_2,
sum( case when times >= 3 then 1 else 0 end) ts_3_m
from dws_mall.dws_user_active_history
where dt = '20260201';





create external table if not exists app_mall.app_user_platform_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_platform_distrib';

insert overwrite table app_mall.app_user_platform_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_platform_distrib
group by ty;


create external table if not exists app_mall.app_user_android_osver_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_android_osver_distrib';

insert overwrite table app_mall.app_user_android_osver_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_android_osver_distrib
group by ty;


create external table if not exists app_mall.app_user_ios_osver_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_ios_osver_distrib';

insert overwrite table app_mall.app_user_ios_osver_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_ios_osver_distrib
group by ty;




create external table if not exists app_mall.app_user_brand_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_brand_distrib';

insert overwrite table app_mall.app_user_brand_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_brand_distrib
group by ty;



create external table if not exists app_mall.app_user_model_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_model_distrib';

insert overwrite table app_mall.app_user_model_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_model_distrib
group by ty;


create external table if not exists app_mall.app_user_net_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_net_distrib';

insert overwrite table app_mall.app_user_net_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_net_distrib
group by ty;


create external table if not exists app_mall.app_app_close_platform_all(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/app_close_platform_all';

insert overwrite table app_mall.app_app_close_platform_all partition(dt='20260201') select
case platform
when 1 then 'android'
when 2 then 'ios'
end ty,
sum(num) as num
from dws_mall.dws_app_close_platform_vercode
where dt = '20260201'
group by platform;



create external table if not exists app_mall.app_app_close_android_vercode(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/app_close_android_vercode';

insert overwrite table app_mall.app_app_close_android_vercode partition(dt='20260201') select
vercode as ty,
sum(num) as num
from dws_mall.dws_app_close_platform_vercode
where dt = '20260201' and platform = 1
group by vercode;

create external table if not exists app_mall.app_app_close_ios_vercode(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/app_close_ios_vercode';

insert overwrite table app_mall.app_app_close_ios_vercode partition(dt='20260201') select
vercode as ty,
sum(num) as num
from dws_mall.dws_app_close_platform_vercode
where dt = '20260201' and platform = 2
group by vercode;
1
2
3
4
5
针对 app层抽取脚本
1:表初始化脚本(初始化执行一次)
app_mall_init_table_2.sh
2:添加分区数据脚本(每天执行一次)
app_mall_add_partition_2.sh
app_mall_init_table_2.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/bin/bash
# 需求二:每日活跃用户(主活)相关指标
# app层数据库和表初始化脚本,只需要执行一次即可

hive -e "
create database if not exists app_mall;

create external table if not exists app_mall.app_user_active_count(
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_active_count';

create external table if not exists app_mall.app_user_active_count_ratio(
num int,
day_ratio double,
week_ratio double
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_active_count_ratio';
"
app_mall_add_partition_2.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#!/bin/bash
# 需求一:每日活跃用户(主活)相关指标
# 每天凌晨执行一次

# 默认获取昨天的日期,也支持传参指定一个日期
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 days ago"`
else
dt=$1
fi

# 转换日期格式,${dt}改为${dt_new}
dt_new=`date +%Y-%m-%d --date="${dt}"`

hive -e "
insert overwrite table app_mall.app_user_active_count partition(dt='${dt}') select
count(*) as num
from dws_mall.dws_user_active_history
where dt = '${dt}';

insert overwrite table app_mall.app_user_active_count_ratio partition(dt='${dt}') select
num,
(num-num_1)/num_1 as day_ratio,
(num-num_7)/num_7 as week_ratio
from(
select
dt,
num,
lead(num,1) over(order by dt desc) as num_1,
lead(num,7) over(order by dt desc) as num_7
from app_mall.app_user_active_count
where dt >=regexp_replace(date_add('${dt_new}',-7),'-','')
) as t
where dt = '${dt}';
"

image-20230405164201393

image-20230405164229164

image-20230405164252247

需求扩展
1
2
3
如何统计每周主活?每月主活?
周:按照自然周,每周一凌晨计算上一周的主活
月:按照自然月,每月1号计算上一个月的主活

需求三:用户7日流失push提醒

1
2
3
4
5
6
7
什么是流失呢?
假设这个用户在2026年2月2日是新增用户,如果他在后续的7天内,也就是在2月9日
内没有再使用app,则认为是流失用户,具体多少天属于流失用户,这个是需要产品经理根
据对应产品的特点来定的,一般业内使用比较多的是7天这个时间点。
push是什么意思呢
大家平时是不是深受各种app的提醒轰炸,我针对大部分的app都禁用了消息推送,要不
然每天手机上会有各种各样的推送消息,很烦,这个其实就是软件给你push的消息
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
实现思路:
1:基于dws_user_active_history,获取表中最近8天的数据,根据xaid进行分组,这样
可以获取xaid以及xaid对应的多个日期(dt)
2:接着需要对xaid对应的dt进行过滤,获取xaid中最大的dt,判断这个dt是否等于(当天
日期-7),如果满足条件,则说明这个用户最近7日内没有使用app,就认为他属于7日流失
用户
例如:dws_user_active_history中有以下几条数据
xaid dt
a1 2026-02-01
a1 2026-02-05
b1 2026-02-01
b1 2026-02-02
c1 2026-02-03
针对这份数据,我们想要在02-09号统计用户7日流失量
那也就意味着要统计表里面在02-02使用过APP,但是在之后的7天内,一直到02-09号没有再使用过app的用户
根据xaid进行分组,获取里面最大的日期(最近一次使用app的时间)
a1 2026-02-01,2026-02-05
b1 2026-02-01,2026-02-02
c1 2026-02-03

判断这个时间是否等于02-02,如果满足这个条件,就说明在02-09号之前的7天内没有使用过app,这里的b1满足条件,所以他就是7日流失用户了。
依此类推,可以计算14日流失, 21日流失用户,针对流失的时间不同可以实现不同的策略给用户实现push提醒,告诉用户他关注的商品降价了,或者给用户推荐他经常浏览的类似商品,促进用户活跃,最终促进订单成交。
3:将满足条件的xaid数据保存到dws层的dws_user_lost_item表中
4:对dws_user_lost_item表中的数据进行聚合统计,统计用户7日流失数据量,保存到app层的app_user_lost_count表中
dws
1
2
3
4
5
6
7
8
9
10
11
12
13
create external table if not exists dws_mall.dws_user_lost_item(
xaid string
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_lost_item';

insert overwrite table dws_mall.dws_user_lost_item partition(dt='20260201') select
xaid
from dws_mall.dws_user_active_history
where dt >= regexp_replace(date_add('2026-02-01',-7),'-','')
group by xaid
having max(dt) = regexp_replace(date_add('2026-02-01',-7),'-','');
提取脚本
1
2
3
4
5
针对 dws 层抽取脚本
1:表初始化脚本(初始化执行一次)
dws_mall_init_table_3.sh
2:添加分区数据脚本(每天执行一次)
dws_mall_add_partition_3.sh
dws_mall_init_table_3.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/bin/bash
# 需求三:用户7日流失push提醒
# dws层数据库和表初始化脚本,只需要执行一次即可

hive -e "
create database if not exists dws_mall;

create external table if not exists dws_mall.dws_user_lost_item(
xaid string
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_lost_item';
"
dws_mall_add_partition_3.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/bin/bash
# 需求三:用户7日流失push提醒
# 每天凌晨执行一次

# 默认获取昨天的日期,也支持传参指定一个日期
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 days ago"`
else
dt=$1
fi

# 转换日期格式,${dt}改为${dt_new}
dt_new=`date +%Y-%m-%d --date="${dt}"`

hive -e "
insert overwrite table dws_mall.dws_user_lost_item partition(dt='${dt}') select
xaid
from dws_mall.dws_user_active_history
where dt >= regexp_replace(date_add('${dt_new}',-7),'-','')
group by xaid
having max(dt) = regexp_replace(date_add('${dt_new}',-7),'-','');
"
app
1
2
3
4
5
6
7
8
9
10
11
create external table if not exists app_mall.app_user_lost_count(
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_lost_count';

insert overwrite table app_mall.app_user_lost_count partition(dt='20260201') select
count(*) as num
from dws_mall.dws_user_lost_item
where dt = '20260201';
提取脚本
1
2
3
4
5
针对 app层抽取脚本
1:表初始化脚本(初始化执行一次)
app_mall_init_table_3.sh
2:添加分区数据脚本(每天执行一次)
app_mall_add_partition_3.sh
app_mall_init_table_3.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/bin/bash
# 需求三:用户7日流失push提醒
# app层数据库和表初始化脚本,只需要执行一次即可

hive -e "
create database if not exists app_mall;

create external table if not app_mall.app_user_lost_count(
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_lost_count';
"
app_mall_add_partition_3.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/bin/bash
# 需求三:用户7日流失push提醒
# 每天凌晨执行一次

# 默认获取昨天的日期,也支持传参指定一个日期
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 days ago"`
else
dt=$1
fi

hive -e "
insert overwrite table app_mall.app_user_lost_count partition(dt='${dt}') select
count(*) as num
from dws_mall.dws_user_lost_item
where dt = '${dt}';
"

image-20230405172638690

image-20230405173019710

image-20230405173138361

需求四:每日启动App次数相关指标

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
这个需求就是对每日打开 app上报的数据进行统计
针对这个需求我们需要统计两个指标
1:每日人均启动App次数
2:每日APP启动次数分布(1次,2次,3次及以上)

 首先看第一个指标:每日人均启动App次数
每日人均启动App次数=当日所有用户启动APP总次数/当日所有人数
针对这种需求,我们在计算结果的时候最好是把这个指标的分子和分母保存起来,这样这份数据后期还有可能被复用,如果直接保存最终的结果,这个数据就没办法复用了。

实现思路如下:
1:基于dws_user_active_history表,统计当日的数据,根据times字段的值求pv和uv即可
2:将计算的结果保存到app层的app_user_open_app_count表中


 接下来是第二个指标:每日APP启动次数分布(1次,2次,3次及以上)
这个指标也需要基于dws_user_active_history表

实现思路如下:
对这里面的times字段进行统计,计算times=1的数据条数、times=2的数据条数以及times>=3的数据条数即可,将最终的结果保存到app层的app_user_open_app_distrib中即可
app
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
create external table if not exists app_mall.app_user_open_app_count(
pv int,
uv int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_open_app_count';

insert overwrite table app_mall.app_user_open_app_count partition(dt='20260201') select
sum(times) as pv,
count(*) as uv
from dws_mall.dws_user_active_history
where dt = '20260201';

create external table if not exists app_mall.app_user_open_app_distrib(
ts_1 int,
ts_2 int,
ts_3_m int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_open_app_distrib';

insert overwrite table app_mall.app_user_open_app_distrib partition(dt='20260201') select
sum( case when times = 1 then 1 else 0 end) ts_1,
sum( case when times = 2 then 1 else 0 end) ts_2,
sum( case when times >= 3 then 1 else 0 end) ts_3_m
from dws_mall.dws_user_active_history
where dt = '20260201';
开发脚本
1
2
3
4
5
针对 app 层抽取脚本
1:表初始化脚本(初始化执行一次)
app_mall_init_table_4.sh
2:添加分区数据脚本(每天执行一次)
app_mall_add_partition_4.sh
app_mall_init_table_4.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#!/bin/bash
# 需求四:每日启动APP次数相关指标
# app层数据库和表初始化脚本,只需要执行一次即可

hive -e "
create database if not exists app_mall;

create external table if not exists app_mall.app_user_open_app_count(
pv int,
uv int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_open_app_count';

create external table if not exists app_mall.app_user_open_app_distrib(
ts_1 int,
ts_2 int,
ts_3_m int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_open_app_distrib';
"
app_mall_add_partition_4.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/bin/bash
# 需求四:每日启动APP次数相关指标
# 每天凌晨执行一次

# 默认获取昨天的日期,也支持传参指定一个日期
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 days ago"`
else
dt=$1
fi

hive -e "
insert overwrite table app_mall.app_user_open_app_count partition(dt='${dt}') select
sum(times) as pv,
count(*) as uv
from dws_mall.dws_user_active_history
where dt = '${dt}';

insert overwrite table app_mall.app_user_open_app_distrib partition(dt='${dt}') select
sum( case when times = 1 then 1 else 0 end) ts_1,
sum( case when times = 2 then 1 else 0 end) ts_2,
sum( case when times >= 3 then 1 else 0 end) ts_3_m
from dws_mall.dws_user_active_history
where dt = '${dt}';
"

image-20230405175545120

image-20230405175646747

需求五:操作系统活跃用户相关指标]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
这个需求是统计一下我们产品的目前主要用户群体是使用什么类型的操作系统
因为我们产品的app是有Android端和ios端的
如果我们的用户80%以上使用的都是Android,那么我们肯定要针对Android端的 APP做更多的优化支持,这样可以保证大部分用户的使用体验。

还有就是获取用户使用的手机型号,分辨率信息,这样可以更好的做适配。
针对这个需求我们主要统计以下指标:
1:操作系统活跃用户分布(安卓、IOS)
2:安卓系统版本活跃用户分布
3:IOS系统版本活跃用户分布
4:设备品牌活跃用户分布
5:设备型号活跃用户分布
6:网络类型活跃用户分布
针对这些指标统一分析,其实可以看出来,他们是有相似之处的。都是基于用户使用app时上报的数据相关的一些指标
其实主要就是针对dwd_user_active表中的这些相关维度字段进行分组聚合统计
1
2
3
4
5
6
7
8
9
10
11
实现思路如下:
1:利用咱们前面讲的维度建模的思想,使用星型模型,基于dwd_user_active表,在外层构建对应的维度表。
2:在dws层基于以上6种维度创建对应的维度聚合表,按天建分区
对应的表名为:
dws_user_platform_distrib
dws_user_andriod_osver_distrib
dws_user_ios_osver_distrib
dws_user_brand_distrib
dws_user_model_distrib
dws_user_net_distrib
3:基于dws层的轻度聚合数据进行全局聚合,因为这些指标统计的时候需要统计所有数据,只统计某一天的没有多大意义,将最终聚合的结果保存到app层,这里面的表就是普通外部表了,里面也不需要日期字段,每天重新生成表里面的数据即可
1
2
3
4
5
6
7
8
注意了,咱们前面保存的有每天聚合的数据,如果后期有需求要统计一段时间内的这些维度的指标,那也很简单,直接基于dws层的表进行统计即可,从这也体现出来了数据分层的好处。

在app层对应的表名为操作系统活跃用户分布(安卓、IOS):app_user_platform_distrib
安卓系统版本活跃用户分布:app_user_andriod_osver_distrib
IOS 系统版本活跃用户分布:app_user_ios_osver_distrib
设备品牌活跃用户分布:app_user_brand_distrib
设备型号活跃用户分布:app_user_model_distrib
网络类型活跃用户分布:app_user_net_distrib
dws

image-20230405211242014

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
create external table if not exists dws_mall.dws_user_platform_distrib(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_platform_distrib';

insert overwrite table dws_mall.dws_user_platform_distrib partition(dt='20260201') select
case platform
when 1 then 'android'
when 2 then 'ios'
end ty,
count(*) as num
from dwd_mall.dwd_user_active
where dt = '20260201' and platform in (1,2)
group by platform;



create external table if not exists dws_mall.dws_user_android_osver_distrib(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_android_osver_distrib';

insert overwrite table dws_mall.dws_user_android_osver_distrib partition(dt='20260201') select
osver as ty,
count(*) as num
from dwd_mall.dwd_user_active
where dt = '20260201' and platform = 1
group by osver;


create external table if not exists dws_mall.dws_user_ios_osver_distrib(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_ios_osver_distrib';

insert overwrite table dws_mall.dws_user_ios_osver_distrib partition(dt='20260201') select
osver as ty,
count(*) as num
from dwd_mall.dwd_user_active
where dt = '20260201' and platform = 2
group by osver;


create external table if not exists dws_mall.dws_user_brand_distrib(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_brand_distrib';

insert overwrite table dws_mall.dws_user_brand_distrib partition(dt='20260201') select
brand as ty,
count(*) as num
from dwd_mall.dwd_user_active
where dt = '20260201'
group by brand;



create external table if not exists dws_mall.dws_user_model_distrib(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_model_distrib';

insert overwrite table dws_mall.dws_user_model_distrib partition(dt='20260201') select
model as ty,
count(*) as num
from dwd_mall.dwd_user_active
where dt = '20260201'
group by model;


create external table if not exists dws_mall.dws_user_net_distrib(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_net_distrib';

insert overwrite table dws_mall.dws_user_net_distrib partition(dt='20260201') select
case net
when 0 then '未知'
when 1 then 'WIFI'
when 2 then '2G'
when 3 then '3G'
when 4 then '4G'
when 5 then '5G'
end ty,
count(*) as num
from dwd_mall.dwd_user_active
where dt = '20260201'
group by net;
开发脚本
dws_mall_init_table_5.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#!/bin/bash
# 需求五:操作系统活跃用户相关指标
# dws层数据库和表初始化脚本,只需要执行一次即可

hive -e "
create database if not exists dws_mall;

create external table if not exists dws_mall.dws_user_platform_distrib(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_platform_distrib';


create external table if not exists dws_mall.dws_user_android_osver_distrib(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_android_osver_distrib';


create external table if not exists dws_mall.dws_user_ios_osver_distrib(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_ios_osver_distrib';


create external table if not exists dws_mall.dws_user_brand_distrib(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_brand_distrib';



create external table if not exists dws_mall.dws_user_model_distrib(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_model_distrib';



create external table if not exists dws_mall.dws_user_net_distrib(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/user_net_distrib';

"
dws_mall_add_partition_5.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
#!/bin/bash
# 需求五:操作系统活跃用户相关指标
# 每天凌晨执行一次

# 默认获取昨天的日期,也支持传参指定一个日期
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 days ago"`
else
dt=$1
fi

hive -e "
insert overwrite table dws_mall.dws_user_platform_distrib partition(dt='${dt}') select
case platform
when 1 then 'android'
when 2 then 'ios'
end ty,
count(*) as num
from dwd_mall.dwd_user_active
where dt = '${dt}' and platform in (1,2)
group by platform;



insert overwrite table dws_mall.dws_user_android_osver_distrib partition(dt='${dt}') select
osver as ty,
count(*) as num
from dwd_mall.dwd_user_active
where dt = '${dt}' and platform = 1
group by osver;


insert overwrite table dws_mall.dws_user_ios_osver_distrib partition(dt='${dt}') select
osver as ty,
count(*) as num
from dwd_mall.dwd_user_active
where dt = '${dt}' and platform = 2
group by osver;


insert overwrite table dws_mall.dws_user_brand_distrib partition(dt='${dt}') select
brand as ty,
count(*) as num
from dwd_mall.dwd_user_active
where dt = '${dt}'
group by brand;


insert overwrite table dws_mall.dws_user_model_distrib partition(dt='${dt}') select
model as ty,
count(*) as num
from dwd_mall.dwd_user_active
where dt = '${dt}'
group by model;



insert overwrite table dws_mall.dws_user_net_distrib partition(dt='${dt}') select
case net
when 0 then '未知'
when 1 then 'WIFI'
when 2 then '2G'
when 3 then '3G'
when 4 then '4G'
when 5 then '5G'
end ty,
count(*) as num
from dwd_mall.dwd_user_active
where dt = '${dt}'
group by net;
"

image-20230405214415743

image-20230405214454630

app
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
create external table if not exists app_mall.app_user_platform_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_platform_distrib';

insert overwrite table app_mall.app_user_platform_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_platform_distrib
group by ty;


create external table if not exists app_mall.app_user_android_osver_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_android_osver_distrib';

insert overwrite table app_mall.app_user_android_osver_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_android_osver_distrib
group by ty;


create external table if not exists app_mall.app_user_ios_osver_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_ios_osver_distrib';

insert overwrite table app_mall.app_user_ios_osver_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_ios_osver_distrib
group by ty;




create external table if not exists app_mall.app_user_brand_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_brand_distrib';

insert overwrite table app_mall.app_user_brand_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_brand_distrib
group by ty;



create external table if not exists app_mall.app_user_model_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_model_distrib';

insert overwrite table app_mall.app_user_model_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_model_distrib
group by ty;


create external table if not exists app_mall.app_user_net_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_net_distrib';

insert overwrite table app_mall.app_user_net_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_net_distrib
group by ty;
开发脚本
1
2
3
4
5
针对 app层抽取脚本
1:表初始化脚本(初始化执行一次)
app_mall_init_table_5.sh
2:添加分区数据脚本(每天执行一次)
app_mall_add_partition_5.sh
app_mall_init_table_5.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#!/bin/bash
# 需求五:操作系统活跃用户相关指标
# app层数据库和表初始化脚本,只需要执行一次即可

hive -e "
create database if not exists app_mall;

create external table if not exists app_mall.app_user_platform_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_platform_distrib';


create external table if not exists app_mall.app_user_android_osver_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_android_osver_distrib';


create external table if not exists app_mall.app_user_ios_osver_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_ios_osver_distrib';


create external table if not exists app_mall.app_user_brand_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_brand_distrib';

create external table if not exists app_mall.app_user_model_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_model_distrib';



create external table if not exists app_mall.app_user_net_distrib(
ty string,
num int
)row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/user_net_distrib';

"
app_mall_add_partition_5.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#!/bin/bash
# 需求五:操作系统活跃用户相关指标
# 每天凌晨执行一次

# 默认获取昨天的日期,也支持传参指定一个日期
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 days ago"`
else
dt=$1
fi

hive -e "
insert overwrite table app_mall.app_user_platform_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_platform_distrib
group by ty;

insert overwrite table app_mall.app_user_android_osver_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_android_osver_distrib
group by ty;



insert overwrite table app_mall.app_user_ios_osver_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_ios_osver_distrib
group by ty;



insert overwrite table app_mall.app_user_brand_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_brand_distrib
group by ty;



insert overwrite table app_mall.app_user_model_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_model_distrib
group by ty;


insert overwrite table app_mall.app_user_net_distrib select
ty,
sum(num) as num
from dws_mall.dws_user_net_distrib
group by ty;
"

image-20230405214547266

image-20230405214646268

image-20230405214755005

需求六:APP 崩溃相关指标

1
2
3
4
5
6
7
8
9
10
11
12
13
14
这个需求是统计在不同平台系统不同版本下APP崩溃的情况,统计这个数据可以方便排查定位问题,如果发现某一个版本的APP崩溃频繁,则需要及时修复问题,推送新版本,提升用户体验。
针对这个需求主要统计下面几个指标
1:每日操作系统崩溃总计(安卓、IOS)
2:每日安卓系统-不同APP版本崩溃量
3:每日IOS系统-不同APP版本崩溃量

这里面这三个指标是有关联的,第一个是总的统计,第二个和第三个是不同维度的统计
实现思路:
针对第一个指标,使用dwd_app_close表中的数据,根据platform进行分组统计即可

但是注意:第二个指标和第三个指标,也需要根据不同的platform进行统计,但是又多了一个操作系统的维度,如果按照我们刚才的分析,直接基于platform进行分组的话,针对后面两个指标还需要重新计算中间表,没有体现出来数据仓库的好处。

所以我们可以这样做:
针对dwd_app_close表中的数据,使用platform和vercode进行分组,做轻度聚合,将数据保存到dws层的dws_app_close_platform_vercode表中
1
2
3
4
5
基于dws_app_close_platform_vercode表中的数据就可以计算出来这三个指标了。
这三个指标的结果分别保存到app层的以下这些表中:
每日操作系统崩溃总计(安卓、IOS):app_app_close_platform_all
每日安卓系统-不同APP版本崩溃量 app_app_close_android_vercode
每日苹果系统-不同APP版本崩溃量 app_app_close_ios_vercode
dws层
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
create external table if not exists dws_mall.dws_app_close_platform_vercode(
platform string,
vercode string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/app_close_platform_vercode';

insert overwrite table dws_mall.dws_app_close_platform_vercode partition(dt='20260201') select
platform,
vercode,
count(*) as num
from dwd_mall.dwd_app_close
where dt = '20260201' and platform in (1,2)
group by platform,vercode;
开发脚本
1
2
3
4
5
针对 dws 层抽取脚本
1:表初始化脚本(初始化执行一次)
dws_mall_init_table_6.sh
2:添加分区数据脚本(每天执行一次)
dws_mall_add_partition_6.sh
dws_mall_init_table_6.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/bash
# 需求六:APP崩溃相关指标
# dws层数据库和表初始化脚本,只需要执行一次即可

hive -e "
create database if not exists dws_mall;

create external table if not exists dws_mall.dws_app_close_platform_vercode(
platform string,
vercode string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/dws/app_close_platform_vercode';
"
dws_mall_add_partition_6.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#!/bin/bash
# 需求六:APP崩溃相关指标
# 每天凌晨执行一次

# 默认获取昨天的日期,也支持传参指定一个日期
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 days ago"`
else
dt=$1
fi

hive -e "
insert overwrite table dws_mall.dws_app_close_platform_vercode partition(dt='${dt}') select
platform,
vercode,
count(*) as num
from dwd_mall.dwd_app_close
where dt = '${dt}' and platform in (1,2)
group by platform,vercode;
"

image-20230405221537102

app层
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
create external table if not exists app_mall.app_app_close_platform_all(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/app_close_platform_all';

insert overwrite table app_mall.app_app_close_platform_all partition(dt='20260201') select
case platform
when 1 then 'android'
when 2 then 'ios'
end ty,
sum(num) as num
from dws_mall.dws_app_close_platform_vercode
where dt = '20260201'
group by platform;




create external table if not exists app_mall.app_app_close_android_vercode(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/app_close_android_vercode';

insert overwrite table app_mall.app_app_close_android_vercode partition(dt='20260201') select
vercode as ty,
sum(num) as num
from dws_mall.dws_app_close_platform_vercode
where dt = '20260201' and platform = 1
group by vercode;

create external table if not exists app_mall.app_app_close_ios_vercode(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/app_close_ios_vercode';

insert overwrite table app_mall.app_app_close_ios_vercode partition(dt='20260201') select
vercode as ty,
sum(num) as num
from dws_mall.dws_app_close_platform_vercode
where dt = '20260201' and platform = 2
group by vercode;
开发脚本
app_mall_init_table_6.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/bin/bash
# 需求六:APP崩溃相关指标
# app层数据库和表初始化脚本,只需要执行一次即可

hive -e "
create database if not exists app_mall;

create external table if not exists app_mall.app_app_close_platform_all(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/app_close_platform_all';


create external table if not exists app_mall.app_app_close_android_vercode(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/app_close_android_vercode';


create external table if not exists app_mall.app_app_close_ios_vercode(
ty string,
num int
)partitioned by(dt string)
row format delimited
fields terminated by '\t'
location 'hdfs://bigdata01:9000/data/app/app_close_ios_vercode';
"
app_mall_add_partition_6.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/bin/bash
# 需求六:APP崩溃相关指标
# 每天凌晨执行一次

# 默认获取昨天的日期,也支持传参指定一个日期
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 days ago"`
else
dt=$1
fi

hive -e "
insert overwrite table app_mall.app_app_close_platform_all partition(dt='${dt}') select
case platform
when 1 then 'android'
when 2 then 'ios'
end ty,
sum(num) as num
from dws_mall.dws_app_close_platform_vercode
where dt = '${dt}'
group by platform;

insert overwrite table app_mall.app_app_close_android_vercode partition(dt='${dt}') select
vercode as ty,
sum(num) as num
from dws_mall.dws_app_close_platform_vercode
where dt = '${dt}' and platform = 1
group by vercode;

insert overwrite table app_mall.app_app_close_ios_vercode partition(dt='${dt}') select
vercode as ty,
sum(num) as num
from dws_mall.dws_app_close_platform_vercode
where dt = '${dt}' and platform = 2
group by vercode;
"

image-20230405221619712

image-20230405221957019

image-20230405222129463

image-20230405222138929

用户行为数据数仓总结

数据库和表梳理

image-20230405222432071

任务脚本梳理

image-20230405222643594


本文标题:大数据开发工程师-第十二周 综合项目:电商数据仓库之用户行为数仓4

文章作者:TTYONG

发布时间:2023年04月04日 - 16:04

最后更新:2023年06月18日 - 17:06

原始链接:http://tianyong.fun/%E5%A4%A7%E6%95%B0%E6%8D%AE%E5%BC%80%E5%8F%91%E5%B7%A5%E7%A8%8B%E5%B8%88-%E7%AC%AC%E5%8D%81%E4%BA%8C%E5%91%A8-%E7%BB%BC%E5%90%88%E9%A1%B9%E7%9B%AE-%E7%94%B5%E5%95%86%E6%95%B0%E6%8D%AE%E4%BB%93%E5%BA%93%E4%B9%8B%E7%94%A8%E6%88%B7%E8%A1%8C%E4%B8%BA%E6%95%B0%E4%BB%934.html

许可协议: 转载请保留原文链接及作者。

多少都是爱
0%