1.1. 前言
这边我们使用Python的M/R框架MRJob来分析.
1.2. M/R步骤
Mapper: 将以行数据解析成 key=hh value=1的形式
Shuffle: 通过Shuffle后的结果会生成以 key 的值排序的 value迭代器
结果如: 09 [1, 1, 1 ... 1, 1]
Reduce: 在这边我们计算出 09 这一小时的访问量
输出如: 09 sum([1, 1, 1 ... 1, 1])
1.3. 代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
cat
mr_pv_hour
.
py
# -*- coding: utf-8 -*-
from
mrjob
.
job
import
MRJob
from
ng_line_parser
import
NgLineParser
class
MRPVHour
(
MRJob
)
:
ng_line_parser
=
NgLineParser
(
)
def
mapper
(
self
,
_
,
line
)
:
self
.
ng_line_parser
.
parse
(
line
)
dy
,
tm
=
str
(
self
.
ng_line_parser
.
access_time
)
.
split
(
)
h
,
m
,
s
=
tm
.
split
(
':'
)
yield
h
,
1
# 每小时的
yield
'total'
,
1
# 所有的
def
reducer
(
self
,
key
,
values
)
:
yield
key
,
sum
(
values
)
def
main
(
)
:
MRPVHour
.
run
(
)
if
__name__
==
'__main__'
:
main
(
)
|
运行统计和输出结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
python
mr_pv_hour
.
py
<
www
.
ttmark
.
com
.
access
.
log
No
configs
found
;
falling
back
on
auto
-
configuration
Creating
temp
directory
/
tmp
/
mr_pv_hour
.
root
.
20160924.130542.359063
Running
step
1
of
1...
reading
from
STDIN
Streaming
final
output
from
/
tmp
/
mr_pv_hour
.
root
.
20160924.130542.359063
/
output
.
.
.
"00"
31539
"01"
34824
"02"
27895
"03"
29669
"04"
27742
"05"
26797
"06"
29384
"07"
31102
"08"
38257
"09"
43060
"10"
48064
"11"
57923
"12"
56413
"13"
57971
"14"
47260
"15"
46364
"16"
45721
"17"
48884
"18"
49318
"19"
49162
"20"
43641
"21"
42525
"22"
40371
"23"
34953
"total"
988839
Removing
temp
directory
/
tmp
/
mr_pv_hour
.
root
.
20160924.130542.359063...
|
昵称: HH
QQ: 275258836
ttlsa群交流沟通(QQ群②: 6690706 QQ群③: 168085569 QQ群④: 415230207(新) 微信公众号: ttlsacom)
感觉本文内容不错,读后有收获?
逛逛衣服店,鼓励作者写出更好文章。
收 藏
转载请注明:成长的对话 » 时刻PV-MRJob-Python数据分析(4)