1.1. 前言
我们计算浏览工具排名经过这四个步骤:
Mapper: 将以行数据解析成 key=浏览工具 value=1的形式
Shuffle: 通过Shuffle后的结果会生成以 key 的值排序的 value迭代器
结果如: 浏览工具 [1, 1, 1 ... 1, 1]
Reduce 1: 在这边我们计算出 浏览工具 的访问量
输出如: None [sum([1, 1, 1 ... 1, 1]), key]
Reduce 2: 对sum([1, 1, 1 ... 1, 1]) 进行排序并输出 TOP 100
输入如: 104533 "Googlebot/2.1; +http://www.google.com/bot.html)"
1.2. 代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
|
cat
mr_browser
.
py
# -*- coding: utf-8 -*-
from
mrjob
.
job
import
MRJob
from
mrjob
.
step
import
MRStep
from
ng_line_parser
import
NgLineParser
import
heapq
class
MRBrowser
(
MRJob
)
:
ng_line_parser
=
NgLineParser
(
)
def
mapper
(
self
,
_
,
line
)
:
self
.
ng_line_parser
.
parse
(
line
)
yield
self
.
ng_line_parser
.
browser
,
1
def
reducer_sum
(
self
,
key
,
values
)
:
""
"统计 VU"
""
yield
None
,
[
sum
(
values
)
,
key
]
def
reducer_top100
(
self
,
_
,
values
)
:
""
"访问数降序"
""
for
cnt
,
browser
in
heapq
.
nlargest
(
100
,
values
)
:
yield
cnt
,
browser
def
steps
(
self
)
:
return
[
MRStep
(
mapper
=
self
.
mapper
,
reducer
=
self
.
reducer_sum
)
,
MRStep
(
reducer
=
self
.
reducer_top100
)
]
def
main
(
)
:
MRBrowser
.
run
(
)
if
__name__
==
'__main__'
:
main
(
)
|
运行统计和输出结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
python
mr_browser
.
py
<
www
.
ttmark
.
com
.
access
.
log
No
configs
found
;
falling
back
on
auto
-
configuration
Creating
temp
directory
/
tmp
/
mr_browser
.
root
.
20160924.155814.200619
Running
step
1
of
2...
reading
from
STDIN
Running
step
2
of
2...
Streaming
final
output
from
/
tmp
/
mr_browser
.
root
.
20160924.155814.200619
/
output
.
.
.
104533
"Googlebot/2.1; +http://www.google.com/bot.html)"
101013
"Chrome/47.0.2526.106 larbin2.6.3@unspecified.mail"
57072
"bingbot/2.0; +http://www.bing.com/bingbot.htm)"
.
.
.
.
.
.
613
"P1 4.1.2)"
610
"Safari/534.30 OppoBrowser/3.9.2"
601
"9.3.2; zh_CN)"
Removing
temp
directory
/
tmp
/
mr_browser
.
root
.
20160924.155814.200619...
|
昵称: HH
QQ: 275258836
ttlsa群交流沟通(QQ群②: 6690706 QQ群③: 168085569 QQ群④: 415230207(新) 微信公众号: ttlsacom)
感觉本文内容不错,读后有收获?
逛逛衣服店,鼓励作者写出更好文章。
收 藏
转载请注明:成长的对话 » 浏览工具排名-MRJob-Python数据分析(10)