盒子
导航
文章目录
  1. 代码一览
  2. 参考资料

第一个Python爬虫程序

51VOA网站上有个Learn A Word的单词学习栏目,通过听力的方式教你学单词和词组,很不错,共有2300多个单词和词组。之前使用C语言写了一个爬虫,但奈何工作量比较大,调试也很花费时间。今天在知乎上看到一个问题:能利用爬虫技术做到哪些很酷很有趣很有用的事情?,题主说他学习并写了个Python爬虫来抓去豆瓣的书单等列表。我看了他贴的代码,发现只区区60行不到!于是我便被Python的简洁、强大给吸引了,果断抛弃了之前使用C来编写的方案(已经写了一半,线程间通信还有问题尚未调试通)。中午从自习室出来就匆匆回住处拿了电脑,在美宜佳买了面包然后就去图书馆泡着了,花费了一个下午学习Python(之前只是听说过但从未接触)并重写了这个爬虫。

代码一览

爬虫代码如下(现学现用,比较粗糙):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import os
import sys
import requests
import time, threading
reload(sys)
sys.setdefaultencoding('utf8')
from bs4 import BeautifulSoup

#download mp3 files
def down(url, purl):
i = 0
flag = False # do not download

new_url = 'http://'
for token in url.split('/'):
if i == 0:
i += 1
elif token != '' :
new_url += token + '/'
filename = './down/' + token
new_url = new_url[:-1] # cut the last '/'

# test if file exists
if os.path.exists(filename):
size = os.path.getsize(filename)
size /= 1024 #kb

if size < 150 : #kb
flag = True
else:
flag = True

if flag:
try:
#print 'start downloading %s...' % filename
fp = open(filename ,'wb')
fcontent = requests.get(new_url).content
fp.write(fcontent)
fp.close()
#print '%s done.\n' % filename
except IOError, e:
print 'Url: %s' % url
print 'New url: ' + new_url
print 'PUrl' + purl
print e
return 1

#parse Learn A Word pages and word list pages
def sub_spider(host, uri):
file_count = 0
url = 'http://' + host + uri
html_source = requests.get(url).text
soup = BeautifulSoup(html_source, 'html.parser')

for link in soup.find(id='list'):
if link != None:
for href in link.find_all('a'):
uri = href.get('href')
html_src = requests.get('http://' + host + uri).text
sp = BeautifulSoup(html_src, 'html.parser')
mp3href = sp.find(id='mp3')
if mp3href != None :
file_count += down(mp3href.get('href'), 'http://' + host + uri)
else:
print 'Can not find mp3 link in page: http://' + host + uri
else:
print 'Failed: ' + url
return file_count

#muti-thread spider
def spider_thread(a, b):
tid = int(a)/6
counter = 0
print 'Thread <%d> starting...\n' % tid
for i in range(int(a), int(b)):
counter += sub_spider('www.51voa.com', '/Learn_A_Word_%d.html' % (i+1))
print 'Thread %d done, %d file downloaded!' % (tid, counter)

#the spider main function
def spider():
i = 0
j = 0
tids = {}
while i < 48 : # 8 threads, 48 pages
t = threading.Thread(target=spider_thread, args=(i, i+6))
t.start()
tids[j] = t
i += 6
j += 1

print 'Waiting for threads to quit...'
for i in range(j):
tids[i].join()

#start spider
spider()

Python的确很好很强大!简单易学,类库丰富,高度封装,区区几行代码就能实现丰富复杂的功能。我学的是2.7.x,因为网上有童鞋说有些第三方库还不支持3.0.x,如requests。

参考资料

  1. Python 2.7.10 documentation
  2. The Python Tutorial
  3. The Python Language Reference
  4. The Python Standard Library
  5. Python Requests
  6. BeautifulSoup 4