python爬虫入门

发表于 2016-10-26 分类于 python

近日，胡打乱撞进入了一个前端教程的博客，看到里面有篇文章提到博主建了一个垃圾网站。 > 垃圾站的内容你懂的，服务器反正在国外，打打擦边球。请大家务必从上面几个域名中选择一个留言给我。

然后打开此网站后发现挺有意思的，就写了个脚本(代码见后文)下载了网站中的图片。此脚本可作为python爬虫的入门脚本，包含登录和不登录两个版本，因为有些图片只有在登录之后才能看见。同时，此脚本还有一些需要注意和值得改进的地方： 1. 通过注释掉对应的代码可实现登录或者不登录。(登录的用户名为jimca，密码为datoujimca)
2. 图片所在页面的url格式为http://ooxxma.com/****.html，****为数字但并没有明确的顺序。所以预先需要整理出这些数字。方法为：

1
2
3

1. 使用 Xenu 软件查询所有页面，并保存结果为txt格式
2. 提取含有html的行，再提取第一列，得到的结果大部分形如http://ooxxma.com/1293.html
3. 去掉重复的行，再提取出数字即可

3. 待爬取的页面保存在列表中，以后有新的页面时只需要再向其中添加即可 4. 爬到的所有jpg中，一部分为缩略图，但文件名形如****-55x55.jpg。如果文件名中含有55x55则拒绝下载 5. 此脚本目前不支持多线程，以后有机会应当加入多线程 6. 有机会应当尝试pyspider等模块

脚本内容如下:

#coding=utf-8
'''
Use:    use this script to download pictures from ooxxma.com with or without logging in. Keeping it in mind that you will get more pictures when logging in.
Author: jim
Email:  jimcateufl@gmail.com 
'''
import urllib                 
import urllib2  
import requests  
import re                    

def getHtmlLogIn(url): 
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.12 Safari/535.11'} 
    data = {"log":"jimca","pwd":"datoujimca"}                                 # username and pasword
    s = requests.session()  
    afterURL = url                                                            # 想要爬取的登录后的页面  
    loginURL = "http://ooxxma.com/wp-login.php"                               # POST发送到的网址  
    login = s.post(loginURL, data = data, headers = headers)                  # 发送登录信息，返回响应信息（包含cookie）  
    response = s.get(afterURL, cookies = login.cookies, headers = headers)    # 获得登陆后的响应信息，使用之前的cookie  
    return response.content  

def getHtml(url):
    page = urllib.urlopen(url)            
    html = page.read()                   
    return html

def getImg(html):
    reg = r'src="(.+?\.jpg)"'             
    imgre = re.compile(reg)              
    imglist = re.findall(imgre,html)      

    for imgurl in imglist:
        if not "55x55" in imgurl.split("/")[-1]:                             # 包含"55x55"的均为小图，所以不下载     
            urllib.urlretrieve(imgurl,imgurl.split("/")[-1])
            print "\tFile: "+imgurl.split("/")[-1]+" is downloading"

list_all=[1008, 1031, 1066, 1071, 1080, 1099, 1107, 1110, 1117, 1126, 1177, 1185, 1196, 1207, 1210, 1231, 1234, 1237, 1245, 1253, 1260, 1268, 1272, 1280, 1283, 1286, 1293, 1298, 1317, 1323, 1328, 1331, 1334, 1337, 1342, 1345, 1348, 1360, 1366, 1369, 1374, 1382, 1391, 1398, 1407, 1417, 1421, 1424, 1427, 1432, 1436, 1441, 1453, 1508, 1511, 1514, 1518, 1530, 1536, 1539, 1546, 1556, 1569, 1577, 1581, 1585, 1590, 1600, 1608, 1615, 1635, 1639, 1648, 1650, 1657, 1664, 1675, 1681, 1684, 1687, 1690, 1693, 1699, 1702, 1724, 1735, 1739, 1748, 1754, 1757, 1761, 1764, 1776, 1780, 1789, 1795, 1804, 1807, 1814, 1822, 1833, 1837, 1843, 896]            
for i in list_all:
    print "%d%% http://ooxxma.com/%s.html is downloading..." %((list_all.index(i)+1)/float(len(list_all))*100, str(i))
    #html = getHtml("http://ooxxma.com/"+str(i)+".html")                    # uncomment this line to download the pictures without logging in
    #html = getHtmlLogIn("http://ooxxma.com/"+str(i)+".html")               # uncomment this line to download the pictures with logging in
    getImg(html)

当然，最后附上网站上的一张图片留作纪念。侵删...