這篇文章主要介紹了python中使用urllib2偽造HTTP報(bào)頭的2個(gè)方法,即偽造http頭信息,需要的朋友可以參考下
在采集網(wǎng)頁(yè)信息的時(shí)候,經(jīng)常需要偽造報(bào)頭來(lái)實(shí)現(xiàn)采集腳本的有效執(zhí)行
下面,我們將使用urllib2的header部分偽造報(bào)頭來(lái)實(shí)現(xiàn)采集信息
方法1、
#!/usr/bin/python
# -*- coding: utf-8 -*-
#encoding=utf-8
#Filename:urllib2-header.py
import urllib2
import sys
#抓取網(wǎng)頁(yè)內(nèi)容-發(fā)送報(bào)頭-1
url= "http://www.xxx.net"
send_headers = {
'Host':'www.xxx.net',
'User-Agent':'Mozilla/5.0 (Windows NT 6.2; rv:16.0) Gecko/20100101 Firefox/16.0',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Connection':'keep-alive'
}
req = urllib2.Request(url,headers=send_headers)
r = urllib2.urlopen(req)
html = r.read() #返回網(wǎng)頁(yè)內(nèi)容
receive_header = r.info() #返回的報(bào)頭信息
# sys.getfilesystemencoding()
html = html.decode('utf-8','replace').encode(sys.getfilesystemencoding()) #轉(zhuǎn)碼:避免輸出出現(xiàn)亂碼
print receive_header
# print '####################################'
print html
方法2、
#!/usr/bin/python
# -*- coding: utf-8 -*-
#encoding=utf-8
#Filename:urllib2-header.py
import urllib2
import sys
url = 'http://www.xxx.net'
req = urllib2.Request(url)
req.add_header('Referer','http://www.xxx.net/')
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.2; rv:16.0) Gecko/20100101 Firefox/16.0')
r = urllib2.urlopen(req)
html = r.read()
receive_header = r.info()
html = html.decode('utf-8').encode(sys.getfilesystemencoding())
print receive_header
print '#####################################'
print html
更多信息請(qǐng)查看IT技術(shù)專欄