用scrapy爬取可用的代理

分析免费代理网站的结构

  • 我爬取了三个字段:IPporttype
    TIM图片20181118171534.jpg

    分析要爬取的数据,编写items.py

  • 因此在items.py中,建立相应的字段
    1
    2
    3
    4
    5
    6
    7
    import scrapy
    class IproxyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    ip = scrapy.Field()
    type = scrapy.Field()
    port = scrapy.Field()

爬取所有的免费ip

  • 在spider目录下,创建IpSpider.py
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    import scrapy
    import Iproxy.items
    class IpSpider(scrapy.Spider):
    name = 'IpSpider'
    allowed_domains = ['xicidaili.com']
    start_urls = ['http://www.xicidaili.com/']

    def parse(self, response):
    item = Iproxy.items.IproxyItem()
    item['ip'] = response.css('tr td:nth-child(2)::text').extract()
    item['port'] = response.css('tr td:nth-child(3)::text').extract()
    item['type'] = response.css('tr td:nth-child(6) ::text').extract()
    yield item

检测是否可用,如果可用则存入数据库

  • 因为是免费的ip,所以我们有必要检测一下他是否可用,对于可用的就存入数据库,反之则丢弃
  • 检测处理数据在pipeline.py中编写
  • 检测原理,通过代理访问百度,如果能够访问,则说明可用
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    # -*- coding: utf-8 -*-

    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

    import pymysql
    import requests

    class IproxyPipeline(object):
    def process_item(self, item, spider):
    print('@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@')
    db = pymysql.connect("localhost", "root", "168168", "spider")
    cursor = db.cursor()
    for i in range(1, len(item['ip'])):
    ip = item['ip'][i] + ':' + item['port'][i]
    try:
    if self.proxyIpCheck(ip) is False:
    print('此ip:'+ip+"不能用")
    continue
    else:
    print('此ip:'+ip+'可用,存入数据库!')
    sql = 'insert into proxyIp value ("%s")' % (ip)
    cursor.execute(sql)
    db.commit()
    except:
    db.rollback()
    db.close()
    return item

    def proxyIpCheck(self, ip):
    proxies = {'http': 'http://' + ip, 'https': 'https://' + ip}
    try:
    r = requests.get('https://www.baidu.com/', proxies=proxies, timeout=1)
    if (r.status_code == 200):
    return True
    else:
    return False
    except:
    return False

运行情况

  • 可以看出还是有好多ip不能用的
    TIM图片20181118172712.png
  • 可用的存在数据库
    TIM图片20181118172841.jpg