如何爬取全网1200本Python书

Python016

如何爬取全网1200本Python书,第1张

前面写了一篇文章关于爬取市面上所有的Python书思路,这也算是我们数据分析系列讲座里面的一个小的实战项目。上次代码没有写完,正好周末有时间把代码全部完成并且存入了数据库中,今天就给大家一步步分析一下是我是如何爬取数据,清洗数据和绕过反爬虫的一些策略和点滴记录。

1

目标网站分析-主页面爬取

1).市面上所有的Python书,都在京东,淘宝和豆瓣上,于是我选择了豆瓣来爬取

2).分析网站的结构,其实还是比较简单的,首先有一个主的页面,里面有所有python的链接,一共1388本(其中有100多本其实是重复的),网页底部分页显示一共93页

3).这个页面是静态页面,url页比较有规律,所以很容易构造出所有的url的地址

4).爬虫每个分页里面的所有的Python书和对应的url,比如第一页里面有"笨办法这本书",我们只需要提取书名和对应的url

2

单个页面分析爬取

1).上面我们已经提取了93个页面的所有的Python书和对应的url,一共是93*15大概1300多本书,首先先去重,然后我们可以把它存到内存里面用一个字典保存,或者存到一个csv文件中去(有同学可能奇怪为啥要存到文件里面呢,用字典存取不是方便吗,先不说最后揭晓)

2).我们接着分析每本书页面的特征:

上一片文章说过我们需要分析:

作者/出版社/译者/出版年/页数/定价/ISBN/评分/评价人数

看一下网站的源码,发现主要的信息在div id="info" 和div class="rating_self clearfix"

3).这一部分的数据清洗是比较麻烦的,因为不是每一本书都是有点评和评分系统的,而且不是每一本书都有作者,页面,价格的,所以提取的时候一定要做好异常处理,比如有的页面长的这样:

原始数据采集的过程中有很多不一致的数据:

书的日期表示格式,各种各样都有:

有的书的日期是:'September 2007','October 22, 2007','2017-9','2017-8-25'

有的书的价格是货币单位不统一,有美金,日元,欧元和人民币

比如:CNY 49.00,135,19 €,JPY 4320, $ 176.00

3

多线程爬取

1).有的同学后台问我,你是用scrapy框架还是自己动手写的,我这个项目是自己动手写的,其实scrapy是一个非常棒的框架,如果爬取几十万的数据,我一定会用这个超级武器.

2).我用的是多线程爬取,把所有的url都扔到一个队列里面,然后设置几个线程去队列里面不断的爬取,然后循环往复,直到队列里的url全部处理完毕

3).数据存储的时候,有两种思路:

一种是直接把爬取完的数据存到SQL数据库里面,然后每次新的url来了之后,直接查询数据库里面有没有,有的话,就跳过,没有就爬取处理

另一种是存入CSV文件,因为是多线程存取,所以一定要加保护,不然几个线程同时写一个文件的会有问题的,写成CSV文件也能转换成数据库,而且保存成CSV文件还有一个好处,可以转成pandas非常方便的处理分析.

4

反爬虫策略

1).一般大型的网站都有反爬虫策略,虽然我们这次爬的数量只有1000本书,但是一样会碰到反爬虫问题

2).关于反爬虫策略,绕过反爬虫有很多种方法。有的时候加时延(特别是多线程处理的时候),有的时候用cookie,有的会代理,特别是大规模的爬取肯定是要用代理池的,我这里用的是cookie加时延,比较土的方法.

3).断点续传,虽然我的数据量不是很大,千条规模,但是建议要加断点续传功能,因为你不知道在爬的时候会出现什么问题,虽然你可以递归爬取,但是如果你爬了800多条,程序挂了,你的东西还没用存下来,下次爬取又要重头开始爬,会吐血的(聪明的同学肯定猜到,我上面第二步留的伏笔,就是这样原因)

5

代码概述篇

1).整个的代码架构我还没有完全优化,目前是6个py文件,后面我会进一步优化和封装的

spider_main:主要是爬取93个分页的所有书的链接和书面,并且多线程处理

book_html_parser:主要是爬取每一本书的信息

url_manager:主要是管理所有的url链接

db_manager:主要是数据库的存取和查询

util:是一个存放一些全局的变量

verify:是我内部测试代码的一个小程序

2).主要的爬取结果的存放

all_books_link.csv:主要存放1200多本书的url和书名

python_books.csv:主要存放具体每一本书的信息

3).用到的库

爬虫部分:用了requests,beautifulSoup

数据清洗:用了大量的正则表达式,collection模块,对书的出版日期用了datetime和calendar模块

多线程:用了threading模块和queue

结论:

好,今天的全网分析Python书,爬虫篇,就讲道这里,基本上我们整个这个项目的技术点都讲了一遍,爬虫还是很有意思的,但是要成为一个爬虫高手还有很多地方要学习,想把爬虫写的爬取速度快,又稳健,还能绕过反爬虫系统,并不是一件容易的事情. 有兴趣的小伙伴,也可以自己动手写一下哦。源码等后面的数据分析篇讲完后,我会放github上,若有什么问题,也欢迎留言讨论一下.

推荐Full Stack Python 有各种python资源汇总,从基础入门到各种框架web应用开发和部署,再到高级的ORM、Docker都有。以下是Full Stack Python 上总结的一些教程,我拙劣的翻译了以下,并调整(调整顺序并删了部分内容)了一下:

1、无开发经验,初学python

如果你不会其他语言,python是你的第一门语言:

A Byte of Python (简明python教程,这个有中文版简明 Python 教程)是非常好的入门教程。

Learn Python the Hard Way (Zed Shaw的免费教程,个人强烈推荐)

Python, Django and Flask教程: Real Python (收费,需购买)

short 5 minute video 解释了为什么你的出发点应该是要完成什么项目,或者解决什么问题,而不是为了学一门语言而去学一门语言。

Dive into Python 3 是一本开源的python教程,提供HTML和PDF版。

Code Academy 有一个为纯新手准备的 Python track 。

Introduction to Programming with Python 介绍了基本语法和控制结构等,提供了大量代码示例。

O'Reilly 的书 Think Python: How to Think Like a Computer Scientist 是非常好的入门教材。

Python Practice Book 是一本python练习的书,帮你掌握python基本语法。

想通过做实际项目来学编程?看看这个 this list of 5 programming project for Python beginners(5个适合python初学者的编程项目)。

Reddit的创造者之一写了一个教程,如何用python构建一个博客网站(use Python to build a blog.),使非常好的web编程入门。

The fullstack python的作者写了一篇关于如何学习python的文章learning Python 。

2、有开发经验 ,初学Python

Learn Python in y minutes ,让你在几分钟内快速上手,有个大概了解。

Python for you and me , python的语法,语言的主要结构等,还包含来Flask Web App的教程。

The Hitchhiker’s Guide to Python

How to Develop Quality Python Code ,如何开发高质量的python代码

3、进阶

The Python Ecosystem: An Introduction , 关于python生态系统,虚拟机、python包管理器pip、虚拟环境virtualenv、还有很多进阶主题

The Python Subreddit ,就是python的reddit节点(相当于中国的贴吧),是一个活跃的社区,可以交流讨论,解决问题等。

Good to Great Python Reads ,收集进阶和高级python文章,讲了很多细微差异和python语言本身的细节。

博客 Free Python Tips ,有很多python和python生态系统的文章。

Python Books ,有一些免费的Python, Django, 数据分析等方面的书。

Python IAQ: Infrequently Asked Questions ,关于python 经常问到的问题。

4、视频,屏幕录像,演示文稿等

一些技术交流会议的视频录像: best Python videos

5、python的包

awesome-python ,收集了python各种非常好用非常酷的包,确实非常awesome,让作者相见恨晚( I wish I had this page when I was just getting started)。

easy-python

6、 播客(Podcasts)

Talk Python to Me , 关注使用python的人们和组织,每一期都会邀请一些开发者谈谈他们的工作等。

Podcast.__init__ ,关于python和让python更牛B的人们。

7、新闻资讯(可订阅)

Python Weekly , 最新的python文章、视频、项目、资讯 。

PyCoder's Weekly ,和python weekly类似。

Import Python

以下是引用的原文:

New to programming

If you're learning your first programming language these books were written with you in mind. Developers learning Python as a second or later language should skip down to the next section for "experienced developers".

To get an introduction to Python, Django and Flask at the same time, consider purchasing the Real Python course by Fletcher, Michael and Jeremy.

This short 5 minute video explains why it's better to think of projects you'd like to build and problems you want to solve with programming. Start working on those projects and problems rather than jumping into a specific language that's recommended to you by a friend.

CS for All is an open book by professors at Harvey Mudd College which teaches the fundamentals of computer science using Python. It's an accessible read and perfect for programming beginners.

If you've never programmed before check out the Getting Started page on Learn To Code with Me by Laurence Bradford. She's done an incredible job of breaking down the steps beginners should take when they're uncertain about where to begin.

Learn Python the Hard Way is a free book by Zed Shaw.

Dive into Python 3 is an open source book provided under the Creative Commons license and available in HTML or PDF form.

While not Python-specific, Mozilla put together a Learning the Web tutorial for beginners and intermediate web users who want to build websites. It's worth a look from a general web development perspective.

A Byte of Python is a beginner's tutorial for the Python language.

Code Academy has a Python track for people completely new to programming.

Introduction to Programming with Python goes over the basic syntax and control structures in Python. The free book has numerous code examples to go along with each topic.

Google put together a great compilation of materials and subjects you should read and learn from if you want to be a professional programmer. Those resources are useful not only for Python beginners but any developer who wants to have a strong professional career in software.

The O'Reilly book Think Python: How to Think Like a Computer Scientist is available in HTML form for free on the web.

Python Practice Book is a book of Python exercises to help you learn the basic language syntax.

Looking for ideas about what projects to use to learn to code? Check out this list of 5 programming project for Python beginners.

There's a Udacity course by one of the creators of Reddit that shows how to use Python to build a blog. It's a great introduction to web development concepts through coding.

I wrote a quick blog post on learning Python that non-technical folks trying to learn to program may find useful.

Experienced developers new to Python

Learn Python in y minutes provides a whirlwind tour of the Python language. The guide is especially useful if you're coming in with previous software development experience and want to quickly grasp how the language is structured.

Python for you and me is an approachable book with sections for Python syntax and the major language constructs. The book also contains a short guide at the end to get programmers to write their first Flask web application.

Kenneth Reitz's The Hitchhiker’s Guide to Python contains a wealth of information both on the Python programming language and the community.

How to Develop Quality Python Code is a good read to begin learning about development environments, application dependencies and project structure.

Beyond the basics

The Python Ecosystem: An Introduction provides context for virtual machines, Python packaging, pip, virutalenv and many other topics after learning the basic Python syntax.

The Python Subreddit rolls up great Python links and has an active community ready to answer questions from beginners and advanced Python developers alike.

Good to Great Python Reads is a collection of intermediate and advanced Python articles around the web focused on nuances and details of the Python language itself.

The blog Free Python Tips provides posts on Python topics as well as news for the Python ecosystem.

Python Books is a collection of freely available books on Python, Django, and data analysis.

Python IAQ: Infrequently Asked Questions is a list of quirky queries on rare Python features and why certain syntax was or was not built into the language.

Videos, screencasts and presentations

Videos from conferences and meetups along with screencasts are listed on the best Python videos page.

Curated Python packages lists

awesome-python is an incredible list of Python frameworks, libraries and software. I wish I had this page when I was just getting started.

easy-python is like awesome-python although instead of just a Git repository this site is in the Read the Docs format.

Podcasts

Talk Python to Me focuses on the people and organizations coding on Python. Each episode features a different guest interviewee to talk about his or her work.

Podcast.__init__ is another podcast on "about Python and the people who make it great".

Newsletters

Python Weekly is a free weekly roundup of the latest Python articles, videos, projects and upcoming events.

PyCoder's Weekly is another great free weekly email newsletter similar to Python Weekly. The best resources are generally covered in both newsletters but they often cover different articles and projects from around the web.

Import Python is a newer newsletter than Python Weekly and PyCoder's Weekly. So far I've found this newsletter often pulls from different sources than the other two. It's well worth subscribing to all three so you don't miss anything.

xlwing打开excel有两种方式,

第一种:你写的 xw.books.add(),但没写全,所以报错,应该是

app = xw.App(visible=True,add_book=False)

book=app.books.add()

第二种:也就是你写的xw.Book()

两种方法区别是:在多次处理文件中,App.books.open方式打开可以控制在一个excel窗口,Book方式则会打开多个窗口。

建议用第一种方法,谢谢阅读。