方法/步骤fromurllib.requestimporturlopen用于打开
网页fromurllib.errorimportHTTPError用于处理链接异常frombs4importBeautifulSoup用于处理html文档importre用正则表达式匹配目标字符串例子用关于抓取百度新闻网页的某些图片链接fromurllib.requestimporturlopenfromurllib.errorimportHTTPErrorfrombs4importBeautifulSoupimportreurl="/"try:html=urlopen(url)exceptHTTPErrorase:print(e)try:bsObj=BeautifulSoup(html.read())images=bsObj.findAll("img",{"src":re.compile(".*")})forimageinimages:print(image["src"])exceptAttributeErrorase:print(e)importjava.io.BufferedReaderimportjava.io.IOExceptionimportjava.io.InputStreamReaderimportjava.net.HttpURLConnectionimportjava.net.MalformedURLExceptionimportjava.net.URLpublicclassCapture{publicstaticvoidmain(String[]args)throwsMalformedURLException,IOException{StringstrUrl="/"URLurl=newURL(strUrl)HttpURLConnectionhttpConnection=(HttpURLConnection)url.openConnection()InputStreamReaderinput=newInputStreamReader(httpConnection.getInputStream(),"utf-8")BufferedReaderbufferedReader=newBufferedReader(input)Stringline=""StringBuilderstringBuilder=newStringBuilder()while((line=bufferedReader.readLine())!=null){stringBuilder.append(line)}Stringstring=stringBuilder.toString()intbegin=string.indexOf("")intend=string.indexOf("")System.out.println("IPaddress:"+string.substring(begin,end))}Python 用requests + BeautifulSoup
很方便。
【Step1】获取html:
import requests
r = requests.get(‘’)
html = r.text#这样3行代码就把网页的html取出来了
【Step2】解析:
html用你喜欢的方式解析就可以了,牛逼的话可以直接正则。
from bs4 import BeautifulSoup
soup = BeautifulSoup(html) #这样2行就可以很方便的操作soup解析了
或者专业点的用scrapy爬虫框架,默认用xpath解析。