1. python爬虫如何二次爬取
一般利用用户的关注人和粉丝人去进行遍历,而遍历到下一层的用户时再去遍历这个用户的关注和粉丝列表,这样利用递归函数我们就能够爬取到大部分用户的信息。
在我的代码中,我的主要思路是先把所有用户的ID放入一个列表,然后遍历这个列表再分别去收集每个用户的信息。
2. 如何用python爬虫爬取出现频率最高的词
完全可以,
可以参考 python爬虫联想词视频 先学习一下基础知识。
3. java jsoup 爬虫 怎么防止重复爬取
import java.io.BufferedWriter;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.net.SocketTimeoutException;
import java.net.UnknownHostException;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupTest {
static String url = "http://www.sogou.com/web?sut=1374&lkt=1%2C1386588673481%2C1386588673481&ie=utf8&sst0=1386588674552&p=40040100&dp=1&w=01019900&dr=1&_asf=www.sogou.com&_ast=1386589056&query=java网页爬虫&page=1";
public static void main(String[] args) {
Document doc = readUrlFist(url);
write(doc);
}
public static void write(Document doc) {
try {
FileOutputStream fos=new FileOutputStream("C:\\Documents and Settings\\Administrator\\桌面\\a.html");
OutputStreamWriter osw=new OutputStreamWriter(fos);
BufferedWriter bw=new BufferedWriter(osw);
bw.write(doc.toString());
bw.flush();
fos.close();
osw.close();
bw.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static Document readUrlFist(String url) {
Document doc = null;
Connection conn = Jsoup.connect(url);
conn.header(
"User-Agent",
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2 Googlebot/2.1");
try {
doc = conn.timeout(200 * 1000).get();
} catch (IOException e) {
e.printStackTrace();
if ((e instanceof UnknownHostException)
|| (e instanceof SocketTimeoutException)) {
doc = readUrlFist(url);
}
}
return doc;
}
}
4. 如何在scrapy框架下用python爬取json文件
生成Request的时候与一般的网页是相同的,提交Request后scrapy就会下载相应的网页生成Response,这时只用解析response.body按照解析json的方法就可以提取数据了。代码示例如下(以京东为例,其中的parse_phone_price和parse_commnets是通过json提取的,省略部分代码):
# -*- coding: utf-8 -*-
from scrapy.spiders import Spider, CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from jdcom.items import JdPhoneCommentItem, JdPhoneItem
from scrapy import Request
from datetime import datetime
import json
import logging
import re
logger = logging.getLogger(__name__)
class JdPhoneSpider(CrawlSpider):
name = "jdPhoneSpider"
start_urls = ["http://list.jd.com/list.html?cat=9987,653,655"]
rules = (
Rule(
LinkExtractor(allow=r"list\.html\?cat\=9987,653,655\&page\=\d+\&trans\=1\&JL\=6_0_0"),
callback="parse_phone_url",
follow=True,
),
)
def parse_phone_url(self, response):
hrefs = response.xpath("//div[@id='plist']/ul/li/div/div[@class='p-name']/a/@href").extract()
phoneIDs = []
for href in hrefs:
phoneID = href[14:-5]
phoneIDs.append(phoneID)
commentsUrl = "http://sclub.jd.com/productpage/p-%s-s-0-t-3-p-0.html" % phoneID
yield Request(commentsUrl, callback=self.parse_commnets)
def parse_phone_price(self, response):
phoneID = response.meta['phoneID']
meta = response.meta
priceStr = response.body.decode("gbk", "ignore")
priceJson = json.loads(priceStr)
price = float(priceJson[0]["p"])
meta['price'] = price
phoneUrl = "http://item.jd.com/%s.html" % phoneID
yield Request(phoneUrl, callback=self.parse_phone_info, meta=meta)
def parse_phone_info(self, response):
pass
def parse_commnets(self, response):
commentsItem = JdPhoneCommentItem()
commentsStr = response.body.decode("gbk", "ignore")
commentsJson = json.loads(commentsStr)
comments = commentsJson['comments']
for comment in comments:
commentsItem['commentId'] = comment['id']
commentsItem['guid'] = comment['guid']
commentsItem['content'] = comment['content']
commentsItem['referenceId'] = comment['referenceId']
# 2016-09-19 13:52:49 %Y-%m-%d %H:%M:%S
datetime.strptime(comment['referenceTime'], "%Y-%m-%d %H:%M:%S")
commentsItem['referenceTime'] = datetime.strptime(comment['referenceTime'], "%Y-%m-%d %H:%M:%S")
commentsItem['referenceName'] = comment['referenceName']
commentsItem['userProvince'] = comment['userProvince']
# commentsItem['userRegisterTime'] = datetime.strptime(comment['userRegisterTime'], "%Y-%m-%d %H:%M:%S")
commentsItem['userRegisterTime'] = comment.get('userRegisterTime')
commentsItem['nickname'] = comment['nickname']
commentsItem['userLevelName'] = comment['userLevelName']
commentsItem['userClientShow'] = comment['userClientShow']
commentsItem['productColor'] = comment['productColor']
# commentsItem['productSize'] = comment['productSize']
commentsItem['productSize'] = comment.get("productSize")
commentsItem['afterDays'] = int(comment['days'])
images = comment.get("images")
images_urls = ""
if images:
for image in images:
images_urls = image["imgUrl"] + ";"
commentsItem['imagesUrl'] = images_urls
yield commentsItem
commentCount = commentsJson["productCommentSummary"]["commentCount"]
goodCommentsCount = commentsJson["productCommentSummary"]["goodCount"]
goodCommentsRate = commentsJson["productCommentSummary"]["goodRate"]
generalCommentsCount = commentsJson["productCommentSummary"]["generalCount"]
generalCommentsRate = commentsJson["productCommentSummary"]["generalRate"]
poorCommentsCount = commentsJson["productCommentSummary"]["poorCount"]
poorCommentsRate = commentsJson["productCommentSummary"]["poorRate"]
phoneID = commentsJson["productCommentSummary"]["productId"]
priceUrl = "http://p.3.cn/prices/mgets?skuIds=J_%s" % phoneID
meta = {
"phoneID": phoneID,
"commentCount": commentCount,
"goodCommentsCount": goodCommentsCount,
"goodCommentsRate": goodCommentsRate,
"generalCommentsCount": generalCommentsCount,
"generalCommentsRate": generalCommentsRate,
"poorCommentsCount": poorCommentsCount,
"poorCommentsRate": poorCommentsRate,
}
yield Request(priceUrl, callback=self.parse_phone_price, meta=meta)
pageNum = commentCount / 10 + 1
for i in range(pageNum):
commentsUrl = "http://sclub.jd.com/productpage/p-%s-s-0-t-3-p-%d.html" % (phoneID, i)
yield Request(commentsUrl, callback=self.parse_commnets)
5. 如何在scrapy框架下用python爬取json文件
import jsonstr = str[(str.find('(')+1):str.rfind(')')] #去掉首尾的圆括号前后部分dict = json.loads(str)comments = dict['comments']#然后for一下就行了如果是scrapy 看下面代码
def parse(self, response): jsonresponse = json.loads(response.body_as_unicode()) item = MyItem() item["firstName"] = jsonresponse["firstName"] return itemcallback= 对应的就是返回内容开头那个函数名字。前后截去函数名和括号就是 JSON 了。
6. 如何在scrapy框架下用python爬取json文件
Python学得倒不用很深,循环跟函数还有类学完就可以搞深度学习了。 新手用深度学习库先跑跑,真要进阶还要修改的话,你会发现瓶颈其实在数学,不在Python
7. 如何用Python实现一只小爬虫,爬取拉勾网
1、首先我们打开拉勾网,并搜索“java”,显示出来的职位信息就是我们的目标。
2、接下来我们需要确定,怎样将信息提取出来。
查看网页源代码,这时候发现,网页源代码里面找不到职位相关信息,这证明拉勾网关于职位的信息是异步加载的,这也是一种很常用的技术。
异步加载的信息,我们需要借助 chrome 浏览器的开发者工具进行分析~
8. Python网络爬虫与聚焦爬虫,如何用爬虫爬取段子
(1) 不同领域、不同背景的用户往往具有不同的检索目的和需求,通用搜索引擎所返回的结果包含大量用户不关心的网页。
(2)通用搜索引擎的目标是尽可能大的网络覆盖率,有限的搜索引擎服务器资源与无限的网络数据资源之间的矛盾将进一步加深。
(3)万维网数据形式的丰富和网络技术的不断发展,图片、数据库、音频、视频多媒体等不同数据大量出现,通用搜索引擎往往对这些信息含量密集且具有一定结构的数据无能为力,不能很好地发现和获取。
(4)通用搜索引擎大多提供基于关键字的检索,难以支持根据语义信息提出的查询。