Rosi图片抓取的PHP脚本

从别人的blog中转来的，抓取的图片其实都是次要的，主要是看看正则的编写（毕竟PHP的单线程虽然说在图片爬取上面不算是明显的劣势，但是速度毕竟还是慢了些，适合放在VPS上慢慢抓取，爬虫这个东西如果抓取的太快太频繁导致别人屏蔽你就不太好了……）：

<?php
    set_time_limit(0);
    function grabImage($url, $filename = ''){
        ob_start();
        readfile($url);
        $img_data = ob_get_contents();
        ob_end_clean();
        $size = strlen($img_data);
        $local_file = fopen($filename , 'a');
        fwrite($local_file, $img_data);
        fclose($local_file);
        return $filename;
    }
    $base = "http://missmm.com/";
    function a0($s){
        if($s < 10){
            return "00".$s;
        }elseif($s <100){
            return "0".$s;
        }else{
            return $s;
        }
    };
    for($i=763;$i>0;$i--){
        $html = file_get_contents($base.a0($i).".html");
        preg_match_all('/http.*?dulei.si.*?jpg/i', $html,$arr);
        foreach($arr[0] as $v){
            $data = pathinfo($v);
            grabImage($v,$data["basename"]);
            sleep(1);
        }
    }
?>

原文地址：http://fuck.0day5.com/?p=767

27 6 月, 2014

admin

Programing, Tools

PHP, preg_match_all

《 “Rosi图片抓取的PHP脚本” 》有 9 条评论

a-z说道：

2017-10-20 11:38

icrawler：强大简单的图片爬虫库
https://mp.weixin.qq.com/s/c7BSRHiOVYvG2AypX1O-9g
http://icrawler.readthedocs.io/en/latest/usage.html
https://github.com/hellock/icrawler

回复
a-z说道：

2017-12-11 10:34

scrapy+selenium爬取UC头条网站
http://kekefund.com/2017/12/06/scrapy-and-selenium/

scrapy+splash 爬取动态网站(JS)
http://kekefund.com/2017/05/25/scrapy-splash/

回复
a-z说道：

2018-01-07 07:41

使用puppeteer和chrome-headless做暗网抓取
http://arganzheng.life/getting-started-with-puppeteer-and-chrome-headless-for-web-scrapping.html

回复
a-z说道：

2018-01-15 13:36

Dynamic configurable crawl (动态可配置化爬虫)
http://www.anycrawl.info
https://github.com/facert/scrapy_helper

爬虫集合
https://github.com/facert/awesome-spider

汤不热 python 多线程爬虫
https://github.com/facert/tumblr_spider

设计和实现一个爬虫框架
https://blog.biezhi.me/2018/01/design-and-implement-a-crawler-framework.html

回复
hi说道：

2018-06-13 13:31

爬虫攻防之前端策略简析
http://coolcao.com/2018/06/09/tips-of-anti-spider-in-fe/
`
1. 自定义字体形式
1.1 猫眼电影
1.2 去哪儿手机端网页
1.3 起点中文网
1.4 小结
2. 元素定位覆盖
3. 背景图拼凑
4. 伪类元素代替
5. 添加干扰字符并隐藏
6. 总结
`

回复
hi说道：

2018-11-27 19:28

搭建Selenium 集群
https://www.03sec.com/3233.shtml
https://paper.tuisec.win/detail/421973d82e43af4

回复
hi说道：

2018-12-04 11:03

干货贴！Github上超过百赞的爬虫脚本合集
https://mp.weixin.qq.com/s/mjOkK3pLubolJ_mK2GTniA

回复
hi说道：

2019-04-13 16:26

GitHub 上有哪些优秀的 Python 爬虫项目？ – 龙鹏-言有三的回答 – 知乎
https://www.zhihu.com/question/58151047/answer/640461600
https://zhuanlan.zhihu.com/p/61289585
`
1 综述类项目与学习资料
首先给大家介绍一些非常优秀的综述和学习类项目，方便大家快速索引找到所需要的资源。
1.1、awesome-spider
地址：https://github.com/facert/awesome-spider
这一款爬虫，里面搜集了几乎所有可以爬取的中文网址，从知乎豆瓣到知网，抖音微博到QQ，还有很多的不可描述的网站，你懂的。

1.2、Nyspider
地址：https://github.com/Nyloner/Nyspider

3、awesome-python-login-model
地址：https://github.com/CriseLYJ/awesome-python-login-model
这是ID为CriseLYJ(职业不详)的用户，这个项目用于模拟各种网址登陆，也包含一些简单的爬虫，star6000+。
先从这个项目开始分析各大网站的登录方式，非常有用，可谓摸清对手再动手。

4、python-spider
地址：https://github.com/Jack-Cherish/python-spider
这是ID为Jack-Cherish的东北大学的一个学生整理的学习python爬虫的资料，star6000+，包含不少的实战项目，非常适合想学习的朋友。

其他还有一些项目，不再一一介绍。
https://github.com/jhao104/proxy_pool
https://github.com/Ehco1996/Python-crawler

2 优秀图片/视频项目

笔者的精力多在图像和视频，所以下面各自介绍一个功能强大，简单好用的图片和视频爬虫。
工具亲测长期有效，省去了很多找爬虫工具的时间，早用早好。

2.1、Google，Baidu，Bing三大搜素引擎图片爬虫
地址：https://github.com/sczhengyabin/Image-Downloader

这个爬虫由ID为sczhengyabin的用户整理，可以按要求爬取百度、Bing、Google上的图片，我已经用了几年了，提供了非常人性化的GUI方便操作，使用方法如下：
使用python image_downloader_gui.py调用GUI界面，配置好参数(关键词，路径，爬取数目等)，关键词可以直接在这里输入也可以选择从txt文件中选择。
可以配置需要爬取的样本数目，这里一次爬了2000张，妥妥的3分钟搞定。
这个爬虫足够满足小型项目初始数据集的积累(爬几千张高质量图片妥妥的)，结果命名也非常整齐规范，最大的优势就是稳定啊，不会三天两天不能用了。

2、各大视频网站爬虫
地址：https://github.com/iawia002/annie
由ID为iawia002的用户整理，Annie是一款以go语言编码的视频下载工具，使用便捷并支持youtube，腾讯视频，抖音等多个网站视频和图像的下载，可以说是该有的都有的。
`

回复
hi说道：

2019-05-10 11:39

利用 Python + Selenium 实现对页面的指定元素截图(可截长图元素)
https://paper.tuisec.win/detail/2626fa1cb08d012
https://www.jianshu.com/p/7ed519854be7

回复

ASPIRE

Rosi图片抓取的PHP脚本

《 “Rosi图片抓取的PHP脚本” 》有 9 条评论

发表回复取消回复

Rosi图片抓取的PHP脚本

《 “Rosi图片抓取的PHP脚本” 》 有 9 条评论

发表回复 取消回复

《 “Rosi图片抓取的PHP脚本” 》有 9 条评论

发表回复取消回复