从一个大牛的blog中看到的思路{http://www.leesec.com/archives/416},自己改用了sed来处理(并且是按照文章标题来命名HTML文件),然后没有转成pdf,觉得HTML文件更适合观看,HTML文件的效果如下:
Bash脚本如下:
#!/bin/bash
echo '<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="zh-CN"><head profile="http://gmpg.org/xfn/11">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<title>CRLF Injection漏洞的利用与实例分析 | WooYun知识库</title>
<link rel="stylesheet" type="text/css" href="http://drops.wooyun.org/wp-content/themes/GZai/style.css">
<link rel="alternate" type="application/rss+xml" href="http://drops.wooyun.org/feed" title="WooYun知识库 RSS Feed">
<link rel="pingback" href="http://drops.wooyun.org/xmlrpc.php">
<meta name="keywords" content="">
<link rel="stylesheet" id="wp-easyarchives-css" href="http://drops.wooyun.org/wp-content/plugins/wp-easyarchives/css/wp-easyarchives.css?ver=3.1" type="text/css" media="screen">
<link rel="stylesheet" id="wp-recentcomments-css" href="http://drops.wooyun.org/wp-content/plugins/wp-recentcomments/css/wp-recentcomments.css?ver=2.2.5" type="text/css" media="screen">
<link rel="stylesheet" id="wp-pagenavi-css" href="http://drops.wooyun.org/wp-content/plugins/wp-pagenavi/pagenavi-css.css?ver=2.70" type="text/css" media="all">
<script type="text/javascript" src="http://drops.wooyun.org/wp-includes/js/jquery/jquery.js?ver=1.7.2"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-includes/js/comment-reply.js?ver=3.4.1"></script>
<link rel="prev" title="用Burpsuite 来处理csrf token" href="http://drops.wooyun.org/tips/2460">
<link rel="next" title="Shodan搜索引擎介绍" href="http://drops.wooyun.org/tips/2469">
<link rel="canonical" href="http://drops.wooyun.org/papers/2466">
<link rel="shortlink" href="http://drops.wooyun.org/?p=2466">
<script type="text/javascript">
window._wp_rp_static_base_url = 'http://dtmvdvtzf8rz0.cloudfront.net/static/';
window._wp_rp_wp_ajax_url = "http://drops.wooyun.org/wp-admin/admin-ajax.php";
window._wp_rp_plugin_version = '2.7';
window._wp_rp_post_id = '2466';
window._wp_rp_num_rel_posts = '6';
</script>
<link rel="stylesheet" href="http://dtmvdvtzf8rz0.cloudfront.net/static/wp-rp-css/plain.css?version=2.7">
<style type="text/css">
.related_post_title {
}
ul.related_post {
}
ul.related_post li {
}
ul.related_post li a {
}
ul.related_post li img {
}</style>
<style type="text/css" media="all">
/* <![CDATA[ */
@import url("http://drops.wooyun.org/wp-content/plugins/wp-table-reloaded/css/plugin.css?ver=1.9.4");
@import url("http://drops.wooyun.org/wp-content/plugins/wp-table-reloaded/css/datatables.css?ver=1.9.4");
/* ]]> */
</style>
<!-- Clean Archives Reloaded v3.2.0 | http://www.viper007bond.com/wordpress-plugins/clean-archives-reloaded/ -->
<style type="text/css">.car-collapse .car-yearmonth { cursor: s-resize; } </style>
<script type="text/javascript">
/* <![CDATA[ */
jQuery(document).ready(function() {
jQuery('.car-collapse').find('.car-monthlisting').hide();
jQuery('.car-collapse').find('.car-monthlisting:first').show();
jQuery('.car-collapse').find('.car-yearmonth').click(function() {
jQuery(this).next('ul').slideToggle('fast');
});
jQuery('.car-collapse').find('.car-toggler').click(function() {
if ( '展开所有月份' == jQuery(this).text() ) {
jQuery(this).parent('.car-container').find('.car-monthlisting').show();
jQuery(this).text('折叠所有月份');
}
else {
jQuery(this).parent('.car-container').find('.car-monthlisting').hide();
jQuery(this).text('展开所有月份');
}
return false;
});
});
/* ]]> */
</script>
<link rel="stylesheet" type="text/css" href="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles/shCore.css?ver=3.0.83c"><link rel="stylesheet" type="text/css" href="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles/shThemeDefault.css?ver=3.0.83c"><style type="text/css" id="syntaxhighlighteranchor"></style>
<script src="http://bdimg.share.baidu.com/static/js/logger.js?cdnversion=390049"></script><link href="http://bdimg.share.baidu.com/static/css/bdsstyle.css?cdnversion=20131219" rel="stylesheet" type="text/css"></head>
<body>
<div id="wrapper">
<div id="header">
<div id="header-top">
<h1 id="blog-title">
<a href="http://drops.wooyun.org/" title="WooYun知识库" rel="home" style="float:left;">WooYun知识库</a>
</h1>
</div><!-- #header-top -->
<div id="wp-search-box">
<form method="get" id="searchform" action="http://drops.wooyun.org/">
<input type="text" class="textfield" size="25" name="s" id="s" placeholder="输入关键字进行搜索">
<input type="submit" class="button" name="submit" id="searchbutton" value="搜索">
</form> </div>
<div id="blog-description">像一朵乌云一样成长 </div>
<div class="fixed"></div>
</div><!-- #header -->
<div id="menubar">
<ul class="menu">
<li class=""><a href="http://drops.wooyun.org/" title="WooYun知识库">首页</a>
</li>
<li class="page_item"><a href="http://www.wooyun.org">WooYun</a></li>
<li class="page_item"><a href="http://zone.wooyun.org">Zone</a></li>
<li class="page_item page-item-7"><a href="http://drops.wooyun.org/newsend">投稿</a></li>
</ul>
</div><!-- #menubar -->
<div id="container">'>/tmp/top.html 2>/dev/null
echo ' <div class="desc-share">
<p style="float:left">版权声明:<span style="color:red">未经授权禁止转载</span> <a href="http://drops.wooyun.org/author/爱小狐狸的小螃蟹" title="由 爱小狐狸的小螃蟹 发布" rel="author">爱小狐狸的小螃蟹</a>@<a href="http://drops.wooyun.org">乌云知识库</a></p>
</div>
<div id="nav-below">
<div class="nav-previous">上一篇:<a href="http://drops.wooyun.org/tips/2443" rel="prev">Mimikatz ON Metasploit</a></div>
<div class="nav-next">下一篇:<a href="http://drops.wooyun.org/tips/2451" rel="next">逆向基础(八)</a></div>
</div>
</div><!-- container -->
</div><!-- #wrapper -->
<script>
/* <![CDATA[ */
var rcGlobal = {
serverUrl :"http://drops.wooyun.org",
infoTemp :"%REVIEWER% 在 %POST%",
loadingText :"正在加载",
noCommentsText :"没有任何评论",
newestText :"« 最新的",
newerText :"« 上一页",
olderText :"下一页 »",
showContent :"1",
external :"1",
avatarSize :"32",
avatarPosition :"left",
anonymous :"匿名"
};
/* ]]> */
</script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shCore.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushAS3.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushBash.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushColdFusion.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushClojure.js?ver=20090602"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushCpp.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushCSharp.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushCss.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushDelphi.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushDiff.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushErlang.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushFSharp.js?ver=20091003"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushGroovy.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushJava.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushJavaFX.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushJScript.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushLatex.js?ver=20090613"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushMatlabKey.js?ver=20091209"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushObjC.js?ver=20091207"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushPerl.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushPhp.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushPlain.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushPowerShell.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushPython.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushR.js?ver=20100919"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushRuby.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushScala.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushSql.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushVb.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushXml.js?ver=3.0.83c"></script>
<script type="text/javascript">
(function(){
var corecss = document.createElement("link");
var themecss = document.createElement("link");
var corecssurl = "http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles/shCore.css?ver=3.0.83c";
if ( corecss.setAttribute ) {
corecss.setAttribute( "rel", "stylesheet" );
corecss.setAttribute( "type", "text/css" );
corecss.setAttribute( "href", corecssurl );
} else {
corecss.rel = "stylesheet";
corecss.href = corecssurl;
}
document.getElementsByTagName("head")[0].insertBefore( corecss, document.getElementById("syntaxhighlighteranchor") );
var themecssurl = "http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles/shThemeDefault.css?ver=3.0.83c";
if ( themecss.setAttribute ) {
themecss.setAttribute( "rel", "stylesheet" );
themecss.setAttribute( "type", "text/css" );
themecss.setAttribute( "href", themecssurl );
} else {
themecss.rel = "stylesheet";
themecss.href = themecssurl;
}
//document.getElementById("syntaxhighlighteranchor").appendChild(themecss);
document.getElementsByTagName("head")[0].insertBefore( themecss, document.getElementById("syntaxhighlighteranchor") );
})();
SyntaxHighlighter.config.strings.expandSource = "+ expand source";
SyntaxHighlighter.config.strings.help = "帮助";
SyntaxHighlighter.config.strings.alert = "SyntaxHighlighternn";
SyntaxHighlighter.config.strings.noBrush = "无法找到Brush:";
SyntaxHighlighter.config.strings.brushNotHtmlScript = "Brush不能设置 html-script选项";
SyntaxHighlighter.defaults["auto-links"] = false;
SyntaxHighlighter.defaults["pad-line-numbers"] = false;
SyntaxHighlighter.defaults["toolbar"] = false;
SyntaxHighlighter.all();
</script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/wp-recentcomments/js/wp-recentcomments.js?ver=2.2.5"></script>
</div>
</body>
</html>'>/tmp/bottom.html 2>/dev/null
rm /tmp/drops.txt
dir=(papers tips tools news web pentesting database binary '%e8%bf%90%e7%bb%b4%e5%ae%89%e5%85%a8')
dir_num=${#dir[@]}
for((i=0;i<$dir_num;i++))
do
for j in $(seq 1 15)
do
if [ "$j" == 1 ]
then
curl http://drops.wooyun.org/category/${dir[i]} | grep entry-title | perl -pe 's/.*href..(.+?)".*>(.*)..a>.*/$1 $2/g' >>/tmp/drops.txt
else
curl http://drops.wooyun.org/category/${dir[i]}/page/$j | grep entry-title | perl -pe 's/.*href..(.+?)".*>(.*)..a>.*/$1 $2/g' >>/tmp/drops.txt
fi
done
done
wait
cat /tmp/drops.txt | while read url title
do
title=$(echo $title | tr ' </' '_')
curl -s $url | sed -n '/<div id="content">/,/entry-tags/p' >/tmp/tmp.html
cat /tmp/top.html /tmp/tmp.html /tmp/bottom.html >/tmp/$title.html
done
rm /tmp/tmp.html
对了,有几点需要注意的就是:文件名要从UTF-8编码转换成GBK编码,因为Windows上会显示为乱码,但在Linux上是OK的,所以需要convmv命令的帮助(可以将所有的HTML文件放在一个新建的目录中,一个convmv命令就行,都不用for循环);
如果要转换成pdf文件的话,可以借助wkhtmltopdf这个工具将下载的HTML文件转换成pdf文件(在shell脚本的最后添加一个for循环就可以了);
还可以根据自己需要,在脚本的最后加一个打包命令将HTML文件打个包发到自己邮箱。

《 “WooYun知识库文章的批量下载脚本” 》 有 11 条评论
只能抓到标题和格式,没有内容啊
谢谢反馈!无法正常抓取可能是因为知识库页面的组织结构改变的缘故?
但在此代码基础上修改起来应该也不会太复杂,前两天太忙没注意,周末找个时间我更新一下脚本内容看看。
乌云知识库镜像
https://drops.secquan.org/
将网页保存为图片、PDF
http://www.03sec.com/3166.shtml
`
先找到PhantomJS的安装路径,然后进入安装目录下面的examples目录。运行下面这行命令即可:
phantomjs rasterize.js 网址 保存的名字
保存的名字后缀为pdf 则将网页保存为了pdf格式的文档,如果保存的名字后缀为png,则将网页保存为截图。
`
wooyunallbugs: wooyun_all_bugs 历史存档数据和图片
https://github.com/m0l1ce/wooyunallbugs
http://www.loner.fm/bugs/
phantomjs爬虫服务化
http://jiayi.space/post/phantomjspa-chong-fu-wu-hua
初见Chrome Headless Browser
https://lightless.me/archives/first-glance-at-chrome-headless-browser.html
https://developers.google.com/web/updates/2017/04/headless-chrome
初见 Chrome Headless 第二弹
https://lightless.me/archives/chrome-headless-second.html
https://github.com/wilson9x1/ChromeHeadlessInterface
https://github.com/minektur/chrome_remote_shell
https://duo.com/blog/driving-headless-chrome-with-python
Headless mode supported in Chrome 59, Firefox 56
https://www.bleepingcomputer.com/news/security/chrome-and-firefox-headless-modes-may-spur-new-adware-and-clickfraud-tactics/
Rendertron – Docker 化、headless 版本的 Chrome,Rendertron 作为一个独立的 HTTP Server 运行,目标是实现 Google 提出的 Web APP 新形态:PWA(Progressive Web Apps)
https://github.com/GoogleChrome/rendertron
简洁、优雅、可扩展的PHP采集工具(爬虫),基于phpQuery
https://github.com/jae-jae/QueryList
https://querylist.cc
邬迪:乌云完成了使命 | 乌云回忆录(一)
https://mp.weixin.qq.com/s/GmRXqg4ay_Q5nr2x-0E41g