WooYun知识库文章的批量下载脚本


从一个大牛的blog中看到的思路{http://www.leesec.com/archives/416},自己改用了sed来处理(并且是按照文章标题来命名HTML文件),然后没有转成pdf,觉得HTML文件更适合观看,HTML文件的效果如下:

drops_effect

 

Bash脚本如下:

#!/bin/bash

echo '<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="zh-CN"><head profile="http://gmpg.org/xfn/11">
    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
    <title>CRLF Injection漏洞的利用与实例分析 | WooYun知识库</title>
    <link rel="stylesheet" type="text/css" href="http://drops.wooyun.org/wp-content/themes/GZai/style.css">
    <link rel="alternate" type="application/rss+xml" href="http://drops.wooyun.org/feed" title="WooYun知识库 RSS Feed">
    <link rel="pingback" href="http://drops.wooyun.org/xmlrpc.php">
    <meta name="keywords" content="">
<link rel="stylesheet" id="wp-easyarchives-css" href="http://drops.wooyun.org/wp-content/plugins/wp-easyarchives/css/wp-easyarchives.css?ver=3.1" type="text/css" media="screen">
<link rel="stylesheet" id="wp-recentcomments-css" href="http://drops.wooyun.org/wp-content/plugins/wp-recentcomments/css/wp-recentcomments.css?ver=2.2.5" type="text/css" media="screen">
<link rel="stylesheet" id="wp-pagenavi-css" href="http://drops.wooyun.org/wp-content/plugins/wp-pagenavi/pagenavi-css.css?ver=2.70" type="text/css" media="all">
<script type="text/javascript" src="http://drops.wooyun.org/wp-includes/js/jquery/jquery.js?ver=1.7.2"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-includes/js/comment-reply.js?ver=3.4.1"></script>
<link rel="prev" title="用Burpsuite 来处理csrf token" href="http://drops.wooyun.org/tips/2460">
<link rel="next" title="Shodan搜索引擎介绍" href="http://drops.wooyun.org/tips/2469">
<link rel="canonical" href="http://drops.wooyun.org/papers/2466">
<link rel="shortlink" href="http://drops.wooyun.org/?p=2466">
<script type="text/javascript">
	window._wp_rp_static_base_url = 'http://dtmvdvtzf8rz0.cloudfront.net/static/';
	window._wp_rp_wp_ajax_url = "http://drops.wooyun.org/wp-admin/admin-ajax.php";
	window._wp_rp_plugin_version = '2.7';
	window._wp_rp_post_id = '2466';
	window._wp_rp_num_rel_posts = '6';
</script>
<link rel="stylesheet" href="http://dtmvdvtzf8rz0.cloudfront.net/static/wp-rp-css/plain.css?version=2.7">
<style type="text/css">
.related_post_title {
}
ul.related_post {
}
ul.related_post li {
}
ul.related_post li a {
}
ul.related_post li img {
}</style>
<style type="text/css" media="all">
/* <![CDATA[ */
@import url("http://drops.wooyun.org/wp-content/plugins/wp-table-reloaded/css/plugin.css?ver=1.9.4");
@import url("http://drops.wooyun.org/wp-content/plugins/wp-table-reloaded/css/datatables.css?ver=1.9.4");
/* ]]> */
</style>
	<!-- Clean Archives Reloaded v3.2.0 | http://www.viper007bond.com/wordpress-plugins/clean-archives-reloaded/ -->
	<style type="text/css">.car-collapse .car-yearmonth { cursor: s-resize; } </style>
	<script type="text/javascript">
		/* <![CDATA[ */
			jQuery(document).ready(function() {
				jQuery('.car-collapse').find('.car-monthlisting').hide();
				jQuery('.car-collapse').find('.car-monthlisting:first').show();
				jQuery('.car-collapse').find('.car-yearmonth').click(function() {
					jQuery(this).next('ul').slideToggle('fast');
				});
				jQuery('.car-collapse').find('.car-toggler').click(function() {
					if ( '展开所有月份' == jQuery(this).text() ) {
						jQuery(this).parent('.car-container').find('.car-monthlisting').show();
						jQuery(this).text('折叠所有月份');
					}
					else {
						jQuery(this).parent('.car-container').find('.car-monthlisting').hide();
						jQuery(this).text('展开所有月份');
					}
					return false;
				});
			});
		/* ]]> */
	</script>

<link rel="stylesheet" type="text/css" href="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles/shCore.css?ver=3.0.83c"><link rel="stylesheet" type="text/css" href="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles/shThemeDefault.css?ver=3.0.83c"><style type="text/css" id="syntaxhighlighteranchor"></style>
<script src="http://bdimg.share.baidu.com/static/js/logger.js?cdnversion=390049"></script><link href="http://bdimg.share.baidu.com/static/css/bdsstyle.css?cdnversion=20131219" rel="stylesheet" type="text/css"></head>
<body>
<div id="wrapper">
    <div id="header">
        <div id="header-top">
        <h1 id="blog-title">
        <a href="http://drops.wooyun.org/" title="WooYun知识库" rel="home" style="float:left;">WooYun知识库</a>
        </h1>
        </div><!-- #header-top  -->
    <div id="wp-search-box">
<form method="get" id="searchform" action="http://drops.wooyun.org/">
    <input type="text" class="textfield" size="25" name="s" id="s" placeholder="输入关键字进行搜索">
    <input type="submit" class="button" name="submit" id="searchbutton" value="搜索">
</form>    </div>
        <div id="blog-description">像一朵乌云一样成长        </div>
        <div class="fixed"></div>
    </div><!--  #header -->

    <div id="menubar">
    <ul class="menu">
    <li class=""><a href="http://drops.wooyun.org/" title="WooYun知识库">首页</a>
    </li>
    <li class="page_item"><a href="http://www.wooyun.org">WooYun</a></li>
    <li class="page_item"><a href="http://zone.wooyun.org">Zone</a></li>
    <li class="page_item page-item-7"><a href="http://drops.wooyun.org/newsend">投稿</a></li>
    </ul>
    </div><!--  #menubar -->
<div id="container">'>/tmp/top.html 2>/dev/null

echo '            <div class="desc-share">
	            <p style="float:left">版权声明:<span style="color:red">未经授权禁止转载</span> <a href="http://drops.wooyun.org/author/爱小狐狸的小螃蟹" title="由 爱小狐狸的小螃蟹 发布" rel="author">爱小狐狸的小螃蟹</a>@<a href="http://drops.wooyun.org">乌云知识库</a></p>
				 </div>

            <div id="nav-below">
                <div class="nav-previous">上一篇:<a href="http://drops.wooyun.org/tips/2443" rel="prev">Mimikatz ON Metasploit</a></div>
                <div class="nav-next">下一篇:<a href="http://drops.wooyun.org/tips/2451" rel="next">逆向基础(八)</a></div>
            </div>
</div><!-- container -->
</div><!-- #wrapper -->
<script>
/* <![CDATA[ */
var rcGlobal = {
	serverUrl		:"http://drops.wooyun.org",
	infoTemp		:"%REVIEWER% 在 %POST%",
	loadingText		:"正在加载",
	noCommentsText	:"没有任何评论",
	newestText		:"&laquo; 最新的",
	newerText		:"&laquo; 上一页",
	olderText		:"下一页 &raquo;",
	showContent		:"1",
	external		:"1",
	avatarSize		:"32",
	avatarPosition	:"left",
	anonymous		:"匿名"
};
/* ]]> */
</script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shCore.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushAS3.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushBash.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushColdFusion.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushClojure.js?ver=20090602"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushCpp.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushCSharp.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushCss.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushDelphi.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushDiff.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushErlang.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushFSharp.js?ver=20091003"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushGroovy.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushJava.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushJavaFX.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushJScript.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushLatex.js?ver=20090613"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushMatlabKey.js?ver=20091209"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushObjC.js?ver=20091207"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushPerl.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushPhp.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushPlain.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushPowerShell.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushPython.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushR.js?ver=20100919"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushRuby.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushScala.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushSql.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushVb.js?ver=3.0.83c"></script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushXml.js?ver=3.0.83c"></script>
<script type="text/javascript">
	(function(){
		var corecss = document.createElement("link");
		var themecss = document.createElement("link");
		var corecssurl = "http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles/shCore.css?ver=3.0.83c";
		if ( corecss.setAttribute ) {
				corecss.setAttribute( "rel", "stylesheet" );
				corecss.setAttribute( "type", "text/css" );
				corecss.setAttribute( "href", corecssurl );
		} else {
				corecss.rel = "stylesheet";
				corecss.href = corecssurl;
		}
		document.getElementsByTagName("head")[0].insertBefore( corecss, document.getElementById("syntaxhighlighteranchor") );
		var themecssurl = "http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles/shThemeDefault.css?ver=3.0.83c";
		if ( themecss.setAttribute ) {
				themecss.setAttribute( "rel", "stylesheet" );
				themecss.setAttribute( "type", "text/css" );
				themecss.setAttribute( "href", themecssurl );
		} else {
				themecss.rel = "stylesheet";
				themecss.href = themecssurl;
		}
		//document.getElementById("syntaxhighlighteranchor").appendChild(themecss);
		document.getElementsByTagName("head")[0].insertBefore( themecss, document.getElementById("syntaxhighlighteranchor") );
	})();
	SyntaxHighlighter.config.strings.expandSource = "+ expand source";
	SyntaxHighlighter.config.strings.help = "帮助";
	SyntaxHighlighter.config.strings.alert = "SyntaxHighlighternn";
	SyntaxHighlighter.config.strings.noBrush = "无法找到Brush:";
	SyntaxHighlighter.config.strings.brushNotHtmlScript = "Brush不能设置 html-script选项";
	SyntaxHighlighter.defaults["auto-links"] = false;
	SyntaxHighlighter.defaults["pad-line-numbers"] = false;
	SyntaxHighlighter.defaults["toolbar"] = false;
	SyntaxHighlighter.all();
</script>
<script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/wp-recentcomments/js/wp-recentcomments.js?ver=2.2.5"></script>
</div>
</body>
</html>'>/tmp/bottom.html 2>/dev/null

rm /tmp/drops.txt
dir=(papers tips tools news web pentesting database binary '%e8%bf%90%e7%bb%b4%e5%ae%89%e5%85%a8')
dir_num=${#dir[@]}
for((i=0;i<$dir_num;i++))
do
	for j in $(seq 1 15)
	do
	if [ "$j" == 1 ]
	then
		curl http://drops.wooyun.org/category/${dir[i]} | grep entry-title | perl -pe 's/.*href..(.+?)".*>(.*)..a>.*/$1 $2/g' >>/tmp/drops.txt
	else
		curl http://drops.wooyun.org/category/${dir[i]}/page/$j | grep entry-title | perl -pe 's/.*href..(.+?)".*>(.*)..a>.*/$1 $2/g' >>/tmp/drops.txt
	fi
	done
done

wait

cat /tmp/drops.txt | while read url title
do
	title=$(echo $title | tr ' </' '_')
	curl -s $url | sed -n '/<div id="content">/,/entry-tags/p' >/tmp/tmp.html
	cat /tmp/top.html /tmp/tmp.html /tmp/bottom.html >/tmp/$title.html
done

rm /tmp/tmp.html

对了,有几点需要注意的就是:文件名要从UTF-8编码转换成GBK编码,因为Windows上会显示为乱码,但在Linux上是OK的,所以需要convmv命令的帮助(可以将所有的HTML文件放在一个新建的目录中,一个convmv命令就行,都不用for循环);

如果要转换成pdf文件的话,可以借助wkhtmltopdf这个工具将下载的HTML文件转换成pdf文件(在shell脚本的最后添加一个for循环就可以了);

还可以根据自己需要,在脚本的最后加一个打包命令将HTML文件打个包发到自己邮箱。

, ,

《“WooYun知识库文章的批量下载脚本”》 有 11 条评论

    • 谢谢反馈!无法正常抓取可能是因为知识库页面的组织结构改变的缘故?
      但在此代码基础上修改起来应该也不会太复杂,前两天太忙没注意,周末找个时间我更新一下脚本内容看看。

  1. 将网页保存为图片、PDF
    http://www.03sec.com/3166.shtml
    `
    先找到PhantomJS的安装路径,然后进入安装目录下面的examples目录。运行下面这行命令即可:

    phantomjs rasterize.js 网址 保存的名字

    保存的名字后缀为pdf 则将网页保存为了pdf格式的文档,如果保存的名字后缀为png,则将网页保存为截图。
    `

回复 a-z 取消回复

您的电子邮箱地址不会被公开。 必填项已用*标注