从一个大牛的blog中看到的思路{http://www.leesec.com/archives/416},自己改用了sed来处理(并且是按照文章标题来命名HTML文件),然后没有转成pdf,觉得HTML文件更适合观看,HTML文件的效果如下:
Bash脚本如下:
#!/bin/bash echo '<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="zh-CN"><head profile="http://gmpg.org/xfn/11"> <meta http-equiv="content-type" content="text/html; charset=UTF-8"> <title>CRLF Injection漏洞的利用与实例分析 | WooYun知识库</title> <link rel="stylesheet" type="text/css" href="http://drops.wooyun.org/wp-content/themes/GZai/style.css"> <link rel="alternate" type="application/rss+xml" href="http://drops.wooyun.org/feed" title="WooYun知识库 RSS Feed"> <link rel="pingback" href="http://drops.wooyun.org/xmlrpc.php"> <meta name="keywords" content=""> <link rel="stylesheet" id="wp-easyarchives-css" href="http://drops.wooyun.org/wp-content/plugins/wp-easyarchives/css/wp-easyarchives.css?ver=3.1" type="text/css" media="screen"> <link rel="stylesheet" id="wp-recentcomments-css" href="http://drops.wooyun.org/wp-content/plugins/wp-recentcomments/css/wp-recentcomments.css?ver=2.2.5" type="text/css" media="screen"> <link rel="stylesheet" id="wp-pagenavi-css" href="http://drops.wooyun.org/wp-content/plugins/wp-pagenavi/pagenavi-css.css?ver=2.70" type="text/css" media="all"> <script type="text/javascript" src="http://drops.wooyun.org/wp-includes/js/jquery/jquery.js?ver=1.7.2"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-includes/js/comment-reply.js?ver=3.4.1"></script> <link rel="prev" title="用Burpsuite 来处理csrf token" href="http://drops.wooyun.org/tips/2460"> <link rel="next" title="Shodan搜索引擎介绍" href="http://drops.wooyun.org/tips/2469"> <link rel="canonical" href="http://drops.wooyun.org/papers/2466"> <link rel="shortlink" href="http://drops.wooyun.org/?p=2466"> <script type="text/javascript"> window._wp_rp_static_base_url = 'http://dtmvdvtzf8rz0.cloudfront.net/static/'; window._wp_rp_wp_ajax_url = "http://drops.wooyun.org/wp-admin/admin-ajax.php"; window._wp_rp_plugin_version = '2.7'; window._wp_rp_post_id = '2466'; window._wp_rp_num_rel_posts = '6'; </script> <link rel="stylesheet" href="http://dtmvdvtzf8rz0.cloudfront.net/static/wp-rp-css/plain.css?version=2.7"> <style type="text/css"> .related_post_title { } ul.related_post { } ul.related_post li { } ul.related_post li a { } ul.related_post li img { }</style> <style type="text/css" media="all"> /* <![CDATA[ */ @import url("http://drops.wooyun.org/wp-content/plugins/wp-table-reloaded/css/plugin.css?ver=1.9.4"); @import url("http://drops.wooyun.org/wp-content/plugins/wp-table-reloaded/css/datatables.css?ver=1.9.4"); /* ]]> */ </style> <!-- Clean Archives Reloaded v3.2.0 | http://www.viper007bond.com/wordpress-plugins/clean-archives-reloaded/ --> <style type="text/css">.car-collapse .car-yearmonth { cursor: s-resize; } </style> <script type="text/javascript"> /* <![CDATA[ */ jQuery(document).ready(function() { jQuery('.car-collapse').find('.car-monthlisting').hide(); jQuery('.car-collapse').find('.car-monthlisting:first').show(); jQuery('.car-collapse').find('.car-yearmonth').click(function() { jQuery(this).next('ul').slideToggle('fast'); }); jQuery('.car-collapse').find('.car-toggler').click(function() { if ( '展开所有月份' == jQuery(this).text() ) { jQuery(this).parent('.car-container').find('.car-monthlisting').show(); jQuery(this).text('折叠所有月份'); } else { jQuery(this).parent('.car-container').find('.car-monthlisting').hide(); jQuery(this).text('展开所有月份'); } return false; }); }); /* ]]> */ </script> <link rel="stylesheet" type="text/css" href="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles/shCore.css?ver=3.0.83c"><link rel="stylesheet" type="text/css" href="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles/shThemeDefault.css?ver=3.0.83c"><style type="text/css" id="syntaxhighlighteranchor"></style> <script src="http://bdimg.share.baidu.com/static/js/logger.js?cdnversion=390049"></script><link href="http://bdimg.share.baidu.com/static/css/bdsstyle.css?cdnversion=20131219" rel="stylesheet" type="text/css"></head> <body> <div id="wrapper"> <div id="header"> <div id="header-top"> <h1 id="blog-title"> <a href="http://drops.wooyun.org/" title="WooYun知识库" rel="home" style="float:left;">WooYun知识库</a> </h1> </div><!-- #header-top --> <div id="wp-search-box"> <form method="get" id="searchform" action="http://drops.wooyun.org/"> <input type="text" class="textfield" size="25" name="s" id="s" placeholder="输入关键字进行搜索"> <input type="submit" class="button" name="submit" id="searchbutton" value="搜索"> </form> </div> <div id="blog-description">像一朵乌云一样成长 </div> <div class="fixed"></div> </div><!-- #header --> <div id="menubar"> <ul class="menu"> <li class=""><a href="http://drops.wooyun.org/" title="WooYun知识库">首页</a> </li> <li class="page_item"><a href="http://www.wooyun.org">WooYun</a></li> <li class="page_item"><a href="http://zone.wooyun.org">Zone</a></li> <li class="page_item page-item-7"><a href="http://drops.wooyun.org/newsend">投稿</a></li> </ul> </div><!-- #menubar --> <div id="container">'>/tmp/top.html 2>/dev/null echo ' <div class="desc-share"> <p style="float:left">版权声明:<span style="color:red">未经授权禁止转载</span> <a href="http://drops.wooyun.org/author/爱小狐狸的小螃蟹" title="由 爱小狐狸的小螃蟹 发布" rel="author">爱小狐狸的小螃蟹</a>@<a href="http://drops.wooyun.org">乌云知识库</a></p> </div> <div id="nav-below"> <div class="nav-previous">上一篇:<a href="http://drops.wooyun.org/tips/2443" rel="prev">Mimikatz ON Metasploit</a></div> <div class="nav-next">下一篇:<a href="http://drops.wooyun.org/tips/2451" rel="next">逆向基础(八)</a></div> </div> </div><!-- container --> </div><!-- #wrapper --> <script> /* <![CDATA[ */ var rcGlobal = { serverUrl :"http://drops.wooyun.org", infoTemp :"%REVIEWER% 在 %POST%", loadingText :"正在加载", noCommentsText :"没有任何评论", newestText :"« 最新的", newerText :"« 上一页", olderText :"下一页 »", showContent :"1", external :"1", avatarSize :"32", avatarPosition :"left", anonymous :"匿名" }; /* ]]> */ </script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shCore.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushAS3.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushBash.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushColdFusion.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushClojure.js?ver=20090602"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushCpp.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushCSharp.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushCss.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushDelphi.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushDiff.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushErlang.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushFSharp.js?ver=20091003"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushGroovy.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushJava.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushJavaFX.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushJScript.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushLatex.js?ver=20090613"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushMatlabKey.js?ver=20091209"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushObjC.js?ver=20091207"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushPerl.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushPhp.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushPlain.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushPowerShell.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushPython.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushR.js?ver=20100919"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushRuby.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushScala.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushSql.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushVb.js?ver=3.0.83c"></script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushXml.js?ver=3.0.83c"></script> <script type="text/javascript"> (function(){ var corecss = document.createElement("link"); var themecss = document.createElement("link"); var corecssurl = "http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles/shCore.css?ver=3.0.83c"; if ( corecss.setAttribute ) { corecss.setAttribute( "rel", "stylesheet" ); corecss.setAttribute( "type", "text/css" ); corecss.setAttribute( "href", corecssurl ); } else { corecss.rel = "stylesheet"; corecss.href = corecssurl; } document.getElementsByTagName("head")[0].insertBefore( corecss, document.getElementById("syntaxhighlighteranchor") ); var themecssurl = "http://drops.wooyun.org/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles/shThemeDefault.css?ver=3.0.83c"; if ( themecss.setAttribute ) { themecss.setAttribute( "rel", "stylesheet" ); themecss.setAttribute( "type", "text/css" ); themecss.setAttribute( "href", themecssurl ); } else { themecss.rel = "stylesheet"; themecss.href = themecssurl; } //document.getElementById("syntaxhighlighteranchor").appendChild(themecss); document.getElementsByTagName("head")[0].insertBefore( themecss, document.getElementById("syntaxhighlighteranchor") ); })(); SyntaxHighlighter.config.strings.expandSource = "+ expand source"; SyntaxHighlighter.config.strings.help = "帮助"; SyntaxHighlighter.config.strings.alert = "SyntaxHighlighternn"; SyntaxHighlighter.config.strings.noBrush = "无法找到Brush:"; SyntaxHighlighter.config.strings.brushNotHtmlScript = "Brush不能设置 html-script选项"; SyntaxHighlighter.defaults["auto-links"] = false; SyntaxHighlighter.defaults["pad-line-numbers"] = false; SyntaxHighlighter.defaults["toolbar"] = false; SyntaxHighlighter.all(); </script> <script type="text/javascript" src="http://drops.wooyun.org/wp-content/plugins/wp-recentcomments/js/wp-recentcomments.js?ver=2.2.5"></script> </div> </body> </html>'>/tmp/bottom.html 2>/dev/null rm /tmp/drops.txt dir=(papers tips tools news web pentesting database binary '%e8%bf%90%e7%bb%b4%e5%ae%89%e5%85%a8') dir_num=${#dir[@]} for((i=0;i<$dir_num;i++)) do for j in $(seq 1 15) do if [ "$j" == 1 ] then curl http://drops.wooyun.org/category/${dir[i]} | grep entry-title | perl -pe 's/.*href..(.+?)".*>(.*)..a>.*/$1 $2/g' >>/tmp/drops.txt else curl http://drops.wooyun.org/category/${dir[i]}/page/$j | grep entry-title | perl -pe 's/.*href..(.+?)".*>(.*)..a>.*/$1 $2/g' >>/tmp/drops.txt fi done done wait cat /tmp/drops.txt | while read url title do title=$(echo $title | tr ' </' '_') curl -s $url | sed -n '/<div id="content">/,/entry-tags/p' >/tmp/tmp.html cat /tmp/top.html /tmp/tmp.html /tmp/bottom.html >/tmp/$title.html done rm /tmp/tmp.html
对了,有几点需要注意的就是:文件名要从UTF-8编码转换成GBK编码,因为Windows上会显示为乱码,但在Linux上是OK的,所以需要convmv命令的帮助(可以将所有的HTML文件放在一个新建的目录中,一个convmv命令就行,都不用for循环);
如果要转换成pdf文件的话,可以借助wkhtmltopdf这个工具将下载的HTML文件转换成pdf文件(在shell脚本的最后添加一个for循环就可以了);
还可以根据自己需要,在脚本的最后加一个打包命令将HTML文件打个包发到自己邮箱。
《 “WooYun知识库文章的批量下载脚本” 》 有 11 条评论
只能抓到标题和格式,没有内容啊
谢谢反馈!无法正常抓取可能是因为知识库页面的组织结构改变的缘故?
但在此代码基础上修改起来应该也不会太复杂,前两天太忙没注意,周末找个时间我更新一下脚本内容看看。
乌云知识库镜像
https://drops.secquan.org/
将网页保存为图片、PDF
http://www.03sec.com/3166.shtml
`
先找到PhantomJS的安装路径,然后进入安装目录下面的examples目录。运行下面这行命令即可:
phantomjs rasterize.js 网址 保存的名字
保存的名字后缀为pdf 则将网页保存为了pdf格式的文档,如果保存的名字后缀为png,则将网页保存为截图。
`
wooyunallbugs: wooyun_all_bugs 历史存档数据和图片
https://github.com/m0l1ce/wooyunallbugs
http://www.loner.fm/bugs/
phantomjs爬虫服务化
http://jiayi.space/post/phantomjspa-chong-fu-wu-hua
初见Chrome Headless Browser
https://lightless.me/archives/first-glance-at-chrome-headless-browser.html
https://developers.google.com/web/updates/2017/04/headless-chrome
初见 Chrome Headless 第二弹
https://lightless.me/archives/chrome-headless-second.html
https://github.com/wilson9x1/ChromeHeadlessInterface
https://github.com/minektur/chrome_remote_shell
https://duo.com/blog/driving-headless-chrome-with-python
Headless mode supported in Chrome 59, Firefox 56
https://www.bleepingcomputer.com/news/security/chrome-and-firefox-headless-modes-may-spur-new-adware-and-clickfraud-tactics/
Rendertron – Docker 化、headless 版本的 Chrome,Rendertron 作为一个独立的 HTTP Server 运行,目标是实现 Google 提出的 Web APP 新形态:PWA(Progressive Web Apps)
https://github.com/GoogleChrome/rendertron
简洁、优雅、可扩展的PHP采集工具(爬虫),基于phpQuery
https://github.com/jae-jae/QueryList
https://querylist.cc
邬迪:乌云完成了使命 | 乌云回忆录(一)
https://mp.weixin.qq.com/s/GmRXqg4ay_Q5nr2x-0E41g