抓取WooYun上的厂商列表

之前在别人的blog上看到抓取WooYun上的厂商列表作为一个厂商数据库，然后每每等到漏洞爆发的时候数据库（需要自己提前做一些处理，如：Web容器分类、后端数据库分类……）就起作用了，定向抓取、定向扫描分析、定向击破……

不过今天这里我就放一放抓取厂商列表的Python代码，比较简单，权当作记录&备份了：

方法一：利用sgmllib.SGMLParser

#!/usr/bin/env python
#coding=utf-8
import urllib2
import sgmllib
class LinksParser(sgmllib.SGMLParser):
	urls = []
	def do_a(self, attrs):	# the function's name do_a can't be changed
		for name, value in attrs:
			if name == 'href' and value not in self.urls:
				if value.startswith('http'):
					self.urls.append(value)
					print value
					fp.write(value + 'n')
			else:
				continue
			return

def get_url(link):
	lParser = LinksParser()
	value = (urllib2.urlopen(link)).read()
	lParser.feed(value)
	lParser.close()

if __name__ == "__main__":
	fp = open("URL.list",'a')
	for x in xrange(1, 29):
		get_url('http://wooyun.org/corps/page/' + str(x))
	fp.close()

不得不说Python的代码短小精悍且功能完备，而PHP就无法做到这点（也可能只是我写不出短小精悍的PHP代码而已了o(╯□╰)o）

方法二：利用HTMLParser

#!/usr/bin/env python
# coding=utf-8
import sys, urllib2, HTMLParser

class myparser(HTMLParser.HTMLParser):
	urls = []
	def __init__(self):
		HTMLParser.HTMLParser.__init__(self)
	def handle_starttag(self, tag, attrs):	# the name--"handle_starttag" can't be changed
		if (tag == 'a'):
			for name,value in attrs:
				if (name == 'href' and value.startswith('http') and value not in self.urls):
					self.urls.append(value)
					print value
					fp.write(value + 'n')

if len(sys.argv)>=2 and sys.argv[1] == '-u':
	content = (urllib2.urlopen(sys.argv[2])).read()
	fp = open("URL.list",'a')
	con = myparser()
	con.feed(content)
	fp.close()
else:
	print 'Usage: %s -u http://domain.com' % sys.argv[0]

这个HTMLParser版本的只是抓取单个网页中的链接信息（其实也可以多添加一两个判断，都可以起到暗链检测的功能了，自己改改就成），也可以改写成上面那种批量抓取自动存文件的形式。

然后再通过一个shell脚本提取其中的域名，方便其他工具的使用（如：用theHarvester进行信息搜集）：

#!/bin/bash

for i in `cat URL.list`;do
	site=${i/www./}
	site=${site##http://}
	site=${site%%/*}
	echo $site
	mkdir $site
	( cd $site && python /path/to/theHarvester.py -d $site -l 500 -b all -f $site.html )
	sleep 30
done

因为抓取的500多个URL里面只有2个是https形式的，所以上面的shell脚本没有考虑进去，而是直接将https改成了http（域名嘛，HTTP和HTTPS没区别，所以暂时就不计较了）。

《 “抓取WooYun上的厂商列表” 》有 3 条评论

a-z说道：

2016-12-07 14:23

Bug赏金
https://github.com/ngalongc/bug-bounty-reference
https://github.com/djadmin/awesome-bug-bounty

安全列表
https://github.com/zbetcheckin/Security_list

回复
hi说道：

2018-07-23 13:33

Photon – 轻量级 Web 爬虫，从网站中提取 URL、文件、端点等信息
https://github.com/s0md3v/Photon

cc.py – 从 commoncrawl.org 中提取指定网站 URL 的脚本
https://github.com/si9int/cc.py

回复
hi说道：

2018-07-31 20:01

2012 – 2018 年间的漏洞赏金 Writeup 收集列表
https://pentester.land/list-of-bug-bounty-writeups.html

回复

ASPIRE

抓取WooYun上的厂商列表

《 “抓取WooYun上的厂商列表” 》有 3 条评论

发表回复取消回复

抓取WooYun上的厂商列表

《 “抓取WooYun上的厂商列表” 》 有 3 条评论

发表回复 取消回复

《 “抓取WooYun上的厂商列表” 》有 3 条评论

发表回复取消回复