=Start=
原文链接:http://www.ar5ch.com/programming/527.arc
关键字搜索脚本:
<?php
define ('DB_SOURCE', 'D:DATA');//定义数据目录,可以遍历目录下所有子目录
define ('CACHE_LIMIT', 67108863);
define ('RESULT_LIMIT', 1000);
define ('TIME_LIMIT', 600);
$begin = microtime(true);
set_time_limit(TIME_LIMIT + 100);
ini_set("memory_limit","-1");  //设置不限制内存
ob_end_flush();
echo <<< EOF
EOF;
flush();
$keyword = $argv[1];
$filelist = array();
get_file_list(DB_SOURCE . '\*');
$count = 0;
echo 'Search ' . $keyword . ' in ' . count($filelist) . " leak databases ...n";
flush();
foreach ($filelist as $filepath) {
    $fp = fopen($filepath, 'r');
    if (!$fp) continue;
    $basename = basename($filepath);
    $fp_start_pos = 0;
    echo 'Searching ' . $filepath . " n";
    while(!feof($fp)) {
        fseek($fp, $fp_start_pos);
        $content = fread($fp, CACHE_LIMIT);
        $content_length = strrpos($content, "n") + 1;
        $content = substr($content, 0, $content_length);
        $fp_start_pos += $content_length;
        $keyword_pos = 0;
        while (($keyword_pos = strpos($content, $keyword, $keyword_pos)) !== false)
        {
            $start_pos = strrpos($content, "n", -$content_length + $keyword_pos);
            $end_pos = strpos($content, "n", $keyword_pos);
            echo  trim(substr($content, $start_pos, $end_pos - $start_pos)) . "n";
            flush();
            $keyword_pos = $end_pos;
            $count++;
            if ($count >= RESULT_LIMIT) break;
        }
        if ($count >= RESULT_LIMIT) break;
    }
    fclose($fp);
    if ($count >= RESULT_LIMIT) break;
    if ((microtime(true) - $begin) >= TIME_LIMIT) break;
}
if ($count >= RESULT_LIMIT)
    echo "Too many results, give upn";
if ((microtime(true) - $begin) >= TIME_LIMIT)
    echo "Search time out, give upn";
echo 'Search complete, get ' . $count . ' results, cost ' . (microtime(true) - $begin) . " secondsn";
flush();
function get_file_list($dbsource) {
    global $filelist;
    $current_file_list = glob($dbsource);
    foreach ($current_file_list as $each) {
        if (strpos($each, 'search.php') === true)
            continue;
        if (is_file($each))
            $filelist[] = $each;
        if (is_dir($each))
            get_file_list($each . '\*');
    }
}
?>
修改了一下,从Web页面访问的版本如下:
<?php
define ('DB_SOURCE', 'd:\data');
define ('CACHE_LIMIT', 4194304);
define ('RESULT_LIMIT', 1000);
define ('TIME_LIMIT', 600);
$begin = microtime(true);
set_time_limit(TIME_LIMIT + 100);
ob_end_flush();
echo <<< EOF
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=GBK" />
<title>Full text search</title>
</head><body>
<form method="get" action="">
<input type="text" name="keyword" />
<input type="submit" />
</form>
EOF;
flush();
$keyword = isset($_REQUEST['keyword']) ? trim($_REQUEST['keyword']) : '';
if (empty($keyword)) exit('</body></html>');
$filelist = array();
get_file_list(DB_SOURCE . '\*');
$count = 0;
echo 'Search ' . $keyword . ' in ' . count($filelist) . " leak databases ...<br />rn";
flush();
foreach ($filelist as $filepath) {
	$fp = fopen($filepath, 'r');
	if (!$fp) continue;
	$basename = basename($filepath);
	$filesize = filesize($filepath);
	$fp_start_pos = 0;
	while($fp_start_pos !== $filesize) {
		fseek($fp, $fp_start_pos);
		$content = fread($fp, CACHE_LIMIT);
		$content_length = strlen($content);
		if ($fp_start_pos + $content_length !== $filesize) {
			$content_length = strrpos($content, "n") + 1;
			$content = substr($content, 0, $content_length);
		}
		$fp_start_pos += $content_length;
		$keyword_pos = 0;
		while (($keyword_pos = strpos($content, $keyword, $keyword_pos)) !== false)
		{
			$start_pos = strrpos($content, "n", -$content_length + $keyword_pos);
			$end_pos = strpos($content, "n", $keyword_pos);
			if ($end_pos === FALSE) $end_pos = $content_length;
			echo $basename . ' | ' . trim(substr($content, $start_pos, $end_pos - $start_pos)) . "<br />rn";
			flush();
			$keyword_pos = $end_pos;
			$count++;
			if ($count >= RESULT_LIMIT) break;
		}
		if ($count >= RESULT_LIMIT) break;
	}
	fclose($fp);
	if ($count >= RESULT_LIMIT) break;
	if ((microtime(true) - $begin) >= TIME_LIMIT) break;
}
if ($count >= RESULT_LIMIT)
	echo "Too many results, give up<br />rn";
if ((microtime(true) - $begin) >= TIME_LIMIT)
	echo "Search time out, give up<br />rn";
echo 'Search complete, get ' . $count . ' results, cost ' . (microtime(true) - $begin) . " seconds<br />rn";
echo '</body></html>';
flush();
function get_file_list($dbsource) {
	global $filelist;
	$current_file_list = glob($dbsource);
	foreach ($current_file_list as $each) {
		if (strpos($each, 'search.php') === true)
			continue;
		if (is_file($each))
			$filelist[] = $each;
		if (is_dir($each))
			get_file_list($each . '\*');
	}
}
?>
脚本本身也比较简单,但是在搜索大文件、多文件的时候效果还是很好的,之前有个用Python搜索的脚本,但是考虑的不是太细,当关键字是换行出现时就无法完成搜索了,不过有改进的空间就是了,有时间了再改改o(╯□╰)o
还有一种更为直观的方式:
<?php
@ini_set('memory_limit', '-1');
$start=microtime(true);
$files=getDirFiles("D:/data/");
for($i=0; $i<count($files); $i++) {
	loadfile($files[$i]);
}
echo microtime(true)-$start . "n";
function loadfile($file) {
	$fp=fopen($file,"r");
	for($i=0; $i<9999; $i++) {
		$temp=fread($fp,1024*1024*10);
		if(strlen($temp)==0) {
			break;
		}
		$temp2=fgets($fp);
		if(strlen($temp2)!=0) {
			$temp.=$temp2;
		}
		$index=strpos($temp,"keyword");
	}
}
function getDirFiles($path,$subDir=false,$addDir=false) {
	$mydir=dir($path);
	$all=array();
	while( ($file=$mydir->read())!==false){
		if($file=="." || $file==".."){
			continue;
		}
		if ( is_dir( $path.$file ."/") ) {
			if($addDir) {
				$all=$path.$file ."/";
			}
			if($subDir) {
				$temp=getDirFiles( $path.$file ."/" ,$sub );
				$all=array_merge($all,$temp);
			}
		} else {
			$all[]= $path.$file ;
		}
	}
	return $all;
}
?>
=END=
《 “关键字查找的PHP脚本” 》 有 3 条评论
一个最精简的php多进程控制库(单文件)。它不依赖任何扩展以及其它的库,可以让你方便地利用系统的多个cpu来完成一些异步任务。我们封装了主进程和子进程之间的通信,以及日志打印,还有错误处理。
https://github.com/SegmentFault/SimpleFork
hawkeye – 从文件系统中搜索敏感文件的工具(用Golang实现)
https://github.com/Ice3man543/hawkeye
用Python实现关键字查找
https://ixyzero.com/blog/archives/344.html
java实现路径通配符*,**,?
https://jdkleo.iteye.com/blog/2392642
查找某个类所在jar包
https://lihong11.iteye.com/blog/1936694