=Start=
原文链接:http://www.ar5ch.com/programming/527.arc
关键字搜索脚本:
<?php
define ('DB_SOURCE', 'D:DATA');//定义数据目录,可以遍历目录下所有子目录
define ('CACHE_LIMIT', 67108863);
define ('RESULT_LIMIT', 1000);
define ('TIME_LIMIT', 600);
$begin = microtime(true);
set_time_limit(TIME_LIMIT + 100);
ini_set("memory_limit","-1"); //设置不限制内存
ob_end_flush();
echo <<< EOF
EOF;
flush();
$keyword = $argv[1];
$filelist = array();
get_file_list(DB_SOURCE . '\*');
$count = 0;
echo 'Search ' . $keyword . ' in ' . count($filelist) . " leak databases ...n";
flush();
foreach ($filelist as $filepath) {
$fp = fopen($filepath, 'r');
if (!$fp) continue;
$basename = basename($filepath);
$fp_start_pos = 0;
echo 'Searching ' . $filepath . " n";
while(!feof($fp)) {
fseek($fp, $fp_start_pos);
$content = fread($fp, CACHE_LIMIT);
$content_length = strrpos($content, "n") + 1;
$content = substr($content, 0, $content_length);
$fp_start_pos += $content_length;
$keyword_pos = 0;
while (($keyword_pos = strpos($content, $keyword, $keyword_pos)) !== false)
{
$start_pos = strrpos($content, "n", -$content_length + $keyword_pos);
$end_pos = strpos($content, "n", $keyword_pos);
echo trim(substr($content, $start_pos, $end_pos - $start_pos)) . "n";
flush();
$keyword_pos = $end_pos;
$count++;
if ($count >= RESULT_LIMIT) break;
}
if ($count >= RESULT_LIMIT) break;
}
fclose($fp);
if ($count >= RESULT_LIMIT) break;
if ((microtime(true) - $begin) >= TIME_LIMIT) break;
}
if ($count >= RESULT_LIMIT)
echo "Too many results, give upn";
if ((microtime(true) - $begin) >= TIME_LIMIT)
echo "Search time out, give upn";
echo 'Search complete, get ' . $count . ' results, cost ' . (microtime(true) - $begin) . " secondsn";
flush();
function get_file_list($dbsource) {
global $filelist;
$current_file_list = glob($dbsource);
foreach ($current_file_list as $each) {
if (strpos($each, 'search.php') === true)
continue;
if (is_file($each))
$filelist[] = $each;
if (is_dir($each))
get_file_list($each . '\*');
}
}
?>
修改了一下,从Web页面访问的版本如下:
<?php
define ('DB_SOURCE', 'd:\data');
define ('CACHE_LIMIT', 4194304);
define ('RESULT_LIMIT', 1000);
define ('TIME_LIMIT', 600);
$begin = microtime(true);
set_time_limit(TIME_LIMIT + 100);
ob_end_flush();
echo <<< EOF
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=GBK" />
<title>Full text search</title>
</head><body>
<form method="get" action="">
<input type="text" name="keyword" />
<input type="submit" />
</form>
EOF;
flush();
$keyword = isset($_REQUEST['keyword']) ? trim($_REQUEST['keyword']) : '';
if (empty($keyword)) exit('</body></html>');
$filelist = array();
get_file_list(DB_SOURCE . '\*');
$count = 0;
echo 'Search ' . $keyword . ' in ' . count($filelist) . " leak databases ...<br />rn";
flush();
foreach ($filelist as $filepath) {
$fp = fopen($filepath, 'r');
if (!$fp) continue;
$basename = basename($filepath);
$filesize = filesize($filepath);
$fp_start_pos = 0;
while($fp_start_pos !== $filesize) {
fseek($fp, $fp_start_pos);
$content = fread($fp, CACHE_LIMIT);
$content_length = strlen($content);
if ($fp_start_pos + $content_length !== $filesize) {
$content_length = strrpos($content, "n") + 1;
$content = substr($content, 0, $content_length);
}
$fp_start_pos += $content_length;
$keyword_pos = 0;
while (($keyword_pos = strpos($content, $keyword, $keyword_pos)) !== false)
{
$start_pos = strrpos($content, "n", -$content_length + $keyword_pos);
$end_pos = strpos($content, "n", $keyword_pos);
if ($end_pos === FALSE) $end_pos = $content_length;
echo $basename . ' | ' . trim(substr($content, $start_pos, $end_pos - $start_pos)) . "<br />rn";
flush();
$keyword_pos = $end_pos;
$count++;
if ($count >= RESULT_LIMIT) break;
}
if ($count >= RESULT_LIMIT) break;
}
fclose($fp);
if ($count >= RESULT_LIMIT) break;
if ((microtime(true) - $begin) >= TIME_LIMIT) break;
}
if ($count >= RESULT_LIMIT)
echo "Too many results, give up<br />rn";
if ((microtime(true) - $begin) >= TIME_LIMIT)
echo "Search time out, give up<br />rn";
echo 'Search complete, get ' . $count . ' results, cost ' . (microtime(true) - $begin) . " seconds<br />rn";
echo '</body></html>';
flush();
function get_file_list($dbsource) {
global $filelist;
$current_file_list = glob($dbsource);
foreach ($current_file_list as $each) {
if (strpos($each, 'search.php') === true)
continue;
if (is_file($each))
$filelist[] = $each;
if (is_dir($each))
get_file_list($each . '\*');
}
}
?>
脚本本身也比较简单,但是在搜索大文件、多文件的时候效果还是很好的,之前有个用Python搜索的脚本,但是考虑的不是太细,当关键字是换行出现时就无法完成搜索了,不过有改进的空间就是了,有时间了再改改o(╯□╰)o
还有一种更为直观的方式:
<?php
@ini_set('memory_limit', '-1');
$start=microtime(true);
$files=getDirFiles("D:/data/");
for($i=0; $i<count($files); $i++) {
loadfile($files[$i]);
}
echo microtime(true)-$start . "n";
function loadfile($file) {
$fp=fopen($file,"r");
for($i=0; $i<9999; $i++) {
$temp=fread($fp,1024*1024*10);
if(strlen($temp)==0) {
break;
}
$temp2=fgets($fp);
if(strlen($temp2)!=0) {
$temp.=$temp2;
}
$index=strpos($temp,"keyword");
}
}
function getDirFiles($path,$subDir=false,$addDir=false) {
$mydir=dir($path);
$all=array();
while( ($file=$mydir->read())!==false){
if($file=="." || $file==".."){
continue;
}
if ( is_dir( $path.$file ."/") ) {
if($addDir) {
$all=$path.$file ."/";
}
if($subDir) {
$temp=getDirFiles( $path.$file ."/" ,$sub );
$all=array_merge($all,$temp);
}
} else {
$all[]= $path.$file ;
}
}
return $all;
}
?>
=END=
《 “关键字查找的PHP脚本” 》 有 3 条评论
一个最精简的php多进程控制库(单文件)。它不依赖任何扩展以及其它的库,可以让你方便地利用系统的多个cpu来完成一些异步任务。我们封装了主进程和子进程之间的通信,以及日志打印,还有错误处理。
https://github.com/SegmentFault/SimpleFork
hawkeye – 从文件系统中搜索敏感文件的工具(用Golang实现)
https://github.com/Ice3man543/hawkeye
用Python实现关键字查找
https://ixyzero.com/blog/archives/344.html
java实现路径通配符*,**,?
https://jdkleo.iteye.com/blog/2392642
查找某个类所在jar包
https://lihong11.iteye.com/blog/1936694