=Start=
原文链接:http://www.ar5ch.com/programming/527.arc
关键字搜索脚本:
<?php define ('DB_SOURCE', 'D:DATA');//定义数据目录,可以遍历目录下所有子目录 define ('CACHE_LIMIT', 67108863); define ('RESULT_LIMIT', 1000); define ('TIME_LIMIT', 600); $begin = microtime(true); set_time_limit(TIME_LIMIT + 100); ini_set("memory_limit","-1"); //设置不限制内存 ob_end_flush(); echo <<< EOF EOF; flush(); $keyword = $argv[1]; $filelist = array(); get_file_list(DB_SOURCE . '\*'); $count = 0; echo 'Search ' . $keyword . ' in ' . count($filelist) . " leak databases ...n"; flush(); foreach ($filelist as $filepath) { $fp = fopen($filepath, 'r'); if (!$fp) continue; $basename = basename($filepath); $fp_start_pos = 0; echo 'Searching ' . $filepath . " n"; while(!feof($fp)) { fseek($fp, $fp_start_pos); $content = fread($fp, CACHE_LIMIT); $content_length = strrpos($content, "n") + 1; $content = substr($content, 0, $content_length); $fp_start_pos += $content_length; $keyword_pos = 0; while (($keyword_pos = strpos($content, $keyword, $keyword_pos)) !== false) { $start_pos = strrpos($content, "n", -$content_length + $keyword_pos); $end_pos = strpos($content, "n", $keyword_pos); echo trim(substr($content, $start_pos, $end_pos - $start_pos)) . "n"; flush(); $keyword_pos = $end_pos; $count++; if ($count >= RESULT_LIMIT) break; } if ($count >= RESULT_LIMIT) break; } fclose($fp); if ($count >= RESULT_LIMIT) break; if ((microtime(true) - $begin) >= TIME_LIMIT) break; } if ($count >= RESULT_LIMIT) echo "Too many results, give upn"; if ((microtime(true) - $begin) >= TIME_LIMIT) echo "Search time out, give upn"; echo 'Search complete, get ' . $count . ' results, cost ' . (microtime(true) - $begin) . " secondsn"; flush(); function get_file_list($dbsource) { global $filelist; $current_file_list = glob($dbsource); foreach ($current_file_list as $each) { if (strpos($each, 'search.php') === true) continue; if (is_file($each)) $filelist[] = $each; if (is_dir($each)) get_file_list($each . '\*'); } } ?>
修改了一下,从Web页面访问的版本如下:
<?php define ('DB_SOURCE', 'd:\data'); define ('CACHE_LIMIT', 4194304); define ('RESULT_LIMIT', 1000); define ('TIME_LIMIT', 600); $begin = microtime(true); set_time_limit(TIME_LIMIT + 100); ob_end_flush(); echo <<< EOF <html><head> <meta http-equiv="Content-Type" content="text/html; charset=GBK" /> <title>Full text search</title> </head><body> <form method="get" action=""> <input type="text" name="keyword" /> <input type="submit" /> </form> EOF; flush(); $keyword = isset($_REQUEST['keyword']) ? trim($_REQUEST['keyword']) : ''; if (empty($keyword)) exit('</body></html>'); $filelist = array(); get_file_list(DB_SOURCE . '\*'); $count = 0; echo 'Search ' . $keyword . ' in ' . count($filelist) . " leak databases ...<br />rn"; flush(); foreach ($filelist as $filepath) { $fp = fopen($filepath, 'r'); if (!$fp) continue; $basename = basename($filepath); $filesize = filesize($filepath); $fp_start_pos = 0; while($fp_start_pos !== $filesize) { fseek($fp, $fp_start_pos); $content = fread($fp, CACHE_LIMIT); $content_length = strlen($content); if ($fp_start_pos + $content_length !== $filesize) { $content_length = strrpos($content, "n") + 1; $content = substr($content, 0, $content_length); } $fp_start_pos += $content_length; $keyword_pos = 0; while (($keyword_pos = strpos($content, $keyword, $keyword_pos)) !== false) { $start_pos = strrpos($content, "n", -$content_length + $keyword_pos); $end_pos = strpos($content, "n", $keyword_pos); if ($end_pos === FALSE) $end_pos = $content_length; echo $basename . ' | ' . trim(substr($content, $start_pos, $end_pos - $start_pos)) . "<br />rn"; flush(); $keyword_pos = $end_pos; $count++; if ($count >= RESULT_LIMIT) break; } if ($count >= RESULT_LIMIT) break; } fclose($fp); if ($count >= RESULT_LIMIT) break; if ((microtime(true) - $begin) >= TIME_LIMIT) break; } if ($count >= RESULT_LIMIT) echo "Too many results, give up<br />rn"; if ((microtime(true) - $begin) >= TIME_LIMIT) echo "Search time out, give up<br />rn"; echo 'Search complete, get ' . $count . ' results, cost ' . (microtime(true) - $begin) . " seconds<br />rn"; echo '</body></html>'; flush(); function get_file_list($dbsource) { global $filelist; $current_file_list = glob($dbsource); foreach ($current_file_list as $each) { if (strpos($each, 'search.php') === true) continue; if (is_file($each)) $filelist[] = $each; if (is_dir($each)) get_file_list($each . '\*'); } } ?>
脚本本身也比较简单,但是在搜索大文件、多文件的时候效果还是很好的,之前有个用Python搜索的脚本,但是考虑的不是太细,当关键字是换行出现时就无法完成搜索了,不过有改进的空间就是了,有时间了再改改o(╯□╰)o
还有一种更为直观的方式:
<?php @ini_set('memory_limit', '-1'); $start=microtime(true); $files=getDirFiles("D:/data/"); for($i=0; $i<count($files); $i++) { loadfile($files[$i]); } echo microtime(true)-$start . "n"; function loadfile($file) { $fp=fopen($file,"r"); for($i=0; $i<9999; $i++) { $temp=fread($fp,1024*1024*10); if(strlen($temp)==0) { break; } $temp2=fgets($fp); if(strlen($temp2)!=0) { $temp.=$temp2; } $index=strpos($temp,"keyword"); } } function getDirFiles($path,$subDir=false,$addDir=false) { $mydir=dir($path); $all=array(); while( ($file=$mydir->read())!==false){ if($file=="." || $file==".."){ continue; } if ( is_dir( $path.$file ."/") ) { if($addDir) { $all=$path.$file ."/"; } if($subDir) { $temp=getDirFiles( $path.$file ."/" ,$sub ); $all=array_merge($all,$temp); } } else { $all[]= $path.$file ; } } return $all; } ?>
=END=
《 “关键字查找的PHP脚本” 》 有 3 条评论
一个最精简的php多进程控制库(单文件)。它不依赖任何扩展以及其它的库,可以让你方便地利用系统的多个cpu来完成一些异步任务。我们封装了主进程和子进程之间的通信,以及日志打印,还有错误处理。
https://github.com/SegmentFault/SimpleFork
hawkeye – 从文件系统中搜索敏感文件的工具(用Golang实现)
https://github.com/Ice3man543/hawkeye
用Python实现关键字查找
https://ixyzero.com/blog/archives/344.html
java实现路径通配符*,**,?
https://jdkleo.iteye.com/blog/2392642
查找某个类所在jar包
https://lihong11.iteye.com/blog/1936694