一、提取某两个标记之间的文本内容(多行)
有文本内容如下:
fdsjhgjhg fdshkjhk Start Good Morning Hello World End dashjkhjk dsfjkhk
我需要用Python实现——获取”Start”和”End”之间的内容并写入结果文件。
解决方法1:
with open('/path/to/input') as infile, open('/path/to/output', 'w') as outfile: copy = False for line in infile: if line.strip() == "Start": copy = True elif line.strip() == "End": copy = False elif copy: outfile.write(line)
解决方法2:
with open('input.txt') as myfile: content = myfile.read() text = re.search(r'Start\n.*?End', content, re.DOTALL).group() with open("output.txt", "w") as myfile2: myfile2.write(text)
解决方法3:
import itertools with open('input.txt', 'r') as f, open('output.txt', 'w') as fout: while True: it = itertools.dropwhile(lambda line: line.strip() != 'Start', f) if next(it, None) is None: break fout.writelines(itertools.takewhile(lambda line: line.strip() != 'End', it))
参考链接:
二、提取某两个字符串之间的内容(单行)
解决方法(字符串切片):
''' get content between str1 and str2 in str ''' def getBetween(str, str1, str2): strOutput = str[str.find(str1)+len(str1):str.find(str2)] return strOutput
参考链接:
https://github.com/bfishadow/SBB
三、其它的实现方式
sed -n '/Start/,/End/p' input.txt | grep -Ev '(Start|End)' sed -e '1,/Start/d' -e '/End/,$d' input.txt awk /Start/,/End/ input.txt | grep -Ev '(Start|End)' awk '/Start/{flag=1;next} /End/{flag=0} flag{ print }' input.txt awk '/End/{flag=0} flag; /Start/{flag=1}' input.txt perl -lne 'print if((/Start/../End/) && !(/Start/||/End/))' input.txt
搜索关键字:
- awk print line between
参考链接:
- http://www.unix.com/shell-programming-and-scripting/48676-how-print-only-lines-between-two-strings-using-awk.html
- https://nixtip.wordpress.com/2010/10/12/print-lines-between-two-patterns-the-awk-way/
- http://www.shellhacks.com/en/Using-SED-and-AWK-to-Print-Lines-Between-Two-Patterns
- http://stackoverflow.com/questions/17988756/how-to-select-lines-between-two-marker-patterns-which-may-occur-multiple-times-w
=EOF=
《“用Python进行字符串提取的两种方法”》 有 1 条评论
正文提取(利用 Newspaper/python-goose/python-readability 这几个包来解读一下新闻提取的一些细节)
http://midday.me/article/757120437f9b42e28d4030ec251a013d