一、提取某两个标记之间的文本内容(多行)
有文本内容如下:
fdsjhgjhg fdshkjhk Start Good Morning Hello World End dashjkhjk dsfjkhk
我需要用Python实现——获取”Start”和”End”之间的内容并写入结果文件。
解决方法1:
with open('/path/to/input') as infile, open('/path/to/output', 'w') as outfile:
copy = False
for line in infile:
if line.strip() == "Start":
copy = True
elif line.strip() == "End":
copy = False
elif copy:
outfile.write(line)
解决方法2:
with open('input.txt') as myfile:
content = myfile.read()
text = re.search(r'Start\n.*?End', content, re.DOTALL).group()
with open("output.txt", "w") as myfile2:
myfile2.write(text)
解决方法3:
import itertools
with open('input.txt', 'r') as f, open('output.txt', 'w') as fout:
while True:
it = itertools.dropwhile(lambda line: line.strip() != 'Start', f)
if next(it, None) is None: break
fout.writelines(itertools.takewhile(lambda line: line.strip() != 'End', it))
参考链接:
二、提取某两个字符串之间的内容(单行)
解决方法(字符串切片):
'''
get content between str1 and str2 in str
'''
def getBetween(str, str1, str2):
strOutput = str[str.find(str1)+len(str1):str.find(str2)]
return strOutput
参考链接:
https://github.com/bfishadow/SBB
三、其它的实现方式
sed -n '/Start/,/End/p' input.txt | grep -Ev '(Start|End)'
sed -e '1,/Start/d' -e '/End/,$d' input.txt
awk /Start/,/End/ input.txt | grep -Ev '(Start|End)'
awk '/Start/{flag=1;next} /End/{flag=0} flag{ print }' input.txt
awk '/End/{flag=0} flag; /Start/{flag=1}' input.txt
perl -lne 'print if((/Start/../End/) && !(/Start/||/End/))' input.txt
搜索关键字:
- awk print line between
参考链接:
- http://www.unix.com/shell-programming-and-scripting/48676-how-print-only-lines-between-two-strings-using-awk.html
- https://nixtip.wordpress.com/2010/10/12/print-lines-between-two-patterns-the-awk-way/
- http://www.shellhacks.com/en/Using-SED-and-AWK-to-Print-Lines-Between-Two-Patterns
- http://stackoverflow.com/questions/17988756/how-to-select-lines-between-two-marker-patterns-which-may-occur-multiple-times-w
=EOF=
《“用Python进行字符串提取的两种方法”》 有 1 条评论
正文提取(利用 Newspaper/python-goose/python-readability 这几个包来解读一下新闻提取的一些细节)
http://midday.me/article/757120437f9b42e28d4030ec251a013d