In our daily work, we sometimes encounter the situation that some Chinese sentences need to be broken through punctuation marks. The effect is shown in the following figure.
How can I break a sentence by punctuation? As shown in the figure above, if you want to get the two sentences "Python Basic Tutorial" and "Python Introduction Tutorial (very detailed)", you need to break the whole sentence with ",". How to use python to implement?
Idea: replace the punctuation marks that you want to cut into the whole text with fixed marks, such as "-", and then obtain different sentences before and after the "-" to achieve punctuation punctuation.
Knowledge points: Focus on the use of python's re module, the use of regular expressions, and file operations.
The codes are as follows:
from pathlib import Path
import re # Import required modules
p1=Path( ‘1.txt’ ) # The original file path. It is recommended to use the same relative path as the program directory
with p1.open( ‘r’ ) as file: # Open original file
article=file.read() # Get the text of the original file
mark=[ ‘?’ , ‘,’ , ‘-‘ , ‘|’ , ‘_’ , ‘–’ , ‘ \n ‘ ] # The punctuation marks you want to cut are stored here
for m in mark: # Traverse the file and replace the punctuation marks you want to cut with “-”
for n in article:
ifm==n:
article=article.replace(m, ‘-‘ )
regex=re.compile(‘[^-]+(?=-| \n )’ ) # The regular expression matches the sentence and passes the “-” Symbols break sentences
r=regex.findall(article)
t=set(r) # Store the obtained sentences in a set and remove the duplicates
with open( ‘2.txt’ , ‘w’ ) as newfile: # Create a new one txt File used to store new sentences
a = (i for i in t)
for x in a:
newfile.write(x ‘ \n ‘ )
The above code can be used to break sentences with punctuation marks. The specific effect is shown in the following figure.
Improvements: Sometimes short sentences like "Baidu Encyclopedia", "Rookie Tutorial", "Zhihu", etc. are not what we want. This can fix the length of matching sentences in the regular expression to filter out sentences with fewer words.