Spliting text file with conditional operators by python - python

I'm having a huge file, which consists excessive length of transcribed speach for about two days straight. Over 100,000 words I guess.
During transcription, I have separated speaker and sessions by "<-- Name -->" mark into different blocks. My problem is, is it possible to automatically process them into files in a naming convention of name_speach.txt ?
THANKS!!!!
Test cases:
Test case
<--测试0-->
这个是一段测试内容,a quick fox jumps over a lazy dog.
<——测试1——>
,a quick fox just over 啊 辣子 dog!!?是吗?
<——测试2——>
这是一段测试用的text,嗯!
<--Test case 3-->
/* sound track lost #153:12.236 -- 153.18.222 */
…
A quick fox jumps over a {lazy|lame} dog.

So you want to search every pattern "<-- Name -->" in a text file (100000 words is not very huge for computer memory, I think).
You can use Regular expression for search tags.
In Python, It's something like:
import re
NAMETAG = r'\<\-\- (?P<name>.*?) \-\-\>'
# find all nametags in your string
matches = re.findall(NAMETAG, yourtext)
offset_start_list = []
offset_end_list = []
name_list = []
for m in matches:
name = m.groups()['name']
name_list.append(name)
# find content offset after name tag
offset_start_list.append(m.end() + 1)
# the last content's end
offset_end_list.append(m.start())
offset_end_list.pop(0)
offset_end_list.append(len(yourtext))
for name, start, end in zip(name_list, offset_start_list, offset_end_list):
# save your files here

Related

Python to Find-replace a string and Create Two Paragraphs Before String in Words Document

I have a VBA Macro. In that, I have
.Find Text = 'Pollution'
.Replacement Text = '^p^pChemical'
Here, '^p^pChemical' means Replace the Word Pollution with Chemical and create two empty paragraphs before the word sea.
Before:
After:
Have you noticed that The Word Pollution has been replaced With Chemical and two empty paragraphs preceds it ? This is how I want in Python.
My Code so far:
import docx
from docx import Document
document = Document('Example.docx')
for Paragraph in document.paragraphs:
if 'Pollution' in paragraph:
replace(Pollution, Chemical)
document.add_paragraph(before('Chemical'))
document.add_paragraph(before('Chemical'))
I want to open a word document to find the word, replace it with another word, and create two empty paragraphs before the replaced word.
You can search through each paragraph to find the word of interest, and call insert_paragraph_before to add the new elements:
def replace(doc, target, replacement):
for par in list(document.paragraphs):
text = par.text
while (index := text.find(target)) != -1:
par.insert_paragraph_before(text[:index].rstrip())
par.insert_paragraph_before('')
par.text = replacement + text[index + len(target)]
list(doc.paragraphs) makes a copy of the list, so that the iteration is not thrown off when you insert elements.
Call this function as many times as you need to replace whatever words you have.
This will take the text from the your document, replace the instances of the word pollution with chemical and add paragraphs in between, but it doesn't change the first document, instead it creates a copy. This is probably the safer route to go anyway.
import re
from docx import Document
ref = {"Pollution":"Chemicals", "Ocean":"Sea", "Speaker":"Magnet"}
def get_old_text():
doc1 = Document('demo.docx')
fullText = []
for para in doc1.paragraphs:
fullText.append(para.text)
text = '\n'.join(fullText)
return text
def create_new_document(ref, text):
doc2 = Document()
lines = text.split('\n')
for line in lines:
for k in ref:
if k.lower() in line.lower():
parts = re.split(f'{k}', line, flags=re.I)
doc2.add_paragraph(parts[0])
for part in parts[1:]:
doc2.add_paragraph('')
doc2.add_paragraph('')
doc2.add_paragraph(ref[k] + " " + part)
doc2.save('demo.docx')
text = get_old_text()
create_new_document(ref, text)
You need to use \n for new line. Using re should work like so:
import re
before = "The term Pollution means the manifestation of any unsolicited foregin substance in something. When we talk about pollution on earth, we refer to the contamination that is happening of the natural resources by various pollutants"
pattern = re.compile("pollution", re.IGNORECASE)
after = pattern.sub("\n\nChemical", before)
print(after)
Which will output:
The term
Chemical means the manifestation of any unsolicited foregin substance in something. When we talk about
Chemical on earth, we refer to the contamination that is happening of the natural resources by various pollutants

Regex not specific enough

So I wrote a program for my Kindle e-reader that searches my highlights and deletes repetitive text (it's usually information about the book title, author, page number, etc.). I thought it was functional but sometimes there would random be periods (.) on certain lines of the output. At first I thought the program was just buggy but then I realized that the regex I'm using to match the books title and author was also matching any sentence that ended in brackets.
This is the code for the regex that I'm using to detect the books title and author
titleRegex = re.compile('(.+)\((.+)\)')
Example
Desired book title and author match: Book title (Author name)
What would also get matched: *I like apples because they are green (they are sometimes red as well). *
In this case it would delete everything and leave just the period at the end of the sentence. This is obviously not ideal because it deletes the text I highlighted
Here is the unformatted text file that goes into my program
The program works by finding all of the matches for the regexes I wrote, looping through those matches and one by one replacing them with empty strings.
Would there be any ways to make my title regex more specific so that it only picks up author titles and not full sentences that end in brackets? If not, what steps would I have to take to restructure this program?
I've attached my code to the bottom of this post. I would greatly appreciate any help as I'm a total coding newbie. Thanks :)
import re
titleRegex = re.compile('(.+)\((.+)\)')
titleRegex2 = re.compile(r'\ufeff (.+)\((.+)\)')
infoRegex = re.compile(r'(.) ([a-zA-Z]+) (Highlight|Bookmark|Note) ([a-zA-Z]+) ([a-zA-Z]+) ([0-9]+) (\|)')
locationRegex = re.compile(r' Location (\d+)(-\d+)? (\|)')
dateRegex = re.compile(r'([a-zA-Z]+) ([a-zA-Z]+) ([a-zA-Z]+), ([a-zA-Z]+) ([0-9]+), ([0-9]+)')
timeRegex = re.compile(r'([0-9]+):([0-9]+):([0-9]+) (AM|PM)')
newlineRegex = re.compile(r'\n')
sepRegex = re.compile('==========')
regexList = [titleRegex, titleRegex2, infoRegex, locationRegex, dateRegex, timeRegex, sepRegex, newlineRegex]
string = open("/Users/devinnagami/myclippings.txt").read()
for x in range(len(regexList)):
newString = re.sub(regexList[x], ' ', string)
string = newString
finalText = newString.split(' ')
with open('booknotes.txt', 'w') as f:
for item in finalText:
f.write('%s\n' % item)
There isn't enough information to tell if "Book title (Book Author)" is different than something like "I like Books (Good Ones)" without context. Thankfully, the text you showed has plenty of context. Instead of creating several different regular expressions, you can combine them into one expression to encode that context.
For instance:
quoteInfoRegex = re.compile(
r"^=+\n(?P<title>.*?) \((?P<author>.*?)\)\n" +
r"- Your Highlight on page (?P<page>[\d]+) \| Location (?P<location>[\d-]+) \| Added on (?P<added>.*?)\n" +
r"\n" +
r"(?P<quote>.*?)\n", flags=re.MULTILINE)
for m in quoteInfoRegex.finditer(data):
print(m.groupdict())
This will pull out each line of the text, and parse it, knowing that the book title is the first line after the equals, and the quote itself is below that.

How to iterate over multiple regex matches and replace them?

Replace each sentence containing the word fear, with the same sentence, wrapped in a b tag with class="fear".
Trying to wrap each (of 2 total) matches for this pattern in html tags.
import re
with open('chicken.txt', 'r') as file:
pattern = re.compile(r'[^\.]+fear[^\.]+')
text = file.read()
matches = pattern.finditer(text)
tagstart = '<b class="fear">'
tagend = '</b>'
replacement = [text.replace(match[0], tagstart + match[0] + tagend) for match in matches]
with open('chick.html', 'w') as htmlfile:
htmlfile.write(replacement[0])
chick.html output looks like this:
If you've spent much time with chickens, you may doubt their ability to process a thought as complex as
"Chicken A will beat me anyway, so why bother to fight?" Your doubt is well placed.
Pecking orders are yet another case where the "thinking" has been done by natural selection,
and so needn't be done by the organism.<b class="fear"> The organism must be able to tell its neighbors apart,
and to feel a healthy fear of the ones that have brutalized it, but it needn't grasp the logic behind the fear</b>.
Any genes endowing a chicken with this selective fear, reducing the time spent in futile and costly combat, should flourish.
The final sentence is the second item in the replacement variable, and in isn't wrapped in that b tag.
You could iterate over each match from findinter using replace but performing the substitution over the whole text each time.
import re
pattern = re.compile(r'[^\.]+fear[^\.]+')
tagstart = '<b class="fear">'
tagend = '</b>'
with open('chicken.txt', 'r') as file:
text = file.read()
matches = pattern.finditer(text)
for match in matches:
text = text.replace(match[0], tagstart + match[0] + tagend)
with open('chick.html', 'w') as htmlfile:
htmlfile.write(text)
File chick.html
If you've spent much time with chickens, you may doubt their ability to process a
thought as complex as "Chicken A will beat me anyway, so why bother to fight?" Your doubt
is well placed. Pecking orders are yet another case where the "thinking" has been done by natural
selection, and so needn't be done by the organism.<b class="fear"> The organism must be able to tell
its neighbors apart, and to feel a healthy fear of the ones that have brutalized it, but
it needn't grasp the logic behind the fear</b>.<b class="fear"> Any genes endowing a chicken with
this selective fear, reducing the time spent in futile and costly combat, should flourish</b>.

How to use text.split() and retain blank (empty) lines

New to python, need some help with my program. I have a code which takes in an unformatted text document, does some formatting (sets the pagewidth and the margins), and outputs a new text document. My entire code works fine except for this function which produces the final output.
Here is the segment of the problem code:
def process(document, pagewidth, margins, formats):
res = []
onlypw = []
pwmarg = []
count = 0
marg = 0
for segment in margins:
for i in range(count, segment[0]):
res.append(document[i])
text = ''
foundmargin = -1
for i in range(segment[0], segment[1]+1):
marg = segment[2]
text = text + '\n' + document[i].strip(' ')
words = text.split()
Note: segment [0] means the beginning of the document, and segment[1] just means to the end of the document if you are wondering about the range. My problem is when I copy text to words (in words=text.split() ) it does not retain my blank lines. The output I should be getting is:
This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in
this. If they but knew it, almost all men in their degree,
some time or other, cherish very nearly the same feelings
towards the ocean with me.
There now is your insular city of the Manhattoes, belted
round by wharves as Indian isles by coral reefs--commerce
surrounds it with her surf.
And what my current output looks like:
This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in
this. If they but knew it, almost all men in their degree,
some time or other, cherish very nearly the same feelings
towards the ocean with me. There now is your insular city of
the Manhattoes, belted round by wharves as Indian isles by
coral reefs--commerce surrounds it with her surf.
I know the problem happens when I copy text to words, since it doesn't keep the blank lines. How can I make sure it copies the blank lines plus the words?
Please let me know if I should add more code or more detail!
First split on at least 2 newlines, then split on words:
import re
paragraphs = re.split('\n\n+', text)
words = [paragraph.split() for paragraph in paragraphs]
You now have a list of lists, one per paragraph; process these per paragraph, after which you can rejoin the whole thing into new text with double newlines inserted back in.
I've used re.split() to support paragraphs being delimited by more than 2 newlines; you could use a simple text.split('\n\n') if there are ever only going to be exactly 2 newlines between paragraphs.
use a regexp to find the words and the blank lines rather than split
m = re.compile('(\S+|\n\n)')
words=m.findall(text)

How can I make this code work faster ? (searching large corpus of text for large words)

in Python, I have created a text generator that acts on certain parameters but my code is -at most of the time- slow and performs below my expectations. I expect one sentence per every 3-4 minutes but it fails to comply if the database it works on is large -I use the project Gutenberg's 18-book corpus and I will create my custom corpus and add further books so performance is vital.- The algorithm and the implementation is below:
ALGORITHM
1- Enter the trigger sentence -only once, at the beginning of the program-
2- Get the longest word in the trigger sentence
3- Find all the sentences of the corpus that contain the word at step2
4- Randomly select one of those sentences
5- Get the sentence (named sentA to resolve the ambiguity in description) that follows the sentence picked at step4 -so long as sentA is longer than 40 characters-
6- Go to step 2, now the trigger sentence is the sentA of step5
IMPLEMENTATION
from nltk.corpus import gutenberg
from random import choice
triggerSentence = raw_input("Please enter the trigger sentence:")#get input sentence from user
previousLongestWord = ""
listOfSents = gutenberg.sents()
listOfWords = gutenberg.words()
corpusSentences = [] #all sentences in the related corpus
sentenceAppender = ""
longestWord = ""
#this function is not mine, code courtesy of Dave Kirby, found on the internet about sorting list without duplication speed tricks
def arraySorter(seq):
seen = set()
return [x for x in seq if x not in seen and not seen.add(x)]
def findLongestWord(longestWord):
if(listOfWords.count(longestWord) == 1 or longestWord.upper() == previousLongestWord.upper()):
longestWord = sortedSetOfValidWords[-2]
if(listOfWords.count(longestWord) == 1):
longestWord = sortedSetOfValidWords[-3]
doappend = corpusSentences.append
def appending():
for mysentence in listOfSents: #sentences are organized into array so they can actually be read word by word.
sentenceAppender = " ".join(mysentence)
doappend(sentenceAppender)
appending()
sentencesContainingLongestWord = []
def getSentence(longestWord, sentencesContainingLongestWord):
for sentence in corpusSentences:
if sentence.count(longestWord):#if the sentence contains the longest target string, push it into the sentencesContainingLongestWord list
sentencesContainingLongestWord.append(sentence)
def lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord):
while(len(corpusSentences[sentenceIndex + 1]) < 40):#in case the next sentence is shorter than 40 characters, pick another trigger sentence
sentencesContainingLongestWord.remove(triggerSentence)
triggerSentence = choice(sentencesContainingLongestWord)
sentenceIndex = corpusSentences.index(triggerSentence)
while len(triggerSentence) > 0: #run the loop as long as you get a trigger sentence
sentencesContainingLongestWord = []#all the sentences that include the longest word are to be inserted into this set
setOfValidWords = [] #set for words in a sentence that exists in a corpus
split_str = triggerSentence.split()#split the sentence into words
setOfValidWords = [word for word in split_str if listOfWords.count(word)]
sortedSetOfValidWords = arraySorter(sorted(setOfValidWords, key = len))
longestWord = sortedSetOfValidWords[-1]
findLongestWord(longestWord)
previousLongestWord = longestWord
getSentence(longestWord, sentencesContainingLongestWord)
triggerSentence = choice(sentencesContainingLongestWord)
sentenceIndex = corpusSentences.index(triggerSentence)
lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord)
triggerSentence = corpusSentences[sentenceIndex + 1]#get the sentence that is next to the previous trigger sentence
print triggerSentence
print "\n"
corpusSentences.remove(triggerSentence)#in order to view the sentence index numbers, you can remove this one so index numbers are concurrent with actual gutenberg numbers
print "End of session, please rerun the program"
#initiated once the while loop exits, so that the program ends without errors
The computer I run the code on is a bit old, dual-core CPU was bought in Feb. 2006 and 2x512 RAM was bought in Sept. 2004 so I'm not sure if my implementation is bad or the hardware is a reason for the slow runtime. Any ideas on how I can rescue this from its hazardous form ? Thanks in advance.
I think my first advice must be: Think carefully about what your routines do, and make sure the name describes that. Currently you have things like:
arraySorter which neither deals with arrays nor sorts (it's an implementation of nub)
findLongestWord which counts things or selects words by criteria not present in the algorithm description, yet ends up doing nothing at all because longestWord is a local variable (argument, as it were)
getSentence which appends an arbitrary number of sentences onto a list
appending which sounds like it might be a state checker, but operates only through side effects
considerable confusion between local and global variables, for instance the global variable sentenceAppender is never used, nor is it an actor (for instance, a function) like the name suggests
For the task itself, what you really need are indices. It might be overkill to index every word - technically you should only need index entries for words that occur as the longest word of a sentence. Dictionaries are your primary tool here, and the second tool is lists. Once you have those indices, looking up a random sentence containing any given word takes only a dictionary lookup, a random.choice, and a list lookup. Perhaps a few list lookups, given the sentence length restriction.
This example should prove a good object lesson that modern hardware or optimizers like Psyco do not solve algorithmic problems.
Maybe Psyco speeds up the execution?

Categories

Resources