To Split text based on words using python code

To Split text based on words using python code - python

I have a long text like the one below. I need to split based on some words say ("In","On","These")
Below is sample data:
On the other hand, we denounce with righteous indignation and dislike men who are so beguiled and demoralized by the charms of pleasure of the moment, so blinded by desire, that they cannot foresee the pain and trouble that are bound to ensue; and equal blame belongs to those who fail in their duty through weakness of will, which is the same as saying through shrinking from toil and pain. These cases are perfectly simple and easy to distinguish. In a free hour, when our power of choice is untrammelled and when nothing prevents our being able to do what we like best, every pleasure is to be welcomed and every pain avoided. But in certain circumstances and owing to the claims of duty or the obligations of business it will frequently occur that pleasures have to be repudiated and annoyances accepted. The wise man therefore always holds in these matters to this principle of selection: he rejects pleasures to secure other greater pleasures, or else he endures pains to avoid worse pains.
Can this problem be solved with a code as I have 1000 rows in a csv file.

As per my comment, I think a good option would be to use regular expression with the pattern:
re.split(r'(?<!^)\b(?=(?:On|In|These)\b)', YourStringVariable)

Yes this can be done in python. You can load the text into a variable and use the built in Split function for string. For example:
with open(filename, 'r') as file:
lines = file.read()
lines = lines.split('These')
# lines is now a list of strings split whenever 'These' string was encountered

To find whole words that are not part of larger words, I like using the regular expression:
[^\w]word[^\w]
Sample python code, assuming the text is in a variable named text:
import re
exp = re.compile(r'[^\w]in[^\w]', flags=re.IGNORECASE)
all_occurrences = list(exp.finditer(text))

Related

highlighting words in an docx file using python-docx gives incorrect results

I would like to highlight specific words in an MS word document (here given as negativeList) and leave the rest of the document as it was before. I have tried to adopt from this one but I can not get it running as it should:
from docx.enum.text import WD_COLOR_INDEX
from docx import Document
import pandas as pd
import copy
import re
doc = Document(docxFileName)
negativList = ["king", "children", "lived", "fire"] # some examples
for paragraph in doc.paragraphs:
for target in negativList:
if target in paragraph.text: # it is worth checking in detail ...
currRuns = copy.copy(paragraph.runs) # deep copy as we delete/clear the object
paragraph.runs.clear()
for run in currRuns:
if target in run.text:
words = re.split('(\W)', run.text) # split into words in order to be able to color only one
for word in words:
if word == target:
newRun = paragraph.add_run(word)
newRun.font.highlight_color = WD_COLOR_INDEX.PINK
else:
newRun = paragraph.add_run(word)
newRun.font.highlight_color = None
else: # our target is not in it so we add it unchanged
paragraph.runs.append(run)
doc.save('output.docx')
As example I am using this text (in a word docx file):
CHAPTER 1
Centuries ago there lived --
"A king!" my little readers will say immediately.
No, children, you are mistaken. Once upon a time there was a piece of
wood. It was not an expensive piece of wood. Far from it. Just a
common block of firewood, one of those thick, solid logs that are put
on the fire in winter to make cold rooms cozy and warm.
There are multiple problems with my code:
1) The first sentence works but the second sentence is in twice. Why?
2) The format gets somehow lost in the part where I highlight. I would possibly need to copy the properties of the original run into the newly created ones but how do I do this?
3) I loose the terminal "--"
4) In the highlighted last paragraph the "cozy and warm" is missing ...
What I would need is a eighter a fix for these problems or maybe I am overthinking it and there is a much easier way to do the highlighting? (something like doc.highlight({"king": "pink"} but I haven't found anything in the documentation)?

You're not overthinking it, this is a challenging problem; it is a form of the search-and-replace problem.
The target text can be located fairly easily by searching Paragraph.text, but replacing it (or in your case adding formatting) while retaining other formatting requires access at the Run level, both of which you've discovered.
There are some complications though, which is what makes it challenging:
There is no guarantee that your "find" target string is located entirely in a single run. So you will need to find the run containing the start of your target string and the run containing the end of your target string, as well as any in-between.
This might be aided by using character offsets, like "King" appears at character offset 3 in '"A king!" ...', and has a length of 4, then identifying which run contains character 3 and which contains character (3+4).
Related to the first complication, there is no guarantee that all the runs in which the target string partly appears are formatted the same. For example, if your target string was "a bold word", the updated version (after adding highlighting) would require at least three runs, one for "a ", one for "bold", and one for " word" (btw, which run each of the two space characters appear in won't change how they appear).
If you accept the simplification that the target string will always be a single word, you can consider the simplification of giving the replacement run the formatting of the first character (first run) of the found target runs, which is probably the usual approach.
So I suppose there are a few possible approaches, but one would be to "normalize" the runs of each paragraph containing the target string, such that the target string appeared within a distinct run. Then you could just apply highlighting to that run and you'd get the result you wanted.
To be of more help, you'll need to narrow down the problem areas and provide specific inputs and outputs. I'd start with the first one (perhaps losing the "--") (in a separate question, perhaps linked from here) and then proceed one by one until it all works. It's asking too much for a respondent to produce their own test case :)
Then you'd have a question like: "I run the string: 'Centuries ago ... --' through this code and the trailing "--" disappears ...", which is a lot easier for folks to reason through.
Another good next step might be to print out the text of each run, just so you get a sense of how they're broken up. That may give you insight into where it's not working.

I know its not the same library, but using wincom32 library you can highlight all the instances of the word in a specific range at once.
The code below will take all highlight all hits.
import win32com.client as win32
word = win32.gencache.EnsureDispatch('Word.Application');word.Visible = True
word = word.Documents.Open("test.docx")
strage = word.Range(Start=0, End=0) #change this range to shorten the replace
strage.Find.Replacement.Highlight = True
strage.Find.Execute(FindText="the",Replace=2,Format=True)

I faced a similar issue where I was supposed to highlight a set of words in a document. I modified certain parts of the OP's code and now I am able to highlight the selected words correctly.
As OP said in the comments: paragraph.runs.clear() was changed to paragraph.clear().
And I added a few lines to the following part of the code:
else:
paragraph.runs.append(run)
to get this:
else:
oldRun = paragraph.add_run(run.text)
if oldRun.text in spell_errors:
oldRun.font.highlight_color = WD_COLOR_INDEX.YELLOW
While iterating over the currRuns, we extract the text content of the run and add it to the paragraph, so we need to highlight those words again.

Change string inside curly brackets (option separated by |)

I'm trying to change the text between the curly brackets from the following string:
s = "As soon as {female_character:Aurelia|Aurelius} turned around the corner, {female_character:she|he} remembered that it was the wrong way and would eventually end in a cul-de-sac. Spinning around, {female_character:Aurelia|Aurelius} tried to run back out, but the way was already blocked by the vendor. In this dark alley way, nobody would see or care what happened to some poor beggar turned thief. Should {female_character:Aurelia|Aurelius} put up a fight in hopes of lasting long enough to escape or give up now and trust to the mercy of the vendor?"
My question is this, how do I apply Python logic to the text in that string so that {female_character:Aurelia|Aurelius} in the text applies the logic of:
if (whatever is on the left side of the colon) == True:
(replace {female_character:Aurelia|Aurelius} with the option on the left side of the |)
else:
(replace {female_character:Aurelia|Aurelius} with the option on the right side of the |)
A couple of other points to note, the string is getting pulled from a json file and there will be many similar texts. Additionally, some of the braces with have braces within braces like so: {strong_character:is big for his age|{small_character:although small for his age, is a very quick warrior|although average size, is a skilled warrior}}
As I'm sure anyone can tell, I'm still new to coding and am trying to learn Python. So I apologize in advance for any ignorance on my part.

You can use a regular expression to locate the variables and their text replacements. Regular expressions support grouping, so you can grab both True and False in separate groups, and then, depending on the current value of the found variable, replace the entire match with the correct group.
With nested expressions it gets a bit harder, though. Best is to construct the regex in such way that it will not match the outer level of nesting. The first time around, the inner braced expressions will be replaced by plain text, and then a second loop will match and change the rest.
So it may take more than one replacement loop, but how many then? That depends on the number of nesting braces. You could set the loop to a 'surely large enough' number such as 10, but this has several disadvantages. For instance, you need to be sure you don't accidentally nest more than 10 times; and if you have a sentence with only one level of braces and no nesting, it will still loop 9 times more, doing nothing at all.
One way to counter this is by counting the number of nested braces. I think my findall regex does this correctly, but I could be wrong there.
import re
def replaceVars(vars,text):
for loop in range(len(re.findall(r'\{[^{}]*(?=\{)', text))+1):
for var in vars:
if vars[var]:
text = re.sub ('\{'+var+r':([^|{}]+)\|([^|{}]+?)\}', r'\1', text)
else:
text = re.sub ('\{'+var+r':([^|{}]+)\|([^|{}]+?)\}', r'\2', text)
return text
s = "As soon as {female_character:Aurelia|Aurelius} turned around the corner, {female_character:she|he} remembered that it was the wrong way and would eventually end in a cul-de-sac. Spinning around, {female_character:Aurelia|Aurelius} tried to run back out, but the way was already blocked by the vendor. In this dark alley way, nobody would see or care what happened to some poor beggar turned thief. Should {female_character:Aurelia|Aurelius} put up a fight in hopes of lasting long enough to escape or give up now and trust to the mercy of the vendor? Puppy {strong_character:is big for his age|{small_character:although small for his age, is a very quick warrior|although average size, {female_character:she|he} is a skilled warrior}}"
variables = {"female_character":True, "strong_character":False, "small_character":False}
t = replaceVars(variables,s)
print (t)
results in
As soon as Aurelia turned around the corner, she remembered that it was the wrong way and would eventually end in a cul-de-sac. Spinning around, Aurelia tried to run back out, but the way was already blocked by the vendor. In this dark alley way, nobody would see or care what happened to some poor beggar turned thief. Should Aurelia put up a fight in hopes of lasting long enough to escape or give up now and trust to the mercy of the vendor? Puppy although average size, she is a skilled warrior

Think you know Python RE? Here's a challenge

Here's the skinny: how do you make a character set match NOT a previously captured character?
r'(.)[^\1]' # doesn't work
Here's the uh... fat? It's part of a (simple) cryptography program. Suppose "hobo" got coded to "fxgx". The program only gets the encoded text and has to figure what it could be, so it generates the pattern:
r'(.)(.)(.)\2' # 1st and 3rd letters *should* be different!
Now it (correctly) matches "hobo", but also matches "hoho" (think about it!). I've tried stuff like:
r'(.)([^\1])([^\1\2])\2' # also doesn't work
and MANY variations but alas! Alack...
Please help!
P.S. The work-around (which I had to implement) is to just retrieve the "hobo"s as well the "hoho"s, and then just filter the results (discarding the "hoho"s), if you catch my drift ;)
P.P.S Now I want a hoho
VVVVV THE ANSWER VVVVV
Yes, I re-re-read the documentation and it does say:
Inside the '[' and ']' of a character class, all numeric escapes are
treated as characters.
As well as:
Special characters lose their special meaning inside sets.
Which pretty much means (I think) NO, you can't do anything like:
re.compile(r'(.)[\1]') # Well you can, but it kills the back-reference!
Thanks for the help!

1st and 3rd letters should be different!
This cannot be detected using a regular expression (not just python's implementation). More specifically, it can't be detected using automata without memory. You'll have to use a different kind of automata.
The kind of grammar you're trying to discover (‫‪reduplication‬‬) is not regular. Moreover, it is not context-free.
Automata is the mechanism which allows regular expression match to be so efficient.

re.findall regex hangs or very slow

My input file is a large txt file with concatenated texts I got from an open text library. I am now trying to extract only the content of the book itself and filter out other stuff such as disclaimers etc. So I have around 100 documents in my large text file (around 50 mb).
I then have identified the start and end markers of the contents themselves, and decided to use a Python regex to find me everything between the start and end marker. To sum it up, the regex should look for the start marker, then match everything after it, and stop looking once the end marker is reached, then repeat these steps until the end of the file is reached.
The following code works flawlessly when I feed a small, 100kb sized file into it:
import codecs
import re
outfile = codecs.open("outfile.txt", "w", "utf-8-sig")
inputfile = codecs.open("infile.txt", "r", "utf-8-sig")
filecontents = inputfile.read()
for result in re.findall(r'START\sOF\sTHE\sPROJECT\sGUTENBERG\sEBOOK.*?\n(.*?)END\sOF\THE\sPROJECT\sGUTENBERG\sEBOOK', filecontents, re.DOTALL):
outfile.write(result)
outfile.close()
When I use this regex operation on my larger file however, it will not do anything, the program just hangs. I tested it overnight to see if it was just slow and even after around 8 hours the program was still stuck.
I am very sure that the source of the problem is the
(.*?)
part of the regex, in combination with re.DOTALL.
When I use a similar regex on smaller distances, the script will run fine and fast.
My question now is: why is this just freezing up everything? I know the texts between the delimiters are not small, but a 50mb file shouldn't be too much to handle, right?
Am I maybe missing a more efficient solution?
Thanks in advance.

You are correct in thinking that using the sequence .*, which appears more than once, is causing problems. The issue is that the solver is trying many possible combinations of .*, leading to a result known as catastrophic backtracking.
The usual solution is to replace the . with a character class that is much more specific, usually the production that you are trying to terminate the first .* with. Something like:
`[^\n]*(.*)`
so that the capturing group can only match from the first newline to the end. Another option is to recognize that a regular expression solution may not be the best approach, and to use either a context free expression (such as pyparsing), or by first breaking up the input into smaller, easier to digest chunks (for example, with corpus.split('\n'))

Another workaround to this issue is adding a sane limit to the number of matched characters.
So instead of something like this:
[abc]*.*[def]*
You can limit it to 1-100 instances per character group.
[abc]{1,100}.{1,100}[def]{1,100}
This won't work for every situation, but in some cases it's an acceptable quickfix.

algorithm for testing mutliple substrings in multiple strings

I have several million strings, X, each with less than 20 or so words. I also have a list of several thousand candidate substrings C. for each x in X, I want to see if there are any strings in C that are contained in x. Right now I am using a naive double for loop, but it's been a while and it hasn't finished yet...Any suggestions? I'm using python if any one knows of a nice implementation, but links for any language or general algorithms would be nice too.

Encode one of your sets of strings as a trie (I recommend the bigger set). Lookup time should be faster than an imperfect hash and you will save some memory too.

It's gonna be a long while. You have to check every one of those several million strings against every one of those several thousand candidate substrings, meaning that you will be doing (several million * several thousand) string comparisons. Yeah, that will take a while.
If this is something that you're only going to do once or infrequently, I would suggest using fgrep. If this is something that you're going to do often, then you want to look into implementing something like the Aho-Corasick string matching algorithm.

If your x in X only contains words, and you only want to match words you could do the following:
Insert your keywords into a set, that makes the access log(n), and then check for every word in x if it is contained in that set.
like:
keywords = set(['bla', 'fubar'])
for w in [x.split(' ') for x in X]:
if w in keywords:
pass # do what you need to do
A good alternative would be to use googles re2 library, that uses super nice automata theory to produce efficient matchers. (http://code.google.com/p/re2/)
EDIT: Be sure you use proper buffering and something in a compiled language, that makes it a lot faster. If its less than a couple gigabytes, it should work with python too.

you could try to use regex
subs=re.compile('|'.join(C))
for x in X:
if subs.search(x):
print 'found'

Have a look at http://en.wikipedia.org/wiki/Aho-Corasick. You can build a pattern-matcher for a set of fixed strings in time linear in the total size of the strings, then search in text, or multiple sections of text, in time linear in the length of the text + the number of matches found.
Another fast exact pattern matcher is http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.