python/regex copy paragraphs in order to another txt document - python

I'm working on what I initially thought would be a pretty simple program. Essentially, it should find key words then copy that paragraph to another document. What I want to do is take content from document 1 (both are .txt files) and re-order the paragraphs into a desired order.
I think I've written my python part correctly, as it works with other snippets (or seems to just fine), but the regex part (admittedly I'm very new to this) for some reason does not work.
I've tried a number of things and searched all through stack overflow. What I have currently "catches" almost the entire txt file instead of just the paragraph. This may be obvious but in addition to it catching most of the document, it's catching paragraphs without the target term (in this case, discussing) in it.
I appreciate all help in advance.
def write_function():
with open('minnar.txt','r') as rf, open('regexoutput.txt', 'a') as wf:
content = rf.read()
matches = target.findall(content)
print(matches)
for match in matches:
wf.write(match + '\n \n')
target = re.compile('([^\']*(?=discussing)[^\']*)')
write_function()```

If your paragraph means the text between quote, then the regex should be follow:
\'([^']+)\'
https://pythex.org/?regex=%5C%27(%5B%5E%27%5D%2B)%5C%27&test_string=This%20is%20%27the%20thing%27%20that%20I%20talked%20about.%20And%20I%20think%20this%20%27should%20be%20the%20one%20that%20they%20expected%27&ignorecase=0&multiline=0&dotall=1&verbose=0

Related

Rewrite a specific portion of a text file in python

(1) I am using Python and would like to create a function that rewrites a portion of a text file. Referencing the sample example below, I would like to be able to delete everything from [Variables] onwards and write new content from that position. I can't figure out how to achieve this using any of seek(), truncate() and/or tell().
I'm thinking I may have to read and store the file's contents up to [Variables] and write that back in before appending the new content. Is there a better way to go about this?
(2) Bonus question: How would I do this if there was content beyond the variables section that I wanted to remain unchanged? This is currently not required, but it would be helpful to know for the future.
Sample Text File:
"[Log]
This happened
That happened
etc
[Variables]
Animals: [Dog, Cat]
Number: 4"
You can try to use regex:
import re
string = text
word = '[Variables]'
# The Regex pattern to match al characters on and after '[Variables]'
pattern = word + ".*"
# Remove all characters after '[Variables]' from string
string = re.sub(pattern, '', string)
print(string)
Here if the text is the text that you show on your question, the output of the code will be:
"[Log]
This happened
That happened
etc"
In order to add new text at the end you just need to concatenate a new string to the existing one like:
string += "Some Text"

highlighting words in an docx file using python-docx gives incorrect results

I would like to highlight specific words in an MS word document (here given as negativeList) and leave the rest of the document as it was before. I have tried to adopt from this one but I can not get it running as it should:
from docx.enum.text import WD_COLOR_INDEX
from docx import Document
import pandas as pd
import copy
import re
doc = Document(docxFileName)
negativList = ["king", "children", "lived", "fire"] # some examples
for paragraph in doc.paragraphs:
for target in negativList:
if target in paragraph.text: # it is worth checking in detail ...
currRuns = copy.copy(paragraph.runs) # deep copy as we delete/clear the object
paragraph.runs.clear()
for run in currRuns:
if target in run.text:
words = re.split('(\W)', run.text) # split into words in order to be able to color only one
for word in words:
if word == target:
newRun = paragraph.add_run(word)
newRun.font.highlight_color = WD_COLOR_INDEX.PINK
else:
newRun = paragraph.add_run(word)
newRun.font.highlight_color = None
else: # our target is not in it so we add it unchanged
paragraph.runs.append(run)
doc.save('output.docx')
As example I am using this text (in a word docx file):
CHAPTER 1
Centuries ago there lived --
"A king!" my little readers will say immediately.
No, children, you are mistaken. Once upon a time there was a piece of
wood. It was not an expensive piece of wood. Far from it. Just a
common block of firewood, one of those thick, solid logs that are put
on the fire in winter to make cold rooms cozy and warm.
There are multiple problems with my code:
1) The first sentence works but the second sentence is in twice. Why?
2) The format gets somehow lost in the part where I highlight. I would possibly need to copy the properties of the original run into the newly created ones but how do I do this?
3) I loose the terminal "--"
4) In the highlighted last paragraph the "cozy and warm" is missing ...
What I would need is a eighter a fix for these problems or maybe I am overthinking it and there is a much easier way to do the highlighting? (something like doc.highlight({"king": "pink"} but I haven't found anything in the documentation)?
You're not overthinking it, this is a challenging problem; it is a form of the search-and-replace problem.
The target text can be located fairly easily by searching Paragraph.text, but replacing it (or in your case adding formatting) while retaining other formatting requires access at the Run level, both of which you've discovered.
There are some complications though, which is what makes it challenging:
There is no guarantee that your "find" target string is located entirely in a single run. So you will need to find the run containing the start of your target string and the run containing the end of your target string, as well as any in-between.
This might be aided by using character offsets, like "King" appears at character offset 3 in '"A king!" ...', and has a length of 4, then identifying which run contains character 3 and which contains character (3+4).
Related to the first complication, there is no guarantee that all the runs in which the target string partly appears are formatted the same. For example, if your target string was "a bold word", the updated version (after adding highlighting) would require at least three runs, one for "a ", one for "bold", and one for " word" (btw, which run each of the two space characters appear in won't change how they appear).
If you accept the simplification that the target string will always be a single word, you can consider the simplification of giving the replacement run the formatting of the first character (first run) of the found target runs, which is probably the usual approach.
So I suppose there are a few possible approaches, but one would be to "normalize" the runs of each paragraph containing the target string, such that the target string appeared within a distinct run. Then you could just apply highlighting to that run and you'd get the result you wanted.
To be of more help, you'll need to narrow down the problem areas and provide specific inputs and outputs. I'd start with the first one (perhaps losing the "--") (in a separate question, perhaps linked from here) and then proceed one by one until it all works. It's asking too much for a respondent to produce their own test case :)
Then you'd have a question like: "I run the string: 'Centuries ago ... --' through this code and the trailing "--" disappears ...", which is a lot easier for folks to reason through.
Another good next step might be to print out the text of each run, just so you get a sense of how they're broken up. That may give you insight into where it's not working.
I know its not the same library, but using wincom32 library you can highlight all the instances of the word in a specific range at once.
The code below will take all highlight all hits.
import win32com.client as win32
word = win32.gencache.EnsureDispatch('Word.Application');word.Visible = True
word = word.Documents.Open("test.docx")
strage = word.Range(Start=0, End=0) #change this range to shorten the replace
strage.Find.Replacement.Highlight = True
strage.Find.Execute(FindText="the",Replace=2,Format=True)
I faced a similar issue where I was supposed to highlight a set of words in a document. I modified certain parts of the OP's code and now I am able to highlight the selected words correctly.
As OP said in the comments: paragraph.runs.clear() was changed to paragraph.clear().
And I added a few lines to the following part of the code:
else:
paragraph.runs.append(run)
to get this:
else:
oldRun = paragraph.add_run(run.text)
if oldRun.text in spell_errors:
oldRun.font.highlight_color = WD_COLOR_INDEX.YELLOW
While iterating over the currRuns, we extract the text content of the run and add it to the paragraph, so we need to highlight those words again.

Identify a dot in an aiml pattern in Python

In a project of mine, I am trying to identify file names in a given sentence. For example, "Could you please open abc.txt", so I need to fetch the keywords "open" in order to know the kind of action that is expected and I also need to identify the file name, for obvious reasons. A simple AIML tag for this is:
<aiml>
<category>
<pattern>* OPEN *</pattern>
<template>open <star index="2"/></template>
<category>
</aiml>
Here, in the template tag, I am just giving an information about the operation to be performed and the file name. My python code on the other hand takes care of performing the required action.
Now the problem is the '.' character. Using that character divides the sentence into 2 parts, (in case of the example I mentioned above, the 2 sentences would be "Could you please open abc" and "txt") which are individually mapped to any of the aiml tags defined. But, in my case I don't want the '.' character to act as a delimiter. Basically, I want to identify file names that may or may not include an extension. Could anyone please help me out with this?
Thanks in advance!
By default AIML allows multi sentence input. This means full stops, exclamation marks and question marks are treated as separators between sentences. For example if you asked:
Good morning. My name is George. How are you today?
this is interpreted as 3 separate inputs. Normally this is a good thing as it means the AIML interpreter can re-use existing patterns for GOOD MORNING, MY NAME IS *, HOW ARE YOU *.
But in your case that's not helping as the full-stop before the extension is causing unwanted splitting. Depending on your AIML interpreter, sentence splitting is done in a pre-processing stage before sending the input to the interpreter. Some AIML interpreters have a configuration file that lets you define the sentence splitting characters, so you may simply be able to remove the full stop from the list of separators.
A better approach may be to pre-process the filenames and replace the full stop with the word DOT, you can then detect this in your pattern * OPEN *
As a final comment, * OPEN * is a very wide ranging pattern, it will also be invoked if someone says WHAT TIME IS THE SHOP OPEN TODAY, or any other input with the word OPEN in it surrounded by text.

Need help finding the correct regex pattern for my string pattern

I'm terrible with RegEx patterns, and I'm writing a simple python program that requires splitting lines of a file into a 'content' part and a 'tags' part, and then further splitting the tags parts into individual tags. Here's a simple example of what one line of my file might look like:
The Beatles <music,rock,60s,70s>
I've opened my file with begun reading lines like this:
def Load(self, filename):
file = open(filename, r)
for line in file:
#Ignore comments and empty lines..
if not line.startswith('#') and not line.strip():
#...
Forgive my likely terrible Python, it's my first few days with the language. Anyway, next I was thinking it would be useful to use a regex to break my string into sections - with a variable to store the 'content' (for example, "The Beatles"), and a list/set to store each of the tags. As such, I need a regex (or two?) that can:
Split the raw part from the <> part.
And split the tags part into a list based on the commas.
Finally, I want to make sure that the content part retains its capitalization and inner spacing. But I want to make sure the tags are all lower-case and without white space.
I'm wondering if any of the regex experts out there can help me find the correct pattern(s) to achieve my goals here?
This is a solution that gets around the problem without using by relying on multiple splits.
# This separates the string into the content and the remainder
content, tagStr = line.split('<')
# This splits the tagStr into individual tags. [:-1] is used to remove trailing '>'
tags = tagStr[:-1].split(',')
print content
print tags
The problem with this is that it leaves a trailing whitespace after the content.
You can remove this with:
content = content[:-1]

python and pyPdf - how to extract text from the pages so that there are spaces between lines

currently, if I make a page object of a pdf page with pyPdf, and extractText(), what happens is that lines are concatenated together. For example, if line 1 of the page says "hello" and line 2 says "world" the resulting text returned from extractText() is "helloworld" instead of "hello world." Does anyone know how to fix this, or have suggestions for a work around? I really need the text to have spaces in between the lines because i'm doing text mining on this pdf text and not having spaces in between lines kills it....
This is a common problem with pdf parsing. You can also expect trailing dashes that you will have to fix in some cases. I came up with a workaround for one of my projects which I will describe here shortly:
I used pdfminer to extract XML from PDF and also found concatenated words in the XML. I extracted the same PDF as HTML and the HTML can be described by lines of the following regex:
<span style="position:absolute; writing-mode:lr-tb; left:[0-9]+px; top:([0-9]+)px; font-size:[0-9]+px;">([^<]*)</span>
The spans are positioned absolutely and have a top-style that you can use to determine if a line break happened. If a line break happened and the last word on the last line does not have a trailing dash you can separate the last word on the last line and the first word on the current line. It can be tricky in the details, but you might be able to fix almost all text parsing errors.
Additionally you might want to run a dictionary library like enchant over your text, find errors and if the fix suggested by the dictionary is like the error word but with a space somewhere, the error word is likely to be a parsing error and can be fixed with the dictionaries suggestion.
Parsing PDF sucks and if you find a better source, use it.

Categories

Resources