I'm trying to change the text between the curly brackets from the following string:
s = "As soon as {female_character:Aurelia|Aurelius} turned around the corner, {female_character:she|he} remembered that it was the wrong way and would eventually end in a cul-de-sac. Spinning around, {female_character:Aurelia|Aurelius} tried to run back out, but the way was already blocked by the vendor. In this dark alley way, nobody would see or care what happened to some poor beggar turned thief. Should {female_character:Aurelia|Aurelius} put up a fight in hopes of lasting long enough to escape or give up now and trust to the mercy of the vendor?"
My question is this, how do I apply Python logic to the text in that string so that {female_character:Aurelia|Aurelius} in the text applies the logic of:
if (whatever is on the left side of the colon) == True:
(replace {female_character:Aurelia|Aurelius} with the option on the left side of the |)
else:
(replace {female_character:Aurelia|Aurelius} with the option on the right side of the |)
A couple of other points to note, the string is getting pulled from a json file and there will be many similar texts. Additionally, some of the braces with have braces within braces like so: {strong_character:is big for his age|{small_character:although small for his age, is a very quick warrior|although average size, is a skilled warrior}}
As I'm sure anyone can tell, I'm still new to coding and am trying to learn Python. So I apologize in advance for any ignorance on my part.
You can use a regular expression to locate the variables and their text replacements. Regular expressions support grouping, so you can grab both True and False in separate groups, and then, depending on the current value of the found variable, replace the entire match with the correct group.
With nested expressions it gets a bit harder, though. Best is to construct the regex in such way that it will not match the outer level of nesting. The first time around, the inner braced expressions will be replaced by plain text, and then a second loop will match and change the rest.
So it may take more than one replacement loop, but how many then? That depends on the number of nesting braces. You could set the loop to a 'surely large enough' number such as 10, but this has several disadvantages. For instance, you need to be sure you don't accidentally nest more than 10 times; and if you have a sentence with only one level of braces and no nesting, it will still loop 9 times more, doing nothing at all.
One way to counter this is by counting the number of nested braces. I think my findall regex does this correctly, but I could be wrong there.
import re
def replaceVars(vars,text):
for loop in range(len(re.findall(r'\{[^{}]*(?=\{)', text))+1):
for var in vars:
if vars[var]:
text = re.sub ('\{'+var+r':([^|{}]+)\|([^|{}]+?)\}', r'\1', text)
else:
text = re.sub ('\{'+var+r':([^|{}]+)\|([^|{}]+?)\}', r'\2', text)
return text
s = "As soon as {female_character:Aurelia|Aurelius} turned around the corner, {female_character:she|he} remembered that it was the wrong way and would eventually end in a cul-de-sac. Spinning around, {female_character:Aurelia|Aurelius} tried to run back out, but the way was already blocked by the vendor. In this dark alley way, nobody would see or care what happened to some poor beggar turned thief. Should {female_character:Aurelia|Aurelius} put up a fight in hopes of lasting long enough to escape or give up now and trust to the mercy of the vendor? Puppy {strong_character:is big for his age|{small_character:although small for his age, is a very quick warrior|although average size, {female_character:she|he} is a skilled warrior}}"
variables = {"female_character":True, "strong_character":False, "small_character":False}
t = replaceVars(variables,s)
print (t)
results in
As soon as Aurelia turned around the corner, she remembered that it was the wrong way and would eventually end in a cul-de-sac. Spinning around, Aurelia tried to run back out, but the way was already blocked by the vendor. In this dark alley way, nobody would see or care what happened to some poor beggar turned thief. Should Aurelia put up a fight in hopes of lasting long enough to escape or give up now and trust to the mercy of the vendor? Puppy although average size, she is a skilled warrior
Related
I was trying to parse together a script for a movie into a dataset containing two columns 'speaker_name' and 'line_spoken'. I don't have any issue with the Python part of the problem but parsing the script is the problem.
The schema of the script goes like this:
Now, this, if copied and pasted into a .txt file is something like this:
ARTHUR
Yeah. I mean, that's just--
SOCIAL WORKER
Does my reading it upset you?
He leans in.
ARTHUR
No. I just,-- some of it's
personal. You know?
SOCIAL WORKER
I understand. I just want to make
sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
In the above case, the regex filtering should return the speaker name and the dialogue and not what is happening in actions like the last line: "slides his journal back". The dialogues often exceed more than two lines so please do not provide hard-coded solutions for 2 lines only. I think I am thinking about this problem in just one direction, some other method to filter can also work.
I have worked with scripts that are colon-separated and I don't have any problem parsing those. But in this case, I am getting no specific endpoints to end the search at. It would be a great help if the answer you give has 2 groups, one with name, the other with the dialogue. Like in the case of colon-separated, my regex was:
pattern = r'(^[a-zA-z]+):(.+)'
Also, if possible, please try and explain why you used that certain regex. It will be a learning experience for me.
Use https://www.onlineocr.net/ co convert pdf to text,
It shows immediately the outcome, where names and on the same line with dialogs,
which could allow for a simple processing
ARTHUR Yeah. I mean, that's just--
SOCIAL WORKER Does my reading it upset you?
He leans in.
ARTHUR No. I just,-- some of its personal. You know me ?
SOCIAL WORKER I understand. I just want to make sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
Not sure will it work for longer dialogs.
Another solution is to extract data from the text file that you can download by clicking the "download output file" link . That file is formatted differently. In that file
10 leading spaces will indicate the dialog, and 5 leading spaces the name - a the least for you sample screenshot
The regex is
r" (.+)(\n( [^ ].+\n)+)"
https://regex101.com/r/FQk8uH/1
it puts in group 1 whatever starts with ten spaces and whatever starts with at the exactly five space into the second :
the subexpression " [^ ].+\n" denotes a line where the first five symbols are spaces, the sixth symbol is anything but space, and the rest of symbols until the end of line are arbitrary. Since dialogs tend to be multiline that expression is followed with another plus.
You will have to delete extra white space from dialogue with additional code and/or regex.
If the amount of spaces varies a bit (say 4-6 and 7 - 14 respectively) but has distinct section the regex needs to be adjusted by using variable repetition operator (curly braces {4, 6}) or optional spaces ?.
r" {7, 14}(.+)(\n( {4-6}[^ ].+\n)+)"
The last idea is to use preexisting list of names in play to match them e.g. (SOCIAL WORKER|JOHN|MARY|ARTUR). The https://www.onlineocr.net/ website still could be used to help spot and delete actions
In Python, you can use DOTALL:
re_pattern = re.compile(r'(\b[A-Z ]{3,}(?=\n))\n*(.*?)\n*(?=\b[A-Z ]{3,}\n|$)', re.DOTALL)
print(re.findall(re_pattern, mystr))
\b[A-Z ]{3,}(?=\n) matches speaker name.
\b matches a word boundary
[A-Z ]{3,} matches three or more upper case letters or spaces. (this means this regex won't recognize speaker names with less than three characters. I did this to avoid false positives in special cases but you might wanna change it. Also check what kind of characters might occur in speaker name (dots, minus, lower case...))
(?=\n) is a lookahead insuring the speaker name is directly followed by a new line (avoids false positive if a similar expression appears in a spoken line)
\n* matches newlines
(.*?) matches everything (including new lines thanks to DOTALL) until the next part of the expression (? makes it lazy instead of greedy)
\n* matches newlines
(?=\b[A-Z ]{3,}\n|$) is a lookahead i.e. a non capturing expression insuring the following part is either a speaker name or the end of your string
Output:
[('ARTHUR', "Yeah. I mean, that's just--"), ('SOCIAL WORKER', 'Does my reading it upset you?\n\nHe leans in.'), ('ARTHUR', "No. I just,-- some of it's\n\npersonal. You know?"), ('SOCIAL WORKER', "I understand. I just want to make\n\nsure you're keeping up with it.\n\nShe slides his journal back to him. He holds it in his lap.")]
You'll have to adjust formatting if you want to remove actions from the result though.
I have a long text like the one below. I need to split based on some words say ("In","On","These")
Below is sample data:
On the other hand, we denounce with righteous indignation and dislike men who are so beguiled and demoralized by the charms of pleasure of the moment, so blinded by desire, that they cannot foresee the pain and trouble that are bound to ensue; and equal blame belongs to those who fail in their duty through weakness of will, which is the same as saying through shrinking from toil and pain. These cases are perfectly simple and easy to distinguish. In a free hour, when our power of choice is untrammelled and when nothing prevents our being able to do what we like best, every pleasure is to be welcomed and every pain avoided. But in certain circumstances and owing to the claims of duty or the obligations of business it will frequently occur that pleasures have to be repudiated and annoyances accepted. The wise man therefore always holds in these matters to this principle of selection: he rejects pleasures to secure other greater pleasures, or else he endures pains to avoid worse pains.
Can this problem be solved with a code as I have 1000 rows in a csv file.
As per my comment, I think a good option would be to use regular expression with the pattern:
re.split(r'(?<!^)\b(?=(?:On|In|These)\b)', YourStringVariable)
Yes this can be done in python. You can load the text into a variable and use the built in Split function for string. For example:
with open(filename, 'r') as file:
lines = file.read()
lines = lines.split('These')
# lines is now a list of strings split whenever 'These' string was encountered
To find whole words that are not part of larger words, I like using the regular expression:
[^\w]word[^\w]
Sample python code, assuming the text is in a variable named text:
import re
exp = re.compile(r'[^\w]in[^\w]', flags=re.IGNORECASE)
all_occurrences = list(exp.finditer(text))
I'm trying to account for the possibility of any USD denomination and I came up with this:
\$\d+\.?\,?\d+\.?\d+
This works for entries like $10,500.23, $1,050.23, $105.23, $105, $10, $1
But won't work for things like $.23
I tried using \d+? instead of just \d? but that doesn't seem to work either (maybe there is a special way of handling this that I'm unaware of?)
The + symbol means one or more, whereas the ? symbol means 0 or 1. If you add an OR statement (|) you can have it work with either, making the completed (and fully functional) statement \$\d+|\d?\.?\,?\d+\.?\d+
one potential issue with this is that the regex doesn't catch items with only one digit after the ., not sure if that is an issue for you in your implementation or not.
Here's the skinny: how do you make a character set match NOT a previously captured character?
r'(.)[^\1]' # doesn't work
Here's the uh... fat? It's part of a (simple) cryptography program. Suppose "hobo" got coded to "fxgx". The program only gets the encoded text and has to figure what it could be, so it generates the pattern:
r'(.)(.)(.)\2' # 1st and 3rd letters *should* be different!
Now it (correctly) matches "hobo", but also matches "hoho" (think about it!). I've tried stuff like:
r'(.)([^\1])([^\1\2])\2' # also doesn't work
and MANY variations but alas! Alack...
Please help!
P.S. The work-around (which I had to implement) is to just retrieve the "hobo"s as well the "hoho"s, and then just filter the results (discarding the "hoho"s), if you catch my drift ;)
P.P.S Now I want a hoho
VVVVV THE ANSWER VVVVV
Yes, I re-re-read the documentation and it does say:
Inside the '[' and ']' of a character class, all numeric escapes are
treated as characters.
As well as:
Special characters lose their special meaning inside sets.
Which pretty much means (I think) NO, you can't do anything like:
re.compile(r'(.)[\1]') # Well you can, but it kills the back-reference!
Thanks for the help!
1st and 3rd letters should be different!
This cannot be detected using a regular expression (not just python's implementation). More specifically, it can't be detected using automata without memory. You'll have to use a different kind of automata.
The kind of grammar you're trying to discover (reduplication) is not regular. Moreover, it is not context-free.
Automata is the mechanism which allows regular expression match to be so efficient.
Background
I am using the handlebars templating language to create documents. In order to create the JSON-file to populate a certain handlebars file with values, I want to extract all handlebars expressions in that file, using a Python script. The Python parse module seems right for the job, since it has a findall function, which finds all occurrences of a certain pattern.
Problem
Parse uses braces to enclose the pattern to be searched for, e.g.
parse("One {} three", "One two three")
yields
<Result ('two',) {}>
However, the patterns I am looking for are tripple stashed like so
{{{<some expression>}}}
which would force me to escape the braces. I tried using backslash, i.e.
parse("One \{\{\{{}\}\}\} three", "One {{{some_number}}} three")
in order to extract
some_number
but this does not work. Is there another way?
Honestly, this is probably not too useful, but I have a terrible hack that you might be able to use. I discovered one day that I could do a kind of partial application (typically inside a for-loop) by calling format on strings with nested braces. Here's an example:
In [1]: url = "http://{hostname}:{{port}}/{{{resource}}}"
In [2]: url.format(hostname="localhost", resource="new_var_name")
Out[2]: 'http://localhost:{port}/{new_var_name}'
The triple braces have a set of escaped braces inside them whereas the double braces just get the outer set peeled away. I never used triple braces because why would you want to rename your variable unless you were doing some dirty dynamic programming stuff, but I digress...
Anyway, this gets real bizarre when you go to four braces (but I usually just keep it at ones and twos):
In [3]: url = "http://{hostname}:{{port}}/{{{endpoint}}}/{{{{resource}}}}"
In [4]: url.format(hostname="localhost", endpoint="wut").format(port=1, wut="hey")
Out[4]: 'http://localhost:1/hey/{resource}'
In [5]: url.format(hostname="localhost", endpoint="wut").format(port=1, wut="hey").format(
...: resource="thatsright")
Out[5]: 'http://localhost:1/hey/thatsright'
Anyway, it occurs to me that if you could drop the 3-braces down to twos, then you'd probably be in business and get what you need by just calling format on the whole resulting string (no idea how performant that would be).
I used this hack to write some dirty code that you had to squint real hard at to understand what it was doing because it was a string that was like {}{}{{}}{{}} and then I'd interpolate the singles and save the latter ones for passes inside a for-loop. I felt guilty about that code. I still do, but I also still feel good about stumbling upon the technique.