Regex fetch text align-left

Regex fetch text align-left - python

I'm new to the Regex world and I've browse many site without finding what I'm looking for.
I have a file where I need to fetch the address. The address is align-left of the paper (there's text in the same line at the right).
Some information on multiple line (6)
that I don't need and can't paste because
it contains some personal information.
So imagine a lot of text here...
So imagine a lot of text here...
So imagine a lot of text here...
Sold To Bill To
Some Cie Some Other Cie
1111 chemin some-road 2222 chemin some-other-road
City-Here QC J0Q 1W0 Other City-Here QC J0Q 1W0
Canada Canada
I need to fetch the text in the 'Sold To' side.
I tried to use the \r but it returns nothing!
I don't know how to fetch the text from the start of the line until there's a bunch of spaces.
Ex: Some Cie (if more than 1 spaces, go to next line)
then I have: Sold\sTo(?=\s{2,100}) but it won't work while (?=\s{2, 100}) returns everything!!!
I saw this: ^((?:\S+\s+){2}\S+).*, which is very close to what I want, but I don't understand the whole thing. I would like to match from 2 to 5 words.
Then I have this: ^([A-Za-z0-9-]*)(?=\s{2,100}) which I thought would match At the beginning of the line until there's more than 2 spaces.
What am I getting wrong?
I need to do this in pure Regex (no flags allowed).
I'm completely lost. Some guidance would be much appreciated.

You're pretty close on your last attempt. Here's what I came up with:
^.+?(?=[^\S\n]{2,})
Explanation:
.+ - One or more characters
? - Non-greedy, to give the next part priority, i.e. avoid matching a bunch of spaces
[^\S\n] - Any whitespace character except newline (this is like \s minus \n)
{2,} - Two or more
Matches from the example:
Sold To
Some Cie
1111 chemin some-road
City-Here QC J0Q 1W0
Canada
Try it on Regex101
Simple example in Python:
import re
pattern = re.compile(r'^.+?(?=[^\S\n]{2,})')
with open(filename) as f:
for line in f:
m = pattern.match(line)
if m:
print(m.group())

Related

Python regex fullmatch doesn't work as expected

I have a text file that contains some sentences, I'm checking them if they are valid sentences based on some rules and writing valid or not valid to a seperate text file. My main problem is when I'm using ctrl + f and enter my regex to search bar it matches the strings that I wanted to match but in code, it works wrong. Here is my code:
import re
pattern = re.compile('(([A-Z])[a-z\s,]*)((: ["‘][a-z,!?\.\s]*["’][.,!?])|(; [a-zA-Z\s]*[!.?])|(\s["‘][a-z,.;!?\s]*["’])|([\.?!]))')
text=open('validSentences',"w+")
with open('sentences.txt',encoding='utf8') as file:
lines = file.readlines()
for line in lines:
matches = pattern.fullmatch(line)
if(matches==None):
text.write("not valid"+"\n")
else:
text.write("valid"+"\n")
file.close()
In documents it says that fullmatch matches only whole string matches and thats what I'm trying to do but this code writes not valid for all sentences that I have. The text file that I have:
How can you say that to me?
As he looked at his reflection in the mirror, he took a deep breath.
He nodded at himself and, feeling braver, he stepped outside the bathroom. He bumped straight into the
extremely tall man, who was waiting by the door.
David said ‘Oh, sorry!’.
The happy pair discussed their future life 2gether and shared sweet words of admiration.
We will not stop you; I promise!
Come here ASAP!
He pushed his chair back and went to the kitchen at 2 pM.
I do not know...
The main character in the movie said: "Play hard. Work harder."
When I enter my regex in vs code with ctrl+f whole first, second, fourth, seventh and eight lines are highligting so according to fullmatch() funtion they need to print as "valid" but they aren't. I need help with this issue.

First, remove lines = file.readlines() as it already moves the file handle to the end of the file stream. Then, you need to keep in mind that when using for line in lines:, the line variable has a trailing newline, so
Either use line=line.rstrip() to remove the trailing whitespace before running the regex or
Ensure your pattern ends in \n? (an optional newline), or even \s* (any zero or more whitespace).
So, a possible solution looks like
with open('sentences.txt',encoding='utf8') as file:
for line in file:
matches = pattern.fullmatch(line.rstrip('\n'))
...
Or,
pattern = re.compile(r'([A-Z][a-z\s,]*)(?:: ["‘][a-z,!?\.\s]*["’][.,!?]|; [a-zA-Z\s]*[!.?]|\s["‘][a-z,.;!?\s]*["’]|[.?!])\s*')
#...
with open('sentences.txt',encoding='utf8') as file:
for line in file:
....

Regex for multiple lines separated with "return" and multiple unnecessary spaces

I was trying to parse together a script for a movie into a dataset containing two columns 'speaker_name' and 'line_spoken'. I don't have any issue with the Python part of the problem but parsing the script is the problem.
The schema of the script goes like this:
Now, this, if copied and pasted into a .txt file is something like this:
ARTHUR
Yeah. I mean, that's just--
SOCIAL WORKER
Does my reading it upset you?
He leans in.
ARTHUR
No. I just,-- some of it's
personal. You know?
SOCIAL WORKER
I understand. I just want to make
sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
In the above case, the regex filtering should return the speaker name and the dialogue and not what is happening in actions like the last line: "slides his journal back". The dialogues often exceed more than two lines so please do not provide hard-coded solutions for 2 lines only. I think I am thinking about this problem in just one direction, some other method to filter can also work.
I have worked with scripts that are colon-separated and I don't have any problem parsing those. But in this case, I am getting no specific endpoints to end the search at. It would be a great help if the answer you give has 2 groups, one with name, the other with the dialogue. Like in the case of colon-separated, my regex was:
pattern = r'(^[a-zA-z]+):(.+)'
Also, if possible, please try and explain why you used that certain regex. It will be a learning experience for me.

Use https://www.onlineocr.net/ co convert pdf to text,
It shows immediately the outcome, where names and on the same line with dialogs,
which could allow for a simple processing
ARTHUR Yeah. I mean, that's just--
SOCIAL WORKER Does my reading it upset you?
He leans in.
ARTHUR No. I just,-- some of its personal. You know me ?
SOCIAL WORKER I understand. I just want to make sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
Not sure will it work for longer dialogs.
Another solution is to extract data from the text file that you can download by clicking the "download output file" link . That file is formatted differently. In that file
10 leading spaces will indicate the dialog, and 5 leading spaces the name - a the least for you sample screenshot
The regex is
r" (.+)(\n( [^ ].+\n)+)"
https://regex101.com/r/FQk8uH/1
it puts in group 1 whatever starts with ten spaces and whatever starts with at the exactly five space into the second :
the subexpression " [^ ].+\n" denotes a line where the first five symbols are spaces, the sixth symbol is anything but space, and the rest of symbols until the end of line are arbitrary. Since dialogs tend to be multiline that expression is followed with another plus.
You will have to delete extra white space from dialogue with additional code and/or regex.
If the amount of spaces varies a bit (say 4-6 and 7 - 14 respectively) but has distinct section the regex needs to be adjusted by using variable repetition operator (curly braces {4, 6}) or optional spaces ?.
r" {7, 14}(.+)(\n( {4-6}[^ ].+\n)+)"
The last idea is to use preexisting list of names in play to match them e.g. (SOCIAL WORKER|JOHN|MARY|ARTUR). The https://www.onlineocr.net/ website still could be used to help spot and delete actions

In Python, you can use DOTALL:
re_pattern = re.compile(r'(\b[A-Z ]{3,}(?=\n))\n*(.*?)\n*(?=\b[A-Z ]{3,}\n|$)', re.DOTALL)
print(re.findall(re_pattern, mystr))
\b[A-Z ]{3,}(?=\n) matches speaker name.
\b matches a word boundary
[A-Z ]{3,} matches three or more upper case letters or spaces. (this means this regex won't recognize speaker names with less than three characters. I did this to avoid false positives in special cases but you might wanna change it. Also check what kind of characters might occur in speaker name (dots, minus, lower case...))
(?=\n) is a lookahead insuring the speaker name is directly followed by a new line (avoids false positive if a similar expression appears in a spoken line)
\n* matches newlines
(.*?) matches everything (including new lines thanks to DOTALL) until the next part of the expression (? makes it lazy instead of greedy)
\n* matches newlines
(?=\b[A-Z ]{3,}\n|$) is a lookahead i.e. a non capturing expression insuring the following part is either a speaker name or the end of your string
Output:
[('ARTHUR', "Yeah. I mean, that's just--"), ('SOCIAL WORKER', 'Does my reading it upset you?\n\nHe leans in.'), ('ARTHUR', "No. I just,-- some of it's\n\npersonal. You know?"), ('SOCIAL WORKER', "I understand. I just want to make\n\nsure you're keeping up with it.\n\nShe slides his journal back to him. He holds it in his lap.")]
You'll have to adjust formatting if you want to remove actions from the result though.

python regex matching paragraph(s) starting with labels

I'm trying to match a paragraph or paragraphs which are lead by letters. I'm testing on and have tried dotALL, lookaheads, multiline, etc and I can't seem to get one to work. The string I'm trying to match looks like this:
A-B: Object, procedure:
- Somethings.
- More things, might run over several lines like this where the sentence just keeps on going and going and going and sometimes isn't even a sentence.
- Another line, sometimes not ending with period
- Variable amount of white space at the beginning of new lines
Comment (A-B): sometimes, there are comments which are separated by two \n\n characters like this.*
C. Second object, other procedure:
- More lines.
- Can have various leads (including no ' - ' leading.
- Variable number of lines.
The closest I've come to a match was using '(.+?\n\n|.+?$)' and dotALL (which I realize is sloppy), but even this didn't work because it misses comments or paragraphs separated by more lines but still under the header ([A-Z]?-?[A-Z]).
Ideally I'd like to capture the header or title (A-B:) or (C.) in match.group(1) and the rest of the paragraphs(s) before the next title in match.group(2), but I'd just be happy to capture everything. I tried lookaheads to catch everything between titles, but that misses the last instance which won't have a title at the end.
I'm a newb and I apologize if this has already been answered or if I'm not clear (but I have been looking for the past 2 days without success). Thanks!

so here is my proposed solution for you :)
import re
with open('./samplestring.txt') as f:
header =[]
nonheader = []
yourString = f.read()
for line in content.splitlines():
if(re.match('(^[A-Z]?-?[A-Z]:)|(^[A-Z]\.)',line.lstrip())):
header.append(line)
else:
nonheader.append(line)

I ended up giving up on capturing comments and everything after them. I used the following code to capture the letter for each header (group(1)), the text for the header (group(2)), and the text in the paragraph excluding comments (group(3)).
([A-Z]{1,2}|[A-Z]-[A-Z])(?::|.) +(\w.+)\n+((\s*(- *.+))+)
([A-Z]{1,2}|[A-Z]-[A-Z])(?::|.) +
captures the letter (group 1), the colon or period, and the space(s) after that
(\w.+)\n+
captures the text of the header, and the next line(s)
((\s*(- *.+))+)
captures multiple lines starting variably with a space, dash, space, and text
I appreciate all your help with this! :)

You can use
(^[^\n]+)(?:\n *-.+(?:\n.+)*|\n\n.+\n)+
(^[^\n]+) - Match the header line, then repeatedly alternate between
\n *-.+(?:\n.+)* - A non-comment line: starts with whitespace, followed by -, optionally running across multiple lines
\n\n.+\n - Or, match a comment line
(no dotall flag)
https://regex101.com/r/6kle0u/2
This depends on the comment lines always having \n\n before them.

Using regex to split text content into dictionary

I have a text file that follows this format.
LESTER HOLT (00:00:01): Breaking News Tonight: A deadly mass shooting
at the airport. A gunman opens fire at baggage claim in Fort
Lauderdale, witnesses describing scenes of sheer horror. A silent
killer shooting people in the head as they tried to run and hide.
Tonight, a storm of questions. Why did he do it? The suspect, a
passenger with a firearm in his checked bag. New concerns about
airport security before the checkpoint.
(00:00:25): Also breaking tonight the new report from U.S.
intelligence: Vladimir Putin himself ordered the effort to influence
the election, aimed at hurting Clinton and helping Trump win. What the
President-elect is saying after his top-secret briefing.
(00:00:39): And States of Emergency: Millions from coast to coast
paralyzed by a massive winter storm.
(00:00:45): NIGHTLY NEWS begins right now.
I am trying to parse this information into a Python Dictionary, where the speaker is a dictionary, of dictionaries, which has timecode keys, and the content is the value, I can't consistently split because of potential information before the timecode, (IE the first quote), as well as the fact that the split character : is also a character involved with the timecode itself 00:00:00.
Trying to split according to the regex.
for line in msg.get_payload().split('\n'):
regex = r'\d{2}:\d{2}:\d{2}'
test = re.split(regex, line)
print(test)
sleep(1)
Appears to work in splitting it properly, but it causes me to lose the value I am splitting on (timecode), which I intend to use as a key. How can I properly split the above content, get the speaker, and then get the timecode as a key, and the content as a value.It is possible he may be present later in the text as well, and it should append to the list of timecodes./
The output format I am targeting is something along the lines of
{speakers:{'Lester Holt': {'00:00:01':content..., '00:00:0025': content...},
'speaker2':{etc,etc,etc} }}
Ive tried using the split as mentioned above, but it removes my timecode variable.
Any thoughts and guidance is appreciated.

Don't bother with split. You're trying to get 2-3 pieces of information out of each line, so try the following:
for line in msg.get_payload().split('\n'):
match = re.search(r'^\s*([^(]*?)\s*\((\d{2}:\d{2}:\d{2})\):\s*(.*)$', line)
if match:
(speaker, time, message) = match.groups()
Speaker will be an empty string if none was present on that line.
Regex explanation:
^ # Start of line
\s* # Drop leading whitespace
([^(]*?) # Capture the speaker if present (non-paren characters)
\s* # Drop whitespace between name and time
\( # Drop literal open paren
(\d{2}:\d{2}:\d{2}) # Capture time
\):\s* # Drop literal close paren, colon and whitespace
(.*) # Capture the rest of the line
$ # End of line

Splitting message in lines when you need to split it in time-stamped paragraphs is a waste. re.split can easily save the tokens that it split on, if you only look at the documentation. Here's my solution:
toks = re.split(r"\((\d\d:\d\d:\d\d)\):", msg.get_payload())[1:]
answer = dict(zip(toks[::2], toks[1::2]))
This creates a dictionary of timestamps and paragraphs. Feel free to use the same approach to split by speaker as well.
Result:
{
'00:00:01': ' Breaking News Tonight: A .....',
'00:00:25': ' Also breaking tonight ......', ....
}

identify new line in regex

I would like to perform some regex on the text from MAcbeth
My text is as follows:
Scena Secunda.
Alarum within. Enter King Malcome, Donalbaine, Lenox, with
attendants,
meeting a bleeding Captaine.
King. What bloody man is that? he can report,
As seemeth by his plight, of the Reuolt
The newest state
My intention is to get the text from Enter to the full-stop.
I am trying this regular expression Enter(.?)*\.
But it is showing no matches. Can anybody fix my regexp?
I am trying it out in this link

Since #Tushar has not explained the issue you had with your regex, I decided to explain it.
Your regex - Enter(.?)*\. - matches a word Enter (literally), then optionally matches any character except a newline 0 or more times, as many as possible, up to the last period.
The problem is that your string contains a newline between the Enter and the period. You'd need a regex pattern to match newlines, too. To force . to match newline symbols, you may use DOTALL mode. However, it won't get you the expected result as the * quantifier is greedy (will return the longest possible substring).
So, to get the substring from Enter till the closest period, you can use
Enter([^.]*)
See this regex demo. If you need no capture group, remove it.
And an IDEONE demo:
import re
p = re.compile(r'Enter([^.]*)')
test_str = "Scena Secunda.\n\nAlarum within. Enter King Malcome, Donalbaine, Lenox, with\nattendants,\nmeeting a bleeding Captaine.\n\n King. What bloody man is that? he can report,\nAs seemeth by his plight, of the Reuolt\nThe newest state"
print(p.findall(test_str)) # if you need the capture group text, or
# print(p.search(test_str).group()) # to get the whole first match, or
# print(re.findall(r'Enter[^.]*', test_str)) # to return all substrings from Enter till the next period

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex fetch text align-left - python

Related

Python regex fullmatch doesn't work as expected

Regex for multiple lines separated with "return" and multiple unnecessary spaces

python regex matching paragraph(s) starting with labels

Using regex to split text content into dictionary

identify new line in regex

Categories

Resources