python regex matching paragraph(s) starting with labels - python

I'm trying to match a paragraph or paragraphs which are lead by letters. I'm testing on and have tried dotALL, lookaheads, multiline, etc and I can't seem to get one to work. The string I'm trying to match looks like this:
A-B: Object, procedure:
- Somethings.
- More things, might run over several lines like this where the sentence just keeps on going and going and going and sometimes isn't even a sentence.
- Another line, sometimes not ending with period
- Variable amount of white space at the beginning of new lines
Comment (A-B): sometimes, there are comments which are separated by two \n\n characters like this.*
C. Second object, other procedure:
- More lines.
- Can have various leads (including no ' - ' leading.
- Variable number of lines.
The closest I've come to a match was using '(.+?\n\n|.+?$)' and dotALL (which I realize is sloppy), but even this didn't work because it misses comments or paragraphs separated by more lines but still under the header ([A-Z]?-?[A-Z]).
Ideally I'd like to capture the header or title (A-B:) or (C.) in match.group(1) and the rest of the paragraphs(s) before the next title in match.group(2), but I'd just be happy to capture everything. I tried lookaheads to catch everything between titles, but that misses the last instance which won't have a title at the end.
I'm a newb and I apologize if this has already been answered or if I'm not clear (but I have been looking for the past 2 days without success). Thanks!

so here is my proposed solution for you :)
import re
with open('./samplestring.txt') as f:
header =[]
nonheader = []
yourString = f.read()
for line in content.splitlines():
if(re.match('(^[A-Z]?-?[A-Z]:)|(^[A-Z]\.)',line.lstrip())):
header.append(line)
else:
nonheader.append(line)

I ended up giving up on capturing comments and everything after them. I used the following code to capture the letter for each header (group(1)), the text for the header (group(2)), and the text in the paragraph excluding comments (group(3)).
([A-Z]{1,2}|[A-Z]-[A-Z])(?::|.) +(\w.+)\n+((\s*(- *.+))+)
([A-Z]{1,2}|[A-Z]-[A-Z])(?::|.) +
captures the letter (group 1), the colon or period, and the space(s) after that
(\w.+)\n+
captures the text of the header, and the next line(s)
((\s*(- *.+))+)
captures multiple lines starting variably with a space, dash, space, and text
I appreciate all your help with this! :)

You can use
(^[^\n]+)(?:\n *-.+(?:\n.+)*|\n\n.+\n)+
(^[^\n]+) - Match the header line, then repeatedly alternate between
\n *-.+(?:\n.+)* - A non-comment line: starts with whitespace, followed by -, optionally running across multiple lines
\n\n.+\n - Or, match a comment line
(no dotall flag)
https://regex101.com/r/6kle0u/2
This depends on the comment lines always having \n\n before them.

Related

Regex for multiple lines separated with "return" and multiple unnecessary spaces

I was trying to parse together a script for a movie into a dataset containing two columns 'speaker_name' and 'line_spoken'. I don't have any issue with the Python part of the problem but parsing the script is the problem.
The schema of the script goes like this:
Now, this, if copied and pasted into a .txt file is something like this:
ARTHUR
Yeah. I mean, that's just--
SOCIAL WORKER
Does my reading it upset you?
He leans in.
ARTHUR
No. I just,-- some of it's
personal. You know?
SOCIAL WORKER
I understand. I just want to make
sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
In the above case, the regex filtering should return the speaker name and the dialogue and not what is happening in actions like the last line: "slides his journal back". The dialogues often exceed more than two lines so please do not provide hard-coded solutions for 2 lines only. I think I am thinking about this problem in just one direction, some other method to filter can also work.
I have worked with scripts that are colon-separated and I don't have any problem parsing those. But in this case, I am getting no specific endpoints to end the search at. It would be a great help if the answer you give has 2 groups, one with name, the other with the dialogue. Like in the case of colon-separated, my regex was:
pattern = r'(^[a-zA-z]+):(.+)'
Also, if possible, please try and explain why you used that certain regex. It will be a learning experience for me.
Use https://www.onlineocr.net/ co convert pdf to text,
It shows immediately the outcome, where names and on the same line with dialogs,
which could allow for a simple processing
ARTHUR Yeah. I mean, that's just--
SOCIAL WORKER Does my reading it upset you?
He leans in.
ARTHUR No. I just,-- some of its personal. You know me ?
SOCIAL WORKER I understand. I just want to make sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
Not sure will it work for longer dialogs.
Another solution is to extract data from the text file that you can download by clicking the "download output file" link . That file is formatted differently. In that file
10 leading spaces will indicate the dialog, and 5 leading spaces the name - a the least for you sample screenshot
The regex is
r" (.+)(\n( [^ ].+\n)+)"
https://regex101.com/r/FQk8uH/1
it puts in group 1 whatever starts with ten spaces and whatever starts with at the exactly five space into the second :
the subexpression " [^ ].+\n" denotes a line where the first five symbols are spaces, the sixth symbol is anything but space, and the rest of symbols until the end of line are arbitrary. Since dialogs tend to be multiline that expression is followed with another plus.
You will have to delete extra white space from dialogue with additional code and/or regex.
If the amount of spaces varies a bit (say 4-6 and 7 - 14 respectively) but has distinct section the regex needs to be adjusted by using variable repetition operator (curly braces {4, 6}) or optional spaces ?.
r" {7, 14}(.+)(\n( {4-6}[^ ].+\n)+)"
The last idea is to use preexisting list of names in play to match them e.g. (SOCIAL WORKER|JOHN|MARY|ARTUR). The https://www.onlineocr.net/ website still could be used to help spot and delete actions
In Python, you can use DOTALL:
re_pattern = re.compile(r'(\b[A-Z ]{3,}(?=\n))\n*(.*?)\n*(?=\b[A-Z ]{3,}\n|$)', re.DOTALL)
print(re.findall(re_pattern, mystr))
\b[A-Z ]{3,}(?=\n) matches speaker name.
\b matches a word boundary
[A-Z ]{3,} matches three or more upper case letters or spaces. (this means this regex won't recognize speaker names with less than three characters. I did this to avoid false positives in special cases but you might wanna change it. Also check what kind of characters might occur in speaker name (dots, minus, lower case...))
(?=\n) is a lookahead insuring the speaker name is directly followed by a new line (avoids false positive if a similar expression appears in a spoken line)
\n* matches newlines
(.*?) matches everything (including new lines thanks to DOTALL) until the next part of the expression (? makes it lazy instead of greedy)
\n* matches newlines
(?=\b[A-Z ]{3,}\n|$) is a lookahead i.e. a non capturing expression insuring the following part is either a speaker name or the end of your string
Output:
[('ARTHUR', "Yeah. I mean, that's just--"), ('SOCIAL WORKER', 'Does my reading it upset you?\n\nHe leans in.'), ('ARTHUR', "No. I just,-- some of it's\n\npersonal. You know?"), ('SOCIAL WORKER', "I understand. I just want to make\n\nsure you're keeping up with it.\n\nShe slides his journal back to him. He holds it in his lap.")]
You'll have to adjust formatting if you want to remove actions from the result though.

Python regex to find string ending in double new line without a period

I have a long string like this:
Page Content
Director, Research Center.
Director of Research, Professor
Researcher
Lines end in a double newline. Some contain period in the end, some don't. I want each that had a double newline one to contain a single period and a single new line, like this:
Page Content.
Director, Research Center.
Director of Research, Professor.
Researcher.
There are also lines which end with a period and a single newline and they should stay the way they are. What I've tried:
re.sub('(?!\.)\n\n', '.\n', text)
What I'm trying to do is a negative on the period followed by two newlines, or find every single double new line that doesn't have a period right before and replace it with a period and a single newline.
I've tried some other variations, but I always end up with either double period or no changes.
You could use a negative lookbehind instead to assert what is on the left is not a dot. Escape the dot \. to match it literally.
(?<!\.)\n\n
Regex demo
Or to match an optional \r you could use a quantifier to repeat a non capturing group:
(?<!\.)(?:\r?\n){2}
Regex demo
Not very elegant, but obviously working:
text = text.replace('\.\n\n', '\n\n').replace('\n\n', '.\n')
If you insist on using re.sub:
text = re.sub('([^.])\.?\n\n', r'\1.\n', text)
This is downright ugly, but works too.

Capture string between \n [string] \n

I'm trying to parse YouTube description's of songs to compile into a .csv
Currently I can isolate timecodes, though making an attempt on isolating the song and artist is proving trickier.
First, I catch the whitesapce
# catches whitespace
pattern = re.compile(r'\s+')
Second, the timecodes (to make the string simpler to deal with)
# catches timecodes
pattern1 = re.compile(r'[\d\.-]+:[\d.-]+:[\d\.-]+')
then I sub and remove.
I then try to capture all strings between \n, as this is how the tracklist is formatted
songBeforeDash = re.search(r'^([\\n][a-zA-Z0-9]*-[a-zA-Z0-9]*[\\n]*)+$', description)
The format follows \n[string]-[string]\n
Using this excellent visualiser , I've been able to tweak it so it catches the first result, however any subsequent results don't match.
Is this a case of stopping at the first result and not catching the others?
Here's a sample of what I'm trying to catch
\nmiddleschoolxAso-Cypress\nShopan-Woodnot\nchromonicci-Memories.\nYasper-MoveTogether\nFenickxDelayde-Longwayhome\nauv-Rockaway5pm\nsadtoi-Aires\nGMillsxKyleMcEvoy-Haze\nRuckP-CoffeeBreak\n
You can do that with split()
t = '\nmiddleschoolxAso-Cypress\nShopan-Woodnot\nchromonicci-Memories.\nYasper-MoveTogether\nFenickxDelayde-Longwayhome\nauv-Rockaway5pm\nsadtoi-Aires\nGMillsxKyleMcEvoy-Haze\nRuckP-CoffeeBreak\n'
liste = t.split('\n')
liste = liste[1:-1:]
print(liste)
re.search only returns the first match in the string.
What you want is to use re.findall which returns all matches.
EDIT - Because your matches would overlap, I would suggest editing the regex to capture until the next newline. Right now they cannot overlap. Consider changing the regex to this:
r'^([\\n][a-zA-Z0-9]*-[a-zA-Z0-9]*)+$'
If what you want is for them to overlap (meaning you want to capture the newlines too), I suggest looking here to see how to capture overlapping regex patterns.
Also, as suggested by #ChatterOne, using the str.split(seperator) method would work well here, assuming no other type of information is present.
descriptor.split('\n')

Python. How to print a certain part of a line after it had been "re.searched" from a file

Could you tell me how to print this part of the line only '\w+.226.\w.+' ?
Code
VSP = input("Номер ВСП (четыре цифры): ")
a = re.compile(r'\w+.226.\w.+'+VSP)
b=re.search(a, open('Sample.txt').read())
print (b.group())
Номер ВСП (четыре цифры): 1020
10.226.27.60 1020
After I have found the intended line associated with my variable "VSP" in the txt file, how can exclude it from output, printing the"10.226.27.60" only?
You will need to modify your regex slightly to separate the trailing characters in the IP and the spaces that separate it from VSP. Adding a capture group will let you select the portion with just the IP address. The updated regex looks like this:
'(\d+\.226\.\S+)\s+' + VSP
\S (uppercase S) matches any non-whitespace, while \s (lowercase s) matches all whitespace. I replaced the first \w with the more specific \d (digits), and . (any character at all) with \. (actual period). The second \w is now \S, but you could use \d+\.\d+ if you wanted to be more specific.
Using the first capture group will give you the IP address:
print(b.group(1))
If you are looking for a single IP address once, not compiling your regex is fine. Also, reading in a small file in its entirety is OK as long as the file is small. If either is not the case, I would recommend compiling the regex and going through the file line by line. That will allow you to discard most lines much faster than using a regex would do.
I see you already have an answer.You can also try this regex if you were to separate the two groups by the whitespace:
import re
a = re.compile(r'(.+?)\s+(.+)') # edit: added ? to avoid
# greedy behaviour of first .+
# otherwise multiple spaces after the
# address will be caught into
# b.group(1), as per #Mad comment
b=re.search(a, '10.226.27.60 1020')
print (b.group(0))
print (b.group(1))
print (b.group(2))
or customize the first group regexp to your needs.
Edit:
This was not meant to be a proper answer but more of a comment wich I didn't think was readable as such; I am trying only to show group separation using regex, wich seems OP didn't know about or didn't use.
That is why I am not matching .226. because OP can do that. I also removed the file read part, which isn't needed for demonstration. Please read #Mad answer because its quite complete and in fact also shows how to use groups.

Python REGEX ignore case at the beginning of the sentence and take the rest

I have this kind of results:
ª!è[008:58:049]HTTP_CLI:0 - Line written in...
And I want to ignore all the beginning characters like ª!è and get only: HTTP_CLI:0 - Line written in... but in a simple regex line.
I tried this: ^[\W0-9]* but is taking the extended ASCII characters plus the time and is not ignoring it, is doing the opposite...
Any help?
Thanks!
If you want to get everything after the closing square bracket, no matter what, and skip everything before that you can go with a match like this:
s = "ª!è[008:58:049]HTTP_CLI:0 - Line written in..."
m = re.match(r'^.*?]([\S\s]*)', s)
print(m.group(1))
Print's 'HTTP_CLI:0 - Line written in...'
This expression looks through an arbitrary number of characters before the closing bracket and matches everything after that. The matched group is available with m.group(1)

Categories

Resources