Python - Regex - match characters between certain characters - python

I have a textfile and i want to match/findall/parse all characters that are between certain characters ([\n" text to match "\n]). The text itself can differ a lot from each other in respect to the structure and characters they contain (they can contain every possible char there is).
I posted this question before (sorry for the duplicate) but so far the problem couldnt be solved, so now i am trying to be even more precise about the problem.
The text in the file is build up like this:
test ="""
[
"this is a text and its supposed to contain every possible char."
],
[
"like *.;#]§< and many "" more."
],
[
"plus there are even
newlines
in it."
]"""
My desired output should be a list (for example) with each text in between the seperators as an element, like the following:
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even newlines in it.']
I tried to solve it with Regex and two solutions with the according output i came up with:
my_list = re.findall(r'(?<=\[\n {8}\").*(?=\"\n {8}\])', test)
print (my_list)
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.']
well this one was close. Its listing the first two elements as its supposed to but unfortunately not the third one as it has newlines within.
my_list = re.findall(r'(?<=\[\n {8}\")[\s\S]*(?=\"\n {8}\])', test)
print (my_list)
['this is a text and its supposed to contain every possible char."\n ], \n [\n "like *.;#]§< and many "" more."\n ], \n [\n "plus there are even\nnewlines\n \n in it.']
okay this time every element is included but the list has only one element in it and the lookahead doesnt seem to be working as i thought it would.
So whats the right Regex to use to get my desired output?
Why does the second approach not include the lookahead?
Or is there even a cleaner, faster way to get what i want (beautifulsoup or other methods?)?
I am very thankful for any help and hints.
i am using python 3.6.

You should use DOTALL flag for matching newlines
print(re.findall(r'\[\n\s+"(.*?)"\n\s+\]', test, re.DOTALL))
Output
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even\nnewlines\n\nin it.']

You can use the pattern
(?s)\[[^"]*"(.*?)"[^]"]*\]
to capture every element within the "s inside the brackets:
https://regex101.com/r/SguEAU/1
Then, you can use a list comprehension with re.sub to replace whitespace characters (including newlines) in every captured substring with a single normal space:
test ="""
[
"this is a text and its supposed to contain every possible char."
],
[
"like *.;#]§< and many "" more."
],
[
"plus there are even
newlines
in it."
]"""
output = [re.sub('\s+', ' ', m.group(1)) for m in re.finditer(r'(?s)\[[^"]*"(.*?)"[^]"]*\]', test)]
Result:
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even newlines in it.']

Related

Python ReGex Pattern Finder

I am trying to get better with ReGex in Python, and I am trying to figure out how I would isolate a specific substring in some text. I have some text, and that text could look like any of the following:
possible_strings = [
"some text (and words) more text (and words)",
"textrighthere(with some more)",
"little trickier (this time) with (all of (the)(values))"
]
With each string, despite the fact that I don't know what's in them, I know it always ends with some information in parentheses. To include examples like #3, where the final pair of parentheses have parentheses in them.
How could I go about using re/ReGex to isolate the text only inside of the last pair of parentheses? So in the previous example, I would want the output to be:
output = [
"and words",
"with some more",
"all of (the)(values)"
]
Any tips or help would be much appreciated!
In python you can use the regex module as it is supports recurssion:
import regex
pat = r'(\((?:[^()]|(?1))*\))$'
regex.findall(pat, '\n'.join(possible_strings), regex.M)
['(and words)', '(with some more)', '(all of (the)(values))']
The regex might be quite complicated for a beginner. Click here for the explanations and examples
Abit of explanation:
( # 1st Capturing Group
\( # matches the character (
(?:#Non-capturing group
[^()] # 1st Alternative Match a single character not present in the character class
| # or
(?1) #2nd Alternative matches the expression defined in the 1st capture group recursively
) # closes capturing group
* # matches zero or more times
\) #matches the character )
$ asserts position at the end of a line
For the first two, start matching an opening bracket, that could be either of these:
"some text (and words) more text (and words)"
^ ^
followed by anything which isn't an opening bracket:
"some text (and words) more text (and words)"
^^^^^^^^^^^^^^^^^^^^^^X^^^^^^^^^^^
|- starting at the first ( hit
another ( which isn't allowed.
followed by end of line. Only the last () fits "no more ( until end of line".
>>> import re
>>> re.findall('\([^(]+\)$', "some text (and words) more text (and words)")
['(and words)']
RegEx is not a good fit for your third example; there's no easy way to pair up the parens, you may have to install and use a different regex engine to get nested structure support. See also
Matching Nested Structures With Regular Expressions in Python
Python: How to match nested parentheses with regex?

Regex to split text file in python

I am trying to find a way to parse a string of a transcript into speaker segments (as a list). Speaker labels are denoted by the upper-casing of the speaker's name followed by a colon. The problem I am having is some names have a number of non upper-case characters. Examples might include the following:
OBAMA: said something
O'MALLEY: said something else
GOV. HICKENLOOPER: said something else entirely'
I have written the following regex, but I am struggling to get it to work:
mystring = "OBAMA: said something \nO'MALLEY: said something else \nGOV. HICKENLOOPER: said something else entirely"
parse_turns = re.split(r'\n(?=[A-Z]+(\ |\.|\'|\d)*[A-Z]*:)', mystring)
What I think I have written (and ideally what I want to do) is a command to split the string based on:
1. Find a newline
2. Use positive look-ahead for one or more uppercase characters
3. If upper-case characters are found look for optional characters from the list of periods, apostrophes, single spaces, and digits
4. If these optional characters are found, look for additional uppercase characters.
5. Crucially, find a colon symbol at the end of this sequence.
EDIT: In many cases, the content of the speech will have newline characters contained within it, and possibly colon symbols. As such, the only thing separating the speaker label from the content of speech is the sequence mentioned above.
just change (\ |.|\'|\d) to [\ .\'\d] or (?:\ |.|\'|\d)
import re
mystring = "OBAMA: said something \nO'MALLEY: said something else \nGOV. HICKENLOOPER: said something else entirely"
parse_turns = re.split(r'\n(?=[A-Z]+[\ \.\'\d]*[A-Z]*:)', mystring)
print(parse_turns)
If it's true that the speaker's name and what they said are separated by a colon, then it might be simpler to move away from regex to do your splitting.
list_of_things = []
mystring = "OBAMA: Hi\nO'MALLEY: True Dat\nHUCK FINN: Sure thing\n"
lines = mystring.split("\n")# 1st split the string into lines based on the \n character
for line in lines:
colon_pos = line.find(":",0) # Finds the position of the first colon in the line
speaker, utterance = line[0:colon_pos].strip(), line[colon_pos+1:].strip()
list_of_things.append((speaker, utterance))
At the end, you should have a neat list of tuples containing speakers, and the things they said.

python regex - characters between certain characters

Edit: I should add, that the string in the test is supposed to contain every char there possible is (i.e. * + $ § € / etc.). So i thought of regexp should help best.
i am using regex to find all characters between certain characters([" and "]. My example goes like this:
test = """["this is a text and its supposed to contain every possible char."],
["another one after a newline."],
["and another one even with
newlines
in it."]"""
The supposed output should be like this:
['this is a text and its supposed to contain every possible char.', 'another one after a newline.', 'and another one even with newlines in it.']
My code including the regex looks like this:
import re
my_list = re.findall(r'(?<=\[").*(?="\])*[^ ,\n]', test)
print (my_list)
And my outcome is the following:
['this is a text and its supposed to contain every possible char."]', 'another one after a newline."]', 'and another one even with']
so there are two problems:
1) its not removing "] at the end of a text as i want it to do with (?="\])
2) its not capturing the third text in brackets, guess because of the newlines. But so far i wasnt able to capture those when i try .*\n it gives me back an empty string.
I am thankful for any help or hints with this issue. Thank you in advance.
Btw iam using python 3.6 on anaconda-spyder and the newest regex (2018).
EDIT 2: One Alteration to the test:
test = """[
"this is a text and its supposed to contain every possible char."
],
[
"another one after a newline."
],
[
"and another one even with
newlines
in it."
]"""
Once again i have trouble to remove the newlines from it, guess the whitespaces could be removed with \s, so an regexp like this could solve it, i thought.
my_list = re.findall(r'(?<=\[\S\s\")[\w\W]*(?=\"\S\s\])', test)
print (my_list)
But that returns only an empty list. How to get the supposed output above from that input?
In case you might also accept not regex solution, you can try
result = []
for l in eval(' '.join(test.split())):
result.extend(l)
print(result)
# ['this is a text and its supposed to contain every possible char.', 'another one after a newline.', 'and another one even with newlines in it.']
You can try this mate.
(?<=\[\")[\w\s.]+(?=\"\])
Demo
What you missed in your regex .* will not match newline.
P.S I am not matching special characters. if you want it can be achieved very easily.
This one matches special characters too
(?<=\[\")[\w\W]+?(?=\"\])
Demo 2
So here's what I came up:
test = """["this is a text and its supposed to contain every possible char."],
["another one after a newline."],
["and another one even with
newlines
in it."]"""
for i in test.replace('\n', '').replace(' ', ' ').split(','):
print(i.lstrip(r' ["').rstrip(r'"]'))
Which results in the following being printed to the screen
this is a text and its supposed to contain every possible char.
another one after a newline.
and another one even with newlines in it.
If you want a list of those -exact- strings, we could modify it to-
newList = []
for i in test.replace('\n', '').replace(' ', ' ').split(','):
newList.append(i.lstrip(r' ["').rstrip(r'"]'))

Removing phrase/sentence that \r\n is located randomly (R / Python)

How can I remove a phrase or sentence that \r\n are located over all different places?
For example, I want to remove a sentence like this:
If you are having trouble viewing this message or would like to share
it on a social network, you can view the message online.
But there are many different variations of this sentence like:
If
you are having trouble viewing this message or would like to share
it on a social network, you can view the message online.
or
If you are having trouble
viewing this message or would like to share
it on a social network, you can view the message online.
I tried to specify every variation in regular expressions, but it is possible when the sentence or phrase is short.
For example, if I want to remove Please contact me immediately
I can specify Please\r\ncontact me immediately Please contact\r\nme immediately Please contact me\r\n immediately Please contact me\r\nimmediately to remove this sentence. But if I want to remove a longer sentence like as my first example, I cannot write every possible variation.
In summary, how can I remove phrases and sentences that have same words but have \r\n in all different places?
Give this a try.
$ import re
$ remove_text = lambda x, y: re.sub('\s?\r?\n?'.join(x.split()), "", y)
$ remove_text("Please contact me immediately", "Hello Please contact\r\nme immediately World")
> 'Hello World'
You can also remove extra spaces later.
$ re.sub('\s+', ' ', remove_text("Please contact me immediately", "Hello Please contact\r\nme immediately World"))
> 'Hello World'
This method has its limitations like if your actual text is Pleasecontact meimmediately, it will be treated as same.
This regex pattern will find all paragraphs ( as opposed to sentences ):
((?:[^\n\r]+[\n\r])+(?:[^\n\r]+[\n\r])(?=[\n\r]))
Try it out # Live Demo
Explanation:
Find ( [ 1 or more non-newline characters ] followed by a [ newline character ] ) on 1 or more lines
(?:[^\n\r]+[\n\r])+
Find an additional line which matches the above pattern
(?:[^\n\r]+[\n\r])
Find an additional [ newline character ]
IE: the blank line in between two groups of text
(?=[\n\r])
The 2nd & 3rd groups combined equate to the final line of the paragraph.

Separating words with comma python

I want to grab a word by using a comma as the end of the word indicator using python and also remove the extra quotation marks and white spaces. Also one more thing I also want to make every letter of the word lowercase, then loop to the next word in the text file.
For Example:
Text File:
"Test Word", "The Test", "Word Two", "Word Four", "Alpha", "Bravo", "Charlie"
I am willing to make further clarifications, any help will be appreciated. Thank you
Since you don't have any code to reference, I'll give a high level explanation of what I would do:
Use str.split() with a comma as your delimiter to break up the string into an array of strings.
Since you need to remove both whitespace and quotes, I would use regular expressions via a replace function, re.sub, to adapt these new strings. It would look something like: '\"|\s', replace with "". You can use str.lower() to convert all characters to lower case. Hope that helps.

Categories

Resources