Removing phrase/sentence that \r\n is located randomly (R / Python) - python

How can I remove a phrase or sentence that \r\n are located over all different places?
For example, I want to remove a sentence like this:
If you are having trouble viewing this message or would like to share
it on a social network, you can view the message online.
But there are many different variations of this sentence like:
If
you are having trouble viewing this message or would like to share
it on a social network, you can view the message online.
or
If you are having trouble
viewing this message or would like to share
it on a social network, you can view the message online.
I tried to specify every variation in regular expressions, but it is possible when the sentence or phrase is short.
For example, if I want to remove Please contact me immediately
I can specify Please\r\ncontact me immediately Please contact\r\nme immediately Please contact me\r\n immediately Please contact me\r\nimmediately to remove this sentence. But if I want to remove a longer sentence like as my first example, I cannot write every possible variation.
In summary, how can I remove phrases and sentences that have same words but have \r\n in all different places?

Give this a try.
$ import re
$ remove_text = lambda x, y: re.sub('\s?\r?\n?'.join(x.split()), "", y)
$ remove_text("Please contact me immediately", "Hello Please contact\r\nme immediately World")
> 'Hello World'
You can also remove extra spaces later.
$ re.sub('\s+', ' ', remove_text("Please contact me immediately", "Hello Please contact\r\nme immediately World"))
> 'Hello World'
This method has its limitations like if your actual text is Pleasecontact meimmediately, it will be treated as same.

This regex pattern will find all paragraphs ( as opposed to sentences ):
((?:[^\n\r]+[\n\r])+(?:[^\n\r]+[\n\r])(?=[\n\r]))
Try it out # Live Demo
Explanation:
Find ( [ 1 or more non-newline characters ] followed by a [ newline character ] ) on 1 or more lines
(?:[^\n\r]+[\n\r])+
Find an additional line which matches the above pattern
(?:[^\n\r]+[\n\r])
Find an additional [ newline character ]
IE: the blank line in between two groups of text
(?=[\n\r])
The 2nd & 3rd groups combined equate to the final line of the paragraph.

Related

Regex for multiple lines separated with "return" and multiple unnecessary spaces

I was trying to parse together a script for a movie into a dataset containing two columns 'speaker_name' and 'line_spoken'. I don't have any issue with the Python part of the problem but parsing the script is the problem.
The schema of the script goes like this:
Now, this, if copied and pasted into a .txt file is something like this:
ARTHUR
Yeah. I mean, that's just--
SOCIAL WORKER
Does my reading it upset you?
He leans in.
ARTHUR
No. I just,-- some of it's
personal. You know?
SOCIAL WORKER
I understand. I just want to make
sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
In the above case, the regex filtering should return the speaker name and the dialogue and not what is happening in actions like the last line: "slides his journal back". The dialogues often exceed more than two lines so please do not provide hard-coded solutions for 2 lines only. I think I am thinking about this problem in just one direction, some other method to filter can also work.
I have worked with scripts that are colon-separated and I don't have any problem parsing those. But in this case, I am getting no specific endpoints to end the search at. It would be a great help if the answer you give has 2 groups, one with name, the other with the dialogue. Like in the case of colon-separated, my regex was:
pattern = r'(^[a-zA-z]+):(.+)'
Also, if possible, please try and explain why you used that certain regex. It will be a learning experience for me.
Use https://www.onlineocr.net/ co convert pdf to text,
It shows immediately the outcome, where names and on the same line with dialogs,
which could allow for a simple processing
ARTHUR Yeah. I mean, that's just--
SOCIAL WORKER Does my reading it upset you?
He leans in.
ARTHUR No. I just,-- some of its personal. You know me ?
SOCIAL WORKER I understand. I just want to make sure you're keeping up with it.
She slides his journal back to him. He holds it in his lap.
Not sure will it work for longer dialogs.
Another solution is to extract data from the text file that you can download by clicking the "download output file" link . That file is formatted differently. In that file
10 leading spaces will indicate the dialog, and 5 leading spaces the name - a the least for you sample screenshot
The regex is
r" (.+)(\n( [^ ].+\n)+)"
https://regex101.com/r/FQk8uH/1
it puts in group 1 whatever starts with ten spaces and whatever starts with at the exactly five space into the second :
the subexpression " [^ ].+\n" denotes a line where the first five symbols are spaces, the sixth symbol is anything but space, and the rest of symbols until the end of line are arbitrary. Since dialogs tend to be multiline that expression is followed with another plus.
You will have to delete extra white space from dialogue with additional code and/or regex.
If the amount of spaces varies a bit (say 4-6 and 7 - 14 respectively) but has distinct section the regex needs to be adjusted by using variable repetition operator (curly braces {4, 6}) or optional spaces ?.
r" {7, 14}(.+)(\n( {4-6}[^ ].+\n)+)"
The last idea is to use preexisting list of names in play to match them e.g. (SOCIAL WORKER|JOHN|MARY|ARTUR). The https://www.onlineocr.net/ website still could be used to help spot and delete actions
In Python, you can use DOTALL:
re_pattern = re.compile(r'(\b[A-Z ]{3,}(?=\n))\n*(.*?)\n*(?=\b[A-Z ]{3,}\n|$)', re.DOTALL)
print(re.findall(re_pattern, mystr))
\b[A-Z ]{3,}(?=\n) matches speaker name.
\b matches a word boundary
[A-Z ]{3,} matches three or more upper case letters or spaces. (this means this regex won't recognize speaker names with less than three characters. I did this to avoid false positives in special cases but you might wanna change it. Also check what kind of characters might occur in speaker name (dots, minus, lower case...))
(?=\n) is a lookahead insuring the speaker name is directly followed by a new line (avoids false positive if a similar expression appears in a spoken line)
\n* matches newlines
(.*?) matches everything (including new lines thanks to DOTALL) until the next part of the expression (? makes it lazy instead of greedy)
\n* matches newlines
(?=\b[A-Z ]{3,}\n|$) is a lookahead i.e. a non capturing expression insuring the following part is either a speaker name or the end of your string
Output:
[('ARTHUR', "Yeah. I mean, that's just--"), ('SOCIAL WORKER', 'Does my reading it upset you?\n\nHe leans in.'), ('ARTHUR', "No. I just,-- some of it's\n\npersonal. You know?"), ('SOCIAL WORKER', "I understand. I just want to make\n\nsure you're keeping up with it.\n\nShe slides his journal back to him. He holds it in his lap.")]
You'll have to adjust formatting if you want to remove actions from the result though.

Why do multi-line strings lead to different pattern matches from single line strings when using python regex?

I am trying to create a Discord Bot that reads users messages and detects when an Amazon link(s) is/are present in their message.
If I use a multi-line string I capture different results from when the message is used on a single line.
Here is the code I am using:
import re
AMAZON_REGEX = re.compile("(http[s]?://[a-zA-Z0-9.-]*(?:amazon|amzn).["
"a-zA-Z]+(?:.+?(?:ref=[^?]+)|.+(?= )|[^?]+))")
def extract_url(message):
foo = AMAZON_REGEX.findall(message)
return foo
user_message = """https://www.amazon.co.uk/dp/B07RLWTXKG blah blah
hello
https://www.amazon.co.uk/dp/B07RLWToop foobar"""
print(extract_url(user_message))
The result of the above code is: ['https://www.amazon.co.uk/dp/B07RLWTXKG blah blah', 'https://www.amazon.co.uk/dp/B07RLWToop']
However, if I change user_message from a multiline string to a single line one then I get the following result: ['https://www.amazon.co.uk/dp/B07RLWTXKG blah blah hello https://www.amazon.co.uk/dp/B07RLWToop']
Why is this the case? Also, how do I capture just the URL without the rest of the users' messages?
It seems like you're having an issue with the exact regex you're using.
Why does the newline change the output?
After parsing the link, it seems like your regex captures the following words, separated by spaces, but the newline character stops the regex from continuing. The fact that there's a newline between "blah" and "hello" in the first case is what's causing the "hello" to not be captured in the multi-line case. As you might know, there's a newline character (\n), a bit like a, * and other character exist.
Only capturing the link
I'm not quite sure what format the amazon link would come in, so it's difficult to say how it should look. However, you know that the link will not contain a space, so stopping the matching when you see a space character would be optimal.
(http[s]?:\/\/[a-zA-Z0-9.-]*(?:amazon|amzn).[a-zA-Z]+(?:.+?(?:ref=[^?]+)|.+(?= )|[^?]+))
(http[s]?:\/\/[a-zA-Z0-9.-]*(?:amazon|amzn).[a-zA-Z]+(?:.+?(?:ref=[^?]+)|[^ ]+(?= )|[^?]+))
In the example above, I turned one of your last . (basically "match all characters") into [^ ] (basically "match all except for a space"). This means you won't start matching the words following the spaces after the word.
Good luck with the Discord bot!
So the reason you're getting a different result between your two different input sources is because you're not doing any checks for the presence of new lines in your regex. This answer goes into a little more detail about how your regex might need to be modified to detect a newline string.
But - if what you really want is just to get a list of links without the rest of the text, you're better off using a different regex string designed to capture just the URL. This post has several different regex strategies for matching just a single URL.

I wish to take the middle pattern of the sentence in chinese character using regex

I tried to take the middle words based on my pattern. Below are my codes:
text = "東京都田中区9-7−4"
import re
#Sorry due to the edit problem and stackoverflow doesnt allow me to include long sentences here, please check my comment below for the compile function of re.
city = re.findall(r,text)
print("getCity: {}".format(city))
My current output:
getCity: ['都田中区']
My expected output:
getCity: ['田中区']
I do not want to take the [都道府県] so I use "?!" in my first beginning pattern as (?!...??[都道府県]). However, when I run my program, it shows that "都" is inside as well like I show on my current output. Could anyone please direct me on this?
The problem with your regex is that it is too allowing.
If you look at this visualisation here (I have removed all the hardcoded city (市) names because they are irrelevant):
you can see a lot of "any character" repeated x times, or just "not 市" and "not 町" repeated x times. These are what matches the 都道府県 in your string. Therefore, these are the places where you should disallow 都道府県:
The corresponding regex would be:
(?:余市|高市|[^都道府県市]{2,3}?)郡(?:玉村|大町|[^都道府県]{1,5}?)[町村]|(?:[^都道府県]{1,4}市)?[^都道府県町]{1,4}?区|[^都道府県]{1,7}?[市町村]
Remember to add the hardcoded cities when you put this in your code!

Python - Regex - match characters between certain characters

I have a textfile and i want to match/findall/parse all characters that are between certain characters ([\n" text to match "\n]). The text itself can differ a lot from each other in respect to the structure and characters they contain (they can contain every possible char there is).
I posted this question before (sorry for the duplicate) but so far the problem couldnt be solved, so now i am trying to be even more precise about the problem.
The text in the file is build up like this:
test ="""
[
"this is a text and its supposed to contain every possible char."
],
[
"like *.;#]§< and many "" more."
],
[
"plus there are even
newlines
in it."
]"""
My desired output should be a list (for example) with each text in between the seperators as an element, like the following:
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even newlines in it.']
I tried to solve it with Regex and two solutions with the according output i came up with:
my_list = re.findall(r'(?<=\[\n {8}\").*(?=\"\n {8}\])', test)
print (my_list)
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.']
well this one was close. Its listing the first two elements as its supposed to but unfortunately not the third one as it has newlines within.
my_list = re.findall(r'(?<=\[\n {8}\")[\s\S]*(?=\"\n {8}\])', test)
print (my_list)
['this is a text and its supposed to contain every possible char."\n ], \n [\n "like *.;#]§< and many "" more."\n ], \n [\n "plus there are even\nnewlines\n \n in it.']
okay this time every element is included but the list has only one element in it and the lookahead doesnt seem to be working as i thought it would.
So whats the right Regex to use to get my desired output?
Why does the second approach not include the lookahead?
Or is there even a cleaner, faster way to get what i want (beautifulsoup or other methods?)?
I am very thankful for any help and hints.
i am using python 3.6.
You should use DOTALL flag for matching newlines
print(re.findall(r'\[\n\s+"(.*?)"\n\s+\]', test, re.DOTALL))
Output
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even\nnewlines\n\nin it.']
You can use the pattern
(?s)\[[^"]*"(.*?)"[^]"]*\]
to capture every element within the "s inside the brackets:
https://regex101.com/r/SguEAU/1
Then, you can use a list comprehension with re.sub to replace whitespace characters (including newlines) in every captured substring with a single normal space:
test ="""
[
"this is a text and its supposed to contain every possible char."
],
[
"like *.;#]§< and many "" more."
],
[
"plus there are even
newlines
in it."
]"""
output = [re.sub('\s+', ' ', m.group(1)) for m in re.finditer(r'(?s)\[[^"]*"(.*?)"[^]"]*\]', test)]
Result:
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even newlines in it.']

python regex - characters between certain characters

Edit: I should add, that the string in the test is supposed to contain every char there possible is (i.e. * + $ § € / etc.). So i thought of regexp should help best.
i am using regex to find all characters between certain characters([" and "]. My example goes like this:
test = """["this is a text and its supposed to contain every possible char."],
["another one after a newline."],
["and another one even with
newlines
in it."]"""
The supposed output should be like this:
['this is a text and its supposed to contain every possible char.', 'another one after a newline.', 'and another one even with newlines in it.']
My code including the regex looks like this:
import re
my_list = re.findall(r'(?<=\[").*(?="\])*[^ ,\n]', test)
print (my_list)
And my outcome is the following:
['this is a text and its supposed to contain every possible char."]', 'another one after a newline."]', 'and another one even with']
so there are two problems:
1) its not removing "] at the end of a text as i want it to do with (?="\])
2) its not capturing the third text in brackets, guess because of the newlines. But so far i wasnt able to capture those when i try .*\n it gives me back an empty string.
I am thankful for any help or hints with this issue. Thank you in advance.
Btw iam using python 3.6 on anaconda-spyder and the newest regex (2018).
EDIT 2: One Alteration to the test:
test = """[
"this is a text and its supposed to contain every possible char."
],
[
"another one after a newline."
],
[
"and another one even with
newlines
in it."
]"""
Once again i have trouble to remove the newlines from it, guess the whitespaces could be removed with \s, so an regexp like this could solve it, i thought.
my_list = re.findall(r'(?<=\[\S\s\")[\w\W]*(?=\"\S\s\])', test)
print (my_list)
But that returns only an empty list. How to get the supposed output above from that input?
In case you might also accept not regex solution, you can try
result = []
for l in eval(' '.join(test.split())):
result.extend(l)
print(result)
# ['this is a text and its supposed to contain every possible char.', 'another one after a newline.', 'and another one even with newlines in it.']
You can try this mate.
(?<=\[\")[\w\s.]+(?=\"\])
Demo
What you missed in your regex .* will not match newline.
P.S I am not matching special characters. if you want it can be achieved very easily.
This one matches special characters too
(?<=\[\")[\w\W]+?(?=\"\])
Demo 2
So here's what I came up:
test = """["this is a text and its supposed to contain every possible char."],
["another one after a newline."],
["and another one even with
newlines
in it."]"""
for i in test.replace('\n', '').replace(' ', ' ').split(','):
print(i.lstrip(r' ["').rstrip(r'"]'))
Which results in the following being printed to the screen
this is a text and its supposed to contain every possible char.
another one after a newline.
and another one even with newlines in it.
If you want a list of those -exact- strings, we could modify it to-
newList = []
for i in test.replace('\n', '').replace(' ', ' ').split(','):
newList.append(i.lstrip(r' ["').rstrip(r'"]'))

Categories

Resources