Regex works in Sublime, not in Python (Jupyter) - python

I am creating a Jupyter notebook to clean a large amount of novels with regex code I am testing in Sublime.
A lot of my texts contain the phrase 'digitized by Google' because that is where I got the PDF that I ran through Optical Character Recognition from.
I want to remove all sentences that contain the phrase 'Digitized', or rather 'gitized' since the first part isn't always correctly transcribed.
When I use this phrase in Sublimes 'replace function', I get exactly the results I want:
^.*igitized.*$
However, when I try to use the re.sub method in my Jupyter notebook, which works from some other phrases, the 'Digitized by Google' lines are NOT correctly identified and replaced by 'nothing'.
text = re.sub(r'^.*igitized.*$', '', text)
What am I missing?

By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string. Add re.MULTILINE flag to match beginning of line.
text = re.sub(r'^.*igitized.*$', '', text, flags=re.MULTILINE)
Using ^ to match beginning of line in Python regex

Related

How to use re.sub to match and substitute any numbers after - extension?

I want to substitute from 'datasets/4/image-3.jpg' to 'datasets/4/image-1.jpg'. Are there any ways to do it by using re.sub? Or should I try something else like .split("/")[-1]? I had tried below but end up getting 'datasets/1/image-1.jpg', but I want to keep the /4/ instead of /1/.
My Code
import re
employee_name = 'datasets/4/image-3.jpg'
label = re.sub('[0-9]', "1", employee_name)
print(label)
Output
datasets/1/image-1.jpg
Expected Input
datasets/4/image-3.jpg
Expected Output
datasets/4/image-1.jpg
You can use
re.sub(r'-\d+\.jpg', '-1.jpg', text)
Note: If the match is always at the end of string, append the $ anchor at the end of the regex. If there can be other text after and you need to make sure there are no word chars after jpg, add a word boundary, \b. If you want to match other extensions, use a group, e.g. (?:jpe?g|png).
Regex details
-\d+ - one or more digits
\.jpg - .jpg string.
See the regex demo (code generator link).

What would be the easiest way to extract specific information from ()'s?

I have lines I am iterating through that look like this:
random text and A08524SDD here (00-04) more random text
lame text (junk data) more text (08-12) more text 4000 5553
random text and numbers 44553349 (2008)
random text (2005) junk text (junk)
nothing important (13-15) not important (not important)
I am trying to figure out how to pull ONLY the dates (range or single year) from the parenthesis without pulling the other random junk from the parenthesis.
Currently using this, but it is returning the random text as well:
date = re.findall('\(([^)]+)', line)
Edit: Each line in the string I am iterating over 1 line at a time. It is not one single string. I have a for loop that is searching each line and trying to extract the date range from each line. Also, there is random numbers included in the random text, so I cannot just search for ##-## or #### in the entire string. It will have to be encased in ()'s.
Edit2: #CarySwoveland has answered my origional question. As a bonus, I do have a few lines that look like this, that if they could also be included would be nice.
random text and numbers 44553349 (2008 important text)
random text (2005 important text) junk text (junk) 55555555 (08-09 important text)
nothing important (13-15) not important (not important)(2008 important text)
In the lines with more than 1 () that both start with a ##-## or a #### I need to grab both of them WITH the text. Out of about 35,000 lines of text, only about ~50 or so have these random issues, and I do not mind doing them by hand. But if a solution exists, it would be nice to implement.
THANK YOU TO ALL WHO HAVE POSTED! THIS HAS HELPED ME OUT GREATLY!!!!
As per both your question and your added comments I would suggest the following pattern:
(?<=\()\d\d-?\d\d.*?(?=\))
This would cater for all patterns of interest, like: (####), (##-##) and (##-## text) and possibly (#### text). Here is an online demo
From left to right:
(?<=\() - Positive lookbehind for an opening paranthesis
\d\d-?\d\d - Two digits followed by an optional hyphen and again two more digits
.*? - Match any character except newlines but non-greedy
(?=\)) - A positive lookahead to check for a closing paranthesis.
If you want to be very explicit about a closing paranthesis behind the 4th digit and the possibility for text you could extend. For example (?<=\()\d\d-?\d\d(?:\s\w+)?(?=\)), where we have a non-capturing group (?:...) where we check for a space \s followed by one-or-more word-characters \w+. The non-capturing group is optional ...)? and then the same positive lookahead as above.
Don't forget, using these patterns in Python, you'd need to make sure to use them as raw strings.
Note: I escaped both the opening and closing paranthesis in the lookarounds with a backslash, e.g: \( and \) to use them as literals. Not doing so will prematurely open/close another (non-)capturing group!
A Python example:
import re
lines = ['random text and A08524SDD here (00-04) more random text',
'lame text (junk data) more text (08-12) more text 4000 5553',
'random text and numbers 44553349 (2008)',
'random text (2005) junk text (junk)',
'nothing important (13-15) not important (not important)',
'random text and numbers 44553349 (2008 important text)',
'random text (2005 important text) junk text (junk) 55555555 (08-09 important text)',
'nothing important (13-15) not important (not important)(2008 important text)']
for line in lines:
print(re.findall(r'(?<=\()\d\d-?\d\d.*?(?=\))', line))
Returns:
['00-04']
['08-12']
['2008']
['2005']
['13-15']
['2008 important text']
['2005 important text', '08-09 important text']
['13-15', '2008 important text']
You can use the following regular expression.
(?m)(?<=\()(?:\d{4}|\d{2}-\d{2})(?=\))
Regex Demo <¯\_(ツ)_/¯> Python demo
Python's regex engine performs the following operations.
(?m) multiline mode
(?<=\() match is preceded by '(' (positive lookbehind)
(?: begin non-capture group
\d{4} match 4 digits
| or
\d{2}-\d{2} match 2 digits, a hyphen, 2 digits
) end non-capture group
(?=\)) match is followed by ')' (positive lookahead)
does something like this work for you?
this is assuming strings is a list of your lines
def getter(string):
return re.search(r'(\(\d{4}\)|\(\d{2}-\d{2}\))', string).group()
list(map(getter, strings))
output:
['(00-04)', '(08-12)', '(2008)', '(2005)', '(13-15)']
as per your edit...if you are looping, just apply the function in the loop on each line

Python regex to identify capitalised single word lines in a text abstract

I am looking for a way to extract words from text if they match the following conditions:
1) are capitalised
and
2) appear on a new line on their own (i.e. no other text on the same line).
I am able to extract all capitalised words with this code:
caps=re.findall(r"\b[A-Z]+\b", mytext)
but can't figure out how to implement the second condition. Any help will be greatly appreciated.
You can use the re.MULTILINE flag to make ^ and $ match the beginning and the end of a line, rather than the beginning and the end of a string:
re.findall(r"^[A-Z]+$", mytext, flags=re.MULTILINE)
please try following statements \r\n at the begining of your regex expression

Python regex capture issue

I have a regex expression that captures the text from a text file. The regex can ve viewed at the following URL:
https://regex101.com/r/wwEjTF/1
In my python code I would like extract the text that is matched by the regex only from all the other text in the text file.
I have following python code for matching th regex and storing it in a variable.
match = re.findall(r'test\s.+\n\sdescription\s\"(.+)\"', text, re.S)
I am expecting all the matches to be in the match variable and returned a list.
But when I do print (match) I get empty list. I do not understand why it is coming up an empty list. How do i capture the matched part of the regex into the variable.
Thanks for your help.
Just in case if there is an issue with the above url, is the regex and the sample text string:
test\s.+\n\sdescription\s\"(.+)\"
some random text
test 111.333.555.666
description "text10"
some random text
some random text
test 22.44.55.66
description "text12"
some random text
some random text
test 77.77.88.99
description "text13"
some random text
some random text
test 14.22.55.99
description "text16"
some random text
some random text
test 13.33.55.66
description "text17"
some random text`
Like I said in my comment try omitting re.S because It makes the '.' special character match any character at all, including a newline
Also \n\s is not appropriate. As \s includes newline. You will have to use \s+
Your regex will be:
match = re.findall(r'test\s.+\s+description\s\"(.+)\"', text)
Ideone Demo
Example in the regex101 uses tab as separator for description, so single \s works for it. Replace it with repetition:
match = re.findall(r'test\s.+\n\s+description\s\"(.+)\"', text, re.S)

dealing with \n characters at end of multiline string in python

I have been using python with regex to clean up a text file. I have been using the following method and it has generally been working:
mystring = compiledRegex.sub("replacement",mystring)
The string in question is an entire text file that includes many embedded newlines. Some of the compiled regex's cover multiple lines using the re.DOTALL option. If the last character in the compiled regex is a \n the above command will substitute all matches of the regex except the match that ends with the final newline at the end of the string. In fact, I have had several other no doubt related problems dealing with newlines and multiple newlines when they appear at the very end of the string. Can anyone give me a pointer as to what is going on here? Thanks in advance.
If i correctly undestood you and all that you need is to get a text without newline at the end of the each line and then iterate over this text in order to find a required word than you can try to use the following:
data = (line for line in text.split('\n') if line.strip())# gives you all non empty lines without '\n'at the end
Now you can either search/replace any text you need using list slicing or regex functionality.
Or you can use replace in order to replace all '\n' to whenever you want:
text.replace('\n', '')
My bet is that your file does not end with a newline...
>>> content = open('foo').read()
>>> print content
TOTAL:.?C2
abcTOTAL:AC2
defTOTAL:C2
>>> content
'TOTAL:.?C2\nabcTOTAL:AC2\ndefTOTAL:C2'
...so the last line does not match the regex:
>>> regex = re.compile('TOTAL:.*?C2\n', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefTOTAL:C2'
If that is the case, the solution is simple: just match either a newline or the end of the file (with $):
>>> regex = re.compile('TOTAL:.*?C2(\n|$)', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefXXX'
I can't get a good handle on what is going on from your explanation but you may be able to fix it by replacing all multiple newlines with a single newline as you read in the file. Another option might be to just trim() the regex removing the \n at the end unless you need it for something.
Is the question mark to prevent the regex matching more than one iine at a time? If so then you probably want to be using the MULTILINE flag instead of DOTALL flag. The ^ sign will now match just after a new line or the beginning of a string and the $ sign will now match just before a newline character or the end of a string.
eg.
regex = re.compile('^TOTAL:.*$', re.MULTILINE)
content = regex.sub('', content)
However, this still leaves with the problem of empty lines. But why not just run one additional regex at the end that removes blank lines.
regex = re.compile('\n{2,}')
content = regex.sub('\n', content)

Categories

Resources