I am using python regex to read documents.
I have the following line in many documents:
Dated: February 4, 2011 THE REAL COMPANY, INC
I can use python text search to easily find the lines that have "dated," but I want to pull THE REAL COMPANY, INC from the text without getting the "February 4, 2011" text.
I have tried the following:
[A-Z\s]{3,}.*INC
My understanding of this regex is it should get me all capital letters and spaces before LLP, but instead it pulls the full line.
This suggests to me I'm fundamentally missing something about how regex works with capital letters. Is there an easy and obvious explanation I'm missing?
what about using:
>>> import re
>>> txt
'Dated: February 4, 2011 THE REAL COMPANY, INC'
>>> re.findall('([A-Z][A-Z]+)', txt)
['THE', 'REAL', 'COMPANY', 'INC']
Another way around is as follows as suggested by #davedwards:
>>> re.findall('[A-Z\s]{3,}.*', txt)
[' THE REAL COMPANY, INC']
Explanation:
[A-Z\s]{3,}.*
Match a single character present in the list below [A-Z\s]{3,}
{3,} Quantifier — Matches between 3 and unlimited times, as many times as possible, giving back as needed (greedy)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
\s matches any whitespace character (equal to [\r\n\t\f\v ])
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
You could use
^Dated:.*?\s([A-Z ,]{3,})
And make use of the first capturing group, see a demo on regex101.com.
Your regex [A-Z\s]{3,}.*INC matches 3 or more times an uppercase character or a whitespace character followed by 0+ times any character and then INC which will match: THE REAL COMPANY, INC
What you could also do is match Dated: from the start of the string followed by a date like format and then capture what comes after in a group. Your value will be in the first capturing group:
^Dated:\s+\S+\s+\d{1,2},\s+\d{4}\s+(.*)$
Explanation
^Dated:\s+ Match dated: followed by 1+ times a whitespace character
\S+\s+ Match 1+ times not a whitespace character followed by 1+ times a whitespace character whic will match February in this case
\d{1,2}, Match 1-2 times a digit
\s+\d{4}\s+ match 1+ times a whitespace character, 4 digits, followed by 1+ times a whitespace character
(.*) Capture in a group 0+ times any character
$ Assert the end of the string
Regex demo
Related
I have a string where I'm trying to match a city and state with a regular expression in Python. Some of the strings have a final country code that is preceded by a space. I'm having trouble writing a regular expression that matches all the cases, and captures the city in the first capture group, and the state in the second capture g
[^.*]?Born:.*in[^.](.*),[^.*](.*)
This is the regular expression that I have so far, and these are some example strings that I'm trying to match.
Born: November 8, 1961 in Chicago, Illinois
Born: February 19, 1995 in Sombor, Serbia rs
Born: May 19, 1976 in Greenville, South Carolina us
Based on my current regular expression this is my current output:
(Chicago) (Illinois)
(Sombor) (Serbia rs )
(Greenville) (South Carolina us)
Expected outputs would be
(Chicago) (Illinois)
(Sombor) (Serbia)
(Greenville) (South Carolina)
How can I account for this trailing string of a space and two characters? Any help would be greatly spp
Use
Born:.*in\s+([^,]*),\s+(.*?)(?=(?:\s[A-Za-z]{2})?$)
See regex proof.
EXPLANATION
Born: - matches the characters Born: literally (case sensitive)
.* - matches any character (except for line terminators), between zero and unlimited times, as many times as possible, giving back as needed (greedy)
in - matches the characters in literally (case sensitive)
\s+ - matches any whitespace character (equivalent to [\r\n\t\f\v ]) between one and unlimited times, as many times as possible, giving back as needed (greedy)
1st Capturing Group ([^,]*)
Match a single character not present in the list below [^,]* between zero and unlimited times, as many times as possible, giving back as needed (greedy)
, - matches the character , with index 4410 (2C16 or 548) literally (case sensitive)
, - matches the character , with index 4410 (2C16 or 548) literally (case sensitive)
\s+ - matches any whitespace character (equivalent to [\r\n\t\f\v ]) between one and unlimited times, as many times as possible, giving back as needed (greedy)
2nd Capturing Group (.*?)
.*? - matches any character (except for line terminators) between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Positive Lookahead (?=(?:\s[A-Za-z]{2})?$)
Assert that the Regex below matches
Non-capturing group (?:\s[A-Za-z]{2})?
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
\s matches any whitespace character (equivalent to [\r\n\t\f\v ])
Match a single character present in the list below [A-Za-z]
{2} matches the previous token exactly 2 times
A-Z matches a single character in the range between A (index 65) and Z (index 90)
(case sensitive)
a-z matches a single character in the range between a (index 97) and z (index 122)
(case sensitive)
$ asserts position at the end of a line
I have a field that looks like this :
------->Total cash dispensed: 40000 MGA
I want to get only the "MGA" using regex but without using split
regex = r"Total cash dispensed:\s*([^ 0-9]*)"
The code I used to get anything that's not number or white space does not work, How do I fix this?
You might use a capture group:
\bTotal cash dispensed:\s*\d+\s+([A-Z]+)\b
\bTotal cash dispensed:\s* Match the text starting with a word boundary and followed by : and optional whitespace chars
\d+\s+ Match 1+ digits and 1+ whitespace chars
([A-Z]+) Capture group 1, match 1+ chars A-Z
\b A word boundary to prevent a partial match
Regex demo
import re
pattern = r"\bTotal cash dispensed:\s*\d+\s+([A-Z]+)\b"
s = "------>Total cash dispensed: 40000 MGA"
matches = re.search(pattern, s)
if matches:
print(matches.group(1))
Output
MGA
You can match anything except whitespace and digits with the following expression:
[^\s\d]+
"\s\" is whitespace, including spaces, tabs, and enters
"\d" is digits, same as [0-9] and some other numeric characters from other scripts
Example:
re.search('[^\s\d]+', '123 \t')
# -> No match
I have to parse a PDF document and I'm using PyPDF2 with re(regex).
The file includes several lines like the one below:
18-02-202010:44:48PEDMILANO OVEST- BINASCOA1,40
I need to extract from this line the text( bold ) between the time and the amount:
PEDMILANO OVEST- BINASCOA
The following code is working but sometimes this code doesn't find anything since can be a number between these chars, for example, 18-02-202010:44:48PEDMILANO OVE3ST- BINASCOA1,40.
regex = re.compile(r'\d\d-\d\d-\d\d\d\d\d\d:\d\d:\d\d\D+\d+,\d\d')
Is there a way to include a number in this regular expression?
The following should simplify the current regex:
import re
s = '18-02-202010:44:48PEDMILANO OVE3ST- BINASCOA1,40'
re.search(r'\:\d+([A-Z].*?)(?=\d+\,\d+$)', s).group(1)
# 'PEDMILANO OVE3ST- BINASCOA'
See demo
\d+([A-Z].*?)(?=\d+\,\d+$)
\: matches the character : literally (case sensitive)
\d+: matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
1st Capturing Group ([A-Z].*?)
Match a single character present in the list below [A-Z]
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
.*? matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Positive Lookahead (?=\d+\,\d+$)
Assert that the Regex below matches
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
\, matches the character , literally (case sensitive)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
I suggest using
import re
text = "18-02-202010:44:48PEDMILANO OVEST- BINASCOA1,40"
print( re.sub(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}(.*?)\d+(?:,\d+)?$', r'\1', text) )
It can also be written as
re.sub(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}|\d+(?:,\d+)?$', '', text)
Or, if you prefer matching and capturing:
m = re.search(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}(.*?)\d+(?:,\d+)?$', text)
if m:
print( m.group(1) )
See an online Python demo. With this solution, your data may start with any char, and will contain any char (excluding line break chars, since your data is on single lines).
Regex details
^ - start of string
\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2} - datetime string: two digits, -, two digits, -, five or six digits, :, two digits, : two digits
(.*?) - Group 1: any zero or more chars other than line break chars, as few as possible
\d+(?:,\d+)? - an int/float value pattern: 1+ digits followed with an optional sequence of , and 1+ digits
$ - end of string.
See the regex demo.
I am trying to search a keyword A in a group of lines with Python re library. The number of lines in a group is in a range of 3 to 5. Each line is enclosed by "" and "". The keyword A may or may not appear in the group. If it doesn't, I want it to get a None to me. A sample of the text looks like:
<BR>GROUP #1</BR>
<BR>arbitrary characters 1</BR>
<BR>arbitrary characters 2</BR>
<BR>arbitrary characters 3</BR>
<BR>GROUP #2</BR>
<BR>arbitrary characters 4</BR>
<BR>arbitrary characters 5</BR>
<BR>KEYWORD_A_2</BR>
<BR>Group #3</BR>
<BR>arbitrary characters 6</BR>
<BR>arbitrary characters 7</BR>
<BR>arbitrary characters 8</BR>
<BR>KEYWORD_A_3</BR>
....
(Note: the uppercase characters may be keywords and should appear exactly same it the original text.)
My first attempt, '<BR>Group #(\d+)</BR>.*?<BR>Keyword_A_(\d+)</BR>' obviously may cross the border of the groups and get a match of (1, 2), instead of (1, None) as I wished.
My next attempt is '<BR>Group #(\d+)</BR>(?:<BR>.*?</BR>){,3}<BR>Keyword_A_(\d+)</BR>', to limit the .. pairs to be 3. But that will be a greedy match so that 'KEYWORD_A_3' is matched and (1, 3) is returned.
So, in summary, I am trying to have regex to find 'KEYWORD_A_(\d+)' after maximum of 5 lines after a match of 'GROUP #(\d+)'. If no match beyond 5 lines, just stop searching, return None, and set the regex's current position at the end of match of 'GROUP #(\d+)', so I can start to search in next group.
Is that possible with re library of Python? Thanks for any helps.
You may use
re.findall(r'<BR>Group\s+#(\d+)</BR>((?:(?!<BR>Group\s+#\d).)*?)<BR>Keyword_A_(\d+)</BR>', text, re.DOTALL)
See the regex demo
Details
<BR>Group - a literal <BR>Group string
\s+ - 1+ whitespaces
# - a # char
(\d+) - Capturing group 1: one or more digits
</BR> - a substring
((?:(?!<BR>Group\s+#\d).)*?) - Capturing group 2: any char, 0 or more but as few as possible occurrences that does not start a <BR>Group\s+#\d pattern
<BR>Keyword_A_ - a literal substring
(\d+) - Capturing group 3: one or more digits
</BR> - a substring
I would like to replace [1-2] with 1, [3-4] with 3, [7-8] with 7, [2] with 2, and so on.
For example, I would like to use the following strings:
db[1-2].abc.xyz.pqr.abc.abc.com
db[3-4].abc.xyz.pqr.abc.abc.com
db[1].abc.xyz.pqr.abc.abc.com
xyz-db[1-2].abc.xyz.pqr.abc.abc.com
and convert them to
db1.abc.xyz.pqr.abc.abc.com
db3.abc.xyz.pqr.abc.abc.com
db1.abc.xyz.pqr.abc.abc.com
xyz-db1.abc.xyz.pqr.abc.abc.com
You could use a regex like:
^(.*)\[([0-9]+).*?\](.*)$
and replace it with:
$1$2$3
Here's what the regex does:
^ matches the beginning of the string
(.*) matches any character any amount of times, and is also the first capture group
\[ matches the character [ literally
([0-9]+) matches any number 1 or more times, and is also the second capture group
.*? matches any character any amount of times, but tries to find the smallest match
\] matches the character ] literally
(.*) matches any characters any amount of times
$ matches the end of the string
By replacing it with $1$2$3, you are replacing it with the text in the first capture group, followed by the text in the second capture group, followed by the text in the third capture group.
Here's a live preview on regex101.com
import re
def fixString(strToFix):
groups = re.match("(.*)\[(\d*).*\](.*)", strToFix).groups()
return "%s%s%s" % (groups[0], groups[1], groups[2])