Python RegEx: Not capturing all the data (python3.6, scrapy)

Python RegEx: Not capturing all the data (python3.6, scrapy) - python

I was trying to script a website of length information using the following simple code:
list = re.findall('(?<=Length:\s\s)[:\d]+', response.text)
if len(list) > 0:
data['Length'] = list[0]
else:
data['Length'] = '00:00'
However, it only gets the information if the length information is less than one hour. For example, it gets the 51:00 but not 01:08:47. I checked the source code for both shorter and longer than one hour. Here are how they look. It seems that for length more than 1 hour, there is one less white space. So I tried, but this time, list only returns a white space. Does anybody know how to get both short and long information? Thank you very much!
list = re.findall('(?<=Length:)[\s:\d]+', response.text)
if len(list) > 0:
data['Length'] = list[0]
else:
data['Length'] = '00:00'

You need '(?<=Length:)\s*(\d\d[\s*:\s*\d\d]+)'.

Try this Regex and extract whatever is present in group 1:
Length\s*:\s*(\d+\s*(?::\s*\d+\s*){1,2})
Click for Demo
Explanation:
Length\s*: - matches Length literally followed by 0+ occurrences of a white-space, as many as possible
:\s* - matches a : followed by 0+ white-spaces
\d+\s* - matches 1+ occurrences of a digit followed by 0+ white-spaces. We start capturing the text from here in Group 1. We capture until the end of the match.
(?::\s*\d+\s*){1,2} - matches either 1 or 2 occurrences of the pattern (?::\s*\d+\s*)
(?:) - indicates a non-capturing group
:\s* - matches a : followed by 0+ occurrences of a white-space
\d+ - matches 1+ occurrences of a digit
\s* - matches 0+ occurrences of a white-space
Alternative Regex:(without any group)
(?<=Length:\s\s)\d+\s*(?::\s*\d+\s*){1,2}

Related

Parse the number of shares in a string

Let's say I have a string like "ABC will issue 1,600,000 shares next week.". Problem statement: I need to extract the number of shares from a string - the number of shares can be identified by the fact that it's followed by the word "shares". Is it possible to do that?
I've tried using the regex '^(?=.)(\d{1,3}(,\d{3})*)?(\.\d+)?$'. The code is re.search(<regex>, <string>) but this only works when the string is just the number and nothing else. As soon <string> = "1,000,000 shares", it returns None. Would appreciate any help!
There is another problem: If I remove the ^ and $ anchors, then the regex pattern starts matching '' as well, so a string like "common shares" may return "common".

The following regex extracts the number of shares accurately from the string:
import re
string = 'ABC will issue 1,600,000 shares next week 219,123,123 apples'
pattern = r'\b([\d,]+) shares\b'
print(re.findall(pattern,string))
Output
['1,600,000']
This essentially says:
From the beginning till the end this string must be
A digit followed by , repeated any number of times
Followed by a space, which is followed by the word shares
You also include the capturing group () to only see the number in your output and not the number followed by the word shares
Also:
import re
string = '1,000,000 shares'
pattern = r'\b([\d,]+) shares\b'
print(re.findall(pattern,string))
Output
['1,000,000']
However, this approach assumes that all the numbers before the word shares are valid because it will also recognise numbers like 1,,,,000,000 or 1,0,00,,00,0 which are obviously not valid.

This should work for you:
keyword = 'shares'
test_string = 'ABC will issue 1,600,000 shares next week'
split_string = test_string.split()
print(split_string)
if keyword in split_string:
index = split_string.index(keyword)
assert index > 0
shares_string = split_string[index-1]
shares_string = shares_string.replace(',', '')
nbr_shares = int(shares_string)
print(nbr_shares)

You can use
(?<!\d)(?<!\d[.,])(?:\d{1,3}(?:,\d{3})*|\d+)(?:\.\d+)?(?=\s*shares?\b)
See the regex demo.
Details:
(?<!\d) - no digit immediately to the left of the current location is allowed
(?<!\d[.,]) - no digit and a . or , char immediately to the left of the current location is allowed
(?: - start of a non-capturing group
\d{1,3}(?:,\d{3})* - one to three digits followed with zero or more occurrences of a comma and three digits
| - or
\d+ - one or more digits
) - end of the non-capturing group
(?:\.\d+)? - an optional occurrence of a . and one or more digits
(?=\s*shares?\b) - immediately on the right, there must be zero or more whitespaces and shares or share with a word boundary after.

Not able get desired output after string parsing through regex

input =
6:/BENM/Gravity Exports/REM//INV: 3267/FEB20:65:ghgh
6:/BENM/Tabuler Trading/REM//IMP/2020-341
original_regex = 6:[A-Za-z0-9 \/\.\-:] - bt this is taking full string 6:/BENM/Gravity Exports/REM//INV: 3267/FEB20:65:ghgh
modified_regex_pattern = 6:[A-Za-z0-9 \/\.\-:]{1,}[\/-:]
In the first string i want output till
6:/BENM/Gravity Exports/REM//INV: 3267/FEB20
but its giving till :65:
Can anyone suggest better way to write this.
Example as below
https://regex101.com/r/pAduvy/1

You could for example use a capturing group with an optional part at the end to match the :digits:a-z part.
(6:[A-Za-z0-9 \/.:-]+?)(?::\d+:[a-z]+)?$
( Capture group 1
6:[A-Za-z0-9 \/.:-]+? Match any of the listed in the character class as least as possible
) Close group 1
(?::\d+:[a-z]+)? optionally match the part at the end that you don't want to include
$ End of string
Regex demo
Note Not sure if intended, but the last part of your pattern [\/-:] denotes a range from ASCII range 47 - 58.
Or a more precise pattern to get the match only
6:/\w+/\w+ \w+/[A-Z]+//[A-Z]+(?:: \d+)?/[A-Z]*\d+(?:-\d+)?
6:/\w+/\w+ Match 6 and 2 times / followed by 1+ word chars and a space
\w+/[A-Z]+//[A-Z]+ Match 1+ word chars, / and uppercase chars, // and again uppercase chars
(?:: \d+)? Optionally match a space and 1+ digits
/[A-Z]*\d+ Match /, optional uppercase chars and 1+ digits
(?:-\d+)? Optionally match - and 1+ digits
Regex demo

Fetching respective group values in a regex expression

I have an example string like below:
Handling - Uncrating of 3 crates - USD600 each 7%=126.00 1,800.00
I can have another example string that can be like:
Unpacking/Unremoval fee Zero Rated 100.00
I am trying to access the first set of words and the last number values.
So I want the dict to be
{'Handling - Uncrating of 3 crates - USD600 each':1800.00}
or
{'Unpacking/Unremoval fee':100.00}
There might be strings where none of the above patterns (Zero Rated or something with %) present and I would skip those strings.
To do that, I was regexing the following pattern
pattern = re.search(r'(.*)Zero.*Rated\s*(\S*)',line.strip())
and then
pattern.group(1)
gives the keys for dict and
pattern.group(2)
gives the value of 1800.00. This works for lines where Zero Rated is present.
However if I want to also check for pattern where Zero Rated is not present but % is present as in first example above, I was trying to use | but it didn't work.
pattern = re.search(r'(.*)Zero.*Rated|%\s*(\S*)',line.strip())
But this time I am not getting the right pattern groups as it is fetching groups.

Sites like regex101.com can help debug regexes.
In this case, the problem is with operator precedence; the | operates over the whole of the rest of the regex. You can group parts of the regex without creating additional groups with (?: )
Try: r'(.*)(?:Zero.*Rated|%)\s*(\S*)'
Definitely give regex101.com a go, though, it'll show you what's going on in the regex.

You might use
^(.+?)\s*(?:Zero Rated|\d+%=\d{1,3}(?:\,\d{3})*\.\d{2})\s*(\d{1,3}(?:,\d{3})*\.\d{2})
The pattern matches
^ Start of string
(.+?) Capture group 1, match any char except a newline as least as possible
\s* Match 0+ whitespace chars
(?: Non capture group
Zero Rated Match literally
| Or
\d+%= Match 1+ digits and %=
\d{1,3}(?:\,\d{3})*\.\d{2} Match a digit format of 1-3 digits, optionally repeated by a comma and 3 digits followed by a dot and 2 digits
) Close non capture group
\s* Match 0+ whitespace chars
(\d{1,3}(?:,\d{3})*\.\d{2}) Capture group 2, match the digit format
Regex demo | Python demo
For example
import re
regex = r"^(.+?)\s*(?:Zero Rated|\d+%=\d{1,3}(?:\,\d{3})*\.\d{2})\s*(\d{1,3}(?:,\d{3})*\.\d{2})"
test_str = ("Handling - Uncrating of 3 crates - USD600 each 7%=126.00 1,800.00\n"
"Unpacking/Unremoval fee Zero Rated 100.00\n"
"Delivery Cartage - IT Equipment, up to 1000kgs - 7%=210.00 3,000.00")
print(dict(re.findall(regex, test_str, re.MULTILINE)))
Output
{'Handling - Uncrating of 3 crates - USD600 each': '1,800.00', 'Unpacking/Unremoval fee': '100.00', 'Delivery Cartage - IT Equipment, up to 1000kgs -': '3,000.00'}

Finding exact values associated with given word using regex in python

am trying to find values associated with a particular word using regex but not getting expected results.
I wrote a pattern that is working fine for standard input only and I want to so the same for all sorts of inputs.
What I have now:
string = r'''results on 12/28/2012: WBC=8.110*3, RBC=3.3010*6, Hgb=11.3gm/dL'''
Pattern which I wrote:
re.findall(r'{}=(.*)'.format(detected_word), search_query)[0].split(',')[0]
detected_word is variable where am detecting left side part of equals sign like (WBC, RBC,...) using another technique.
In this above case, it's working fine, but if I change the sentence pattern like below am unable to find a generic pattern.
string = r'''results on 12/28/2012: WBC=8.110*3, RBC=3.3010*6 and Hgb=11.3gm/dL'''
string = r'''results for WBC, RBC and Hgb are 8.110*3, 3.3010*6 and 11.3gm/dL'''
no matter of string format I can able to detect WBC, RBC, and Hgb these words but detecting the value for an associated word is worrying me
Could anyone please help me with this?
Thanks in advance

Here is an idea: use two separate patterns for the strings you provided as sample input, the first one will extract values coming after expected word= and the other will extract them from clauses of expected word1 + optional expected word2 + optional expected word3 + "to be" verb + value1, optional value2 and optional value3.
Pattern 1:
\b(WBC|RBC|Hgb)=(\S*)\b
See the regex demo.
\b(WBC|RBC|Hgb) - a whole word WBC, RBC or Hgb
= - a = char
(\S*)\b - Group 2: 0 or more non-whitespaces, that stops at last word boundary position
Pattern 2:
\b(WBC|RBC|Hgb)(?:(?:\s+and)?(?:\s*,)?\s+(WBC|RBC|Hgb))?(?:(?:\s+and)?(?:\s*,)?\s*(WBC|RBC|Hgb))?\s+(?:is|(?:a|we)re|was|will\s+be)(?:\s*,)?\s*(\d\S*)\b(?:(?:\s+and)?(?:\s*,)?\s*(\d\S*)\b)?(?:(?:\s+and)?(?:\s*,)?\s*(\d\S*)\b)?
See regex demo.
\b(WBC|RBC|Hgb) - Group 1 capturing the searched word
(?:(?:\s+and)?(?:\s*,)?\s*(WBC|RBC|Hgb))? - an optional pattern:
(?:\s+and)? - an optional sequence of 1+ whitespaces and then and
(?:\s*,)? - an optional sequence of 0+ whitespaces and then a comma
\s*(WBC|RBC|Hgb) - 0+ whitespaces and Group 2 capturing the searched word
(?:(?:\s+and)?(?:\s*,)?\s*(WBC|RBC|Hgb))? - same as above, captures the 3rd optional searched word into Group 3
\s+ - 1+ whitespaces
(?:is|(?:a|we)re|was|will\s+be) - a VERB, you may add more if you expect them to be at this position, or plainly try a \S+ or \w+ pattern instead
(?:\s*,)?\s* - an optional 0+ whitespaces and a comma sequence, then 0+ whitespaces
(\d\S*)\b - Group 4 (pair it with Group 1 value): a digit and then 0+ non-whitespace chars limited by a word boundary
(?:(?:\s+and)?(?:\s*,)?\s*(\d\S*)\b)? - an optional group matching
(?:\s+and)? - an optional sequence of 1+ whitespaces and and
(?:\s*,)?\s* - an optional 0+ whitespaces and a comma, then 0+ whitespaces
(\d\S*)\b - Group 5 (pair it with Group 2 value): a digit and then 0+ non-whitespace chars limited by a word boundary
(?:(?:\s+and)?(?:\s*,)?\s*(\d\S*)\b)? - same as above, with a capture group 6 that must be paired with Group 3.

Key error when using regex quantifier python

I am trying to capture words following specified stocks in a pandas df. I have several stocks in the format $IBM and am setting a python regex pattern to search each tweet for 3-5 words following the stock if found.
My df called stock_news looks as such:
Word Count
0 $IBM 10
1 $GOOGL 8
etc
pattern = ''
for word in stock_news.Word:
pattern += '{} (\w+\s*\S*){3,5}|'.format(re.escape(word))
However my understanding is that {} should be a quantifier, in my case matching between 3 to 5 times however I receive the following KeyError:
KeyError: '3,5'
I have also tried using rawstrings with r'{} (\w+\s*\S*){3,5}|' but to no avail. I also tried using this pattern on regex101 and it seems to work there but not in my Pycharm IDE. Any help would be appreciated.
Code for finding:
pat = re.compile(pattern, re.I)
for i in tweet_df.Tweets:
for x in pat.findall(i):
print(x)

When you build your pattern, there is an empty alternative left at the end, so your pattern effectively matches any string, every empty space before non-matching texts.
You need to build the pattern like
(?:\$IBM|\$GOOGLE)\s+(\w+(?:\s+\S+){3,5})
You may use
pattern = r'(?:{})\s+(\w+(?:\s+\S+){{3,5}})'.format(
"|".join(map(re.escape, stock_news['Word'])))
Mind that the literal curly braces inside an f-string or a format string must be doubled.
Regex details
(?:\$IBM|\$GOOGLE) - a non-capturing group matching either $IBM or $GOOGLE
\s+ - 1+ whitespaces
(\w+(?:\s+\S+){3,5}) - Capturing group 1 (when using str.findall, only this part will be returned):
\w+ - 1+ word chars
(?:\s+\S+){3,5} - a non-capturing* group matching three, four or five occurrences of 1+ whitespaces followed with 1+ non-whitespace characters
Note that non-capturing groups are meant to group some patterns, or quantify them, without actually allocating any memory buffer for the values they match, so that you could capture only what you need to return/keep.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python RegEx: Not capturing all the data (python3.6, scrapy) - python

You need '(?<=Length:)\s(\d\d[\s:\s*\d\d]+)'.

Related

Parse the number of shares in a string

Not able get desired output after string parsing through regex

Fetching respective group values in a regex expression

Finding exact values associated with given word using regex in python

Key error when using regex quantifier python

Categories

Resources