Extract salaries from a list of strings

Extract salaries from a list of strings - python

I'm trying to extract salaries from a list of strings.
I'm using the regex findall() function but it's returning many empty strings as well as the salaries and this is causing me problems later in my code.
sal= '41 000€ à 63 000€ / an' #this is a sample string for which i have errors
regex = ' ?([0-9]* ?[0-9]?[0-9]?[0-9]?)'#this is my regex
re.findall(regex,sal)[0]
#returns '41 000' as expected but:
re.findall(regex,sal)[1]
#returns: ''
#Desired result : '63 000'
#the whole list of matches is like this:
['41 000',
'',
'',
'',
'',
'',
'',
'63 000',
'',
'',
'',
'',
'',
'',
'',
'',
'']
# I would prefer ['41 000','63 000']
Can anyone help?
Thanks

Using re.findall will give you the capturing groups when you use them in your pattern and you are using a group where almost everything is optional giving you the empty strings in the result.
In your pattern you use [0-9]* which would match 0+ times a digit. If there is not limit to the leading digits, you might use [0-9]+ instead to not make it optional.
You might use this pattern with a capturing group:
(?<!\S)([0-9]+(?: [0-9]{1,3})?)€(?!\S)
Regex demo | Python demo
Explanation
(?<!\S) Assert what is on the left is not a non whitespace character
( Capture group
[0-9]+(?: [0-9]{1,3})? match 1+ digits followed by an optional part that matches a space and 1-3 digits
) Close capture group
€ Match literally
(?!\S) Assert what is on the right is not a non whitespace character
Your code might look like:
import re
sal= '41 000€ à 63 000€ / an' #this is a sample string for which i have errors
regex = '(?<!\S)([0-9]+(?: [0-9]{1,3})?)€(?!\S)'
print(re.findall(regex,sal)) # ['41 000', '63 000']

Related

How to select for the longest match from regex

I have a long string that I need to split into separate strings. I have written up a regex pattern shown below. The problem right now is that my long string is split into the smaller strings but with duplicates.
That is what my code looks like:
import re
teststring = '''#first-error-type: cjjyr901-d374-jfh73kf8k,
#second-err, #some-other-error : jksdf89-123r-e3-1345r, 99f7yr901-374-jfh73kf8k
#third-errortype cjjyr901-d374-jfh73kf8k, ksdf89-123r-e3-1345r, 99f7yr901-374-jfh73kf8k'
'''
new = re.sub('\((.+)\)', '', teststring)
#remove perenthesis
new = re.sub(':', '', new)
#removing stray colons
new = re.sub(r'([#][\w]*(-[\w]*)*[,]*)', r'\1:', new)
#adding colons
new = re.split(r'(([#][\w]*)*(-[\w]*)*[, :]([\w]*[-, ]*)*)', new)
Due to the inconsistencies of the text, I have to do preliminary cleaning in the beginning to remove stray colons and then add it in later
Currently, this is my output:
['', '#first-error-type: cjjyr901-d374-jfh73kf8k, ', '#first', '-type', '', '\n', '#second-err,', '#second', '-err', '', '', ': ', None, None, '', '', '#some-other-error: jksdf89-123r-e3-1345r , 99f7yr901-374-jfh73kf8k ', '#some', '-error', '', '\n', '#third-errortype: cjjyr901-d374-jfh73kf8k, ksdf89-123r-e3-1345r, 99f7yr901-374-jfh73kf8k', '#third', '-errortype', '', "'\n"]
The code I am expecting is:
['#first-error-type: cjjyr901-d374-jfh73kf8k',
'#second-err, #some-other-error: jksdf89-123r-e3-1345r (some kind of note), 99f7yr901-374-jfh73kf8k',
'#third-errortype: cjjyr901-d374-jfh73kf8k, ksdf89-123r-e3-1345r, 99f7yr901-374-jfh73kf8k']
It seems like I had made a few mistakes in the pattern so I am not grouping two hashtag comments together to add just one colon. Also, the output is split into duplicating segments of varing length.

You can use
import re
teststring = '''#first-error-type: cjjyr901-d374-jfh73kf8k,
#second-err, #some-other-error : jksdf89-123r-e3-1345r (some kind of note), 99f7yr901-374-jfh73kf8k
#third-errortype cjjyr901-d374-jfh73kf8k, ksdf89-123r-e3-1345r, 99f7yr901-374-jfh73kf8k'
'''
new = re.sub(
r'(#\w+(?:-\w+)*)(?=[:\s]+\w)[^\S\n]*(?::[^\S\n]*)?|,[^\S\n]*(?=\n)',
lambda x: f'{x.group(1)}: ' if x.group(1) else '',
teststring)
print( list(map(lambda x: x.strip(), re.findall(r'#\w+(?:-\w+)*[,: ][^"\'\r\n]*', new))) )
See the Python demo.
Output:
['#first-error-type: cjjyr901-d374-jfh73kf8k', '#second-err, #some-other-error: jksdf89-123r-e3-1345r (some kind of note), 99f7yr901-374-jfh73kf8k', '#third-errortype: cjjyr901-d374-jfh73kf8k, ksdf89-123r-e3-1345r, 99f7yr901-374-jfh73kf8k']
The (#\w+(?:-\w+)*)(?=[:\s]+\w)[^\S\n]*(?::[^\S\n]*)?|,[^\S\n]*(?=\n) regex matches
(#\w+(?:-\w+)*)(?=[:\s]+\w)[^\S\n]*(?::[^\S\n]*)? -
(#\w+(?:-\w+)*) - Group 1: a # char followed with one or more word chars followed with zero or more repetitions of a - and again one or more word chars
(?=[:\s]+\w) - there must be a word char after any one or more whitespaces or : from the location
[^\S\n]* - zero or more whitespaces but LF char
(?::[^\S\n]*)? - an optional occurrence of a : and zero or more whitespaces but LF char
| - or
,[^\S\n]*(?=\n) - a comma, zero or more whitespaces but LF char followed with an LF char
The match is removed if Group 1 is not matched, else, we return Group 1 plus a colon with a space.
The #\w+(?:-\w+)*[,: ][^"\'\r\n]* regex used in re.findall matches
#\w+(?:-\w+)* - a #, one or more word chars, and zero or more occurrences of a - and one or more word chars
[,: ] - a space, comma or colon
[^"\'\r\n]* - zero or more chars other than ", ', CR and LF.
The map(lambda x: x.strip(),...) is used to stip whitespaces mainly from the end of the matches since the [^"\'\r\n]* negated character class can match them at will.

Using regex to extract characters either side of a match

I have a string:
test=' 40 virtual asset service providers law, 2020e section 1 c law 14 of 2020 page 5 cayman islands'
I want to match all occurrences of a digit, then print not just the digit but the three characters either side of the digit.
At the moment, using re I have matched the digits:
print (re.findall('\d+', test ))
['40', '2020', '1', '14', '2020', '5']
I want it to return:
[' 40 v', 'w, 2020e s', 'aw 14 of', 'of 2020 ', 'ge 5 c']

Use . to capture any character and then {0,3} to capture up to 3 characters on each side
print(re.findall('.{0,3}\d+.{0,3}', test))

re.findall(".{0,3}\d+.{0,3}", test)
The {0,3} "greedy" quantifier match at most 3 characters.

Here you go:
re.findall('[^0-9]{0,3}[0-9]+[^0-9]{0,3}', test)
[EDIT]
Breaking the pattern down:
'[^0-9]{0,3}' matches up to 3 non-digit characters
'[0-9]+' matches one or more digits
The final pattern '[^0-9]{0,3}[0-9]+[^0-9]{0,3}' matches one or more digits surrounded by up to 3 non-digits on either side.
To reduce confusion, I am in favor of using '[^0-9]{0,3}' instead of '.{0,3}' (as mentioned in other answers) in the pattern, because it explicitly tells that non-digits need to be matched. '.' could be confusing because it matches any literal (including digits).

Regular expression finding '\n'

I'm in the process of making a program to pattern match phone numbers in text.
I'm loading this text:
(01111-222222)fdf
01111222222
(01111)222222
01111 222222
01111.222222
Into a variable, and using "findall" it's returning this:
('(01111-222222)', '(01111', '-', '222222)')
('\n011112', '', '\n', '011112')
('(01111)222222', '(01111)', '', '222222')
('01111 222222', '01111', ' ', '222222')
('01111.222222', '01111', '.', '222222')
This is my expression:
ex = re.compile(r"""(
(\(?0\d{4}\)?)? # Area code
(\s*\-*\.*)? # seperator
(\(?\d{6}\)?) # Local number
)""", re.VERBOSE)
I don't understand why the '\n' is being caught.
If * in '\\.*' is substituted for by '+', the expression works as I want it. Or if I simply remove *(and being happy to find the two sets of numbers separated by only a single period), the expression works.

The \s matches both horizontal and veritcal whitespace symbols. If you have a re.VERBOSE, you can match a normal space with an escaped space \ . Or, you may exclude \r and \n from \s with [^\S\r\n] to match horizontal whitespace.
Use
ex = re.compile(r"""(
(\(?0\d{4}\)?)? # Area code
([^\S\r\n]*-*\.*)? # seperator ((HERE))
(\(?\d{6}\)?) # Local number
)""", re.VERBOSE)
See the regex demo
Also, the - outside a character class does not require escaping.

Regex one-liner for matching only what comes after a certain word?

I want to extract song names from a list like this: 'some text here, songs: song1, song2, song3, fro: othenkl' and get ['song1', 'song2', 'song3']. So I try to do it in one regex:
result = re.findall('[Ss]ongs?:?.*', 'songs: songname1, songname2,')
print re.findall('(?:(\w+),)*', result[0])
This matches perfectly: ['', '', '', '', '', '', '', 'songname1', '', 'songname2', ''] (except for the empty strings, but nbd.
But I want to do it in one line, so I do the following:
print re.findall('[Ss]ongs?:?(?:(\w+),)*','songs: songname1, songname2,')
But I do not understand why this is unable to capture the same as the two regexes above:
['', 'name1', 'name2']
Is there a way to accomplish this in one line? It would be useful to be concise here. thanks.

You don't need to use re.findall in this case, you better to use re.search to find the sequence of songs then split the result with comma ,. Also you don't need to use character class [Ss] to match the Capitals you can use Ignore case flag (re.I) :
>>> s ='some text here, songs: song1, song2, song3, fro: othenkl'
>>> re.search(r'(?<=songs:)(.+),', s,flags=re.I).group(1).split(',')
[' song1', ' song2', ' song3']
(?<=songs:) is a positive look behind which will makes your regex engine match the strings precede by songs: and (.+), will match the largest string after songs: which follows by comma that is the sequence of your songs.
Also as a more general way instead of specifying comma at the end of your regex you can capture the song names based on this fact that they are followed by this patter \s\w+:.
>>> re.search(r'(?<=songs:)(.+)(?=\s\w+:)', s).group(1).split(',')
[' song1', ' song2', ' song3', '']

No, you can't do it in one pattern with the re module.
What you can do is to use the regex module instead with this pattern:
regex.findall(r'(?:\G(?!\A), |\msongs: )(\w++)(?!:)', s)
Where \G is the position after the previous match, \A the start of the string, \m a word boundary followed by word characters, and ++ a possessive quantifier.

Confusing with the usage of regex in Python

I'm confused with the following three patterns, would someone explain it in more detail?
## IPython with Python 2.7.3
In [62]: re.findall(r'[a-z]*',"f233op")
Out[62]: ['f', '', '', '', 'op', ''] ## why does the last '' come out?
In [63]: re.findall(r'([a-z])*',"f233op")
Out[63]: ['f', '', '', '', 'p', ''] ## why does the character 'o' get lost?
In [64]: re.findall(r'([a-z]*)',"f233op")
Out[64]: ['f', '', '', '', 'op', ''] ## what's the different than line 63 above?

Example 1
re.findall(r'[a-z]*',"f233op")
This pattern is matching zero-or-more instances of lower case alphabet characters. The ZERO-or-more part is key here, since a match of nothing, starting from every index position in the string, is just as valid as a match of f or op. The last empty string returned is the match starting from the end of the string (the position between p and $ (end of string).
Example 2
re.findall(r'([a-z])*',"f233op")
Now you are matching character groups, consisting of a single lower-case alphabet character. The o is no longer returned because this is a greedy search, and the last valid matched group will be returned. So if you changed the string to f233op12fre, the final e would be returned, but no the preceding f or r. Likewise, if you take out the p in your string, you still see that o is returned as a valid match.
Conversely, if you tried to make this regex non-greedy by adding a ? (eg. ([a-z])*?), the returned set of matches would all be empty strings, since a valid match of nothing has a higher precedence of a valid match of something.
Example 3
re.findall(r'([a-z]*)',"f233op")
Nothing is different in the matched characters, but now you are returning character groups instead of raw matches. The output of this regex query will be the same as your first example, but you'll notice that if you add an additional matching group, you will suddenly see the results of each match attempt grouped into tuples:
IN : re.findall(r'([a-z]*)([0-9]*)',"f233op")
OUT: [('f', '233'), ('op', ''), ('', '')]
Contrast this with the same pattern, minus the parenthesis (groups), and you'll see why they are important:
IN : re.findall(r'[a-z]*[0-9]*',"f233op")
OUT: ['f233', 'op', '']
Also...
It can be useful to plug regex patterns like these into regex diagram generators like Regexplained to see how the pattern matching logic works. For example, as an explanation as to why your regex is always returning empty character string matches, take a look at the difference between the patterns [a-z]* and [a-z]+.
Don't forget to check the Python docs for the re library if you get stuck, they actually give a pretty stellar explanation for the standard regex syntax.

You get the final '' because [a-z]* is matching the empty string at the end.
The character 'o' is missing because you have told re.findall to match groups, and each group has a single character. Put another way, you’re doing the equivalent of
m = re.match(r'([a-z])*', 'op')
m.group(1)
which will return 'p', because that’s the last thing captured by the parens (capture group 1).
Again, you’re matching groups, but this time multi-character groups.

Your surprising results are related to the Regular Expression Quantifier *.
Consider:
[a-z]*
Debuggex Demo
Vs:
[a-z]+
Debuggex Demo
Consider as another example that I think is more illustrative of what you are seeing:
>>> re.findall(r'[a-z]*', '123456789')
['', '', '', '', '', '', '', '', '', '']
There are no characters in the set [a-z] in the string 123456789. Yet, since * means 'zero or more', all character positions 'match' by not matching any characters at that position.
For example, assume you just wanted to test if there were any letters in a string, and you use a regex like so:
>>> re.search(r'[a-z]*', '1234')
<_sre.SRE_Match object at 0x1069b6988> # a 'match' is returned, but this is
# probably not what was intended
Now consider:
>>> re.findall(r'[a-z]*', '123abc789')
['', '', '', 'abc', '', '', '', '']
Vs:
>>> re.findall(r'([a-z])*', '123abc789')
['', '', '', 'c', '', '', '', '']
The first pattern is [a-z]*. The part [a-z] is a character class matching a single character in the set a-z unless modified; the addition of * quantifier will greedily match as many characters as possible if more than zero -- hence the match of 'abc' but will also allow zero characters to be a match (or a character outside the character set to match the position since 0 is a match).
The addition of a grouping in ([a-z])* effectively reduces the match in the quantified set back to a single character and the last character matched in the set is returned.
If you want the effect of grouping (say in a more complex pattern) use a non capturing group:
>>> re.findall(r'(?:[a-z])*', '123abc789')
['', '', '', 'abc', '', '', '', '']

In line 63 you're finding all instances of a group (indicated with the parens) of characters of length 1. The * isn't doing much for you here (just causing you to match zero length groups).
In the other examples having the * next to the [a-z], you match adjacent characters of any length.
EDIT
Playing around with this tool may help.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract salaries from a list of strings - python

Related

How to select for the longest match from regex

Using regex to extract characters either side of a match

Regular expression finding '\n'

Regex one-liner for matching only what comes after a certain word?

Confusing with the usage of regex in Python

Categories

Resources