I have a lot of file names with the pattern SURENAME__notalwaysmiddlename_firstnames_1230123Abc123-16x_notalways.pdf, e.g.:
SMITH_John_001322Cde444-16v_HA.pdf
FLORRICK-DOILE_Debora_Alicia_321333Gef213-16p.pdf
ROBINSON-SMITH_Maria-Louise_321333Gef213-16p_GH.pdf
My old regex was ([\w]*)_([\w-\w]+)\.\w+ but after switching to Python and getting the first double-barrelled surnames (and even in the first names) I'm unable to get it running.
With the old regex I got two groups:
SMITH_James
001322Cde444-16v_HA
But now I have no clue how to achieve this with re and even include the occasional double-barrelled names in group 1 and the ID in group 2.
([A-Z-]+)(?:_([A-z-]+))?_([A-z-]+)_(\d.*)\.
This pattern will return the surname, potential middle name, first name, and final string.
([A-Z-]+) returns a upper-cased word that can also contain -
(?:_([A-z-]+))? returns 0 or 1 matches of a word preceded by an _. The (?: makes the _ non-capturing
([A-z-]+) returns a word that can also contain -
(\d.*) returns a string that starts with a number
\. finds the escaped period right before the file type
Related
I want to generate a username from an email with :
firstname's first letter
lastname's first 7 letters
eg :
getUsername("my-firstname.my-lastname#email.com")
mmylastn
Here is getUsername's code :
def getUsername(email) :
re.match(r"(.){1}[a-z]+.([a-z]{7})",email.replace('-','')).group()
email.replace('-','') to get rid of the - symbol
regex that captures the 2 groups I discribed above
If I do .group(1,2) I can see the captured groups are m and mylastn, so it's all good.
But using .group() doesn't just return the capturing group but also everthing between them : myfirstnamemlastn
Can someone explain me this behavior ?
First of all, a . in a pattern is a metacharacter that matches any char excluding line break chars. You need to escape the . in the regex pattern
Also, {1} limiting quantifier is always redundant, you may safely remove it from any regex you have.
Next, if you need to get a mmylastn string as a result, you cannot use match.group() because .group() fetches the overall match value, not the concatenated capturing group values.
So, in your case,
Check if there is a match first, trying to access None.groups() will throw an exception
Then join the match.groups()
You can use
import re
def getUsername(email) :
m = re.match(r"(.)[a-z]+\.([a-z]{7})",email.replace('-',''))
if m:
return "".join(m.groups())
return email
print(getUsername("my-firstname.my-lastname#email.com"))
See the Python demo.
I am trying to capture words following specified stocks in a pandas df. I have several stocks in the format $IBM and am setting a python regex pattern to search each tweet for 3-5 words following the stock if found.
My df called stock_news looks as such:
Word Count
0 $IBM 10
1 $GOOGL 8
etc
pattern = ''
for word in stock_news.Word:
pattern += '{} (\w+\s*\S*){3,5}|'.format(re.escape(word))
However my understanding is that {} should be a quantifier, in my case matching between 3 to 5 times however I receive the following KeyError:
KeyError: '3,5'
I have also tried using rawstrings with r'{} (\w+\s*\S*){3,5}|' but to no avail. I also tried using this pattern on regex101 and it seems to work there but not in my Pycharm IDE. Any help would be appreciated.
Code for finding:
pat = re.compile(pattern, re.I)
for i in tweet_df.Tweets:
for x in pat.findall(i):
print(x)
When you build your pattern, there is an empty alternative left at the end, so your pattern effectively matches any string, every empty space before non-matching texts.
You need to build the pattern like
(?:\$IBM|\$GOOGLE)\s+(\w+(?:\s+\S+){3,5})
You may use
pattern = r'(?:{})\s+(\w+(?:\s+\S+){{3,5}})'.format(
"|".join(map(re.escape, stock_news['Word'])))
Mind that the literal curly braces inside an f-string or a format string must be doubled.
Regex details
(?:\$IBM|\$GOOGLE) - a non-capturing group matching either $IBM or $GOOGLE
\s+ - 1+ whitespaces
(\w+(?:\s+\S+){3,5}) - Capturing group 1 (when using str.findall, only this part will be returned):
\w+ - 1+ word chars
(?:\s+\S+){3,5} - a non-capturing* group matching three, four or five occurrences of 1+ whitespaces followed with 1+ non-whitespace characters
Note that non-capturing groups are meant to group some patterns, or quantify them, without actually allocating any memory buffer for the values they match, so that you could capture only what you need to return/keep.
I have strings of as and bs. I want to extract all overlapping subsequences, where a subsequence is a single a surrounding by any number of bs. This is the regex I wrote:
import re
pattern = """(?= # inside lookahead for overlapping results
(?:a|^) # match at beginning of str or after a
(b* (?:a) b*) # one a between any number of bs
(?:a|$)) # at end of str or before next a
"""
a_between_bs = re.compile(pattern, re.VERBOSE)
It seems to work as expected, except when the very first character in the string is an a, in which case this subsequence is missed:
a_between_bs.findall("bbabbba")
# ['bbabbb', 'bbba']
a_between_bs.findall("abbabb")
# ['bbabb']
I don't understand what is happening. If I change the order of how a potential match could start, the results also change:
pattern = """(?=
(?:^|a) # a and ^ swapped
(b* (?:a) b*)
(?:a|$))
"""
a_between_bs = re.compile(pattern, re.VERBOSE)
a_between_bs.findall("abbabb")
# ['abb']
I would have expected this to be symmetric, so that strings ending in an a might also be missed, but this doesn't appear to be the case. What is going on?
Edit:
I assumed that solutions to the toy example above would translate to my full problem, but that doesn't seem to be the case, so I'm elaborating now (sorry about that). I am trying to extract "syllables" from transcribed words. A "syllable" is a vowel or a diphtongue, preceded and followed by any number of consonants. This is my regular expression to extract them:
vowels = 'æɑəɛiɪɔuʊʌ'
diphtongues = "|".join(('aj', 'aw', 'ej', 'oj', 'ow'))
consonants = 'θwlmvhpɡŋszbkʃɹdnʒjtðf'
pattern = f"""(?=
(?:[{vowels}]|^|{diphtongues})
([{consonants}]* (?:[{vowels}]|{diphtongues}) [{consonants}]*)
(?:[{vowels}]|$|{diphtongues})
)
"""
syllables = re.compile(pattern, re.VERBOSE)
The tricky bit is that the diphtongues end in consonants (j or w), which I don't want to be included in the next syllable. So replacing the first non-capturing group by a double negative (?<![{consonants}]) doesn't work. I tried to instead replace that group by a positive lookahead (?<=[{vowels}]|^|{diphtongues}), but regex won't accept different lengths (even removing the diphtongues doesn't work, apparently ^ is of a different length).
So this is the problematic case with the pattern above:
syllables.findall('æbə')
# ['bə']
# should be: ['æb', 'bə']
Edit 2:
I've switched to using regex, which allows variable-width lookbehinds, which solves the problem. To my surprise, it even appears to be faster than the re module in the standard library. I'd still like to know how to get this working with the re module, though. (:
I suggest fixing this with a double negation:
(?= # inside lookahead for overlapping results
(?<![^a]) # match at beginning of str or after a
(b*ab*) # one a between any number of bs
(?![^a]) # at end of str or before next a
)
See the regex demo
Note I replaced the grouping constructs with lookarounds: (?:a|^) with (?<![^a]) and (?:a|$) with (?![^a]). The latter is not really important, but the first is very important here.
The (?:a|^) at the beginning of the outer lookahead pattern matches a or start of the string, whatever comes first. If a is at the start, it is matched and when the input is abbabb, you get bbabb since it matches the capturing group pattern and there is an end of string position right after. The next iteration starts after the first a, and cannot find any match since the only a left in the string has no a after bs.
Note that order of alternative matters. If you change to (?:^|a), the match starts at the start of the string, b* matches empty string, ab* grabs the first abb in abbabb, and since there is a right after, you get abb as a match. There is no way to match anything after the first a.
Remember that python "short-circuits", so, if it matches "^", its not going to continue looking to see if it matches "a" too. This will "consume" the matching character, so in cases where it matches "a", "a" is consumed and not available for the next group to match, and because using the (?:) syntax is non-capturing, that "a" is "lost", and not available to be captured by the next grouping (b*(?:a)b*), whereas when "^" is consumed by the first grouping, that first "a" would match in the second grouping.
I have a csv file with lots of twitter data from which I want to extract nodes and edges to create a network file for Social Network Analysis.
The data I want are located in row[11] and row[12]. I add them together, inserting four & just to make an easy delimiter.
for row in reader:
interactions = row[11] + "&&&&" + row[12]
print interactions #for debugging only
edgeST = re.findall(r'^(.*)&&&&.*#([A-Za-z0-9([A-Za-z0-9_]+)', interactions, flags = re.MULTILINE)
print edgeST
The output from both prints look like this (first line prints the entire interactions string, the second line the result of the re.findall):
GaryStokesKSPS&&&&RT #PBS: .#SciGirls Season 3 is now on #YouTube! Watch now: http://t.co/YHH23ADDq9 #SciGirls #STEM #CitizenScience
[('GaryStokesKSPS', 'YouTube')]
In this case, my first parenthesis matches the source node username ('GaryStokesKSPS'), which is fine. But then I get a match for 'Youtube', but not for #PBS or #SciGirls. The last match is returned, but not the previous ones. This pattern occurs throughout my entire dataset.
How can I get A) All matches and/or B) Only the first one?
Use a non-greedy match: change
^(.*)&&&&.*#([A-Za-z0-9([A-Za-z0-9_]+)
into
^(.*)&&&&.*?#([A-Za-z0-9([A-Za-z0-9_]+)
The former matches as much text as possible, so your text is getting split up as
GaryStokesKSPS&&&&RT #PBS: .#SciGirls Season 3 is now on #YouTube! Watch now: ...
\________________/\_____________________________________/#\_____/
(.*)&&&& greedy match .* group
First of all, I’m not sure about this part:
([A-Za-z0-9([A-Za-z0-9_]+)
There is a character class ([]) with the following characters and ranges: A-Z, a-z, 0-9, (, [, A-Z again, a-z again, 0-9 again, and _.
You probably made a mistake there as it’s the same as ([A-Za-z0-9([_]+). And the brackets aren’t actually allowed in Twitter names. You probably meant to match for something like this: ([A-Za-z0-9_]+).
That being said, to fix your problem, you need a non-greedy match before the # character. Change it to this:
^(.*?)&&&&.*?#([A-Za-z0-9_]+)
↑ ↑
the question marks make this non-greedy
Given the regex and the word below I want to match the part after the - (which can also be a _ or space) only if the part after the delimiter is a digit and nothing comes after it (I basically want to to be a number and number only). I am using group statements but it just doesn't seem to work right. It keeps matching the 3 at the beginning (or the 1 at the end if I modify it a bit). How do I achieve this (by using grouping) ?
Target word: BR0227-3G1
Regex: ([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*)
It should not match 3G1, G1 , 1G
It should match only pure numbers like 3,10, 2 etc.
Here is also a helper web site for evaluating the regex: http://www.pythonregex.com/
More examples:
It should match:
BR0227-3
BR0227 3
BR0227_3
into groups (BR0227) (3)
It should only match (BR0227) for
BR0227-3G1
BR0227-CS
BR0227
BR0227-
I would use
re.findall('^([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*$)?', str)
Each string starts with the first group and ends with the last group, so the ^ and $ groups can assist in capture. The $ at the end requires all numbers to be captured, but it's optional so the first group can still be captured.
Since you want the start and (possible) end of the word in groups, then do this:
r'\b([A-Z0-9]+)(?:[ _-](\d+))?\b'
This will put the first part of the word in the first group, and optionally the remainder in the second group. The second group will be None if it didn't match.
This should match anything followed by '-', ' ', or '_' with only digits after it.
(.*)[- _](\d+)