I have a csv file with lots of twitter data from which I want to extract nodes and edges to create a network file for Social Network Analysis.
The data I want are located in row[11] and row[12]. I add them together, inserting four & just to make an easy delimiter.
for row in reader:
interactions = row[11] + "&&&&" + row[12]
print interactions #for debugging only
edgeST = re.findall(r'^(.*)&&&&.*#([A-Za-z0-9([A-Za-z0-9_]+)', interactions, flags = re.MULTILINE)
print edgeST
The output from both prints look like this (first line prints the entire interactions string, the second line the result of the re.findall):
GaryStokesKSPS&&&&RT #PBS: .#SciGirls Season 3 is now on #YouTube! Watch now: http://t.co/YHH23ADDq9 #SciGirls #STEM #CitizenScience
[('GaryStokesKSPS', 'YouTube')]
In this case, my first parenthesis matches the source node username ('GaryStokesKSPS'), which is fine. But then I get a match for 'Youtube', but not for #PBS or #SciGirls. The last match is returned, but not the previous ones. This pattern occurs throughout my entire dataset.
How can I get A) All matches and/or B) Only the first one?
Use a non-greedy match: change
^(.*)&&&&.*#([A-Za-z0-9([A-Za-z0-9_]+)
into
^(.*)&&&&.*?#([A-Za-z0-9([A-Za-z0-9_]+)
The former matches as much text as possible, so your text is getting split up as
GaryStokesKSPS&&&&RT #PBS: .#SciGirls Season 3 is now on #YouTube! Watch now: ...
\________________/\_____________________________________/#\_____/
(.*)&&&& greedy match .* group
First of all, I’m not sure about this part:
([A-Za-z0-9([A-Za-z0-9_]+)
There is a character class ([]) with the following characters and ranges: A-Z, a-z, 0-9, (, [, A-Z again, a-z again, 0-9 again, and _.
You probably made a mistake there as it’s the same as ([A-Za-z0-9([_]+). And the brackets aren’t actually allowed in Twitter names. You probably meant to match for something like this: ([A-Za-z0-9_]+).
That being said, to fix your problem, you need a non-greedy match before the # character. Change it to this:
^(.*?)&&&&.*?#([A-Za-z0-9_]+)
↑ ↑
the question marks make this non-greedy
Related
I have strings of as and bs. I want to extract all overlapping subsequences, where a subsequence is a single a surrounding by any number of bs. This is the regex I wrote:
import re
pattern = """(?= # inside lookahead for overlapping results
(?:a|^) # match at beginning of str or after a
(b* (?:a) b*) # one a between any number of bs
(?:a|$)) # at end of str or before next a
"""
a_between_bs = re.compile(pattern, re.VERBOSE)
It seems to work as expected, except when the very first character in the string is an a, in which case this subsequence is missed:
a_between_bs.findall("bbabbba")
# ['bbabbb', 'bbba']
a_between_bs.findall("abbabb")
# ['bbabb']
I don't understand what is happening. If I change the order of how a potential match could start, the results also change:
pattern = """(?=
(?:^|a) # a and ^ swapped
(b* (?:a) b*)
(?:a|$))
"""
a_between_bs = re.compile(pattern, re.VERBOSE)
a_between_bs.findall("abbabb")
# ['abb']
I would have expected this to be symmetric, so that strings ending in an a might also be missed, but this doesn't appear to be the case. What is going on?
Edit:
I assumed that solutions to the toy example above would translate to my full problem, but that doesn't seem to be the case, so I'm elaborating now (sorry about that). I am trying to extract "syllables" from transcribed words. A "syllable" is a vowel or a diphtongue, preceded and followed by any number of consonants. This is my regular expression to extract them:
vowels = 'æɑəɛiɪɔuʊʌ'
diphtongues = "|".join(('aj', 'aw', 'ej', 'oj', 'ow'))
consonants = 'θwlmvhpɡŋszbkʃɹdnʒjtðf'
pattern = f"""(?=
(?:[{vowels}]|^|{diphtongues})
([{consonants}]* (?:[{vowels}]|{diphtongues}) [{consonants}]*)
(?:[{vowels}]|$|{diphtongues})
)
"""
syllables = re.compile(pattern, re.VERBOSE)
The tricky bit is that the diphtongues end in consonants (j or w), which I don't want to be included in the next syllable. So replacing the first non-capturing group by a double negative (?<![{consonants}]) doesn't work. I tried to instead replace that group by a positive lookahead (?<=[{vowels}]|^|{diphtongues}), but regex won't accept different lengths (even removing the diphtongues doesn't work, apparently ^ is of a different length).
So this is the problematic case with the pattern above:
syllables.findall('æbə')
# ['bə']
# should be: ['æb', 'bə']
Edit 2:
I've switched to using regex, which allows variable-width lookbehinds, which solves the problem. To my surprise, it even appears to be faster than the re module in the standard library. I'd still like to know how to get this working with the re module, though. (:
I suggest fixing this with a double negation:
(?= # inside lookahead for overlapping results
(?<![^a]) # match at beginning of str or after a
(b*ab*) # one a between any number of bs
(?![^a]) # at end of str or before next a
)
See the regex demo
Note I replaced the grouping constructs with lookarounds: (?:a|^) with (?<![^a]) and (?:a|$) with (?![^a]). The latter is not really important, but the first is very important here.
The (?:a|^) at the beginning of the outer lookahead pattern matches a or start of the string, whatever comes first. If a is at the start, it is matched and when the input is abbabb, you get bbabb since it matches the capturing group pattern and there is an end of string position right after. The next iteration starts after the first a, and cannot find any match since the only a left in the string has no a after bs.
Note that order of alternative matters. If you change to (?:^|a), the match starts at the start of the string, b* matches empty string, ab* grabs the first abb in abbabb, and since there is a right after, you get abb as a match. There is no way to match anything after the first a.
Remember that python "short-circuits", so, if it matches "^", its not going to continue looking to see if it matches "a" too. This will "consume" the matching character, so in cases where it matches "a", "a" is consumed and not available for the next group to match, and because using the (?:) syntax is non-capturing, that "a" is "lost", and not available to be captured by the next grouping (b*(?:a)b*), whereas when "^" is consumed by the first grouping, that first "a" would match in the second grouping.
I have the following lines:
12(3)/FO.2-3;1-2
153/G6S.3-H;2-3;1-2
1/G13S.2-3
22/FO.2-3;1-2
12(3)2S/FO.2-3;1-2
153/SH/G6S.3-H;2-3;1-2
45/3/H/GDP6;2-3;1-2
I digits to get a match if at the beginning of the line I find two or three numbers but not one, also if the field contains somewhere the expressions FO, SH, GDP or LDP I should not count it as an occurrence. It means, from the previous lines, only get 153/G6S.3-H;2-3;1-2 as a match because in the others either contain FO, SH, GDP, or there is just one digit at the beginning.
I tried using
^[1-9][1-9]((?!FO|SH|GDP).)*$
I am getting the correct result but I am not sure is correct, I am not quite expert in regular expressions.
You need to add any other characters that might be between your starting digits and the things you want to exclude:
Simplified regex: ^[1-9]{2,3}(?!.*(?:FO|SH|GDP|LDP)).*$
will only match 153/G6S.3-H;2-3;1-2 from your given data.
Explanation:
^[1-9]{2,3}(?!.*(?:FO|SH|GDP|LDP)).*$
----------- 2 to 3 digits or more at start of line
^[1-9]{2,3}(?!.*(?:FO|SH|GDP|LDP)).*$
--------------------- any characters + not matching (FO|SH|GDP|LDP)
^[1-9]{2,3}(?!.*(?:FO|SH|GDP|LDP)).*$
--- match till end of line
The (?:....) negative lookbehind must follow exactly, you have other characters between what you do not want to see and your match, hence it is not picking it up.
See https://regex101.com/r/j4SRoQ/1 for more explanations (uses {2,}).
Full code example:
import re
regex = r"^[1-9]{2,3}(?!.*(?:FO|SH|GDP|LDP)).*$"
test_str = r"""12(3)/FO.2-3;1-2
153/G6S.3-H;2-3;1-2
1/G13S.2-3
22/FO.2-3;1-2
12(3)2S/FO.2-3;1-2
153/SH/G6S.3-H;2-3;1-2
45/3/H/GDP6;2-3;1-2"""
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
print(match.group())
Output:
153/G6S.3-H;2-3;1-2
For any phone number which allows () in the area code and any space between area code and the 4th number, I want to create a tuple of the 3 sets of numbers.
For example: (301) 556-9018 or (301)556-9018 would return ('301','556','9018').
I will raise a Value error exception if the input is anything other than the original format.
How do I avoid () characters and include either \s or none between the area code and the next values?
This is my foundation so far:
phonenum=re.compile('''([\d)]+)\s([\d]+) - ([\d]+)$''',re.VERBOSE).match('(123) 324244-123').groups()
print(phonenum)
Do I need to make a if then statement to ignore the () for the first tuple element, or is there a re expression that does that more efficiently?
In addition the \s in between the first 2 tuples doesn't work if it's (301)556-9018.
Any hints on how to approach this?
When specifying a regular expression, you should use raw-string mode:
`r'abc'` instead of `'abc'`
That said, right now you are capturing three sets of numbers in groups. To allow parens, you will need to match parens. (The parens you currently have are for the capturing groups.)
You can match parens by escaping them: \( and \)
You can find various solutions to "what is a regex for XXX" by seaching one of the many "regex libary" web sites. I was able to find this one via DuckDuckGo: http://www.regexlib.com/Search.aspx?k=phone
To make a part of your pattern optional, you can make the individual pieces optional, or you can provide alternatives with the piece present or absent.
Since the parens have to be present or absent together - that is, you don't want to allow an opening paren but no closing paren - you probably want to provide alternatives:
# number, no parens: 800 555-1212
noparens = r'\d{3}\s+\d{3}-\d{4}'
# number with parens: (800) 555-1212
yesparens = r'\(\d{3}\)\s*\d{3}-\d{4}'
You can match the three pieces by inserting "grouping parens":
noparens_grouped = r'(\d{3})\s+(\d{3})-(\d{4})'
yesparens_grouped = r'\((\d{3})\)\s*(\d{3})-(\d{4})'
Note that the quoted parens go outside of the grouping parens, so that the parens do not become part of the captured group.
You can join the alternatives together with the | operator:
yes_or_no_parens_groups = noparens_grouped + '|' + yesparens_grouped
In regular expressions you can use special characters to specify some behavior of some part of the expression.
From python re documentation:
'*' =
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.
'+' =
Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.
'?' =
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.
So to solve the blank space problem you can use either '?' if you know the occurrence will be no more than 1, or '+' if you can have more than 1.
In case of grouping information together and them returning a list, you can put your expression inside parenthesis and then use function groups() from re.
The result would be:
results = re.search('\((\d{3})\)\s?(\d{3})-(\d{4})', '(301) 556-9018')
if results:
print results.groups()
else:
print('Invalid phone number')
I have a lot of file names with the pattern SURENAME__notalwaysmiddlename_firstnames_1230123Abc123-16x_notalways.pdf, e.g.:
SMITH_John_001322Cde444-16v_HA.pdf
FLORRICK-DOILE_Debora_Alicia_321333Gef213-16p.pdf
ROBINSON-SMITH_Maria-Louise_321333Gef213-16p_GH.pdf
My old regex was ([\w]*)_([\w-\w]+)\.\w+ but after switching to Python and getting the first double-barrelled surnames (and even in the first names) I'm unable to get it running.
With the old regex I got two groups:
SMITH_James
001322Cde444-16v_HA
But now I have no clue how to achieve this with re and even include the occasional double-barrelled names in group 1 and the ID in group 2.
([A-Z-]+)(?:_([A-z-]+))?_([A-z-]+)_(\d.*)\.
This pattern will return the surname, potential middle name, first name, and final string.
([A-Z-]+) returns a upper-cased word that can also contain -
(?:_([A-z-]+))? returns 0 or 1 matches of a word preceded by an _. The (?: makes the _ non-capturing
([A-z-]+) returns a word that can also contain -
(\d.*) returns a string that starts with a number
\. finds the escaped period right before the file type
Given the regex and the word below I want to match the part after the - (which can also be a _ or space) only if the part after the delimiter is a digit and nothing comes after it (I basically want to to be a number and number only). I am using group statements but it just doesn't seem to work right. It keeps matching the 3 at the beginning (or the 1 at the end if I modify it a bit). How do I achieve this (by using grouping) ?
Target word: BR0227-3G1
Regex: ([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*)
It should not match 3G1, G1 , 1G
It should match only pure numbers like 3,10, 2 etc.
Here is also a helper web site for evaluating the regex: http://www.pythonregex.com/
More examples:
It should match:
BR0227-3
BR0227 3
BR0227_3
into groups (BR0227) (3)
It should only match (BR0227) for
BR0227-3G1
BR0227-CS
BR0227
BR0227-
I would use
re.findall('^([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*$)?', str)
Each string starts with the first group and ends with the last group, so the ^ and $ groups can assist in capture. The $ at the end requires all numbers to be captured, but it's optional so the first group can still be captured.
Since you want the start and (possible) end of the word in groups, then do this:
r'\b([A-Z0-9]+)(?:[ _-](\d+))?\b'
This will put the first part of the word in the first group, and optionally the remainder in the second group. The second group will be None if it didn't match.
This should match anything followed by '-', ' ', or '_' with only digits after it.
(.*)[- _](\d+)