I am trying to match the below string using regular expressions
String:
These are my variables -abc $def -geh $ijk for case1
These are my variables -lmn $opq -rst $uvw for case2
Pattern:
These\s+are\s+my\s+variables(?:\s*-(\w+)\s+\$(\w+))*\s+for\s+(case\d)
I could match successfully the above string with my pattern but the problem is that I am unable to catch the groups as I intend. My attempts are giving me the results as below
geh, ijk, case1
rst, uvw, case2
I wanted the groups output as below
abc, def, geh, ijk, case1
lmn, opq, rst, uvw, case2
How to approach for this issue?
Regex Demo
Use PyPi regex module and use the same regex you are using as is shown below:
import regex
s = 'These are my variables -abc $def -geh $ijk for case1'
rx = regex.compile(r'These\s+are\s+my\s+variables(?:\s*-(\w+)\s+\$(\w+))*\s+for\s+(case\d)')
print([x.captures(1) for x in rx.finditer(s)])
# => [abc, geh]
print([x.captures(2) for x in rx.finditer(s)])
# => [def, ijk]
Else, capture all the options with
These\s+are\s+my\s+variables((?:\s*-\w+\s+\$\w+)*)\s+for\s+(case\d)
(see demo), and get the separate values as Step 2.
import re
r = r"These\s+are\s+my\s+variables((?:\s*-\w+\s+\$\w+)*)\s+for\s+(case\d)"
s = "These are my variables -abc $def -geh $ijk for case1"
m = re.search(r, s)
if m:
print(re.findall(r'-(\w+)', m.group(1)))
print(re.findall(r'\$(\w+)', m.group(1)))
print(m.group(2))
See the Python demo
Consider the following alternative approach using str.lstrip and str.split functions(it will return a list of parameter sets for each line):
s = '''These are my variables -abc $def -geh $ijk for case1
These are my variables -lmn $opq -rst $uvw for case2'''
params = [[p.lstrip('$-') for p in l.split()[4:] if p != 'for'] for l in s.split('\n') if l]
print(params)
The output:
[['abc', 'def', 'geh', 'ijk', 'case1'], ['lmn', 'opq', 'rst', 'uvw', 'case2']]
Related
I have text like this
Example:
"visa code: ab c master number: efg discover: i j k"
Output should be like this:
abc, efg, ijk
Is there a way, I can use Grok pattern match or Reg EX to get 3 characters after the ":" (not considering space) ?
You can start with this:
>>> import re
>>> p = re.compile(r"\b((?:\w\s*){2}\w)\b")
>>> re.findall(p, "visa code: ab c master number: efg discover: i j k")
['ab c', 'efg', 'i j k']
But you have more work to do. For example, nobody can guess what you mean - exactly - by "characters".
Beyond that, pattern matching systems match strings, but do not convert them. You'll have to remove spaces you don't want via some other means (which should be easy).
I am python programmer and i want to use regular expression in r, but i want the functionality of finditer in r language , not findall , i want to use each value something like:
so if i have a file which contains:
<LayerDepth Units="mm" Count="4" value1="141" value2="241" value3="1104" value4="1492" value444="898" LastModified="6/11/2012"
Now if i use this piece of code :
import re
pattern='(value\d.+?)"(\d.+?)"'
with open("file1.txt",'r') as f:
match=re.finditer(pattern,f.read())
for i in match:
print(i.group())
output will be:
value1="141"
value2="241"
value3="1104"
value4="1492"
value444="898"
I want same functionality in r , How can i achieve this?
We can use gregexpr with the following pattern:
(value\d+="\d+")
Then, use regmatches with the output of gregexpr to obtain the actual matches from the input string.
x <- c("<LayerDepth Units=\"mm\" Count=\"4\" value1=\"141\" value2=\"241\" value3=\"1104\" value4=\"1492\" value444=\"898\" LastModified=\"6/11/2012\" Now")
m <- gregexpr("(value\\d+=\"\\d+\")", x)
regmatches(x, m)
[[1]]
[1] "value1=\"141\"" "value2=\"241\"" "value3=\"1104\"" "value4=\"1492\""
[5] "value444=\"898\""
Demo
What regular expression can i use to match genes(in bold) in the gene list string:
GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8
I tried : GENE_List:((( \w+).(\w+));)+* but it only captures the last gene
Given:
>>> s="GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8"
You can use Python string methods to do:
>>> s.split(': ')[1].split('; ')
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']
For a regex:
(?<=[:;]\s)([^\s;]+)
Demo
Or, in Python:
>>> re.findall(r'(?<=[:;]\s)([^\s;]+)', s)
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']
You can use the following:
\s([^;\s]+)
Demo
The captured group, ([^;\s]+), will contain the desired substrings followed by whitespace (\s)
>>> s = 'GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8'
>>> re.findall(r'\s([^;\s]+)', s)
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']
UPDATE
It's in fact much simpler:
[^\s;]+
however, first use substring to take only the part you need (the genes, without GENELIST )
demo: regex demo
string = "GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8"
re.findall(r"([^;\s]+)(?:;|$)", string)
The output is:
['F59A7.7',
'T25D3.3',
'F13B12.4',
'cysl-1',
'cysl-2',
'cysl-3',
'cysl-4',
'F01D4.8']
Here's hoping somebody can shed some light on this question because it has me stumped. I have a string that looks like this:
s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
I want this result:
abcdef ghijk lmnop qrs tuv wxyz 0123456789
Having reviewed numerous questions and answers here, the closest I have come to a solution is:
s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
s = re.sub('\[\[.*?\|', '', s)
s = re.sub('[\]\]]', '', s)
--> abcdef ghijk lmnop wxyz 0123456789
Since not every substring within double brackets contains a pipe, the re.sub removes everything from '[[' to next '|' instead of checking within each set of double brackets.
Any assistance would be most appreciated.
What about this:
In [187]: re.sub(r'([\[|\]])|((?<=\[)\w+\s+\w+(?=|))', '', s)
Out[187]: 'abcdef ghijk lmnop qrs tuv wxyz 0123456789'
I purpose you a contrary method, instead of remove it you can just catch patterns you want. I think this way can make your code more semantics.
There are two patterns you wish to catch:
Case: words outside [[...]]
Pattern: Any words are either leaded by ']] ' or trailed by ' [['.
Regex: (?<=\]\]\s)\w+|\w+(?=\s\[\[)
Case: words inside [[...]]
Pattern: Any words are trailed by ']]'
Regex: \w+(?=\]\])
Example code
1 #!/usr/bin/env python
2 import re
3
4 s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789 "
5
6 p = re.compile('(?<=\]\]\s)\w+|\w+(?=\s\[\[)|\w+(?=\]\])')
7 print p.findall(s)
Result:
['abcdef', 'ghijk', 'lmnop', 'qrs', 'tuv', 'wxyz', '0123456789']
>>> import re
>>> s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
>>> re.sub(r'(\[\[[^]]+?\|)|([\[\]])', '', s)
'abcdef ghijk lmnop qrs tuv wxyz 0123456789'
This searches for and removes the following two items:
Two opening brackets followed by a bunch of stuff that isn't a closing bracket followed by a pipe.
Opening or closing brackets.
As a general regex using built-in re module you can use follwing regex that used look-around:
(?<!\[\[)\b([\w]+)\b(?!\|)|\[\[([^|]*)\]\]
you can use re.finditer to get the desire result :
>>> g=re.finditer(r'(?<!\[\[)\b([\w]+)\b(?!\|)|(?<=\[\[)[^|]*(?=\]\])',s)
>>> [j.group() for j in g]
['abcdef', 'ghijk', 'lmnop', 'qrs', 'tuv', 'wxyz', '0123456789']
The preceding regex contains from 2 part one is :
(?<=\[\[)[^|]*(?=\]\])
which match any combinations of word characters that not followed by | and not precede by [[.
the second part is :
\[\[([^|]*)\]\]
that will match any thing between 2 brackets except |.
I begin to learn re module. First I'll show the original code:
import re
cheesetext = u'''<tag>I love cheese.</tag>
<tag>Yeah, cheese is all I need.</tag>
<tag>But let me explain one thing.</tag>
<tag>Cheese is REALLY I need.</tag>
<tag>And the last thing I'd like to say...</tag>
<tag>Everyone can like cheese.</tag>
<tag>It's a question of the time, I think.</tag>'''
def action1(source):
regex = u'<tag>(.*?)</tag>'
pattern = re.compile(regex, re.UNICODE | re.DOTALL | re.IGNORECASE)
result = pattern.findall(source)
return(result)
def action2(match, source):
pattern = re.compile(match, re.UNICODE | re.DOTALL | re.IGNORECASE)
result = bool(pattern.findall(source))
return(result)
result = action1(cheesetext)
result = [item for item in result if action2(u'cheese', item)]
print result
>>> [u'I love cheese.', u'Yeah, cheese is all I need.', u'Cheese is REALLY I need.', u'Everyone can like cheese.']
And now what I need. I need to do the same thing using one regex. It was an example, I have to process much more information than these cheesy texts. :-) Is it possible to combine these two actions in one regex? So the question is: how can I use conditions in regex?
>>> p = u'<tag>((?:(?!</tag>).)*cheese.*?)</tag>'
>>> patt = re.compile(p, re.UNICODE | re.DOTALL | re.IGNORECASE)
>>> patt.findall(cheesetext)
[u'I love cheese.', u'Yeah, cheese is all I need.', u'Cheese is REALLY I need.', u'Everyone can like cheese.']
This uses a negative-lookahead assertion. A good explanation of this is given by Tim Pietzcker in this question.
You can use |.
>>> import re
>>> m = re.compile("(Hello|Goodbye) World")
>>> m.match("Hello World")
<_sre.SRE_Match object at 0x01ECF960>
>>> m.match("Goodbye World")
<_sre.SRE_Match object at 0x01ECF9E0>
>>> m.match("foobar")
>>> m.match("Hello World").groups()
('Hello',)
In addition, if you need actual conditions, you can use conditionals on previously matched groups with (?=...), (?!...), (?P=name) and friends. See Python's re module docs.
I propose to use look foward to check you don't get a </tag> inside
re.findall(r'<tag>((?:(?!</tag>).)*?cheese(?:(?!</tag>).)*?)</tag>', cheesetext)