python finditer functionality in r language? - python

I am python programmer and i want to use regular expression in r, but i want the functionality of finditer in r language , not findall , i want to use each value something like:
so if i have a file which contains:
<LayerDepth Units="mm" Count="4" value1="141" value2="241" value3="1104" value4="1492" value444="898" LastModified="6/11/2012"
Now if i use this piece of code :
import re
pattern='(value\d.+?)"(\d.+?)"'
with open("file1.txt",'r') as f:
match=re.finditer(pattern,f.read())
for i in match:
print(i.group())
output will be:
value1="141"
value2="241"
value3="1104"
value4="1492"
value444="898"
I want same functionality in r , How can i achieve this?

We can use gregexpr with the following pattern:
(value\d+="\d+")
Then, use regmatches with the output of gregexpr to obtain the actual matches from the input string.
x <- c("<LayerDepth Units=\"mm\" Count=\"4\" value1=\"141\" value2=\"241\" value3=\"1104\" value4=\"1492\" value444=\"898\" LastModified=\"6/11/2012\" Now")
m <- gregexpr("(value\\d+=\"\\d+\")", x)
regmatches(x, m)
[[1]]
[1] "value1=\"141\"" "value2=\"241\"" "value3=\"1104\"" "value4=\"1492\""
[5] "value444=\"898\""
Demo

Related

Attempting to use re.sub but need to maintain some of the regex

So I have a string, "this-is-a-big-tool" and swap out THIS and TOOL for different words but maintain BIG
import re
test = "this-is-a-big-tool"
s = [("a","b"), ("a","d"), ("c","d")]
for a,b in s:
result = re.sub("this-[\w]+-[\w]+-[big|giant]-tool", "%s-moves-big-%s" % (a,b), test)
print(result)
The issue is that say the only thing I care about is THIS, BIG, TOOL. I want to swap THIS and TOOL but keep BIG. and I dont care about the other words.
So my goal is to do something like:
a-is-a-big-b
a-is-a-giant-d
c-is-a-giant-d
The issue is that i figured out the regex, but how to i pass BIG or GIANT into the replace portion of the code?
result = re.sub("this-[\w]+-[\w]+-[big|giant]-tool", "%s-moves-big-%s" % (a,b), test)
How Do I pass This ---^ into --^
You can try this:
import re
test = "this-is-a-big-tool"
s = [("a","b"), ("a","d"), ("c","d")]
new_results = [re.sub('this|tool', '{}', test).format(*i) for i in s]
Output:
['a-is-a-big-b', 'a-is-a-big-d', 'c-is-a-big-d']

Regular expression with pattern repetition within pattern

I am trying to match the below string using regular expressions
String:
These are my variables -abc $def -geh $ijk for case1
These are my variables -lmn $opq -rst $uvw for case2
Pattern:
These\s+are\s+my\s+variables(?:\s*-(\w+)\s+\$(\w+))*\s+for\s+(case\d)
I could match successfully the above string with my pattern but the problem is that I am unable to catch the groups as I intend. My attempts are giving me the results as below
geh, ijk, case1
rst, uvw, case2
I wanted the groups output as below
abc, def, geh, ijk, case1
lmn, opq, rst, uvw, case2
How to approach for this issue?
Regex Demo
Use PyPi regex module and use the same regex you are using as is shown below:
import regex
s = 'These are my variables -abc $def -geh $ijk for case1'
rx = regex.compile(r'These\s+are\s+my\s+variables(?:\s*-(\w+)\s+\$(\w+))*\s+for\s+(case\d)')
print([x.captures(1) for x in rx.finditer(s)])
# => [abc, geh]
print([x.captures(2) for x in rx.finditer(s)])
# => [def, ijk]
Else, capture all the options with
These\s+are\s+my\s+variables((?:\s*-\w+\s+\$\w+)*)\s+for\s+(case\d)
(see demo), and get the separate values as Step 2.
import re
r = r"These\s+are\s+my\s+variables((?:\s*-\w+\s+\$\w+)*)\s+for\s+(case\d)"
s = "These are my variables -abc $def -geh $ijk for case1"
m = re.search(r, s)
if m:
print(re.findall(r'-(\w+)', m.group(1)))
print(re.findall(r'\$(\w+)', m.group(1)))
print(m.group(2))
See the Python demo
Consider the following alternative approach using str.lstrip and str.split functions(it will return a list of parameter sets for each line):
s = '''These are my variables -abc $def -geh $ijk for case1
These are my variables -lmn $opq -rst $uvw for case2'''
params = [[p.lstrip('$-') for p in l.split()[4:] if p != 'for'] for l in s.split('\n') if l]
print(params)
The output:
[['abc', 'def', 'geh', 'ijk', 'case1'], ['lmn', 'opq', 'rst', 'uvw', 'case2']]

Python usage of regular expressions

How can I extract string1#string2 from the bellow line?
<![CDATA[<html><body><p style="margin:0;">string1#string2</p></body></html>]]>
The # character and the structure of the line is always the same.
Simple, buggy, not reliable:
line.replace('<![CDATA[<html><body><p style="margin:0;">', "").replace('</p></body></html>]]>', "").split("#")
re.search(r'[^>]+#[^<]+',s).group()
I would like to refer you to this gem:
In synthesis a regex is not the appropriate tool for this job
Also have you tried an XML parser instead?
EDIT:
import xml.etree.ElementTree as ET
a = "<html><body><p style=\"margin:0;\">string1#string2</p></body></html>"
root = ET.fromstring(a)
c = root[0][0].text
OUT:
c
'string1#string2'
d = c.replace('#', ' ').split()
Out:
d
['string1', 'string2']
If you wish to use a regex:
>>> re.search(r"<p.*?>(.+?)</p>", txt).group(1)
'string1#string2'

Using Regular expressions to match a portion of the string?(python)

What regular expression can i use to match genes(in bold) in the gene list string:
GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8
I tried : GENE_List:((( \w+).(\w+));)+* but it only captures the last gene
Given:
>>> s="GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8"
You can use Python string methods to do:
>>> s.split(': ')[1].split('; ')
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']
For a regex:
(?<=[:;]\s)([^\s;]+)
Demo
Or, in Python:
>>> re.findall(r'(?<=[:;]\s)([^\s;]+)', s)
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']
You can use the following:
\s([^;\s]+)
Demo
The captured group, ([^;\s]+), will contain the desired substrings followed by whitespace (\s)
>>> s = 'GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8'
>>> re.findall(r'\s([^;\s]+)', s)
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']
UPDATE
It's in fact much simpler:
[^\s;]+
however, first use substring to take only the part you need (the genes, without GENELIST )
demo: regex demo
string = "GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8"
re.findall(r"([^;\s]+)(?:;|$)", string)
The output is:
['F59A7.7',
'T25D3.3',
'F13B12.4',
'cysl-1',
'cysl-2',
'cysl-3',
'cysl-4',
'F01D4.8']

strip sides of a string in python

I have a list like this:
Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]
I want to strip the unwanted characters using python so the list would look like:
Tomato
Populus trichocarpa
I can do the following for the first one:
name = ">Tomato4439"
name = name.strip(">1234567890")
print name
Tomato
However, I am not sure what to do with the second one. Any suggestion would be appreciated.
given:
s='Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]'
this:
s = s.split()
[s[0].strip('0123456789,'), s[-2].replace('[',''), s[-1].replace(']','')]
will give you
['Tomato', 'Populus', 'trichocarpa']
It might be worth investigating regular expressions if you are going to do this frequently and the "rules" might not be that static as regular expressions are much more flexible dealing with the data in that case. For the sample problem you present though, this will work.
import re
a = "Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]"
re.sub(r"^([A-Za-z]+).+\[([^]]+)\]$", r"\1 \2", a)
This gives
'Tomato Populus trichocarpa'
If the strings you're trying to parse are consistent semantically, then your best option might be classifying the different "types" of strings you have, and then creating regular expressions to parse them using python's re module.
>>> import re
>>> line = "Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]"
>>> match = re.match("^([a-zA-Z]+).*\[([a-zA-Z ]+)\].*",line)
>>> match.groups()
('Tomato', 'Populus trichocarpa')
edited to not include the [] on the 2nd part... this should work for any thing that matches the pattern of your query (eg starts with name, ends with something in []) it would also match
"Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa apples]" for example
Previous answers were simpler than mine, but:
Here is one way to print the stuff that you don't want.
tag = "Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]"
import re, os
find = re.search('>(.+?) \[', tag).group(1)
print find
Gives you
gi|224089052|ref|XP_002308615.1| predicted protein
Then you can use the replace function to remove that from the original string. And the translate function to remove the extra unwanted characters.
tag2 = tag.replace(find, "")
tag3 = str.translate(tag2, None, ">[],")
print tag3
Gives you
Tomato4439 Populus trichocarpa

Categories

Resources