Extract variables using python regex - python

Input file contains following lines:
a=b*c;
d=a+2;
c=0;
b=a;
Now for each line I want to extract variables that has been used.For example, for line 1, the output should be [a,b,c].Currently I am doing as follows :
var=[a,b,c,d] # list of variables
for line in file_ptr :
if '=' in line :
temp=line.split('=') :
ans=list(temp[0])
if '+' in temp[1] :
# do something
elif '*' in temp[1] :
# do something
else :
# single variable as line 4 OR constant as line 3
Is it possible to do this using regex?
EDIT:
Expected output for above file :
[a,b,c]
[d,a]
[c]
[a,b]

I would use re.findall() with whatever pattern matches variable names in the example's programming language. Assuming a typical language, this might work for you:
import re
lines = '''a=b*c;
d=a+2;
c=0;
b=a;'''
for line in lines.splitlines():
print re.findall('[_a-z][_a-z0-9]*', line, re.I)

I'd use some shorter pattern for matching variable names:
import re
strs = ['a=b*c;', 'd=a+2;', 'c=0;', 'b=a;']
print([re.findall(r'[_a-z]\w*', x, re.I) for x in strs])
See the Python demo
Pattern matches:
[_a-z] - a _ or an ASCII letter (any upper or lowercase due to the case insensitive modifier use re.I)
\w* - 0 or more alphanumeric or underscore characters.
See the regex demo

If you want just the variables, then do this:
answer = []
for line in file_ptr :
temp = []
for char in line:
if char.isalpha():
temp.append(char)
answer.append(temp)
A word of caution though: this would work only with variables that are exactly 1 character in length. More details about isalpha() can be found here or here.

I'm not entirely sure what you're after, but you can do something like this:
re.split(r'[^\w]', line)
to give a list of the alphabetic characters in the line:
>>> re.split(r'[^\w]', 'a=b*c;')
['a', 'b', 'c', '']

This is how I did :
l=re.split(r'[^A-Za-z]', 'a=b*2;')
l=filter(None,l)

Related

How can I replace part of a string with a pattern

for example is the string is "abbacdeffel" and the pattern being "xyyx" replaced with "1234"
so it would result from "abbacdeffel" to "1234cd1234l"
I have tried to think this out but I couldnt come up with anything. At first I thought maybe dictionary could help but still nothing came to mind.
What you're looking to do can be accomplished by using regex, or more commonly known as, Regular Expressions. Regular Expressions in programming enables you to extract what you want and just what you want from a string.
In your case, you want to match the string with the pattern abba so using the following regex:
(\w+)(\w+)\2\1
https://regex101.com/r/hP8lA3/1
You can match two word groups and use backreferences to make sure that the second group comes first, then the first group.
So implementing this in python code looks like this:
First, import the regex module in python
import re
Then, declare your variable
text = "abbacdeffel"
The re.finditer returns an iterable so you can iterate through all the groups
matches = re.finditer(r"(\w)(\w)\2\1", text)
Go through all the matches that the regexp found and replace the pattern with "1234"
for match in matches:
text = text.replace(match.group(0), "1234")
For debugging:
print(text)
Complete Code:
import re
text = "abbacdeffel"
matches = re.finditer(r"(\w)(\w)\2\1", text)
for match in matches:
text = text.replace(match.group(0), "1234")
print(text)
You can learn more about Regular Expressions here: https://regexone.com/references/python
New version of code (there was a bug):
def replace_with_pattern(pattern, line, replace):
from collections import OrderedDict
set_of_chars_in_pattern = set(pattern)
indice_start_pattern = 0
output_line = ""
while indice_start_pattern < len(line):
potential_end_pattern = indice_start_pattern + len(pattern)
subline = line[indice_start_pattern:potential_end_pattern]
print(subline)
set_of_chars_in_subline = set(subline)
if len(set_of_chars_in_subline)!= len(set_of_chars_in_pattern):
output_line += line[indice_start_pattern]
indice_start_pattern +=1
continue
map_of_chars = OrderedDict()
liste_of_chars_in_pattern = []
for char in pattern:
if char not in liste_of_chars_in_pattern:
liste_of_chars_in_pattern.append(char)
print(liste_of_chars_in_pattern)
for subline_char in subline:
if subline_char not in map_of_chars.values():
map_of_chars[liste_of_chars_in_pattern.pop(0)] =subline_char
print(map_of_chars)
wanted_subline = ""
for char_of_pattern in pattern:
wanted_subline += map_of_chars[char_of_pattern]
print("wanted_subline =" + wanted_subline)
if subline == wanted_subline:
output_line += replace
indice_start_pattern += len(pattern)
else:
output_line += line[indice_start_pattern]
indice_start_pattern += 1
return output_line
some test :
test1 = replace_with_pattern("xyyx", "abbacdeffel", "1234")
test2 = replace_with_pattern("abbacdeffel", "abbacdeffel", "1234")
print(test1, test2)
=> 1234cd1234l 1234
Here goes my attempt:
([a-zA-Z])(?!\1)([a-zA-Z])\2\1
Assuming you want to match letters only (if other ranges, change both [a-zA-Z] as appropriate, we have:
([a-zA-Z])
Find the first character, and note it so we can later refer to it with \1.
(?!\1)
Check to see if the next character is not the same as the first, but without advancing the search pointer. This is to prevent aaaa being accepted. If aaaa is OK, just remove this subexpression.
([a-zA-Z])
Find the second character, and note it so we can later refer to it with \2.
\2\1
Now find the second again, then the first again, so we match the full abba pattern.
And finally, to do a replace operation, the full command would be:
import re
re.sub(r'([a-zA-Z])(?!\1)([a-zA-Z])\2\1',
'1234',
'abbacdeffelzzzz')
The r at the start of the regex pattern is to prevent Python processing the backslashes. Without it, you would need to do:
import re
re.sub('([a-zA-Z])(?!\\1)([a-zA-Z])\\2\\1',
'1234',
'abbacdeffelzzzz')
Now, I see the spec has expanded to a user-defined pattern; here is some code that will build that pattern:
import re
def make_re(pattern, charset):
result = ''
seen = []
for c in pattern:
# Is this a letter we've seen before?
if c in seen:
# Yes, so we want to match the captured pattern
result += '\\' + str(seen.index(c)+1)
else:
# No, so match a new character from the charset,
# but first exclude already matched characters
for i in xrange(len(seen)):
result += '(?!\\' + str(i + 1) + ')'
result += '(' + charset + ')'
# Note we have seen this letter
seen.append(c)
return result
print re.sub(make_re('xzzx', '\\d'), 'abba', 'abba1221b99999889')
print re.sub(make_re('xyzxyz', '[a-z]'), '123123', 'abcabc zyxzyyx zyzzyz')
Outputs:
abbaabbab9999abba
123123 zyxzyyx zyzzyz

How to parse values appear after the same string in python?

I have a input text like this (actual text file contains tons of garbage characters surrounding these 2 string too.)
(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)
I am trying to parse the text to store something like this:
value1="xxx" and value2="yyy".
I wrote python code as follows:
value1_start = content.find('value')
value1_end = content.find(';', value1_start)
value2_start = content.find('value')
value2_end = content.find(';', value2_start)
print "%s" %(content[value1_start:value1_end])
print "%s" %(content[value2_start:value2_end])
But it always returns:
value=xxx
value=xxx
Could anyone tell me how can I parse the text so that the output is:
value=xxx
value=yyy
Use a regex approach:
re.findall(r'\bvalue=[^;]*', s)
Or - if value can be any 1+ word (letter/digit/underscore) chars:
re.findall(r'\b\w+=[^;]*', s)
See the regex demo
Details:
\b - word boundary
value= - a literal char sequence value=
[^;]* - zero or more chars other than ;.
See the Python demo:
import re
rx = re.compile(r"\bvalue=[^;]*")
s = "$%$%&^(&value=xxx;$%^$%^$&^%^*value=yyy;%$#^%"
res = rx.findall(s)
print(res)
Use regex to filter the data you want from the "junk characters":
>>> import re
>>> _input = '#4#5%value=xxx38u952035983049;3^&^*(^%$3value=yyy#%$#^&*^%;$#%$#^'
>>> matches = re.findall(r'[a-zA-Z0-9]+=[a-zA-Z0-9]+', _input)
>>> matches
['value=xxx', 'value=yyy']
>>> for match in matches:
print(match)
value=xxx
value=yyy
>>>
Summary or the regular expression:
[a-zA-Z0-9]+: One or more alphanumeric characters
=: literal equal sign
[a-zA-Z0-9]+: One or more alphanumeric characters
For this input:
content = '(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)'
use a simple regex and manually strip off the first and last two characters:
import re
values = [x[2:-2] for x in re.findall(r'\*\*value=.*?\*\*', content)]
for value in values:
print(value)
Output:
value=xxx
value=yyy
Here the assumption is that there are always two leading and two trailing * as in **value=xxx**.
You already have good answers based on the re module. That would certainly be the simplest way.
If for any reason (perfs?) you prefere to use str methods, it is indeed possible. But you must search the second string past the end of the first one :
value2_start = content.find('value', value1_end)
value2_end = content.find(';', value2_start)

Find all occurrences of a regex pattern and replace with eval output

I have a txt file which contains a lot of strings such as
Chr(101)
Chr(97)
Chr(104)
...
I am using the below code to find all occurrences of such strings using regex. What I'd like to do is to replace each occurrence with its evaluated output. So in this case I'd replace the above with:
e
a
h
The code I have is as follows:
with open(oFile, "r") as f:
for line in f:
# find all occurrences of Chr(\d+\) and put in a list
chrList = [str(s) for s in re.findall(r'Chr\(\d+\)', line)]
# print chrList
for c in chrList:
# print eval(c.lower())
out = re.sub(c, eval(c.lower()), line)
If I print the eval(c.lower()) line then it outputs as expected. However the re.sub line fails with the following error:
raise error, v # invalid expression sre_constants.error: bogus escape (end of line)
Not sure where I'm going wrong here.
You don't have to use distinct search and replace functions. You can invoke eval using the functional form of re.sub:
for line in f:
out = re.sub(r'Chr\(\d+\)', lambda c: eval(c.group(0).lower()), line)
print out
Your going to want to escape your search pattern because parenthesis are special characters in regular expressions. You can easily do this using re.escape.
out = re.sub(re.escape(c), eval(c.lower()), line)
And as an example:
strings = ['Chr(100)', 'Chr(101)', 'Chr(102)']
values = [re.sub(re.escape(c), eval(c.lower()), c) for c in strings]
# ['d', 'e', 'f']
That being said, why not just use replace()?
out = line.replace(c, eval(c.lower())
Same thing but without eval() or imports:
strings = ['Chr(100)', 'Chr(101)', 'Chr(102)']
values = [chr(x) for x in (int(c.replace("Chr(", "").replace(")","")) for c in strings)]

python regex with variable input

I've a text file (say test.txt) e.g.
a ......
aa ......
a+a .....
aa+ .....
a+ .....
aaa .....
.........
Now I would like to find the line number of any particular strings e.g. 'a', 'aa+' etc. I've tried to find an exact match of the input string using regex.
name='a'
import re
p = re.compile(r'\b'+re.escape(name)+ r'\b')
i=0
with open('test.txt') as inpfile:
for num, line in enumerate(inpfile):
if p.search(line):
print num
The program should print "0" only but its printing 0,2,4.
My expected output is
name='a'
output: 0
name='aa'
output: 1
name='aa+'
output: 3 and so on...
I understood that the regular expression I used above, is not correct. But it will be helpful if you please share your comments/suggestions to compile the regular expression such a way that it gives the desired output for all the patterns.
Thanks.
Why would it not print 2 and 4? a+a and a+ both contain a surrounded by word boundaries, exactly as you have specified with \b. Perhaps you want to match the start and end of the line instead? E.g.
name='a'
import re
p = re.compile('^'+re.escape(name)+ '$')
with open('test.txt') as inpfile:
for num, line in enumerate(inpfile):
if p.search(line.rstrip(r'\n')):
print num
But if you're looking to match the line exactly, why go to the trouble of using a regular expression?
name='a'
with open('test.txt') as inpfile:
for num, line in enumerate(inpfile):
if name == line.rstrip(r'\n'):
print num
The problem is making exact sense of your regular expression. In lay terms, you are matching the expression:
"word border" followed by an 'a' followed by another "word border"
and that is why is matching lines 0 (a), 2 (a+a) and so on.
Here, spaces and non-printable characters (start of line, end of line) and '+' mark end of word
You should not use \b. It will match a+a, a+. I think you may want ^a$.

Python: Regex to find but not include an alphanumeric

Is there an regular expression to find, for example, ">ab" but do not include ">" in the result?
I want to replace some strings using re.sub, and I want to find strings starting with ">" without remove the ">".
You want a positive lookbehind assertion. See the docs.
r'(?<=>)ab'
It needs to be a fixed length expression, it can't be a variable number of characters. Basically, do
r'(?<=stringiwanttobebeforethematch)stringiwanttomatch'
So, an example:
import re
# replace 'ab' with 'e' if it has '>' before it
#here we've got '>ab' so we'll get '>ecd'
print re.sub(r'(?<=>)ab', 'e', '>abcd')
#here we've got 'ab' but no '>' so we'll get 'abcd'
print re.sub(r'(?<=>)ab', 'e', 'abcd')
You can use a back reference in sub:
import re
test = """
>word
>word2
don't replace
"""
print re.sub('(>).*', r'\1replace!', test)
Outputs:
>replace!
>replace!
don't replace
I believe this accomplishes what you actually want when you say "I want to replace some strings using re.sub, and I want to find strings starting with '>' without remove the '>'."
if you want to avoid using the re module you can also use the startswith() string method.
>>> foo = [ '>12', '>54', '34' ]
>>> for line in foo:
... if line.startswith('>'):
... line = line.strip('>')
... print line
...
12
54
34
>>>

Categories

Resources