python regex with variable input - python

I've a text file (say test.txt) e.g.
a ......
aa ......
a+a .....
aa+ .....
a+ .....
aaa .....
.........
Now I would like to find the line number of any particular strings e.g. 'a', 'aa+' etc. I've tried to find an exact match of the input string using regex.
name='a'
import re
p = re.compile(r'\b'+re.escape(name)+ r'\b')
i=0
with open('test.txt') as inpfile:
for num, line in enumerate(inpfile):
if p.search(line):
print num
The program should print "0" only but its printing 0,2,4.
My expected output is
name='a'
output: 0
name='aa'
output: 1
name='aa+'
output: 3 and so on...
I understood that the regular expression I used above, is not correct. But it will be helpful if you please share your comments/suggestions to compile the regular expression such a way that it gives the desired output for all the patterns.
Thanks.

Why would it not print 2 and 4? a+a and a+ both contain a surrounded by word boundaries, exactly as you have specified with \b. Perhaps you want to match the start and end of the line instead? E.g.
name='a'
import re
p = re.compile('^'+re.escape(name)+ '$')
with open('test.txt') as inpfile:
for num, line in enumerate(inpfile):
if p.search(line.rstrip(r'\n')):
print num
But if you're looking to match the line exactly, why go to the trouble of using a regular expression?
name='a'
with open('test.txt') as inpfile:
for num, line in enumerate(inpfile):
if name == line.rstrip(r'\n'):
print num

The problem is making exact sense of your regular expression. In lay terms, you are matching the expression:
"word border" followed by an 'a' followed by another "word border"
and that is why is matching lines 0 (a), 2 (a+a) and so on.
Here, spaces and non-printable characters (start of line, end of line) and '+' mark end of word

You should not use \b. It will match a+a, a+. I think you may want ^a$.

Related

My Python Regex code isn't finding consecutive sets of characters

I'm trying to code a program to find if there are three consecutive sets of double letters in a .txt file (E.G. bookkeeper). So far I have:
import re
text = open(r'C:\Users\Jimbo.Wimbo\Desktop\List.txt')
for line in text:
x = re.finditer(r'((\w)\2)+', line)
if True:
print("Yes")
Else:
print("No")
List.txt has 5 words. There is one word with three consecutive sets of double letters right at the end, but it prints 5 "Yes"'s. What can I do to fix it using re and os?
You don't need re.finditer(), you can just use re.search().
Your regexp is wrong, it will match at least 1 set of duplicate characters, not 3.
if True: doesn't do anything useful. It doesn't mean "if the last assignment was a truthy value". You need to test the result of the regexp search.
Use any() to test if the condition matches any line in the file. Your code will print Yes or No for each line in the file.
if any(re.search(r'((\w)\2)){3}', line) for line in text):
print('Yes')
else:
print('No')
I think your regex is incorrect.
A good way to check your regex is to use an online regex checker, and you can test your regex against any number of strings you provide.
Here is one possible solution to your query:
import re
text = open(r'C:\Users\Jimbo.Wimbo\Desktop\List.txt')
for line in text:
x = len(re.findall(r'(.)\1', line))
if x == 3:
print(f"Found a word with 3 duplicate letters : {line}")
else:
print(f"Word: {line}, Duplicate letters : {x}")
Hope this helps.

Python regular expression to check start string is present or not

I am trying to write a regular expression which should check start string of the line and count some strings present in the line.
Example:
File.txt
# Compute
[ checking
a = b
a
a=b>c=d
Iterate this file and ignore the line with below condition
My Condition:
(line.startswith("[") or line.startswith("#") or line.count("=") > 1 or '=' not in line)
I need to re write the above condition in regex.
Trying the below,
re.search("^#",line)
re.search("^/[",line)
How to write this regex checking line starts with "#" or "[" and other conditions
If you actually wish to use a singular regular expression, you can use the following pattern;
^[^#\[][^=]*?=[^=]*?$
Which will match everything that does not fit the logic you specified in your answer - and so will extract only things that don't fit the logic you provided, and so will ignore everything all lines with the conditions specified. This single pattern would save you mixing python logic with regular expressions, which may be more consistent.
Demo here
Explanation:
^ anchors to the start of the string
[^#\[] Makes sure there is not a [ or a # at the start of the line
[^=]*? lazily match any number of anything except an =
= match exactly one =
[^=]*? lazily match any number of anything except an =
$ end of string anchor.
You could use this, for example, with grep if you're running bash to extract all the matching lines, and so ignore all desired lines, or use a simple python script as follows;
import re
pattern = re.compile('^[^#[][^=]?=[^=]?$')
# For loop solution
with open('test.txt') as f:
for line in f:
if pattern.match(line):
print(line)
# Alternative one-line generator expression;
with open('test.txt') as f:
print('\n'.join((line for line in f if pattern.match(line))))
For your given output file, both will print out;
a = b
For the first set of startswith conditions you can use re.match:
if re.match(r'[\[#]', text):
...
For the second condition, you can use re.findall (if you want the count):
if len(re.findall('=', text)) != 1:
...
You can combine the two above with an and, like this:
if re.match(r'[\[#]', text) and len(re.findall('=', text)) != 1:
...

Extract variables using python regex

Input file contains following lines:
a=b*c;
d=a+2;
c=0;
b=a;
Now for each line I want to extract variables that has been used.For example, for line 1, the output should be [a,b,c].Currently I am doing as follows :
var=[a,b,c,d] # list of variables
for line in file_ptr :
if '=' in line :
temp=line.split('=') :
ans=list(temp[0])
if '+' in temp[1] :
# do something
elif '*' in temp[1] :
# do something
else :
# single variable as line 4 OR constant as line 3
Is it possible to do this using regex?
EDIT:
Expected output for above file :
[a,b,c]
[d,a]
[c]
[a,b]
I would use re.findall() with whatever pattern matches variable names in the example's programming language. Assuming a typical language, this might work for you:
import re
lines = '''a=b*c;
d=a+2;
c=0;
b=a;'''
for line in lines.splitlines():
print re.findall('[_a-z][_a-z0-9]*', line, re.I)
I'd use some shorter pattern for matching variable names:
import re
strs = ['a=b*c;', 'd=a+2;', 'c=0;', 'b=a;']
print([re.findall(r'[_a-z]\w*', x, re.I) for x in strs])
See the Python demo
Pattern matches:
[_a-z] - a _ or an ASCII letter (any upper or lowercase due to the case insensitive modifier use re.I)
\w* - 0 or more alphanumeric or underscore characters.
See the regex demo
If you want just the variables, then do this:
answer = []
for line in file_ptr :
temp = []
for char in line:
if char.isalpha():
temp.append(char)
answer.append(temp)
A word of caution though: this would work only with variables that are exactly 1 character in length. More details about isalpha() can be found here or here.
I'm not entirely sure what you're after, but you can do something like this:
re.split(r'[^\w]', line)
to give a list of the alphabetic characters in the line:
>>> re.split(r'[^\w]', 'a=b*c;')
['a', 'b', 'c', '']
This is how I did :
l=re.split(r'[^A-Za-z]', 'a=b*2;')
l=filter(None,l)

Regular expression, trimming after a particular sign and neglecting the list terms which do not have that sign

file = open('SMSm.txt', 'r')
file2 = open('SMSw.txt', 'w')
debited=[]
for line in file.readlines():
if 'debited with' in line:
import re
a= re.findall(r'[INR]\S*', line)
debited.append(a)
file2.write(line)
print re.findall(r'^(.*?)(=)?$', (debited)
My output is [['INR 2,000=2E00'], ['INR 12,000=2E400', 'NFS*Cash'], ['INR 2,000=2E0d0']]
I only want the digits after INR. For example ['INR 2,000','INR 12000','INR 2000']. What changes shall I make in the regular expression?
I have tried using str(debited) but it didn't work out.
You can use a simple regex matching INR + whitespace if any + any digits with , as separator:
import re
s = "[['INR 2,000=2E00']['INR 12,000=2E400', 'NFS*Cash']['INR 2,000=2E0d0']]"
t = re.findall(r"INR\s*(\d+(?:,\d+)*)", s)
print(t)
# Result: ['2,000', '12,000', '2,000']
With findall, all captured texts will be output as a list.
See IDEONE demo
If you want INR as part of the output, just remove the capturing round brackets from the pattern: r"INR\s*\d+(?:,\d+)*".
UPDATE
Just tried out a non-regex approach (a bit error prone if there are entries with no =), here it is:
t = [x[0:x.find("=")].strip("'") for x in s.strip("[]").replace("][", "?").split("?")]
print(t)
Given the code you already have, the simplest solution is to make the extracted string start with INR (it already does) and end just before the equals sign. Just replace this line
a= re.findall(r'[INR]\S*', line)
with this:
a= re.findall(r'[INR][^\s=]*', line)

Simple Filter Python script for Text

I am trying to create what must be a simple filter function which runs a regex against a text file and returns all words containing that particular regex.
so for example if i wanted to find all words that contained "abc", and I had the list: abcde, bce, xyz and zyxabc the script would return abcde and zyxabc.
I have a script below however I am not sure if it is just the regex I am failing at or not. it just returns abc twice rather than the full word. thanks.
import re
text = open("test.txt", "r")
regex = re.compile(r'(abc)')
for line in text:
target = regex.findall(line)
for word in target:
print word
I think you dont need regex for such task you can simply split your lines to create a list of words then loop over your words list and use in operator :
with open("test.txt") as f :
for line in f:
for w in line.split():
if 'abc' in w :
print w
Your methodology is correct however, you can change your Regex to r'.*abc.*', in the sense
regex = re.compile(r'.*abc.*')
This will match all the lines with abc in themThe wildcards.*` will match all your letters in the line.
A small Demo with that particular line changed would print
abcde
zyxabc
Note, As Kasra mentions it is better to use in operator in such cases

Categories

Resources