Matching exact strings in python - python

How can I match exact strings in Python which can dynamically catch/ignore the following cases?
If I want to get value2 in this IE output "Y" from a file that is formatted as such:
...
--value1 X
---value2 Y
----value3 Z
....
How can I search for the exact match "value2" whilst ignoring the preceding "---", these characters don't allow exact string matching with the "==" operator when searching each line.

You could strip leading dashes, then split the result to get the first word without the dashes:
let's say you iterate on the lines:
for line in lines:
first_word = line.lstrip("-").split()
if first_word == "value2":
print("found")
regex can be of help too, with word boundary on the right
if re.match(r"^-*value2\b",line):

You can remove the extra characters at the start of a string s using s.lstrip('-') before using an exact match. There are other ways to handle this, but this is the fastest and strictest way without using regular expressions.

Can you guarantee that all of the valid words with have a dash before them and a space afterward? If so, you could write that like:
for line in lines:
if '-value2 ' in line:
print(line.split()[1])

The simplest way that I know is:
for line in lines:
if 'value2' in line:
...
Another way (if you need to know position):
for line in lines:
pos = line.find('value2')
if pos >= 0:
...
More complex things can be done as well, like a regular expression, if necessary, but without knowing what validation you need. The two ways above, I feel, are the most simple.
UPDATE (addressing comment):
(Trying to keep it simple, this requires a space after the number)
for line in lines:
for token in line.split():
if 'value2' in token:
...

Related

In python, find tokens in line

long time ago I wrote a tool for parsing text files, line by line, and do some stuff, depending on commands and conditions in the file.
I used regex for this, however, I was never good in regex.
A line holding a condition looks like this:
[type==STRING]
And the regex I use is:
re.compile(r'^[^\[\]]*\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*$', re.MULTILINE)
This regex would result me the keyword "type" and the value "STRING".
However, now I need to update my tool to have more conditions in one line, e.g.
[type==STRING][amount==0]
I need to update my regex to get me two pairs of results, one pair type/STRING and one pair amount/0.
But I'm lost on this. My regex above gets me zero results with this line.
Any ideas how to do this?
You could either match a second pair of groups:
^[^\[\]]*\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*(?:\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*)?$
Regex demo
Or you can omit the anchors and the [^\[\]]* part to get the group1 and group 2 values multiple times:
\[([^\]\[=]*)==([^\]\[=]*)\]
Regex demo
Is it a requirement that you use regex? You can alternatively accomplish this pretty easily using the split function twice and stripping the first opening and last closing bracket.
line_to_parse = "[type==STRING]"
# omit the first and last char before splitting
pairs = line_to_parse[1:-1].split("][")
for pair in pairs:
x, y = pair.split("==")
Rather depends on the precise "rules" that describe your data. However, for your given data why not:
import re
text = '[type==STRING][amount==0]'
words = re.findall('\w+', text)
lst = []
for i in range(0, len(words), 2):
lst.append((words[i], words[i+1]))
print(lst)
Output:
[('type', 'STRING'), ('amount', '0')]

How would I print all the instances when the "$" shows up?

I have this string and I'm basically trying to get the numbers after the "$" shows up. For example, I would want an output like:
>>> 100, 654, 123, 111.654
The variable and string:
file = """| $100 on the first line
| $654 on the second line
| $123 on the third line
| $111.654 on the fourth line"""
And as of right now, I have this bit of code that I think helps me separate the numbers. But I can't figure out why it's only separating the fourth line. It only prints out 111.654
txt = io.StringIO(file).getvalue()
idx = txt.rfind('$')
print(txt[idx+1:].split()[0])
Is there an easier way to do this or am I just forgetting something?
Your code finds only the last $ because that's exactly what you programmed it to do.
You take the entire input, find the last $, and then split the rest of the string. This specifically ignores any other $ in the input.
You cite "line" as if it's a unit of your program, but you've done nothing to iterate through lines. I recommend that you quit fiddling with io and simply use standard file operations. You find this in any tutorial on Python files.
In the meantime, here's how you handle the input you have:
by_lines = txt.split('\n') # Split in newline characters
for line in by_lines:
idx = line.rfind('$')
print(line[idx+1:].split()[0])
Output:
100
654
123
111.654
Does that get you moving?
Regular expressions yay:
import re
matches = re.findall(r'\$(\d+\.\d+|\d+)', file)
Finds all integer and float amounts, ensures trailing '.' fullstops are not incorrectly captured.
This should do it! For every character in txt: if it is '$' then continue until you find a space.
print(*[txt[i+1: i + txt[i:].find(' ')] for i in range(0, len(txt)) if txt[i]=='$'])
Output:
100 654 123 111.654
Your whole sequence appears to be a single string. Try using the split function to break it into separate lines. Then, I believe you need to iterate through the entire list, searching for $ at each iteration.
I'm not the most fluent in python, but maybe something like this:
for i in txt.split('\n'):
idx=txt.rfind('$')
print(txt[idx+1].split()[0])
How about this?
re.findall('\$(\d+\.?\d*)', file)
# ['100', '654', '123', '111.654']
The regex looks for the dollar sign \$ then grabs the maximum sized group available () containing one or more digits \d+ and zero or one decimal points \.? and zero or more digits \d* after that.

I want to split a string by a character on its first occurence, which belongs to a list of characters. How to do this in python?

Basically, I have a list of special characters. I need to split a string by a character if it belongs to this list and exists in the string. Something on the lines of:
def find_char(string):
if string.find("some_char"):
#do xyz with some_char
elif string.find("another_char"):
#do xyz with another_char
else:
return False
and so on. The way I think of doing it is:
def find_char_split(string):
char_list = [",","*",";","/"]
for my_char in char_list:
if string.find(my_char) != -1:
my_strings = string.split(my_char)
break
else:
my_strings = False
return my_strings
Is there a more pythonic way of doing this? Or the above procedure would be fine? Please help, I'm not very proficient in python.
(EDIT): I want it to split on the first occurrence of the character, which is encountered first. That is to say, if the string contains multiple commas, and multiple stars, then I want it to split by the first occurrence of the comma. Please note, if the star comes first, then it will be broken by the star.
I would favor using the re module for this because the expression for splitting on multiple arbitrary characters is very simple:
r'[,*;/]'
The brackets create a character class that matches anything inside of them. The code is like this:
import re
results = re.split(r'[,*;/]', my_string, maxsplit=1)
The maxsplit argument makes it so that the split only occurs once.
If you are doing the same split many times, you can compile the regex and search on that same expression a little bit faster (but see Jon Clements' comment below):
c = re.compile(r'[,*;/]')
results = c.split(my_string)
If this speed up is important (it probably isn't) you can use the compiled version in a function instead of having it re compile every time. Then make a separate function that stores the actual compiled expression:
def split_chars(chars, maxsplit=0, flags=0, string=None):
# see note about the + symbol below
c = re.compile('[{}]+'.format(''.join(chars)), flags=flags)
def f(string, maxsplit=maxsplit):
return c.split(string, maxsplit=maxsplit)
return f if string is None else f(string)
Then:
special_split = split_chars(',*;/', maxsplit=1)
result = special_split(my_string)
But also:
result = split_chars(',*;/', my_string, maxsplit=1)
The purpose of the + character is to treat multiple delimiters as one if that is desired (thank you Jon Clements). If this is not desired, you can just use re.compile('[{}]'.format(''.join(chars))) above. Note that with maxsplit=1, this will not have any effect.
Finally: have a look at this talk for a quick introduction to regular expressions in Python, and this one for a much more information packed journey.

Python regular expression to check start string is present or not

I am trying to write a regular expression which should check start string of the line and count some strings present in the line.
Example:
File.txt
# Compute
[ checking
a = b
a
a=b>c=d
Iterate this file and ignore the line with below condition
My Condition:
(line.startswith("[") or line.startswith("#") or line.count("=") > 1 or '=' not in line)
I need to re write the above condition in regex.
Trying the below,
re.search("^#",line)
re.search("^/[",line)
How to write this regex checking line starts with "#" or "[" and other conditions
If you actually wish to use a singular regular expression, you can use the following pattern;
^[^#\[][^=]*?=[^=]*?$
Which will match everything that does not fit the logic you specified in your answer - and so will extract only things that don't fit the logic you provided, and so will ignore everything all lines with the conditions specified. This single pattern would save you mixing python logic with regular expressions, which may be more consistent.
Demo here
Explanation:
^ anchors to the start of the string
[^#\[] Makes sure there is not a [ or a # at the start of the line
[^=]*? lazily match any number of anything except an =
= match exactly one =
[^=]*? lazily match any number of anything except an =
$ end of string anchor.
You could use this, for example, with grep if you're running bash to extract all the matching lines, and so ignore all desired lines, or use a simple python script as follows;
import re
pattern = re.compile('^[^#[][^=]?=[^=]?$')
# For loop solution
with open('test.txt') as f:
for line in f:
if pattern.match(line):
print(line)
# Alternative one-line generator expression;
with open('test.txt') as f:
print('\n'.join((line for line in f if pattern.match(line))))
For your given output file, both will print out;
a = b
For the first set of startswith conditions you can use re.match:
if re.match(r'[\[#]', text):
...
For the second condition, you can use re.findall (if you want the count):
if len(re.findall('=', text)) != 1:
...
You can combine the two above with an and, like this:
if re.match(r'[\[#]', text) and len(re.findall('=', text)) != 1:
...

Elegant way test in python if string contains nothing except 0-9,e,+,-,spaces,tabs

I would like to find the most efficient and simple way to test in python if a string passes the following criteria:
contains nothing except:
digits (the numbers 0-9)
decimal points: '.'
the letter 'e'
the sign '+' or '-'
spaces (any number of them)
tabs (any number of them)
I can do this easily with nested 'if' loops, etc., but i'm wondering if there's a more convenient way...
For example, I would want the string:
0.0009017041601 5.13623e-05 0.00137531 0.00124203
to be 'true' and all the following to be 'false':
# File generated at 10:45am Tuesday, July 8th
# Velocity: 82.568
# Ambient Pressure: 150000.0
Time(seconds) Force_x Force_y Force_z
That's trivial for a regex, using a character class:
import re
if re.match(r"[0-9e \t+.-]*$", subject):
# Match!
However, that will (according to the rules) also match eeeee or +-e-+ etc...
If what you actually want to do is check whether a given string is a valid number, you could simply use
try:
num = float(subject)
except ValueError:
print("Illegal value")
This will handle strings like "+34" or "-4e-50" or " 3.456e7 ".
import re
if re.match(r"^[0-9\te+ -]+$",x):
print "yes"
else:
print "no"
You can try this.If there is a match,its a pass else fail.Here x will be your string.
Easiest way to check whether the string has only required characters is by using the string.translate method.
num = "1234e+5"
if num.translate(None, "0123456789e+- \t"
print "pass"
else:
print "Wrong character present!!!"
You can add any character at the second parameter in the translate method other than that I mentioned.
You dont need to use regular expressions just use a test_list and all operation :
>>> from string import digits
>>> test_list=list(digits)+['+','-',' ','\t','e','.']
>>> all(i in test_list for i in s)
Demo:
>>> s ='+4534e '
>>> all(i in test_list for i in s)
True
>>> s='+9328a '
>>> all(i in test_list for i in s)
False
>>> s="0.0009017041601 5.13623e-05 0.00137531 0.00124203"
>>> all(i in test_list for i in s)
True
Performance wise, running a regular expression check is costly, depending on the expression. Also running a regex check for each valid line (i.e. lines which the value should be "True") will be costly, especially because you'll end up parsing each line with a regex and parse the same line again to get the numbers.
You did not say what you wanted to do with the data so I will empirically assume a few things.
First off in a case like this I would make sure the data source is always formatted the same way. Using your example as a template I would then define the following convention:
any line, which first non-blank character is a hash sign is ignored
any blank line is ignored
any line that contains only spaces is ignored
This kind of convention makes parsing much easier since you only need one regular expression to fit rules 1. to 3. : ^\s*(#|$), i.e. any number of space followed by either a hash sign or an end of line. On the performance side, this expression scans an entire line only when it's comprised of spaces and just spaces, which shall not happen very often. In other cases the expression scans a line and stops at the first non-space character, which means comments will be detected quickly for the scanning will stop as soon as the hash is encountered, at position 0 most of the time.
If you can also enforce the following convention:
the first non blank line of the remaining lines is the header with column names
there is no blank lines between samples
there are no comments in samples
Your code would then do the following:
read lines into line for as long as re.match(r'^\s*(#|$)', line) evaluates to True;
continue, reading headers from the next line into line: headers = line.split() and you have headers in a list.
You can use a namedtuple for your line layout — which I assume is constant throughout the same data table:
class WindSample(namedtuple('WindSample', 'time, force_x, force_y, force_z')):
def __new__(cls, time, force_x, force_y, force_z):
return super(WindSample, cls).__new__(
cls,
float(time),
float(force_x),
float(force_y),
float(force_z)
)
Parsing valid lines would then consist of the following, for each line:
try:
data = WindSample(*line.split())
except ValueError, e:
print e
Variable data would hold something such as:
>>> print data
WindSample(time=0.0009017041601, force_x=5.13623e-05, force_y=0.00137531, force_z=0.00124203)
The advantage is twofold:
you run costly regular expressions only for the smallest set of lines (i.e. blank lines and comments);
your code parses floats, raising an exception whenever parsing would yield something invalid.

Categories

Resources