I am trying to write a regular expression which should check start string of the line and count some strings present in the line.
Example:
File.txt
# Compute
[ checking
a = b
a
a=b>c=d
Iterate this file and ignore the line with below condition
My Condition:
(line.startswith("[") or line.startswith("#") or line.count("=") > 1 or '=' not in line)
I need to re write the above condition in regex.
Trying the below,
re.search("^#",line)
re.search("^/[",line)
How to write this regex checking line starts with "#" or "[" and other conditions
If you actually wish to use a singular regular expression, you can use the following pattern;
^[^#\[][^=]*?=[^=]*?$
Which will match everything that does not fit the logic you specified in your answer - and so will extract only things that don't fit the logic you provided, and so will ignore everything all lines with the conditions specified. This single pattern would save you mixing python logic with regular expressions, which may be more consistent.
Demo here
Explanation:
^ anchors to the start of the string
[^#\[] Makes sure there is not a [ or a # at the start of the line
[^=]*? lazily match any number of anything except an =
= match exactly one =
[^=]*? lazily match any number of anything except an =
$ end of string anchor.
You could use this, for example, with grep if you're running bash to extract all the matching lines, and so ignore all desired lines, or use a simple python script as follows;
import re
pattern = re.compile('^[^#[][^=]?=[^=]?$')
# For loop solution
with open('test.txt') as f:
for line in f:
if pattern.match(line):
print(line)
# Alternative one-line generator expression;
with open('test.txt') as f:
print('\n'.join((line for line in f if pattern.match(line))))
For your given output file, both will print out;
a = b
For the first set of startswith conditions you can use re.match:
if re.match(r'[\[#]', text):
...
For the second condition, you can use re.findall (if you want the count):
if len(re.findall('=', text)) != 1:
...
You can combine the two above with an and, like this:
if re.match(r'[\[#]', text) and len(re.findall('=', text)) != 1:
...
Related
long time ago I wrote a tool for parsing text files, line by line, and do some stuff, depending on commands and conditions in the file.
I used regex for this, however, I was never good in regex.
A line holding a condition looks like this:
[type==STRING]
And the regex I use is:
re.compile(r'^[^\[\]]*\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*$', re.MULTILINE)
This regex would result me the keyword "type" and the value "STRING".
However, now I need to update my tool to have more conditions in one line, e.g.
[type==STRING][amount==0]
I need to update my regex to get me two pairs of results, one pair type/STRING and one pair amount/0.
But I'm lost on this. My regex above gets me zero results with this line.
Any ideas how to do this?
You could either match a second pair of groups:
^[^\[\]]*\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*(?:\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*)?$
Regex demo
Or you can omit the anchors and the [^\[\]]* part to get the group1 and group 2 values multiple times:
\[([^\]\[=]*)==([^\]\[=]*)\]
Regex demo
Is it a requirement that you use regex? You can alternatively accomplish this pretty easily using the split function twice and stripping the first opening and last closing bracket.
line_to_parse = "[type==STRING]"
# omit the first and last char before splitting
pairs = line_to_parse[1:-1].split("][")
for pair in pairs:
x, y = pair.split("==")
Rather depends on the precise "rules" that describe your data. However, for your given data why not:
import re
text = '[type==STRING][amount==0]'
words = re.findall('\w+', text)
lst = []
for i in range(0, len(words), 2):
lst.append((words[i], words[i+1]))
print(lst)
Output:
[('type', 'STRING'), ('amount', '0')]
For the data file formated like this:
("Time Step" "courantnumber_max" "courantnumber_avg" "flow-time")
0 0.55432343242 0.34323443432242 0.00001
I can use pd.read_table(filename, sep=' ', header=0) and it will get everything correct except for the very first header, "Time Step".
Is there a way to specify a regex string for read_table() to use to parse out the header names?
I know a way to solve the issue is to just use regex to create a list of names for the read_table() function to use, but I figured there might/should be a way to directly express that in the import itself.
Edit: Here's what it returns as headers:
['("Time', 'Step"', 'courantnumber_max', 'courantnumber_avg', 'flow-time']
So it doesn't appear to be actually possible to do this inside the pandas.read_table() function. Below is posted the actual solution I ended up using to fix the problem:
import re
def get_headers(file, headerline, regexstring, exclude):
# Get string of selected headerline
with file.open() as f:
for i, line in enumerate(f):
if i == headerline-1:
headerstring = line
elif i > headerline-1:
break
# Parse headerstring
reglist = re.split(regexstring, headerstring)
# Filter entries in reglist
#filter out blank strs
filteredlist = list(filter(None, reglist))
#filter out items in exclude list
headerslist = []
if exclude:
for entry in filteredlist:
if not entry in exclude:
headerslist.append(entry)
return headerslist
get_headers(filename, 3, r'(?:" ")|["\)\(]', ['\n'])
Code explanation:
get_headers():
Arguments, file is a file object that contains the header. headerline is the line number (starting at 1) that the header names exist. regexstring is the pattern that will be fed into re.split(). Highly recommended that you prepend a r to the regex pattern. exclude is a list of miscellaneous strings that you want to be removed from the headerlist.
The regex pattern I used:
First up we have the pipe (|) symbol. This was done to separate both the "normal" split method (which is the " ") and the other stuff that needs to be rid of (namely the parenthesis).
Starting with the first group: (?:" "). We have the (...) since we want to match those characters in order. The " " is what we want to match as the stuff to split around. The ?: basically says to not capture the contents of the group. This is important/useful as otherwise re.split() will keep any groups as a separate item. See re.split() in documentation.
The second group is simply the other characters. Without them, the first and last items would be '("Time Step' and 'flow-time)\n'. Note that this causes \n to be treated as a separate entry to the list. This is why we use the exclude argument to fix that up after the fact.
I need to match strings like: '2017-08-09,08:59:20.445 INFO {peers_peak_parameters_grid} [eval_peers_peak] Evaluating batch 0 out of 2158',
I have tried different regular expressions such as: comp = re.compile("Evaluating batch ^[-+]?[0-9]+$ out of ^[-+]?[0-9]+$")
and this is an example usage:
def get_batch_process_time(log):
loglines = log.splitlines()
comp = re.compile("Evaluating batch ^[-+]?[0-9]+$ out of ^[-+]?[0-9]+$")
times = []
matches = []
for i, line in enumerate(loglines):
if comp.search(line):
time = string2datetime(line.split(' ')[0])
times.append(time)
matches.append(line)
return np.array(times), matches
Unfortunately none of the lines seems to match the given pattern. I assume that I'm using the wrong regular expression.
What is the right regular expression?
Am I using re correctly? (should I use match rather than search?)
^[-+]?[0-9]+$ alone would match a whole string consisting of an optional plus or minus operation then a non-empty sequence of digits.
When I say a whole string, it's because ^ and $ are "anchors" that will match respectively the start and end of the string, which is why your regex doesn't work.
I suppose you could also remove the optional sign part, i.e. [-+]?.
You could have found that out by yourself by testing your regex in regex101 (check the explanation panel on the top right) or a similar utility.
I would like to find the most efficient and simple way to test in python if a string passes the following criteria:
contains nothing except:
digits (the numbers 0-9)
decimal points: '.'
the letter 'e'
the sign '+' or '-'
spaces (any number of them)
tabs (any number of them)
I can do this easily with nested 'if' loops, etc., but i'm wondering if there's a more convenient way...
For example, I would want the string:
0.0009017041601 5.13623e-05 0.00137531 0.00124203
to be 'true' and all the following to be 'false':
# File generated at 10:45am Tuesday, July 8th
# Velocity: 82.568
# Ambient Pressure: 150000.0
Time(seconds) Force_x Force_y Force_z
That's trivial for a regex, using a character class:
import re
if re.match(r"[0-9e \t+.-]*$", subject):
# Match!
However, that will (according to the rules) also match eeeee or +-e-+ etc...
If what you actually want to do is check whether a given string is a valid number, you could simply use
try:
num = float(subject)
except ValueError:
print("Illegal value")
This will handle strings like "+34" or "-4e-50" or " 3.456e7 ".
import re
if re.match(r"^[0-9\te+ -]+$",x):
print "yes"
else:
print "no"
You can try this.If there is a match,its a pass else fail.Here x will be your string.
Easiest way to check whether the string has only required characters is by using the string.translate method.
num = "1234e+5"
if num.translate(None, "0123456789e+- \t"
print "pass"
else:
print "Wrong character present!!!"
You can add any character at the second parameter in the translate method other than that I mentioned.
You dont need to use regular expressions just use a test_list and all operation :
>>> from string import digits
>>> test_list=list(digits)+['+','-',' ','\t','e','.']
>>> all(i in test_list for i in s)
Demo:
>>> s ='+4534e '
>>> all(i in test_list for i in s)
True
>>> s='+9328a '
>>> all(i in test_list for i in s)
False
>>> s="0.0009017041601 5.13623e-05 0.00137531 0.00124203"
>>> all(i in test_list for i in s)
True
Performance wise, running a regular expression check is costly, depending on the expression. Also running a regex check for each valid line (i.e. lines which the value should be "True") will be costly, especially because you'll end up parsing each line with a regex and parse the same line again to get the numbers.
You did not say what you wanted to do with the data so I will empirically assume a few things.
First off in a case like this I would make sure the data source is always formatted the same way. Using your example as a template I would then define the following convention:
any line, which first non-blank character is a hash sign is ignored
any blank line is ignored
any line that contains only spaces is ignored
This kind of convention makes parsing much easier since you only need one regular expression to fit rules 1. to 3. : ^\s*(#|$), i.e. any number of space followed by either a hash sign or an end of line. On the performance side, this expression scans an entire line only when it's comprised of spaces and just spaces, which shall not happen very often. In other cases the expression scans a line and stops at the first non-space character, which means comments will be detected quickly for the scanning will stop as soon as the hash is encountered, at position 0 most of the time.
If you can also enforce the following convention:
the first non blank line of the remaining lines is the header with column names
there is no blank lines between samples
there are no comments in samples
Your code would then do the following:
read lines into line for as long as re.match(r'^\s*(#|$)', line) evaluates to True;
continue, reading headers from the next line into line: headers = line.split() and you have headers in a list.
You can use a namedtuple for your line layout — which I assume is constant throughout the same data table:
class WindSample(namedtuple('WindSample', 'time, force_x, force_y, force_z')):
def __new__(cls, time, force_x, force_y, force_z):
return super(WindSample, cls).__new__(
cls,
float(time),
float(force_x),
float(force_y),
float(force_z)
)
Parsing valid lines would then consist of the following, for each line:
try:
data = WindSample(*line.split())
except ValueError, e:
print e
Variable data would hold something such as:
>>> print data
WindSample(time=0.0009017041601, force_x=5.13623e-05, force_y=0.00137531, force_z=0.00124203)
The advantage is twofold:
you run costly regular expressions only for the smallest set of lines (i.e. blank lines and comments);
your code parses floats, raising an exception whenever parsing would yield something invalid.
I'm trying to do a simple VB6 to c translator to help me port an open source game to the c language.
I want to be able to get "NpcList[NpcIndex]" from "With Npclist[NpcIndex]" using ragex and to replace it everywhere it has to be replaced. ("With" is used as a macro in VB6 that adds Npclist[NpcIndex] when ever it needs to until it founds "End With")
Example:
With Npclist[NpcIndex]
.goTo(245) <-- it should be replaced with Npclist[NpcIndex].goTo(245)
End With
Is it possible to use regex to do the job?
I've tried using a function to perfom another regex replace between the "With" and the "End With" but I can't know the text the "With" is replacing (Npclist[NpcIndex]).
Thanks in advance
I personally wouldn't trust any single-regex solution to get it right on the first time nor feel like debugging it. Instead, I would parse the code line-to-line and cache any With expression to use it to replace any . directly preceded by whitespace or by any type of brackets (add use-cases as needed):
(?<=[\s[({])\. - positive lookbehind for any character from the set + escaped literal dot
(?:(?<=[\s[({])|^)\. - use this non-capturing alternatives list if to-be-replaced . can occur on the beginning of line
import re
def convert_vb_to_c(vb_code_lines):
c_code = []
current_with = ""
for line in vb_code_lines:
if re.search(r'^\s*With', line) is not None:
current_with = line[5:] + "."
continue
elif re.search(r'^\s*End With', line) is not None:
current_with = "{error_outside_with_replacement}"
continue
line = re.sub(r'(?<=[\s[({])\.', current_with, line)
c_code.append(line)
return "\n".join(c_code)
example = """
With Npclist[NpcIndex]
.goTo(245)
End With
With hatla
.matla.tatla[.matla.other] = .matla.other2
dont.mind.me(.do.mind.me)
.next()
End With
"""
# use file_object.readlines() in real life
print(convert_vb_to_c(example.split("\n")))
You can pass a function to the sub method:
# just to give the idea of the regex
regex = re.compile(r'''With (.+)
(the-regex-for-the-VB-expression)+?
End With''')
def repl(match):
beginning = match.group(1) # NpcList[NpcIndex] in your example
return ''.join(beginning + line for line in match.group(2).splitlines())
re.sub(regex, repl, the_string)
In repl you can obtain all the information about the matching from the match object, build whichever string you want and return it. The matched string will be replaced by the string you return.
Note that you must be really careful to write the regex above. In particular using (.+) as I did matches all the line up to the newline excluded, which or may not be what you want(but I don't know VB and I have no idea which regex could go there instead to catch only what you want.
The same goes for the (the-regex-forthe-VB-expression)+. I have no idea what code could be in those lines, hence I leave to you the detail of implementing it. Maybe taking all the line can be okay, but I wouldn't trust something this simple(probably expressions can span multiple lines, right?).
Also doing all in one big regular expression is, in general, error prone and slow.
I'd strongly consider regexes only to find With and End With and use something else to do the replacements.
This may do what you need in Python 2.7. I'm assuming you want to strip out the With and End With, right? You don't need those in C.
>>> import re
>>> search_text = """
... With Np1clist[Npc1Index]
... .comeFrom(543)
... End With
...
... With Npc2list[Npc2Index]
... .goTo(245)
... End With"""
>>>
>>> def f(m):
... return '{0}{1}({2})'.format(m.group(1), m.group(2), m.group(3))
...
>>> regex = r'With\s+([^\s]*)\s*(\.[^(]+)\(([^)]+)\)[^\n]*\nEnd With'
>>> print re.sub(regex, f, search_text)
Np1clist[Npc1Index].comeFrom(543)
Npc2list[Npc2Index].goTo(245)