Python RegEx String Parsing with inconsistent data - python

I have a string that I need to extract values out of. The problem is the string is inconsistent. Here's an example of the script that has the string within it.
import re
RAW_Data = "Name Multiple Words Zero Row* (78.59/0) Name Multiple Words2* (96/24.56) Name Multiple Words3* (0/32.45) Name Multiple Words4* (96/12.58) Name Multiple Words5* (96/0) Name Multiple Words Zero Row6* (0) Name Multiple Words7* (96/95.57) Name Multiple Words Zero Row8* (0) Name Multiple Words9*"
First_Num = re.findall(r'\((.*?)\/*', RAW_Data)
Seg_Length = re.findall(r'\/(.*?)\)', RAW_Data)
#WithinParenthesis = re.findall(r'\((.*?)\)', RAW_Data) #This works correctly
print First_Num
print Seg_Length
del RAW_Data
What I need to get out of the string are all values within the parenthesis. However, I need some logic that will handle the absence of the "/" between the numbers. Basically if the "/" doesn't exist make both values for First_Num and Seg_Length equal to "0". I hope this makes sense.

Use a simple regex and add some programming logic:
import re
rx = r'\(([^)]+)\)'
string = """Name Multiple Words Zero Row* (78.59/0) Name Multiple Words2* (96/24.56) Name Multiple Words3* (0/32.45) Name Multiple Words4* (96/12.58) Name Multiple Words5* (96/0) Name Multiple Words Zero Row6* (0) Name Multiple Words7* (96/95.57) Name Multiple Words Zero Row8* (0) Name Multiple Words9*"""
for match in re.finditer(rx, string):
parts = match.group(1).split('/')
First_Num = parts[0]
try:
Seg_Length = parts[1]
except IndexError:
Seg_Length = None
print "First_Num, Seg_Length: ", First_Num, Seg_Length
You might get along with a regex alone solution (e.g. with conditional regex), but this approach is likely to be still understood in three months. See a demo on ideone.com.

You are attempting to find values on each side of '/' that you know may not exist. Pull back to the always known condition for your initial search. Use a Regular Expression to findall of data within parenthesis. Then process these based on if '/' is in the value.

Related

pythonic method for extracting numeric digits from string

I am developing a program to read through a CSV file and create a dictionary of information from it. Each line in the CSV is essentially a new dictionary entry with the delimited objects being the values.
As one subpart of task, I need to extract an unknown number of numeric digits from within a string. I have a working version, but it does not seem very pythonic.
An example string looks like this:
variable = Applicaiton.Module_Name.VAR_NAME_ST12.WORD_type[0]
variable is string's name in the python code, and represents the variable name within a MODBUS. I want to extract just the digits prior to the .WORD_type[0] which relate to the number of bytes the string is packed into.
Here is my working code, note this is nested within a for statement iterating through the lines in the CSV. var_length and var_type are some of the keys, i.e. {"var_length": var_length}
if re.search(".+_ST[0-9]{1,2}\\.WORD_type.+", variable):
var_type = "string"
temp = re.split("\\.", variable)
temp = re.split("_", temp[2])
temp = temp[-1]
var_length = int(str.lstrip(temp, "ST")) / 2
You could maybe try using matching groups like so:
import re
variable = "Applicaiton.Module_Name.VAR_NAME_ST12.WORD_type[0]"
matches = re.match(r".+_ST(\d+)\.WORD_type.+", variable)
if matches:
print(matches[1])
matches[0] has the full match and matches[1] contains the matched group.

pandas read_table with regex header definition

For the data file formated like this:
("Time Step" "courantnumber_max" "courantnumber_avg" "flow-time")
0 0.55432343242 0.34323443432242 0.00001
I can use pd.read_table(filename, sep=' ', header=0) and it will get everything correct except for the very first header, "Time Step".
Is there a way to specify a regex string for read_table() to use to parse out the header names?
I know a way to solve the issue is to just use regex to create a list of names for the read_table() function to use, but I figured there might/should be a way to directly express that in the import itself.
Edit: Here's what it returns as headers:
['("Time', 'Step"', 'courantnumber_max', 'courantnumber_avg', 'flow-time']
So it doesn't appear to be actually possible to do this inside the pandas.read_table() function. Below is posted the actual solution I ended up using to fix the problem:
import re
def get_headers(file, headerline, regexstring, exclude):
# Get string of selected headerline
with file.open() as f:
for i, line in enumerate(f):
if i == headerline-1:
headerstring = line
elif i > headerline-1:
break
# Parse headerstring
reglist = re.split(regexstring, headerstring)
# Filter entries in reglist
#filter out blank strs
filteredlist = list(filter(None, reglist))
#filter out items in exclude list
headerslist = []
if exclude:
for entry in filteredlist:
if not entry in exclude:
headerslist.append(entry)
return headerslist
get_headers(filename, 3, r'(?:" ")|["\)\(]', ['\n'])
Code explanation:
get_headers():
Arguments, file is a file object that contains the header. headerline is the line number (starting at 1) that the header names exist. regexstring is the pattern that will be fed into re.split(). Highly recommended that you prepend a r to the regex pattern. exclude is a list of miscellaneous strings that you want to be removed from the headerlist.
The regex pattern I used:
First up we have the pipe (|) symbol. This was done to separate both the "normal" split method (which is the " ") and the other stuff that needs to be rid of (namely the parenthesis).
Starting with the first group: (?:" "). We have the (...) since we want to match those characters in order. The " " is what we want to match as the stuff to split around. The ?: basically says to not capture the contents of the group. This is important/useful as otherwise re.split() will keep any groups as a separate item. See re.split() in documentation.
The second group is simply the other characters. Without them, the first and last items would be '("Time Step' and 'flow-time)\n'. Note that this causes \n to be treated as a separate entry to the list. This is why we use the exclude argument to fix that up after the fact.

python's regular expression that repeats

I have a list of lines. I'm writing a typical text modifying function that runs through each line in the list and modifies it when a pattern is detected.
I realized later in writing this type of functions that a pattern may repeat multiple times in the line.
For example, this is one of the functions I wrote:
def change_eq(string):
#inputs a string and outputs the modified string
#replaces (X####=#) to (X####==#)
#set pattern
pat_eq=r"""(.*) #stuff before
([\(\|][A-Z]+[0-9]*) #either ( or | followed by the variable name
(=) #single equal sign we want to change
([0-9]*[\)\|]) #numeric value of the variable followed by ) or |
(.*)""" #stuff after
p= re.compile(pat_eq, re.X)
p1=p.match(string)
if bool(p1)==1:
# if pattern in pat_eq is detected, replace that portion of the string with a modified version
original=p1.group(0)
fixed=p1.group(1)+p1.group(2)+"=="+p1.group(4)+p1.group(5)
string_c=string.replace(original,fixed)
return string_c
else:
# returns the original string
return string
But for an input string such as
'IF (X2727!=78|FLAG781=0) THEN PURPILN2=(X2727!=78|FLAG781=0)*X2727'
, group() only works on the last pattern detected in the string, so it changes it to
'IF (X2727!=78|FLAG781=0) THEN PURPILN2=(X2727!=78|FLAG781==0)*X2727'
, ignoring the first case detected. I understand that's the product of my function using the group attribute.
How would I address this issue? I know there is {m,n}, but does it work with match?
Thank you in advance.
Different languages handle "global" matches in different ways. You'll want to use Python's re.finditer (link) and use a for loop to iterate through the resulting match objects.
Example with some of your code:
p = re.compile(pat_eq, re.X)
string_c = string
for match_obj in p.finditer(string):
original = match_obj.group(0)
fixed = p1.group(1) + p1.group(2) + '==' + p1.group(4) + p1.group(5)
string_c = string_c.replace(original, fixed)
return string_c

Python test if string matches a template value

I am trying to iterate through a list of strings, keeping only those that match a naming template I have specified. I want to accept any list entry that matches the template exactly, other than having an integer in a variable <SCENARIO> field.
The check needs to be general. Specifically, the string structure could change such that there is no guarantee <SCENARIO> always shows up at character X (to use list comprehensions, for example).
The code below shows an approach that works using split, but there must be a better way to make this string comparison. Could I use regular expressions here?
template = 'name_is_here_<SCENARIO>_20131204.txt'
testList = ['name_is_here_100_20131204.txt', # should accept
'name_is_here_100_20131204.txt.NEW', # should reject
'other_name.txt'] # should reject
acceptList = []
for name in testList:
print name
acceptFlag = True
splitTemplate = template.split('_')
splitName = name.split('_')
# if lengths do not match, name cannot possibly match template
if len(splitTemplate) == len(splitName):
print zip(splitTemplate, splitName)
# compare records in the split
for t, n in zip(splitTemplate, splitName):
if t!=n and not t=='<SCENARIO>':
#reject if any of the "other" fields are not identical
#(would also check that '<SCENARIO>' field is numeric - not shown here)
print 'reject: ' + name
acceptFlag = False
else:
acceptFlag = False
# keep name if it passed checks
if acceptFlag == True:
acceptList.append(name)
print acceptList
# correctly prints --> ['name_is_here_100_20131204.txt']
Try with the re module for regular expressions in Python:
import re
template = re.compile(r'^name_is_here_(\d+)_20131204.txt$')
testList = ['name_is_here_100_20131204.txt', #accepted
'name_is_here_100_20131204.txt.NEW', #rejected!
'name_is_here_aabs2352_20131204.txt', #rejected!
'other_name.txt'] #rejected!
acceptList = [item for item in testList if template.match(item)]
This should do, I understand that name_is_here is just a placeholder for alphanumeric characters?
import re
testList = ['name_is_here_100_20131204.txt', # should accept
'name_is_here_100_20131204.txt.NEW', # should reject
'other_name.txt',
'name_is_44ere_100_20131204.txt',
'name_is_here_100_2013120499.txt',
'name_is_here_100_something_2013120499.txt',
'name_is_here_100_something_20131204.txt']
def find(scenario):
begin = '[a-z_]+100_' # any combinations of chars and underscores followd by 100
end = '_[0-9]{8}.txt$' #exactly eight digits followed by .txt at the end
pattern = re.compile("".join([begin,scenario,end]))
result = []
for word in testList:
if pattern.match(word):
result.append(word)
return result
find('something') # returns ['name_is_here_100_something_20131204.txt']
EDIT: scenario in separate variable, regex now only matches characters followed by 100, then scenarion, then eight digits followed by .txt.

Dynamically Read the Format of a String, Python

I have a lookup table of Scientific Names for plants. I want to use this lookup table to validate other tables where I have a data entry person entering the data. Sometimes they get the formatting of these scientific names wrong, so I am writing a script to try to flag the errors.
There's a very specific way to format each name. For example 'Sonchus arvensis L.' specifically needs to have the S in Sonchus capitalized as well as the L at the end. I have about 1000 different plants and each one is formatted differently. Here's a few more examples:
Linaria dalmatica (L.) Mill.
Knautia arvensis (L.) Coult.
Alliaria petiolata (M. Bieb.) Cavara & Grande
Berteroa incana (L.) DC.
Aegilops cylindrica Host
As you can see, all of these strings are formatted very differently (i.e some letters are capitalized, some aren't, there are brackets sometimes, ampersands, periods, etc)
My question is, is there any way to dynamically read the formatting of each string in the lookup table so that I can compare that to the value the data entry person entered to make sure it is formatted properly? In the script below, I test (first elif) to see if the value is in the lookup table by capitalizing all values in order to make the match work, regardless of formatting. In the next test (second elif) I can sort of test formatting by comparing against the lookup table value for value. This will return unmatched records based on formatting, but it doesn't specifically tell you why the unmatched record returned.
What I perceive to do is, read in the string values in the look up table and somehow dynamically read the formatting of each string, so that I can specifically identify the error (i.e. a letter should be capitalized, where it wasn't)
So far my code snippet looks like this:
# Determine if the field heaidng is in a list I built earlier
if "SCIENTIFIC_NAME" in fieldnames:
# First, Test to see if record is empty
if not row.SCIENTIFIC_NAME:
weedPLineErrors.append("SCIENTIFIC_NAME record is empty")
# Second, Test to see if value is in the lookup table, regardless of formatting.
elif row.SCIENTIFIC_NAME.upper() not in [x.upper() for x in weedScientificTableList]:
weedPLineErrors.append("COMMON_NAME (" + row.SCIENTIFIC_NAME + ")" + " is not in the domain table")
# Third, if the second test is satisfied, we know the value is in the lookup table. We can then test the lookup table again, without capitalizing everything to see if there is an exact match to account for formatting.
elif row.SCIENTIFIC_NAME not in weedScientificTableList:
weedPLineErrors.append("COMMON_NAME (" + row.SCIENTIFIC_NAME + ")" + " is not formatted properly")
else:
pass
I hope my question is clear enough. I looked at string templates, but I don't think it does what I want to do...at least not dynamically. If anyone can point me in a better direction, I am all eyes...but maybe I am way out to lunch on this one.
Thanks,
Mike
To get around the punctuation problem, you can use regular expressions.
>>> import re
>>> def tokenize(s):
... return re.split('[^A-Za-z]+', s) # Split by anything that isn't a letter
...
>>> tokens = tokenize('Alliaria petiolata (M. Bieb.) Cavara & Grande')
>>> tokens
['Alliaria', 'petiolata', 'M', 'Bieb', 'Cavara', 'Grande']
To get around the capitalization problem, you can use
>>> tokens = [s.lower() for s in tokens]
From there, you could rewrite the entry in a standardized format, such as
>>> import string
>>> ## I'm not sure exactly what format you're looking for
>>> first, second, third = [string.capitalize(s) for s in tokens[:3]]
>>> "%s %s (%s)" % (first, second, third)
'Alliaria Petiolata (M)'
This probably isn't the exact formatting that you want, but maybe that will get you headed in the right direction.
You can build a dictionary of the names from the lookup table. Assuming that you have the names stored in a list (call it correctList), you can write a function which removes all formatting and maybe lowers or uppers the case and store the result in a dictionary. For example following is a sample code to build the dictionary
def removeFormatting(name):
name = name.replace("(", "").replace(")", "")
name = name.replace(".", "")
...
return name.lower()
formattingDict = dict([(removeFormatting(i), i) for i in correctList])
Now you can compare the strings input by the data entry person. Lets say it is in a list called inputList.
for name in inputList:
unformattedName = removeFormatting(name)
lookedUpName = formattingDict.get(unformattedName, "")
if not lookedUpName:
print "Spelling mistake:", name
elif lookedUpName != name:
print "Formatting error"
print differences(name, lookedUpName)
The differences function could be stuffed with some rules like brackets, "."s etc
def differences(inputName, lookedUpName):
mismatches = []
# Check for brackets
if "(" in lookedUpName:
if "(" not in inputName:
mismatches.append("Bracket missing")
...
# Add more rules
return mismatches
Does that answer your question a bit?

Categories

Resources