pythonic method for extracting numeric digits from string

pythonic method for extracting numeric digits from string - python

I am developing a program to read through a CSV file and create a dictionary of information from it. Each line in the CSV is essentially a new dictionary entry with the delimited objects being the values.
As one subpart of task, I need to extract an unknown number of numeric digits from within a string. I have a working version, but it does not seem very pythonic.
An example string looks like this:
variable = Applicaiton.Module_Name.VAR_NAME_ST12.WORD_type[0]
variable is string's name in the python code, and represents the variable name within a MODBUS. I want to extract just the digits prior to the .WORD_type[0] which relate to the number of bytes the string is packed into.
Here is my working code, note this is nested within a for statement iterating through the lines in the CSV. var_length and var_type are some of the keys, i.e. {"var_length": var_length}
if re.search(".+_ST[0-9]{1,2}\\.WORD_type.+", variable):
var_type = "string"
temp = re.split("\\.", variable)
temp = re.split("_", temp[2])
temp = temp[-1]
var_length = int(str.lstrip(temp, "ST")) / 2

You could maybe try using matching groups like so:
import re
variable = "Applicaiton.Module_Name.VAR_NAME_ST12.WORD_type[0]"
matches = re.match(r".+_ST(\d+)\.WORD_type.+", variable)
if matches:
print(matches[1])
matches[0] has the full match and matches[1] contains the matched group.

Related

Python Dictionary Comprehension with Regular Expression Pattern

I'm just practicing basic web scraping using Python and Regex
I want to write a function that takes a string object as input and returns a dictionary where each key is a date as string like '2017-01-23' (without the quotes tho); and each value corresponding is the approval rating, stored as a floating numbers.
Here is what the input object(data) looks like:
As you can see, each record(per day) is denoted by {}, and each key:value pattern followed by ','
{"date":"2017-01-23","future":false,"subgroup":"All polls","approve_estimate":"45.46693",
"approve_hi":"50.88971","approve_lo":"40.04416","disapprove_estimate":"41.26452",
"disapprove_hi":"46.68729","disapprove_lo":"35.84175"},
{"date":"2017-01-24","future":false,"subgroup":"All polls"
...................
Here's a regex pattern for the dates:
date_pattern = r'\d{4}-\d{2}-\d{2}'
Using this,
date_pattern = r'\d{4}-\d{2}-\d{2}'
date_matcher = re.compile(date_pattern)
date_matches = matcher.findall(long_string) #list of all dates in string
But for the actual approval rating value, this wouldn't work because I'm not looking for a match, but the number that comes after this, which is 45.46693 in this example.
approve_pattern = r'approve_estimate\":'
#float(re.sub('[aZ]','',re.sub('["]','',re.split(approve_pattern, data) [1])))
The problem with the approve_pattern is that I can only fetch one value at a time. So how can I do this for the entire data and store the approve rating values as float?
Also, I want to only keep records for which "future":false to discard predicted values, and only keep the values with "future":true.
Please assume all encountered dates have valid approval estimates.
Here's the desired output
date_matches=['2018-01-01','2018-01-02','2018-01-03'] # "future":true filtered out
approve_matches=[47.1,47.2,47.9]
final_dict = {k:v for k,v in zip(date_matches,approve_matches)}
final_dict #Desired Output {'2018-01-01': 47.1, '2018-01-02': 47.2, '2018-01-03': 47.9}

Your data looks very much like JSON, except that it must be enclosed in brackets to form an array. You should use a JSON parser (e.g., json.loads) to read it.
Let's say s is your original string. Then the following expression results in your dictionary:
final_dict = {record['date']: record['approve_estimate']
for record in json.loads("[" + s + "]")
if record['future']}
# Empty in your case

python the list has automatically unwanted special characters \n+

I am reading some data from a dataframe column and I do some manipulation on each value if the value contains a "-". These manipulations include spliting based on the "-". However I do not understand why each value in the list has an "\n*" as for instance
['2010\n1', '200\n2 450\n3', ..., '1239\n1000']
here is a sample of my code:
splited = []
wantedList = []
val = str(x) # x represents the value in the value read from the dataframe column
print val # the val variable does not does not contain those special characters
if val.find('-') != -1:
splited = val.split('-')
wantedList.append(splited[0])
print splited # splited list contains those special characters
print wantedList # wantedList contains those special characters
I guess this has to do with the way I created the list or the way I am appending to it.
Does anyone knows why something like this does happen

There isn't nothing in your code that could possibly automagically add a new line character at some random position within your strings. I'd say the characters are already in the string but print isn't showing as \n but as a new line.
You can confirm that by printing the representation of the string:
print repr(val)
If you want them out of your strings, you can with a simple str.replace for all \n.

Python RegEx String Parsing with inconsistent data

I have a string that I need to extract values out of. The problem is the string is inconsistent. Here's an example of the script that has the string within it.
import re
RAW_Data = "Name Multiple Words Zero Row* (78.59/0) Name Multiple Words2* (96/24.56) Name Multiple Words3* (0/32.45) Name Multiple Words4* (96/12.58) Name Multiple Words5* (96/0) Name Multiple Words Zero Row6* (0) Name Multiple Words7* (96/95.57) Name Multiple Words Zero Row8* (0) Name Multiple Words9*"
First_Num = re.findall(r'\((.*?)\/*', RAW_Data)
Seg_Length = re.findall(r'\/(.*?)\)', RAW_Data)
#WithinParenthesis = re.findall(r'\((.*?)\)', RAW_Data) #This works correctly
print First_Num
print Seg_Length
del RAW_Data
What I need to get out of the string are all values within the parenthesis. However, I need some logic that will handle the absence of the "/" between the numbers. Basically if the "/" doesn't exist make both values for First_Num and Seg_Length equal to "0". I hope this makes sense.

Use a simple regex and add some programming logic:
import re
rx = r'\(([^)]+)\)'
string = """Name Multiple Words Zero Row* (78.59/0) Name Multiple Words2* (96/24.56) Name Multiple Words3* (0/32.45) Name Multiple Words4* (96/12.58) Name Multiple Words5* (96/0) Name Multiple Words Zero Row6* (0) Name Multiple Words7* (96/95.57) Name Multiple Words Zero Row8* (0) Name Multiple Words9*"""
for match in re.finditer(rx, string):
parts = match.group(1).split('/')
First_Num = parts[0]
try:
Seg_Length = parts[1]
except IndexError:
Seg_Length = None
print "First_Num, Seg_Length: ", First_Num, Seg_Length
You might get along with a regex alone solution (e.g. with conditional regex), but this approach is likely to be still understood in three months. See a demo on ideone.com.

You are attempting to find values on each side of '/' that you know may not exist. Pull back to the always known condition for your initial search. Use a Regular Expression to findall of data within parenthesis. Then process these based on if '/' is in the value.

Dynamically Read the Format of a String, Python

I have a lookup table of Scientific Names for plants. I want to use this lookup table to validate other tables where I have a data entry person entering the data. Sometimes they get the formatting of these scientific names wrong, so I am writing a script to try to flag the errors.
There's a very specific way to format each name. For example 'Sonchus arvensis L.' specifically needs to have the S in Sonchus capitalized as well as the L at the end. I have about 1000 different plants and each one is formatted differently. Here's a few more examples:
Linaria dalmatica (L.) Mill.
Knautia arvensis (L.) Coult.
Alliaria petiolata (M. Bieb.) Cavara & Grande
Berteroa incana (L.) DC.
Aegilops cylindrica Host
As you can see, all of these strings are formatted very differently (i.e some letters are capitalized, some aren't, there are brackets sometimes, ampersands, periods, etc)
My question is, is there any way to dynamically read the formatting of each string in the lookup table so that I can compare that to the value the data entry person entered to make sure it is formatted properly? In the script below, I test (first elif) to see if the value is in the lookup table by capitalizing all values in order to make the match work, regardless of formatting. In the next test (second elif) I can sort of test formatting by comparing against the lookup table value for value. This will return unmatched records based on formatting, but it doesn't specifically tell you why the unmatched record returned.
What I perceive to do is, read in the string values in the look up table and somehow dynamically read the formatting of each string, so that I can specifically identify the error (i.e. a letter should be capitalized, where it wasn't)
So far my code snippet looks like this:
# Determine if the field heaidng is in a list I built earlier
if "SCIENTIFIC_NAME" in fieldnames:
# First, Test to see if record is empty
if not row.SCIENTIFIC_NAME:
weedPLineErrors.append("SCIENTIFIC_NAME record is empty")
# Second, Test to see if value is in the lookup table, regardless of formatting.
elif row.SCIENTIFIC_NAME.upper() not in [x.upper() for x in weedScientificTableList]:
weedPLineErrors.append("COMMON_NAME (" + row.SCIENTIFIC_NAME + ")" + " is not in the domain table")
# Third, if the second test is satisfied, we know the value is in the lookup table. We can then test the lookup table again, without capitalizing everything to see if there is an exact match to account for formatting.
elif row.SCIENTIFIC_NAME not in weedScientificTableList:
weedPLineErrors.append("COMMON_NAME (" + row.SCIENTIFIC_NAME + ")" + " is not formatted properly")
else:
pass
I hope my question is clear enough. I looked at string templates, but I don't think it does what I want to do...at least not dynamically. If anyone can point me in a better direction, I am all eyes...but maybe I am way out to lunch on this one.
Thanks,
Mike

To get around the punctuation problem, you can use regular expressions.
>>> import re
>>> def tokenize(s):
... return re.split('[^A-Za-z]+', s) # Split by anything that isn't a letter
...
>>> tokens = tokenize('Alliaria petiolata (M. Bieb.) Cavara & Grande')
>>> tokens
['Alliaria', 'petiolata', 'M', 'Bieb', 'Cavara', 'Grande']
To get around the capitalization problem, you can use
>>> tokens = [s.lower() for s in tokens]
From there, you could rewrite the entry in a standardized format, such as
>>> import string
>>> ## I'm not sure exactly what format you're looking for
>>> first, second, third = [string.capitalize(s) for s in tokens[:3]]
>>> "%s %s (%s)" % (first, second, third)
'Alliaria Petiolata (M)'
This probably isn't the exact formatting that you want, but maybe that will get you headed in the right direction.

You can build a dictionary of the names from the lookup table. Assuming that you have the names stored in a list (call it correctList), you can write a function which removes all formatting and maybe lowers or uppers the case and store the result in a dictionary. For example following is a sample code to build the dictionary
def removeFormatting(name):
name = name.replace("(", "").replace(")", "")
name = name.replace(".", "")
...
return name.lower()
formattingDict = dict([(removeFormatting(i), i) for i in correctList])
Now you can compare the strings input by the data entry person. Lets say it is in a list called inputList.
for name in inputList:
unformattedName = removeFormatting(name)
lookedUpName = formattingDict.get(unformattedName, "")
if not lookedUpName:
print "Spelling mistake:", name
elif lookedUpName != name:
print "Formatting error"
print differences(name, lookedUpName)
The differences function could be stuffed with some rules like brackets, "."s etc
def differences(inputName, lookedUpName):
mismatches = []
# Check for brackets
if "(" in lookedUpName:
if "(" not in inputName:
mismatches.append("Bracket missing")
...
# Add more rules
return mismatches
Does that answer your question a bit?

Regular Expression Parsing Key Value Pairs in Namelist Input File

I have an input file which is in a Fortran "namelist" format which I would like to parse with python regular expressions. Easiest way to demonstrate is with a ficticious example:
$VEHICLES
CARS= 1,
TRUCKS = 0,
PLAINS= 0, TRAINS = 0,
LIB='AUTO.DAT',
C This is a comment
C Data variable spans multiple lines
DATA=1.2,2.34,3.12,
4.56E-2,6.78,
$END
$PLOTTING
PLOT=T,
PLOT(2)=12,
$END
So the keys can contain regular variable-name characters as well as parenthesis and numbers. The values can be strings, boolean (T, F, .T., .F., TRUE, FALSE, .TRUE., .FALSE. are all possible), integers, floating-point numbers, or comma-separated lists of numbers. Keys are connected to their values with equal signs. Key-Value pairs are separated by commas, but can share a line. Values can span multiple lines for long lists of numbers. Comments are any line beginning with a C. There is generally inconsistent spacing before and after '=' and ','.
I have come up with a working regular expression for parsing the keys and values and getting them into an Ordered Dictionary (need to preserve order of inputs).
Here's my code so far. I've included everything from reading the file to saving to a dictionary for thoroughness.
import re
from collections import OrderedDict
f=open('file.dat','r')
file_str=f.read()
#Compile regex pattern for requested namelist
name='Vehicles'
p_namelist = re.compile(r"\$"+name.upper()+"(.*?)\$END",flags=re.DOTALL|re.MULTILINE)
#Execute regex on file string and get a list of captured tokens
m_namelist = p_namelist.findall(file_str)
#Check for a valid result
if m_namelist:
#The text of the desired namelist is the first captured token
namelist=m_namelist[0]
#Split into lines
lines=namelist.splitlines()
#List comprehension which returns the list of lines that do not start with "C"
#Effectively remove comment lines
lines = [item for item in lines if not item.startswith("C")]
#Re-combine now that comment lines are removed
namelist='\n'.join(lines)
#Create key-value parsing regex
p_item = re.compile(r"([^\s,\=]+?)\s*=\s*([^=]+)(?=[\s,][^\s,\=]+\s*\=|$)",flags=re.DOTALL|re.MULTILINE)
#Execute regex
items = p_item.findall(namelist)
#Initialize namelist ordered dictionary
n = OrderedDict()
#Remove undesired characters from value
for item in items:
n[item[0]] = item[1].strip(',\r\n ')
My question is whether I'm going about this correctly. I realize there is a ConfigParser library, which I have not yet attempted. My focus here is the regular expression:
([^\s,\=]+?)\s*=\s*([^=]+)(?=[\s,][^\s,\=]+\s*\=|$)
but I went ahead and included the other code for thoroughness and to demonstrate what I'm doing with it. For my Regular Expression, because the values can contain commas, and the key-value pairs are also separated by commas, there is no simple way to isolate the pairs. I chose to use a forward look-ahead to find the next key and "=". This allows everything between the "=" and the next key to be the value. Finally, because this doesn't work for the last pair, I threw in "|$" into the forward look-ahead meaning that if another "VALUE=" isn't found, look for the end of the string. I figured matching the value with [^=]+ followed by a look-ahead was better than trying to match all possible value types.
While writing this question I came up with an alternative Regular Expresson that takes advantage of the fact that numbers are the only value that can be in lists:
([^\s,\=]+?)\s*=\s*((?:\s*\d[\d\.\E\+\-]*\s*,){2,}|[^=,]+)
This one matches either a list of 2 or more numbers with (?:\s*\d[\d\.\E\+\-]*\s*,){2,} or anything before the next comma with [^=,].
Are these somewhat messy Regular Expressions the best way to parse a file like this?

I would suggest to develop little more sophisticated parser.
I stumble upon the project on google code hosting that implements very similar parser functionality: Fortran Namelist parser for Python prog/scripts but it was build for little different format.
I played with it a little and updated it to support structure of the format in your example:
Please see my version on gist:
Updated Fortran Namelist parser for python https://gist.github.com/4506282
I hope this parser will help you with your project.
Here is example output produced by the script after parsing FORTRAN code example:
{'PLOTTING':
{'par':
[OrderedDict([('PLOT', ['T']), ('PLOT(2) =', ['12'])])],
'raw': ['PLOT=T', 'PLOT(2)=12']},
'VEHICLES':
{'par':
[OrderedDict([('TRUCKS', ['0']), ('PLAINS', ['0']), ('TRAINS', ['0']), ('LIB', ['AUTO.DAT']), ('DATA', ['1.2', '2.34', '3.12', '4.56E-2', '6.78'])])],
'raw':
['TRUCKS = 0',
'PLAINS= 0, TRAINS = 0',
"LIB='AUTO.DAT'",
'DATA=1.2,2.34,3.12',
'4.56E-2,6.78']}}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.