I'm just practicing basic web scraping using Python and Regex
I want to write a function that takes a string object as input and returns a dictionary where each key is a date as string like '2017-01-23' (without the quotes tho); and each value corresponding is the approval rating, stored as a floating numbers.
Here is what the input object(data) looks like:
As you can see, each record(per day) is denoted by {}, and each key:value pattern followed by ','
{"date":"2017-01-23","future":false,"subgroup":"All polls","approve_estimate":"45.46693",
"approve_hi":"50.88971","approve_lo":"40.04416","disapprove_estimate":"41.26452",
"disapprove_hi":"46.68729","disapprove_lo":"35.84175"},
{"date":"2017-01-24","future":false,"subgroup":"All polls"
...................
Here's a regex pattern for the dates:
date_pattern = r'\d{4}-\d{2}-\d{2}'
Using this,
date_pattern = r'\d{4}-\d{2}-\d{2}'
date_matcher = re.compile(date_pattern)
date_matches = matcher.findall(long_string) #list of all dates in string
But for the actual approval rating value, this wouldn't work because I'm not looking for a match, but the number that comes after this, which is 45.46693 in this example.
approve_pattern = r'approve_estimate\":'
#float(re.sub('[aZ]','',re.sub('["]','',re.split(approve_pattern, data) [1])))
The problem with the approve_pattern is that I can only fetch one value at a time. So how can I do this for the entire data and store the approve rating values as float?
Also, I want to only keep records for which "future":false to discard predicted values, and only keep the values with "future":true.
Please assume all encountered dates have valid approval estimates.
Here's the desired output
date_matches=['2018-01-01','2018-01-02','2018-01-03'] # "future":true filtered out
approve_matches=[47.1,47.2,47.9]
final_dict = {k:v for k,v in zip(date_matches,approve_matches)}
final_dict #Desired Output {'2018-01-01': 47.1, '2018-01-02': 47.2, '2018-01-03': 47.9}
Your data looks very much like JSON, except that it must be enclosed in brackets to form an array. You should use a JSON parser (e.g., json.loads) to read it.
Let's say s is your original string. Then the following expression results in your dictionary:
final_dict = {record['date']: record['approve_estimate']
for record in json.loads("[" + s + "]")
if record['future']}
# Empty in your case
Related
I am developing a program to read through a CSV file and create a dictionary of information from it. Each line in the CSV is essentially a new dictionary entry with the delimited objects being the values.
As one subpart of task, I need to extract an unknown number of numeric digits from within a string. I have a working version, but it does not seem very pythonic.
An example string looks like this:
variable = Applicaiton.Module_Name.VAR_NAME_ST12.WORD_type[0]
variable is string's name in the python code, and represents the variable name within a MODBUS. I want to extract just the digits prior to the .WORD_type[0] which relate to the number of bytes the string is packed into.
Here is my working code, note this is nested within a for statement iterating through the lines in the CSV. var_length and var_type are some of the keys, i.e. {"var_length": var_length}
if re.search(".+_ST[0-9]{1,2}\\.WORD_type.+", variable):
var_type = "string"
temp = re.split("\\.", variable)
temp = re.split("_", temp[2])
temp = temp[-1]
var_length = int(str.lstrip(temp, "ST")) / 2
You could maybe try using matching groups like so:
import re
variable = "Applicaiton.Module_Name.VAR_NAME_ST12.WORD_type[0]"
matches = re.match(r".+_ST(\d+)\.WORD_type.+", variable)
if matches:
print(matches[1])
matches[0] has the full match and matches[1] contains the matched group.
I receive an input string having values expressed in two possible formats. E.g.:
#short format
data = '"interval":19'
>>> "interval":19
#extended format
data = '"interval":{"t0":19,"tf":19}'
>>> "interval":{"t0":19,"tf":19}
I would like to check whether a short format is used and, in case, make it extended.
Considering that the string could be composed of multiple values, i.e.
data = '"interval":19,"interval2":{"t0":10,"tf":15}'
>>> "interval":19,"interval2":{"t0":10,"tf":15}
I cannot just say:
if ":{" not in data:
#then short format is used
I would like to code something like:
if ":$(a general int/float/double number)" in data:
#extract the number
#replace ":{number}" with the extended format
I know how to code the replacing part.
I need help for implementing if condition: in my mind, I model it like a variable substring, in which the variable part is the number inside it, while the rigid format is the $(value name) + ":" part.
"some_value":19
^ ^
rigid format variable part
EDIT - WHY NOT PARSE IT?
I know the string is "JSON-friendly" and I can convert it into a dictionary, easily accessing then the values.
Indeed, I already have this solution in my code. But I don't like it since the input string could be multilevel and I need to iterate on the leaf values of the resulting dictionary, independently from the dictionary levels. The latter is not a simple thing to do.
So I was wondering whether a way to act directly on the string exists.
If you replace all keys, except t0, tf, followed by numbers, it should work.
I show you an example on a multilevel string, probably to be put in a better shape:
import re
s = '"interval": 19,"t0interval2":{"t0":10,"tf":15},{"deeper": {"other_interval":23}}'
gex = '("(?!(t0|tf)")\w+":)\s*(\d+)'
new_s = re.sub(gex, r'\1 {"t0": \3, "tf": \3}', s)
print(new_s)
>>> print(new_s)
"interval": {"t0": 19, "tf": 19},"t0interval2":{"t0":10,"tf":15},{"deeper": {"other_interval": {"t0": 23, "tf": 23}}}
You could use a regular expression. ("interval":)(\d+) will look for the string '"interval":' followed by any number of digits.
Let's test this
data = '"interval":19,"interval2":{"t0":10,"tf":15},"interval":25'
result = re.sub(r'("interval":)(\d+)', r'xxx', data)
print(result)
# -> xxx,"interval2":{"t0":10,"tf":15},xxx
We see that we found the correct places. Now we're going to create your target format. Here the matched groups come in handy. In the regular expression ("interval":) is group 1, (\d+) is group 2.
Now we use the content of those groups to create your wanted result.
data = '"interval":19,"interval2":{"t0":10,"tf":15},"interval":25'
result = re.sub(r'("interval":)(\d+)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":10,"tf":15},"interval":{"t0":25,"tf":25}
If there are floating point values involved you'll have to change (\d+) to ([.\d]+).
If you want any Unicode standard word characters and not only interval you can use the special sequence \w and because it could be multiple characters the expression will be \w+.
data = '"interval":19,"interval2":{"t0":10,"tf":15},"Monty":25.4'
result = re.sub(r'("\w+":)([.\d]+)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":{"t0":10,"tf":10},"tf":{"t0":15,"tf":15}},"Monty":{"t0":25.4,"tf":25.4}
Dang! Yes, we found "Monty" but now the values from the second part are found too. We'll have to fix this somehow. Let's see. We don't want ("\w+") if it's preceded by { so were going to use a negative lookbehind assertion: (?<!{)("\w+"). And after the number part (\d+) we don't want a } or an other digit so we're using a negative lookahead assertion here: ([.\d]+)(?!})(?!\d).
data = '"interval":19,"interval2":{"t0":10,"tf":15},"Monty":25.4'
result = re.sub(r'(?<!{)("\w+":)([.\d]+)(?!})(?!\d)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":10,"tf":15},"Monty":{"t0":25.4,"tf":25.4}
Hooray, it works!
Regular expressions are powerful and fun, but if you start to add more constraints this might become unmanageable.
I want to remove special character '-' from date format in python. I have retrieved maximum date from a database column.
Here is my small code:
def max_date():
Max_Date= hive_select('SELECT MAX (t_date) FROM ovisto.ovisto_log')
value = Max_Date[0]
print value
Here is output:
{0: '2017-02-21', '_c0': '2017-02-21'}
I want only numbers without special character '-' from output.
so, I am expecting this answer '20170221'
I have tried in different ways but could not get proper answer.
How can I get in a simple way? Thanks for your time.
just rebuild a new dictionary using dict comprehension, iterating on the original dictionary, and stripping the unwanted characters from values using str.replace
d = {0: '2017-02-21', '_c0': '2017-02-21'}
new_d = {k:v.replace("-","") for k,v in d.items()}
print(new_d)
result:
{0: '20170221', '_c0': '20170221'}
if you only want to keep the values and drop the duplicates (and the order too :), use a set comprehension with the values instead:
s = {v.replace("-","") for _,v in d.items()}
You can try strptime:
value = Max_Date[0]
new_val= datetime.datetime.strptime( str( value ), '%Y%m%d').strftime('%m/%d/%y')
I found it here: How to convert a date string to different format
I have a lookup table of Scientific Names for plants. I want to use this lookup table to validate other tables where I have a data entry person entering the data. Sometimes they get the formatting of these scientific names wrong, so I am writing a script to try to flag the errors.
There's a very specific way to format each name. For example 'Sonchus arvensis L.' specifically needs to have the S in Sonchus capitalized as well as the L at the end. I have about 1000 different plants and each one is formatted differently. Here's a few more examples:
Linaria dalmatica (L.) Mill.
Knautia arvensis (L.) Coult.
Alliaria petiolata (M. Bieb.) Cavara & Grande
Berteroa incana (L.) DC.
Aegilops cylindrica Host
As you can see, all of these strings are formatted very differently (i.e some letters are capitalized, some aren't, there are brackets sometimes, ampersands, periods, etc)
My question is, is there any way to dynamically read the formatting of each string in the lookup table so that I can compare that to the value the data entry person entered to make sure it is formatted properly? In the script below, I test (first elif) to see if the value is in the lookup table by capitalizing all values in order to make the match work, regardless of formatting. In the next test (second elif) I can sort of test formatting by comparing against the lookup table value for value. This will return unmatched records based on formatting, but it doesn't specifically tell you why the unmatched record returned.
What I perceive to do is, read in the string values in the look up table and somehow dynamically read the formatting of each string, so that I can specifically identify the error (i.e. a letter should be capitalized, where it wasn't)
So far my code snippet looks like this:
# Determine if the field heaidng is in a list I built earlier
if "SCIENTIFIC_NAME" in fieldnames:
# First, Test to see if record is empty
if not row.SCIENTIFIC_NAME:
weedPLineErrors.append("SCIENTIFIC_NAME record is empty")
# Second, Test to see if value is in the lookup table, regardless of formatting.
elif row.SCIENTIFIC_NAME.upper() not in [x.upper() for x in weedScientificTableList]:
weedPLineErrors.append("COMMON_NAME (" + row.SCIENTIFIC_NAME + ")" + " is not in the domain table")
# Third, if the second test is satisfied, we know the value is in the lookup table. We can then test the lookup table again, without capitalizing everything to see if there is an exact match to account for formatting.
elif row.SCIENTIFIC_NAME not in weedScientificTableList:
weedPLineErrors.append("COMMON_NAME (" + row.SCIENTIFIC_NAME + ")" + " is not formatted properly")
else:
pass
I hope my question is clear enough. I looked at string templates, but I don't think it does what I want to do...at least not dynamically. If anyone can point me in a better direction, I am all eyes...but maybe I am way out to lunch on this one.
Thanks,
Mike
To get around the punctuation problem, you can use regular expressions.
>>> import re
>>> def tokenize(s):
... return re.split('[^A-Za-z]+', s) # Split by anything that isn't a letter
...
>>> tokens = tokenize('Alliaria petiolata (M. Bieb.) Cavara & Grande')
>>> tokens
['Alliaria', 'petiolata', 'M', 'Bieb', 'Cavara', 'Grande']
To get around the capitalization problem, you can use
>>> tokens = [s.lower() for s in tokens]
From there, you could rewrite the entry in a standardized format, such as
>>> import string
>>> ## I'm not sure exactly what format you're looking for
>>> first, second, third = [string.capitalize(s) for s in tokens[:3]]
>>> "%s %s (%s)" % (first, second, third)
'Alliaria Petiolata (M)'
This probably isn't the exact formatting that you want, but maybe that will get you headed in the right direction.
You can build a dictionary of the names from the lookup table. Assuming that you have the names stored in a list (call it correctList), you can write a function which removes all formatting and maybe lowers or uppers the case and store the result in a dictionary. For example following is a sample code to build the dictionary
def removeFormatting(name):
name = name.replace("(", "").replace(")", "")
name = name.replace(".", "")
...
return name.lower()
formattingDict = dict([(removeFormatting(i), i) for i in correctList])
Now you can compare the strings input by the data entry person. Lets say it is in a list called inputList.
for name in inputList:
unformattedName = removeFormatting(name)
lookedUpName = formattingDict.get(unformattedName, "")
if not lookedUpName:
print "Spelling mistake:", name
elif lookedUpName != name:
print "Formatting error"
print differences(name, lookedUpName)
The differences function could be stuffed with some rules like brackets, "."s etc
def differences(inputName, lookedUpName):
mismatches = []
# Check for brackets
if "(" in lookedUpName:
if "(" not in inputName:
mismatches.append("Bracket missing")
...
# Add more rules
return mismatches
Does that answer your question a bit?
I have an input file which is in a Fortran "namelist" format which I would like to parse with python regular expressions. Easiest way to demonstrate is with a ficticious example:
$VEHICLES
CARS= 1,
TRUCKS = 0,
PLAINS= 0, TRAINS = 0,
LIB='AUTO.DAT',
C This is a comment
C Data variable spans multiple lines
DATA=1.2,2.34,3.12,
4.56E-2,6.78,
$END
$PLOTTING
PLOT=T,
PLOT(2)=12,
$END
So the keys can contain regular variable-name characters as well as parenthesis and numbers. The values can be strings, boolean (T, F, .T., .F., TRUE, FALSE, .TRUE., .FALSE. are all possible), integers, floating-point numbers, or comma-separated lists of numbers. Keys are connected to their values with equal signs. Key-Value pairs are separated by commas, but can share a line. Values can span multiple lines for long lists of numbers. Comments are any line beginning with a C. There is generally inconsistent spacing before and after '=' and ','.
I have come up with a working regular expression for parsing the keys and values and getting them into an Ordered Dictionary (need to preserve order of inputs).
Here's my code so far. I've included everything from reading the file to saving to a dictionary for thoroughness.
import re
from collections import OrderedDict
f=open('file.dat','r')
file_str=f.read()
#Compile regex pattern for requested namelist
name='Vehicles'
p_namelist = re.compile(r"\$"+name.upper()+"(.*?)\$END",flags=re.DOTALL|re.MULTILINE)
#Execute regex on file string and get a list of captured tokens
m_namelist = p_namelist.findall(file_str)
#Check for a valid result
if m_namelist:
#The text of the desired namelist is the first captured token
namelist=m_namelist[0]
#Split into lines
lines=namelist.splitlines()
#List comprehension which returns the list of lines that do not start with "C"
#Effectively remove comment lines
lines = [item for item in lines if not item.startswith("C")]
#Re-combine now that comment lines are removed
namelist='\n'.join(lines)
#Create key-value parsing regex
p_item = re.compile(r"([^\s,\=]+?)\s*=\s*([^=]+)(?=[\s,][^\s,\=]+\s*\=|$)",flags=re.DOTALL|re.MULTILINE)
#Execute regex
items = p_item.findall(namelist)
#Initialize namelist ordered dictionary
n = OrderedDict()
#Remove undesired characters from value
for item in items:
n[item[0]] = item[1].strip(',\r\n ')
My question is whether I'm going about this correctly. I realize there is a ConfigParser library, which I have not yet attempted. My focus here is the regular expression:
([^\s,\=]+?)\s*=\s*([^=]+)(?=[\s,][^\s,\=]+\s*\=|$)
but I went ahead and included the other code for thoroughness and to demonstrate what I'm doing with it. For my Regular Expression, because the values can contain commas, and the key-value pairs are also separated by commas, there is no simple way to isolate the pairs. I chose to use a forward look-ahead to find the next key and "=". This allows everything between the "=" and the next key to be the value. Finally, because this doesn't work for the last pair, I threw in "|$" into the forward look-ahead meaning that if another "VALUE=" isn't found, look for the end of the string. I figured matching the value with [^=]+ followed by a look-ahead was better than trying to match all possible value types.
While writing this question I came up with an alternative Regular Expresson that takes advantage of the fact that numbers are the only value that can be in lists:
([^\s,\=]+?)\s*=\s*((?:\s*\d[\d\.\E\+\-]*\s*,){2,}|[^=,]+)
This one matches either a list of 2 or more numbers with (?:\s*\d[\d\.\E\+\-]*\s*,){2,} or anything before the next comma with [^=,].
Are these somewhat messy Regular Expressions the best way to parse a file like this?
I would suggest to develop little more sophisticated parser.
I stumble upon the project on google code hosting that implements very similar parser functionality: Fortran Namelist parser for Python prog/scripts but it was build for little different format.
I played with it a little and updated it to support structure of the format in your example:
Please see my version on gist:
Updated Fortran Namelist parser for python https://gist.github.com/4506282
I hope this parser will help you with your project.
Here is example output produced by the script after parsing FORTRAN code example:
{'PLOTTING':
{'par':
[OrderedDict([('PLOT', ['T']), ('PLOT(2) =', ['12'])])],
'raw': ['PLOT=T', 'PLOT(2)=12']},
'VEHICLES':
{'par':
[OrderedDict([('TRUCKS', ['0']), ('PLAINS', ['0']), ('TRAINS', ['0']), ('LIB', ['AUTO.DAT']), ('DATA', ['1.2', '2.34', '3.12', '4.56E-2', '6.78'])])],
'raw':
['TRUCKS = 0',
'PLAINS= 0, TRAINS = 0',
"LIB='AUTO.DAT'",
'DATA=1.2,2.34,3.12',
'4.56E-2,6.78']}}