Selective merge of two or more data files

Selective merge of two or more data files - python

I have an executable whose input is contained in an ASCII file with format:
$ GENERAL INPUTS
$ PARAM1 = 123.456
PARAM2=456,789,101112
PARAM3(1)=123,456,789
PARAM4 =
1234,5678,91011E2
PARAM5(1,2)='STRING','STRING2'
$ NEW INSTANCE
NEW(1)=.TRUE.
PAR1=123
[More data here]
$ NEW INSTANCE
NEW(2)=.TRUE.
[etcetera]
In other words, some general inputs, and some parameter values for a number of new instances. The declaration of parameters is irregular; some numbers are separated by commas, others are in scientific notation, others are inside quotes, the spacing is not constant, etc.
The evaluation of some scenarios requires that I take the input of one "master" data file and copy the parameter data of, say, instances 2 through 6 to another data file which may already contain data for said instances (in which case data should be overwritten) and possibly others (data which should be left unchanged).
I have written a Flex lexer and a Bison parser; together they can eat a data file and store the parameters in memory. If I use them to open both files (master and "scenario"), it should not be too hard to selectively write to a third, new file the desired parameters (as in "general input from 'scenario'; instances 1 though 5 from 'master'; instances 6 through 9 from 'scenario'; ..."), save it, and delete the original scenario file.
Other information: (1) the files are highly sensitive, it is very important that the user is completely shielded from altering the master file; (2) the files are of manageable size (from 500K to 10M).
I have learned that what I can do in ten lines of code, some fellow here can do in two. How would you approach this problem? A Pythonic answer would make me cry. Seriously.

If you're already able to parse this format (I'd have tried it with pyParsing, but if you already have a working flexx/bison solution, that will be just fine), and the parsed data fit well in memory, then you're basically there. You can represent what you read from each file as a simple object with a dict for "general input" and a list of dicts, one per instance (or probably better a dict of instances, with the keys being the instance-numbers, which may give you a bit more flexibility). Then, as you mentioned, you just selectively "update" (add or overwrite) some of the instances copied from the master into the scenario, write the new scenario file, replace the old one with it.
To use the flexx/bison code with Python you have several options -- make it into a DLL/so and access it with ctypes, or call it from a cython-coded extension, a SWIG wrapper, a Python C-API extension, or SIP, Boost, etc etc.
Suppose that, one way or another, you have a parser primitive that (e.g.) accepts an input filename, reads and parses that file, and returns a list of 2-string tuples, each of which is either of the following:
(paramname, paramvalue)
('$$$$', 'General Inputs')
('$$$$', 'New Instance')
just using '$$$$' as a kind of arbitrary marker. Then for the object representing all that you've read from a file you might have:
import re
instidre = re.compile(r'NEW\((\d+)\)')
class Afile(object):
def __init__(self, filename):
self.filename = filename
self.geninput = dict()
self.instances = dict()
def feed_data(self, listoftuples):
it = iter(listoftuples)
assert next(it) == ('$$$$', 'General Inputs')
for name, value in it:
if name == '$$$$': break
self.geninput[name] = value
else: # no instances at all!
return
currinst = dict()
for name, value in it:
if name == '$$$$':
self.finish_inst(currinst)
currinst = dict()
continue
mo = instidre.match(name)
if mo:
assert value == '.TRUE.'
name = '$$$INSTID$$$'
value = mo.group(1)
currinst[name] = value
self.finish_inst(currinst)
def finish_inst(self, adict):
instid = dict.pop('$$$INSTID$$$')
assert instid not in self.instances
self.instances[instid] = adict
Sanity checking might be improved a bit, diagnosing anomalies more precisely, but net of error cases I think this is roughly what you want.
The merging just requires doing foo.instances[instid] = bar.instances[instid] for the required values of instid, where foo is the Afile instance for the scenario file and bar is the one for the master file -- that will overwrite or add as required.
I'm assuming that to write out the newly changed scenario file you don't need to repeat all the formatting quirks the specific inputs might have (if you do, then those quirks will need to be recorded during parsing together with names and values), so simply looping on sorted(foo.instances) and writing each out also in sorted order (after writing the general stuff also in sorted order, and with appropriate $ this and that marker lines, and with proper translation of the '$$$INSTID$$$' entry, etc) should suffice.

Related

Divide and Conquer Lists in Python (to read sav files using pyreadstat)

I am trying to read sav files using pyreadstat in python but for some rare scenarios I am getting error of UnicodeDecodeError since the string variable has special characters.
To handle this I think instead of loading the entire variable set I will load only variables which do not have this error.
Below is the pseudo-code that I have with me. This is not a very efficient code since I check for error in each item of list using try and except.
# Reads only the medata to get information about the variables
df, meta = pyreadstat.read_sav('Test.sav', metadataonly=True)
list = meta.column_names # All variables are stored in list
result = []
for var in list:
print(var)
try:
df, meta = pyreadstat.read_sav('Test.sav', usecols=[str(var)])
# If no error that means we can store this variable in result
result.append(var)
except:
pass
# This will finally load the sav for non error variables
df, meta = pyreadstat.read_sav('Test.sav', usecols=result)
For a sav file with 1000+ variables it takes a long amount of time to process this.
I was thinking if there is a way to use divide and conquer approach and do it faster. Below is my suggested approach but I am not very good in implementing recursion algorithm. Can someone please help me with pseudo code it would be very helpful.
Take the list and try to read sav file
In case of no error then output can be stored in result and then we read the sav file
In case of error then split the list into 2 parts and run these again ....
Step 3 needs to run again until we have a list where it does not give any error
Using the second approach 90% of my sav files will get loaded on the first pass itself hence I think recursion is a good method
You can try to reproduce the issue for sav file here

For this specific case I would suggest a different approach: you can give an argument "encoding" to pyreadstat.read_sav to manually set the encoding. If you don't know which one it is, what you can do is iterate over the list of encodings here: https://gist.github.com/hakre/4188459 to find out which one makes sense. For example:
# here codes is a list with all the encodings in the link mentioned before
for c in codes:
try:
df, meta = p.read_sav("Test.sav", encoding=c)
print(encoding)
print(df.head())
except:
pass
I did and there were a few that may potentially make sense, assuming that the string is in a non-latin alphabet. However the most promising one is not in the list: encoding="UTF8" (the list contains UTF-8, with dash and that fails). Using UTF8 (no dash) I get this:
నేను గతంలో వాడిన బ
which according to google translate means "I used to come b" in Telugu. Not sure if that fully makes sense, but it's a way.
The advantage of this approach is that if you find the right encoding, you will not be loosing data, and reading the data will be fast. The disadvantage is that you may not find the right encoding.
In case you would not find the right encoding, you anyway would be reading the problematic columns very fast, and you can discard them later in pandas by inspecting which character columns do not contain latin characters. This will be much faster than the algorithm you were suggesting.

Dynamically update complex OrderedDict (based on yaml file)

I'm trying to build a piece of software that will rely on very dynamic configuration (or "ruleset", really). I've tried to capture it in the pseudocode:
"""
---
config:
item1:
thething: ${stages.current.variables.item1}
stages:
dev:
variables:
item1: stuff
prod:
variables:
item1: stuf2
"""
config_obj = yaml.load(config)
current_stage = 'dev'
#Insert artificial "current stage" to allow var matching
stages['current'] = stages[current_stage]
updated_config_obj = replace_vars(config_obj)
The goal is to have the updated_config_obj replace all variable-types with the actual value, so for this example it should replace ${stages.current.variables.item1} with stuff. The current part is easily solved by copying whatever's the current stage into a current item in the same OrderedDict, but I'm still stomped by how to actually perform the replace. The config yaml can be quite large and is totally depended on a plugin system so it must be dynamic.
Right now I'm looking at "walking" the entire object, checking for the existence of a $ on each "leaf" (indicating a variable) and performing a lookup backup to the current object to "resolve" the variable, but somehow that seems overly complex. Another alternative is (I guess) to ue Jinja2-templating on the "config string", with the parsed object as a lookup. Certainly doable but it somehow feels a little dirty.
I have the feeling that there should be a more elegant solution which can be done solely on the parsed object (without interacting with the string), but it escapes me.
Any pointers appreciated!

First, my two cents: try to avoid using any form of interpolation in your configuration file. This creates another layer of dependencies - one dependency for your program (the configuration file) and another dependency for your configuration file.
It's a slick solution at the moment, but consider that five years down the road some lowly developer might be staring down ${stages.current.variables.item1} for a month trying to figure out what this is, not understanding that this implicitly maps onto stages.dev. And then worse yet, some other developer comes along, and seeing that the floodgates of interpolation have been opened, they start using {{stages_dev}}, to mean that some value should interpolated from the system's environmental variables. And then some other developer starts using their own convention like {{!stagesdev!}}, which means that the value uses its own custom runtime interpolation, invoked in some obscure, downstream back-alley.
And then some consultant is hired to reverse-engineer the whole thing and now they are sailing the seas of spaghetti.
If you still want to do this, I'd recommend opening/parsing the configuration file into a dictionary (presumably using yaml.load()), then iterating through the whole thing, line-by-line, using regex to find instances of \$\{(.*)\}.
For each captured group, create an ordered list like:
# ["stages", "current", "variables", item1"]
yaml_references = ".".split("stages.current.variables.item1")
Then, you could do something like:
yaml_config_dict = "" # the parsed configuration file
interpolated_reference = None
for y in yaml_references:
interpolated_reference = yaml_config_dict[y]
i = interpolated_reference[0]
Now i should represent whatever ${stages.current.variables.item1} was pointing to in the context of the .yaml file and you should be able to do a string replace.

What is the fastest performance tuple for large data sets in python?

Right now, I'm basically running through an excel sheet.
I have about 20 names and then I have 50k total values that match to one of those 20 names, so the excel sheet is 50k rows long, column B showing any random value, and column A showing one of the 20 names.
I'm trying to get a string for each of the names that show all of the values.
Name A: 123,244,123,523,123,5523,12505,142... etc etc.
Name B: 123,244,123,523,123,5523,12505,142... etc etc.
Right now, I created a dictionary that runs through the excel sheet, checks if the name is all ready in the dictionary, if it is, then it does a
strA = strA + "," + foundValue
Then it inserts strA back into the dictionary for that particular name. If the name doesn't exist, it creates that dictionary key and then adds that value to it.
Now, this was working all well at first.. but it's been about 15 or 20 mins and it is only on 5k values added to the dictionary so far and it seems to get slower as time goes on and it keeps running.
I wonder if there is a better way to do this or faster way to do this. I was thinking of building new dictionaries every 1k values and then combine them all together at the end.. but that would be 50 dictionaries total and it sounds complicated.. although maybe not.. I'm not sure, maybe it could work better that way, this seems to not work.
I DO need the string that shows each value with a comma between each value. That is why I am doing the string thing right now.

There are a number of things that are likely causing your program to run slowly.
String concatenation in python can be extremely inefficient when used with large strings.
Strings in Python are immutable. This fact frequently sneaks up and bites novice Python programmers on the rump. Immutability confers some advantages and disadvantages. In the plus column, strings can be used as keys in dictionaries and individual copies can be shared among multiple variable bindings. (Python automatically shares one- and two-character strings.) In the minus column, you can't say something like, "change all the 'a's to 'b's" in any given string. Instead, you have to create a new string with the desired properties. This continual copying can lead to significant inefficiencies in Python programs.
Considering each string in your example could contain thousands of characters, each time you do a concatenation, python has to copy that giant string into memory to create a new object.
This would be much more efficient:
strings = []
strings.append('string')
strings.append('other_string')
...
','.join(strings)
In your case, instead of each dictionary key storing a massive string, it should store a list, and you would just append each match to the list, and only at the very end would you do a string concatenation using str.join.
In addition, printing to stdout is also notoriously slow. If you're printing to stdout on each iteration of your massive 50,000 item loop, each iteration is being held up by the unbuffered write to stdout. Consider only printing every nth iteration, or perhaps writing to a file instead (file writes are normally buffered) and then tailing the file from another terminal.

This answer is based on OP's answer to my comment. I asked what he would do with the dict, suggesting that maybe he doesn't need to build it in the first place. #simon replies:
i add it to an excel sheet, so I take the KEY, which is the name, and
put it in A1, then I take the VALUE, which is
1345,345,135,346,3451,35.. etc etc, and put that into A2. then I do
the rest of my programming with that information...... but i need
those values seperated by commas and acessible inside that excel sheet
like that!
So it looks like the dict doesn't have to be built after all. Here is an alternative: for each name, create a file, and store those files in a dict:
files = {}
name = 'John' # let's say
if name not in files:
files[name] = open(name, 'w')
Then when you loop over the 50k-row excel, you do something like this (pseudo-code):
for row in 50k_rows:
name, value_string = rows.split() # or whatever
file = files[name]
file.write(value_string + ',') # if already ends with ',', no need to add
Since your value_string is already comma separated, your file will be csv-like without any further tweaking on your part (except maybe you want to strip the last trailing comma after you're done). Then when you need the values, say, of John, just value = open('John').read().
Now I've never worked with 50k-row excels, but I'd be very surprised if this is not quite a bit faster than what you currently have. Having persistent data is also (well, maybe) a plus.
EDIT:
Above is a memory-oriented solution. Writing to files is much slower than appending to lists (but probably still faster than recreating many large strings). But if the lists are huge (which seems likely) and you run into a memory problem (not saying you will), you can try the file approach.
An alternative, similar to lists in performance (at least for the toy test I tried) is to use StringIO:
from io import StringIO # python 2: import StringIO import StringIO
string_ios = {'John': StringIO()} # a dict to store StringIO objects
for value in ['ab', 'cd', 'ef']:
string_ios['John'].write(value + ',')
print(string_ios['John'].getvalue())
This will output 'ab,cd,ef,'

Instead of building a string that looks like a list, use an actual list and make the string representation you want out of it when you are done.

The proper way is to collect in lists and join at the end, but if for some reason you want to use strings, you could speed up the string extensions. Pop the string out of the dict so that there's only one reference to it and thus the optimization can kick in.
Demo:
>>> timeit('s = d.pop(k); s = s + "y"; d[k] = s', 'k = "x"; d = {k: ""}')
0.8417842664330237
>>> timeit('s = d[k]; s = s + "y"; d[k] = s', 'k = "x"; d = {k: ""}')
294.2475278390723

Depending on how you have read the excel file, but let's say that lines are read as delimiter-separated tuples or something:
d = {}
for name, foundValue in line_tuples:
try:
d[name].append(foundValue)
except KeyError:
d[name] = [foundValue]
d = {k: ",".join(v) for k, v in d.items()}
Alternatively using pandas:
import pandas as pd
df = pd.read_excel("some_excel_file.xlsx")
d = df.groupby("A")["B"].apply(lambda x: ",".join(x)).to_dict()

strange output from yaml.dump

I've just started using yaml and I love it. However, the other day I came across a case that seemed really odd and I am not sure what is causing it. I have a list of file path locations and another list of file path destinations. I create a dictionary out of them and then use yaml to dump it out to read later (I work with artists and use yaml so that it is human readable as well).
sorry for the long lists:
source = ['/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_diff.exr', '/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskTapeFloor.1051.exr', '/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskBurnt.1031.exr']
dest = ['/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_diff_diffuse_v0006.exr', '/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_maskTapeFloor_diffuse_v0006.1051.exr', '/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_maskBurnt_diffuse_v0006.1031.exr']
dictionary = dict(zip(source, dest))
print yaml.dump(dictionary)
this is the output that I get:
{/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_diff.exr: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhaw
k_diff_diffuse_v0006.exr,
/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskBurnt.1031.exr: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v00
06/blackhawk_maskBurnt_diffuse_v0006.1031.exr,
? /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskTapeFloor.1051.exr
: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_maskTapeFloor_diffuse_v0006.1051.exr}
It comes back in fine with a yaml.load, but this is not useful for artists to be able to edit if need be.

This is the first question in the FAQ.
By default, PyYAML chooses the style of a collection depending on whether it has nested collections. If a collection has nested collections, it will be assigned the block style. Otherwise it will have the flow style.
If you want collections to be always serialized in the block style, set the parameter default_flow_style of dump() to False.
So:
>>> print yaml.dump(dictionary, default_flow_style=False)
/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_diff.exr: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_diff_diffuse_v0006.exr
/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskBurnt.1031.exr: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_maskBurnt_diffuse_v0006.1031.exr
? /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskTapeFloor.1051.exr
: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_maskTapeFloor_diffuse_v0006.1051.exr
Still not exactly beautiful, but when you have strings longer than 80 characters as keys, it's about as good as you can reasonably expect.
If you model (part of) the filesystem hierarchy in your object hierarchy, or create aliases (or dynamic aliasers) for parts of the tree, etc., the YAML will look a lot nicer. But that's something you have to actually do at the object-model level; as far as YAML is concerned, those long paths full of repeated prefixes are just strings.

Python, faster way to read fixed length fields from a file into dictionary

I have a file of names and addresses as follows (example line)
OSCAR ,CANNONS ,8 ,STIEGLITZ CIRCUIT
And I want to read it into a dictionary of name and value. Here self.field_list is a list of the name, length and start point of the fixed fields in the file. What ways are there to speed up this method? (python 2.6)
def line_to_dictionary(self, file_line,rec_num):
file_line = file_line.lower() # Make it all lowercase
return_rec = {} # Return record as a dictionary
for (field_start, field_length, field_name) in self.field_list:
field_data = file_line[field_start:field_start+field_length]
if self.strip_fields == True: # Strip off white spaces first
field_data = field_data.strip()
if field_data != '': # Only add non-empty fields to dictionary
return_rec[field_name] = field_data
# Set hidden fields
#
return_rec['_rec_num_'] = rec_num
return_rec['_dataset_name_'] = self.name
return return_rec

struct.unpack() combined with s specifiers with lengths will tear the string apart faster than slicing.

Edit: Just saw your remark below about commas. The approach below is fast when it comes to file reading, but it is delimiter-based, and would fail in your case. It's useful in other cases, though.
If you want to read the file really fast, you can use a dedicated module, such as the almost standard Numpy:
data = numpy.loadtxt('file_name.txt', dtype=('S10', 'S8'), delimiter=',') # dtype must be adapted to your column sizes
loadtxt() also allows you to process fields on the fly (with the converters argument). Numpy also allows you to give names to columns (see the doc), so that you can do:
data['name'][42] # Name # 42
The structure obtained is like an Excel array; it is quite memory efficient, compared to a dictionary.
If you really need to use a dictionary, you can use a dedicated loop over the data array read quickly by Numpy, in a way similar to what you have done.

If you want to get some speed up, you can also store field_start+field_length directly in self.field_list, instead of storing field_length.
Also, if field_data != '' can more simply be written as if field_data (if this gives any speed up, it is marginal, though).
I would say that your method is quite fast, compared to what standard Python can do (i.e., without using non-standard, dedicated modules).

If your lines include commas like the example, you can use line.split(',') instead of several slices. This may prove to be faster.

You'll want to use the csv module.
It handle not only csv, but any csv-like format which yours seems to be.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.