Python Generic Data Engine - python

I have been working on Python for about 1.5yrs and looking for some direction. This is the first time I can't find what I need after doing a lot of searching and must be missing something- most likely searching the wrong terms.
Problem: I am working on an app that has many processes (Could be hundreds or even thousands). Each process may have a unique input and output data format - could be multiline strings, comma separated strings, excel or csv with or without varying headers and many others. I need something that will format the input correctly and handle the output based upon the process. New processes also need to be easily added/defined. I am open to whatever is the best approach, but my thoughts are to use a database that stores the template/data definition and use that to know the format given a process. However, I'm struggling to come up with exactly how, if this is really the best approach, but it needs to be a solution that is scalable. Any direction would be appreciated. Thank you.
A couple simple examples of data
Process 1 example data (multi line string with Header)
Input of
[ABC123, XYZ453, CDE987]
and the resulting data input below would be created:
Barcode
ABC123
XYZ453
CDE987
This code below works, but is not reusable for the example 2.
list = [ABC123, XYZ453, CDE987]
input = "Barcode /r/n"
for l in list:
input = input + l + '/r/n'
Process 2 example input template (comma separated with Header):
Barcode,Location,Param1,Param2
Item1,L1,11,A
Item1,L1,22,B
Item2,L1,33,C
Item2,L2,44,F
Item3,L2,55,B
Item3,L2,66,P
Process 2 example resulting input data (comma separated with Header):
Input of
{'Barcode':['ABC123', 'XYZ453', 'CDE987', 'FGH487', 'YTR123'], 'Location':['Shelf1', 'Shelf2']}
and using the template to create the input data below:
Barcode,Location,Param1,Param2
ABC123,Shelf1,11,A
ABC123,Shelf1,22,B
XYZ453,Shelf1,33,C
XYZ453,Shelf2,44,F
CDE987,Shelf2,55,B
CDE987,Shelf2,66,P
FGH487,Shelf1,11,A
FGH487,Shelf1,22,B
YTR123,Shelf1,33,C
YTR123,Shelf2,44,F
I know how to handle each process with hardcoded loop/dataframe merge, etc. Ive done some abstraction in other cases with dicts. However, how to define/store each format that vary so much and create reusable abstracted code is where I am stuck.

Maybe you can do the output of the functions as a tuple with the keys "datatype" and "output" for the actual output

Related

Divide and Conquer Lists in Python (to read sav files using pyreadstat)

I am trying to read sav files using pyreadstat in python but for some rare scenarios I am getting error of UnicodeDecodeError since the string variable has special characters.
To handle this I think instead of loading the entire variable set I will load only variables which do not have this error.
Below is the pseudo-code that I have with me. This is not a very efficient code since I check for error in each item of list using try and except.
# Reads only the medata to get information about the variables
df, meta = pyreadstat.read_sav('Test.sav', metadataonly=True)
list = meta.column_names # All variables are stored in list
result = []
for var in list:
print(var)
try:
df, meta = pyreadstat.read_sav('Test.sav', usecols=[str(var)])
# If no error that means we can store this variable in result
result.append(var)
except:
pass
# This will finally load the sav for non error variables
df, meta = pyreadstat.read_sav('Test.sav', usecols=result)
For a sav file with 1000+ variables it takes a long amount of time to process this.
I was thinking if there is a way to use divide and conquer approach and do it faster. Below is my suggested approach but I am not very good in implementing recursion algorithm. Can someone please help me with pseudo code it would be very helpful.
Take the list and try to read sav file
In case of no error then output can be stored in result and then we read the sav file
In case of error then split the list into 2 parts and run these again ....
Step 3 needs to run again until we have a list where it does not give any error
Using the second approach 90% of my sav files will get loaded on the first pass itself hence I think recursion is a good method
You can try to reproduce the issue for sav file here
For this specific case I would suggest a different approach: you can give an argument "encoding" to pyreadstat.read_sav to manually set the encoding. If you don't know which one it is, what you can do is iterate over the list of encodings here: https://gist.github.com/hakre/4188459 to find out which one makes sense. For example:
# here codes is a list with all the encodings in the link mentioned before
for c in codes:
try:
df, meta = p.read_sav("Test.sav", encoding=c)
print(encoding)
print(df.head())
except:
pass
I did and there were a few that may potentially make sense, assuming that the string is in a non-latin alphabet. However the most promising one is not in the list: encoding="UTF8" (the list contains UTF-8, with dash and that fails). Using UTF8 (no dash) I get this:
నేను గతంలో వాడిన బ
which according to google translate means "I used to come b" in Telugu. Not sure if that fully makes sense, but it's a way.
The advantage of this approach is that if you find the right encoding, you will not be loosing data, and reading the data will be fast. The disadvantage is that you may not find the right encoding.
In case you would not find the right encoding, you anyway would be reading the problematic columns very fast, and you can discard them later in pandas by inspecting which character columns do not contain latin characters. This will be much faster than the algorithm you were suggesting.

Loop over BIG json file without loading the file in advance - Python [duplicate]

This question already has answers here:
Is there a memory efficient and fast way to load big JSON files?
(11 answers)
Closed 3 years ago.
I would like to load one by one the items of my json file. The file could be up to 3gb so loading it in advance and looping over it is not an option.
My json file is basically a dictionary of key and value pairs (hundreds of pairs), and there is nothing I want to discard (ijson).
I just want to load one pair at a time to work with it. Is there anyway to do that?
So basically I found out in this answer how to do it in a much simple way:
https://stackoverflow.com/a/17326199/2933485
Using ijson, it looks like you can loop over the file without loadin it but opening the file and using ijson parse function over it, this is the example I found:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
Why dont you populate a sqlite table with the data once and query the data using the record PK? See https://docs.python.org/3.7/library/sqlite3.html
OK, so json is a nested format, which means each repeating block (dict or list object) is surrounded by start and end characters. Normally, you read the entire file, and in doing so, can confirm the well-formed, structure and "closedness" of each object - in other words, it's verifiable that all objects are legally structured. When you load a json file into memory using the json library, part of that process is the validation.
If you want to do that for an extra large file - you have to forgo the normal library and roll your own, loading in a line (or chunk) at a time, and processing that under the assumption that validation will retrospectively succeed.
That's achievable (assuming you're able to put your faith in such an assumption) but it's probably something you'll have to write yourself.
One strategy might be to read a line at a time, splitting on the colon : character, with commas as record delimiters, which is a crude approximation of how key-value pairs are coded within json. Following this method, you're going to be able to process all but the first and final key-value pairs cleanly in sequence.
That just leaves you to write some special conditions for properly parsing the first and final records, which will come through garbled using this strategy.
Crudely then, call something like this (referencing the csv library) and treat the json like a massive, unusually formatted csv file.
import csv
with open('big.json', newline=',') as csv_json_franken_file:
jsonreader = csv.reader(csv_json_franken_file, delimiter=':', quotechar='"')
for row in jsonreader: # This bit reads in a "row" at a time, until finished
print(', '.join(row))
Then do some edge-case treatment of the first and last rows (more or less depending on the structure of your json) to repair the garbling caused by what is a fairly blatant hack. It's not clean, and it's not robust to changes in the content - but sometimes, you just have to play the hand you've been dealt.
To be honest, generating json files of 3GB in size is a little irresponsible, so if anyone comes asking, you've got that in your corner.

Read data from CSV and write data to CSV - String to integer

I have a CSV file with 100,000 rows.
Each row in column A is a sentence comprised of both chars and integers.
I want column B to contain only integers.
I want the new columns to be in the same CSV file.
How can I accomplish this?
If I'm understanding your question correctly, I would use .isdigit() to parse the data in column A. I'm frankly not sure what the format of column A is, so I don't know exactly what you would do with this (if you gave more information I could give a more specific answer). Your solution will likely come in a similar form to this:
def find(lines):
B = []
for line in lines:
numbers = [c for c in line if c.isdigit()]
current = int(''.join(numbers))
# current is the concatenation of all
# integers found in column A from left to right
B.append(current)
return B
Let me know if this makes sense or is even in the right track for your solution. Once again, without knowing what you're trying to do, and what A looks like, I'm not sure what your actual goals are.
EDIT
I'm not going to explain the csv stuff for you, mainly because there is a fantastic resource and library for it included in python here. If you have specific questions related to writing csv, definitely post them.
It sounds like you essentially want to pull int values out of column A then add them to a new column B. There are definitely many ways to solve this, but the general form of the problem is for each row you'll filter out the int, then you'll add the filtered int into the new column. I'll list a couple:
Regex: You could use a pattern such as [0-9]+ to pull the string out of A, then use int(whatever that output is) to cast to int, then store those values in B. I'm a sucker for a good regular expression and this one is fairly straight forward. Regexr is a great resource to learn about this and test your pattern.
Use an algorithm similar to above: The above algorithm worked before, but I've updated it slightly. Now that it's been updated it'll return an array of numbers correspondent to numbers in A from left to right. This is relatively sound, but it doesn't necessarily guarantee you have the right integer, given that if the title has an int in it, it'll mess some things up. It is likely one of the more clear ways of doing this, though.

Big data File: Read and Create structured file

I have a 20+GB dataset that is structured as follows:
1 3
1 2
2 3
1 4
2 1
3 4
4 2
(Note: the repetition is intentional and there is no inherent order in either column.)
I want to construct a file in the following format:
1: 2, 3, 4
2: 3, 1
3: 4
4: 2
Here is my problem; I have tried writing scripts in both Python and C++ to load in the file, create long strings, and write to a file line-by-line. It seems, however, that neither language is capable of handling the task at hand. Does anyone have any suggestions as to how to tackle this problem? Specifically, is there a particular method/program that is optimal for this? Any help or guided directions would be greatly appreciated.
You can try this using Hadoop. You can run a stand-alone Map Reduce program. The mapper will output the first column as key and the second column as value. All the outputs with same key will go to one reducer. So you have a key and a list of values with that key. You can run through the values list and output the (key, valueString) which is the final output you desire. You can start this with a simple hadoop tutorial and do mapper and reducer as I suggested. However, I've not tried to scale a 20GB data on a stand-alone hadoop system. You may try. Hope this helps.
Have you tried using a std::vector of std::vector?
The outer vector represents each row. Each slot in the outer vector is a vector containing all the possible values for each row. This assumes that the row # can be used as an index into the vector.
Otherwise, you can try std::map<unsigned int, std::vector<unsigned int> >, where the key is the row number and the vector contains all values for the row.
A std::list of would work also.
Does your program run out of memory?
Edit 1: Handling large data files
You can handle your issue by treating it like a merge sort.
Open a file for each row number.
Append the 2nd column values to the file.
After all data is read, close all files.
Open each file and read the values and print them out, comma separated.
Open output file for each key.
While iterating over lines of source file append values into output files.
Join output files.
An interesting thought found also on Stack Overflow
If you want to persist a large dictionary, you are basically looking at a database.
As recommended there, use Python's sqlite3 module to write to a table where the primary key is auto incremented, with a field called "key" (or "left") and a field called "value" (or "right").
Then SELECT out of the table which was the MIN(key) and MAX(key), and with that information you can SELECT all rows that have the same "key" (or "left") value, in sorted order, and print those informations to an outfile (if the database is not a good output for you).
I have written this approach in the assumption you call this problem "big data" because the number of keys do not fit well into memory (otherwise, a simple Python dictionary would be enough). However, IMHO this question is not correctly tagged as "big data": in order to require distributed computations on Hadoop or similar, your input data should be much more than what you can hold in a single hard drive, or your computations should be much more costly than a simple hash table lookup and insertion.

How to 'flatten' lines from text file if they meet certain criteria using Python?

To start I am a complete new comer to Python and programming anything other than web languages.
So, I have developed a script using Python as an interface between a piece of Software called Spendmap and an online app called Freeagent. This script works perfectly. It imports and parses the text file and pushes it through the API to the web app.
What I am struggling with is Spendmap exports multiple lines per order where as Freeagent wants One line per order. So I need to add the cost values from any orders spread across multiple lines and then 'flatten' the lines into One so it can be sent through the API. The 'key' field is the 'PO' field. So if the script sees any matching PO numbers, I want it to flatten them as per above.
This is a 'dummy' example of the text file produced by Spendmap:
5090071648,2013-06-05,2013-09-05,P000001,1133997,223.010,20,2013-09-10,104,xxxxxx,AP
COMMENT,002091
301067,2013-09-06,2013-09-11,P000002,1133919,42.000,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
301067,2013-09-06,2013-09-11,P000002,1133919,359.400,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
301067,2013-09-06,2013-09-11,P000003,1133910,23.690,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
The above has been formatted for easier reading and normally is just one line after the next with no text formatting.
The 'key' or PO field is the first bold item and the second bold/italic item is the cost to be totalled. So if this example was to be passed through the script id expect the first row to be left alone, the Second and Third row costs to be added as they're both from the same PO number and the Fourth line to left alone.
Expected result:
5090071648,2013-06-05,2013-09-05,P000001,1133997,223.010,20,2013-09-10,104,xxxxxx,AP
COMMENT,002091
301067,2013-09-06,2013-09-11,P000002,1133919,401.400,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
301067,2013-09-06,2013-09-11,P000003,1133910,23.690,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
Any help with this would be greatly appreciated and if you need any further details just say.
Thanks in advance for looking!
I won't give you the solution. But you should:
Write and test a regular expression that breaks the line down into its parts, or use the CSV library.
Parse the numbers out so they're decimal numbers rather than strings
Collect the lines up by ID. Perhaps you could use a dict that maps IDs to lists of orders?
When all the input is finished, iterate over that dict and add up all orders stored in that list.
Make a string format function that outputs the line in the expected format.
Maybe feed the output back into the input to test that you get the same result. Second time round there should be no changes, if I understood the problem.
Good luck!
I would use a dictionary to compile the lines, using get(key,0.0) to sum values if they exist already, or start with zero if not:
InputData = """5090071648,2013-06-05,2013-09-05,P000001,1133997,223.010,20,2013-09-10,104,xxxxxx,AP COMMENT,002091
301067,2013-09-06,2013-09-11,P000002,1133919,42.000,20,2013-10-31,103,xxxxxx,AP COMMENT,002143
301067,2013-09-06,2013-09-11,P000002,1133919,359.400,20,2013-10-31,103,xxxxxx,AP COMMENT,002143
301067,2013-09-06,2013-09-11,P000003,1133910,23.690,20,2013-10-31,103,xxxxxx,AP COMMENT,002143"""
OutD = {}
ValueD = {}
for Line in InputData.split('\n'):
# commas in comments won't matter because we are joining after anyway
Fields = Line.split(',')
PO = Fields[3]
Value = float(Fields[5])
# set up the output string with a placeholder for .format()
OutD[PO] = ",".join(Fields[:5] + ["{0:.3f}"] + Fields[6:])
# add the value to the old value or to zero if it is not found
ValueD[PO] = ValueD.get(PO,0.0) + Value
# the output is unsorted by default, but you could sort or preserve original order
for POKey in ValueD:
print OutD[POKey].format(ValueD[POKey])
P.S. Yes, I know Capitals are for Classes, but this makes it easier to tell what variables I have defined...

Categories

Resources