I must write a two dimensional array (let's say array[100][20]) into a file. This is just an example of my code:
for i in range(0,99):
print >> file, "%15.9e%24.15e%3d..." % (array[i][0], array[i][1], array[i][2], ... )
Is it possible to shorten it after "%" symbol with a slicing or something similar notation?
You could do:
print >> file, format_string % tuple(array[i])
That said, I think that I would probably use numpy here to save the data in a format that isn't ascii for efficiency (both in terms of time spent doing IO and disk space).
At the risk of being down-voted, are you sure you're asking the right question? I've always found that if something is ugly or hard, it's probably not the best approach, especially in Python.
You might check out the csv module---what format do you want to write, exactly?
There's also the texttable module, that makes pretty tables for you, and will even dump them to a text file.
Finally, if you should be so inclined, there are the Python Excel Utilities, which will dump things into Excel spreadsheets. I have written some wrappers around the excel read and excel write interfaces in that module that I'll post here if you want---they're nothing special, but they may save you some time.
Related
I am trying to read sav files using pyreadstat in python but for some rare scenarios I am getting error of UnicodeDecodeError since the string variable has special characters.
To handle this I think instead of loading the entire variable set I will load only variables which do not have this error.
Below is the pseudo-code that I have with me. This is not a very efficient code since I check for error in each item of list using try and except.
# Reads only the medata to get information about the variables
df, meta = pyreadstat.read_sav('Test.sav', metadataonly=True)
list = meta.column_names # All variables are stored in list
result = []
for var in list:
print(var)
try:
df, meta = pyreadstat.read_sav('Test.sav', usecols=[str(var)])
# If no error that means we can store this variable in result
result.append(var)
except:
pass
# This will finally load the sav for non error variables
df, meta = pyreadstat.read_sav('Test.sav', usecols=result)
For a sav file with 1000+ variables it takes a long amount of time to process this.
I was thinking if there is a way to use divide and conquer approach and do it faster. Below is my suggested approach but I am not very good in implementing recursion algorithm. Can someone please help me with pseudo code it would be very helpful.
Take the list and try to read sav file
In case of no error then output can be stored in result and then we read the sav file
In case of error then split the list into 2 parts and run these again ....
Step 3 needs to run again until we have a list where it does not give any error
Using the second approach 90% of my sav files will get loaded on the first pass itself hence I think recursion is a good method
You can try to reproduce the issue for sav file here
For this specific case I would suggest a different approach: you can give an argument "encoding" to pyreadstat.read_sav to manually set the encoding. If you don't know which one it is, what you can do is iterate over the list of encodings here: https://gist.github.com/hakre/4188459 to find out which one makes sense. For example:
# here codes is a list with all the encodings in the link mentioned before
for c in codes:
try:
df, meta = p.read_sav("Test.sav", encoding=c)
print(encoding)
print(df.head())
except:
pass
I did and there were a few that may potentially make sense, assuming that the string is in a non-latin alphabet. However the most promising one is not in the list: encoding="UTF8" (the list contains UTF-8, with dash and that fails). Using UTF8 (no dash) I get this:
నేను గతంలో వాడిన బ
which according to google translate means "I used to come b" in Telugu. Not sure if that fully makes sense, but it's a way.
The advantage of this approach is that if you find the right encoding, you will not be loosing data, and reading the data will be fast. The disadvantage is that you may not find the right encoding.
In case you would not find the right encoding, you anyway would be reading the problematic columns very fast, and you can discard them later in pandas by inspecting which character columns do not contain latin characters. This will be much faster than the algorithm you were suggesting.
I have been working on Python for about 1.5yrs and looking for some direction. This is the first time I can't find what I need after doing a lot of searching and must be missing something- most likely searching the wrong terms.
Problem: I am working on an app that has many processes (Could be hundreds or even thousands). Each process may have a unique input and output data format - could be multiline strings, comma separated strings, excel or csv with or without varying headers and many others. I need something that will format the input correctly and handle the output based upon the process. New processes also need to be easily added/defined. I am open to whatever is the best approach, but my thoughts are to use a database that stores the template/data definition and use that to know the format given a process. However, I'm struggling to come up with exactly how, if this is really the best approach, but it needs to be a solution that is scalable. Any direction would be appreciated. Thank you.
A couple simple examples of data
Process 1 example data (multi line string with Header)
Input of
[ABC123, XYZ453, CDE987]
and the resulting data input below would be created:
Barcode
ABC123
XYZ453
CDE987
This code below works, but is not reusable for the example 2.
list = [ABC123, XYZ453, CDE987]
input = "Barcode /r/n"
for l in list:
input = input + l + '/r/n'
Process 2 example input template (comma separated with Header):
Barcode,Location,Param1,Param2
Item1,L1,11,A
Item1,L1,22,B
Item2,L1,33,C
Item2,L2,44,F
Item3,L2,55,B
Item3,L2,66,P
Process 2 example resulting input data (comma separated with Header):
Input of
{'Barcode':['ABC123', 'XYZ453', 'CDE987', 'FGH487', 'YTR123'], 'Location':['Shelf1', 'Shelf2']}
and using the template to create the input data below:
Barcode,Location,Param1,Param2
ABC123,Shelf1,11,A
ABC123,Shelf1,22,B
XYZ453,Shelf1,33,C
XYZ453,Shelf2,44,F
CDE987,Shelf2,55,B
CDE987,Shelf2,66,P
FGH487,Shelf1,11,A
FGH487,Shelf1,22,B
YTR123,Shelf1,33,C
YTR123,Shelf2,44,F
I know how to handle each process with hardcoded loop/dataframe merge, etc. Ive done some abstraction in other cases with dicts. However, how to define/store each format that vary so much and create reusable abstracted code is where I am stuck.
Maybe you can do the output of the functions as a tuple with the keys "datatype" and "output" for the actual output
This question already has answers here:
Is there a memory efficient and fast way to load big JSON files?
(11 answers)
Closed 3 years ago.
I would like to load one by one the items of my json file. The file could be up to 3gb so loading it in advance and looping over it is not an option.
My json file is basically a dictionary of key and value pairs (hundreds of pairs), and there is nothing I want to discard (ijson).
I just want to load one pair at a time to work with it. Is there anyway to do that?
So basically I found out in this answer how to do it in a much simple way:
https://stackoverflow.com/a/17326199/2933485
Using ijson, it looks like you can loop over the file without loadin it but opening the file and using ijson parse function over it, this is the example I found:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
Why dont you populate a sqlite table with the data once and query the data using the record PK? See https://docs.python.org/3.7/library/sqlite3.html
OK, so json is a nested format, which means each repeating block (dict or list object) is surrounded by start and end characters. Normally, you read the entire file, and in doing so, can confirm the well-formed, structure and "closedness" of each object - in other words, it's verifiable that all objects are legally structured. When you load a json file into memory using the json library, part of that process is the validation.
If you want to do that for an extra large file - you have to forgo the normal library and roll your own, loading in a line (or chunk) at a time, and processing that under the assumption that validation will retrospectively succeed.
That's achievable (assuming you're able to put your faith in such an assumption) but it's probably something you'll have to write yourself.
One strategy might be to read a line at a time, splitting on the colon : character, with commas as record delimiters, which is a crude approximation of how key-value pairs are coded within json. Following this method, you're going to be able to process all but the first and final key-value pairs cleanly in sequence.
That just leaves you to write some special conditions for properly parsing the first and final records, which will come through garbled using this strategy.
Crudely then, call something like this (referencing the csv library) and treat the json like a massive, unusually formatted csv file.
import csv
with open('big.json', newline=',') as csv_json_franken_file:
jsonreader = csv.reader(csv_json_franken_file, delimiter=':', quotechar='"')
for row in jsonreader: # This bit reads in a "row" at a time, until finished
print(', '.join(row))
Then do some edge-case treatment of the first and last rows (more or less depending on the structure of your json) to repair the garbling caused by what is a fairly blatant hack. It's not clean, and it's not robust to changes in the content - but sometimes, you just have to play the hand you've been dealt.
To be honest, generating json files of 3GB in size is a little irresponsible, so if anyone comes asking, you've got that in your corner.
I need to print or save a list with big data completely while I only can print partial data. The specific condition is like the picture.
the situation of python print
When I try to save this data, I can only save partial data too.But now I need to save or print all the data. I could save the data as .mat, But I cannot read it in java.Please help me.
Sorry, I didn't post my code fragment.
the code
Use np.savetxt:
b = np.zeros((10000, 10000))
np.savetxt('output.txt', b)
So that you can get the exact format that you want, savetxt has many options. You can read about them here. One particularly useful option is to set a format string to control how the numbers are written. For example:
np.savetxt('output2.txt', b, fmt='%5.2f')
Let's say I need to save a matrix(each line corresponds one row) that could be loaded from fortran later. What method should I prefer? Is converting everything to string is the only one approach?
You can save them in binary format as well. Please see the documentation on the struct standard module, it has a pack function for converting Python object into binary data.
For example:
import struct
value = 3.141592654
data = struct.pack('d', value)
open('file.ext', 'wb').write(data)
You can convert each element of your matrix and write to a file. Fortran should be able to load that binary data. You can speed up the process by converting a row as a whole, like this:
row_data = struct.pack('d' * len(matrix_row), *matrix_row)
Please note, that 'd' * len(matrix_row) is a constant for your matrix size, so you need to calculate that format string only once.
I don't know fortran, so it's hard to tell what is easy for you to perform on that side for parsing.
It sounds like your options are either saving the doubles in plaintext (meaning, 'converting' them to string), or in binary (using struct and the likes). The decision for which one is better depends.
I would go with the plaintext solution, as it means the files will be easily readable, and you won't have to mess with different kinds of details (endianity, default double sizes).
But, there are cases where binary is better (for example, if you have a really big list of doubles and space is of importance, or if it is easier for you to parse it and you need the optimization) - but this is likely not your case.
You can use JSON
import json
matrix = [[2.3452452435, 3.34134], [4.5, 7.9]]
data = json.dumps(matrix)
open('file.ext', 'wb').write(data)
File content will look like:
[[2.3452452435, 3.3413400000000002], [4.5, 7.9000000000000004]]
If legibility and ease of access is important (and file size is reasonable), Fortran can easily parse a simple array of numbers, at least if it knows the size of the matrix beforehand (with something like READ(FILE_ID, '2(F)'), I think):
1.234 5.6789e4
3.1415 9.265358978
42 ...
Two nested for loops in your Python code can easily write your matrix in this form.