I am fairly new to Python(and handling files) , I am using pandas and storing a dataframe in a text file.
My program requires constant changes in the dataframe , which in turn requires to be updated in the text.
Writing the whole dataframe over and over again ,would not be efficient(i guess,given that i may want to update only a cell)! Appending data would mean adding the whole dataframe again(which is not what i want).
And , then there is Binary file , should i store as that , open it ,edit as normal python object and it reflects back in the file?
How do i achieve this?
Besides the discussion of whether or not you should use the database, it seems what you need is a quick way to save/read again the DataFrame.
You can do it with pickle. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.
import pickle
# Save the DataFrame
pickle.dump(df, open( "dataFrame.p", "wb" ))
# Load the DataFrame
df_read =pickle.load( open( "dataFrame.p", "rb"))
Related
Context
I have a pandas dataframe which I need to save to disk and re-load later. Because the file saved on disk needs to be human-readable, I'm currently saving the dataframe as a CSV. The data includes values that are integers, booleans, null/None, timestamps, and strings.
Problem
Some of the string values are phone numbers, formatted as "+12025550140", but these are being converted to integers by the round trip (dataframe -> CSV -> dataframe). I need them to stay as strings.
I've changed the CSV writing portion to use quoting=csv.QUOTE_NONNUMERIC, which preserves the format of the phone numbers into the CSV, but when they are read back into a dataframe they are converted to integers. If I tell the CSV reading portion to also use quoting=csv.QUOTE_NONNUMERIC, then they are converted to floats.
How do I enforce that quoted fields are loaded as strings? Or, is there any other way to enforce that the full process is type-safe?
Constraints and non-constraints
The file saved to disk must be easy to manually edit, preferably with a plain text editor. I have full control and ownership over the code which generates the CSV file. A different file format can be used if it is easy to apply manual edits.
Code
Writing to disk:
import csv
df = get_df() # real function has been replaced
df.to_csv(query_file_path, index=False, quoting=csv.QUOTE_NONNUMERIC)
Reading from disk:
import pandas as pd
CSV_NA_VALS = pd._libs.parsers.STR_NA_VALUES
CSV_NA_VALS.remove("")
df = pd.read_csv(query_file_path, na_values=CSV_NA_VALS)
df = df.replace([""], [None])
Versions
Python 3.9.5
pandas==1.4.0
I want to detect different dataframes in the excel file and give each detected dataframe an id and store this dataframe as an object/blob into Oracle database.
So in DB table, it would look like:
DF_ID
DF_BLOB
1
/blob string for df 1/
2
/blob string for df 2/
I know how to store entire excel file as blob in oracle (basically directly store excelfile.read())
but I cannot directly read() or open() pandas df. Then how can I store this df object as blob?
The go to library for storing Python objects in a binary format is pickle.
To get a byte-string instead of writing to a file use pickle.dumps():
pickle.dumps(obj, protocol=None, *, fix_imports=True, buffer_callback=None)
Return the pickled representation of the object obj as a bytes object, instead of writing it to a file.
Arguments protocol, fix_imports and buffer_callback have the same meaning as in the Pickler constructor.
I am attempting to read MapInfo .dat files into .csv files using Python. So far, I have found the easiest way to do this is though xlwings and pandas.
When I do this (below code) I get a mostly correct .csv file. The only issue is that some columns are appearing as symbols/gibberish instead of their real values. I know this because I also have the correct data on hand, exported from MapInfo.
import xlwings as xw
import pandas as pd
app = xw.App(visible=False)
tracker = app.books.open('./cable.dat')
last_row = xw.Range('A1').current_region.last_cell.row
data = xw.Range("A1:AE" + str(last_row))
test_dataframe = data.options(pd.DataFrame, header=True).value
test_dataframe.columns = list(schema)
test_dataframe.to_csv('./output.csv')
When I compare to the real data, I can see that the symbols do actually map the correct number (meaning that (1 = Â?, 2=#, 3=#, etc.)
Below is the first part of the 'dictionary' as to how they map:
My question is this:
Is there an encoding that I can use to turn these series of symbols into their correct representation? The floats aren't the only column affected by this, but they are the most important to my data.
Any help is appreciated.
import pandas as pd
from simpledbf import Dbf5
dbf = Dbf5('path/filename.dat')
df = dbf.to_dataframe()
.dat files are dbase files underneath https://www.loc.gov/preservation/digital/formats/fdd/fdd000324.shtml. so just use that method.
then just output the data
df.to_csv('outpath/filename.csv')
EDIT
If I understand well you are using XLWings to load the .dat file into excel. And then read it into pandas dataframe to export it into a csv file.
Somewhere along this it seems indeed that some binary data is not/incorrectly interpreted and then written as text to you csv file.
directly read dBase file
My first suggestion would be to try to read the input file directly into Python without the use of an excel instance.
According to Wikipedia, mapinfo .dat files are actually are dBase III files. These you can parse in python using a library like dbfread.
inspect data before writing to csv
Secondly, I would inspect the 'corrupted' columns in python instead of immediately writing them to disk.
Either something is going wrong in the excel import and the data of these columns gets imported as text instead of some binary number format,
Or this data is correctly into memory as a byte array (instead of a float), and when you write it to csv, it just gets byte-wise dumped to disk instead of interpreting it as a number format and making a text representation of it
note
Small remark about your initial question regarding mapping text to numbers:
Probably it will not be possible create a straightforward map of characters to numbers:
These numbers could have any encoding and might not be stored as decimal text values like you now seem to assume
These text representations are just a decoding using some character encoding (UTF-8, UTF-16). E.g. for UTF-8 several bytes might map to one character. And the question marks or squares you see, might indicate that one or more characters could not be decoded.
In any case you will be losing information if start from the text, you must start from the binary data to decode.
I have an custom Python library function that takes a csv flat file as input that is read using data = open('file.csv', 'r').read(). But currently I've the data processed in Python as a Pandas DataFrame. How can I pass this DataFrame as a flat file object that my custom library function accepts?
As a work around I'm writing the DataFrame to disk and reading it back using the read function which is causing adding a second or two for each iteration. I want to avoid using this process.
In the to_csv method of pandas DataFrame, if you don't provide any argument you get the CSV output returned as a string. So you can use the to_csv method on your DataFrame, that produces the same output as you are getting by storing and reading the DataFrame again.
I have certain computations performed on Dataset and I need the result to be stored in external file.
Had it been to CSV, to process it further I'd have to convert again to Dataframe/SFrame, which is again increasing lines of code.
Here's the snippet:
train_data = graphlab.SFrame(ratings_base)
Clearly, it is in SFrame and can be converted to DFrame using
df_train = train_data.to_dataframe()
Now that it is in DFrame, I need it exported to a file without changing it's structure. Since the exported file will be used as Argument to another python code. That code must accept DFrame and not CSV.
I have already check out in place1, place2, place3, place4 and place5
P.S. - I'm still digging for Python serialization, if anyone can simplify
it in the context would be helpful
I'd use HDFS format as it's supported by Pandas and by graphlab.SFrame and beside that HDFS format is very fast.
Alternatively you can export Pandas.DataFrame to Pickle file and read it from another scripts:
sf.to_dataframe().to_pickle(r'/path/to/pd_frame.pickle')
to read it back (from the same or from another script):
df = pd.read_pickle(r'/path/to/pd_frame.pickle')