When saving a data frame and then reading it again, the data is not the same. Why? How to make is save/read the exact same thing?
import pandas as pd
from pandas.testing import assert_frame_equal
import myproject.io.db as db
old = db.get_dataframe()
old.to_csv(r'mypath\myfile.csv') # Save
new = pd.read_csv(r'mypath\myfile.csv', index_col=0) # Read the new save.
assert_frame_equal(old, new) # Assertion error, not identical
I tried forcing the datatype for a test:
new = pd.read_csv(r'mypath\myfile.csv', index_col=0, dtype=old.dtypes.to_dict())
The issue occurs because the data contains None, "NaN", "nan" and "NULL", but I'm working with some legacy code and to "standardize" all of this I will have to create unitests first (or I might break everything!)
The problem is that saving to a generic text file (like csv) may result in loss of some metadata (like types).
If you want to be sure the dataframe remains identical then I would suggest pickling
old = db.get_account()
old.to_pickle(r'mypath\myfile.pkl')
new = pd.read_pickle(r'mypath\myfile.pkl')
Related
I have been trying to rename the column name in a csv file which I have been working on through Google-Colab. But the same line of code is working on one column name and is also not working for the other.
import pandas as pd
import numpy as np
data = pd.read_csv("Daily Bike Sharing.csv",
index_col="dteday",
parse_dates=True)
dataset = data.loc[:,["cnt","holiday","workingday","weathersit",
"temp","atemp","hum","windspeed"]]
dataset = dataset.rename(columns={'cnt' : 'y'})
dataset = dataset.rename(columns={"dteday" : 'ds'})
dataset.head(1)
The Image below is the dataframe called data
The Image below is dataset
This image is the final output which I get when I try to rename the dataframe.
The column name "dtedate" is not getting renamed but "cnt" is getting replaced "y" by the same code. Can someone help me out, I have been racking my brain on this for sometime now.
That's because you're setting dteday as your index, upon reading in the csv, whereas cnt is quite simply a column. Avoid the index_col attribute in read_csv and instead perform dataset = dataset.set_index('ds') after renaming.
An alternative in which only your penultimate line (trying to rename the index) would need to be changed:
dataset.index.names = ['ds']
You can remove the 'index-col' in the read statement, include 'dtedate' in your dataset and then change the column name. You can make the column index using df.set_index later.
My task is to take an output from a machine, and convert that data to json. I am using python, but the issue is the structure of the output.
From my research online, csv usually has the first row with the keys and the values in the same order underneath. Example: https://e.nodegoat.net/CMS/upload/guide-import_person_csv_notepad.png
However, the output from my machine doesn't look like this.
Mine looks like:
Date:,10/10/2015
Name:,"Company name"
Location:,"Company location"
Serial num:,"Serial number"
So the machine i'm working with is outputting each result on a new .dat file instead of appending onto a single csv with the row of keys and whatnot. Technically, yes the data is separated with csv, but not sure how to work with this.
How should I go about turning this kind of data to json? Should I look into restructuring the data to the default csv? Or is there a way I can work with this and not do any cleanup to convert this? In either case, any direction is appreciated.
You can try transpose using pandas
import pandas as pd
from io import StringIO
data = '''\
Date:,10/10/2015
Name:,"Company name"
Location:,"Company location"
Serial num:,"Serial number"
'''
f = StringIO(data)
df = pd.read_csv(f)
t = df.set_index(df.columns[0]).T
print(t['Location:'][0])
print(t['Serial num:'][0])
I just started to use Python.
Actually, I'm setting up a new methodology to read patent data. With textrazor this patent data should be analyzed. I'm interested in getting the topics and save them in a term-document-matrix. It's already possible for me to save the output topics, but only in one big cell with a very long vector. How can I split this long vector, to save the topics in different cells in an Excel file?
If you have any ideas regarding this problem, I would be thankful for your answer. Also, feel free to recommend or help me with my code.
data = open('Patentdaten1.csv')
content= data.read()
table=[]
row = content.split('\n')
for i in range(len(row)):
column= row[i].split(';')
table.append(column)
patent1= table[1][1]
import textrazor
textrazor.api_key ="b033067632dba8a710c57f088115ad4eeff22142629bb1c07c780a10"
client = textrazor.TextRazor(extractors= ["entities", "categories", "topics"])
client.set_classifiers(['textrazor_newscodes'])
response = client.analyze(content)
topics= response.topics()
import pandas as pd
df = pd.DataFrame({'topic' : [topics]})
df.to_csv('test.csv')
It's a bit difficult to see exactly what is the problem without an example input and/or output, but saving data to excel via pandas removes any need for intermediate processing:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html
For instance:
import pandas
data = pandas.DataFrame.from_dict({"pantents": ["p0", "p1"], "authors": ["a0", "a1"]})
data.to_excel("D:\\test.xlsx")
Output:
Python newbie here.
I am trying to save a large data frame into HDF file with lz4 compression using to_hdf.
I use Windows 10, Python 3, Pandas 20.2
I get the error “OverflowError: Python int too large to convert to C long”.
None of the machine resources are close to their limits (RAM, CPU, SWAP usage)
Previous posts discuss the dtype, but the following example shows that there is some other problem, potentially related to the size?
import numpy as np
import pandas as pd
# sample dataframe to be saved, pardon my French
n=500*1000*1000
df= pd.DataFrame({'col1':[999999999999999999]*n,
'col2':['aaaaaaaaaaaaaaaaa']*n,
'col3':[999999999999999999]*n,
'col4':['aaaaaaaaaaaaaaaaa']*n,
'col5':[999999999999999999]*n,
'col6':['aaaaaaaaaaaaaaaaa']*n})
# works fine
lim=200*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
# works fine
lim=300*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
# Error
lim=400*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
....
OverflowError: Python int too large to convert to C long
I experienced the same issue and it seems that it is indeed connected to the size of the data frame rather than to dtype (I had all the columns stored as strings and was able to store them to .h5 separately).
The solution that worked for me is to save the data frame in chunks using mode='a'.
As suggested in pandas documentation : mode{‘a’, ‘w’, ‘r+’}, default ‘a’: ‘a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.
So the sample code would look something like:
batch_size = 1000
for i, df_chunk in df.groupby(np.arange(df.shape[0]) // batch_size):
df_chunk.to_hdf('df.h5','table', complib= 'blosc:lz4', mode='a')
As #Giovanni Maria Strampelli pointed, the answer of #Artem Snorkovenko only saves the last batch. Pandas documentation states the following:
In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.
Here is a possible workaround to save all batches (adjusted from the answer of #Artem Snorkovenko):
for i in range(len(df)):
sr = df.loc[i] #pandas series object for the given index
sr.to_hdf('df.h5', key='table_%i'%i, complib='blosc:lz4', mode='a')
This code saves each Pandas Series object with a different key. Each key is indexed by i.
To load the existing .h5 file after saving, one can do the following:
i = 0
dfdone = False #if True, all keys in the .h5 file are successfully loaded.
srl = [] #df series object list
while dfdone == False:
#print(i) #this is to see if code is working properly.
try: #check whether current i value exists in the keys of the .h5 file
sdfr = pd.read_hdf('df.h5', key='table_%i'%i) #Current series object
srl.append(sdfr) #append each series to a list to create the dataframe in the end.
i += 1 #increment i by 1 after loading the series object
except: #if an error occurs, current i value exceeds the number of keys, all keys are loaded.
dfdone = True #Terminate the while loop.
df = pd.DataFrame(srl) #Generate the dataframe from the list of series objects.
I used a while loop, assuming we do not know the exact length of the dataframe in the .h5 file. If the length is known, for loop can also be used.
Note that I am not saving dataframes in chunks here. Thus, the loading procedure is in its current form not suitable for saving in chunks, where the data type would be DataFrame for each chunk. In my implementation, each saved object is Series, and DataFrame is generated from a list of Series. The code I provided can be adjusted to work for saving in chunks and generating a DataFrame from a list of DataFrame objects (a nice starting point can be found in ths Stack Overflow entry.).
In addition to #tetrisforjeff 's answer:
If the df contains object types, the reading could lead to error. I would suggest pd.concat(srl) instead of pd.DataFrame(srl)
I have got .xls file with images in cells, like so:
When I loaded this file in pandas
>>> import pandas as pd
>>> df = pd.read_excel('myfile.xls') # same behaviour with *.xlsx
>>> df.dtypes
The dtype in all columns appeared as object
After some manipulations I saved the df back to excel, however the images disappeared.
Please note that, in excel, I was able to sort the rows simultaneously with the images, and by resizing cells, images scaled accordingly so it looks like they were really contained in the cells.
Why did they disappear after saving df back to excel, or didnt they load into the df in the first place?
I'm not sure if this will be helpful, but I had the problem where I needed to load a data frame with images, so I wrote the following code. I hope this helps.
import base64
from io import BytesIO
import openpyxl
import pandas as pd
from openpyxl_image_loader import SheetImageLoader
def load_dataframe(dataframe_file_path: str, dataframe_sheet_name: str) -> pd.DataFrame:
# By default, it appears that pandas does not read images, as it uses only openpyxl to read
# the file. As a result we need to load into memory the dataframe and explicitly load in
# the images, and then convert all of this to HTML and put it back into the normal
# dataframe, ready for use.
pxl_doc = openpyxl.load_workbook(dataframe_file_path)
pxl_sheet = pxl_doc[dataframe_sheet_name]
pxl_image_loader = SheetImageLoader(pxl_sheet)
pd_df = pd.read_excel(dataframe_file_path, sheet_name=dataframe_sheet_name)
for pd_row_idx, pd_row_data in pd_df.iterrows():
for pd_column_idx, _pd_cell_data in enumerate(pd_row_data):
# Offset as openpyxl sheets index by one, and also offset the row index by one more to account for the header row
pxl_cell_coord_str = pxl_sheet.cell(pd_row_idx + 2, pd_column_idx + 1).coordinate
if pxl_image_loader.image_in(pxl_cell_coord_str):
# Now that we have a cell that contains an image, we want to convert it to
# base64, and it make it nice and HTML, so that it loads in a front end
pxl_pil_img = pxl_image_loader.get(pxl_cell_coord_str)
with BytesIO() as pxl_pil_buffered:
pxl_pil_img.save(pxl_pil_buffered, format="PNG")
pxl_pil_img_b64_str = base64.b64encode(pxl_pil_buffered.getvalue())
pd_df.iat[pd_row_idx, pd_column_idx] = '<img src="data:image/png;base64,' + \
pxl_pil_img_b64_str.decode('utf-8') + \
f'" alt="{pxl_cell_coord_str}" />'
return pd_df
NOTE: For some reason, SheetImageLoader's loading of images is persistent globally. This means that when I run this function twice, the second time I run it, openpyxl will append the images from the second run into the SheetImageLoader object of the first.
For example, if I read one file that has 25 images in it, pxl_sheet._images and pxl_image_loader._images both have 25 images in them. However, if I read another file which has 5 images in it, pxl_sheet._images has length 5, but pxl_image_loader._images now has length 30, so it has just appended the new images to the old object, despite being a completely different function call.
I tried deleting the object from memory, but this did not work. I eventually solved this by adding some code in where, after I construct the SheetImageLoader object, I manually reset pxl_image_loader's _images attribute (using a logic similar to that in SheetImageLoader's __init__ method). I'm unsure if this is a bug in openpyxl or something to do with scoping in Python.