pandas: write dataframe to excel file *object* (not file)? - python

I have a dataframe that I want to convert to excel file, and return it using HTTP. Dataframe's to_excel method accepts either a path, or an ExcelWriter, which, in turn, refers to a path.
Is there any way to convert the dataframe to a file object, without writing it to disk?

This can be done using the BytesIO Object in the standard library:
import pandas
from io import BytesIO
# Create Random Data for example
cols = ["col1", "col2"]
df = pandas.DataFrame.from_records([{k: 0.0 for k in cols} for _ in range(25)])
# Create an in memory binary file object, and write the dataframe to it.
in_memory_fp = BytesIO()
df.to_excel(in_memory_fp)
# Write the file out to disk to demonstrate that it worked.
in_memory_fp.seek(0,0)
with open("my_file.xlsx", 'wb') as f:
f.write(in_memory_fp.read())
In the above example, I wrote the object out to a file so you can verify that it works. If you want to just return the raw binary data in memory all you need is:
in_memory_fp.seek(0,0)
binary_xl = in_memory_fp.read()

Related

Haw to save the new result which are extracting from csv file in python?

I have the (M.csv) file contain many columns.
I extract 2 columns from the (M.CSV) file by using DictReader, Now I want to save these 2 columns in New CSV file. I use data.to_csv(.....) but not working.
this is the code:
import pandas as pd
import csv
with open('M.csv', newline='') as f:
data = csv.DictReader(f)
print("I_CaIIHK",",", 'Swm')
print("---------------------------------")
for row in data:
print(row['I_CaIIHK_err'],"," ,row['Swm'])
data.to_csv('New.csv',index=False)
the code is run and I got the 2 columns but cannot saving in new csv file.
And this is the error:
AttributeError: 'DictReader' object has no attribute 'to_csv'
enter image description here
It looks like you are trying to call the pandas.DataFrame.to_csv method on a csv.DictReader object (which is not a DataFrame).
You should be able to read the CSV file into a DataFrame with the pandas.read_csv function, specifying only the columns you want with the usecols argument, then save the data with the pandas.DataFrame.to_csv method.
Then you don't need the csv library at all :).
Something like:
import pandas as pd
COLUMNS_TO_USE = ["I_CaIIHK_err", "Swm"]
PATH_TO_FILE = "./M.csv"
OUTPUT_FILE = "New.csv"
df = pd.read_csv(filepath=PATH_TO_FILE, usecols=COLUMNS_TO_USE)
df.to_csv(path_or_buf=OUTPUT_FILE)

Pandas - save multiple CSV in a zip archive

I need to save multiple dataframes in CSV, all in a same zip file.
Is it possible without making temporary files?
I tried using zipfile:
with zipfile.ZipFile("archive.zip", "w") as zf:
with zf.open(f"file1.csv", "w") as buffer:
data_frame.to_csv(buffer, mode="wb")
This works with to_excel but fails with to_csv as as zipfiles expects binary data and to_csv writes a string, despite the mode="wb" parameter:
.../lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 283, in _save_header
writer.writerow(encoded_labels)
.../lib/python3.8/zipfile.py", line 1137, in write
TypeError: a bytes-like object is required, not 'str'
On the other hand, I tried using the compression parameter of to_csv, but the archive is overwritten, and only the last dataframe remains in the final archive.
If no other way, I'll use temporary files, but I was wondering if someone have an idea to allow to_csv and zipfile work together.
Thanks in advance!
I would approach this following way
import io
import pandas as pd
df = pd.DataFrame({"x":[1,2,3]})
string_io = io.StringIO()
df.to_csv(string_io)
string_io.seek(0)
df_bytes = string_io.read().encode('utf-8')
as df_bytes is bytes it should now work with zipfile. Edit: after looking into to_csv help I found simpler way, to get bytes namely:
import pandas as pd
df = pd.DataFrame({"x":[1,2,3]})
df_bytes = df.to_csv().encode('utf-8')
For saving multiple excel files from dataframe in a zip file
import zipfile
from zipfile import ZipFile
import pandas as pd
df1 = pd.DataFrame({"x":[1,2,3]})
df2 = pd.DataFrame({"y":[4,5,6]})
df3 = pd.DataFrame({"z":[7,8,9]})
with zipfile.ZipFile("rishabh.zip", "w") as zf:
with zf.open(f"check1.xlsx", "w") as buffer:
df1.to_excel(buffer,index=False)
with zf.open(f"check2.xlsx", "w") as buffer:
df2.to_excel(buffer,index=False)
with zf.open(f"check3.xlsx", "w") as buffer:
df3.to_excel(buffer, index=False)

Python: Parsing multiple csv files and skip files without a keyword

I am trying to read some .csv field data on python for post-processing, I typically just use something like:
for flist in glob('*.csv'):
df = pd.read_csv(flist, delimiter = ',')
However I need to filter through the bad files which contain "Run_Terminated" somewhere in the file and skip the file entirely. I'm still new to python so I'm not familiar with all of its functionalities, any input would be appreciated. Thank you.
What you could do is first read the file fully in memory (using a io.StringIO file-like object and look for the Run_Terminated string anywhere in the file (dirty, but should be OK),
Then pass the handle to read_csv (since you can pass a handle OR a filename) so you don't have to read it again from the file.
import pandas as pd
import glob
import io
for flist in glob('*.csv'):
with open(flist) as f:
data = io.StringIO()
data.write(f.read())
if "Run_Terminated" not in data.getvalue():
data.seek(0) # rewind or it won't read anything
df = pd.read_csv(data, delimiter = ',')

How to convert Pandas Dataframe to csv reader directly in python?

I have a csv file on with millions of rows. I used to create a dictionary out the csv file like this
with open('us_db.csv', 'rb') as f:
data = csv.reader(f)
for row in data:
Create Dictionary based on a column
Now to filter the rows based on some conditions I use pandas Dataframe as it is super fast in these operations. I load the csv as pandas Dataframe do some filtering. Then I want to continue doing the above. I thought of using pandas df.iterrows() or df.itertuples() but it is really slow.
Is there a way to convert the pandas dataframe to csv.reader() directly so that I can continue to use the above code. If I use csv_rows = to_csv(), it gives a long string. Ofcourse, I can write out a csv and then read from it again. But I want to know if there is a way to skip the extra read and write to a file.
You could do something like this..
import numpy as np
import pandas as pd
from io import StringIO
import csv
#random dataframe
df = pd.DataFrame(np.random.randn(3,4))
buffer = StringIO() #creating an empty buffer
df.to_csv(buffer) #filling that buffer
buffer.seek(0) #set to the start of the stream
for row in csv.reader(buffer):
#do stuff
Why don't you apply the Create Dictionary function to the target column?
Something like:
df['column_name'] = df['column_name'].apply(Create Dictionary)

What happens with images from excel file after creating dataframe in Pandas

I have got .xls file with images in cells, like so:
When I loaded this file in pandas
>>> import pandas as pd
>>> df = pd.read_excel('myfile.xls') # same behaviour with *.xlsx
>>> df.dtypes
The dtype in all columns appeared as object
After some manipulations I saved the df back to excel, however the images disappeared.
Please note that, in excel, I was able to sort the rows simultaneously with the images, and by resizing cells, images scaled accordingly so it looks like they were really contained in the cells.
Why did they disappear after saving df back to excel, or didnt they load into the df in the first place?
I'm not sure if this will be helpful, but I had the problem where I needed to load a data frame with images, so I wrote the following code. I hope this helps.
import base64
from io import BytesIO
import openpyxl
import pandas as pd
from openpyxl_image_loader import SheetImageLoader
def load_dataframe(dataframe_file_path: str, dataframe_sheet_name: str) -> pd.DataFrame:
# By default, it appears that pandas does not read images, as it uses only openpyxl to read
# the file. As a result we need to load into memory the dataframe and explicitly load in
# the images, and then convert all of this to HTML and put it back into the normal
# dataframe, ready for use.
pxl_doc = openpyxl.load_workbook(dataframe_file_path)
pxl_sheet = pxl_doc[dataframe_sheet_name]
pxl_image_loader = SheetImageLoader(pxl_sheet)
pd_df = pd.read_excel(dataframe_file_path, sheet_name=dataframe_sheet_name)
for pd_row_idx, pd_row_data in pd_df.iterrows():
for pd_column_idx, _pd_cell_data in enumerate(pd_row_data):
# Offset as openpyxl sheets index by one, and also offset the row index by one more to account for the header row
pxl_cell_coord_str = pxl_sheet.cell(pd_row_idx + 2, pd_column_idx + 1).coordinate
if pxl_image_loader.image_in(pxl_cell_coord_str):
# Now that we have a cell that contains an image, we want to convert it to
# base64, and it make it nice and HTML, so that it loads in a front end
pxl_pil_img = pxl_image_loader.get(pxl_cell_coord_str)
with BytesIO() as pxl_pil_buffered:
pxl_pil_img.save(pxl_pil_buffered, format="PNG")
pxl_pil_img_b64_str = base64.b64encode(pxl_pil_buffered.getvalue())
pd_df.iat[pd_row_idx, pd_column_idx] = '<img src="data:image/png;base64,' + \
pxl_pil_img_b64_str.decode('utf-8') + \
f'" alt="{pxl_cell_coord_str}" />'
return pd_df
NOTE: For some reason, SheetImageLoader's loading of images is persistent globally. This means that when I run this function twice, the second time I run it, openpyxl will append the images from the second run into the SheetImageLoader object of the first.
For example, if I read one file that has 25 images in it, pxl_sheet._images and pxl_image_loader._images both have 25 images in them. However, if I read another file which has 5 images in it, pxl_sheet._images has length 5, but pxl_image_loader._images now has length 30, so it has just appended the new images to the old object, despite being a completely different function call.
I tried deleting the object from memory, but this did not work. I eventually solved this by adding some code in where, after I construct the SheetImageLoader object, I manually reset pxl_image_loader's _images attribute (using a logic similar to that in SheetImageLoader's __init__ method). I'm unsure if this is a bug in openpyxl or something to do with scoping in Python.

Categories

Resources