What happens with images from excel file after creating dataframe in Pandas - python

I have got .xls file with images in cells, like so:
When I loaded this file in pandas
>>> import pandas as pd
>>> df = pd.read_excel('myfile.xls') # same behaviour with *.xlsx
>>> df.dtypes
The dtype in all columns appeared as object
After some manipulations I saved the df back to excel, however the images disappeared.
Please note that, in excel, I was able to sort the rows simultaneously with the images, and by resizing cells, images scaled accordingly so it looks like they were really contained in the cells.
Why did they disappear after saving df back to excel, or didnt they load into the df in the first place?

I'm not sure if this will be helpful, but I had the problem where I needed to load a data frame with images, so I wrote the following code. I hope this helps.
import base64
from io import BytesIO
import openpyxl
import pandas as pd
from openpyxl_image_loader import SheetImageLoader
def load_dataframe(dataframe_file_path: str, dataframe_sheet_name: str) -> pd.DataFrame:
# By default, it appears that pandas does not read images, as it uses only openpyxl to read
# the file. As a result we need to load into memory the dataframe and explicitly load in
# the images, and then convert all of this to HTML and put it back into the normal
# dataframe, ready for use.
pxl_doc = openpyxl.load_workbook(dataframe_file_path)
pxl_sheet = pxl_doc[dataframe_sheet_name]
pxl_image_loader = SheetImageLoader(pxl_sheet)
pd_df = pd.read_excel(dataframe_file_path, sheet_name=dataframe_sheet_name)
for pd_row_idx, pd_row_data in pd_df.iterrows():
for pd_column_idx, _pd_cell_data in enumerate(pd_row_data):
# Offset as openpyxl sheets index by one, and also offset the row index by one more to account for the header row
pxl_cell_coord_str = pxl_sheet.cell(pd_row_idx + 2, pd_column_idx + 1).coordinate
if pxl_image_loader.image_in(pxl_cell_coord_str):
# Now that we have a cell that contains an image, we want to convert it to
# base64, and it make it nice and HTML, so that it loads in a front end
pxl_pil_img = pxl_image_loader.get(pxl_cell_coord_str)
with BytesIO() as pxl_pil_buffered:
pxl_pil_img.save(pxl_pil_buffered, format="PNG")
pxl_pil_img_b64_str = base64.b64encode(pxl_pil_buffered.getvalue())
pd_df.iat[pd_row_idx, pd_column_idx] = '<img src="data:image/png;base64,' + \
pxl_pil_img_b64_str.decode('utf-8') + \
f'" alt="{pxl_cell_coord_str}" />'
return pd_df
NOTE: For some reason, SheetImageLoader's loading of images is persistent globally. This means that when I run this function twice, the second time I run it, openpyxl will append the images from the second run into the SheetImageLoader object of the first.
For example, if I read one file that has 25 images in it, pxl_sheet._images and pxl_image_loader._images both have 25 images in them. However, if I read another file which has 5 images in it, pxl_sheet._images has length 5, but pxl_image_loader._images now has length 30, so it has just appended the new images to the old object, despite being a completely different function call.
I tried deleting the object from memory, but this did not work. I eventually solved this by adding some code in where, after I construct the SheetImageLoader object, I manually reset pxl_image_loader's _images attribute (using a logic similar to that in SheetImageLoader's __init__ method). I'm unsure if this is a bug in openpyxl or something to do with scoping in Python.

Related

What is the fastest way to retrieve header names from excel files using pandas

I have a big size excel files that I'm organizing the column names into a unique list.
The code below works, but it takes ~9 minutes!
Does anyone have suggestions for speeding it up?
import pandas as pd
import os
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl').columns)
print(get_col)
Using pandas to extract just the column names of a large excel file is very inefficient.
You can use openpyxl for this:
from openpyxl import load_workbook
wb = load_workbook("E:\DATA\dbo.xlsx", read_only=True)
columns = {}
for sheet in worksheets:
for value in sheet.iter_rows(min_row=1, max_row=1, values_only=True):
columns = value
Assuming you only have one sheet, you will get a tuple of column names here.
If you want faster reading, then I suggest you use other type files. Excel, while convenient and fast are binary files, therefore for pandas to be able to read it and correctly parse it must use the full file. Using nrows or skipfooter to work with less data with only happen after the full data is loaded and therefore shouldn't really affect the waiting time. On the opposite, when working with a .csv() file, given its type and that there is no significant metadata, you can just extract the first rows of it as an interable using the chunksize parameter in pd.read_csv().
Other than that, using list() with a dataframe as value, returns a list of the columns already. So my only suggestion for the code you use is:
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl'))
The stronger suggestion is to change datatype if you specifically want to address this issue.

How to save and read the EXACT same data with pandas?

When saving a data frame and then reading it again, the data is not the same. Why? How to make is save/read the exact same thing?
import pandas as pd
from pandas.testing import assert_frame_equal
import myproject.io.db as db
old = db.get_dataframe()
old.to_csv(r'mypath\myfile.csv') # Save
new = pd.read_csv(r'mypath\myfile.csv', index_col=0) # Read the new save.
assert_frame_equal(old, new) # Assertion error, not identical
I tried forcing the datatype for a test:
new = pd.read_csv(r'mypath\myfile.csv', index_col=0, dtype=old.dtypes.to_dict())
The issue occurs because the data contains None, "NaN", "nan" and "NULL", but I'm working with some legacy code and to "standardize" all of this I will have to create unitests first (or I might break everything!)
The problem is that saving to a generic text file (like csv) may result in loss of some metadata (like types).
If you want to be sure the dataframe remains identical then I would suggest pickling
old = db.get_account()
old.to_pickle(r'mypath\myfile.pkl')
new = pd.read_pickle(r'mypath\myfile.pkl')

Numpy CSV fromfile()

I'm probably trying to reinvent the wheel here, but numpy has a fromfile() function that can read - I imagine - CSV files.
It appears to be incredibly fast - even compared to Pandas read_csv(), but I'm unclear on how it works.
Here's some test code:
import pandas as pd
import numpy as np
# Create the file here, two columns, one million rows of random numbers.
filename = 'my_file.csv'
df = pd.DataFrame({'a':np.random.randint(100,10000,1000000), 'b':np.random.randint(100,10000,1000000)})
df.to_csv(filename, index = False)
# Now read the file into memory.
arr = np.fromfile(filename)
print len(arr)
I included the len() at the end there to make sure it wasn't reading just a single line. But curiously, the length for me (will vary based on your random number generation) was 1,352,244. Huh?
The docs show an optional sep parameter. But when that is used:
arr = np.fromfile(filename, sep = ',')
...we get a length of 0?!
Ideally I'd be able to load a 2D array of arrays from this CSV file, but I'd settle for a single array from this CSV.
What am I missing here?
numpy.fromfile is not made to read .csv files, instead, it is made for reading data written with the numpy.ndarray.tofile method.
From the docs:
A highly efficient way of reading binary data with a known data-type, as well as parsing simply formatted text files. Data written using the tofile method can be read using this function.
By using it without a sep parameter, numpy assumes you are reading a binary file, hence the different lengths. When you specify a separator, I guess the function just breaks.
To read a .csv file using numpy, I think you can use numpy.genfromtext or numpy.loadtxt (from this question).

Convert image to numpy array, save it into Excel and reverse the all

I tried my best to solve this issue on my own and I think I've hit a roadblock as it stands so far. I'm new to Python and so far so good on a web scraping project I'm trying to complete. What I'm currently trying to accomplish is to take an image, convert it into something readable by Pandas like a text string, store that into a single Excel cell, and then later convert it back from the text into the image for a finished product in Excel.
I tried a few different methods like base64 which works for conversions to and from the image, but exceeds my single Excel cell desire. I'm currently an venture where I can store the image as an NumPy array into the Pandas dataframe and write that to excel which seems to work as it retains the numbers and structure, but I'm having issues reimporting it back into NumPy (I'm pretty sure it's an issue of going from Integers to Strings and trying to go back without really knowing how).
The initial dtype image array upon conversion from image to array is uint8
The stored Excel text string of the array when brought back into NumPy is U786. I've tried reconverting the string in NumPy, but I can't figure out how to do it.
A few potential roadblocks:
The image is a screenshot from Selenium that I am saving from my web scraping
I'm writing all my scraped contents to include the image screenshot converted to an array from a Pandas dataframe to the Excel at one time.
I'm using Xlsxwriter as my Excel Writer to write the dataframe and would to continue doing so if possible.
Below is an example of the code I'm using for this project. I'm open to any and all potential approaches that would fix this issue.
import numpy as np
from PIL import Image
import openpyxl as Workbook
import pandas as pd
import matplotlib
#----Opens Image of interest, adds text, and appends to dataframe
MyDataTable = [] #Datatable to write to Excel
ExampleTextString = "Text for Example" #Only used as without it Pandas gives an error of not passing 2D array when saving to excel
MyDataTable.append(ExampleTextString)
img = Image.open('example.png') # uses PIL library to open image in memory
imgtoarray = np.array(img) # imgtoarray.shape: height x width x channel
MyDataTable.append(imgtoarray) #adds my array to dataframe
#----Check my array image with matplotlib
matplotlib.pyplot.imshow(imgtoarray)
#----Pandas & Excelwriter to Excel
df = pd.DataFrame(MyDataTable)
df.to_excel('ExampleSpreadsheet.xlsx', engine="xlsxwriter", header=False, index=False)
#------Open Array Test Data and where NumPy Array is Saved-----
wb = Workbook.load_workbook(filename='ExampleSpreadsheet.xlsx')
sheet_ranges = wb['Sheet1']
testarraytoimg = sheet_ranges['A2'].value
print (testarraytoimg)

OverflowError with Pandas to_hdf

Python newbie here.
I am trying to save a large data frame into HDF file with lz4 compression using to_hdf.
I use Windows 10, Python 3, Pandas 20.2
I get the error “OverflowError: Python int too large to convert to C long”.
None of the machine resources are close to their limits (RAM, CPU, SWAP usage)
Previous posts discuss the dtype, but the following example shows that there is some other problem, potentially related to the size?
import numpy as np
import pandas as pd
# sample dataframe to be saved, pardon my French
n=500*1000*1000
df= pd.DataFrame({'col1':[999999999999999999]*n,
'col2':['aaaaaaaaaaaaaaaaa']*n,
'col3':[999999999999999999]*n,
'col4':['aaaaaaaaaaaaaaaaa']*n,
'col5':[999999999999999999]*n,
'col6':['aaaaaaaaaaaaaaaaa']*n})
# works fine
lim=200*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
# works fine
lim=300*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
# Error
lim=400*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
....
OverflowError: Python int too large to convert to C long
I experienced the same issue and it seems that it is indeed connected to the size of the data frame rather than to dtype (I had all the columns stored as strings and was able to store them to .h5 separately).
The solution that worked for me is to save the data frame in chunks using mode='a'.
As suggested in pandas documentation : mode{‘a’, ‘w’, ‘r+’}, default ‘a’: ‘a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.
So the sample code would look something like:
batch_size = 1000
for i, df_chunk in df.groupby(np.arange(df.shape[0]) // batch_size):
df_chunk.to_hdf('df.h5','table', complib= 'blosc:lz4', mode='a')
As #Giovanni Maria Strampelli pointed, the answer of #Artem Snorkovenko only saves the last batch. Pandas documentation states the following:
In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.
Here is a possible workaround to save all batches (adjusted from the answer of #Artem Snorkovenko):
for i in range(len(df)):
sr = df.loc[i] #pandas series object for the given index
sr.to_hdf('df.h5', key='table_%i'%i, complib='blosc:lz4', mode='a')
This code saves each Pandas Series object with a different key. Each key is indexed by i.
To load the existing .h5 file after saving, one can do the following:
i = 0
dfdone = False #if True, all keys in the .h5 file are successfully loaded.
srl = [] #df series object list
while dfdone == False:
#print(i) #this is to see if code is working properly.
try: #check whether current i value exists in the keys of the .h5 file
sdfr = pd.read_hdf('df.h5', key='table_%i'%i) #Current series object
srl.append(sdfr) #append each series to a list to create the dataframe in the end.
i += 1 #increment i by 1 after loading the series object
except: #if an error occurs, current i value exceeds the number of keys, all keys are loaded.
dfdone = True #Terminate the while loop.
df = pd.DataFrame(srl) #Generate the dataframe from the list of series objects.
I used a while loop, assuming we do not know the exact length of the dataframe in the .h5 file. If the length is known, for loop can also be used.
Note that I am not saving dataframes in chunks here. Thus, the loading procedure is in its current form not suitable for saving in chunks, where the data type would be DataFrame for each chunk. In my implementation, each saved object is Series, and DataFrame is generated from a list of Series. The code I provided can be adjusted to work for saving in chunks and generating a DataFrame from a list of DataFrame objects (a nice starting point can be found in ths Stack Overflow entry.).
In addition to #tetrisforjeff 's answer:
If the df contains object types, the reading could lead to error. I would suggest pd.concat(srl) instead of pd.DataFrame(srl)

Categories

Resources