I am trying to get data from an excel file using xlwings (am new to python) and load it into a multi dimensionnal array (or rather, table) that I could then loop through later on row by row.
What I would like to do :
db = []
wdb = xw.Book(r'C:\temp\xlpython\db.xlsx')
db.append(wdb.sheets[0].range('A2:K2').expand('down'))
So this would load the data into my table 'db', and I could later loop through it using :
for i in range(len(db)):
print(db[i][1])
If I wanted to retrieve the data originally in column B for instance
But instead of this, it loads the data in a single dimension, so if I run the code :
print(range(len(db)))
I will get (0,1) instead of the (0,145) expected if I had 146 rows of data in the excel file
Is there a way to do this, except loading the table line by line ?
Thanks
Have a look at the documentation here on converting the range to a numpy array or specifying the dimensions.
db = []
wdb = xw.Book(r'C:\temp\xlpython\db.xlsx')
db.append(wdb.sheets[0].range('A2:K2').options(np.array, expand='down').value)
After looking at numpy arrays as suggested by Rawson, it seems they have the same behaviour than python lists when appending a whole range, meaning it generates a flat array and does not preserve the rows of the excel range into the array; at least I couldn't get it to work that way.
So finally I looked into panda DataFrame and it seems to do the exact needed job, you can even import column titles which is a plus.
import pandas as pd
wdb = xw.Book(r'C:\temp\xlpython\db.xlsx')
db= pd.DataFrame(wdb.sheets[0].range('A2:K2').expand('down').value)
Related
I have a big size excel files that I'm organizing the column names into a unique list.
The code below works, but it takes ~9 minutes!
Does anyone have suggestions for speeding it up?
import pandas as pd
import os
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl').columns)
print(get_col)
Using pandas to extract just the column names of a large excel file is very inefficient.
You can use openpyxl for this:
from openpyxl import load_workbook
wb = load_workbook("E:\DATA\dbo.xlsx", read_only=True)
columns = {}
for sheet in worksheets:
for value in sheet.iter_rows(min_row=1, max_row=1, values_only=True):
columns = value
Assuming you only have one sheet, you will get a tuple of column names here.
If you want faster reading, then I suggest you use other type files. Excel, while convenient and fast are binary files, therefore for pandas to be able to read it and correctly parse it must use the full file. Using nrows or skipfooter to work with less data with only happen after the full data is loaded and therefore shouldn't really affect the waiting time. On the opposite, when working with a .csv() file, given its type and that there is no significant metadata, you can just extract the first rows of it as an interable using the chunksize parameter in pd.read_csv().
Other than that, using list() with a dataframe as value, returns a list of the columns already. So my only suggestion for the code you use is:
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl'))
The stronger suggestion is to change datatype if you specifically want to address this issue.
I am trying to export a .csv file from a .mat file, which was generated with OpenModelica. The following code seems to work quite well:
from scipy.io import loadmat
import numpy
x = loadmat('results.mat')
traj=x['data_2'][0]
numpy.savetxt("results.csv", traj, delimiter=",")
However, there is an issue that I cannot solve. The line traj=x['data_2'][0] is taking an array with the values (over time) of the first variable in the file (index is 0). The problem is that I cannot make a correspondence between the variable I am looking for and its index. Let's say that I want to print the values of a variable called "My_model.T". How do I know the index of this variable?
The file format is described here: https://www.openmodelica.org/doc/OpenModelicaUsersGuide/1.17/technical_details.html#the-matv4-result-file-format
So you need to lookup the variable's name in the name matrix, then look in the dataInfo matrix to see if the variable is stored in data_1 or data_2 and which index it has in this matrix.
Edit: And since the title was how to create a CSV from MAT-file... You can do this from an OpenModelica .mos-script:
filterSimulationResults("M_res.mat", "M_res.csv", readSimulationResultVars("M_res.mat"))
I am trying to populate an empty dataframe by using the csv module to iterate over a large tab-delimited file, and replacing each row in the dataframe with these values. (Before you ask, yes I have tried all the normal read_csv methods, and nothing has worked because of dtype issues and how large the file is).
I first made an empty numpy array using np.empty, using the dimensions of my data. I then converted this to a pandas DataFrame. Then, I did the following:
with open(input_file) as csvfile:
reader = csv.DictReader(csvfile,delimiter='\t')
row_num = 0
for row in reader:
for key, value in row.items():
df.loc[row_num,key] = value
row_num += 1
This is working great, except that my file has 900,000 columns, so it is unbelievably slow. This also feels like something that pandas could do more efficiently, but I've been unable to find how. The dictionary for each row given by DictReader looks like:
{'columnName1':<value>,'columnName2':<value> ...}
Where the values are what I want to put in the dataframe in those columns for that row.
Thanks!
So what you could do in this case is to build smaller chunks of your big csv data file. I had the same issue with a 32GB Csv-File, so I had to build chunks. After reading them in you could work with them.
# read the large csv file with specified chunksize
df_chunk = pd.read_csv(r'../input/data.csv', chunksize=1000000)
chunksize=1000000 sets how many row are read in at once
Helpfull website:
https://towardsdatascience.com/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c
I have a dictionary as follows:
my_dict = {'a':[12,15.2,52.1],'b':[2.5,2.4,5.2],'c':[1.2,5.3,12]}
I want to save this dictionary in Databricks in order for me not to obtain it every time I want to start working with it. Furthermore, I would like to know how to retrieve it and have it in its original form again.
I have tried doing the following:
from itertools import zip_longest
column_names, data = zip(*my_dict.items())
spark.createDataFrame(zip_longest(*data), column_names).show()
and
column_names, data = zip(*dict_brands.items())
spark.createDataFrame(zip(*data), column_names).show()
However, I get the following error:
zip_longest argument #10342 must support iteration
I also do not know how to reload it or upload it. I tried with a sample dataframe (not the same one), as follows:
df.write.format("tfrecords").mode("overwrite").save('/data/tmp/my_df')
And the error is:
Attribute name "my_column" contains invalid character(s)
among " ,;{}()\n\t=". Please use alias to rename it.
Finally, in order to obtain it, I thought about:
my_df = spark.table("my_df") # Get table
df = my_df.toPandas() # Make pd dataframe
and then make it a dictionary, but maybe there is an easier way than making it a dataframe and then retrieving as dataframe and converting into dictionary back again.
I would also like to know the computational cost for the solutions, since the actual dataset is very large.
Here is my sample code for realizing your needs step by step.
Convert a dictionary to a Pandas dataframe
my_dict = {'a':[12,15.2,52.1],'b':[2.5,2.4,5.2],'c':[1.2,5.3,12]}
import pandas as pd
pdf = pd.DataFrame(my_dict)
Convert a Pandas dataframe to a PySpark dataframe
df = spark.createDataFrame(pdf)
To save a PySpark dataframe to a file using parquet format. Format tfrecords is not supported at here.
df.write.format("parquet").mode("overwrite").save('/data/tmp/my_df')
To load the saved file above as a PySpark dataframe.
df2 = spark.read.format("parquet").load('/data/tmp/my_df')
To convet a PySpark dataframe to a dictionary.
my_dict2 = df2.toPandas().to_dict()
The computational cost of these code above is depended on the memory usage for your actual dataset.
Python newbie here.
I am trying to save a large data frame into HDF file with lz4 compression using to_hdf.
I use Windows 10, Python 3, Pandas 20.2
I get the error “OverflowError: Python int too large to convert to C long”.
None of the machine resources are close to their limits (RAM, CPU, SWAP usage)
Previous posts discuss the dtype, but the following example shows that there is some other problem, potentially related to the size?
import numpy as np
import pandas as pd
# sample dataframe to be saved, pardon my French
n=500*1000*1000
df= pd.DataFrame({'col1':[999999999999999999]*n,
'col2':['aaaaaaaaaaaaaaaaa']*n,
'col3':[999999999999999999]*n,
'col4':['aaaaaaaaaaaaaaaaa']*n,
'col5':[999999999999999999]*n,
'col6':['aaaaaaaaaaaaaaaaa']*n})
# works fine
lim=200*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
# works fine
lim=300*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
# Error
lim=400*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
....
OverflowError: Python int too large to convert to C long
I experienced the same issue and it seems that it is indeed connected to the size of the data frame rather than to dtype (I had all the columns stored as strings and was able to store them to .h5 separately).
The solution that worked for me is to save the data frame in chunks using mode='a'.
As suggested in pandas documentation : mode{‘a’, ‘w’, ‘r+’}, default ‘a’: ‘a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.
So the sample code would look something like:
batch_size = 1000
for i, df_chunk in df.groupby(np.arange(df.shape[0]) // batch_size):
df_chunk.to_hdf('df.h5','table', complib= 'blosc:lz4', mode='a')
As #Giovanni Maria Strampelli pointed, the answer of #Artem Snorkovenko only saves the last batch. Pandas documentation states the following:
In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.
Here is a possible workaround to save all batches (adjusted from the answer of #Artem Snorkovenko):
for i in range(len(df)):
sr = df.loc[i] #pandas series object for the given index
sr.to_hdf('df.h5', key='table_%i'%i, complib='blosc:lz4', mode='a')
This code saves each Pandas Series object with a different key. Each key is indexed by i.
To load the existing .h5 file after saving, one can do the following:
i = 0
dfdone = False #if True, all keys in the .h5 file are successfully loaded.
srl = [] #df series object list
while dfdone == False:
#print(i) #this is to see if code is working properly.
try: #check whether current i value exists in the keys of the .h5 file
sdfr = pd.read_hdf('df.h5', key='table_%i'%i) #Current series object
srl.append(sdfr) #append each series to a list to create the dataframe in the end.
i += 1 #increment i by 1 after loading the series object
except: #if an error occurs, current i value exceeds the number of keys, all keys are loaded.
dfdone = True #Terminate the while loop.
df = pd.DataFrame(srl) #Generate the dataframe from the list of series objects.
I used a while loop, assuming we do not know the exact length of the dataframe in the .h5 file. If the length is known, for loop can also be used.
Note that I am not saving dataframes in chunks here. Thus, the loading procedure is in its current form not suitable for saving in chunks, where the data type would be DataFrame for each chunk. In my implementation, each saved object is Series, and DataFrame is generated from a list of Series. The code I provided can be adjusted to work for saving in chunks and generating a DataFrame from a list of DataFrame objects (a nice starting point can be found in ths Stack Overflow entry.).
In addition to #tetrisforjeff 's answer:
If the df contains object types, the reading could lead to error. I would suggest pd.concat(srl) instead of pd.DataFrame(srl)