chunksize keyword of read_excel is not implemented - python

In version 0.16.1 the chunksize argument was available.
See: http://pandas.pydata.org/pandas-docs/version/0.16.1/generated/pandas.ExcelFile.parse.html
But in latest version it's not available.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.ExcelFile.parse.html
What was the reason that it was removed?
Also, how should I process excel file by chunks in latest version?
I used to do below:
import pandas as pd
excel = pd.ExcelFile("test.xlsx")
for sheet in excel.sheet_names:
reader = excel.parse(sheet, chunksize=1000)
for chunk in reader:
# process chunk

As EdChum explained in the comment, this feature was removed in 0.17.0. Chris gave below reason for the same in the comment:
there's no super-compelling reason; the main idea was to match up with
api of to_excel, i.e. the "ExcelFileWrapper" (ExcelFile, ExcelWriter)
doesn't have any pandas-specific functionality, instead you pass it
into the io functions (read_excel, to_excel).
I did update the docs to cover that specific example. edit: although
it may be hard to see in the diff - rendered below.
Source: https://github.com/pandas-dev/pandas/pull/11198
I still wonder if there's any alternate way to read excel in chunks?

Related

Convert huge csv to hdf5 format

I downloaded IBM's Airline Reporting Carrier On-Time Performance Dataset; the uncompressed CSV is 84 GB. I want to run an analysis, similar to Flying high with Vaex, with the vaex libary.
I tried to convert the CSV to a hdf5 file, to make it readable for the vaex libary:
import time
import vaex
start=time.time()
df = vaex.from_csv(r"D:\airline.csv", convert=True, chunk_size=1000000)
end=time.time()
print("Time:",(end-start),"Seconds")
I always get an error when running the code:
RuntimeError: Dirty entry flush destroy failed (file write failed: time = Fri Sep 30 17:58:55 2022
, filename = 'D:\airline.csv_chunk_8.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 0000021EA8C6B128, total write size = 2040, bytes this sub-write = 2040, bytes actually written = 18446744073709551615, offset = 221133661).
Second run, I get this error:
RuntimeError: Unable to flush file's cached information (file write failed: time = Fri Sep 30 20:18:19 2022
, filename = 'D:\airline.csv_chunk_18.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 000002504659B828, total write size = 2048, bytes this sub-write = 2048, bytes actually written = 18446744073709551615, offset = 348515307)
Is there an alternative way to convert the CSV to hdf5 without Python? For example, a downloadable software which can do this job?
I'm not familiar with vaex, so can't help with usage and functions. However, I can read error messages. :-)
It reports "bytes written" with a huge number (18_446_744_073_709_551_615), much larger than the 84GB CSV. Some possible explanations:
you ran out of disk
you ran out of memory, or
had some other error
To diagnose, try testing with a small csv file and see if vaex.from_csv() works as expected. I suggest the lax_to_jfk.csv file.
Regarding your question, is there an alternative way to convert a csv to hdf5?, why not use Python?
Are you more comfortable with other languages? If so, you can install HDF5 and write your code with their C or Fortran API.
OTOH, if you are familiar with Python, there are other packages you can use to read the CSV file and create the HDF5 file.
Python packages to read the CSV
Personally, I like NumPy's genfromtxt() to read the CSV (You can also use loadtxt() to read the CSV, if you don't have missing values and don't need the field names.) However, I think you will run into memory problems reading a 84GB file. That said, you can use the skip_header and max_rows parameters with genfromtxt() to read and load a subset of lines. Alternately you can use csv.DictReader(). It reads a line at a time. So, you avoid memory issues, but it could be very slow loading the HDF5 file.
Python packages to create the HDF5 file
I have used both h5py and pytables (aka tables) to create and read HDF5 files. Once you load the CSV data to a NumPy array, it's a snap to create the HDF5 dataset.
Here is a very simple example that reads the lax_to_jfk.csv data and loads to a HDF5 file.
csv_name = 'lax_to_jfk'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
dtype=None, names=True, encoding='bytes')
with h5py.File(csv_name+'.h5', 'w') as h5f:
h5f.create_dataset(csv_name,data=rec_arr)
Update:
After posting this example, I decided to test with a larger file (airline_2m.csv). It's 861 MB, and has 2M rows. I discovered the code above doesn't work. However, it's not because of the number of rows. The problem is the columns (field names). Turns out the data isn't as clean; there are 109 field names on row 1, and some rows have 111 columns of data. As a result, the auto-generated dtype doesn't have a matching field. While investigating this, I also discovered many rows only have the values for first 56 fields. In other words, fields 57-111 are not very useful. One solution to this is to add the usecols=() parameter. Code below reflects this modification, and works with this test file. (I have not tried testing with your large file airline.csv. Given it's size likely you will need to read and load incrementally.)
csv_name = 'airline_2m'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
dtype=None, names=True, encoding='bytes') #,
usecols=(i for i in range(56)) )
with h5py.File(csv_name+'.h5', 'w') as h5f:
h5f.create_dataset(csv_name,data=rec_arr)
I tried reproducing your example. I believe the problem you are facing is quite common when dealing with CSVs. The schema is not known.
Sometimes there are "mixed types" and pandas (used underneath vaex's read_csv or from_csv ) casts those columns as dtype object.
Vaex does not really support such mixed dtypes, and requires each column to be of a single uniform type (kind of a like a database).
So how to go around this? Well, the best way I can think of is to use the dtype argument to explicitly specify the types of all columns (or those that you suspect or know to have mixed types). I know this file has like 100+ columns and that's annoying.. but that is also kind of the price to pay when using a format such as CSV...
Another thing i noticed is the encoding.. using pure pandas.read_csv failed at some point because of encoding and requires one to add encoding="ISO-8859-1". This is also supported by vaex.open (since the args are just passed down to pandas).
In fact if you want to do manually what vaex.open does automatically for you (given that this CSV file might not be as clean as one would hope), do something like (this is pseudo code but I hope close to the real thing)
# Iterate over the file in chunks
for i, df_tmp in enumerate(pd.read_csv(file, chunksize=11_000_000, encoding="ISO-8859-1", dtype=dtype)):
# Assert or check or do whatever needs doing to ensure column types are as they should be
# Pass the data to vaex (this does not take extra RAM):
df_vaex = vaex.from_pandas(df_tmp)
# Export this chunk into HDF5
# df_vaex.export_hdf5(f'chunk_{i}.hdf5')
# When the above loop finishes, just concat and export the data to a single file if needed (gives some performance benefit).
df = vaex.open('chunk*.hdf5')
df.export_hdf5('converted.hdf5', progress='rich')
I've seen potentially much better/faster way of doing this with vaex, but it is not released yet (i saw it in the code repo on github), so I will not go into it, but if you can install from source, and want me to elaborate further feel free to drop a comment.
Hope this at least gives some ideas on how to move forward.
EDIT:
In last couple of versions of vaex core, vaex.open() opens all CSV files lazily, so then just export to hdf5/arrow directly, it will do it in one go. Check the docs for more details: https://vaex.io/docs/guides/io.html#Text-based-file-formats

How to deal with warning : "Workbook contains no default style, apply openpyxl's default "

I have the -current- latest version of pandas, openpyxl, xlrd.
openpyxl : 3.0.6.
pandas : 1.2.2.
xlrd : 2.0.1.
I have a generated excel xlsx- file (export from a webapplication).
I read it in pandas:
myexcelfile = pd.read_excel(easy_payfile, engine="openpyxl")
Everything goes ok, I can successfully read the file.
But I do get a warning:
/Users/*******/projects/environments/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:214: UserWarning: Workbook contains no default style, apply openpyxl's default
warn("Workbook contains no default style, apply openpyxl's default")
The documentation doesn't shed too much light on it.
Is there any way I can add an option to avoid this warning?
I prefer not to suppress it.
I don't think the library offers you a way to disable this thus you are going to need to use the warnings package directly.
A simple and punctual solution to the problem would be doing:
import warnings
with warnings.catch_warnings(record=True):
warnings.simplefilter("always")
myexcelfile = pd.read_excel(easy_payfile, engine="openpyxl")
df=pd.read_excel("my.xlsx",engine="openpyxl") passing the engine parameter got rid of the warning for me. Default = None, so I think it is just warning you that it using openpyxl for default style.
I had the same warning. Just changed the sheet name of my excel file from "sheet_1" to "Sheet1", then the warning disappeared. very similar with Yoan. I think pandas should fix this warning later.
#ruhanbidart solution is better because you turn off warnings just for the call to read_excel, but if you have dozens of calls to pd.read_excel, you can simply disable all warnings:
import warnings
warnings.simplefilter("ignore")
I had the exact same warning and was unable to read the file. In my case the problem was coming from the Sheet name in the Excel file.
The initial name contained a . (ex: MDM.TARGET) I simply replace the . with _ and everything's fine.
In my situation some columns' names had a dollar sign ($) in them. Replacing '$' to '_' solved the issue.

How to convert Pandas DataFrame to RDF (Resource Description Framework)?

I'm looking for a recipe for converting Pandas DataFrames to RDF data in Python. I'm aware of the following Python modules (I know how to Google!), but they do not work for me:
rdfpandas
pandasrdf
Neither seems mature. I have problems with both. In the case of rdfpandas, I'm unable to install and there are no examples and insufficient documentation. In the case of pandasrdf, the example doesn't work and crashes. I can fix it, but the RDF file has zero triples, so the result is useless. I'd rather not have to write out the data to some intermediate data file that I have to injest later. Pandas->numpy->RDF would be OK I guess. Does anybody have a working example of converting a Pandas DataFrame to RDF in one of the common serialisation formats that does not involve an artisanal black magic package installation?
A newer version of RdfPandas is out, so you can try it out and see if it covers your use case: https://rdfpandas.readthedocs.io/en/latest (thanks to
Carmoreno for the prompt to fix the link)
Example based on https://github.com/cadmiumkitty/capability-models/blob/master/notebooks/investment_management_capabilities.csv is below
import pandas as pd
import rdfpandas
df = pd.read_csv('investment_management_capabilities.csv', index_col = '#id', keep_default_na = True)
g = rdfpandas.to_graph(df)
ttl = g.serialize(format = 'turtle')
with open('investment_management_capabilities.ttl', 'wb') as file:
file.write(ttl)
The code that does the conversion is pretty minimal and is here (just look at the to_graph method) https://github.com/cadmiumkitty/rdfpandas/blob/master/rdfpandas/graph.py, so you can use it directly as an inspiration to create your own conversion logic.

how to read a comma sep value with .txt extension into python as an array?

I am biologist and very very new to Python and before, i learnt a bit of R.
So I have a very big text file (3 GB, too big to handle in R), all values are comma seperated but the extension is .txt (I don't know if it is necessary information). what i wanted to do is to:
read it into python as an object which is equivalent of dataframe in R,
get rid of columns in the middle
reduce the size of the object
write it as txt file
take the rest to R.
If you can help me i would be very happy.
thank you
There is no real need to go into python first. Your question looks a lot like this question. The answer marked as the correct answer iteratively reads the large file, and creates a new, smaller file. Other good alternatives are using sqlite and the sqdf package, or use the ff package. This last approach works particularly well is the number of columns is small compared to the number of rows.
This will take minimal memory as it does not load the whole file at once.
import csv
with open('in.txt', 'rb') f_in, open('out.csv', 'wb') as f_out:
reader = csv.reader(f_in)
writer = csv.writer(f_out)
for row in reader:
# keep first two columns and last three columns
writer.writerow(row[:2] + row[-3:])
Note: If using Python 3 change the file modes to 'r' and 'w', respectively.
i am not familiar with r dataframe, but pandas provides helpers to read csv into pandas dataframe:
from pandas import read_csv
df = read_csv('yourfile.txt')
print df
print df['Line']
If that is not what you need you can use csv module to iterate through each line of your csv as a python list and put it into whatever data structure you want.
If you insist on using a preprocessing step, using the linux command tools is a really good and fast option. If you use Linux, these tools are already installed, under Windows you'll need to first install MinGW or Cygwin. This SO question already provides some nice pointers. In essence you use the awk tool to iteratively process the text file, creating an output text file as you go. Copying form the accepted answer of the SO question I linked:
awk -F "," '{ split ($8,array," "); sub ("\"","",array[1]); sub (NR,"",$0); sub (",","",$0); print $0 > array[1] }' file.txt
This read the file, grabs the eight column, and dumps it to a file. See the answer for more details.
Per CRAN (new features and bug fixes re: development) the new development build 3.0.0 should allow for R to use the pagefile/swap. In windows you will need to set R_MAX_MEM_SIZE to a suitably large value.

Save .dta files in python

I'm wondering if anyone knows a Python package that allows you to save numpy arrays/recarrays in the .dta format of the statistical data analysis software Stata. This would really speed up a few steps in a system I have.
The scikits.statsmodels package includes a reader for Stata data files, which relies in part on PyDTA as pointed out by #Sven. In particular, genfromdta() will return an ndarray, e.g.
from Python 2.7/statsmodels 0.3.1:
>>> import scikits.statsmodels.api as sm
>>> arr = sm.iolib.genfromdta('/Applications/Stata12/auto.dta')
>>> type(arr)
<type 'numpy.ndarray'>
The savetxt() function can be used in turn to save an array as a text file, which can be imported in Stata. For example, we can export the above as
>>> sm.iolib.savetxt('auto.txt', arr, fmt='%2s', delimiter=",")
and read it in Stata without a dictionary file as follows:
. insheet using auto.txt, clear
I believe a *.dta reader should be added in the near future.
The only Python library for STATA interoperability I could find merely provides read-only access to .dta files. The R foreign library however provides a function write.dta, and RPy provides a Python interface to R. Maybe the combination of these tools can help you.
pandas DataFrame objects now have a "to_stata" method. So you can do for instance
import pandas as pd
df = pd.read_stata('my_data_in.dta')
df.to_stata('my_data_out.dta')
DISCLAIMER: the first step is quite slow (in my test, around 1 minute for reading a 51 MB dta - also see this question), and the second produces a file which can be way larger than the original one (in my test, the size goes from 51 MB to 111MB). This answer may look less elegant, but it is probably more efficient.

Categories

Resources