I am trying to read a large csv file and run a code. I use chunk size to do the same.
file = "./data.csv"
df = pd.read_csv(file, sep="/", header=0,iterator=True, chunksize=1000000, dtype=str)
print len(df.index)
I get the following error in the code:
AttributeError: 'TextFileReader' object has no attribute 'index'
How to resolve this?
Those errors are stemming from the fact that your pd.read_csv call, in this case, does not return a DataFrame object. Instead, it returns a TextFileReader object, which is an iterator. This is, essentially, because when you set the iterator parameter to True, what is returned is NOT a DataFrame; it is an iterator of DataFrame objects, each the size of the integer passed to the chunksize parameter (in this case 1000000).
Specific to your case, you can't just call df.index because, simply, an iterator object does not have an index attribute. This does not mean that you cannot access the DataFrames inside the iterator. What it means is that you would either have to loop through the iterator to access one DataFrame at a time or you would have to use some kind of way of concatenating all those DataFrames into one giant one.
If you are considering just working with one DataFrame at a time, then the following is what you would need to do to print the indexes of each DataFrame:
file = "./data.csv"
dfs = pd.read_csv(file, sep="/", header=0,iterator=True, chunksize=1000000, dtype=str)
for df in dfs:
print(df.index)
# do something
df.to_csv('output_file.csv', mode='a', index=False)
This will save the DataFrames into an output file with the name output_file.csv. With the mode parameter set to a, the operations should append to the file. As a result, nothing should be overwritten.
However, if the goal for you is to concatenate all the DataFrames into one giant DataFrame, then the following would perhaps be a better path:
file = "./data.csv"
dfs = pd.read_csv(file, sep="/", header=0,iterator=True, chunksize=1000000, dtype=str)
giant_df = pd.concat(dfs)
print(giant_df.index)
Since you are already using the iterator parameter here, I would assume that you are concerned about memory. As such, the first strategy would be a better one. That basically means that you are taking advantage of the benefits that iterators offer when it comes to memory management for large datasets.
I hope this proves useful.
Related
I have an h5 data file, which includes key rawreport
I can read the rawreport and save as dataframe using read_hdf(filename, "rawreport") without any problems. But the data has 17 mil rows and i'd like to use chunking
When I ran this code
chunksize = 10**6
someval = 100
df = pd.DataFrame()
for chunk in pd.read_hdf(filename, 'rawreport', chunksize=chunksize, where='datetime < someval'):
df = pd.concat([df, chunk], ignore_index=True)
I get "TypeError: can only use an iterator or chunksize on a table"
What does it mean that the rawreport isn't a table and how could I overcome this issue? I'm not the person who created the h5 file.
Chunking is only possible if your file was written in a Table format using PyTables. This must be specified when your file was first written:
df.to_hdf('rawreport', format = 'table')
If this wasn't specified when you wrote the file, then Pandas defaults to using a fixed format. This means that while the file can be quickly written and read later, it does mean that the entire dataframe must be read into memory. Unfortunately, this means that chunking and other options in read_hdf to specify particular rows or columns can't be used here.
With a folder with many .feather files, I would like to load all of them into dask in python.
So far, I have tried the following sourced from a similar question on GitHub https://github.com/dask/dask/issues/1277
files = [...]
dfs = [dask.delayed(feather.read_dataframe)(f) for f in files]
df = dd.concat(dfs)
Unfortunately, this gives me the error TypeError: Truth of Delayed objects is not supported which is mentioned there, but a workaround is not clear.
Is it possible to do the above in dask?
Instead of concat, which operates on dataframes, you want to use from_delayed, which turns a list of delayed objects, each of which represents a dataframe, into a single logical dataframe
dfs = [dask.delayed(feather.read_dataframe)(f) for f in files]
df = dd.from_delayed(dfs)
If possible, you should also supply the meta= (a zero-length dataframe, describing the columns, index and dtypes) and divisions= (the boundary values of the index along the partitions) kwargs.
I have a file file1.json whose contents are like this (each dict in a separate line):
{"a":1,"b":2}
{"c":3,"d":4}
{"e":9,"f":6}
.
.
.
{"u":31,"v":23}
{"w":87,"x":46}
{"y":98,"z":68}
I want to load this file into a pandas dataframe, so this is what i did:
df = pd.read_json('../Dataset/file1.json', orient='columns', lines=True, chunksize=10)
But this instead of returning a dataframe returns a JSONReader.
[IN]: df
[OUT]: <pandas.io.json.json.JsonReader at 0x7f873465bd30>
Is it normal, or am i doing something wrong? And if this is how read_json() is supposed to behave when there're multiple dictionaries in a single json file (without being any comma separated) and with each dict in a separate line, then how can i best fit them into a dataframe?
EDIT:
if i remove the chunksize paramter from the read_json() this is what i get:
[IN]: df = pd.read_json('../Dataset/file1.json', orient='columns', lines=True)
[OUT]: ValueError: Expected object or value
As the docs explain, this is exactly the point of the chunksize parameter:
chunksize: integer, default None
Return JsonReader object for iteration. See the line-delimted json docs for more information on chunksize. This can only be passed if lines=True. If this is None, the file will be read into memory all at once.
The linked docs say:
For line-delimited json files, pandas can also return an iterator which reads in chunksize lines at a time. This can be useful for large files or to read from a stream.
… and then give an example of how to use it.
If you don't want that, why are you passing chunksize? Just leave it out.
New to pandas here.
I have a pandas DataFrame called all_invoice with only one column called 'whole_line'.
Each row in all_invoice is a fixed width string. I need a new DataFrame from all_invoice using read_fwf.
I have a working solution that looks like this:
invoice = pd.DataFrame()
for i,r in all_invoice['whole_line'].iteritems():
temp_df = pd.read_fwf(StringIO(r), colspecs=in_specs,
names=in_cols, converters=in_convert)
invoice = invoice.append(temp_df, ignore_index = True)
in_specs, in_cols, and in_convert have been defined earlier in my script.
So this solution works but is very slow. For 18K rows with 85 columns, it takes about 6 minutes for this part of the code to execute. I'm hoping for a more elegant solution that doesn't involve iterating over the rows in the DataFrame or Series and that will use the apply function to call read_fwf to make this go faster. So I tried:
invoice = all_invoice['whole_line'].apply(pd.read_fwf, colspecs=in_specs,names=in_cols, converters=in_convert)
The tail end of my traceback looks like:
OSError: [Errno 36] File name too long:
Following that colon is the string that is passed to the read_fwf method. I suspect that this is happening because read_fwf needs a file path or buffer. In my working (but slow) code, I'm able to call StringIO() on the string to make it a buffer but I cannot do that with the apply function. Any help with getting the apply working or another way to make use of the read_fwf on the entire series/df at once to avoid iterating over the rows is appreciated. Thanks.
Have you tried just doing:
invoice = pd.read_fwf(filename, colspecs=in_specs,
names=in_cols, converters=in_convert)
I am using pandas to read a csv file and convert it into a numpy array. Earlier I was loading the whole file and was getting memory error. So I went through this link and tried to read the file in chunks.
But now I am getting a different error which say:
AssertionError: first argument must be a list-like of pandas objects, you passed an object of type "TextFileReader"
This is the code I am using:
>>> X_chunks = pd.read_csv('train_v2.csv', iterator=True, chunksize=1000)
>>> X = pd.concat(X_chunks, ignore_index=True)
API reference for read_csv tells that it returns either a DataFrame or a TextParser. The problem is that concat function will work fine if X_chunks is DataFrame, but its type is TextParser here.
is there any way in which I can force the return type for read_csv or any work around to load the whole file as a numpy array?
Since iterator=False is the default, and chunksize forces a TextFileReader object, may I suggest:
X_chunks = pd.read_csv('train_v2.csv')
But you don't want to materialize the list?
Final suggestion:
X_chunks = pd.read_csv('train_v2.csv', iterator=True, chunksize=1000)
for chunk in x_chunks:
analyze(chunk)
Where analyze is whatever process you've broken up to analyze the chunks piece by piece, since you apparently can't load the entire dataset into memory.
You can't use concat in the way you're trying to, the reason is that it demands the data be fully materialized, which makes sense, you can't concatenate something that isn't there yet.