How to concatenate large dataset into dataframe pandas - python

So I am working with a fairly substantial CSV dataset that has couple hundred megabytes. I have managed to read in the data in chunks (~100 rows).
How do i then elegantly convert those chunks into a dataframe and apply the describe function to it?
Thank you

It seems you need concat of TextFileReader object what is output of read_csv if parameter chunksize with describe:
df = pd.concat([x for x in pd.read_csv('filename', chunksize=1000)], ignore_index=True)
df = df.describe()
print (df)

Related

Resampling many timeseries files with pandas/dask

I have many .csv files with two columns. One with timestamps and the other with values. The data is sampled on seconds. What I would like to do is
read all files
set index on time column
resample on hours
save to new files (parquet, hdf,...)
1) Only dask
I tried to use dask's read_csv.
import dask.dataframe as dd
import pandas as pd
df = dd.read_csv(
"../data_*.csv",
parse_dates = [0],
date_parser = lambda x: pd.to_datetime(float(x)),
)
So far that's fine. The problem is that I cannot df.resample("min").mean() directly, because the index of dask data frame is not properly set.
After calling dd.reset_index().set_index("timestamp") it works - BUT I cannot afford to do this because it is expensive.
2) Workaround with pandas and hdf files
Another approach was to save all csv files to hdf files using pandas. In this case the pandas dataframes were already indexed by time.
df= dd.read_hdf("/data_01.hdf", key="data")
# This doesn't work directly
# df = df.resample("min").mean()
# Error: "Can only resample dataframes with known divisions"
df = df.reset_index().set_index("timestamp") # expensive! :-(
df = df.resample("min").mean() # works!
Of course this works but it would be extremely expensive on dd.read_hdf("/data_*.hdf", key="data").
How can I directly read timeseries data in dask that it is properly partitioned and indexed?
Do you have any tips or suggestions?
Exmpample Data:
import dask
df = dask.datasets.timeseries()
df.to_hdf("dask.hdf", "data")
# Doesn't work!
# dd.read_hdf("dask.hdf", key="data").resample("min").mean()
# Works!
dd.read_hdf("dask.hdf", key="data").reset_index().set_index("timestamp").resample(
"min"
).mean()
Can you try something like:
pd.read_csv('data.csv', index_col='timestamp', parse_dates=['timestamp']) \
.resample('T').mean().to_csv('data_1min.csv') # or to_hdf(...)

How to pivot existing dataframe in chunks?

I have 10million row, 60 column dataframe that I read in from a parquet file.
I have a the line of code (below) that pivots my dataframe plus does 3 other lines of manipulation exactly how I need it. However this line of code only works on smaller datasets, not the larger dataset:
pivoted_df = pd.pivot_table(df.fillna('missing'), index=cols, columns='Field', values='Value', aggfunc='first').reset_index().replace('missing', np.nan)
pivoted_df = pivoted_df.drop(['FieldId', 'FieldType'], axis=1)
pivoted_df = pivoted_df.replace('nan', np.nan)
pivoted_df = pivoted_df.groupby('Id', as_index=False).last()
Is there anyway I can chunk the data from df, while pivoting the chunks individually, cleaning and joining the pivoted data all together later?
kernel keeps crashing both in spyder and terminal.
Open to use any other tools to do this.
I broke up the dataframe into evenly sized pieces using:
import numpy as np
z = np.array_split(df, 5)
then iterated over the list:
for i in z:
(rest of code)

Pandas row to json

I have a dataframe in pandas and my goal is to write each row of the dataframe as a new json file.
I'm a bit stuck right now. My intuition was to iterate over the rows of the dataframe (using df.iterrows) and use json.dumps to dump the file but to no avail.
Any thoughts?
Looping over indices is very inefficient.
A faster technique:
df['json'] = df.apply(lambda x: x.to_json(), axis=1)
Pandas DataFrames have a to_json method that will do it for you:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html
If you want each row in its own file you can iterate over the index (and use the index to help name them):
for i in df.index:
df.loc[i].to_json("row{}.json".format(i))
Extending the answer of #MrE, if you're looking to convert multiple columns from a single row into another column with the content in json format (and not separate json files as output) I've had speed issues while using:
df['json'] = df.apply(lambda x: x.to_json(), axis=1)
I've achieved significant speed improvements on a dataset of 175K records and 5 columns using this line of code:
df['json'] = df.to_json(orient='records', lines=True).splitlines()
Speed went from >1 min to 350 ms.
Using apply, this can be done as
def writejson(row):
with open(row["filename"]+'.json', "w") as outfile:
json.dump(row["json"], outfile, indent=2)
in_df.apply(writejson, axis=1)
Assuming the dataframe has a column named "filename" with filename for each json row.
Here's a simple solution:
transform a dataframe to json per record, one json per line. then simply split the lines
list_of_jsons = df.to_json(orient='records', lines=True).splitlines()

How do I export TextFileReader object to txt

I read a census ACS file into iPython Notebook in chunks using:
pusb = pd.read_csv('ss14pusb.csv', low_memory=False, chunksize = 25000)
Then I selected some columns I want to keep and use for analysis. Now I want to export pusb to a txt or csv file, but `pusb.to_csv(etc... didn't work. How do I do this? Is there a way to concatenate the chunks one I read them so that they're one data frame?
Thanks in advance!
You can try function concat:
pusb = pd.read_csv('ss14pusb.csv', low_memory=False, chunksize = 25000)
print pusb
#<pandas.io.parsers.TextFileReader object at 0x00000000150E0048>
df = pd.concat(tp, ignore_index=True)
I think is necessary add parameter ignore index to function concat, because avoiding duplicity of indexes.
I try explain it better:
pusb = pd.read_csv('ss14pusb.csv', low_memory=False, chunksize = 25000)
function read csv by chunks - docs and output is TextFileReader, not DataFrame.
You can check this iterable object by:
for chunk in pusb:
print(chunk)
And then you need concat chunks to one big DataFrame - use concat.
Concatenating objects.

python - Using pandas structures with large csv(iterate and chunksize)

I have a large csv file, about 600mb with 11 million rows and I want to create statistical data like pivots, histograms, graphs etc. Obviously trying to just to read it normally:
df = pd.read_csv('Check400_900.csv', sep='\t')
Doesn't work so I found iterate and chunksize in a similar post so I used:
df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
All good, i can for example print df.get_chunk(5) and search the whole file with just:
for chunk in df:
print chunk
My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk.
plt.plot()
print df.head()
print df.describe()
print df.dtypes
customer_group3 = df.groupby('UserID')
y3 = customer_group.size()
Solution, if need create one big DataFrame if need processes all data at once (what is possible, but not recommended):
Then use concat for all chunks to df, because type of output of function:
df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
isn't dataframe, but pandas.io.parsers.TextFileReader - source.
tp = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
print tp
#<pandas.io.parsers.TextFileReader object at 0x00000000150E0048>
df = pd.concat(tp, ignore_index=True)
I think is necessary add parameter ignore index to function concat, because avoiding duplicity of indexes.
EDIT:
But if want working with large data like aggregating, much better is use dask, because it provides advanced parallelism.
You do not need concat here. It's exactly like writing sum(map(list, grouper(tup, 1000))) instead of list(tup). The only thing iterator and chunksize=1000 does is to give you a reader object that iterates 1000-row DataFrames instead of reading the whole thing. If you want the whole thing at once, just don't use those parameters.
But if reading the whole file into memory at once is too expensive (e.g., takes so much memory that you get a MemoryError, or slow your system to a crawl by throwing it into swap hell), that's exactly what chunksize is for.
The problem is that you named the resulting iterator df, and then tried to use it as a DataFrame. It's not a DataFrame; it's an iterator that gives you 1000-row DataFrames one by one.
When you say this:
My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk
The answer is that you can't. If you can't load the whole thing into one giant DataFrame, you can't use one giant DataFrame. You have to rewrite your code around chunks.
Instead of this:
df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
print df.dtypes
customer_group3 = df.groupby('UserID')
… you have to do things like this:
for df in pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000):
print df.dtypes
customer_group3 = df.groupby('UserID')
Often, what you need to do is aggregate some data—reduce each chunk down to something much smaller with only the parts you need. For example, if you want to sum the entire file by groups, you can groupby each chunk, then sum the chunk by groups, and store a series/array/list/dict of running totals for each group.
Of course it's slightly more complicated than just summing a giant series all at once, but there's no way around that. (Except to buy more RAM and/or switch to 64 bits.) That's how iterator and chunksize solve the problem: by allowing you to make this tradeoff when you need to.
You need to concatenate the chucks. For example:
df2 = pd.concat([chunk for chunk in df])
And then run your commands on df2
This might not reply directly to the question, but when you have to load a big dataset it is a good practice, to covert the dtypes of your columns while reading the dataset. Also if you know which columns you need, use the usecols argument to load only those.
df = pd.read_csv("data.csv",
usecols=['A', 'B', 'C', 'Date'],
dtype={'A':'uint32',
'B':'uint8',
'C':'uint8'
},
parse_dates=['Date'], # convert to datetime64
sep='\t'
)

Categories

Resources