Convert large spark DF to pandas DF

Convert large spark DF to pandas DF - python

I have a huge (1258355, 14) pyspark dataframe that has to be converted to pandas df. Probably there is a memory issue (modifying the config file did not work)
pdf = df.toPandas() fails. pdf1 = df.limit(1000) works. How can I iterate through the whole df, convert the slices to pandas df and join these at last?

Related

Resampling many timeseries files with pandas/dask

I have many .csv files with two columns. One with timestamps and the other with values. The data is sampled on seconds. What I would like to do is
read all files
set index on time column
resample on hours
save to new files (parquet, hdf,...)
1) Only dask
I tried to use dask's read_csv.
import dask.dataframe as dd
import pandas as pd
df = dd.read_csv(
"../data_*.csv",
parse_dates = [0],
date_parser = lambda x: pd.to_datetime(float(x)),
)
So far that's fine. The problem is that I cannot df.resample("min").mean() directly, because the index of dask data frame is not properly set.
After calling dd.reset_index().set_index("timestamp") it works - BUT I cannot afford to do this because it is expensive.
2) Workaround with pandas and hdf files
Another approach was to save all csv files to hdf files using pandas. In this case the pandas dataframes were already indexed by time.
df= dd.read_hdf("/data_01.hdf", key="data")
# This doesn't work directly
# df = df.resample("min").mean()
# Error: "Can only resample dataframes with known divisions"
df = df.reset_index().set_index("timestamp") # expensive! :-(
df = df.resample("min").mean() # works!
Of course this works but it would be extremely expensive on dd.read_hdf("/data_*.hdf", key="data").
How can I directly read timeseries data in dask that it is properly partitioned and indexed?
Do you have any tips or suggestions?
Exmpample Data:
import dask
df = dask.datasets.timeseries()
df.to_hdf("dask.hdf", "data")
# Doesn't work!
# dd.read_hdf("dask.hdf", key="data").resample("min").mean()
# Works!
dd.read_hdf("dask.hdf", key="data").reset_index().set_index("timestamp").resample(
"min"
).mean()

Can you try something like:
pd.read_csv('data.csv', index_col='timestamp', parse_dates=['timestamp']) \
.resample('T').mean().to_csv('data_1min.csv') # or to_hdf(...)

Exporting sorted/adjusted data to excel with python

I have a simple dataset that I have sorted with dataframe based on 'category'.
The sorting has gone all well. But now, I'd like to export the sorted/adjusted dataset in .xlsx format. That is the dataset that has been categorized, not the dataset that is read in excel.
I have tried the following:
import pandas as pd
df = pd.read_excel("python_sorting_test.xlsx",index_col=[1])
df.head()
print(df.sort_index(level=['Category'], ascending=True))
df.to_excel (r'C:\Users\Laptop\PycharmProjects\untitled8\export_dataframe.xlsx', header=True)
The issue: It doesn't doesn't store the sorted/adjusted dataset.

Actually, you doesn't save results of sort_index. You can add inplace=True
print(df.sort_index(level=['Category'], ascending=True, inplace=True))
or save results of df.sort_index
df = df.sort_index(level=['Category'], ascending=True)

Pyspark to pandas df taking a lot of time

Converting pyspark object to pandas taking hell time. How to store in pandas df?
I have the below code(sample). I am pulling data from pyspark and just then pulling data from teradata, then finally joining 2 different df in python. However while converting pp_data2 to pandas df taking of around 2 hours.
pp_data2 = sqlContext.sql('''SELECT c1,c2,c3
FROM cstonedb3.pp_data
where prod in ('7QD','7RJ','7RK','7RL','7RM') ''')
pp_data2 = pp_data2.toPandas()

Drop column using Dask dataframe

This should work:
raw_data.drop('some_great_column', axis=1).compute()
But the column is not dropped. In pandas I use:
raw_data.drop(['some_great_column'], axis=1, inplace=True)
But inplace does not exist in Dask. Any ideas?

You can separate into two operations:
# dask operation
raw_data = raw_data.drop('some_great_column', axis=1)
# conversion to pandas
df = raw_data.compute()
Then export the Pandas dataframe to a CSV file:
df.to_csv(r'out.csv', index=False)

I assume you want to keep "raw data" in a Dask DF. In that case the following will do the trick:
new_raw_df = raw_data.drop('some_great_column', axis=1).copy()
where type(new_raw_df) is dask.dataframe.core.DataFrame and you can delete the original DF.

How to concatenate large dataset into dataframe pandas

So I am working with a fairly substantial CSV dataset that has couple hundred megabytes. I have managed to read in the data in chunks (~100 rows).
How do i then elegantly convert those chunks into a dataframe and apply the describe function to it?
Thank you

It seems you need concat of TextFileReader object what is output of read_csv if parameter chunksize with describe:
df = pd.concat([x for x in pd.read_csv('filename', chunksize=1000)], ignore_index=True)
df = df.describe()
print (df)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert large spark DF to pandas DF - python

Related

Resampling many timeseries files with pandas/dask

Exporting sorted/adjusted data to excel with python

Pyspark to pandas df taking a lot of time

Drop column using Dask dataframe

How to concatenate large dataset into dataframe pandas

Categories

Resources