Pyspark to pandas df taking a lot of time - python

Converting pyspark object to pandas taking hell time. How to store in pandas df?
I have the below code(sample). I am pulling data from pyspark and just then pulling data from teradata, then finally joining 2 different df in python. However while converting pp_data2 to pandas df taking of around 2 hours.
pp_data2 = sqlContext.sql('''SELECT c1,c2,c3
FROM cstonedb3.pp_data
where prod in ('7QD','7RJ','7RK','7RL','7RM') ''')
pp_data2 = pp_data2.toPandas()

Related

How do I copy pandas nested column to another DF?

We have some data in a Delta source which has nested structures. For this example we are focusing on a particular field from the Delta named status which has a number of sub-fields: commissionDate, decommissionDate, isDeactivated, isPreview, terminationDate.
In our transformation we currently read the Delta file in using PySpark, convert the DF to pandas using df.toPandas() and operate on this pandas DF using the pandas API. Once we have this pandas DF we would like to access its fields without using row iteration.
The data in Pandas looks like the following when queried using inventory_df["status"][0] (i.e. inventory_df["status"] is a list):
Row(commissionDate='2011-07-24T00:00:00+00:00', decommissionDate='2013-07-15T00:00:00+00:00', isDeactivated=True, isPreview=False, terminationDate=None)
We have found success using row iteration like:
unit_df["Active"] = [
not row["isDeactivated"] for row in inventory_df["status"]
]
but we have to use a row iteration each time we want to access data from the inventory_df. This is more verbose and is less efficient.
We would love to be able to do something like:
unit_df["Active"] = [
not inventory_df["status.isDeactivated"]
]
which is similar to the Spark destructuring approach, and allows accessing all of the rows at once but there doesn't seem to be equivalent pandas logic.
The data within PySpark has a format like status: struct<commissionDate:string,decommissionDate:string,isDeactivated:boolean,isPreview:boolean,terminationDate:string> and we can use the format mentioned above, selecting a subcolumn like df.select("status.isDeactivated").
How can this approach be done using pandas?
This may get you to where you think you are:
unit_df["Active"] = unit_df["Active"].apply(lambda x: pd.DataFrame(x.asDict()))
From here I would do:
unit_df = pd.concat([pd.concat(unif_df["Active"], ignore_index=True), unit_df], axis=1)
Which would get you a singular pd.DataFrame, now with columns for commissiondate, decomissiondate, etc.

Convert large spark DF to pandas DF

I have a huge (1258355, 14) pyspark dataframe that has to be converted to pandas df. Probably there is a memory issue (modifying the config file did not work)
pdf = df.toPandas() fails. pdf1 = df.limit(1000) works. How can I iterate through the whole df, convert the slices to pandas df and join these at last?

Resampling many timeseries files with pandas/dask

I have many .csv files with two columns. One with timestamps and the other with values. The data is sampled on seconds. What I would like to do is
read all files
set index on time column
resample on hours
save to new files (parquet, hdf,...)
1) Only dask
I tried to use dask's read_csv.
import dask.dataframe as dd
import pandas as pd
df = dd.read_csv(
"../data_*.csv",
parse_dates = [0],
date_parser = lambda x: pd.to_datetime(float(x)),
)
So far that's fine. The problem is that I cannot df.resample("min").mean() directly, because the index of dask data frame is not properly set.
After calling dd.reset_index().set_index("timestamp") it works - BUT I cannot afford to do this because it is expensive.
2) Workaround with pandas and hdf files
Another approach was to save all csv files to hdf files using pandas. In this case the pandas dataframes were already indexed by time.
df= dd.read_hdf("/data_01.hdf", key="data")
# This doesn't work directly
# df = df.resample("min").mean()
# Error: "Can only resample dataframes with known divisions"
df = df.reset_index().set_index("timestamp") # expensive! :-(
df = df.resample("min").mean() # works!
Of course this works but it would be extremely expensive on dd.read_hdf("/data_*.hdf", key="data").
How can I directly read timeseries data in dask that it is properly partitioned and indexed?
Do you have any tips or suggestions?
Exmpample Data:
import dask
df = dask.datasets.timeseries()
df.to_hdf("dask.hdf", "data")
# Doesn't work!
# dd.read_hdf("dask.hdf", key="data").resample("min").mean()
# Works!
dd.read_hdf("dask.hdf", key="data").reset_index().set_index("timestamp").resample(
"min"
).mean()
Can you try something like:
pd.read_csv('data.csv', index_col='timestamp', parse_dates=['timestamp']) \
.resample('T').mean().to_csv('data_1min.csv') # or to_hdf(...)

How to convert spark dataframe to python dataframe using a loop

I have created a spark dataframe which has 500k rows. If i convert it to python dataframe using pandas_df = spark_df.toPandas(), it takes a lot of time and disconnects. How can i create a loop which pulls up 100k rows from spark dataframe and puts it to python data frame and than iterate 5 times to create 5 df with 100k rows each?

Merging two data frames Pandas

Still not getting the hang of pandas, I am attempting to join two data frames in Pandas using merge. I have read in the CSVs into two data frames (named dropData and deosData in the code below). Both data frames have the column ‘Date_Time’, which is a parsed column of Date and Time information to create a unique id for each entry. The deosData file is an entire year’s worth of observations that I am trying to match up with corresponding entries in dropData.
CSV files:
deosData: https://www.dropbox.com/s/3rr7hf7jzrmxdke/inputDeos.csv?dl=0
dropData: https://www.dropbox.com/s/z9mv4xccjzlsyif/inputDrop.csv?dl=0
I have gone through the documentation for the merge function and have tried the following code in various iterations, so far I have only been able to have a blank data frame with correct header row, or have the two data frames merged on the 0--(N-1) indexing that is assigned by default:
My code:
import pandas as pd
import numpy as np
import os
from matplotlib import pyplot as plt
#read in CSV to dataframe
dropData=pd.read_csv("inputDrop.csv", header=0, index_col=None)
deosData=pd.read_csv("inputDeos.csv", header=0, index_col=None)
#merging dataframes into single sf
merge=pd.merge(dropData,deosData, how='inner', on='Date_Time')
#comment out during debugging
#merge.to_csv('output.csv', sep=',', headers=True, index=False)
#check merge dataframe creation
print merge.head(1)
After searching on SE and the Doc’s I have tried resetting the index, ignoring the index columns, copying the ‘Date_Time’ column as a separate index and trying to merge on the new column, I have tried using ‘on=None’, ‘left_on’ and ‘right_on’ as permutations of ‘Date_Time’ to no avail. I have checked the column data types, ‘Date_Time’ in both are dtype Objects, I do not know if this is the source of the error, since the only issues I could find searching revolved around matching different dtypes to each other.
What I am looking to do is have the two data frames merge where the two 'Date_Time' columns intersect. For example:
Date_Time,Volume(Max),Volume(Sum),Volume(Min),Volume(Mean),Diameter(Count),Diameter(Max),Diameter(Sum),Diameter(Min),Diameter(Mean),Depth(Sum),Velocity(Max),Velocity(Sum),Velocity(Min),Velocity(Mean), Air Temperature (deg. C), Relative humidity (%), Wind Speed (m.s-1), Wind Direction (deg.), Wind Gust Speed (5) (m.s-1), Barometric Pressure (mbar), Gage Precipitation (5) (mm)
9/1/2014 0:00,2.266188524,2.989272461,0.052464219,0.332141385,9,1.629668,5.972978,0.464467,0.663664222,0.003736591,2.288401,16.889656,1.495487,1.876628444,22.5,99,0,216.1,0.4,1016.2,0
Any help would be greatly appreciated.
You need to parse_dates when reading csv file, so that Date_Time columns in both dataframes are of pd.Timestamp object instead of raw strings. (if you look at your csv file, one is in ISO format YYYY-MM-DD HH:MM:SS whereas the other is in MM/DD/YYYY HH:MM) Try the following codes:
#read in CSV to dataframe
dropData = pd.read_csv("inputDrop.csv", header=0, index_col=None, parse_dates=['Date_Time'])
deosData = pd.read_csv("inputDeos.csv", header=0, index_col=None, parse_dates=['Date_Time'])
and then do your merge.
You can use join, but you first need to set the index:
dropData=pd.read_csv('.../inputDrop.csv', header=0, index_col='Date_Time', parse_dates=True)
deosData=pd.read_csv('.../inputDeos.csv', header=0, index_col='Date_Time', parse_dates=True)
dropData.join(deosData)

Categories

Resources