How to convert spark dataframe to python dataframe using a loop - python

I have created a spark dataframe which has 500k rows. If i convert it to python dataframe using pandas_df = spark_df.toPandas(), it takes a lot of time and disconnects. How can i create a loop which pulls up 100k rows from spark dataframe and puts it to python data frame and than iterate 5 times to create 5 df with 100k rows each?

Related

How do I parallelize a loop that appends to a for loop that appends to a Pandas dataframe?

I'm running a for loop that calls a function which returns a Pandas series. On each iteration of the for loop I'm appending that row to a final Dataframe output.
Inside the function I calculate some stuff and query a SQL database.
How can I run this on 4 or 5 parallel threads and still append to the same final dataframe?
df_final = pd.DataFrame()
for i in range(0,10000):
series = myFunction(A,B,C)
df_final = df_final.append(series)

Merge multiple df in python and keep the same rows only one time

i am trying to merge multiple dataframes and create a new dataframe containing all the rows from each dataframe but containing only one time the rows that are the same. For example:
The dataframes that i have as input:
input dataframes
The dataframe that i want to have as output:
output dataframe
Do you know if there is a way to do that? If you could help me, i would be more than thankfull!!
Thanks,
Eleni

Quickly Groupby Large DataFrame in Python

I have a DataFrame of 100M rows and 8 columns. I'm trying to groupby my DataFrame by 5 string columns and perform the following calculations as fast as possible.
df.groupby(['A','B','C','D','E'])['F'].transform('median')
df.groupby(['A','B','C','D','E']).agg({'F':'count', 'G':['mean','median','std'], 'H':['mean','std']})
I'm assuming doing it using numpy arrays would be the fastest, but I don't even know where to begin because it takes a couple of minutes just to convert a column to a numpy array.

Merge multiple int columns/rows into one numpy array (pandas dataframe)

I have a pandas dataframe with few columns and rows. I want to merge the columns into one and then merge the rows based on id and date into one.
Currently I am doing so by:
df['matrix'] = df[[col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19,col20,col21,col22,col23,col24,col25,col26,col27,col28,col29,col30,col31,col32,col33,col34,col35,col36,col37,col38,col39,col40,col41,col42,col43,col44,col45,col46,col47,col48]].values.tolist()
df = df.groupby(['id','date'])['matrix'].apply(list).reset_index(name='matrix')
This gives me the matrix in form of a list.
Later I convert it into numpy.ndarray using:
df['matrix'] = df['matrix'].apply(np.array)
This is a small segment of my dataset for reference:
id,date,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19,col20,col21,col22,col23,col24,col25,col26,col27,col28,col29,col30,col31,col32,col33,col34,col35,col36,col37,col38,col39,col40,col41,col42,col43,col44,col45,col46,col47,col48
16,2014-06-22,0,0,0,10,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
16,2014-06-22,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
16,2014-06-22,2,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
16,2014-06-22,3,0,0,0,0,0,0,0,0,0,0,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0
16,2014-06-22,4,0,0,0,0,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,22,0,0,0,0
Though the above piece of code works fine for small datasets, but sometimes crashes for larger ones. Specifically df['matrix'].apply(np.array) statement.
Is there a way by which I can perform the merging to fetch me a numpy.array? This would save a lot of time.
No need to merge the columns at first. Split DataFrame using groupby and then flatten the result
matrix=df.set_index(['id','date']).groupby(['id','date']).apply(lambda x: x.values.flatten())

Pyspark to pandas df taking a lot of time

Converting pyspark object to pandas taking hell time. How to store in pandas df?
I have the below code(sample). I am pulling data from pyspark and just then pulling data from teradata, then finally joining 2 different df in python. However while converting pp_data2 to pandas df taking of around 2 hours.
pp_data2 = sqlContext.sql('''SELECT c1,c2,c3
FROM cstonedb3.pp_data
where prod in ('7QD','7RJ','7RK','7RL','7RM') ''')
pp_data2 = pp_data2.toPandas()

Categories

Resources