Take n rows from a spark dataframe and pass to toPandas()

Take n rows from a spark dataframe and pass to toPandas() - python

I have this code:
l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df.withColumn('age2', df.age + 2).toPandas()
Works fine, does what it needs to. Suppose though I only want to display the first n rows, and then call toPandas() to return a pandas dataframe. How do I do it? I can't call take(n) because that doesn't return a dataframe and thus I can't pass it to toPandas().
So to put it another way, how can I take the top n rows from a dataframe and call toPandas() on the resulting dataframe? Can't think this is difficult but I can't figure it out.
I'm using Spark 1.6.0.

You can use the limit(n) function:
l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df.limit(2).withColumn('age2', df.age + 2).toPandas()
Or:
l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df.withColumn('age2', df.age + 2).limit(2).toPandas()

You could get first rows of Spark DataFrame with head and then create Pandas DataFrame:
l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df_pandas = pd.DataFrame(df.head(3), columns=df.columns)
In [4]: df_pandas
Out[4]:
name age
0 Alice 1
1 Jim 2
2 Sandra 3

Try it:
def showDf(df, count=None, percent=None, maxColumns=0):
if (df == None): return
import pandas
from IPython.display import display
pandas.set_option('display.encoding', 'UTF-8')
# Pandas dataframe
dfp = None
# maxColumns param
if (maxColumns >= 0):
if (maxColumns == 0): maxColumns = len(df.columns)
pandas.set_option('display.max_columns', maxColumns)
# count param
if (count == None and percent == None): count = 10 # Default count
if (count != None):
count = int(count)
if (count == 0): count = df.count()
pandas.set_option('display.max_rows', count)
dfp = pandas.DataFrame(df.head(count), columns=df.columns)
display(dfp)
# percent param
elif (percent != None):
percent = float(percent)
if (percent >=0.0 and percent <= 1.0):
import datetime
now = datetime.datetime.now()
seed = long(now.strftime("%H%M%S"))
dfs = df.sample(False, percent, seed)
count = df.count()
pandas.set_option('display.max_rows', count)
dfp = dfs.toPandas()
display(dfp)
Examples of usages are:
# Shows the ten first rows of the Spark dataframe
showDf(df)
showDf(df, 10)
showDf(df, count=10)
# Shows a random sample which represents 15% of the Spark dataframe
showDf(df, percent=0.15)

Related

Sum of a numeric columns into specific ranges and counting its occurrences

I am quite new to programming field of Python.
I have a dataset which needs to be modified. I tried few methods for sum part but I dont get the exact results.
Dataset : My data table
Requirements:
To categorize the debit and credit values into the following ranges/bins :
a) 2000-4000
b) 5000-8000
c) 9000-20000
The sum of debit should be for 20 days period like
if the transaction happened on 2020-01-01 then
the sum of credit should be from 2020-01-01 to 2020-01-20
I also want the record of occurrences i.e
the number of times the value from the bins lies in the category
Required Result : Result]2
The code I tried for credit values:
EndDate = BM['transaction_date']+ pd.to_timedelta(20, unit='D')
StartDate= BM['transaction_date']
dfx=BM
dfx['EndDate'] = EndDate
dfx['StartDate'] = StartDate
dfx['Debit'] = dfx.apply(lambda x: BM.loc[(df['transaction_date'] >= x.StartDate) &
(BM['transaction_date']
<=x.EndDate),'Debit'].sum(), axis=1)
Code1-
Code2-
error :

I have created a lot of functions and broke the problem into smaller tasks. Hope the comments make this understandable.
def sum20Days(df, debitORCredit):
"""
Calculates the sum of all amount in the debitORCredit column of df looking 20 days into the future within df
df: pandas DataFrame. Should already do groupby on name
debitORCredit : String. Takes either debit or credit. Column names in the dataframe
Returns:
df: Creates a column sum_debit_20days, adds the sum amount and returns the final dataframe
"""
df = df.copy()
temp_df = df[df[debitORCredit]>0]
dates = sorted(temp_df["transaction_date"].unique())
curr_date = dates[0]
date_20days = curr_date + pd.Timedelta(20, unit="D")
i = 0
while i < len(dates):
date = dates[i]
if date > date_20days:
curr_date = date
date_20days = curr_date + pd.Timedelta(20, unit="D")
series = temp_df.loc[(df["transaction_date"]>=date)&(df["transaction_date"]<=date_20days), :]
df.loc[max(df.loc[df["transaction_date"] == series["transaction_date"].max()].index), f"sum_{debitORCredit}_20days"] = sum(series[debitORCredit])
new_i = series["transaction_date"].nunique()
if new_i > 1:
i = new_i+1
else:
i += 1
return df
def groupListUsingList(inp, groupby):
"""
Groups inp by list groupby
inp: List
groupby: List
Example: inp = [0, 1, 2, 3, 4, 5, 6, 7], groupby=[3, 6] then output = [[0, 1, 2, 3], [4, 5, 6], [7]]
"""
groupby = sorted(groupby)
inp = sorted(inp)
lst = []
arr = []
for i in inp:
if len(groupby) > 0:
if i <= groupby[0]:
arr.append(i)
else:
if len(arr)>0:
lst.append(arr)
arr = [i]
groupby.pop(0)
else:
arr += inp[i:]
if len(arr) > 0:
lst.append(arr)
return lst
def count_amounts_in_category(df, debitORCredit, category_info):
"""
Based on the category assigned, finds the number of amounts belonging to that category
Inputs-
df: Pandas Dataframe. Grouped by name and only contains the transactions belonging to a single category calculation
debitORCredit: String. Takes either credit/debit. Used to get column in df
category_info: Dict. Contains the rules of categorization.
Output-
count: Float. Returns count
"""
if debitORCredit.lower() == "debit":
temp_df = df.loc[(df["debitorcredit"]=="D")]
elif debitORCredit.lower() == "credit":
temp_df = df.loc[(df["debitorcredit"]=="C")]
if temp_df.shape[0] == 0:
return np.nan
category = temp_df.iloc[-1].loc[f"category_{debitORCredit}"]
amount_range = category_info.get(category)
count = temp_df[debitORCredit].apply(lambda x: 1 if x<=amount_range[1] and x>=amount_range[0] else 0).sum()
return count
def assign_category(amount, category_info):
"""
Assigns category based on amount and categorization rules
Input -
amount: Float/Int. The amount
category_info: Dict. Contains the rules of categorization.
Ouptut -
Returns the String category based on the categorization rules
"""
if pd.isna(amount):
return np.nan
for k, v in category_info.items():
if v[0]<=amount<=v[1]:
return k
return np.nan
category_info = {"A": (2000, 4000),
"B": (5000, 8000),
"C":(9000, 20000)}
debitORCredit = "debit"
new_df = pd.DataFrame()
#Groupby name, then for each date in a group, calculate the sum of debitORCredit amounts over the next 20 days
for group in df.groupby("name"):
temp_df = sum20Days(group[1], debitORCredit=debitORCredit)
new_df = pd.concat([new_df, temp_df])
new_df = new_df.reset_index(drop=True)
#Based on the 20 days sum, use the categorization rules to assign a category
new_df[f"category_{debitORCredit}"] = new_df[f"sum_{debitORCredit}_20days"].apply(lambda x: assign_category(x, category_info))
#After assigning a category, groupby name and later groupby each 20 day transaction to find the count of transaction that belong to category assigned to that group of transactions
for group in new_df.groupby("name"):
#to groupby every 20 day transaction, we identified the last row of every 20 day transaction (ones which have a sum_debit_20days value) and split the group(a group from name groupby) on the last value in the index
indices = groupListUsingList(inp=group[1].index, groupby=group[1][group[1][f"sum_{debitORCredit}_20days"].notna()].index)
for index in indices:
count = count_amounts_in_category(df=new_df.loc[index], debitORCredit=debitORCredit, category_info=category_info)
new_df.loc[index[-1], f"count_{debitORCredit}"] = count
new_df

Melt dataframe based on condition

d = {'key': [1,2,3], 'a': [True,True, False], 'b': [False,False,True]}
df = pd.DataFrame(d)
Current melt function is:
df2 = df.melt(id_vars=['key'], var_name = 'letter', value_name = 'Bool')
df2 = df2.query('Bool == True')
Is there a way to incorporate that 'True' condition in the melt function. As I continue to add entries to my df and I have hundreds of columns, I assume it's much less costly to pull only the values I need instead of melting the entire df and then filtering. Any ideas?

Use pd.melt instead. Factor in replacement of False with NaN and dropna() eventually.
pd.melt(df.replace(False, np.nan), id_vars=['key'],var_name = 'letter', value_name = 'Bool').dropna()
key letter Bool
0 1 a True
1 2 a True
5 3 b True

You can filter the non key cols first, melt the results and concat the melted rows back. See the following;
import pandas as pd
import numpy as np
import time
d = {'key': [1,2,3], 'a': [True,True, False], 'b': [False,False,True]}
df = pd.DataFrame(d)
start_time = time.time()
key_column_name = 'key'
key_column_loc = list(df.columns).index(key_column_name)
filtered_frame = None
for letter in [s for s in list(df.columns) if s != key_column_name]:
true_booleans = np.nonzero(df[letter].values)[0]
melted_df = df.iloc[true_booleans][[key_column_name, letter]].reset_index(drop=True).melt(id_vars=[key_column_name], var_name = 'letter', value_name = 'Bool')
if filtered_frame is None:
filtered_frame = melted_df
else:
filtered_frame = pd.concat((filtered_frame, melted_df), axis = 0)
end_time = time.time()
print(filtered_frame, '\n\n', end_time - start_time, 'seconds!')
Output
key letter Bool
0 1 a True
1 2 a True
0 3 b True
0.011133432388305664 seconds!
Compared to your code, it is slower (your score is 0.008090734481811523 seconds!), however as the rows increase, I would expect that above way of doing it will be more efficient. Looking forward for the results.
Regarding the discussion on speed (Benchmarks)
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
# Benchmark Tests
d = {'key': [1,2,3], 'a': [True,True, False], 'b': [False,False,True]}
df_initial = pd.DataFrame(d)
data_size = [10, 100, 10000, 50000, 100000, 500000, 1000000, 5000000, 10000000, 50000000]
scores_current = []
scores_golden_lion = []
scores_sammywemmy = []
scores_wwnde = []
scores_slybot = []
for n_rows in data_size:
df = df_initial.sample(n=n_rows, replace=True).reset_index(drop=True)
## #Current method
start_time = time.time()
df_current = df.melt(id_vars=['key'], var_name = 'letter', value_name = 'Bool')
df_current = df_current.query('Bool == True')
end_time = time.time()
scores_current.append(end_time-start_time)
## #Golden Lion
start_time = time.time()
df_golden_lion = df.melt(id_vars=['key'], var_name = 'letter', value_name = 'Boolean')
df_golden_lion= df_golden_lion.drop(df_golden_lion.index[df_golden_lion['Boolean'] == False])
end_time = time.time()
scores_golden_lion.append(end_time-start_time)
## #sammywemmy
start_time = time.time()
box = df.iloc[:, 1:]
len_df = len(df)
letters = np.tile(box.columns, (len_df,1))[box]
df_sammywemmy = pd.DataFrame({'key':df.key.array,
'letter' : letters,
'Bool' : [True]*len_df})
end_time = time.time()
scores_sammywemmy.append(end_time-start_time)
## #wwnde
start_time = time.time()
df_wwnde = pd.melt(df.replace(False, np.nan), id_vars=['key'],var_name = 'letter', value_name = 'Bool').dropna()
end_time = time.time()
scores_wwnde.append(end_time-start_time)
## #Slybot
start_time = time.time()
key_column_name = 'key'
key_column_loc = list(df.columns).index(key_column_name)
filtered_frame = None
for letter in [s for s in list(df.columns) if s != key_column_name]:
true_booleans = np.nonzero(df[letter].values)[0]
melted_df = df.iloc[true_booleans][[key_column_name, letter]].melt(id_vars=[key_column_name], var_name = 'letter', value_name = 'Bool')
if filtered_frame is None:
filtered_frame = melted_df
else:
filtered_frame = pd.concat((filtered_frame, melted_df), axis = 0)
end_time = time.time()
scores_slybot.append(end_time-start_time)
plt.plot(data_size, scores_current, label = "Current method")
plt.plot(data_size, scores_golden_lion, label = "Golden Lion")
plt.plot(data_size, scores_sammywemmy, label = "sammywemmy")
plt.plot(data_size, scores_wwnde, label = "wwnde")
plt.plot(data_size, scores_slybot, label = "Slybot")
plt.legend()
plt.show()
Interesting to see that none of the other answers can beat the originally suggested method with a dataset of 500,000 rows! Until 200,000 rows sammywemmy's method is a clear winner though.

The melt and filter step is efficient though, I'd probably stick with loc instead of query, especially if your data is not that large (<200_000 rows)
Another option is to skip melt, use numpy, and build a new dataframe:
box = df.iloc[:, 1:]
len_df = len(df)
letters = np.tile(box.columns, (len_df,1))[box]
pd.DataFrame({'key':df.key.array,
'letter' : letters,
'Bool' : [True]*len_df})
key letter Bool
0 1 a True
1 2 a True
2 3 b True

melt moves column data and stacks it vertically resulting in two columns: the variable name of the column being stacked and the value column name.
d = {'key': [1,2,3], 'a': [True,True, False], 'b': [False,False,True],'c':['Batchelor','Masters','Doctorate']}
df = pd.DataFrame(d)
df2 = df.melt(id_vars=['key'], var_name = 'letter', value_name = 'Boolean')
df2=df2.drop(df2.index[df2['Boolean'] == False])
print(df2)
output
key letter Boolean
0 1 a True
1 2 a True
5 3 b True
6 1 c Batchelor
7 2 c Masters
8 3 c Doctorate

How to store the results of each iteration of for loop in a dataframe

cols = Germandata.columns
percentage_list = [0.05,0.01,0.1]
for i in range(len(Germandata)) :
for percentage in percentage_list:
columns_n = 3
random_columns = np.random.choice(cols, columns_n, replace=False)
local_data = Germandata.copy()
remove_n = int(round(local_data.shape[0] * percentage, 0))
for column_name in random_columns:
drop_indices = np.random.choice(local_data.index, remove_n, replace=False)
local_data.loc[drop_indices, column_name] = np.nan
The code here selects the columns at random and will delete certain percentage of observations from the data and it will replace them with NANs. The problem here is after running the loop i will get the final percentage deleted dataframe in the percentage list because it is overwriting after each iteration. How to store the dataframe with nans after each iteration.? Ideally i should get three dataframes with different percent of data deleted.

Try this
df_list = []
cols = Germandata.columns
percentage_list = [0.05,0.01,0.1]
for percentage in percentage_list:
columns_n = 3
random_columns = np.random.choice(cols, columns_n, replace=False)
local_data = Germandata.copy()
remove_n = int(round(local_data.shape[0] * percentage, 0))
for column_name in random_columns:
drop_indices = np.random.choice(local_data.index, remove_n, replace=False)
local_data.loc[drop_indices, column_name] = np.nan
local_data['percentage'] = percentage # optional
df_list.append(local_data)
df_05 = df_list[0]
df_01 = df_list[1]
df_1 = df_list[2]
Alternatively, you can use a dictionary
df_dict = {}
cols = Germandata.columns
percentage_list = [0.05,0.01,0.1]
for percentage in percentage_list:
columns_n = 3
random_columns = np.random.choice(cols, columns_n, replace=False)
local_data = Germandata.copy()
remove_n = int(round(local_data.shape[0] * percentage, 0))
for column_name in random_columns:
drop_indices = np.random.choice(local_data.index, remove_n, replace=False)
local_data.loc[drop_indices, column_name] = np.nan
local_data['percentage'] = percentage # optional
df_dict[str(percentage)] = local_data
df_05 = df_dict['0.05']
df_01 = df_dict['0.01']
df_1 = df_dict['0.1']

writing function in pandas/python

I have just started to learn python and don't have much of dev background. Here is the code I have written while learning.
I now want to make a function which exactly does what my "for" loop is doing but it needs to calculate different exp(exp,exp1 etc) based on different num(num, num1 etc)
how can I do this?
import pandas as pd
index = [0,1]
s = pd.Series(['a','b'],index= index)
t = pd.Series([1,2],index= index)
t1 = pd.Series([3,4],index= index)
df = pd.DataFrame(s,columns = ["str"])
df["num"] =t
df['num1']=t1
print (df)
exp=[]
for index, row in df.iterrows():
if(row['str'] == 'a'):
row['mul'] = -1 * row['num']
exp.append(row['mul'])
else:
row['mul'] = 1 * row['num']
exp.append(row['mul'])
df['exp'] = exp
print (df)
This is what i was trying to do which gives wrong results
import pandas as pd
index = [0,1]
s = pd.Series(['a','b'],index= index)
t = pd.Series([1,2],index= index)
t1 = pd.Series([3,4],index= index)
df = pd.DataFrame(s,columns = ["str"])
df["num"] =t
df['num1']=t1
def f(x):
exp=[]
for index, row in df.iterrows():
if(row['str'] == 'a'):
row['mul'] = -1 * x
exp.append(row['mul'])
else:
row['mul'] = 1 * x
exp.append(row['mul'])
return exp
df['exp'] = df['num'].apply(f)
df['exp1'] = df['num1'].apply(f)
df
Per suggestion below, I would do:
df['exp']=np.where(df.str=='a',df['num']*-1,df['num']*1)
df['exp1']=np.where(df.str=='a',df['num1']*-1,df['num1']*1)

I think you are looking for np.where
df['exp']=np.where(df.str=='a',df['num']*-1,df['num']*1)
df
Out[281]:
str num num1 exp
0 a 1 3 -1
1 b 2 4 2

Normal dataframe operation:
df["exp"] = df.apply(lambda x: x["num"] * (1 if x["str"]=="a" else -1), axis=1)
Mathematical dataframe operation:
df["exp"] = ((df["str"] == 'a')-0.5) * 2 * df["num"]

Python pd.read_csv, .to_sql different length than actual data

I have a database with ~ 50million rows. After reading to a database I only get 21,000 rows. What am I doing wrong? Thanks.
chunksize = 100000
csv_database = create_engine('sqlite:///csv_database.db', pool_pre_ping=True)
i=0
j=0
q=0
for df in pd.read_csv(filename, chunksize = chunksize, iterator = False):
# df = df.rename(columns={c: c.replace(' ', '') for c in df.columns})
df.index += j
i+= 1
df.to_sql('table', csv_database, if_exists='append')
j = df.index[-1] +1
q+=1
print("q: " + repr(q))
columnx = df.iloc[:,0]
columny = df.iloc[:,1]
columnz = df.iloc[:,2]
columnmass = df.iloc[:,3]
out: [21739 rows x 1 columns] etc etc.
in[19]: len(df)
Out[19]: 21739

'df' doesn't contain the entire csv file as you specified chunk size to 100000, and 21739 is the number of rows inserted in the last iteration.
If you do a count(1) of your table, I bet you'll get something like 5_21739.

Following code is working for me.
import numpy as np
import pandas as pd
import sqlite3
from sqlalchemy import create_engine
DIR = 'C:/Users/aslams/Desktop/checkpoint/'
FILE = 'SUBSCRIBER1.csv'
file = '{}{}'.format(DIR, FILE)
csv_database = create_engine('sqlite:///csv_database.db')
chunksize = 10000
i = 0
j = 0
for df in pd.read_csv(file, chunksize=chunksize, iterator=True):
df = df.rename(columns= {c: c.replace(' ', '') for c in df.columns})
df.index +=3
df.to_sql('data_use', csv_database, if_exists = 'append')
j = df.index[-1]+1
print('| index: {}',format(j))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Take n rows from a spark dataframe and pass to toPandas() - python

Related

Sum of a numeric columns into specific ranges and counting its occurrences

Melt dataframe based on condition

How to store the results of each iteration of for loop in a dataframe

writing function in pandas/python

Python pd.read_csv, .to_sql different length than actual data

Categories

Resources