Is it possible to use target row to initialize a Spark UDAF? - python

I have a problem that I am trying to solve in Spark by defining my own UDAF by trying to mimic the recommendations given here and here. My eventual goal is to apply a series of complex bit-shifting and bit-wise boolean manipulations to a sequence of integers within a given window.
I am having issues since my use-case is on a fairly large dataset (~100million rows, for which I need to perform 6 such bit-wise manipulations on groups ranging from 2-7 elements long), and am therefore trying to implement this in scala. Problem is that I'm brand new to scala (my primary language being python), and while scala itself doesn't seem that difficult, the combination of both a new language plus specifics of the UDAF class itself as applied to windows is leaving me a little stumped.
Explaining the logic by example in python/pandas
To make the question more concrete, consider a pandas DataFrame:
keep = list(range(30))
for num in (3, 5, 11, 16, 22, 24):
keep.pop(num)
np.random.seed(100)
df = pd.DataFrame({
'id': 'A',
'date': pd.date_range('2018-06-01', '2018-06-30')[keep],
'num': np.random.randint(low=1, high=100, size=30)[keep]
})
Which produces:
id date num
0 A 2018-06-01 9
1 A 2018-06-02 25
2 A 2018-06-03 68
3 A 2018-06-05 80
4 A 2018-06-06 49
5 A 2018-06-08 95
6 A 2018-06-09 53
7 A 2018-06-10 99
8 A 2018-06-11 54
9 A 2018-06-12 67
10 A 2018-06-13 99
11 A 2018-06-15 35
12 A 2018-06-16 25
13 A 2018-06-17 16
14 A 2018-06-18 61
15 A 2018-06-19 59
16 A 2018-06-21 10
17 A 2018-06-22 94
18 A 2018-06-23 87
19 A 2018-06-24 3
20 A 2018-06-25 28
21 A 2018-06-26 5
22 A 2018-06-28 2
23 A 2018-06-29 14
What I would like to be able to do is, relative to the current row, find the number of days, then perform some bit-wise manipulations based on that value. To demonstrate, staying in pandas (I will have to do a full outer join then filter to demonstrate):
exp_df = df[['id', 'date']].merge(df, on='id') \ # full outer join on 'id'
.assign(days_diff = lambda df: (df['date_y'] - df['date_x']).dt.days) \ # number of days since my date of interest
.mask(lambda df: (df['days_diff'] > 3) | (df['days_diff'] < 0)) \ # nulls rows where days_diff isn't between 0 and 3
.dropna() \ # then filters the rows
.drop('date_y', axis='columns') \
.rename({'date_x': 'date', 'num': 'nums'}, axis='columns') \
.reset_index(drop=True)
exp_df[['nums', 'days_diff']] = exp_df[['nums', 'days_diff']].astype('int')
Now I perform my bit-wise shifting and other logic:
# Extra values to add after bit-wise shifting (1 for shift of 1, 3 for shift of 2 ...)
additions = {val: sum(2**power for power in range(val)) for val in exp_df['days_diff'].unique()}
exp_df['shifted'] = np.left_shift(exp_df['nums'].values, exp_df['days_diff'].values) \
+ exp_df['days_diff'].apply(lambda val: additions[val])
After all this, exp_df looks like the following (first 10 rows):
id date nums days_diff shifted
0 A 2018-06-01 9 0 9
1 A 2018-06-01 25 1 51
2 A 2018-06-01 68 2 275
3 A 2018-06-02 25 0 25
4 A 2018-06-02 68 1 137
5 A 2018-06-02 80 3 647
6 A 2018-06-03 68 0 68
7 A 2018-06-03 80 2 323
8 A 2018-06-03 49 3 399
9 A 2018-06-05 80 0 80
Now I can aggregate:
exp_df.groupby('date')['shifted'].agg(lambda group_vals: np.bitwise_and.reduce(group_vals.values)
And the final result looks like the following (if I join back to the original DataFrame:
id date num shifted
0 A 2018-06-01 9 1
1 A 2018-06-02 25 1
2 A 2018-06-03 68 0
3 A 2018-06-05 80 64
4 A 2018-06-06 49 33
5 A 2018-06-08 95 3
6 A 2018-06-09 53 1
7 A 2018-06-10 99 1
8 A 2018-06-11 54 6
9 A 2018-06-12 67 3
10 A 2018-06-13 99 3
11 A 2018-06-15 35 3
12 A 2018-06-16 25 1
13 A 2018-06-17 16 0
14 A 2018-06-18 61 21
15 A 2018-06-19 59 35
16 A 2018-06-21 10 8
17 A 2018-06-22 94 6
18 A 2018-06-23 87 3
19 A 2018-06-24 3 1
20 A 2018-06-25 28 0
21 A 2018-06-26 5 1
22 A 2018-06-28 2 0
23 A 2018-06-29 14 14
Back to the question
Ok, now that I've demonstrated my logic, I realize that I can essentially do the same thing in Spark - performing a full outer join of the DataFrame on itself, then filtering and aggregating.
What I want to know is if I can avoid performing a full join, and instead create my own UDAF to perform this aggregation over a window function, using the target row as an input. Basically, I need to create the equivalent of the "days_diff" column in order to perform my required logic, which means comparing the target date to each of the other dates within my specified window. Is this even possible?
Also, am I even justified in worrying about using a self-join? I know that spark does all of its processing lazily, so it's very possible that I wouldn't need to worry . Should I expect the performance to be similar if I were to do all of this using a self join versus my imaginary UDAF applied over a window? The logic is more sequential and easier to follow using the join-filter-aggregate method, which is a clear advantage.
One thing to know is that I will be performing this logic on multiple windows. In principle, I could cache the largest version of the filtered DataFrame after the join, then use that for subsequent calculations.

Related

Dividing one dataframe by another in python using pandas with float values

I have two separate data frames named df1 and df2 as shown below:
Scaffold Position Ref_Allele_Count Alt_Allele_Count Coverage_Depth Alt_Allele_Frequency
0 1 11 7 51 58 0.879310
1 1 16 20 95 115 0.826087
2 2 9 9 33 42 0.785714
3 2 12 86 51 137 0.372263
4 2 67 41 98 139 0.705036
5 3 8 0 0 0 0.000000
6 4 99 32 26 58 0.448276
7 4 101 100 24 124 0.193548
8 4 115 69 26 95 0.273684
9 5 6 40 57 97 0.587629
10 5 19 53 87 140 0.621429
Scaffold Position Ref_Allele_Count Alt_Allele_Count Coverage_Depth Alt_Allele_Frequency
0 1 11 7 64 71 0.901408
1 1 16 10 90 100 0.900000
2 2 9 79 86 165 0.521212
3 2 12 12 73 85 0.858824
4 2 67 54 96 150 0.640000
5 3 8 0 0 0 0.000000
6 4 99 86 28 114 0.245614
7 4 101 32 25 57 0.438596
8 4 115 97 16 113 0.141593
9 5 6 86 43 129 0.333333
10 5 19 59 27 86 0.313953
I have already found the sum values for df1 and df2 in Allele_Count and Coverage Depth but I need to divide the resulting Alt_Allele_Count and Coverage_Depth of both df's with one another to fine the total allele frequency(AF). I have tried dividing the two variable and got the error message :
TypeError: float() argument must be a string or a number, not 'DataFrame'
when I tried to convert them to floats and this table when I laft it as a df:
Alt_Allele_Count Coverage_Depth
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN
My code so far:
import csv
import pandas as pd
import numpy as np
df1 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_1.csv')
df2 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_2.csv')
print(df1)
print(df2)
Ref_Allele_Count = (df1[['Ref_Allele_Count']] + df2[['Ref_Allele_Count']])
print(Ref_Allele_Count)
Alt_Allele_Count = (df1[['Alt_Allele_Count']] + df2[['Alt_Allele_Count']])
print(Alt_Allele_Count)
Coverage_Depth = (df1[['Coverage_Depth']] + df2[['Coverage_Depth']]).astype(float)
print(Coverage_Depth)
AF = Alt_Allele_Count / Coverage_Depth
print(AF)
The error stems from the difference between a pandas series and a dataframe. Series are 1 dimensional structures like a singular column, while dataframes are 2d objects like tables. Series added together make a new series of values while dataframes added together make something a lot less usable.
Taking slices of a dataframe can either result in a series or dataframe object depending on how you do it:
df['column_name'] -> Series
df[['column_name', 'column_2']] -> Dataframe
So in the line:
Ref_Allele_Count = (df1[['Ref_Allele_Count']] + df2[['Ref_Allele_Count']])
df1[['Ref_Allele_Count']] becomes a singular column dataframe rather than a series.
Ref_Allele_Count = (df1['Ref_Allele_Count'] + df2['Ref_Allele_Count'])
Should return the correct result here. Same goes for the rest of the columns you're adding together.
This can be fixed by only using once set of brackets '[]' while referring to a column in a pandas df, rather than 2.
import csv
import pandas as pd
import numpy as np
df1 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_1.csv')
df2 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_2.csv')
print(df1)
print(df2)
# note that I changed your double brackets ([["col_name"]]) to single (["col_name"])
# this results in pd.Series objects instead of pd.DataFrame objects
Ref_Allele_Count = (df1['Ref_Allele_Count'] + df2['Ref_Allele_Count'])
print(Ref_Allele_Count)
Alt_Allele_Count = (df1['Alt_Allele_Count'] + df2['Alt_Allele_Count'])
print(Alt_Allele_Count)
Coverage_Depth = (df1['Coverage_Depth'] + df2['Coverage_Depth']).astype(float)
print(Coverage_Depth)
AF = Alt_Allele_Count / Coverage_Depth
print(AF)

How Count/Sum Values Based On Multiple Conditions in Multiple Columns

I have a shipping records table with approx. 100K rows and
I want to calculate, for each row, for each material, how many qtys were shipped in last 30 days.
As you can see in below example, calculated qty depends on "material, shipping date".
I've tried to write very basic code and couldn't find a way to apply it to all rows.
df[(df['malzeme']==material) & (df['cikistarihi'] < shippingDate) & (df['cikistarihi'] >= (shippingDate-30))]['qty'].sum()
material
shippingDate
qty
shipped qtys in last 30 days
A
23.01.2019
8
0
A
28.01.2019
41
8
A
31.01.2019
66
49 (8+41)
A
20.03.2019
67
0
B
17.02.2019
53
0
B
26.02.2019
35
53
B
11.03.2019
4
88 (53+35)
B
20.03.2019
67
106 (35+4+67)
You can use .groupby with .rolling:
# convert the shippingData to datetime:
df["shippingDate"] = pd.to_datetime(df["shippingDate"], dayfirst=True)
# sort the values (if they aren't already)
df = df.sort_values(["material", "shippingDate"])
df["shipped qtys in last 30 days"] = (
df.groupby("material")
.rolling("30D", on="shippingDate", closed="left")["qty"]
.sum()
.fillna(0)
.values
)
print(df)
Prints:
material shippingDate qty shipped qtys in last 30 days
0 A 2019-01-23 8 0.0
1 A 2019-01-28 41 8.0
2 A 2019-01-31 66 49.0
3 A 2019-03-20 67 0.0
4 B 2019-02-17 53 0.0
5 B 2019-02-26 35 53.0
6 B 2019-03-11 4 88.0
7 B 2019-03-20 67 39.0
EDIT: Add .sort_values() before groupby

Pandas Q-cut: Binning Data using an Expanding Window Approach

This question is somewhat similar to a 2018 question I have found on an identical topic.
I am hoping that if I ask it in a simpler way, someone will be able to figure out a simple fix to the issue that I am currently facing:
I have a timeseries dataframe named "df", which is roughly structured as follows:
V_1 V_2 V_3 V_4
1/1/2000 17 77 15 88
1/2/2000 85 78 6 59
1/3/2000 31 9 49 16
1/4/2000 81 55 28 33
1/5/2000 8 82 82 4
1/6/2000 89 87 57 62
1/7/2000 50 60 54 49
1/8/2000 65 84 29 26
1/9/2000 12 57 53 84
1/10/2000 6 27 70 56
1/11/2000 61 6 38 38
1/12/2000 22 8 82 58
1/13/2000 17 86 65 42
1/14/2000 9 27 42 86
1/15/2000 63 78 18 35
1/16/2000 73 13 51 61
1/17/2000 70 64 75 83
If I wanted to use all the columns to produce daily quantiles, I would follow this approach:
quantiles = df.apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)
The output looks like this:
V_1 V_2 V_3 V_4
2000-01-01 1 3 0 4
2000-01-02 4 3 0 3
2000-01-03 2 0 2 0
2000-01-04 4 1 0 0
2000-01-05 0 4 4 0
2000-01-06 4 4 3 3
2000-01-07 2 2 3 2
2000-01-08 3 4 1 0
2000-01-09 0 2 2 4
2000-01-10 0 1 4 2
2000-01-11 2 0 1 1
2000-01-12 1 0 4 2
2000-01-13 1 4 3 1
2000-01-14 0 1 1 4
2000-01-15 3 3 0 1
2000-01-16 4 0 2 3
2000-01-17 3 2 4 4
What I want to do:
I would like to produce quantiles of the data in "df" using observations that occurred before and at a specific point in time. I do not want to include observations that occurred after the specific point in time.
For instance:
To calculate the bins for the 2nd of January 2000, I would like to just use observations from the 1st and 2nd of January 2000; and, nothing after the dates;
To calculate the bins for the 3rd of January 2000, I would like to just use observations from the 1st, 2nd and 3rd of January 2000; and, nothing after the dates;
To calculate the bins for the 4th of January 2000, I would like to just use observations from the 1st, 2nd, 3rd and 4th of January 2000; and, nothing after the dates;
To calculate the bins for the 5th of January 2000, I would like to just use observations from the 1st, 2nd, 3rd, 4th and 5th of January 2000; and, nothing after the dates;
Otherwise put, I would like to use this approach to calculate the bins for ALL the datapoints in "df". That is, to calculate bins from the 1st of January 2000 to the 17th of January 2000.
In short, what I want to do is to conduct an expanding window q-cut (if there is any such thing). It helps to avoid "look-ahead" bias when dealing with timeseries data.
This code block below is wrong, but it illustrates exactly what I am trying to accomplish:
quantiles = df.expanding().apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)
Does anyone have any ideas of how to do this in a simpler fashion than this
I am new so take this with a grain of salt, but when broken down I believe your question is a duplicate because it requires simple datetime index slicing answered HERE.
lt_jan_5 = df.loc[:'2000-01-05'].apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)
print(lt_jan_5)
V_1 V_2 V_3 V_4
2000-01-01 1 2 1 4
2000-01-02 4 3 0 3
2000-01-03 2 0 3 1
2000-01-04 3 1 2 2
2000-01-05 0 4 4 0
Hope this is helpful

Dask DataFrame calculate mean within multi-column groupings

I have a data frame as shown in Image, what I want to do is to take the mean along the column 'trial'. It for every subject, condition and sample (when all these three columns has value one), take average of data along column trial (100 rows).
what I have done in pandas is as following
sub_erp_pd= pd.DataFrame()
for j in range(1,4):
sub_c=subp[subp['condition']==j]
for i in range(1,3073):
sub_erp_pd=sub_erp_pd.append(sub_c[sub_c['sample']==i].mean(),ignore_index=True)
But this take alot of time..
So i am thinking to use dask instead of Pandas.
But in dask i am having issue in creating an empty data frame. Like we create an empty data frame in pandas and append data to it.
image of data frame
as suggested by #edesz I made changes in my approach
EDIT
%%time
sub_erp=pd.DataFrame()
for subno in progressbar.progressbar(range(1,82)):
try:
sub=pd.read_csv('../input/data/{}.csv'.format(subno,subno),header=None)
except:
sub=pd.read_csv('../input/data/{}.csv'.format(subno,subno),header=None)
sub_erp=sub_erp.append(sub.groupby(['condition','sample'], as_index=False).mean())
Reading a file using pandas take 13.6 seconds while reading a file using dask take 61.3 ms. But in dask, I am having trouble in appending.
NOTE - The original question was titled Create an empty dask dataframe and append values to it.
If I understand correctly, you need to
use groupby (read more here) in order to group the subject, condition and sample columns
this will gather all rows, which have the same value in each of these three columns, into a single group
take the average using .mean()
this will give you the mean within each group
Generate some dummy data
df = df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)),
columns=['trial','condition','sample'])
df.insert(0,'subject',[1]*10 + [2]*30 + [5]*60)
print(df.head())
subject trial condition sample
0 1 71 96 34
1 1 2 89 66
2 1 90 90 81
3 1 93 43 18
4 1 29 82 32
Pandas approach
Aggregate and take mean
df_grouped = df.groupby(['subject','condition','sample'], as_index=False)['trial'].mean()
print(df_grouped.head(15))
subject condition sample trial
0 1 18 24 89
1 1 43 18 93
2 1 67 47 81
3 1 82 32 29
4 1 85 28 97
5 1 88 13 48
6 1 89 59 23
7 1 89 66 2
8 1 90 81 90
9 1 96 34 71
10 2 0 81 19
11 2 2 39 58
12 2 2 59 94
13 2 5 42 13
14 2 9 42 4
Dask approach
Step 1. Imports
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
Step 2. Convert Pandas DataFrame to Dask DataFrame, using .from_pandas
ddf = dd.from_pandas(df, npartitions=2)
Step 3. Aggregate and take mean
ddf_grouped = (
ddf.groupby(['subject','condition','sample'])['trial']
.mean()
.reset_index(drop=False)
)
with ProgressBar():
df_grouped = ddf_grouped.compute()
[ ] | 0% Completed | 0.0s
[########################################] | 100% Completed | 0.1s
print(df_grouped.head(15))
subject condition sample trial
0 1 18 24 89
1 1 43 18 93
2 1 67 47 81
3 1 82 32 29
4 1 85 28 97
5 1 88 13 48
6 1 89 59 23
7 1 89 66 2
8 1 90 81 90
9 1 96 34 71
10 2 0 81 19
11 2 2 39 58
12 2 2 59 94
13 2 5 42 13
14 2 9 42 4
IMPORTANT NOTE: The approach in this answer does not use the approach of creating an empty Dask DataFrame and append values to it in order to calculate a mean within groupings of subject, condition and trial. Instead, this answer provides an alternate approach (using GROUP BY) to obtaining the desired end result (of calculating the mean within groupings of subject, condition and trial).

Find average of every column in a dataframe, grouped by column, exluding one value

I have a Dataframe like the one presented below:
CPU Memory Disk Label
0 21 28 29 0
1 46 53 55 1
2 48 45 49 2
3 48 52 50 3
4 51 54 55 4
5 45 50 56 5
6 50 83 44 -1
What I want is to grouby and find the average of each label. So far I have this
dataset.groupby('Label')['CPU', 'Memory', 'Disk'].mean() which works just fine and get the results as follows:
Label CPU Memory Disk
-1 46.441176 53.882353 54.176471
0 48.500000 58.500000 60.750000
1 45.000000 51.000000 60.000000
2 54.000000 49.000000 56.000000
3 55.000000 71.500000 67.500000
4 53.000000 70.000000 71.000000
5 21.333333 30.000000 30.666667
The only thing I haven't yet found is how to exclude everything that is labeled as -1. Is there a way to do that?
You could filter the dataframe before grouping:
# Exclude rows with Label=-1
dataset = dataset.loc[dataset['Label'] != -1]
# Group by on filtered result
dataset.groupby('Label')['CPU', 'Memory', 'Disk'].mean()

Categories

Resources