Dask DataFrame calculate mean within multi-column groupings - python

I have a data frame as shown in Image, what I want to do is to take the mean along the column 'trial'. It for every subject, condition and sample (when all these three columns has value one), take average of data along column trial (100 rows).
what I have done in pandas is as following
sub_erp_pd= pd.DataFrame()
for j in range(1,4):
sub_c=subp[subp['condition']==j]
for i in range(1,3073):
sub_erp_pd=sub_erp_pd.append(sub_c[sub_c['sample']==i].mean(),ignore_index=True)
But this take alot of time..
So i am thinking to use dask instead of Pandas.
But in dask i am having issue in creating an empty data frame. Like we create an empty data frame in pandas and append data to it.
image of data frame
as suggested by #edesz I made changes in my approach
EDIT
%%time
sub_erp=pd.DataFrame()
for subno in progressbar.progressbar(range(1,82)):
try:
sub=pd.read_csv('../input/data/{}.csv'.format(subno,subno),header=None)
except:
sub=pd.read_csv('../input/data/{}.csv'.format(subno,subno),header=None)
sub_erp=sub_erp.append(sub.groupby(['condition','sample'], as_index=False).mean())
Reading a file using pandas take 13.6 seconds while reading a file using dask take 61.3 ms. But in dask, I am having trouble in appending.
NOTE - The original question was titled Create an empty dask dataframe and append values to it.

If I understand correctly, you need to
use groupby (read more here) in order to group the subject, condition and sample columns
this will gather all rows, which have the same value in each of these three columns, into a single group
take the average using .mean()
this will give you the mean within each group
Generate some dummy data
df = df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)),
columns=['trial','condition','sample'])
df.insert(0,'subject',[1]*10 + [2]*30 + [5]*60)
print(df.head())
subject trial condition sample
0 1 71 96 34
1 1 2 89 66
2 1 90 90 81
3 1 93 43 18
4 1 29 82 32
Pandas approach
Aggregate and take mean
df_grouped = df.groupby(['subject','condition','sample'], as_index=False)['trial'].mean()
print(df_grouped.head(15))
subject condition sample trial
0 1 18 24 89
1 1 43 18 93
2 1 67 47 81
3 1 82 32 29
4 1 85 28 97
5 1 88 13 48
6 1 89 59 23
7 1 89 66 2
8 1 90 81 90
9 1 96 34 71
10 2 0 81 19
11 2 2 39 58
12 2 2 59 94
13 2 5 42 13
14 2 9 42 4
Dask approach
Step 1. Imports
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
Step 2. Convert Pandas DataFrame to Dask DataFrame, using .from_pandas
ddf = dd.from_pandas(df, npartitions=2)
Step 3. Aggregate and take mean
ddf_grouped = (
ddf.groupby(['subject','condition','sample'])['trial']
.mean()
.reset_index(drop=False)
)
with ProgressBar():
df_grouped = ddf_grouped.compute()
[ ] | 0% Completed | 0.0s
[########################################] | 100% Completed | 0.1s
print(df_grouped.head(15))
subject condition sample trial
0 1 18 24 89
1 1 43 18 93
2 1 67 47 81
3 1 82 32 29
4 1 85 28 97
5 1 88 13 48
6 1 89 59 23
7 1 89 66 2
8 1 90 81 90
9 1 96 34 71
10 2 0 81 19
11 2 2 39 58
12 2 2 59 94
13 2 5 42 13
14 2 9 42 4
IMPORTANT NOTE: The approach in this answer does not use the approach of creating an empty Dask DataFrame and append values to it in order to calculate a mean within groupings of subject, condition and trial. Instead, this answer provides an alternate approach (using GROUP BY) to obtaining the desired end result (of calculating the mean within groupings of subject, condition and trial).

Related

Binning continuos variables in columns based on time of first column

I am trying to bin values in columns as the average of 5 rows from 1-5, 6-10 and so on using python.
My df dataset looks like this:
Unnamed: 0 C00_zscore C01_zscore C02_zscore
1 3 5 6
2 4 36 65
3 56 98 62
4 89 52 35
5 32 74 30
6 55 22 35
7 68 23 31
8 97 65 15
9 2 68 1
10 13 54 300
11
Ideally, the result should look like this:
bin C00_binned C01_binned C02_binned
1 36.8 53 39.6
2 47 46.4 76.4
Take the index and divide it by the bin size. This will be the row's bin. In your case, the row number starts at 1 and you want bins of size 5.
bin_num = row_num / (bin_size + 1)
Now that each row has a bin_num, group by it then do the calculations.
df['bin_num'] = df['Unnamed'] / 6
df.groupby(['bin_num']).mean()

Find average of every column in a dataframe, grouped by column, exluding one value

I have a Dataframe like the one presented below:
CPU Memory Disk Label
0 21 28 29 0
1 46 53 55 1
2 48 45 49 2
3 48 52 50 3
4 51 54 55 4
5 45 50 56 5
6 50 83 44 -1
What I want is to grouby and find the average of each label. So far I have this
dataset.groupby('Label')['CPU', 'Memory', 'Disk'].mean() which works just fine and get the results as follows:
Label CPU Memory Disk
-1 46.441176 53.882353 54.176471
0 48.500000 58.500000 60.750000
1 45.000000 51.000000 60.000000
2 54.000000 49.000000 56.000000
3 55.000000 71.500000 67.500000
4 53.000000 70.000000 71.000000
5 21.333333 30.000000 30.666667
The only thing I haven't yet found is how to exclude everything that is labeled as -1. Is there a way to do that?
You could filter the dataframe before grouping:
# Exclude rows with Label=-1
dataset = dataset.loc[dataset['Label'] != -1]
# Group by on filtered result
dataset.groupby('Label')['CPU', 'Memory', 'Disk'].mean()

Correlation coefficient of two columns in pandas dataframe with .corr()

I would like to calculate the correlation coefficient between two columns of a pandas data frame after making a column boolean in nature. The original table had two columns: a Group Column with one of two treatment groups, now boolean, and an Age Group. Those are the two columns I'm looking to calculate the correlation coefficient.
I tried the .corr() method, with:
table.corr(method='pearson')
but have this returned to me:
I have pasted the first 25 rows of boolean table below. I don't know if I'm missing parameters, or how to interpret this result. It's also strange that it's 1 as well. Thanks in advance!
Group Age
0 1 50
1 1 59
2 1 22
3 1 48
4 1 53
5 1 48
6 1 29
7 1 44
8 1 28
9 1 42
10 1 35
11 0 54
12 0 43
13 1 50
14 1 62
15 0 64
16 0 39
17 1 40
18 1 59
19 1 46
20 0 56
21 1 21
22 1 45
23 0 41
24 1 46
25 0 35
Calling .corr() on the entire DataFrame gives you a full correlation matrix:
>>> table.corr()
Group Age
Group 1.0000 -0.1533
Age -0.1533 1.0000
You can use the separate Series instead:
>>> table['Group'].corr(table['Age'])
-0.15330486289034567
This should be faster than using the full matrix and indexing it (with df.corr().iat['Group', 'Age']). Also, this should work whether Group is bool or int dtype.
my data frame consists of many columns. the correlation between any 2 colums is
**df.corr().loc['ColA','ColB']**
we get matrix b/w both columns

Why is Scipy's percentileofscore returning a different result than Excel's PERCENTRANK.INC?

I'm running into a strange issue with scipy's percentileofscore function.
In Excel, I have the following rows:
0
1
3
3
3
3
3
4
6
8
9
11
11
11
12
45
Next, I have a column that calculates the percentilerank.inc for each row:
=100 * (1-PERCENTRANK.INC($A:$A,A1))
The results are as follows:
100
94
87
87
87
87
87
54
47
40
34
27
27
27
7
0
I then take the same data and put them into an array and calculate the percentilofscore using scipy
100 - stats.percentileofscore(array, score, kind='strict')
However, my results are as follows:
100
94
88
88
88
88
88
56
50
44
38
31
31
31
13
7
Here are the results side by side to show the differences:
Data Excel Scipy
0 100 100
1 94 94
3 87 88
3 87 88
3 87 88
3 87 88
3 87 88
4 54 56
6 47 50
8 40 44
9 34 38
11 27 31
11 27 31
11 27 31
12 7 13
45 0 7
There are clearly some differences in the results. Some of them off by 4 digits.
Any thoughts on how to mimic Excel's PERCENTILERANK.INC function?
I'm using scipy 1.0.0, numpy 1.13.3, python 3.5.2, Excel 2016
Edit
If I do not include the max value of 45, the numbers jive. Could this be how PERCENTILERANK.INC works?
The Excel function PERCENTILERANK.INC excludes the max value (in my case 45). Which is why it shows 0 versus 6.25 like scipy does.
To rectify this, I modified my function to remove the max values of the array like so:
array = list(filter(lambda a: a != max(array), array))
return 100 - int(stats.percentileofscore(array, score, kind='strict'))
This gave me the correct results, and all my other tests passed.
Additional information based on Brian Pendleton's comment. Here is a link to the Excel functions explaining PERCENTILERANK.INC as well as other ranking functions. Thanks for this.

Python Pandas add rows based on missing sequential values in a timeseries

I'm new to python and struggling to manipulate data in pandas library. I have a pandas database like this:
Year Value
0 91 1
1 93 4
2 94 7
3 95 10
4 98 13
And want to complete the missing years creating rows with empty values, like this:
Year Value
0 91 1
1 92 0
2 93 4
3 94 7
4 95 10
5 96 0
6 97 0
7 98 13
How do i do that in Python?
(I wanna do that so I can plot Values without skipping years)
I would create a new dataframe that has Year as an Index and includes the entire date range that you need to cover. Then you can simply set the values across the two dataframes, and the index will make sure that they correct rows are matched (I've had to use fillna to set the missing years to zero, by default they will be set to NaN):
df = pd.DataFrame({'Year':[91,93,94,95,98],'Value':[1,4,7,10,13]})
df.index = df.Year
df2 = pd.DataFrame({'Year':range(91,99), 'Value':0})
df2.index = df2.Year
df2.Value = df.Value
df2= df2.fillna(0)
df2
Value Year
Year
91 1 91
92 0 92
93 4 93
94 7 94
95 10 95
96 0 96
97 0 97
98 13 98
Finally you can use reset_index if you don't want Year as your index:
df2.drop('Year',1).reset_index()
Year Value
0 91 1
1 92 0
2 93 4
3 94 7
4 95 10
5 96 0
6 97 0
7 98 13

Categories

Resources