Binning continuos variables in columns based on time of first column

Binning continuos variables in columns based on time of first column - python

I am trying to bin values in columns as the average of 5 rows from 1-5, 6-10 and so on using python.
My df dataset looks like this:
Unnamed: 0 C00_zscore C01_zscore C02_zscore
1 3 5 6
2 4 36 65
3 56 98 62
4 89 52 35
5 32 74 30
6 55 22 35
7 68 23 31
8 97 65 15
9 2 68 1
10 13 54 300
11
Ideally, the result should look like this:
bin C00_binned C01_binned C02_binned
1 36.8 53 39.6
2 47 46.4 76.4

Take the index and divide it by the bin size. This will be the row's bin. In your case, the row number starts at 1 and you want bins of size 5.
bin_num = row_num / (bin_size + 1)
Now that each row has a bin_num, group by it then do the calculations.
df['bin_num'] = df['Unnamed'] / 6
df.groupby(['bin_num']).mean()

Related

Pandas fill dataframe with count of values within a range from another dataframe

I currently have two dataframes, df_ages and df_count:
In [1]: df_ages
Out [1]:
Enrolled Age
1 Y 44
2 Y 35
3 N 37
4 Y 55
5 N 26
6 Y 19
7 N 18
8 N 49
9 Y 26
10 Y 25
11 Y 25
12 Y 32
13 Y 25
14 N 50
15 N 58
In [2]: df_count
Out [2]:
Min Max counts percentage
1 18 25
2 26 35
3 36 45
4 46 55
5 56 65
I am looking for code to populate df_count [count] column with the sum of people who fit within the min and max age range in the previous columns.
The [percentage] column should be the percentage of number of entries.
The desired resulting output is shown below:
In [2]: df_count
Out [2]:
Min Max counts percentage
1 18 25 5 33.3
2 26 35 4 26.7
3 36 45 2 13.3
4 46 55 3 20.0
5 56 65 1 6.7

You can try apply on rows with Series.between
df_count['counts'] = df_count.apply(lambda row: df_ages['Age'].between(row['Min'], row['Max']).sum(), axis=1)
df_count['percentage'] = df_count['counts'].div(len(df_ages)).mul(100).round(1)
print(df_count)
Min Max counts percentage
0 18 25 5 33.3
1 26 35 4 26.7
2 36 45 2 13.3
3 46 55 3 20.0
4 56 65 1 6.7

Calculate mean from multiple columns

I have 12 columns filled with wages. I want to calculate the mean but my output is 12 different means from each column, but I want one mean which is calculated with the whole dataset as one.
This is how my df looks:
Month 1 Month 2 Month 3 Month 4 ... Month 9 Month 10 Month 11 Month 12
0 1429.97 2816.61 2123.29 2123.29 ... 2816.61 2816.61 1429.97 1776.63
1 3499.53 3326.20 3499.53 2112.89 ... 1939.56 2806.21 2632.88 2459.55
2 2599.95 3119.94 3813.26 3466.60 ... 3466.60 3466.60 2946.61 2946.61
3 2599.95 2946.61 3466.60 2773.28 ... 2253.29 3119.94 1906.63 2773.28
I used this code to calculate the mean:
mean = df.mean()
Do i have to convert these 12 columns into one column or how can i calculate one mean?

Just call the mean again to get the mean of those 12 values:
df.mean().mean()

Use numpy.mean with convert values to 2d array:
mean = np.mean(df.to_numpy())
print (mean)
2914.254166666667
Or use DataFrame.melt:
mean = df.melt()['value'].mean()
print (mean)
2914.254166666666

You can also use stack:
df.stack().mean()
Suppose this dataframe:
>>> df
A B C D E F G H
0 60 1 59 25 8 27 34 43
1 81 48 32 30 60 3 90 22
2 66 15 21 5 23 36 83 46
3 56 42 14 86 41 64 89 56
4 28 53 89 89 52 13 12 39
5 64 7 2 16 91 46 74 35
6 81 81 27 67 26 80 19 35
7 56 8 17 39 63 6 34 26
8 56 25 26 39 37 14 41 27
9 41 56 68 38 57 23 36 8
>>> df.stack().mean()
41.6625

Sum row values of all columns where column names meet string match condition

I have the following dataset:
ID Length Width Range_CAP Capacity_CAP
0 1 33 25 16 50
1 2 34 22 11 66
2 3 22 12 15 42
3 4 46 45 66 54
4 5 16 6 23 75
5 6 21 42 433 50
I basically want to sum the row values of the columns only where the columns match a string (in this case, all columns with _CAP at the end of their name). And store the sum of the result in a new column.
So that I end up with a dataframe that looks something like this:
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
I first tried to use the solution recommended in this question here:
Summing columns in Dataframe that have matching column headers
However, the solution doesn't work for me since they are summing up columns that have the same exact name so a simple groupby can accomplish the result whereas I am trying to sum columns with specific string matches only.
Code to recreate above sample dataset:
data1 = [['1', 33,25,16,50], ['2', 34,22,11,66],
['3', 22,12,15,42],['4', 46,45,66,54],
['5',16,6,23,75], ['6', 21,42,433,50]]
df = pd.DataFrame(data1, columns = ['ID', 'Length','Width','Range_CAP','Capacity_CAP'])

Let us do filter
df['CAP_SUM'] = df.filter(like='CAP').sum(1)
Out[86]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
If have other CAP in front
df.filter(regex='_CAP$').sum(1)
Out[92]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64

One approach is:
df['CAP_SUM'] = df.loc[:, df.columns.str.endswith('_CAP')].sum(1)
print(df)
Output
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
The expression:
df.columns.str.endswith('_CAP')
creates a boolean mask where the values are True if and only if the column name ends with CAP. As an alternative use filter, with the following regex:
df['CAP_SUM'] = df.filter(regex='_CAP$').sum(1)
print(df)
Output (of filter)
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483

You may try this:
columnstxt = df.columns
df['sum'] = 0
for i in columnstxt:
if i.find('_CAP') != -1:
df['sum'] = df['sum'] + df[i]
else:
pass

Dask DataFrame calculate mean within multi-column groupings

I have a data frame as shown in Image, what I want to do is to take the mean along the column 'trial'. It for every subject, condition and sample (when all these three columns has value one), take average of data along column trial (100 rows).
what I have done in pandas is as following
sub_erp_pd= pd.DataFrame()
for j in range(1,4):
sub_c=subp[subp['condition']==j]
for i in range(1,3073):
sub_erp_pd=sub_erp_pd.append(sub_c[sub_c['sample']==i].mean(),ignore_index=True)
But this take alot of time..
So i am thinking to use dask instead of Pandas.
But in dask i am having issue in creating an empty data frame. Like we create an empty data frame in pandas and append data to it.
image of data frame
as suggested by #edesz I made changes in my approach
EDIT
%%time
sub_erp=pd.DataFrame()
for subno in progressbar.progressbar(range(1,82)):
try:
sub=pd.read_csv('../input/data/{}.csv'.format(subno,subno),header=None)
except:
sub=pd.read_csv('../input/data/{}.csv'.format(subno,subno),header=None)
sub_erp=sub_erp.append(sub.groupby(['condition','sample'], as_index=False).mean())
Reading a file using pandas take 13.6 seconds while reading a file using dask take 61.3 ms. But in dask, I am having trouble in appending.
NOTE - The original question was titled Create an empty dask dataframe and append values to it.

If I understand correctly, you need to
use groupby (read more here) in order to group the subject, condition and sample columns
this will gather all rows, which have the same value in each of these three columns, into a single group
take the average using .mean()
this will give you the mean within each group
Generate some dummy data
df = df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)),
columns=['trial','condition','sample'])
df.insert(0,'subject',[1]*10 + [2]*30 + [5]*60)
print(df.head())
subject trial condition sample
0 1 71 96 34
1 1 2 89 66
2 1 90 90 81
3 1 93 43 18
4 1 29 82 32
Pandas approach
Aggregate and take mean
df_grouped = df.groupby(['subject','condition','sample'], as_index=False)['trial'].mean()
print(df_grouped.head(15))
subject condition sample trial
0 1 18 24 89
1 1 43 18 93
2 1 67 47 81
3 1 82 32 29
4 1 85 28 97
5 1 88 13 48
6 1 89 59 23
7 1 89 66 2
8 1 90 81 90
9 1 96 34 71
10 2 0 81 19
11 2 2 39 58
12 2 2 59 94
13 2 5 42 13
14 2 9 42 4
Dask approach
Step 1. Imports
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
Step 2. Convert Pandas DataFrame to Dask DataFrame, using .from_pandas
ddf = dd.from_pandas(df, npartitions=2)
Step 3. Aggregate and take mean
ddf_grouped = (
ddf.groupby(['subject','condition','sample'])['trial']
.mean()
.reset_index(drop=False)
)
with ProgressBar():
df_grouped = ddf_grouped.compute()
[ ] | 0% Completed | 0.0s
[########################################] | 100% Completed | 0.1s
print(df_grouped.head(15))
subject condition sample trial
0 1 18 24 89
1 1 43 18 93
2 1 67 47 81
3 1 82 32 29
4 1 85 28 97
5 1 88 13 48
6 1 89 59 23
7 1 89 66 2
8 1 90 81 90
9 1 96 34 71
10 2 0 81 19
11 2 2 39 58
12 2 2 59 94
13 2 5 42 13
14 2 9 42 4
IMPORTANT NOTE: The approach in this answer does not use the approach of creating an empty Dask DataFrame and append values to it in order to calculate a mean within groupings of subject, condition and trial. Instead, this answer provides an alternate approach (using GROUP BY) to obtaining the desired end result (of calculating the mean within groupings of subject, condition and trial).

Find average of every column in a dataframe, grouped by column, exluding one value

I have a Dataframe like the one presented below:
CPU Memory Disk Label
0 21 28 29 0
1 46 53 55 1
2 48 45 49 2
3 48 52 50 3
4 51 54 55 4
5 45 50 56 5
6 50 83 44 -1
What I want is to grouby and find the average of each label. So far I have this
dataset.groupby('Label')['CPU', 'Memory', 'Disk'].mean() which works just fine and get the results as follows:
Label CPU Memory Disk
-1 46.441176 53.882353 54.176471
0 48.500000 58.500000 60.750000
1 45.000000 51.000000 60.000000
2 54.000000 49.000000 56.000000
3 55.000000 71.500000 67.500000
4 53.000000 70.000000 71.000000
5 21.333333 30.000000 30.666667
The only thing I haven't yet found is how to exclude everything that is labeled as -1. Is there a way to do that?

You could filter the dataframe before grouping:
# Exclude rows with Label=-1
dataset = dataset.loc[dataset['Label'] != -1]
# Group by on filtered result
dataset.groupby('Label')['CPU', 'Memory', 'Disk'].mean()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Binning continuos variables in columns based on time of first column - python

Related

Pandas fill dataframe with count of values within a range from another dataframe

Calculate mean from multiple columns

Sum row values of all columns where column names meet string match condition

Dask DataFrame calculate mean within multi-column groupings

Find average of every column in a dataframe, grouped by column, exluding one value

Categories

Resources