How to average n adjacent columns together in python pandas dataframe? - python

I have a dataframe that is a histogram with 2000 bins, with a column for each bin. I need to reduce it down to a quarter of the size - 500 bins.
Let's say we have the original dataframe:
A B C D E F G H
1 1 1 1 2 2 2 2
I want to reduce it to a new quarter width dataframe:
A B
1 2
where in the new dataframe, A is the average of A+B+C+D/4 in the original dataframe.
Feels like it should be easy, but can't work out how to do it! Cheers :)

Assuming you want to group the first 4 and last 4 columns (or any number of columns 4 by 4):
out = df.groupby(np.arange(df.shape[1])//4, axis=1).mean()
ouput:
0 1
0 1.0 2.0
If you further want to relabel the columns A/B:
out = (df.groupby(np.arange(df.shape[1])//4, axis=1).mean()
.set_axis(['A', 'B'], axis=1)
)
output:
A B
0 1.0 2.0

Related

Compare values across multiple columns in pandas and count the instances in which values in the last column is higher than the others

I have a DataFrame that looks like this:
Image of DataFrame
What I would like to do is to compare the values in all four columns (A, B, C, and D) for every row and count the number of times in which D has the smaller value than A, B, or C for each row and add it into the 'Count' column. So for instance, 'Count' should be 1 for the second row, the third row, and 2 for the last row.
Thank you in advance!
You can use vectorize the operation using gt and sum methods along an axis:
df['Count'] = df[['A', 'B', 'C']].gt(df['D'], axis=0).sum(axis=1)
print(df)
# Output
A B C D Count
0 1 2 3 4 0
1 4 3 2 1 3
2 2 1 4 3 1
In the future, please do not post data as an image.
Use a lambda function and compare across all columns, then sum across the columns.
data = {'A': [1,47,4316,8511],
'B': [4,1,3,4],
'C': [2,7,9,1],
'D': [32,17,1,0]
}
df = pd.DataFrame(data)
df['Count'] = df.apply(lambda x: x['D'] < x, axis=1).sum(axis=1)
Output:
A B C D Count
0 1 4 2 32 0
1 47 1 7 17 1
2 4316 3 9 1 3
3 8511 4 1 0 3

Pandas sample different fractions for each group after groupby

import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
'b': [1,1,1,0,0,0,0],
})
grouped = df.groupby('b')
now sample from each group, e.g., I want 30% from group b = 1, and 20% from group b = 0. How should I do that?
if I want to have 150% for some group, can i do that?
You can dynamically return a random sample dataframe with different % of samples as defined per group. You can do this with percentages below 100% (see example 1) AND above 100% (see example 2) by passing replace=True:
Using np.select, create a new column c that returns the number of rows per group to be sampled randomly according to a 20%, 40%, etc. percentage that you set.
From there, you can sample x rows per group based off these percentage conditions. From these rows, return the .index of the rows and filter for the rows with .loc as well as columns 'a','b'. The code grouped.apply(lambda x: x['c'].sample(frac=x['c'].iloc[0])) creates a multiindex series of the output you are looking for, but it requires some cleanup. This is why for me it is just easier to grab the .index and filter the original dataframe with .loc, rather than try to clean up the messy multiindex series.
grouped = df.groupby('b', group_keys=False)
df['c'] = np.select([df['b'].eq(0), df['b'].eq(1)], [0.4, 0.2])
df.loc[grouped.apply(lambda x: x['c'].sample(frac=x['c'].iloc[0])).index, ['a','b']]
Out[1]:
a b
6 7 0
8 9 0
3 4 1
If you would like to return a larger random sample using duplicates of the existing cvalues, simply pass replace=True. Then, do some cleanup to get the output.
grouped = df.groupby('b', group_keys=False)
v = df['b'].value_counts()
df['c'] = np.select([df['b'].eq(0), df['b'].eq(1)],
[int(v.loc[0] * 1.2), int(v.loc[1] * 2)]) #frac parameter doesn't work with sample when frac > 1, so we have to calcualte the integer value for number of rows to be sampled.
(grouped.apply(lambda x: x['b'].sample(x['c'].iloc[0], replace=True))
.reset_index()
.rename({'index' : 'a'}, axis=1))
Out[2]:
a b
0 7 0
1 8 0
2 9 0
3 7 0
4 7 0
5 8 0
6 1 1
7 3 1
8 3 1
9 1 1
10 0 1
11 0 1
12 4 1
13 2 1
14 3 1
15 0 1
You can get a DataFrame from the GroupBy object with, e.g. grouped.get_group(0). If you want to sample from that you can use the .sample method. For instance grouped.get_group(0).sample(frac=0.2) gives:
a
5 6
For the example you give both samples will only give one element because the groups have 4 and 3 elements and 0.2*4 = 0.8 and 0.3*3 = 0.9 both round to 1.

Calculate and add columns to a data frame using multiple columns for sorting

I have a pretty simple data frame with Columns A, B, C and I am would like to add several. I would like to create two cumulative summed columns and have these stored in that same data frame. Currently I'm doing it by creating two different data frames that are order differently and then plotting the results on the same graph but I'm guessing there is a more efficient approach. The columns I'm trying to create are:
(1) Column D = the cumulative sum of Column C ordered by increasing values in Column A
(2) Column E = The cumulative sum of Column C ordered by decreasing values in column B
This should work:
# Cumsum helps us get the cummulative sum and we sort after for correct order of column
df = pd.read_csv('Sample.csv')
df.insert(3,'D',df.sort_values(by = ['A']).C.cumsum().sort_values().values)
df.insert(4,'E',df.sort_values(by = ['B'], ascending = False).C.cumsum().sort_values().values)
print(df)
A B C D E
0 1 0.1 1 1 2
1 2 0.3 3 4 3
2 3 0.6 1 5 6
3 4 0.7 2 7 8
4 5 0.3 2 9 9

Create a new dataframe by aggregating repeated origin and destination values by a separate count column in a pandas dataframe

I am having trouble analysing origin-destination values in a pandas dataframe which contains origin/destination columns and a count column of the frequency of these. I want to transform this into a dataframe with the count of how many are leaving and entering:
Initial:
Origin Destination Count
A B 7
A C 1
B A 1
B C 4
C A 3
C B 10
For example this simplified dataframe has 7 leaving from A to B and 1 from A to C so overall leaving place A would be 8, and entering place A would be 4 (B - A is 1, C - A is 3) etc. The new dataframe would look something like this.
Goal:
Place Entering Leaving
A 4 8
B 17 5
C 5 13
I have tried several techniques such as .groupby() but have not yet created my intended dataframe. How can I handle the repeated values in the origin/destination columns and assign them to a new dataframe with aggregated values of just the count of leaving and entering?
Thank you!
Use double groupby + concat:
a = df.groupby('Destination')['Count'].sum()
b = df.groupby('Origin')['Count'].sum()
df = pd.concat([a,b], axis=1, keys=('Entering','Leaving')).rename_axis('Place').reset_index()
print (df)
Place Entering Leaving
0 A 4 8
1 B 17 5
2 C 5 13
pivot_table then do sum
df=pd.pivot_table(df,index='Origin',columns='Destination',values='Count',aggfunc=sum)
pd.concat([df.sum(0),df.sum(1)],1)
Out[428]:
0 1
A 4.0 8.0
B 17.0 5.0
C 5.0 13.0

Group by value of sum of columns with Pandas

I got lost in Pandas doc and features trying to figure out a way to groupby a DataFrame by the values of the sum of the columns.
for instance, let say I have the following data :
In [2]: dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
In [3]: df = pd.DataFrame(dat)
In [4]: df
Out[4]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
I would like columns a, b and c to be grouped since they all have their sum equal to 1. The resulting DataFrame would have columns labels equals to the sum of the columns it summed. Like this :
1 9
0 2 2
1 1 3
2 0 4
Any idea to put me in the good direction ? Thanks in advance !
Here you go:
In [57]: df.groupby(df.sum(), axis=1).sum()
Out[57]:
1 9
0 2 2
1 1 3
2 0 4
[3 rows x 2 columns]
df.sum() is your grouper. It sums over the 0 axis (the index), giving you the two groups: 1 (columns a, b, and, c) and 9 (column d) . You want to group the columns (axis=1), and take the sum of each group.
Because pandas is designed with database concepts in mind, it's really expected information to be stored together in rows, not in columns. Because of this, it's usually more elegant to do things row-wise. Here's how to solve your problem row-wise:
dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
df = pd.DataFrame(dat)
df = df.transpose()
df['totals'] = df.sum(1)
print df.groupby('totals').sum().transpose()
#totals 1 9
#0 2 2
#1 1 3
#2 0 4

Categories

Resources