Plotting pandas groupby results - python

I have a very simple data set:
Customer Amount
A 1.25
B 2
C 1
A 5
D 2
B 10
I would like to get the following result:
Customer Amount Number_of_transactions
A 6.25 2
B 12 2
C 1 1
D 1 2
The way I solved is to add another column where all values are 1 and then use df.groupby('Customer').
Is there a more efficient way to do that?
I would need to plot the distribution of number_of_transactions and distribution of amounts. Whenever I try to do that, I get key error (I assume because of groupby). Can someone point to the right direction?

Try this:
>>> df['Number_of_transactions'] = 1
>>> df1 = df.pivot_table(index='Customer',
values=['Amount', 'Number_of_transactions'],
aggfunc=np.sum)\
.reset_index() # reset_index is optional
>>> df1
Out[21]:
Customer Amount Number_of_transactions
0 A 6.25 2
1 B 12.00 2
2 C 1.00 1
3 D 2.00 1
For the plots just do:
>>> df1.hist(bin=50)

I'm not sure what you want as a plot, but for the first part, you can do this :
new_df = pd.concat([df.groupby(df.Customer).Amount.sum(),
df.Customer.value_counts()], axis=1)
new_df.columns = ["Amounts","Number_of_transactions"]
and then if you can have a bar plot with :
new_df.plot(kind="bar")
or if you wan't a histogram :
new_df.hist()

Related

How to average n adjacent columns together in python pandas dataframe?

I have a dataframe that is a histogram with 2000 bins, with a column for each bin. I need to reduce it down to a quarter of the size - 500 bins.
Let's say we have the original dataframe:
A B C D E F G H
1 1 1 1 2 2 2 2
I want to reduce it to a new quarter width dataframe:
A B
1 2
where in the new dataframe, A is the average of A+B+C+D/4 in the original dataframe.
Feels like it should be easy, but can't work out how to do it! Cheers :)
Assuming you want to group the first 4 and last 4 columns (or any number of columns 4 by 4):
out = df.groupby(np.arange(df.shape[1])//4, axis=1).mean()
ouput:
0 1
0 1.0 2.0
If you further want to relabel the columns A/B:
out = (df.groupby(np.arange(df.shape[1])//4, axis=1).mean()
.set_axis(['A', 'B'], axis=1)
)
output:
A B
0 1.0 2.0

How to add values one by one in a new column with pandas

I have a column with acceleration values, and I’m trying to integrate them in a new column.
Here’s what I want as output :
A B
0 a b-a
1 b c-b
2 c d-c
3 d …-d
…
I’m currently doing like that
l=[]
for i in range(len(df)):
l.append(df.values[i+1][0]-df.values[i][0])
df[1]=l
That’s very slow to process.
I have over a million lines, and this in 20 different csv files. Is there a way to do it faster ?
IIUC, you can use diff:
df = pd.DataFrame({'A': [0,2,1,10]})
df['B'] = -df['A'].diff(-1)
output:
A B
0 0 2.0
1 2 -1.0
2 1 9.0
3 10 NaN

Rearrange rows based on condition alternating?

I have a bunch of rows which I want to rearrange one after the other based on a particular column.
df
B/S
0 B
1 B
2 S
3 S
4 B
5 S
I have thought about doing a loc based on B and S and then adding them all together in a new dataframe but that doesn't seem like good practice for pandas.
Is there a pandas centric approach to this?
Output required
B/S
0 B
2 S
1 B
3 S
4 B
5 S
We can achieve this by making smart use of reset_index:
m = df['B/S'].eq('B')
b = df[m].reset_index(drop=True)
s = df[~m].reset_index(drop=True)
out = b.append(s).sort_index().reset_index(drop=True)
B/S
0 B
1 S
2 B
3 S
4 B
5 S
If you want to keep your index information, we can slightly adjust our approach:
m = df['B/S'].eq('B')
b = df[m].reset_index()
s = df[~m].reset_index()
out = b.append(s).sort_index().set_index('index')
B/S
index
0 B
2 S
1 B
3 S
4 B
5 S

Pandas Dataframe groupby: double groupby & apply function

I have a question regarding pandas dataframes:
I have a dataframe like the following,
df = pd.DataFrame([[1,1,10],[1,1,30],[1,2,40],[2,3,50],[2,3,150],[2,4,100]],columns=["a","b","c"])
a b c
0 1 1 10
1 1 1 30
2 1 2 40
3 2 3 50
4 2 3 150
5 2 4 100
And i want to produce the following output,
a "new col"
0 1 30
1 2 100
where the first line is calculated as the following:
Group df by the first column "a",
Then group each grouped object the "b"
calculate the mean of "c" for this b-group
calculate the means of all b-groupbs for one "a"
this is the final value stored in "new col" for one "a"
I can imagine that this is somehow confusing, but I hope this is understandable, nevertheless.
I achieved the desired result, but as i need it for a huge dataframe, my solution is probably much to slow,
pd.DataFrame([ [a, adata.groupby("b").agg({"c": lambda x:x.mean()}).mean()[0]] for a,adata in df.groupby("a") ],columns=["a","new col"])
a new col
0 1 30.0
1 2 100.0
Therefore, what I would need is something like (?)
df.groupby("a").groupby("b")["c"].mean()
Thank you very much in advance!
Here's one way
In [101]: (df.groupby(['a', 'b'], as_index=False)['c'].mean()
.groupby('a', as_index=False)['c'].mean()
.rename(columns={'c': 'new col'}))
Out[101]:
a new col
0 1 30
1 2 100
In [57]: df.groupby(['a','b'])['c'].mean().mean(level=0).reset_index()
Out[57]:
a c
0 1 30
1 2 100
df.groupby(['a','b']).mean().reset_index().groupby('a').mean()
Out[117]:
b c
a
1 1.5 30.0
2 3.5 100.0

Group by value of sum of columns with Pandas

I got lost in Pandas doc and features trying to figure out a way to groupby a DataFrame by the values of the sum of the columns.
for instance, let say I have the following data :
In [2]: dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
In [3]: df = pd.DataFrame(dat)
In [4]: df
Out[4]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
I would like columns a, b and c to be grouped since they all have their sum equal to 1. The resulting DataFrame would have columns labels equals to the sum of the columns it summed. Like this :
1 9
0 2 2
1 1 3
2 0 4
Any idea to put me in the good direction ? Thanks in advance !
Here you go:
In [57]: df.groupby(df.sum(), axis=1).sum()
Out[57]:
1 9
0 2 2
1 1 3
2 0 4
[3 rows x 2 columns]
df.sum() is your grouper. It sums over the 0 axis (the index), giving you the two groups: 1 (columns a, b, and, c) and 9 (column d) . You want to group the columns (axis=1), and take the sum of each group.
Because pandas is designed with database concepts in mind, it's really expected information to be stored together in rows, not in columns. Because of this, it's usually more elegant to do things row-wise. Here's how to solve your problem row-wise:
dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
df = pd.DataFrame(dat)
df = df.transpose()
df['totals'] = df.sum(1)
print df.groupby('totals').sum().transpose()
#totals 1 9
#0 2 2
#1 1 3
#2 0 4

Categories

Resources