I have a question regarding pandas dataframes:
I have a dataframe like the following,
df = pd.DataFrame([[1,1,10],[1,1,30],[1,2,40],[2,3,50],[2,3,150],[2,4,100]],columns=["a","b","c"])
a b c
0 1 1 10
1 1 1 30
2 1 2 40
3 2 3 50
4 2 3 150
5 2 4 100
And i want to produce the following output,
a "new col"
0 1 30
1 2 100
where the first line is calculated as the following:
Group df by the first column "a",
Then group each grouped object the "b"
calculate the mean of "c" for this b-group
calculate the means of all b-groupbs for one "a"
this is the final value stored in "new col" for one "a"
I can imagine that this is somehow confusing, but I hope this is understandable, nevertheless.
I achieved the desired result, but as i need it for a huge dataframe, my solution is probably much to slow,
pd.DataFrame([ [a, adata.groupby("b").agg({"c": lambda x:x.mean()}).mean()[0]] for a,adata in df.groupby("a") ],columns=["a","new col"])
a new col
0 1 30.0
1 2 100.0
Therefore, what I would need is something like (?)
df.groupby("a").groupby("b")["c"].mean()
Thank you very much in advance!
Here's one way
In [101]: (df.groupby(['a', 'b'], as_index=False)['c'].mean()
.groupby('a', as_index=False)['c'].mean()
.rename(columns={'c': 'new col'}))
Out[101]:
a new col
0 1 30
1 2 100
In [57]: df.groupby(['a','b'])['c'].mean().mean(level=0).reset_index()
Out[57]:
a c
0 1 30
1 2 100
df.groupby(['a','b']).mean().reset_index().groupby('a').mean()
Out[117]:
b c
a
1 1.5 30.0
2 3.5 100.0
Related
This might be a quite easy problem but I can't deal with it properly and didn't find the exact answer here. So, let's say we have a Python Dataframe as below:
df:
ID a b c d
0 1 3 4 9
1 2 8 8 3
2 1 3 10 12
3 0 1 3 0
I want to remove all the rows that contain repeating values in different columns. In other words, I am only interested in keeping rows with unique values. Referring to the above example, the desired output should be:
ID a b c d
0 1 3 4 9
2 1 3 10 12
(I didn't change the ID values on purpose to make the comparison easier). Please let me know if you have any ideas. Thanks!
You can compare length of sets with length of columns names:
lc = len(df.columns)
df1 = df[df.apply(lambda x: len(set(x)) == lc, axis=1)]
print (df1)
a b c d
ID
0 1 3 4 9
2 1 3 10 12
Or test by Series.duplicated and Series.any:
df1 = df[~df.apply(lambda x: x.duplicated().any(), axis=1)]
Or DataFrame.nunique:
df1 = df[df.nunique(axis=1).eq(lc)]
Or:
df1 = df[[len(set(x)) == lc for x in df.to_numpy()]]
I have two pandas DF. Of unequal sizes. For example :
Df1
id value
a 2
b 3
c 22
d 5
Df2
id value
c 22
a 2
No I want to extract from DF1 those rows which has the same id as in DF2. Now my first approach is to run 2 for loops, with something like :
x=[]
for i in range(len(DF2)):
for j in range(len(DF1)):
if DF2['id'][i] == DF1['id'][j]:
x.append(DF1.iloc[j])
Now this is okay, but for 2 files of 400,000 lines in one and 5,000 in another, I need an efficient Pythonic+Pnadas way
import pandas as pd
data1={'id':['a','b','c','d'],
'value':[2,3,22,5]}
data2={'id':['c','a'],
'value':[22,2]}
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
finaldf=pd.concat([df1,df2],ignore_index=True)
Output after concat
id value
0 a 2
1 b 3
2 c 22
3 d 5
4 c 22
5 a 2
Final Ouput
finaldf.drop_duplicates()
id value
0 a 2
1 b 3
2 c 22
3 d 5
You can concat the dataframes , then check if all the elements are duplicated or not , then drop_duplicates and keep just the first occurrence:
m = pd.concat((df1,df2))
m[m.duplicated('id',keep=False)].drop_duplicates()
id value
0 a 2
2 c 22
You can try this:
df = df1[df1.set_index(['id']).index.isin(df2.set_index(['id']).index)]
In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)
I have a very simple data set:
Customer Amount
A 1.25
B 2
C 1
A 5
D 2
B 10
I would like to get the following result:
Customer Amount Number_of_transactions
A 6.25 2
B 12 2
C 1 1
D 1 2
The way I solved is to add another column where all values are 1 and then use df.groupby('Customer').
Is there a more efficient way to do that?
I would need to plot the distribution of number_of_transactions and distribution of amounts. Whenever I try to do that, I get key error (I assume because of groupby). Can someone point to the right direction?
Try this:
>>> df['Number_of_transactions'] = 1
>>> df1 = df.pivot_table(index='Customer',
values=['Amount', 'Number_of_transactions'],
aggfunc=np.sum)\
.reset_index() # reset_index is optional
>>> df1
Out[21]:
Customer Amount Number_of_transactions
0 A 6.25 2
1 B 12.00 2
2 C 1.00 1
3 D 2.00 1
For the plots just do:
>>> df1.hist(bin=50)
I'm not sure what you want as a plot, but for the first part, you can do this :
new_df = pd.concat([df.groupby(df.Customer).Amount.sum(),
df.Customer.value_counts()], axis=1)
new_df.columns = ["Amounts","Number_of_transactions"]
and then if you can have a bar plot with :
new_df.plot(kind="bar")
or if you wan't a histogram :
new_df.hist()
I got lost in Pandas doc and features trying to figure out a way to groupby a DataFrame by the values of the sum of the columns.
for instance, let say I have the following data :
In [2]: dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
In [3]: df = pd.DataFrame(dat)
In [4]: df
Out[4]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
I would like columns a, b and c to be grouped since they all have their sum equal to 1. The resulting DataFrame would have columns labels equals to the sum of the columns it summed. Like this :
1 9
0 2 2
1 1 3
2 0 4
Any idea to put me in the good direction ? Thanks in advance !
Here you go:
In [57]: df.groupby(df.sum(), axis=1).sum()
Out[57]:
1 9
0 2 2
1 1 3
2 0 4
[3 rows x 2 columns]
df.sum() is your grouper. It sums over the 0 axis (the index), giving you the two groups: 1 (columns a, b, and, c) and 9 (column d) . You want to group the columns (axis=1), and take the sum of each group.
Because pandas is designed with database concepts in mind, it's really expected information to be stored together in rows, not in columns. Because of this, it's usually more elegant to do things row-wise. Here's how to solve your problem row-wise:
dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
df = pd.DataFrame(dat)
df = df.transpose()
df['totals'] = df.sum(1)
print df.groupby('totals').sum().transpose()
#totals 1 9
#0 2 2
#1 1 3
#2 0 4