Here is an example DataFrame:
In [308]: df
Out[308]:
A B
0 1 1
1 1 2
2 2 3
3 2 4
4 3 5
5 3 6
I want to merge A and B while keeping order, indexing and duplicates in A intact. At the same time, I only want to get values from B that are not in A so the resulting DataFrame should look like this:
In [308]: df
Out[308]:
A B
0 1 1
1 1 2
2 2 3
3 2 4
4 3 5
5 3 6
6 4 NaN
7 5 NaN
8 6 NaN
Any pointers would be much appreciated. I tried doing a concat of the two columns and a groupby but that doesn't preserve column A values since duplicates are discarded.
I want to retain what is already there but also add values from B that are not in A.
To get those elements of B not in A, use the isin method with the ~ invert (not) operator:
In [11]: B_notin_A = df['B'][~df['B'].isin(df['A'])]
In [12]: B_notin_A
Out[12]:
3 4
4 5
5 6
Name: B, dtype: int64
And then you can append (concat) these with A, sort (if you use order it returns the result rather than doing the operation in place) and reset_index:
In [13]: A_concat_B_notin_A = pd.concat([df['A'], B_notin_A]).order().reset_index(drop=True)
In [14]: A_concat_B_notin_A
Out[14]:
0 1
1 1
2 2
3 2
4 3
5 3
6 4
7 5
8 6
dtype: int64
and then create a new DataFrame:
In [15]: pd.DataFrame({'A': A_concat_B_notin_A, 'B': df['B']})
Out[15]:
A B
0 1 1
1 1 2
2 2 3
3 2 4
4 3 5
5 3 6
6 4 NaN
7 5 NaN
8 6 NaN
FWIW I'm not sure whether this is necessarily the correct datastructure for you...
Related
I want to groupby DataFrame and get the nlargest data of column 'C'.
while the return is series, not DataFrame.
dftest = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10],
'B':['A','B','A','B','A','B','A','B','B','B'],
'C':[0,0,1,1,2,2,3,3,4,4]})
dfn=dftest.groupby('B',group_keys=False)\
.apply(lambda grp:grp['C'].nlargest(int(grp['C'].count()*0.8))).sort_index()
the result get a series.
2 1
4 2
5 2
6 3
7 3
8 4
9 4
Name: C, dtype: int64
I hope the result is DataFrame, like
A B C
2 3 A 1
4 5 A 2
5 6 B 2
6 7 A 3
7 8 B 3
8 9 B 4
9 10 B 4
******update**************
sorry, the column 'A' in fact does not series integers, the dftest might be more like
dftest = pd.DataFrame({'A':['Feb','Flow','Air','Flow','Feb','Beta','Cat','Feb','Beta','Air'],
'B':['A','B','A','B','A','B','A','B','B','B'],
'C':[0,0,1,1,2,2,3,3,4,4]})
and the result should be
A B C
2 Air A 1
4 Feb A 2
5 Beta B 2
6 Cat A 3
7 Feb B 3
8 Beta B 4
9 Air B 4
It may be a bit clumsy, but it does what you asked:
dfn= dftest.groupby('B').apply(lambda
grp:grp['C'].nlargest(int(grp['C'].count()*0.8))).reset_index().rename(columns=
{'level_1':'A'})
dfn.A = dfn.A+1
dfn=dfn[['A','B','C']].sort_values(by='A')
Thanks to my friends, the follow code works for me.
dfn=dftest.groupby('B',group_keys=False)\
.apply(lambda grp:grp.nlargest(n=int(grp['C'].count()*0.8),columns='C').sort_index())
the dfn is
In [8]:dfn
Out[8]:
A B C
2 3 A 1
4 5 A 2
6 7 A 3
5 6 B 2
7 8 B 3
8 9 B 4
9 10 B 4
my previous code is deal with series, the later one is deal with DataFrame.
I have this table
A B C E
1 2 1 3
1 2 4 4
2 7 1 1
3 4 0 2
3 4 8 3
Now, I want to remove duplicates based on column A and B and at the same time sum up column C. For E, it should take the value where C shows the max value. The desirable result table should look like this:
A B C E
1 2 5 4
2 7 1 1
3 4 8 3
I tried this: df.groupby(['A', 'B']).sum()['C'] but my data frame does not change at all as I am thinking that I didn't incorporate the E column part properly...Can somebody advise?
Thanks so much!
If the first and second rows are duplicates, we can group by them.
In [20]: df
Out[20]:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 3
In [21]: df.groupby(['A', 'B'])['C'].sum()
Out[21]:
A B
1 1 6
3 3 8
Name: C, dtype: int64
I tried this: df.groupby(['A', 'B']).sum()['C'] but my data frame does not change at all
yes, it's because pandas didn't overwrite initial DataFrame
In [22]: df
Out[22]:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 3
You have to overwrite it explicitly.
In [23]: df = df.groupby(['A', 'B'])['C'].sum()
In [24]: df
Out[24]:
A B
1 1 6
3 3 8
Name: C, dtype: int64
I want to reverse a column values in my dataframe, but only on a individual "groupby" level. Below you can find a minimal demonstration example, where I want to "flip" values that belong the same letter A,B or C:
df = pd.DataFrame({"group":["A","A","A","B","B","B","B","C","C"],
"value": [1,3,2,4,4,2,3,2,5]})
group value
0 A 1
1 A 3
2 A 2
3 B 4
4 B 4
5 B 2
6 B 3
7 C 2
8 C 5
My desired output looks like this: (column is added instead of replaced only for the brevity purposes)
group value value_desired
0 A 1 2
1 A 3 3
2 A 2 1
3 B 4 3
4 B 4 2
5 B 2 4
6 B 3 4
7 C 2 5
8 C 5 2
As always, when I don't see a proper vector-style approach, I end messing with loops just for the sake of final output, but my current code hurts me very much:
for i in list(set(df["group"].values.tolist())):
reversed_group = df.loc[df["group"]==i,"value"].values.tolist()[::-1]
df.loc[df["group"]==i,"value_desired"] = reversed_group
Pandas gurus, please show me the way :)
You can use transform
In [900]: df.groupby('group')['value'].transform(lambda x: x[::-1])
Out[900]:
0 2
1 3
2 1
3 3
4 2
5 4
6 4
7 5
8 2
Name: value, dtype: int64
Details
In [901]: df['value_desired'] = df.groupby('group')['value'].transform(lambda x: x[::-1])
In [902]: df
Out[902]:
group value value_desired
0 A 1 2
1 A 3 3
2 A 2 1
3 B 4 3
4 B 4 2
5 B 2 4
6 B 3 4
7 C 2 5
8 C 5 2
When faced with large numbers of groups, any graph you might make is apt to be useless due to having too many lines and an unreadable legend. In these cases, being able to find the groups that have the most and least information in them is very useful. However, while x.size() tells you the group membership (after having used groupby), there is no way I can find to re-sort the dataframe using this information, so that you can then use limiting looping to only graph the first x groups.
You can use transform to get the counts and sort on that column:
df = pd.DataFrame({'A': list('aabababc'), 'B': np.arange(8)})
df
Out:
A B
0 a 0
1 a 1
2 b 2
3 a 3
4 b 4
5 a 5
6 b 6
7 c 7
df['counts'] = df.groupby('A').transform('count')
df
Out:
A B counts
0 a 0 4
1 a 1 4
2 b 2 3
3 a 3 4
4 b 4 3
5 a 5 4
6 b 6 3
7 c 7 1
Now you can sort by counts:
df.sort_values('counts')
Out:
A B counts
7 c 7 1
2 b 2 3
4 b 4 3
6 b 6 3
0 a 0 4
1 a 1 4
3 a 3 4
5 a 5 4
In one line:
df.assign(counts = df.groupby('A').transform('count')).sort_values('counts')
beginner's question:
I want to create a cumulative sum column on my dataframe, but I only want the column to add the values from the previous 4 rows (inclusive of the current row). I also need to start the count again with each new 'Type' in the frame.
This is what I'm going for:
Type Value Desired column
A 1 -
A 2 -
A 1 -
A 1 5
A 2 6
A 2 6
B 2 -
B 2 -
B 2 -
B 2 8
B 1 7
B 1 6
You can do this by applying a rolling_sum after we groupby the Type. For example:
>>> df["sum4"] = df.groupby("Type")["Value"].apply(lambda x: pd.rolling_sum(x,4))
>>> df
Type Value sum4
0 A 1 NaN
1 A 2 NaN
2 A 1 NaN
3 A 1 5
4 A 2 6
5 A 2 6
6 B 2 NaN
7 B 2 NaN
8 B 2 NaN
9 B 2 8
10 B 1 7
11 B 1 6
pandas uses NaN to represent missing data; if you really want - instead, you could do that too, using
df["sum4"] = df["sum4"].fillna('-')