Grouping row in pandas - python

I have a dataframe looks like this:
In [4]:
import pandas as pd
df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
df
Out[4]:
a b
0 A 1
1 A 2
2 B 5
3 B 5
4 B 4
5 C 6
I just want to group row which has same value in column a. The desired output is like this:
df
Out[4]:
a b
0 A 1
2
1 B 5
5
4
2 C 6
EDIT:
I am sorry, actually the desired output may be like this:
df
Out[4]:
b
A 1
2
B 5
5
4
C 6

I think you are looking for set_index rather than groupby:
In [11]: df.set_index('a')
Out[11]:
b
a
A 1
A 2
B 5
B 5
B 4
C 6

Related

How to make an one-to-one merge on pandas dataframe

Let's say df1 looks like:
id x
a 1
b 2
b 3
c 4
and df2 looks like:
id y
b 9
b 8
How do I merge them so that the output is:
id x y
b 2 9
b 3 8
I've tried pd.merge(df1, df2, on='id') but it is giving me:
id x y
b 2 9
b 2 8
b 3 9
b 3 8
which is not what I want.
IIUC, GroupBy.cumcount + merge
new_df = (df1.assign(count=df1.groupby('id').cumcount())
.merge(df2.assign(count=df2.groupby('id').cumcount()),
on=['id', 'count'], how='inner')
.drop(columns='count'))
id x y
0 b 2 9
1 b 3 8

selecting rows of one dataframe using multiple columns of another dataframe in python, pandas

I want to pick only rows from df1 where both values of columns A and B in df1 match values of columns A and B in df2 so for example if df 1 and df2 are as follow:
df1
A B C
1 2 3
4 5 6
6 7 8
df2
A B D E
1 2 6 8
2 3 7 9
4 5 2 1
the result will be a subset of df1 rows, in this example, result will look like:
df1
A B C
1 2 3
4 5 6
Use:
df = pd.merge(df1, df2[["A", "B"]], on=["A", "B"], how="inner")
print(df)
This prints:
A B C
0 1 2 3
1 4 5 6

Aggregate data frame rows based on conditions

I have this table
A B C E
1 2 1 3
1 2 4 4
2 7 1 1
3 4 0 2
3 4 8 3
Now, I want to remove duplicates based on column A and B and at the same time sum up column C. For E, it should take the value where C shows the max value. The desirable result table should look like this:
A B C E
1 2 5 4
2 7 1 1
3 4 8 3
I tried this: df.groupby(['A', 'B']).sum()['C'] but my data frame does not change at all as I am thinking that I didn't incorporate the E column part properly...Can somebody advise?
Thanks so much!
If the first and second rows are duplicates, we can group by them.
In [20]: df
Out[20]:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 3
In [21]: df.groupby(['A', 'B'])['C'].sum()
Out[21]:
A B
1 1 6
3 3 8
Name: C, dtype: int64
I tried this: df.groupby(['A', 'B']).sum()['C'] but my data frame does not change at all
yes, it's because pandas didn't overwrite initial DataFrame
In [22]: df
Out[22]:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 3
You have to overwrite it explicitly.
In [23]: df = df.groupby(['A', 'B'])['C'].sum()
In [24]: df
Out[24]:
A B
1 1 6
3 3 8
Name: C, dtype: int64

Pandas - how to replace specific values in a Series?

I have a dataframe with a column called product_type such as:
df1.product_type.unique()
>> ["prod_1", "prod_2", "prod_3"]
df.prod_cost.dtype
>> dtype('O')
I am looking for the most efficient way to replace that by numerical values [1, 2, 3].
Thanks
Use factorize to encode a new column:
In [2]:
df = pd.DataFrame({'a':list('abcdbcbccc')})
df
Out[2]:
a
0 a
1 b
2 c
3 d
4 b
5 c
6 b
7 c
8 c
9 c
In [5]:
df['code'] = df['a'].factorize()[0] + 1
df
Out[5]:
a code
0 a 1
1 b 2
2 c 3
3 d 4
4 b 2
5 c 3
6 b 2
7 c 3
8 c 3
9 c 3
so in your case:
df1['product_type'] = df1['product_type'].factorize()[0] + 1
should work
Cast the column as a category, and then get the codes.
df1 = pd.DataFrame({'product_type': ['prod_1'] * 3 + ['prod_2'] * 3 + ['prod_3'] * 3})
df1['product_type_code'] = df1.product_type.astype('category').cat.codes
>>> df1
product_type product_type_code
0 prod_1 0
1 prod_1 0
2 prod_1 0
3 prod_2 1
4 prod_2 1
5 prod_2 1
6 prod_3 2
7 prod_3 2
8 prod_3 2

Drop observations from the data frame in python

How to delete observation from data frame in python. For example, I have data frame with variables a, b, c in it, and I want to delete observation if variable a is missing, or variable c is equal to zero.
You could build a boolean mask using isnull:
mask = (df['a'].isnull()) | (df['c'] == 0)
and then select the desired rows with:
df = df.loc[~mask]
~mask is the boolean inverse of mask, so df.loc[~mask] selects rows where a is not null and c is not 0.
For example,
import numpy as np
import pandas as pd
arr = np.arange(15, dtype='float').reshape(5,3) % 4
arr[arr > 2] = np.nan
df = pd.DataFrame(arr, columns=list('abc'))
# a b c
# 0 0 1 2
# 1 NaN 0 1
# 2 2 NaN 0
# 3 1 2 NaN
# 4 0 1 2
mask = (df['a'].isnull()) | (df['c'] == 0)
df = df.loc[~mask]
yields
a b c
0 0 1 2
3 1 2 NaN
4 0 1 2
Let's say your DataFrame looks like this:
In [2]: data = pd.DataFrame({
...: 'a': [1,2,3,pd.np.nan,5],
...: 'b': [3,4,pd.np.nan,5,6],
...: 'c': [0,1,2,3,4],
...: })
In [3]: data
Out[3]:
a b c
0 1 3 0
1 2 4 1
2 3 NaN 2
3 NaN 5 3
4 5 6 4
To delete rows with missing observations, use:
In [5]: data.dropna()
Out[5]:
a b c
0 1 3 0
1 2 4 1
4 5 6 4
To delete rows where only column 'a' has missing observations, use:
In [6]: data.dropna(subset=['a'])
Out[6]:
a b c
0 1 3 0
1 2 4 1
2 3 NaN 2
4 5 6 4
To delete rows that have either missing observations or zeros, use:
In [18]: data[data.all(axis=1)].dropna()
Out[18]:
a b c
1 2 4 1
4 5 6 4

Categories

Resources