Group by column value and set it as index in Pandas - python

I have a dataframe df1 that looks like this:
df1 = pd.DataFrame({'A':[0,5,4,8,9,0,7,6],
'B':['a','s','d','f','g','h','j','k'],
'C':['XX','XX','XX','YY','YY','WW','ZZ','ZZ']})
My goal is to group the elements according to the values contained in column Cso that rows having the same value, have the same index (which must contain the value stored in C). Therefore the output should be like this:
A B
XX 0 a
5 s
4 d
YY 8 f
9 g
WW 0 h
ZZ 7 j
6 k
I tried to use the command df.groupby('C') but it returns the following object:
<pandas.core.groupby.DataFrameGroupBy object at 0x000000001A9D4860>
Can you suggest me an elegant and smart way to achieve my goal?
Note: I think my question is somehow related to multi-indexing

It seems you need DataFrame.set_index
df2 = df1.set_index('C')
print (df2)
A B
C
XX 0 a
XX 5 s
XX 4 d
YY 8 f
YY 9 g
WW 0 h
ZZ 7 j
ZZ 6 k
print (df2.loc['XX'])
A B
C
XX 0 a
XX 5 s
XX 4 d
If need MultiIndex from columns C and A:
df3 = df1.set_index(['C', 'A'])
print (df3)
B
C A
XX 0 a
5 s
4 d
YY 8 f
9 g
WW 0 h
ZZ 7 j
6 k
print (df3.loc['XX'])
B
A
0 a
5 s
4 d

I think you are looking for pivot_table i.e
pd.pivot_table(df1, values='A', index=['C','B'])
Output :
A
C B
WW h 0
XX a 0
d 4
s 5
YY f 8
g 9
ZZ j 7
k 6

Related

Create adjacency matrix from adjacency list

I have the next DF with two columns
A x
A y
A z
B x
B w
C x
C w
C i
I want to produce an adjacency matrix like this (count the intersection)
A B C
A 0 1 2
B 1 0 2
C 2 2 0
I have the next code but doesnt work:
import pandas as pd
df = pd.read_csv('lista.csv')
drugs = pd.read_csv('drugs.csv')
drugs = drugs['Drug'].tolist()
df = pd.crosstab(df.Drug, df.Gene)
df = df.reindex(index=drugs, columns=drugs)
How can i obtain the adjacency matrix?
Thanks
Try self merge on column 2 and then crosstab:
s = df.merge(df,on='col2').query('col1_x != col1_y')
pd.crosstab(s['col1_x'], s['col1_y'])
Output:
col1_y A B C
col1_x
A 0 1 1
B 1 0 2
C 1 2 0
Input:
>>> drugs
Drug Gene
0 A x
1 A y
2 A z
3 B x
4 B w
5 C x
6 C w
7 C i
Merge on gene before crosstab and fill diagonal with zeros
df = pd.merge(drugs, drugs, on="Gene")
df = pd.crosstab(df["Drug_x"], df["Drug_y"])
np.fill_diagonal(df.values, 0)
Output:
>>> df
Drug_y A B C
Drug_x
A 0 1 1
B 1 0 2
C 1 2 0

How to replace row values in a particular column using index?

I have the following data frames,
data frame- 1 (named as df1)
index A B C
1 q a w
2 e d q
3 r f r
4 t g t
5 y j o
6 i k p
7 j w k
8 i o u
9 a p v
10 o l a
data frame- 2 (named as df2)
index C
3 a
7 b
9 c
10 d
I tried to replace the rows for specific indexes in the column "C" using the data frame - 2 for the data frame - 1 but I got the following result after using the below code:
df1['C'] = df2
Output:
index A B C
1 q a NaN
2 e d NaN
3 r f a
4 t g NaN
5 y j NaN
6 i k NaN
7 j w b
8 i o NaN
9 a p c
10 o l d
But I want something like this,
Expected output:
index A B C
1 q a w
2 e d q
3 r f a
4 t g t
5 y j o
6 i k p
7 j w b
8 i o u
9 a p c
10 o l d
So clearly I don't need NaN values in column "C" instead I want the values to remain as it is. (I mean it should change only for that particular index value).
Please let me know the solution.
Thanks in advance!
Assuming index is the actual index column, we can do loc:
df1.loc[df2.index, 'C'] = df2['C']
Or even more simple with:
df1.update(df2)
Output:
A B C
index
1 q a w
2 e d q
3 r f a
4 t g t
5 y j o
6 i k p
7 j w b
8 i o u
9 a p c
10 o l d
Try this
for idx, row in df2.iterrows():
df1.at[idx, 'C'] = row['C']

Pandas - move values of a column under another column

So here is my problem.
I'm using pandas to parse csv file.
So my csv file looks like this :
A B C D
1 x 5 e
2 y 6 f
3 z 7 g
What I want to get is :
get all the values of column C
Place them under column A
Same with columns D and B
So it would get me this :
A B C D
1 x
2 y
3 z
5 e
6 f
7 g
However, all i've been able to get is to create a new column that "sums" column A with column C and column B with column D:
A B C D E F
1 x 5 e 15 xe
2 y 6 f 26 yf
3 z 7 g 37 zg
Any idea would be appreciated.
Thanks
Rename column C and D and append them to the bottom of columns A and B`:
result = df[['A', 'B']].append(df[['C','D']].set_axis(['A', 'B'], axis=1)).reset_index(drop=True)

Create new column to existing column and add value at certain interval

I am trying to create a new column based one on my first column. For example,
I have a list of a = ["A", "B", "C"] and existing dataframe
Race Boy Girl
W 0 1
B 1 0
H 1 1
W 1 0
B 0 0
H 0 1
W 1 0
B 1 1
H 0 1
My goal is to create a new column and add value to it base on W, B, H interval. So that the end result looks like:
Race Boy Girl New Column
W 0 1 A
B 1 0 A
H 1 1 A
W 1 0 B
B 0 0 B
H 0 1 B
W 1 0 C
B 1 1 C
H 0 1 C
The W,B,H interval is consistent, and I want to add new value to the new column every time I see W. The data is longer than this.
I have tried all possible ways and I couldn't come up with a code. I will be glad if someone can help and also explain the process. Thanks
Here is what you can do:
Use a loop to create a list that is repetitive for the column.
for i in len(dataframe['Race']):
#Create list for last column
Once you have that list you can add it to the list by using:
dataframe['New Column'] = list
maybe this works..
list = ['A','B','C',....]
i=-1
for entry in dataframe:
if entry['Race'] = 'W':
i+=1
entry['new column'] = list[i]
also if the new column list is very big to type you can use list comprehension:
list = [x for x in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ']
If your W, B, H is in this exact order and complete inteval, you may use np.repeat. As in your comment, np.repeat only would suffice.
import numpy as np
a = ["A", "B", "C"] #list
n = df.Race.nunique() # length of each interval
df['New Col'] = np.repeat(a, n)
In [20]: df
Out[20]:
Race Boy Girl New Col
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C
Here is a way with pandas. It increments each time you see a new 'W' and handles missing values of Race.
# use original post's definition of df
df['New Col'] = (
(df['Race'] == 'W') # True (1) for W; False (0) otherwise
.cumsum() # increments each time you hit True (1)
.map({1: 'A', 2: 'B', 3: 'C'}) # 1->A, 2->B, ...
)
print(df)
Race Boy Girl New Col
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C
There are multiple ways to solve this problem statement. You can iterate through the DataFrame and assign values to the new column at each interval.
Here's an approach I think will work.
#setting up the DataFrame you referred in the example
import pandas as pd
df = pd.DataFrame({'Race':['W','B','H','W','B','H','W','B','H'],
'Boy':[0,1,1,1,0,0,1,1,0],
'Girl':[1,0,1,0,0,1,0,1,1]})
#if you have 3 values to assign, create a list say A, B, C
#By creating a list, you have to manage only the list and the frequency
a = ['A','B','C']
#iterate thru the dataframe and assign the values in batches
for (i,row) in df.iterrows(): #the trick is to assign for loc[i]
df.loc[i,'New'] = a[int(i/3)] #where i is the index and assign value in list a
#note: dividing by 3 will distribute equally
print(df)
The output of this will be:
Race Boy Girl New
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C
I see that you are trying to get a solution that works for 17 sets of records. Here's the code and it works correctly.
import pandas as pd
df = pd.DataFrame({'Race':['W','B','H']*17,
'Boy':[0,1,1]*17,
'Girl':[1,0,1]*17})
#in the DataFrame, you can define the Boy and Girl value
#I think Race values are repeating so I just repeated it 17 times
#define a variable from a thru z
a = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
for (i,row) in df.iterrows():
df.loc[i,'New'] = a[int(i/3)] #still dividing it by 3 equal batches
print(df)
I didn't print for all 17 sets. I just did with 7 sets. It is still the same result.
Race Boy Girl New
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 0 1 B
4 B 1 0 B
5 H 1 1 B
6 W 0 1 C
7 B 1 0 C
8 H 1 1 C
9 W 0 1 D
10 B 1 0 D
11 H 1 1 D
12 W 0 1 E
13 B 1 0 E
14 H 1 1 E
15 W 0 1 F
16 B 1 0 F
17 H 1 1 F
18 W 0 1 G
19 B 1 0 G
20 H 1 1 G
The old pythonic fashion: use a function !
In [18]: df
Out[18]:
Race Boy Girl
0 W 0 1
1 B 1 0
2 H 1 1
3 W 1 0
4 B 0 0
5 H 0 1
6 W 1 0
7 B 1 1
8 H 0 1
The function:
def make_new_col(race_col, abc):
race_col = iter(race_col)
abc = iter(abc)
new_col = []
while True:
try:
race = next(race_col)
except:
break
if race == 'W':
abc_value = next(abc)
new_col.append(abc_value)
else:
new_col.append(abc_value)
return new_col
Then do:
abc = ['A', 'B', 'C']
df['New Column'] = make_new_col(df['Race'], abc)
You get:
In [20]: df
Out[20]:
Race Boy Girl New Column
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C

How to store values of selected columns in separate rows?

I have a DataFrame that looks as follows:
import pandas as pd
df = pd.DataFrame({
'ids': range(4),
'strc': ['some', 'thing', 'abc', 'foo'],
'not_relevant': range(4),
'strc2': list('abcd'),
'strc3': list('lkjh')
})
ids strc not_relevant strc2 strc3
0 0 some 0 a l
1 1 thing 1 b k
2 2 abc 2 c j
3 3 foo 3 d h
For each value in ids I want to collect all values that are stored in the
columns that start with strc and put them in a separate columns called strc_list, so I want:
ids strc not_relevant strc2 strc3 strc_list
0 0 some 0 a l some
0 0 some 0 a l a
0 0 some 0 a l l
1 1 thing 1 b k thing
1 1 thing 1 b k b
1 1 thing 1 b k k
2 2 abc 2 c j abc
2 2 abc 2 c j c
2 2 abc 2 c j j
3 3 foo 3 d h foo
3 3 foo 3 d h d
3 3 foo 3 d h h
I know that I can select all required columns using
df.filter(like='strc', axis=1)
but I don't know how to continue from here. How can I get my desired outcome?
After filter, you need stack, droplevel, rename and join back to df
df1 = df.join(df.filter(like='strc', axis=1).stack().droplevel(1).rename('strc_list'))
Out[135]:
ids strc not_relevant strc2 strc3 strc_list
0 0 some 0 a l some
0 0 some 0 a l a
0 0 some 0 a l l
1 1 thing 1 b k thing
1 1 thing 1 b k b
1 1 thing 1 b k k
2 2 abc 2 c j abc
2 2 abc 2 c j c
2 2 abc 2 c j j
3 3 foo 3 d h foo
3 3 foo 3 d h d
3 3 foo 3 d h h
You can first store the desired values in a list using apply:
df['strc_list'] = df.filter(like='strc', axis=1).apply(list, axis=1)
0 [some, a, l]
1 [thing, b, k]
2 [abc, c, j]
3 [foo, d, h]
Then use explode to distribute them over separate rows:
df = df.explode('strc_list')
A one-liner could then look like this:
df.assign(strc_list=df.filter(like='strc', axis=1).apply(list, axis=1)).explode('strc_list')

Categories

Resources