How to store values of selected columns in separate rows? - python

I have a DataFrame that looks as follows:
import pandas as pd
df = pd.DataFrame({
'ids': range(4),
'strc': ['some', 'thing', 'abc', 'foo'],
'not_relevant': range(4),
'strc2': list('abcd'),
'strc3': list('lkjh')
})
ids strc not_relevant strc2 strc3
0 0 some 0 a l
1 1 thing 1 b k
2 2 abc 2 c j
3 3 foo 3 d h
For each value in ids I want to collect all values that are stored in the
columns that start with strc and put them in a separate columns called strc_list, so I want:
ids strc not_relevant strc2 strc3 strc_list
0 0 some 0 a l some
0 0 some 0 a l a
0 0 some 0 a l l
1 1 thing 1 b k thing
1 1 thing 1 b k b
1 1 thing 1 b k k
2 2 abc 2 c j abc
2 2 abc 2 c j c
2 2 abc 2 c j j
3 3 foo 3 d h foo
3 3 foo 3 d h d
3 3 foo 3 d h h
I know that I can select all required columns using
df.filter(like='strc', axis=1)
but I don't know how to continue from here. How can I get my desired outcome?

After filter, you need stack, droplevel, rename and join back to df
df1 = df.join(df.filter(like='strc', axis=1).stack().droplevel(1).rename('strc_list'))
Out[135]:
ids strc not_relevant strc2 strc3 strc_list
0 0 some 0 a l some
0 0 some 0 a l a
0 0 some 0 a l l
1 1 thing 1 b k thing
1 1 thing 1 b k b
1 1 thing 1 b k k
2 2 abc 2 c j abc
2 2 abc 2 c j c
2 2 abc 2 c j j
3 3 foo 3 d h foo
3 3 foo 3 d h d
3 3 foo 3 d h h

You can first store the desired values in a list using apply:
df['strc_list'] = df.filter(like='strc', axis=1).apply(list, axis=1)
0 [some, a, l]
1 [thing, b, k]
2 [abc, c, j]
3 [foo, d, h]
Then use explode to distribute them over separate rows:
df = df.explode('strc_list')
A one-liner could then look like this:
df.assign(strc_list=df.filter(like='strc', axis=1).apply(list, axis=1)).explode('strc_list')

Related

How to convert binary columns with multiple occurrences into categorical data in Pandas

I have the following example data set
A
B
C
D
foo
0
1
1
bar
0
0
1
baz
1
1
0
How could extract the column names of each 1 occurrence in a row and put that into another column E so that I get the following table:
A
B
C
D
E
foo
0
1
1
C, D
bar
0
0
1
D
baz
1
1
0
B, C
Note that there can be more than two 1s per row.
You can use DataFrame.dot.
df['E'] = df[['B', 'C', 'D']].dot(df.columns[1:] + ', ').str.rstrip(', ')
df
A B C D E
0 foo 0 1 1 C, D
1 bar 0 0 1 D
2 baz 1 1 0 B, C
Inspired by jezrael's answer in this post.
Another way is that you can convert each row to boolean and use it as a selection mask to filter the column names.
cols = pd.Index(['B', 'C', 'D'])
df['E'] = df[cols].astype('bool').apply(lambda row: ", ".join(cols[row]), axis=1)
df
A B C D E
0 foo 0 1 1 C, D
1 bar 0 0 1 D
2 baz 1 1 0 B, C

Create new column to existing column and add value at certain interval

I am trying to create a new column based one on my first column. For example,
I have a list of a = ["A", "B", "C"] and existing dataframe
Race Boy Girl
W 0 1
B 1 0
H 1 1
W 1 0
B 0 0
H 0 1
W 1 0
B 1 1
H 0 1
My goal is to create a new column and add value to it base on W, B, H interval. So that the end result looks like:
Race Boy Girl New Column
W 0 1 A
B 1 0 A
H 1 1 A
W 1 0 B
B 0 0 B
H 0 1 B
W 1 0 C
B 1 1 C
H 0 1 C
The W,B,H interval is consistent, and I want to add new value to the new column every time I see W. The data is longer than this.
I have tried all possible ways and I couldn't come up with a code. I will be glad if someone can help and also explain the process. Thanks
Here is what you can do:
Use a loop to create a list that is repetitive for the column.
for i in len(dataframe['Race']):
#Create list for last column
Once you have that list you can add it to the list by using:
dataframe['New Column'] = list
maybe this works..
list = ['A','B','C',....]
i=-1
for entry in dataframe:
if entry['Race'] = 'W':
i+=1
entry['new column'] = list[i]
also if the new column list is very big to type you can use list comprehension:
list = [x for x in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ']
If your W, B, H is in this exact order and complete inteval, you may use np.repeat. As in your comment, np.repeat only would suffice.
import numpy as np
a = ["A", "B", "C"] #list
n = df.Race.nunique() # length of each interval
df['New Col'] = np.repeat(a, n)
In [20]: df
Out[20]:
Race Boy Girl New Col
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C
Here is a way with pandas. It increments each time you see a new 'W' and handles missing values of Race.
# use original post's definition of df
df['New Col'] = (
(df['Race'] == 'W') # True (1) for W; False (0) otherwise
.cumsum() # increments each time you hit True (1)
.map({1: 'A', 2: 'B', 3: 'C'}) # 1->A, 2->B, ...
)
print(df)
Race Boy Girl New Col
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C
There are multiple ways to solve this problem statement. You can iterate through the DataFrame and assign values to the new column at each interval.
Here's an approach I think will work.
#setting up the DataFrame you referred in the example
import pandas as pd
df = pd.DataFrame({'Race':['W','B','H','W','B','H','W','B','H'],
'Boy':[0,1,1,1,0,0,1,1,0],
'Girl':[1,0,1,0,0,1,0,1,1]})
#if you have 3 values to assign, create a list say A, B, C
#By creating a list, you have to manage only the list and the frequency
a = ['A','B','C']
#iterate thru the dataframe and assign the values in batches
for (i,row) in df.iterrows(): #the trick is to assign for loc[i]
df.loc[i,'New'] = a[int(i/3)] #where i is the index and assign value in list a
#note: dividing by 3 will distribute equally
print(df)
The output of this will be:
Race Boy Girl New
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C
I see that you are trying to get a solution that works for 17 sets of records. Here's the code and it works correctly.
import pandas as pd
df = pd.DataFrame({'Race':['W','B','H']*17,
'Boy':[0,1,1]*17,
'Girl':[1,0,1]*17})
#in the DataFrame, you can define the Boy and Girl value
#I think Race values are repeating so I just repeated it 17 times
#define a variable from a thru z
a = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
for (i,row) in df.iterrows():
df.loc[i,'New'] = a[int(i/3)] #still dividing it by 3 equal batches
print(df)
I didn't print for all 17 sets. I just did with 7 sets. It is still the same result.
Race Boy Girl New
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 0 1 B
4 B 1 0 B
5 H 1 1 B
6 W 0 1 C
7 B 1 0 C
8 H 1 1 C
9 W 0 1 D
10 B 1 0 D
11 H 1 1 D
12 W 0 1 E
13 B 1 0 E
14 H 1 1 E
15 W 0 1 F
16 B 1 0 F
17 H 1 1 F
18 W 0 1 G
19 B 1 0 G
20 H 1 1 G
The old pythonic fashion: use a function !
In [18]: df
Out[18]:
Race Boy Girl
0 W 0 1
1 B 1 0
2 H 1 1
3 W 1 0
4 B 0 0
5 H 0 1
6 W 1 0
7 B 1 1
8 H 0 1
The function:
def make_new_col(race_col, abc):
race_col = iter(race_col)
abc = iter(abc)
new_col = []
while True:
try:
race = next(race_col)
except:
break
if race == 'W':
abc_value = next(abc)
new_col.append(abc_value)
else:
new_col.append(abc_value)
return new_col
Then do:
abc = ['A', 'B', 'C']
df['New Column'] = make_new_col(df['Race'], abc)
You get:
In [20]: df
Out[20]:
Race Boy Girl New Column
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C

Rearanging table structure based on number of rows and columns pandas

I have the following data frame table. The table has the columns Id, columns, rows, 1, 2, 3, 4, 5, 6, 7, 8, and 9.
Id columns rows 1 2 3 4 5 6 7 8 9
1 3 3 A B C D E F G H Z
2 3 2 I J K
By considering Id, the number of rows, and columns I would like to restructure the table as follows.
Id columns rows col_1 col_2 col_3
1 3 3 A B C
1 3 3 D E F
1 3 3 G H Z
2 3 2 I J K
2 3 2 - - -
Can anyone help to do this in Python Pandas?
Here's a solution using MultiIndex and .itterrows():
df
Id columns rows 1 2 3 4 5 6 7 8 9
0 1 3 3 A B C D E F G H Z
1 2 3 2 I J K None None None None None None
You can set n to any length, in your case 3:
n = 3
df = df.set_index(['Id', 'columns', 'rows'])
new_index = []
new_rows = []
for index, row in df.iterrows():
max_rows = index[-1] * (len(index)-1) # read amount of rows
for i in range(0, len(row), n):
if i > max_rows: # max rows reached, stop appending
continue
new_index.append(index)
new_rows.append(row.values[i:i+n])
pd.DataFrame(new_rows, index=pd.MultiIndex.from_tuples(new_index))
0 1 2
1 3 3 A B C
3 D E F
3 G H Z
2 3 2 I J K
2 None None None
And if you are keen on getting your old index and headers back:
new_headers = ['Id', 'columns', 'rows'] + list(range(1, n+1))
df2.reset_index().set_axis(new_headers, axis=1)
Id columns rows 1 2 3
0 1 3 3 A B C
1 1 3 3 D E F
2 1 3 3 G H Z
3 2 3 2 I J K
4 2 3 2 None None None
Using melt and str.split with floor division against your index to create groups of 3.
s = pd.melt(df,id_vars=['Id','columns','rows'])
s1 = (
s.sort_values(["Id", "variable"])
.assign(idx=s.index // 3)
.fillna("-")
.groupby(["idx", "Id"])
.agg(
columns=("columns", "first"), rows=("rows", "first"), value=("value", ",".join)
)
)
s2 = s1["value"].str.split(",", expand=True).rename(
columns=dict(zip(s1["value"].str.split(",", expand=True).columns,
[f'col_{i+1}' for i in range(s1["value"].str.split(',').apply(len).max())]
))
)
df1 = pd.concat([s1.drop('value',axis=1),s2],axis=1)
print(df1)
columns rows col_1 col_2 col_3
idx Id
0 1 3 3 A B C
1 1 3 3 D E F
2 1 3 3 G H Z
3 2 3 2 I J K
4 2 3 2 - - -
5 2 3 2 - - -
I modify unutbu solution for create array for each row by expected length of new rows, columns, then create Dataframe in list comprehension and join together by concat:
def f(x):
c, r = x.name[1], x.name[2]
#print (c, r)
arr = np.empty(c * r, dtype='O')
vals = x.iloc[:len(arr)]
arr[:len(vals)] = vals
idx = pd.MultiIndex.from_tuples([x.name] * r, names=df.columns[:3])
cols = [f'col_{c+1}' for c in range(c)]
return pd.DataFrame(arr.reshape((r, c)), index=idx, columns=cols).fillna('-')
df1 = (pd.concat([x for x in df.set_index(['Id', 'columns', 'rows'])
.apply(f, axis=1)])
.reset_index())
print (df1)
Id columns rows col_1 col_2 col_3
0 1 3 3 A B C
1 1 3 3 D E F
2 1 3 3 G H Z
3 2 3 2 I J K
4 2 3 2 - - -

insert a list as row in a dataframe at a specific position

I have a list l=['a', 'b' ,'c']
and a dataframe with columns d,e,f and values are all numbers
How can I insert list l in my dataframe just below the columns.
Setup
df = pd.DataFrame(np.ones((2, 3), dtype=int), columns=list('def'))
l = list('abc')
df
d e f
0 1 1 1
1 1 1 1
Option 1
I'd accomplish this task by adding a level to the columns object
df.columns = pd.MultiIndex.from_tuples(list(zip(df.columns, l)))
df
d e f
a b c
0 1 1 1
1 1 1 1
Option 2
Use a dictionary comprehension passed to the dataframe constructor
pd.DataFrame({(i, j): df[i] for i, j in zip(df, l)})
d e f
a b c
0 1 1 1
1 1 1 1
But if you insist on putting it in the dataframe proper... (keep in mind, this turns the dataframe into dtype object and we lose significant computational efficiencies.)
Alternative 1
pd.DataFrame([l], columns=df.columns).append(df, ignore_index=True)
d e f
0 a b c
1 1 1 1
2 1 1 1
Alternative 2
pd.DataFrame([l] + df.values.tolist(), columns=df.columns)
d e f
0 a b c
1 1 1 1
2 1 1 1
Use pd.concat
In [1112]: df
Out[1112]:
d e f
0 0.517243 0.731847 0.259034
1 0.318821 0.551298 0.773115
2 0.194192 0.707525 0.804102
3 0.945842 0.614033 0.757389
In [1113]: pd.concat([pd.DataFrame([l], columns=df.columns), df], ignore_index=True)
Out[1113]:
d e f
0 a b c
1 0.517243 0.731847 0.259034
2 0.318821 0.551298 0.773115
3 0.194192 0.707525 0.804102
4 0.945842 0.614033 0.757389
Are you looking for append i.e
df = pd.DataFrame([[1,2,3]],columns=list('def'))
I = ['a','b','c']
ndf = df.append(pd.Series(I,index=df.columns.tolist()),ignore_index=True)
Output:
d e f
0 1 2 3
1 a b c
If you want add list to columns for MultiIndex:
df.columns = [df.columns, l]
print (df)
d e f
a b c
0 4 7 1
1 5 8 3
2 4 9 5
3 5 4 7
4 5 2 1
5 4 3 0
print (df.columns)
MultiIndex(levels=[['d', 'e', 'f'], ['a', 'b', 'c']],
labels=[[0, 1, 2], [0, 1, 2]])
If you want add list to specific position pos:
pos = 0
df1 = pd.DataFrame([l], columns=df.columns)
print (df1)
d e f
0 a b c
df = pd.concat([df.iloc[:pos], df1, df.iloc[pos:]], ignore_index=True)
print (df)
d e f
0 a b c
1 4 7 1
2 5 8 3
3 4 9 5
4 5 4 7
5 5 2 1
6 4 3 0
But if append this list to numeric dataframe, get mixed types - numeric with strings, so some pandas functions should failed.
Setup:
df = pd.DataFrame({'d':[4,5,4,5,5,4],
'e':[7,8,9,4,2,3],
'f':[1,3,5,7,1,0]})
print (df)

Convert Two column data frame to occurrence matrix in pandas

Hi all I have a csv file which contains data as the format below
A a
A b
B f
B g
B e
B h
C d
C e
C f
The first column contains items second column contains available feature from feature vector=[a,b,c,d,e,f,g,h]
I want to convert this to occurence matrix look like below
a,b,c,d,e,f,g,h
A 1,1,0,0,0,0,0,0
B 0,0,0,0,1,1,1,1
C 0,0,0,1,1,1,0,0
Can anyone tell me how to do this using pandas?
Here is another way to do it using pd.get_dummies().
import pandas as pd
# your data
# =======================
df
col1 col2
0 A a
1 A b
2 B f
3 B g
4 B e
5 B h
6 C d
7 C e
8 C f
# processing
# ===================================
pd.get_dummies(df.col2).groupby(df.col1).apply(max)
a b d e f g h
col1
A 1 1 0 0 0 0 0
B 0 0 0 1 1 1 1
C 0 0 1 1 1 0 0
Unclear if your data has a typo or not but you can crosstab for this:
In [95]:
pd.crosstab(index=df['A'], columns = df['a'])
Out[95]:
a b d e f g h
A
A 1 0 0 0 0 0
B 0 0 1 1 1 1
C 0 1 1 1 0 0
In your sample data your second column has value a as the name of that column but in your expected output it's in the column as a value
EDIT
OK I fixed your input data so it generates the correct result:
In [98]:
import pandas as pd
import io
t="""A a
A b
B f
B g
B e
B h
C d
C e
C f"""
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, names=['A','a'])
df
Out[98]:
A a
0 A a
1 A b
2 B f
3 B g
4 B e
5 B h
6 C d
7 C e
8 C f
In [99]:
ct = pd.crosstab(index=df['A'], columns = df['a'])
ct
Out[99]:
a a b d e f g h
A
A 1 1 0 0 0 0 0
B 0 0 0 1 1 1 1
C 0 0 1 1 1 0 0
This approach yields the same result in a scipy sparse coo matrix much faster
from scipy import sparse
df['col1'] = df['col1'].astype("category")
df['col2'] = df['col2'].astype("category")
df['ones'] = 1
user_items = sparse.coo_matrix((df.ones.astype(float),
(df.col1.cat.codes,
df.col2.cat.codes)))

Categories

Resources