How to store values of selected columns in separate rows?

How to store values of selected columns in separate rows? - python

I have a DataFrame that looks as follows:
import pandas as pd
df = pd.DataFrame({
'ids': range(4),
'strc': ['some', 'thing', 'abc', 'foo'],
'not_relevant': range(4),
'strc2': list('abcd'),
'strc3': list('lkjh')
})
ids strc not_relevant strc2 strc3
0 0 some 0 a l
1 1 thing 1 b k
2 2 abc 2 c j
3 3 foo 3 d h
For each value in ids I want to collect all values that are stored in the
columns that start with strc and put them in a separate columns called strc_list, so I want:
ids strc not_relevant strc2 strc3 strc_list
0 0 some 0 a l some
0 0 some 0 a l a
0 0 some 0 a l l
1 1 thing 1 b k thing
1 1 thing 1 b k b
1 1 thing 1 b k k
2 2 abc 2 c j abc
2 2 abc 2 c j c
2 2 abc 2 c j j
3 3 foo 3 d h foo
3 3 foo 3 d h d
3 3 foo 3 d h h
I know that I can select all required columns using
df.filter(like='strc', axis=1)
but I don't know how to continue from here. How can I get my desired outcome?

After filter, you need stack, droplevel, rename and join back to df
df1 = df.join(df.filter(like='strc', axis=1).stack().droplevel(1).rename('strc_list'))
Out[135]:
ids strc not_relevant strc2 strc3 strc_list
0 0 some 0 a l some
0 0 some 0 a l a
0 0 some 0 a l l
1 1 thing 1 b k thing
1 1 thing 1 b k b
1 1 thing 1 b k k
2 2 abc 2 c j abc
2 2 abc 2 c j c
2 2 abc 2 c j j
3 3 foo 3 d h foo
3 3 foo 3 d h d
3 3 foo 3 d h h

You can first store the desired values in a list using apply:
df['strc_list'] = df.filter(like='strc', axis=1).apply(list, axis=1)
0 [some, a, l]
1 [thing, b, k]
2 [abc, c, j]
3 [foo, d, h]
Then use explode to distribute them over separate rows:
df = df.explode('strc_list')
A one-liner could then look like this:
df.assign(strc_list=df.filter(like='strc', axis=1).apply(list, axis=1)).explode('strc_list')

Related

How to convert binary columns with multiple occurrences into categorical data in Pandas

I have the following example data set
A
B
C
D
foo
0
1
1
bar
0
0
1
baz
1
1
0
How could extract the column names of each 1 occurrence in a row and put that into another column E so that I get the following table:
A
B
C
D
E
foo
0
1
1
C, D
bar
0
0
1
D
baz
1
1
0
B, C
Note that there can be more than two 1s per row.

You can use DataFrame.dot.
df['E'] = df[['B', 'C', 'D']].dot(df.columns[1:] + ', ').str.rstrip(', ')
df
A B C D E
0 foo 0 1 1 C, D
1 bar 0 0 1 D
2 baz 1 1 0 B, C
Inspired by jezrael's answer in this post.
Another way is that you can convert each row to boolean and use it as a selection mask to filter the column names.
cols = pd.Index(['B', 'C', 'D'])
df['E'] = df[cols].astype('bool').apply(lambda row: ", ".join(cols[row]), axis=1)
df
A B C D E
0 foo 0 1 1 C, D
1 bar 0 0 1 D
2 baz 1 1 0 B, C

Create new column to existing column and add value at certain interval

I am trying to create a new column based one on my first column. For example,
I have a list of a = ["A", "B", "C"] and existing dataframe
Race Boy Girl
W 0 1
B 1 0
H 1 1
W 1 0
B 0 0
H 0 1
W 1 0
B 1 1
H 0 1
My goal is to create a new column and add value to it base on W, B, H interval. So that the end result looks like:
Race Boy Girl New Column
W 0 1 A
B 1 0 A
H 1 1 A
W 1 0 B
B 0 0 B
H 0 1 B
W 1 0 C
B 1 1 C
H 0 1 C
The W,B,H interval is consistent, and I want to add new value to the new column every time I see W. The data is longer than this.
I have tried all possible ways and I couldn't come up with a code. I will be glad if someone can help and also explain the process. Thanks

Here is what you can do:
Use a loop to create a list that is repetitive for the column.
for i in len(dataframe['Race']):
#Create list for last column
Once you have that list you can add it to the list by using:
dataframe['New Column'] = list

maybe this works..
list = ['A','B','C',....]
i=-1
for entry in dataframe:
if entry['Race'] = 'W':
i+=1
entry['new column'] = list[i]
also if the new column list is very big to type you can use list comprehension:
list = [x for x in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ']

If your W, B, H is in this exact order and complete inteval, you may use np.repeat. As in your comment, np.repeat only would suffice.
import numpy as np
a = ["A", "B", "C"] #list
n = df.Race.nunique() # length of each interval
df['New Col'] = np.repeat(a, n)
In [20]: df
Out[20]:
Race Boy Girl New Col
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C

Here is a way with pandas. It increments each time you see a new 'W' and handles missing values of Race.
# use original post's definition of df
df['New Col'] = (
(df['Race'] == 'W') # True (1) for W; False (0) otherwise
.cumsum() # increments each time you hit True (1)
.map({1: 'A', 2: 'B', 3: 'C'}) # 1->A, 2->B, ...
)
print(df)
Race Boy Girl New Col
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C

There are multiple ways to solve this problem statement. You can iterate through the DataFrame and assign values to the new column at each interval.
Here's an approach I think will work.
#setting up the DataFrame you referred in the example
import pandas as pd
df = pd.DataFrame({'Race':['W','B','H','W','B','H','W','B','H'],
'Boy':[0,1,1,1,0,0,1,1,0],
'Girl':[1,0,1,0,0,1,0,1,1]})
#if you have 3 values to assign, create a list say A, B, C
#By creating a list, you have to manage only the list and the frequency
a = ['A','B','C']
#iterate thru the dataframe and assign the values in batches
for (i,row) in df.iterrows(): #the trick is to assign for loc[i]
df.loc[i,'New'] = a[int(i/3)] #where i is the index and assign value in list a
#note: dividing by 3 will distribute equally
print(df)
The output of this will be:
Race Boy Girl New
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C
I see that you are trying to get a solution that works for 17 sets of records. Here's the code and it works correctly.
import pandas as pd
df = pd.DataFrame({'Race':['W','B','H']*17,
'Boy':[0,1,1]*17,
'Girl':[1,0,1]*17})
#in the DataFrame, you can define the Boy and Girl value
#I think Race values are repeating so I just repeated it 17 times
#define a variable from a thru z
a = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
for (i,row) in df.iterrows():
df.loc[i,'New'] = a[int(i/3)] #still dividing it by 3 equal batches
print(df)
I didn't print for all 17 sets. I just did with 7 sets. It is still the same result.
Race Boy Girl New
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 0 1 B
4 B 1 0 B
5 H 1 1 B
6 W 0 1 C
7 B 1 0 C
8 H 1 1 C
9 W 0 1 D
10 B 1 0 D
11 H 1 1 D
12 W 0 1 E
13 B 1 0 E
14 H 1 1 E
15 W 0 1 F
16 B 1 0 F
17 H 1 1 F
18 W 0 1 G
19 B 1 0 G
20 H 1 1 G

The old pythonic fashion: use a function !
In [18]: df
Out[18]:
Race Boy Girl
0 W 0 1
1 B 1 0
2 H 1 1
3 W 1 0
4 B 0 0
5 H 0 1
6 W 1 0
7 B 1 1
8 H 0 1
The function:
def make_new_col(race_col, abc):
race_col = iter(race_col)
abc = iter(abc)
new_col = []
while True:
try:
race = next(race_col)
except:
break
if race == 'W':
abc_value = next(abc)
new_col.append(abc_value)
else:
new_col.append(abc_value)
return new_col
Then do:
abc = ['A', 'B', 'C']
df['New Column'] = make_new_col(df['Race'], abc)
You get:
In [20]: df
Out[20]:
Race Boy Girl New Column
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C

Rearanging table structure based on number of rows and columns pandas

I have the following data frame table. The table has the columns Id, columns, rows, 1, 2, 3, 4, 5, 6, 7, 8, and 9.
Id columns rows 1 2 3 4 5 6 7 8 9
1 3 3 A B C D E F G H Z
2 3 2 I J K
By considering Id, the number of rows, and columns I would like to restructure the table as follows.
Id columns rows col_1 col_2 col_3
1 3 3 A B C
1 3 3 D E F
1 3 3 G H Z
2 3 2 I J K
2 3 2 - - -
Can anyone help to do this in Python Pandas?

Here's a solution using MultiIndex and .itterrows():
df
Id columns rows 1 2 3 4 5 6 7 8 9
0 1 3 3 A B C D E F G H Z
1 2 3 2 I J K None None None None None None
You can set n to any length, in your case 3:
n = 3
df = df.set_index(['Id', 'columns', 'rows'])
new_index = []
new_rows = []
for index, row in df.iterrows():
max_rows = index[-1] * (len(index)-1) # read amount of rows
for i in range(0, len(row), n):
if i > max_rows: # max rows reached, stop appending
continue
new_index.append(index)
new_rows.append(row.values[i:i+n])
pd.DataFrame(new_rows, index=pd.MultiIndex.from_tuples(new_index))
0 1 2
1 3 3 A B C
3 D E F
3 G H Z
2 3 2 I J K
2 None None None
And if you are keen on getting your old index and headers back:
new_headers = ['Id', 'columns', 'rows'] + list(range(1, n+1))
df2.reset_index().set_axis(new_headers, axis=1)
Id columns rows 1 2 3
0 1 3 3 A B C
1 1 3 3 D E F
2 1 3 3 G H Z
3 2 3 2 I J K
4 2 3 2 None None None

Using melt and str.split with floor division against your index to create groups of 3.
s = pd.melt(df,id_vars=['Id','columns','rows'])
s1 = (
s.sort_values(["Id", "variable"])
.assign(idx=s.index // 3)
.fillna("-")
.groupby(["idx", "Id"])
.agg(
columns=("columns", "first"), rows=("rows", "first"), value=("value", ",".join)
)
)
s2 = s1["value"].str.split(",", expand=True).rename(
columns=dict(zip(s1["value"].str.split(",", expand=True).columns,
[f'col_{i+1}' for i in range(s1["value"].str.split(',').apply(len).max())]
))
)
df1 = pd.concat([s1.drop('value',axis=1),s2],axis=1)
print(df1)
columns rows col_1 col_2 col_3
idx Id
0 1 3 3 A B C
1 1 3 3 D E F
2 1 3 3 G H Z
3 2 3 2 I J K
4 2 3 2 - - -
5 2 3 2 - - -

I modify unutbu solution for create array for each row by expected length of new rows, columns, then create Dataframe in list comprehension and join together by concat:
def f(x):
c, r = x.name[1], x.name[2]
#print (c, r)
arr = np.empty(c * r, dtype='O')
vals = x.iloc[:len(arr)]
arr[:len(vals)] = vals
idx = pd.MultiIndex.from_tuples([x.name] * r, names=df.columns[:3])
cols = [f'col_{c+1}' for c in range(c)]
return pd.DataFrame(arr.reshape((r, c)), index=idx, columns=cols).fillna('-')
df1 = (pd.concat([x for x in df.set_index(['Id', 'columns', 'rows'])
.apply(f, axis=1)])
.reset_index())
print (df1)
Id columns rows col_1 col_2 col_3
0 1 3 3 A B C
1 1 3 3 D E F
2 1 3 3 G H Z
3 2 3 2 I J K
4 2 3 2 - - -

insert a list as row in a dataframe at a specific position

I have a list l=['a', 'b' ,'c']
and a dataframe with columns d,e,f and values are all numbers
How can I insert list l in my dataframe just below the columns.

Setup
df = pd.DataFrame(np.ones((2, 3), dtype=int), columns=list('def'))
l = list('abc')
df
d e f
0 1 1 1
1 1 1 1
Option 1
I'd accomplish this task by adding a level to the columns object
df.columns = pd.MultiIndex.from_tuples(list(zip(df.columns, l)))
df
d e f
a b c
0 1 1 1
1 1 1 1
Option 2
Use a dictionary comprehension passed to the dataframe constructor
pd.DataFrame({(i, j): df[i] for i, j in zip(df, l)})
d e f
a b c
0 1 1 1
1 1 1 1
But if you insist on putting it in the dataframe proper... (keep in mind, this turns the dataframe into dtype object and we lose significant computational efficiencies.)
Alternative 1
pd.DataFrame([l], columns=df.columns).append(df, ignore_index=True)
d e f
0 a b c
1 1 1 1
2 1 1 1
Alternative 2
pd.DataFrame([l] + df.values.tolist(), columns=df.columns)
d e f
0 a b c
1 1 1 1
2 1 1 1

Use pd.concat
In [1112]: df
Out[1112]:
d e f
0 0.517243 0.731847 0.259034
1 0.318821 0.551298 0.773115
2 0.194192 0.707525 0.804102
3 0.945842 0.614033 0.757389
In [1113]: pd.concat([pd.DataFrame([l], columns=df.columns), df], ignore_index=True)
Out[1113]:
d e f
0 a b c
1 0.517243 0.731847 0.259034
2 0.318821 0.551298 0.773115
3 0.194192 0.707525 0.804102
4 0.945842 0.614033 0.757389

Are you looking for append i.e
df = pd.DataFrame([[1,2,3]],columns=list('def'))
I = ['a','b','c']
ndf = df.append(pd.Series(I,index=df.columns.tolist()),ignore_index=True)
Output:
d e f
0 1 2 3
1 a b c

If you want add list to columns for MultiIndex:
df.columns = [df.columns, l]
print (df)
d e f
a b c
0 4 7 1
1 5 8 3
2 4 9 5
3 5 4 7
4 5 2 1
5 4 3 0
print (df.columns)
MultiIndex(levels=[['d', 'e', 'f'], ['a', 'b', 'c']],
labels=[[0, 1, 2], [0, 1, 2]])
If you want add list to specific position pos:
pos = 0
df1 = pd.DataFrame([l], columns=df.columns)
print (df1)
d e f
0 a b c
df = pd.concat([df.iloc[:pos], df1, df.iloc[pos:]], ignore_index=True)
print (df)
d e f
0 a b c
1 4 7 1
2 5 8 3
3 4 9 5
4 5 4 7
5 5 2 1
6 4 3 0
But if append this list to numeric dataframe, get mixed types - numeric with strings, so some pandas functions should failed.
Setup:
df = pd.DataFrame({'d':[4,5,4,5,5,4],
'e':[7,8,9,4,2,3],
'f':[1,3,5,7,1,0]})
print (df)

Convert Two column data frame to occurrence matrix in pandas

Hi all I have a csv file which contains data as the format below
A a
A b
B f
B g
B e
B h
C d
C e
C f
The first column contains items second column contains available feature from feature vector=[a,b,c,d,e,f,g,h]
I want to convert this to occurence matrix look like below
a,b,c,d,e,f,g,h
A 1,1,0,0,0,0,0,0
B 0,0,0,0,1,1,1,1
C 0,0,0,1,1,1,0,0
Can anyone tell me how to do this using pandas?

Here is another way to do it using pd.get_dummies().
import pandas as pd
# your data
# =======================
df
col1 col2
0 A a
1 A b
2 B f
3 B g
4 B e
5 B h
6 C d
7 C e
8 C f
# processing
# ===================================
pd.get_dummies(df.col2).groupby(df.col1).apply(max)
a b d e f g h
col1
A 1 1 0 0 0 0 0
B 0 0 0 1 1 1 1
C 0 0 1 1 1 0 0

Unclear if your data has a typo or not but you can crosstab for this:
In [95]:
pd.crosstab(index=df['A'], columns = df['a'])
Out[95]:
a b d e f g h
A
A 1 0 0 0 0 0
B 0 0 1 1 1 1
C 0 1 1 1 0 0
In your sample data your second column has value a as the name of that column but in your expected output it's in the column as a value
EDIT
OK I fixed your input data so it generates the correct result:
In [98]:
import pandas as pd
import io
t="""A a
A b
B f
B g
B e
B h
C d
C e
C f"""
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, names=['A','a'])
df
Out[98]:
A a
0 A a
1 A b
2 B f
3 B g
4 B e
5 B h
6 C d
7 C e
8 C f
In [99]:
ct = pd.crosstab(index=df['A'], columns = df['a'])
ct
Out[99]:
a a b d e f g h
A
A 1 1 0 0 0 0 0
B 0 0 0 1 1 1 1
C 0 0 1 1 1 0 0

This approach yields the same result in a scipy sparse coo matrix much faster
from scipy import sparse
df['col1'] = df['col1'].astype("category")
df['col2'] = df['col2'].astype("category")
df['ones'] = 1
user_items = sparse.coo_matrix((df.ones.astype(float),
(df.col1.cat.codes,
df.col2.cat.codes)))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to store values of selected columns in separate rows? - python

Related

How to convert binary columns with multiple occurrences into categorical data in Pandas

Create new column to existing column and add value at certain interval

Rearanging table structure based on number of rows and columns pandas

insert a list as row in a dataframe at a specific position

Convert Two column data frame to occurrence matrix in pandas

Categories

Resources