Replace columns with multi level columns based on lookup on Dataframe - python

How can I replace this single column heading:
foo bar
0 0 0
1 0 0
To get these multi level columns:
A B
a b
0 0 0
1 0 0
Based on this dataframe mapping
col1 col2 col3
0 foo a A
1 bar b B
2 baz c C
I am trying a list comprehension trying to create a new multi level column index but doesn't seem to be working...I have a feeling there is a more pythonic way to achieve this nonetheless
df1 = pd.DataFrame({'foo':[0,0],
'bar':[0,0]})
df2 = pd.DataFrame({'col1':['foo','bar','baz'],
'col2':['A','B','C'],
'col3':['a','b','c']})
df1.columns = [(df2.loc[df2['col1']==i,'col2'], df2.loc[df2['col1']==i,'col3']) for i in df1.columns]

You can transform df2 to a Series of tuples and map it to the columns:
df1.columns = df1.columns.map(df2.set_index('col1').apply(tuple, axis=1))
output:
A B
a b
0 0 0
1 0 0

Related

Filling rows based on column partitions & conditions in Python

My goal is to iterate through a list of possible B values, such that each ID (col A) will have new rows added with C = 0 where the possible B value did not previously exist in the DF.
I have a dataframe with:
A B C
0 id1 2 10
1 id1 3 20
2 id2 1 30
possible_B_values = [1 2 3]
Resulting in:
A B C
0 id1 1 0
1 id1 2 10
2 id1 3 20
3 id2 1 30
4 id2 2 0
5 id2 3 0
Thanks in advance!
Using some index trickery:
import pandas as pd
df = pd.read_clipboard() # Your df here
possible_B_values = [1, 2, 3]
extrapolate_columns = ["A", "B"]
index = pd.MultiIndex.from_product(
[df["A"].unique(), possible_B_values],
names=extrapolate_columns
)
out = df.set_index(extrapolate_columns).reindex(index, fill_value=0).reset_index()
out:
A B C
0 id1 1 0
1 id1 2 10
2 id1 3 20
3 id2 1 30
4 id2 2 0
5 id2 3 0
Maybe you can create a dataframe with list of tuples with possible B values and merge it with original one
import pandas as pd
# Create a list of tuples with the possible B values and a C value of 0
possible_b_values = [1, 2, 3]
possible_b_rows = [(id, b, 0) for id in df['A'].unique() for b in possible_b_values]
# Create a new DataFrame from the list of tuples
possible_b_df = pd.DataFrame(possible_b_rows, columns=['A', 'B', 'C'])
# Merge the new DataFrame with the original one, using the 'A' and 'B' columns as the keys
df = df.merge(possible_b_df, on=['A', 'B'], how='outer')
# Fill any null values in the 'C' column with 0
df['C'] = df['C'].fillna(0)
print(df)
Here is a one-liner pure pandas way of solving this -
You set the index as B (this will help in re-indexing later)
Gropuby column A and then for column C apply the following apply function to reindex B
The lambda function x.reindex(range(1,4), fill_value=0) basically takes each group of dataframe x for each id, and then reindexes it from range(1,4) = 1,2,3 and fills the nan values with 0.
Finally you reset_index to bring A and B back into the dataframe.
out = df.set_index('B') \ # Set index as B
.groupby(['A'])['C'] \ # Groupby A and use apply on column C
.apply(lambda x: x.reindex(range(1,4), fill_value=0))\ # Reindex B to range(1,4) for each group and fill 0
.reset_index() # Reset index
print(out)
A B C
0 id1 1 0
1 id1 2 10
2 id1 3 20
3 id2 1 30
4 id2 2 0
5 id2 3 0

Remove duplicates in a row pandas

I have a df
Name Symbol Dummy
A (BO),(BO),(AD),(TR) 2
B (TV),(TV),(TV) 2
C (HY) 2
D (UI) 2
I need df as
Name Symbol Dummy
A (BO),(AD),(TR) 2
B (TV) 2
C (HY) 2
D (UI) 2
Tried with this function but not working as expected.
drop_duplicates
Split the strings around delimiter ,, then dedupe using dict.fromkeys which also preserves the order of strings, finally join around delimiter ,
df['Symbol'] = df['Symbol'].str.split(',').map(dict.fromkeys).str.join(',')
Name Symbol Dummy
0 A (BO),(AD),(TR) 2
1 B (TV) 2
2 C (HY) 2
3 D (UI) 2
Another method
#original DF
index
col1
col2
0
(BO),(BO),(AD),(TR)
2
df.col1 = df.col1.str.split(',').apply(lambda x: sorted(set(x), key=x.index)).str.join(',')
df
#output
index
col1
col2
0
(BO),(AD),(TR)
2
If values order not important you can simply do:
df.col1 = df.col1.str.split(',').apply(lambda x: set(x)).str.join(',')
df
#output
index
col1
col2
0
(AD),(BO),(TR)
2

Match columns pandas Dataframe

I want to match two pandas Dataframes by the name of their columns.
import pandas as pd
df1 = pd.DataFrame([[0,2,1],[1,3,0],[0,4,0]], columns=['A', 'B', 'C'])
A B C
0 0 2 1
1 1 3 0
2 0 4 0
df2 = pd.DataFrame([[0,0,1],[1,5,0],[0,7,0]], columns=['A', 'B', 'D'])
A B D
0 0 0 1
1 1 5 0
2 0 7 0
If the names match, do nothing. (Keep the column of df2)
If a column is in Dataframe 1 but not in Dataframe 2, add the column in Dataframe 2 as a vector of zeros.
If a column is in Dataframe 2 but not in Dataframe 1, drop it.
The output should look like this:
A B C
0 0 0 0
1 1 5 0
2 0 7 0
I know if I do:
df2 = df2[df1.columns]
I get:
KeyError: "['C'] not in index"
I could also add the vectors of zeros manually, but of course this is a toy example of a much longer dataset. Is there any smarter/pythonic way of doing this?
It appears that df2 columns should be the same as df1 columns after this operation, as columns that are in df1 and not df2 should be added, while columns only in df2 should be removed. We can simply reindex df2 to match df1 columns with a fill_value=0 (this is the safe equivalent to df2 = df2[df1.columns] when adding new columns with a fill value):
df2 = df2.reindex(columns=df1.columns, fill_value=0)
df2:
A B C
0 0 0 0
1 1 5 0
2 0 7 0

Create dummy variable of multiple columns with python

I am working with a dataframe containing two columns with ID numbers. For further research I want to make a sort of dummy variables of these ID numbers (with the two ID numbers). My code, however, does not merge the columns from the two dataframes. How can I merge the columns from the two dataframes and create the dummy variables?
Dataframe
import pandas as pd
import numpy as np
d = {'ID1': [1,2,3], 'ID2': [2,3,4]}
df = pd.DataFrame(data=d)
Current code
pd.get_dummies(df, prefix = ['ID1', 'ID2'], columns=['ID1', 'ID2'])
Desired output
p = {'1': [1,0,0], '2': [1,1,0], '3': [0,1,1], '4': [0,0,1]}
df2 = pd.DataFrame(data=p)
df2
If need indicators in output use max, if need count values use sum after get_dummies with another parameters and casting values to strings:
df = pd.get_dummies(df.astype(str), prefix='', prefix_sep='').max(level=0, axis=1)
#count alternative
#df = pd.get_dummies(df.astype(str), prefix='', prefix_sep='').sum(level=0, axis=1)
print (df)
1 2 3 4
0 1 1 0 0
1 0 1 1 0
2 0 0 1 1
Different ways of skinning a cat; here's how I'd do it—use an additional groupby:
# pd.get_dummies(df.astype(str)).groupby(lambda x: x.split('_')[1], axis=1).sum()
pd.get_dummies(df.astype(str)).groupby(lambda x: x.split('_')[1], axis=1).max()
1 2 3 4
0 1 1 0 0
1 0 1 1 0
2 0 0 1 1
Another option is stacking, if you like conciseness:
# pd.get_dummies(df.stack()).sum(level=0)
pd.get_dummies(df.stack()).max(level=0)
1 2 3 4
0 1 1 0 0
1 0 1 1 0
2 0 0 1 1

Pandas: set the value of a column in a row to be the value stored in a different df at the index of its other rows

>>> df
0 1
0 0 0
1 1 1
2 2 1
>>> df1
0 1 2
0 A B C
1 D E F
>>> crazy_magic()
>>> df
0 1 3
0 0 0 A #df1[0][0]
1 1 1 E #df1[1][1]
2 2 1 F #df1[2][1]
Is there a way to achieve this without for?
import pandas as pd
df = pd.DataFrame([[0,0],[1,1],[2,1]])
df1 = pd.DataFrame([['A', 'B', 'C'],['D', 'E', 'F']])
df2 = df1.reset_index(drop=False)
# index 0 1 2
# 0 0 A B C
# 1 1 D E F
df3 = pd.melt(df2, id_vars=['index'])
# index variable value
# 0 0 0 A
# 1 1 0 D
# 2 0 1 B
# 3 1 1 E
# 4 0 2 C
# 5 1 2 F
result = pd.merge(df, df3, left_on=[0,1], right_on=['variable', 'index'])
result = result[[0, 1, 'value']]
print(result)
yields
0 1 value
0 0 0 A
1 1 1 E
2 2 1 F
My reasoning goes as follows:
We want to use two columns of df as coordinates.
The word "coordinates" reminds me of pivot, since
if you have two columns whose values represent "coordinates" and a third
column representing values, and you want to convert that to a grid, then
pivot is the tool to use.
But df does not have a third column of values. The values are in df1. In fact df1 looks like the result of a pivot operation. So instead of pivoting df, we want to unpivot df1.
pd.melt is the function to use when you want to unpivot.
So I tried melting df1. Comparison with other uses of pd.melt led me to conclude df1 needed the index as a column. That's the reason for defining df2. So we melt df2.
Once you get that far, visually comparing df3 to df leads you naturally to the use of pd.merge.

Categories

Resources