PANDAS - Rename and combine like columns - python

I am trying to rename a column and combine that renamed column to others like it. The row indexes will not be the same (i.e. I am not combining 'City' and 'State' from two columns).
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
df.rename(columns={'Col_one' : 'Col_1'}, inplace=True)
# Desired output:
({'Col_1': ['A', 'B', 'C', 'G', 'H', 'I'],
'Col_2': ['D', 'E', 'F', '-', '-', '-'],})
I've tried pd.concat and a few other things, but it fails to combine the columns in a way I'm expecting. Thank you!

This is melt and pivot after you have renamed:
u = df.melt()
out = (u.assign(k=u.groupby("variable").cumcount())
.pivot("k","variable","value").fillna('-'))
out = out.rename_axis(index=None,columns=None)
print(out)
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -

Using append without modifying the actual dataframe:
result = (df[['Col_1', 'Col_2']]
.append(df[['Col_one']]
.rename(columns={'Col_one': 'Col_1'}),ignore_index=True).fillna('-')
)
OUTPUT:
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -

Might be a slightly longer method than other answers but the below delivered the required output.
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
# Create a list of the values we want to retain
TempList = df['Col_one']
# Append existing dataframe with the values from the list
df = df.append(pd.DataFrame({'Col_1':TempList}), ignore_index = True)
# Drop the redundant column
df.drop(columns=['Col_one'], inplace=True)
# Populate NaN with -
df.fillna('-', inplace=True)
Output is
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -

Using concat should work.
import pandas as pd
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
df2 = pd.DataFrame()
df2['Col_1'] = pd.concat([df['Col_1'], df['Col_one']], axis = 0)
df2 = df2.reset_index()
df2 = df2.drop('index', axis =1)
df2['Col_2'] = df['Col_2']
df2['Col_2'] = df2['Col_2'].fillna('-')
print(df2)
prints
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -

Related

Pandas - Creating new column based on dynamic conditions from lists

I have two lists to start with:
delta = ['1','5']
taxa = ['2','3','4']
My dataframe will look like :
data = { 'id': [101,102,103,104,105],
'1_srcA': ['a', 'b','c', 'd', 'g'] ,
'1_srcB': ['a', 'b','c', 'd', 'e'] ,
'2_srcA': ['g', 'b','f', 'd', 'e'] ,
'2_srcB': ['a', 'b','c', 'd', 'e'] ,
'3_srcA': ['a', 'b','c', 'd', 'e'] ,
'3_srcB': ['a', 'b','1', 'd', 'm'] ,
'4_srcA': ['a', 'b','c', 'd', 'e'] ,
'4_srcB': ['a', 'b','c', 'd', 'e'] ,
'5_srcA': ['a', 'b','c', 'd', 'e'] ,
'5_srcB': ['m', 'b','c', 'd', 'e'] }
df = pd.DataFrame(data)
df
I have to do two types of checks on this dataframe. Say, Delta check and Taxa checks.
For Delta checks, based on list delta = ['1','5'] I have to compare 1_srcA vs 1_srcB and 5_srcA vs 5_srcB since '1' is in 1_srcA ,1_srcB and '5' is in 5_srcA, 5_srcB . If the values differ, I have to populate 2. For tax checks (based on values from taxa list), it should be 1. If no difference, it is 0.
So, this comparison has to happen on all the rows. df is generated based on merge of two dataframes. so, there will be only two cols which has '1' in it, two cols which has '2' in it and so on.
Conditions I have to check:
I need to check if columns containing values from delta list differs. If yes, I will populate 2.
need to check if columns containing values from taxa list differs. If yes, I will populate 1.
If condition 1 and condition 2 are satisfied, then populate 2.
If none of the conditions satisfied, then 0.
So, my output should look like:
The code I tried:
df_cols_ = df.columns.tolist()[1:]
conditions = []
res = {}
for i,col in enumerate(df_cols_):
if (i == 0) or (i%2 == 0) :
continue
var = 'cond_'+str(i)
for del_col in delta:
if del_col in col:
var = var + '_F'
break
print (var)
cond = f"df.iloc[:, {i}] != df.iloc[:, {i+1}]"
res[var] = cond
conditions.append(cond)
The res dict will look like the below. But how can i use the condition to populate?
Is the any optimal solution the resultant dataframe can be derived? Thanks.
Create helper function for filter values by DataFrame.filter and compare them for not equal, then use np.logical_or.reduce for processing list of boolean masks to one mask and pass to numpy.select:
delta = ['1','5']
taxa = ['2','3','4']
def f(x):
df1 = df.filter(like=x)
return df1.iloc[:, 0].ne(df1.iloc[:, 1])
d = np.logical_or.reduce([f(x) for x in delta])
print (d)
[ True False False False True]
t = np.logical_or.reduce([f(x) for x in taxa])
print (t)
[ True False True False True]
df['res'] = np.select([d, t], [2, 1], default=0)
print (df)
id 1_srcA 1_srcB 2_srcA 2_srcB 3_srcA 3_srcB 4_srcA 4_srcB 5_srcA 5_srcB \
0 101 a a g a a a a a a m
1 102 b b b b b b b b b b
2 103 c c f c c 1 c c c c
3 104 d d d d d d d d d d
4 105 g e e e e m e e e e
res
0 2
1 0
2 1
3 0
4 2

How can I concatenate a dataframe and a series?

Code:
df_columns = ['A', 'B', 'C', 'D']
df_series = pd.Series([1,2,3,'N/A'],index = df_columns)
df = pd.DataFrame(df_series)
df
When I run the code above I receive the following output:
A 1
B 2
C 3
D 'N/A'
How can I write the code so that my Output is and df_columns is still the dataframe's index:
A B C D
1 2 3 'N/A'
So this would work, note the double brackets when loading in the data to designate the row.
import pandas as pd
df_columns = ['A', 'B', 'C', 'D']
df = pd.DataFrame([[1,2,3,'N/A']],columns= df_columns)
print(df)

One-hot encode pandas dataframe using dictionary with column name and values

I currently have the below code to one-hot encode a pandas dataframe using a dictionary where keys are feature names and values are list of values for the feature.
def dummy_encode_dataframe(self, df, dummy_values_dict):
for (feature, dummy_values) in sorted(dummy_values_dict.items()):
for dummy_value in sorted(dummy_values):
dummy_name = u'%s_%s' % (feature, dummy_value)
df[dummy_name] = (df[feature] == dummy_value).astype(float)
del df[feature]
return df
The dummy_values_dict has the structure:
feature name (key) list of possible values (strings)
--------- ---------------------------------
F1 ['A', 'B', 'C', 'MISSING']
F2 ['D', 'E', 'F', 'MISSING']
F3 ['G', 'H', 'I']
with sample input/output:
df (one row):
====
F1 F2 F3
--- ----- -----
'A' 'Q' 'H'
expected output:
df_output:
====
F1_A F1_B F1_C F1_MISSING F2_D F2_E F2_F F2_MISSING F3_G F3_H F3_I
--- ---- ----- --------- ---- ---- ---- ---------- ---- ---- -----
1 0 0 0 0 0 0 0 0 1 0
The problem is that the for-loops takes too long to run. Any way to optimize it?
UPDATE 1: From the comment about using OneHotEncoder in scikit-learn...
Can you elaborate on this piece of code to get the desired output?
import pandas as pd
df = pd.DataFrame(columns=['F1', 'F2', 'F3'])
df.loc[0] = ['A', 'Q', 'H']
dummy_values_dict = { 'F1': ['A', 'B', 'C', 'MISSING'], 'F2': ['D', 'E', 'F', 'MISSING'], 'F3': ['G', 'H', 'I'] }
# import OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
categorical_cols = sorted(dummy_values_dict.keys())
# instantiate OneHotEncoder
# todo: encoding...
Maybe the question was not well phrased. I managed to find a more optimized implementation (probably there are better) using the below code:
import pandas as pd
import numpy as np
def dummy_encode_dataframe_optimized(df, dummy_values_dict):
column_headers = np.concatenate(np.array(
[np.array([k + '_value_' + s
for s in sorted(dummy_values_dict[k])])
for k in sorted(dummy_values_dict.keys())]), axis=0)
feature_values = [str(feature) + '_value_' + str(df[feature][0])
for feature in dummy_values_dict.keys()]
one_hot_encode_vector = np.vectorize(lambda x: float(1) if x in feature_values else float(0))(column_headers)
untouched_df = df.drop(df.ix[:, dummy_values_dict.keys()].head(0).columns, axis=1)
hot_encoded_df = pd.concat(
[
untouched_df,
pd.DataFrame(
[one_hot_encode_vector],
index=untouched_df.index,
columns=column_headers
)
], axis=1
)
return hot_encoded_df
df = pd.DataFrame(columns=['F1', 'F2', 'F3'])
df.loc[0] = ['A', 'Q', 'H']
dummy_values_dict = { 'F1': ['A', 'B', 'C', 'MISSING'], 'F2': ['D', 'E', 'F', 'MISSING'], 'F3': ['G', 'H', 'I'] }
result = dummy_encode_dataframe_optimized(df, dummy_values_dict)
pd.get_dummies should work in your case, but first we need to set all the value not in the dictionary to NaN
df = pd.DataFrame({'F1': ['A', 'B', 'C', 'MISSING'], 'F2': [
'Q', 'E', 'F', 'MISSING'], 'F3': ['G', 'H', 'I', 5]})
# F1 F2 F3
# 0 A Q G
# 1 B E H
# 2 C F I
# 3 MISSING MISSING 5
dummy_values_dict = {'F1': ['A', 'B', 'C', 'MISSING'], 'F2': [
'D', 'E', 'F', 'MISSING'], 'F3': ['G', 'H', 'I']}
We can set all the other value to np.nan:
for col in df.columns:
df.loc[~df[col].isin(dummy_values_dict[col]), col] = np.nan
print(df)
# F1 F2 F3
# 0 A NaN G
# 1 B E H
# 2 C F I
# 3 MISSING MISSING NaN
Then we can use pd.get_dummies to do the job:
print(pd.get_dummies(df))
# F1_A F1_B F1_C F1_MISSING F2_E F2_F F2_MISSING F3_G F3_H F3_I
# 0 1 0 0 0 0 0 0 1 0 0
# 1 0 1 0 0 1 0 0 0 1 0
# 2 0 0 1 0 0 1 0 0 0 1
# 3 0 0 0 1 0 0 1 0 0 0
Be aware that if we do not have one value (for example 'D' in columns 'F2'), there won't be the 'F2_D' column, but that can be fix quite easily if you do need the column.

Sort or groupby dataframe in python using given string

I have given dataframe
Id Direction Load Unit
1 CN05059815 LoadFWD 0,0 NaN
2 CN05059815 LoadBWD 0,0 NaN
4 ....
....
and the given list.
list =['CN05059830','CN05059946','CN05060010','CN05060064' ...]
I would like to sort or group the data by a given element of the list.
For example,
The new data will have exactly the same sort as the list. The first column would start withCN05059815 which doesn't belong to the list, then the second CN05059830 CN05059946 ... are both belong to the list. With remaining the other data
One way is to use Categorical Data. Here's a minimal example:
# sample dataframe
df = pd.DataFrame({'col': ['A', 'B', 'C', 'D', 'E', 'F']})
# required ordering
lst = ['D', 'E', 'A', 'B']
# convert to categorical
df['col'] = df['col'].astype('category')
# set order, adding values not in lst to the front
order = list(set(df['col']) - set(lst)) + lst
# attach ordering information to categorical series
df['col'] = df['col'].cat.reorder_categories(order)
# apply ordering
df = df.sort_values('col')
print(df)
col
2 C
5 F
3 D
4 E
0 A
1 B
Consider below approach and example:
df = pd.DataFrame({
'col': ['a', 'b', 'c', 'd', 'e']
})
list_ = ['d', 'b', 'a']
print(df)
Output:
col
0 a
1 b
2 c
3 d
4 e
Then in order to sort the df with the list and its ordering:
df.reindex(df.assign(dummy=df['col'])['dummy'].apply(lambda x: list_.index(x) if x in list_ else -1).sort_values().index)
Output:
col
2 c
4 e
3 d
1 b
0 a

Python Pandas lookup and replace df1 value from df2

I have two dataframes
df df2
df column FOUR matches with df2 column LOOKUP COL
I need to match df column FOUR with df2 column LOOKUP COL and replace df column FOUR with the corresponding values from df2 column RETURN THIS
The resulting dataframe could overwrite df but I have it listed as result below.
NOTE: THE INDEX DOES NOT MATCH ON EACH OF THE DATAFRAMES
df = pd.DataFrame([['a', 'b', 'c', 'd'],
['e', 'f', 'g', 'h'],
['j', 'k', 'l', 'm'],
['x', 'y', 'z', 'w']])
df.columns = ['ONE', 'TWO', 'THREE', 'FOUR']
ONE TWO THREE FOUR
0 a b c d
1 e f g h
2 j k l m
3 x y z w
df2 = pd.DataFrame([['a', 'b', 'd', '1'],
['e', 'f', 'h', '2'],
['j', 'k', 'm', '3'],
['x', 'y', 'w', '4']])
df2.columns = ['X1', 'Y2', 'LOOKUP COL', 'RETURN THIS']
X1 Y2 LOOKUP COL RETURN THIS
0 a b d 1
1 e f h 2
2 j k m 3
3 x y w 4
RESULTING DF
ONE TWO THREE FOUR
0 a b c 1
1 e f g 2
2 j k l 3
3 x y z 4
You can use Series.map. You'll need to create a dictionary or a Series to use in map. A Series makes more sense here but the index should be LOOKUP COL:
df['FOUR'] = df['FOUR'].map(df2.set_index('LOOKUP COL')['RETURN THIS'])
df
Out:
ONE TWO THREE FOUR
0 a b c 1
1 e f g 2
2 j k l 3
3 x y z 4
df['Four']=[df2[df2['LOOKUP COL']==i]['RETURN THIS'] for i in df['Four']]
Should be something like sufficient to do the trick? There's probably a more pandas native way to do it.
Basically, list comprehension - We generate a new array of df2['RETURN THIS'] values based on using the lookup column as we iterate over the i in df['Four'] list.

Categories

Resources