I have two dataframes
df df2
df column FOUR matches with df2 column LOOKUP COL
I need to match df column FOUR with df2 column LOOKUP COL and replace df column FOUR with the corresponding values from df2 column RETURN THIS
The resulting dataframe could overwrite df but I have it listed as result below.
NOTE: THE INDEX DOES NOT MATCH ON EACH OF THE DATAFRAMES
df = pd.DataFrame([['a', 'b', 'c', 'd'],
['e', 'f', 'g', 'h'],
['j', 'k', 'l', 'm'],
['x', 'y', 'z', 'w']])
df.columns = ['ONE', 'TWO', 'THREE', 'FOUR']
ONE TWO THREE FOUR
0 a b c d
1 e f g h
2 j k l m
3 x y z w
df2 = pd.DataFrame([['a', 'b', 'd', '1'],
['e', 'f', 'h', '2'],
['j', 'k', 'm', '3'],
['x', 'y', 'w', '4']])
df2.columns = ['X1', 'Y2', 'LOOKUP COL', 'RETURN THIS']
X1 Y2 LOOKUP COL RETURN THIS
0 a b d 1
1 e f h 2
2 j k m 3
3 x y w 4
RESULTING DF
ONE TWO THREE FOUR
0 a b c 1
1 e f g 2
2 j k l 3
3 x y z 4
You can use Series.map. You'll need to create a dictionary or a Series to use in map. A Series makes more sense here but the index should be LOOKUP COL:
df['FOUR'] = df['FOUR'].map(df2.set_index('LOOKUP COL')['RETURN THIS'])
df
Out:
ONE TWO THREE FOUR
0 a b c 1
1 e f g 2
2 j k l 3
3 x y z 4
df['Four']=[df2[df2['LOOKUP COL']==i]['RETURN THIS'] for i in df['Four']]
Should be something like sufficient to do the trick? There's probably a more pandas native way to do it.
Basically, list comprehension - We generate a new array of df2['RETURN THIS'] values based on using the lookup column as we iterate over the i in df['Four'] list.
Related
I have two lists to start with:
delta = ['1','5']
taxa = ['2','3','4']
My dataframe will look like :
data = { 'id': [101,102,103,104,105],
'1_srcA': ['a', 'b','c', 'd', 'g'] ,
'1_srcB': ['a', 'b','c', 'd', 'e'] ,
'2_srcA': ['g', 'b','f', 'd', 'e'] ,
'2_srcB': ['a', 'b','c', 'd', 'e'] ,
'3_srcA': ['a', 'b','c', 'd', 'e'] ,
'3_srcB': ['a', 'b','1', 'd', 'm'] ,
'4_srcA': ['a', 'b','c', 'd', 'e'] ,
'4_srcB': ['a', 'b','c', 'd', 'e'] ,
'5_srcA': ['a', 'b','c', 'd', 'e'] ,
'5_srcB': ['m', 'b','c', 'd', 'e'] }
df = pd.DataFrame(data)
df
I have to do two types of checks on this dataframe. Say, Delta check and Taxa checks.
For Delta checks, based on list delta = ['1','5'] I have to compare 1_srcA vs 1_srcB and 5_srcA vs 5_srcB since '1' is in 1_srcA ,1_srcB and '5' is in 5_srcA, 5_srcB . If the values differ, I have to populate 2. For tax checks (based on values from taxa list), it should be 1. If no difference, it is 0.
So, this comparison has to happen on all the rows. df is generated based on merge of two dataframes. so, there will be only two cols which has '1' in it, two cols which has '2' in it and so on.
Conditions I have to check:
I need to check if columns containing values from delta list differs. If yes, I will populate 2.
need to check if columns containing values from taxa list differs. If yes, I will populate 1.
If condition 1 and condition 2 are satisfied, then populate 2.
If none of the conditions satisfied, then 0.
So, my output should look like:
The code I tried:
df_cols_ = df.columns.tolist()[1:]
conditions = []
res = {}
for i,col in enumerate(df_cols_):
if (i == 0) or (i%2 == 0) :
continue
var = 'cond_'+str(i)
for del_col in delta:
if del_col in col:
var = var + '_F'
break
print (var)
cond = f"df.iloc[:, {i}] != df.iloc[:, {i+1}]"
res[var] = cond
conditions.append(cond)
The res dict will look like the below. But how can i use the condition to populate?
Is the any optimal solution the resultant dataframe can be derived? Thanks.
Create helper function for filter values by DataFrame.filter and compare them for not equal, then use np.logical_or.reduce for processing list of boolean masks to one mask and pass to numpy.select:
delta = ['1','5']
taxa = ['2','3','4']
def f(x):
df1 = df.filter(like=x)
return df1.iloc[:, 0].ne(df1.iloc[:, 1])
d = np.logical_or.reduce([f(x) for x in delta])
print (d)
[ True False False False True]
t = np.logical_or.reduce([f(x) for x in taxa])
print (t)
[ True False True False True]
df['res'] = np.select([d, t], [2, 1], default=0)
print (df)
id 1_srcA 1_srcB 2_srcA 2_srcB 3_srcA 3_srcB 4_srcA 4_srcB 5_srcA 5_srcB \
0 101 a a g a a a a a a m
1 102 b b b b b b b b b b
2 103 c c f c c 1 c c c c
3 104 d d d d d d d d d d
4 105 g e e e e m e e e e
res
0 2
1 0
2 1
3 0
4 2
I have the following list:
index_list = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
I would like to use it to define an index for the following DataFrame:
df = pd.DataFrame(index= [0,1,4,7])
The DataFrames index corresponds to an entry in the list. The end result should look like:
df_target = pd.DataFrame(index= ['a','b','e','h'])
I have tried a number of built in functions for pandas, but no luck.
Use Index.map with dictonary created by enumerate:
target_df = pd.DataFrame(index=df.index.map(dict(enumerate(index_list))))
print (target_df)
Empty DataFrame
Columns: []
Index: [a, b, e, h]
If need change index in existing DataFrame:
df = pd.DataFrame({'a':range(4)}, index= [0,1,4,7])
print (df)
a
0 0
1 1
4 2
7 3
df = df.rename(dict(enumerate(index_list)))
print (df)
a
a 0
b 1
e 2
h 3
like this:
print(pd.DataFrame(index=[index_list[i] for i in [0,1,4,7]]))
Output:
Empty DataFrame
Columns: []
Index: [a, b, e, h]
You can pass a Series object to Index.map method.
df.index.map(pd.Series(index_list))
# Index(['a', 'b', 'e', 'h'], dtype='object')
Or rename the Dataframe index directly
df.rename(index=pd.Series(index_list))
# Empty DataFrame
# Columns: []
# Index: [a, b, e, h]
I am trying to rename a column and combine that renamed column to others like it. The row indexes will not be the same (i.e. I am not combining 'City' and 'State' from two columns).
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
df.rename(columns={'Col_one' : 'Col_1'}, inplace=True)
# Desired output:
({'Col_1': ['A', 'B', 'C', 'G', 'H', 'I'],
'Col_2': ['D', 'E', 'F', '-', '-', '-'],})
I've tried pd.concat and a few other things, but it fails to combine the columns in a way I'm expecting. Thank you!
This is melt and pivot after you have renamed:
u = df.melt()
out = (u.assign(k=u.groupby("variable").cumcount())
.pivot("k","variable","value").fillna('-'))
out = out.rename_axis(index=None,columns=None)
print(out)
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -
Using append without modifying the actual dataframe:
result = (df[['Col_1', 'Col_2']]
.append(df[['Col_one']]
.rename(columns={'Col_one': 'Col_1'}),ignore_index=True).fillna('-')
)
OUTPUT:
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -
Might be a slightly longer method than other answers but the below delivered the required output.
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
# Create a list of the values we want to retain
TempList = df['Col_one']
# Append existing dataframe with the values from the list
df = df.append(pd.DataFrame({'Col_1':TempList}), ignore_index = True)
# Drop the redundant column
df.drop(columns=['Col_one'], inplace=True)
# Populate NaN with -
df.fillna('-', inplace=True)
Output is
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -
Using concat should work.
import pandas as pd
df = pd.DataFrame({'Col_1': ['A', 'B', 'C'],
'Col_2': ['D', 'E', 'F'],
'Col_one':['G', 'H', 'I'],})
df2 = pd.DataFrame()
df2['Col_1'] = pd.concat([df['Col_1'], df['Col_one']], axis = 0)
df2 = df2.reset_index()
df2 = df2.drop('index', axis =1)
df2['Col_2'] = df['Col_2']
df2['Col_2'] = df2['Col_2'].fillna('-')
print(df2)
prints
Col_1 Col_2
0 A D
1 B E
2 C F
3 G -
4 H -
5 I -
I am starting on with panda catagorical dataframes.
Let's say I have (1):
A B C
-------------
3 Z M
O X T
4 A B
I filtered the dataframe like that : df[ df['B'] != "X"]
So I would get as result (2):
A B C
-------------
3 Z M
4 A B
In (1) df['B'].cat.categories #would equal to ['Z', 'X', 'A']
In (2) df['B'].cat.categories #still equal to ['Z', 'X', 'A']
How to update the DF categories of all columns after this kind of filtering operation ?
BONUS : If you want to clean up the indexes after the filtering
df.reset_index()
remove_unused_categories from the columns after filtering.
As piRSquared points out you can do this succinctly given every column is a categorical dtype:
df = df.query('B != "X"').apply(lambda s: s.cat.remove_unused_categories())
This loops over the columns after filtering.
print(df)
# A B C
#0 3 Z M
#1 O X T
#2 4 A B
df['B'].cat.categories
#Index(['A', 'X', 'Z'], dtype='object')
df = df[ df['B'] != 'X']
# Update all category columns
for col in df.dtypes.loc[lambda x: x == 'category'].index:
df[col] = df[col].cat.remove_unused_categories()
df['B'].cat.categories
#Index(['A', 'Z'], dtype='object')
df['C'].cat.categories
#Index(['B', 'M'], dtype='object')
Pandas store the categories separately and don't remove them if not used, if you want do do that you can use this attribute: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.cat.remove_unused_categories.html#pandas.Series.cat.remove_unused_categories
I have given dataframe
Id Direction Load Unit
1 CN05059815 LoadFWD 0,0 NaN
2 CN05059815 LoadBWD 0,0 NaN
4 ....
....
and the given list.
list =['CN05059830','CN05059946','CN05060010','CN05060064' ...]
I would like to sort or group the data by a given element of the list.
For example,
The new data will have exactly the same sort as the list. The first column would start withCN05059815 which doesn't belong to the list, then the second CN05059830 CN05059946 ... are both belong to the list. With remaining the other data
One way is to use Categorical Data. Here's a minimal example:
# sample dataframe
df = pd.DataFrame({'col': ['A', 'B', 'C', 'D', 'E', 'F']})
# required ordering
lst = ['D', 'E', 'A', 'B']
# convert to categorical
df['col'] = df['col'].astype('category')
# set order, adding values not in lst to the front
order = list(set(df['col']) - set(lst)) + lst
# attach ordering information to categorical series
df['col'] = df['col'].cat.reorder_categories(order)
# apply ordering
df = df.sort_values('col')
print(df)
col
2 C
5 F
3 D
4 E
0 A
1 B
Consider below approach and example:
df = pd.DataFrame({
'col': ['a', 'b', 'c', 'd', 'e']
})
list_ = ['d', 'b', 'a']
print(df)
Output:
col
0 a
1 b
2 c
3 d
4 e
Then in order to sort the df with the list and its ordering:
df.reindex(df.assign(dummy=df['col'])['dummy'].apply(lambda x: list_.index(x) if x in list_ else -1).sort_values().index)
Output:
col
2 c
4 e
3 d
1 b
0 a