Replace value of one column based on condition - python

I have two data frames named df and df_reference, which contain following information:
df df_reference
col1 col2 col1 col2
A 10 A 15
B 25 B 33
C 30 C 20
A 12
I want to compare both data frame based on col1.
I want to replace the value of df.col2 with df_reference.col2 if the value in df_reference is greater than value of df.col2.
The expected output is:
df
col1 col2
A 15
B 33
C 30
A 15
I have tried:
dict1 = {'a':'15'}
df.loc[df['col1'].isin(dict1.keys()), 'col2'] = sams['col1'].map(dict1)

Use Series.map by Series created by DataFrame.set_index and NaNs if some values are not matched are replace by Series.fillna:
s = df['col1'].map(df_reference.set_index('col1')['col2']).fillna(df['col2'])
df.loc[s > df['col2'], 'col2'] = s
print (df)
col1 col2
0 A 15
1 B 33
2 C 30
3 A 15

I can suggest you to first do a merge based on 'col1' and then apply a function that generates a new column with the greater value of the two 'col2'. Then just drop the useless column !
def greaterValue(row) :
if (row['col2_x']>row['col2_y']) :
return row['col2_x']
else :
return row['col2_y']
df = df.merge(df_reference, left_on='col1', right_on='col1')
df['col2'] = df.apply(greaterValue, axis=1)
df = df.loc[:,['col1','col2']]

Related

Creating a New Column in a Pandas Dataframe in a more pythonic way

I am trying to find a better, more pythonic way of accomplishing the following:
I want to add a new column to business_df called 'dot_prod', which is the dot product of a fixed vector (fixed_vector) and a vector from another data frame (rating_df). The rows of both business_df and rating_df have the same index values (business_id).
I have this loop which appears to work, however I know it's super clumsy (and takes forever). Essentially it loops through once for every row, calculates the dot product, then dumps it into the business_df dataframe.
n=0
for i in range(business_df.shape[0]):
dot_prod = np.dot(fixed_vector, rating_df.iloc[n])
business_df['dot_prod'][n] = dot_prod
n+=1
IIUC, you are looking for apply across axis=1 like:
business_df['dot_prod'] = rating_df.apply(lambda x: np.dot(fixed_vector, x), axis=1)
>>> fixed_vector = [1, 2, 3]
>>> df = pd.DataFrame({'col1' : [1,2], 'col2' : [3,4], 'col3' : [5,6]})
>>> df
col1 col2 col3
0 1 3 5
1 2 4 6
>>> df['col4'] = np.dot(fixed_vector, [df['col1'], df['col2'], df['col3']])
>>> df
col1 col2 col3 col4
0 1 3 5 22
1 2 4 6 28

drop all duplicate values in python

Assume I have the following dataframe in Python:
A = [['A',2,3],['A',5,4],['B',8,9],['C',8,10],['C',9,20],['C',10,20]]
B = pd.DataFrame(A, columns = ['Col1','Col2','Col3'])
This gives me the above dataframe: I want to remove the rows that have the same values for Col1 but different values for Col3. I have tried to use drop_duplicates command with different subset of columns but it does not give what I want. I can write for loop but that is not efficient at all (since you might have much more columns than this).
C= B.drop_duplicates(['Col1','Col3'],keep = False)
Can anyone help if there is any command in Python can do this without using for loop?
The expected output would be, since A and C are removed because they have the same Col1 but different Col3.
A = [['A',2,3],['A',5,4],['B',8,9],['C',8,10],['C',9,20],['C',10,20]]
df = pd.DataFrame(A, columns = ['Col1','Col2','Col3'])
output = df.drop_duplicates('Col1', keep=False)
print(output)
Output:
Col1 Col2 Col3
2 B 8 9
This can do the job,
grouped_df = df.groupby("Col1")
groups = [grouped_df.get_group(key) for key in grouped_df.groups.keys() if len(grouped_df.get_group(key)["Col3"].unique()) == 1]
new_df = pd.concat(groups).reset_index(drop = True)
Output -
Col1
Col2
Col3
0
B
8
9

Loop through Pandas Dataframe with unique column values

I have the following question and I need help to apply the for loop to iterate through dataframe columns with unique values. For ex I have the following df.
col1 col2 col3
aaa 10 1
bbb 15 2
aaa 12 1
bbb 16 3
ccc 20 3
ccc 50 1
ddd 18 2
I had to apply some manipulation to the dataset for each unique value of col3. Therefore, what I did is I sliced out the df with col3=1 by:
df1 = df[df['col3']==1]
#added all processing here in df1#
Now I need to do the same slicing for col3==2 ... col3==10, and I will be applying the same manipulation as I did in col3==1. For ex I have to do:
df2 = df[df['col3']==2]
#add the same processing here in df2#
df3 = df[df['col3']==3]
#add the same processing here in df3#
Then I will need to append them into a list and then combine them at the end.
I couldn't figure out how to run a for loop that will go through col3 column and look at the unique values so I don't have to create manually ten dfs.
I tried to groupby then apply the manipulation but it didn't work.
I appreciate help on this. Thanks
simple solution. just iterate on the unique values of this column and loc the rows with this unique value. like this:
dfs=[]
for i in df["col3"].unique():
df_i = df.loc[df["Cluster"]==i,:]
dfs.append(df_i.copy())
This should do it but will be slow for large dataframes.
df1 = pd.DataFrame(columns=['col1', 'col2', 'col3'])
df2 = pd.DataFrame(columns=['col1', 'col2', 'col3'])
df3 = pd.DataFrame(columns=['col1', 'col2', 'col3'])
for _, v in df.iterrows():
if v[2] == 1:
# add your code
df1 = df1.append(v)
elif v[2] == 2:
# add your code
df2 = df2.append(v)
elif v[2] == 3:
# add your code
df3 = df3.append(v)
You can then use pd.concat() to rebuild to one df.
Output of df1
col1 col2 col3
0 aaa 10 1
2 aaa 12 1
5 ccc 50 1

pandas dataframe from_dict - Set column name for key, and column name for key-value

I have a dictionary as follows:
my_keys = {'a':10, 'b':3, 'c':23}
I turn it into a Dataframe:
df = pd.DataFrame.from_dict(my_keys)
It outputs the df as below
a b c
0 10 3 23
How can I get it to look like below:
Col1 Col2
a 10
b 3
c 23
I've tried orient=index but I still can't get column names?
You can create list of tuples and pass to DataFrame constructor:
df = pd.DataFrame(list(my_keys.items()), columns=['col1','col2'])
Or convert keys and values to separate lists:
df = pd.DataFrame({'col1': list(my_keys.keys()),'col2':list(my_keys.values())})
print (df)
col1 col2
0 a 10
1 b 3
2 c 23
Your solution should be changed by orient='index' and columns, but then is necessary add DataFrame.rename_axis and
DataFrame.reset_index for column from index:
df = (pd.DataFrame.from_dict(my_keys, orient='index', columns=['col2'])
.rename_axis('col1')
.reset_index())

Find unique values for each column

I am looking to find the unique values for each column in my dataframe. (Values unique for the whole dataframe)
Col1 Col2 Col3
1 A A B
2 C A B
3 B B F
Col1 has C as a unique value, Col2 has none and Col3 has F.
Any genius ideas ? thank you !
You can use stack for Series, then drop_duplicates - keep=False remove all, remove first level by reset_index and last reindex:
df = df.stack()
.drop_duplicates(keep=False)
.reset_index(level=0, drop=True)
.reindex(index=df.columns)
print (df)
Col1 C
Col2 NaN
Col3 F
dtype: object
Solution above works nice if only one unique value per column.
I try create more general solution:
print (df)
Col1 Col2 Col3
1 A A B
2 C A X
3 B B F
s = df.stack().drop_duplicates(keep=False).reset_index(level=0, drop=True)
print (s)
Col1 C
Col3 X
Col3 F
dtype: object
s = s.groupby(level=0).unique().reindex(index=df.columns)
print (s)
Col1 [C]
Col2 NaN
Col3 [X, F]
dtype: object
I don't believe this is exactly what you want, but as useful information - you can find unique values for a DataFrame using numpy's .unique() like so:
>>> np.unique(df[['Col1', 'Col2', 'Col3']])
['A' 'B' 'C' 'F']
You can also get unique values of a specific column, e.g. Col3:
>>> df.Col3.unique()
['B' 'F']

Categories

Resources