Loop through Pandas Dataframe with unique column values

Loop through Pandas Dataframe with unique column values - python

I have the following question and I need help to apply the for loop to iterate through dataframe columns with unique values. For ex I have the following df.
col1 col2 col3
aaa 10 1
bbb 15 2
aaa 12 1
bbb 16 3
ccc 20 3
ccc 50 1
ddd 18 2
I had to apply some manipulation to the dataset for each unique value of col3. Therefore, what I did is I sliced out the df with col3=1 by:
df1 = df[df['col3']==1]
#added all processing here in df1#
Now I need to do the same slicing for col3==2 ... col3==10, and I will be applying the same manipulation as I did in col3==1. For ex I have to do:
df2 = df[df['col3']==2]
#add the same processing here in df2#
df3 = df[df['col3']==3]
#add the same processing here in df3#
Then I will need to append them into a list and then combine them at the end.
I couldn't figure out how to run a for loop that will go through col3 column and look at the unique values so I don't have to create manually ten dfs.
I tried to groupby then apply the manipulation but it didn't work.
I appreciate help on this. Thanks

simple solution. just iterate on the unique values of this column and loc the rows with this unique value. like this:
dfs=[]
for i in df["col3"].unique():
df_i = df.loc[df["Cluster"]==i,:]
dfs.append(df_i.copy())

This should do it but will be slow for large dataframes.
df1 = pd.DataFrame(columns=['col1', 'col2', 'col3'])
df2 = pd.DataFrame(columns=['col1', 'col2', 'col3'])
df3 = pd.DataFrame(columns=['col1', 'col2', 'col3'])
for _, v in df.iterrows():
if v[2] == 1:
# add your code
df1 = df1.append(v)
elif v[2] == 2:
# add your code
df2 = df2.append(v)
elif v[2] == 3:
# add your code
df3 = df3.append(v)
You can then use pd.concat() to rebuild to one df.
Output of df1
col1 col2 col3
0 aaa 10 1
2 aaa 12 1
5 ccc 50 1

Related

drop all duplicate values in python

Assume I have the following dataframe in Python:
A = [['A',2,3],['A',5,4],['B',8,9],['C',8,10],['C',9,20],['C',10,20]]
B = pd.DataFrame(A, columns = ['Col1','Col2','Col3'])
This gives me the above dataframe: I want to remove the rows that have the same values for Col1 but different values for Col3. I have tried to use drop_duplicates command with different subset of columns but it does not give what I want. I can write for loop but that is not efficient at all (since you might have much more columns than this).
C= B.drop_duplicates(['Col1','Col3'],keep = False)
Can anyone help if there is any command in Python can do this without using for loop?
The expected output would be, since A and C are removed because they have the same Col1 but different Col3.

A = [['A',2,3],['A',5,4],['B',8,9],['C',8,10],['C',9,20],['C',10,20]]
df = pd.DataFrame(A, columns = ['Col1','Col2','Col3'])
output = df.drop_duplicates('Col1', keep=False)
print(output)
Output:
Col1 Col2 Col3
2 B 8 9

This can do the job,
grouped_df = df.groupby("Col1")
groups = [grouped_df.get_group(key) for key in grouped_df.groups.keys() if len(grouped_df.get_group(key)["Col3"].unique()) == 1]
new_df = pd.concat(groups).reset_index(drop = True)
Output -
Col1
Col2
Col3
0
B
8
9

Pandas DataFrame.assign() doesn't work properly for multiple columns

I am trying to reassign multiple columns in DataFrame with modifications.
The below is a simplified example.
import pandas as pd
d = {'col1':[1,2], 'col2':[3,4]}
df = pd.DataFrame(d)
print(df)
col1 col2
0 1 3
1 2 4
I use assign() method to add 1 to both 'col1' and 'col2'.
However, the result is to add 1 only to 'col2' and copy the result to 'col1'.
df2 = df.assign(**{c: lambda x: x[c] + 1 for c in ['col1','col2']})
print(df2)
col1 col2
0 4 4
1 5 5
Can someone explain why this is happening, and also suggest a correct way to apply assign() to multiple columns?

I think the lambda here can not be used within the for loop dict
df.assign(**{c: df[c] + 1 for c in ['col1','col2']})

The common element between a row and a column in a python dataframe

I have two dataframes. The FIRST one, shown below, has three columns.
Col_1 Col_2 Col_3
aaa dfd ccc
sdf jjj sge
rty fgh rtg
hji dfg hyt
lkj bgh dcf
In each row, there is one element that is the same as one of the elements in the SECOND dataframe shown below (the elements in the second dataframe do not have to have any a specific order, of course).
list
ccc
sge
fgh
dfg
dcf
My goal is to iterate through each row in the FIRST dataframe and find that common element with the SECOND dataframe. This is followed by bringing that element ahead to the beginning of the row. The expected result is as follows:
Expected result
Col_1 Col_2 Col_3
ccc aaa dfd
sge sdf jjj
fgh rty rtg
dfg hji hyt
dcf lkj bgh
Any help will be appreciated !!

Using the .apply method of the pandas DataFrame you can do it in one line. This will be faster than manually iterating over the rows.
It only uses pandas and works at a row level by first checking if any of the rows elements are in ls, sorts the returned binary indicator (True to front of row) and then re-indexes the row to be sorted in this order. It then broadcasts the results back onto the original row.
import pandas as pd
df = pd.DataFrame({'col1':['aaa','sdf','rty','hji','lkj'],
'col2':['dfd','jjj','fgh','dfg','bgh'],
'col3':['ccc','sge','rtg','hyt','dcf']})
ls = pd.Series(['ccc','sge','fgh','dfg','dcf'])
df = df.apply(lambda x: x[(~x.isin(ls)).argsort()],
axis=1,
result_type='broadcast')
Returns:
col1 col2 col3
0 ccc aaa dfd
1 sge sdf jjj
2 fgh rty rtg
3 dfg hji hyt
4 dcf lkj bgh

Why not try using apply, isin and tolist:
print(df.apply(lambda x: x[x.isin(ls)].tolist() + x[~x.isin(ls)].tolist(), axis=1))
Output:
col1 col2 col3
0 ccc aaa dfd
1 sge sdf jjj
2 fgh rty rtg
3 dfg hji hyt
4 dcf lkj bgh
I simple load each row and get the one that is in ls and make it the first value, by adding the rest to the end with itself being the first, using isin and tolist and +.

# turn 2nd dataframe into lookup list
lookup = df2['list'].tolist()
for index, row in df1.iterrows():
# if column 1 matches do nothing
# if column 2 matches list, reorder column 1 and 2, ignore 3
if row['Col_2'] in lookup:
col1 = row['Col_1']
col2 = row['Col_2']
df1.loc[index, 'Col_1'] = col2
df1.loc[index, 'Col_2'] = col1
# if column 3 matches, reorder values
if row['Col_3'] in lookup:
col1 = row['Col_1']
col2 = row['Col_2']
col3 = row['Col_3']
df1.loc[index, 'Col_1'] = col3
df1.loc[index, 'Col_2'] = col1
df1.loc[index, 'Col_3'] = col2

Here is a possible solution using a list comprehension (df1 and df2 are the two DataFrames):
import pandas as pd
result = pd.DataFrame([((x,y,z) if x in set(df2.list)
else ((y,x,z) if y in set(df2.list)
else (z,x,y)))
for _, x, y, z in df1.itertuples()],
columns=df1.columns)

Here's another solution that should be fast:
df = pd.DataFrame({'col1':['aaa','sdf','rty','hji','lkj'],
'col2':['dfd','jjj','fgh','dfg','bgh'],
'col3':['ccc','sge','rtg','hyt','dcf']})
list2 = pd.DataFrame({'list':['ccc','sge','fgh','dfg','dcf']})
list2.assign(**df).unstack().drop_duplicates().groupby(level=1).agg(list).apply(pd.Series, index=[1,2,3]).add_prefix('col_')
col_1 col_2 col_3
0 ccc aaa dfd
1 sge sdf jjj
2 fgh rty rtg
3 dfg hji hyt
4 dcf lkj bgh

Replace value of one column based on condition

I have two data frames named df and df_reference, which contain following information:
df df_reference
col1 col2 col1 col2
A 10 A 15
B 25 B 33
C 30 C 20
A 12
I want to compare both data frame based on col1.
I want to replace the value of df.col2 with df_reference.col2 if the value in df_reference is greater than value of df.col2.
The expected output is:
df
col1 col2
A 15
B 33
C 30
A 15
I have tried:
dict1 = {'a':'15'}
df.loc[df['col1'].isin(dict1.keys()), 'col2'] = sams['col1'].map(dict1)

Use Series.map by Series created by DataFrame.set_index and NaNs if some values are not matched are replace by Series.fillna:
s = df['col1'].map(df_reference.set_index('col1')['col2']).fillna(df['col2'])
df.loc[s > df['col2'], 'col2'] = s
print (df)
col1 col2
0 A 15
1 B 33
2 C 30
3 A 15

I can suggest you to first do a merge based on 'col1' and then apply a function that generates a new column with the greater value of the two 'col2'. Then just drop the useless column !
def greaterValue(row) :
if (row['col2_x']>row['col2_y']) :
return row['col2_x']
else :
return row['col2_y']
df = df.merge(df_reference, left_on='col1', right_on='col1')
df['col2'] = df.apply(greaterValue, axis=1)
df = df.loc[:,['col1','col2']]

pandas dataframe from_dict - Set column name for key, and column name for key-value

I have a dictionary as follows:
my_keys = {'a':10, 'b':3, 'c':23}
I turn it into a Dataframe:
df = pd.DataFrame.from_dict(my_keys)
It outputs the df as below
a b c
0 10 3 23
How can I get it to look like below:
Col1 Col2
a 10
b 3
c 23
I've tried orient=index but I still can't get column names?

You can create list of tuples and pass to DataFrame constructor:
df = pd.DataFrame(list(my_keys.items()), columns=['col1','col2'])
Or convert keys and values to separate lists:
df = pd.DataFrame({'col1': list(my_keys.keys()),'col2':list(my_keys.values())})
print (df)
col1 col2
0 a 10
1 b 3
2 c 23
Your solution should be changed by orient='index' and columns, but then is necessary add DataFrame.rename_axis and
DataFrame.reset_index for column from index:
df = (pd.DataFrame.from_dict(my_keys, orient='index', columns=['col2'])
.rename_axis('col1')
.reset_index())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Loop through Pandas Dataframe with unique column values - python

simple solution. just iterate on the unique values of this column and loc the rows with this unique value. like this: dfs=[] for i in df["col3"].unique(): df_i = df.loc[df["Cluster"]==i,:] dfs.append(df_i.copy())

Related

drop all duplicate values in python

Pandas DataFrame.assign() doesn't work properly for multiple columns

The common element between a row and a column in a python dataframe

Replace value of one column based on condition

pandas dataframe from_dict - Set column name for key, and column name for key-value

Categories

Resources