Assume I have the following dataframe in Python:
A = [['A',2,3],['A',5,4],['B',8,9],['C',8,10],['C',9,20],['C',10,20]]
B = pd.DataFrame(A, columns = ['Col1','Col2','Col3'])
This gives me the above dataframe: I want to remove the rows that have the same values for Col1 but different values for Col3. I have tried to use drop_duplicates command with different subset of columns but it does not give what I want. I can write for loop but that is not efficient at all (since you might have much more columns than this).
C= B.drop_duplicates(['Col1','Col3'],keep = False)
Can anyone help if there is any command in Python can do this without using for loop?
The expected output would be, since A and C are removed because they have the same Col1 but different Col3.
A = [['A',2,3],['A',5,4],['B',8,9],['C',8,10],['C',9,20],['C',10,20]]
df = pd.DataFrame(A, columns = ['Col1','Col2','Col3'])
output = df.drop_duplicates('Col1', keep=False)
print(output)
Output:
Col1 Col2 Col3
2 B 8 9
This can do the job,
grouped_df = df.groupby("Col1")
groups = [grouped_df.get_group(key) for key in grouped_df.groups.keys() if len(grouped_df.get_group(key)["Col3"].unique()) == 1]
new_df = pd.concat(groups).reset_index(drop = True)
Output -
Col1
Col2
Col3
0
B
8
9
Related
I am trying to find a better, more pythonic way of accomplishing the following:
I want to add a new column to business_df called 'dot_prod', which is the dot product of a fixed vector (fixed_vector) and a vector from another data frame (rating_df). The rows of both business_df and rating_df have the same index values (business_id).
I have this loop which appears to work, however I know it's super clumsy (and takes forever). Essentially it loops through once for every row, calculates the dot product, then dumps it into the business_df dataframe.
n=0
for i in range(business_df.shape[0]):
dot_prod = np.dot(fixed_vector, rating_df.iloc[n])
business_df['dot_prod'][n] = dot_prod
n+=1
IIUC, you are looking for apply across axis=1 like:
business_df['dot_prod'] = rating_df.apply(lambda x: np.dot(fixed_vector, x), axis=1)
>>> fixed_vector = [1, 2, 3]
>>> df = pd.DataFrame({'col1' : [1,2], 'col2' : [3,4], 'col3' : [5,6]})
>>> df
col1 col2 col3
0 1 3 5
1 2 4 6
>>> df['col4'] = np.dot(fixed_vector, [df['col1'], df['col2'], df['col3']])
>>> df
col1 col2 col3 col4
0 1 3 5 22
1 2 4 6 28
I have the following question and I need help to apply the for loop to iterate through dataframe columns with unique values. For ex I have the following df.
col1 col2 col3
aaa 10 1
bbb 15 2
aaa 12 1
bbb 16 3
ccc 20 3
ccc 50 1
ddd 18 2
I had to apply some manipulation to the dataset for each unique value of col3. Therefore, what I did is I sliced out the df with col3=1 by:
df1 = df[df['col3']==1]
#added all processing here in df1#
Now I need to do the same slicing for col3==2 ... col3==10, and I will be applying the same manipulation as I did in col3==1. For ex I have to do:
df2 = df[df['col3']==2]
#add the same processing here in df2#
df3 = df[df['col3']==3]
#add the same processing here in df3#
Then I will need to append them into a list and then combine them at the end.
I couldn't figure out how to run a for loop that will go through col3 column and look at the unique values so I don't have to create manually ten dfs.
I tried to groupby then apply the manipulation but it didn't work.
I appreciate help on this. Thanks
simple solution. just iterate on the unique values of this column and loc the rows with this unique value. like this:
dfs=[]
for i in df["col3"].unique():
df_i = df.loc[df["Cluster"]==i,:]
dfs.append(df_i.copy())
This should do it but will be slow for large dataframes.
df1 = pd.DataFrame(columns=['col1', 'col2', 'col3'])
df2 = pd.DataFrame(columns=['col1', 'col2', 'col3'])
df3 = pd.DataFrame(columns=['col1', 'col2', 'col3'])
for _, v in df.iterrows():
if v[2] == 1:
# add your code
df1 = df1.append(v)
elif v[2] == 2:
# add your code
df2 = df2.append(v)
elif v[2] == 3:
# add your code
df3 = df3.append(v)
You can then use pd.concat() to rebuild to one df.
Output of df1
col1 col2 col3
0 aaa 10 1
2 aaa 12 1
5 ccc 50 1
I have such a dataframe df with two columns:
Col1 Col2
'abc-def-ghi' 1
'abc-opq-rst' 2
I created a new column Col3 like this:
df['Col3'] = df['Col1'].str.findall('abc', flags=re.IGNORECASE)
And got such a dataframe afterwards:
Col1 Col2 Col3
'abc-def-ghi' 1 [abc]
'abc-opq-rst' 2 [abc]
What I want to do now is to create a new column Col4 where I get a one if Col3 contains 'abc' and otherwise zero.
I tried to do this with a function:
def f(row):
if row['Col3'] == '[abc]':
val = 1
else:
val = 0
return val
And applied this to my pandas dataframe:
df['Col4'] = df.apply(f, axis=1)
But I only get 0, also in column that contain 'abc'. I think there is something wrong with my if-statement.
How can I solve this?
Just do
df['Col4'] = df.Col3.astype(bool).astype(int)
I have the following dataframes:
df1 = pd.DataFrame({'col1': ['A','M','C'],
'col2': ['B','N','O'],
# plus many more
})
df2 = pd.DataFrame({'col3': ['A','A','A','B','B','B'],
'col4': ['M','P','Q','J','P','M'],
# plus many more
})
Which look like these:
df1:
col1 col2
A B
M N
C O
#...plus many more
df2:
col3 col4
A M
A P
A Q
B J
B P
B M
#...plus many more
The objective is to create a dataframe containing all elements in col4 for each col3 that occurs in one row in df1. For example, let's look at row 1 of df1. We see that A is in col1 and B is in col2. Then, we go to df2, and check what col4 is for df2[df2['col3'] == 'A'] and df2[df2['col3'] == 'B']. We get, for A: ['M','P','Q'], and for B, ['J','P','M']. The intersection of these is['M', 'P'], so what I want is something like this
col1 col2 col4
A B M
A B P
....(and so on for the other rows)
The naive way to go about this is to iterate over rows and then get the intersection, but I was wondering if it's possible to solve this via merging techniques or other faster methods. So far, I can't think of any way how.
This should achieve what you want, using a combination of merge, groupby and set intersection:
# Getting tuple of all col1=col3 values in col4
df3 = pd.merge(df1, df2, left_on='col1', right_on='col3')
df3 = df3.groupby(['col1', 'col2'])['col4'].apply(tuple)
df3 = df3.reset_index()
# Getting tuple of all col2=col3 values in col4
df3 = pd.merge(df3, df2, left_on='col2', right_on='col3')
df3 = df3.groupby(['col1', 'col2', 'col4_x'])['col4_y'].apply(tuple)
df3 = df3.reset_index()
# Taking set intersection of our two tuples
df3['col4'] = df3.apply(lambda row: set(row['col4_x']) & set(row['col4_y']), axis=1)
# Dropping unnecessary columns
df3 = df3.drop(['col4_x', 'col4_y'], axis=1)
print(df3)
col1 col2 col4
0 A B {P, M}
If required, see this answer for examples of how to 'melt' col4.
I am looking to find the unique values for each column in my dataframe. (Values unique for the whole dataframe)
Col1 Col2 Col3
1 A A B
2 C A B
3 B B F
Col1 has C as a unique value, Col2 has none and Col3 has F.
Any genius ideas ? thank you !
You can use stack for Series, then drop_duplicates - keep=False remove all, remove first level by reset_index and last reindex:
df = df.stack()
.drop_duplicates(keep=False)
.reset_index(level=0, drop=True)
.reindex(index=df.columns)
print (df)
Col1 C
Col2 NaN
Col3 F
dtype: object
Solution above works nice if only one unique value per column.
I try create more general solution:
print (df)
Col1 Col2 Col3
1 A A B
2 C A X
3 B B F
s = df.stack().drop_duplicates(keep=False).reset_index(level=0, drop=True)
print (s)
Col1 C
Col3 X
Col3 F
dtype: object
s = s.groupby(level=0).unique().reindex(index=df.columns)
print (s)
Col1 [C]
Col2 NaN
Col3 [X, F]
dtype: object
I don't believe this is exactly what you want, but as useful information - you can find unique values for a DataFrame using numpy's .unique() like so:
>>> np.unique(df[['Col1', 'Col2', 'Col3']])
['A' 'B' 'C' 'F']
You can also get unique values of a specific column, e.g. Col3:
>>> df.Col3.unique()
['B' 'F']