I have the following dataframes:
df1 = pd.DataFrame({'col1': ['A','M','C'],
'col2': ['B','N','O'],
# plus many more
})
df2 = pd.DataFrame({'col3': ['A','A','A','B','B','B'],
'col4': ['M','P','Q','J','P','M'],
# plus many more
})
Which look like these:
df1:
col1 col2
A B
M N
C O
#...plus many more
df2:
col3 col4
A M
A P
A Q
B J
B P
B M
#...plus many more
The objective is to create a dataframe containing all elements in col4 for each col3 that occurs in one row in df1. For example, let's look at row 1 of df1. We see that A is in col1 and B is in col2. Then, we go to df2, and check what col4 is for df2[df2['col3'] == 'A'] and df2[df2['col3'] == 'B']. We get, for A: ['M','P','Q'], and for B, ['J','P','M']. The intersection of these is['M', 'P'], so what I want is something like this
col1 col2 col4
A B M
A B P
....(and so on for the other rows)
The naive way to go about this is to iterate over rows and then get the intersection, but I was wondering if it's possible to solve this via merging techniques or other faster methods. So far, I can't think of any way how.
This should achieve what you want, using a combination of merge, groupby and set intersection:
# Getting tuple of all col1=col3 values in col4
df3 = pd.merge(df1, df2, left_on='col1', right_on='col3')
df3 = df3.groupby(['col1', 'col2'])['col4'].apply(tuple)
df3 = df3.reset_index()
# Getting tuple of all col2=col3 values in col4
df3 = pd.merge(df3, df2, left_on='col2', right_on='col3')
df3 = df3.groupby(['col1', 'col2', 'col4_x'])['col4_y'].apply(tuple)
df3 = df3.reset_index()
# Taking set intersection of our two tuples
df3['col4'] = df3.apply(lambda row: set(row['col4_x']) & set(row['col4_y']), axis=1)
# Dropping unnecessary columns
df3 = df3.drop(['col4_x', 'col4_y'], axis=1)
print(df3)
col1 col2 col4
0 A B {P, M}
If required, see this answer for examples of how to 'melt' col4.
Related
Assume I have the following dataframe in Python:
A = [['A',2,3],['A',5,4],['B',8,9],['C',8,10],['C',9,20],['C',10,20]]
B = pd.DataFrame(A, columns = ['Col1','Col2','Col3'])
This gives me the above dataframe: I want to remove the rows that have the same values for Col1 but different values for Col3. I have tried to use drop_duplicates command with different subset of columns but it does not give what I want. I can write for loop but that is not efficient at all (since you might have much more columns than this).
C= B.drop_duplicates(['Col1','Col3'],keep = False)
Can anyone help if there is any command in Python can do this without using for loop?
The expected output would be, since A and C are removed because they have the same Col1 but different Col3.
A = [['A',2,3],['A',5,4],['B',8,9],['C',8,10],['C',9,20],['C',10,20]]
df = pd.DataFrame(A, columns = ['Col1','Col2','Col3'])
output = df.drop_duplicates('Col1', keep=False)
print(output)
Output:
Col1 Col2 Col3
2 B 8 9
This can do the job,
grouped_df = df.groupby("Col1")
groups = [grouped_df.get_group(key) for key in grouped_df.groups.keys() if len(grouped_df.get_group(key)["Col3"].unique()) == 1]
new_df = pd.concat(groups).reset_index(drop = True)
Output -
Col1
Col2
Col3
0
B
8
9
I have such a dataframe df with two columns:
Col1 Col2
'abc-def-ghi' 1
'abc-opq-rst' 2
I created a new column Col3 like this:
df['Col3'] = df['Col1'].str.findall('abc', flags=re.IGNORECASE)
And got such a dataframe afterwards:
Col1 Col2 Col3
'abc-def-ghi' 1 [abc]
'abc-opq-rst' 2 [abc]
What I want to do now is to create a new column Col4 where I get a one if Col3 contains 'abc' and otherwise zero.
I tried to do this with a function:
def f(row):
if row['Col3'] == '[abc]':
val = 1
else:
val = 0
return val
And applied this to my pandas dataframe:
df['Col4'] = df.apply(f, axis=1)
But I only get 0, also in column that contain 'abc'. I think there is something wrong with my if-statement.
How can I solve this?
Just do
df['Col4'] = df.Col3.astype(bool).astype(int)
Assume a dataframe df like the following:
col1 col2
0 a A
1 b A
2 c A
3 c B
4 a B
5 b B
6 a C
7 a C
8 c C
I would like to find those values of col2 where there are duplicate entries a in col1. In this example the result should be ['C]', since for df['col2'] == 'C', col1 has two a as entries.
I tried this approach
df[(df['col1'] == 'a') & (df['col2'].duplicated())]['col2'].to_list()
but this only works, if the a within a block of rows defined by col2 is at the beginning or end of the block, depending on how you define the keep keyword of duplicated(). In this example, it returns ['B', 'C'], which is not what I want.
Use Series.duplicated only for filtered rows:
df1 = df[df['col1'] == 'a']
out = df1.loc[df1['col2'].duplicated(keep=False), 'col2'].unique().tolist()
print (out)
['C']
Another idea is use DataFrame.duplicated by both columns and chain wit hrows match only a:
out = df.loc[df.duplicated(subset=['col1', 'col2'], keep=False) &
(df['col1'] == 'a'), 'col2'].unique().tolist()
print (out)
['C']
You can group your col1 by col2 and count occurrences of 'a'
>>> s = df.col1.groupby(df.col2).sum().str.count('a').gt(1)
>>> s[s].index.values
array(['C'], dtype=object)
A more generalised solution using Groupby.count and index.get_level_values:
In [2632]: x = df.groupby(['col1', 'col2']).col2.count().to_frame()
In [2642]: res = x[x.col2 > 1].index.get_level_values(1).tolist()
In [2643]: res
Out[2643]: ['C']
I am looking to find the unique values for each column in my dataframe. (Values unique for the whole dataframe)
Col1 Col2 Col3
1 A A B
2 C A B
3 B B F
Col1 has C as a unique value, Col2 has none and Col3 has F.
Any genius ideas ? thank you !
You can use stack for Series, then drop_duplicates - keep=False remove all, remove first level by reset_index and last reindex:
df = df.stack()
.drop_duplicates(keep=False)
.reset_index(level=0, drop=True)
.reindex(index=df.columns)
print (df)
Col1 C
Col2 NaN
Col3 F
dtype: object
Solution above works nice if only one unique value per column.
I try create more general solution:
print (df)
Col1 Col2 Col3
1 A A B
2 C A X
3 B B F
s = df.stack().drop_duplicates(keep=False).reset_index(level=0, drop=True)
print (s)
Col1 C
Col3 X
Col3 F
dtype: object
s = s.groupby(level=0).unique().reindex(index=df.columns)
print (s)
Col1 [C]
Col2 NaN
Col3 [X, F]
dtype: object
I don't believe this is exactly what you want, but as useful information - you can find unique values for a DataFrame using numpy's .unique() like so:
>>> np.unique(df[['Col1', 'Col2', 'Col3']])
['A' 'B' 'C' 'F']
You can also get unique values of a specific column, e.g. Col3:
>>> df.Col3.unique()
['B' 'F']
I have 3 dataframes that I'd like to combine. They look like this:
df1 |df2 |df3
col1 col2 |col1 col2 |col1 col3
1 5 2 9 1 some
2 data
I'd like the first two df-s to be merged into the third df based on col1, so the desired output is
df3
col1 col3 col2
1 some 5
2 data 9
How can I achieve this? I'm trying:
df3['col2'] = df1[df1.col1 == df3.col1].col2 if df1[df1.col1 == df3.col1].col2 is not None else df2[df2.col1 == df3.col1].col2
For this I get ValueError: Series lengths must match to compare
It is guaranteed, that df3's col1 values are present either in df1 or df2. What's the way to do this? PLEASE NOTE, that a simple concat will not work, since there is other data in df3, not just col1.
If df1 and df2 don't have duplicates in col1, you can try this:
pd.concat([df1, df2]).merge(df3)
Data:
df1 = pd.DataFrame({'col1': [1], 'col2': [5]})
df2 = pd.DataFrame({'col1': [2], 'col2': [9]})
df3 = pd.DataFrame({'col1': [1,2], 'col3': ['some', 'data']})