Python - Combinig pandas dataframes - python

I have 3 dataframes that I'd like to combine. They look like this:
df1 |df2 |df3
col1 col2 |col1 col2 |col1 col3
1 5 2 9 1 some
2 data
I'd like the first two df-s to be merged into the third df based on col1, so the desired output is
df3
col1 col3 col2
1 some 5
2 data 9
How can I achieve this? I'm trying:
df3['col2'] = df1[df1.col1 == df3.col1].col2 if df1[df1.col1 == df3.col1].col2 is not None else df2[df2.col1 == df3.col1].col2
For this I get ValueError: Series lengths must match to compare
It is guaranteed, that df3's col1 values are present either in df1 or df2. What's the way to do this? PLEASE NOTE, that a simple concat will not work, since there is other data in df3, not just col1.

If df1 and df2 don't have duplicates in col1, you can try this:
pd.concat([df1, df2]).merge(df3)
Data:
df1 = pd.DataFrame({'col1': [1], 'col2': [5]})
df2 = pd.DataFrame({'col1': [2], 'col2': [9]})
df3 = pd.DataFrame({'col1': [1,2], 'col3': ['some', 'data']})

Related

drop all duplicate values in python

Assume I have the following dataframe in Python:
A = [['A',2,3],['A',5,4],['B',8,9],['C',8,10],['C',9,20],['C',10,20]]
B = pd.DataFrame(A, columns = ['Col1','Col2','Col3'])
This gives me the above dataframe: I want to remove the rows that have the same values for Col1 but different values for Col3. I have tried to use drop_duplicates command with different subset of columns but it does not give what I want. I can write for loop but that is not efficient at all (since you might have much more columns than this).
C= B.drop_duplicates(['Col1','Col3'],keep = False)
Can anyone help if there is any command in Python can do this without using for loop?
The expected output would be, since A and C are removed because they have the same Col1 but different Col3.
A = [['A',2,3],['A',5,4],['B',8,9],['C',8,10],['C',9,20],['C',10,20]]
df = pd.DataFrame(A, columns = ['Col1','Col2','Col3'])
output = df.drop_duplicates('Col1', keep=False)
print(output)
Output:
Col1 Col2 Col3
2 B 8 9
This can do the job,
grouped_df = df.groupby("Col1")
groups = [grouped_df.get_group(key) for key in grouped_df.groups.keys() if len(grouped_df.get_group(key)["Col3"].unique()) == 1]
new_df = pd.concat(groups).reset_index(drop = True)
Output -
Col1
Col2
Col3
0
B
8
9

Pandas merge not working after using StringIO

I need to convert a string into a pandas DataFrame to further merge it with another DataFrame, unfortunately the merge is not working.
str_data = StringIO("""col1;col2
one;apple
two;lemon""")
df = pd.read_csv(str_data, sep =";")
df2 = pd.DataFrame([['one', 10], ['two', 15]], columns = ['col1', 'col3'])
df=df.merge(df2, how='left', on='col1')
The resulting DataFrame has on NaNĀ“s in col3 and not the integers from col3 in df2
col1 col2 col3
0 one apple NaN
1 two lemon NaN
thanks in advance for recommendations!
For me working well:
from io import StringIO
str_data = StringIO("""col1;col2
one;apple
two;lemon""")
df = pd.read_csv(str_data, sep =";")
df2 = pd.DataFrame([['one', 10], ['two', 15]], columns = ['col1', 'col3'])
df=df.merge(df2, how='left', on='col1')
print (df)
col1 col2 col3
0 one apple 10
1 two lemon 15

pandas select rows by condition for all of dataframe columns

I have a dataframe
d = {'col1': [1, 2], 'col2': [3, 4], 'col3' : [5,6]}
df = pd.DataFrame(data=d)
df
col1 col2 col3
0 1 3 5
1 2 4 6
for example, i need select all rows with value = 1 so my code is:
df[df['col1']==1]
col1 col2 col3
0 1 3 5
but how i can choose not only 'col1' but all columns, i have try this code:
for col in df.columns:
print(df[df[col]==1])
but outpus not in pandas dataframe's view:
col1 col2 col3
0 1 3 5
Empty DataFrame
Columns: [col1, col2, col3]
Index: []
Empty DataFrame
Columns: [col1, col2, col3]
Index: []
can i go over all the columns and get view like in dataframe?
You can use df.eq to check if any value in the df is equal to 1 and using df.any on axis=1 , this will return True for all rows where any of the column values have 1. Finally use boolean indexing
output = df[df.eq(1).any(axis=1)]

How to find common elements in several dataframes

I have the following dataframes:
df1 = pd.DataFrame({'col1': ['A','M','C'],
'col2': ['B','N','O'],
# plus many more
})
df2 = pd.DataFrame({'col3': ['A','A','A','B','B','B'],
'col4': ['M','P','Q','J','P','M'],
# plus many more
})
Which look like these:
df1:
col1 col2
A B
M N
C O
#...plus many more
df2:
col3 col4
A M
A P
A Q
B J
B P
B M
#...plus many more
The objective is to create a dataframe containing all elements in col4 for each col3 that occurs in one row in df1. For example, let's look at row 1 of df1. We see that A is in col1 and B is in col2. Then, we go to df2, and check what col4 is for df2[df2['col3'] == 'A'] and df2[df2['col3'] == 'B']. We get, for A: ['M','P','Q'], and for B, ['J','P','M']. The intersection of these is['M', 'P'], so what I want is something like this
col1 col2 col4
A B M
A B P
....(and so on for the other rows)
The naive way to go about this is to iterate over rows and then get the intersection, but I was wondering if it's possible to solve this via merging techniques or other faster methods. So far, I can't think of any way how.
This should achieve what you want, using a combination of merge, groupby and set intersection:
# Getting tuple of all col1=col3 values in col4
df3 = pd.merge(df1, df2, left_on='col1', right_on='col3')
df3 = df3.groupby(['col1', 'col2'])['col4'].apply(tuple)
df3 = df3.reset_index()
# Getting tuple of all col2=col3 values in col4
df3 = pd.merge(df3, df2, left_on='col2', right_on='col3')
df3 = df3.groupby(['col1', 'col2', 'col4_x'])['col4_y'].apply(tuple)
df3 = df3.reset_index()
# Taking set intersection of our two tuples
df3['col4'] = df3.apply(lambda row: set(row['col4_x']) & set(row['col4_y']), axis=1)
# Dropping unnecessary columns
df3 = df3.drop(['col4_x', 'col4_y'], axis=1)
print(df3)
col1 col2 col4
0 A B {P, M}
If required, see this answer for examples of how to 'melt' col4.

Count unique symbols per column in Pandas

I was wondering how to calculate the number of unique symbols that occur in a single column in a dataframe. For example:
df = pd.DataFrame({'col1': ['a', 'bbb', 'cc', ''], 'col2': ['ddd', 'eeeee', 'ff', 'ggggggg']})
df col1 col2
0 a ddd
1 bbb eeeee
2 cc ff
3 gggggg
It should calculate that col1 contains 3 unique symbols, and col2 contains 4 unique symbols.
My code so far (but this might be wrong):
unique_symbols = [0]*203
i = 0
for col in df.columns:
observed_symbols = []
df_temp = df[[col]]
df_temp = df_temp.astype('str')
#This part is where I am not so sure
for index, row in df_temp.iterrows():
pass
if symbol not in observed_symbols:
observed_symbols.append(symbol)
unique_symbols[i] = len(observed_symbols)
i += 1
Thanks in advance
Option 1
str.join + set inside a dict comprehension
For problems like this, I'd prefer falling back to python, because it's so much faster.
{c : len(set(''.join(df[c]))) for c in df.columns}
{'col1': 3, 'col2': 4}
Option 2
agg
If you want to stay in pandas space.
df.agg(lambda x: set(''.join(x)), axis=0).str.len()
Or,
df.agg(lambda x: len(set(''.join(x))), axis=0)
col1 3
col2 4
dtype: int64
Here is one way:
df.apply(lambda x: len(set(''.join(x.astype(str)))))
col1 3
col2 4
Maybe
df.sum().apply(set).str.len()
Out[673]:
col1 3
col2 4
dtype: int64
One more option:
In [38]: df.applymap(lambda x: len(set(x))).sum()
Out[38]:
col1 3
col2 4
dtype: int64

Categories

Resources