Python - Combinig pandas dataframes

Python - Combinig pandas dataframes - python

I have 3 dataframes that I'd like to combine. They look like this:
df1 |df2 |df3
col1 col2 |col1 col2 |col1 col3
1 5 2 9 1 some
2 data
I'd like the first two df-s to be merged into the third df based on col1, so the desired output is
df3
col1 col3 col2
1 some 5
2 data 9
How can I achieve this? I'm trying:
df3['col2'] = df1[df1.col1 == df3.col1].col2 if df1[df1.col1 == df3.col1].col2 is not None else df2[df2.col1 == df3.col1].col2
For this I get ValueError: Series lengths must match to compare
It is guaranteed, that df3's col1 values are present either in df1 or df2. What's the way to do this? PLEASE NOTE, that a simple concat will not work, since there is other data in df3, not just col1.

If df1 and df2 don't have duplicates in col1, you can try this:
pd.concat([df1, df2]).merge(df3)
Data:
df1 = pd.DataFrame({'col1': [1], 'col2': [5]})
df2 = pd.DataFrame({'col1': [2], 'col2': [9]})
df3 = pd.DataFrame({'col1': [1,2], 'col3': ['some', 'data']})

Related

drop all duplicate values in python

Assume I have the following dataframe in Python:
A = [['A',2,3],['A',5,4],['B',8,9],['C',8,10],['C',9,20],['C',10,20]]
B = pd.DataFrame(A, columns = ['Col1','Col2','Col3'])
This gives me the above dataframe: I want to remove the rows that have the same values for Col1 but different values for Col3. I have tried to use drop_duplicates command with different subset of columns but it does not give what I want. I can write for loop but that is not efficient at all (since you might have much more columns than this).
C= B.drop_duplicates(['Col1','Col3'],keep = False)
Can anyone help if there is any command in Python can do this without using for loop?
The expected output would be, since A and C are removed because they have the same Col1 but different Col3.

A = [['A',2,3],['A',5,4],['B',8,9],['C',8,10],['C',9,20],['C',10,20]]
df = pd.DataFrame(A, columns = ['Col1','Col2','Col3'])
output = df.drop_duplicates('Col1', keep=False)
print(output)
Output:
Col1 Col2 Col3
2 B 8 9

This can do the job,
grouped_df = df.groupby("Col1")
groups = [grouped_df.get_group(key) for key in grouped_df.groups.keys() if len(grouped_df.get_group(key)["Col3"].unique()) == 1]
new_df = pd.concat(groups).reset_index(drop = True)
Output -
Col1
Col2
Col3
0
B
8
9

Pandas merge not working after using StringIO

I need to convert a string into a pandas DataFrame to further merge it with another DataFrame, unfortunately the merge is not working.
str_data = StringIO("""col1;col2
one;apple
two;lemon""")
df = pd.read_csv(str_data, sep =";")
df2 = pd.DataFrame([['one', 10], ['two', 15]], columns = ['col1', 'col3'])
df=df.merge(df2, how='left', on='col1')
The resulting DataFrame has on NaN´s in col3 and not the integers from col3 in df2
col1 col2 col3
0 one apple NaN
1 two lemon NaN
thanks in advance for recommendations!

For me working well:
from io import StringIO
str_data = StringIO("""col1;col2
one;apple
two;lemon""")
df = pd.read_csv(str_data, sep =";")
df2 = pd.DataFrame([['one', 10], ['two', 15]], columns = ['col1', 'col3'])
df=df.merge(df2, how='left', on='col1')
print (df)
col1 col2 col3
0 one apple 10
1 two lemon 15

pandas select rows by condition for all of dataframe columns

I have a dataframe
d = {'col1': [1, 2], 'col2': [3, 4], 'col3' : [5,6]}
df = pd.DataFrame(data=d)
df
col1 col2 col3
0 1 3 5
1 2 4 6
for example, i need select all rows with value = 1 so my code is:
df[df['col1']==1]
col1 col2 col3
0 1 3 5
but how i can choose not only 'col1' but all columns, i have try this code:
for col in df.columns:
print(df[df[col]==1])
but outpus not in pandas dataframe's view:
col1 col2 col3
0 1 3 5
Empty DataFrame
Columns: [col1, col2, col3]
Index: []
Empty DataFrame
Columns: [col1, col2, col3]
Index: []
can i go over all the columns and get view like in dataframe?

You can use df.eq to check if any value in the df is equal to 1 and using df.any on axis=1 , this will return True for all rows where any of the column values have 1. Finally use boolean indexing
output = df[df.eq(1).any(axis=1)]

How to find common elements in several dataframes

I have the following dataframes:
df1 = pd.DataFrame({'col1': ['A','M','C'],
'col2': ['B','N','O'],
# plus many more
})
df2 = pd.DataFrame({'col3': ['A','A','A','B','B','B'],
'col4': ['M','P','Q','J','P','M'],
# plus many more
})
Which look like these:
df1:
col1 col2
A B
M N
C O
#...plus many more
df2:
col3 col4
A M
A P
A Q
B J
B P
B M
#...plus many more
The objective is to create a dataframe containing all elements in col4 for each col3 that occurs in one row in df1. For example, let's look at row 1 of df1. We see that A is in col1 and B is in col2. Then, we go to df2, and check what col4 is for df2[df2['col3'] == 'A'] and df2[df2['col3'] == 'B']. We get, for A: ['M','P','Q'], and for B, ['J','P','M']. The intersection of these is['M', 'P'], so what I want is something like this
col1 col2 col4
A B M
A B P
....(and so on for the other rows)
The naive way to go about this is to iterate over rows and then get the intersection, but I was wondering if it's possible to solve this via merging techniques or other faster methods. So far, I can't think of any way how.

This should achieve what you want, using a combination of merge, groupby and set intersection:
# Getting tuple of all col1=col3 values in col4
df3 = pd.merge(df1, df2, left_on='col1', right_on='col3')
df3 = df3.groupby(['col1', 'col2'])['col4'].apply(tuple)
df3 = df3.reset_index()
# Getting tuple of all col2=col3 values in col4
df3 = pd.merge(df3, df2, left_on='col2', right_on='col3')
df3 = df3.groupby(['col1', 'col2', 'col4_x'])['col4_y'].apply(tuple)
df3 = df3.reset_index()
# Taking set intersection of our two tuples
df3['col4'] = df3.apply(lambda row: set(row['col4_x']) & set(row['col4_y']), axis=1)
# Dropping unnecessary columns
df3 = df3.drop(['col4_x', 'col4_y'], axis=1)
print(df3)
col1 col2 col4
0 A B {P, M}
If required, see this answer for examples of how to 'melt' col4.

Count unique symbols per column in Pandas

I was wondering how to calculate the number of unique symbols that occur in a single column in a dataframe. For example:
df = pd.DataFrame({'col1': ['a', 'bbb', 'cc', ''], 'col2': ['ddd', 'eeeee', 'ff', 'ggggggg']})
df col1 col2
0 a ddd
1 bbb eeeee
2 cc ff
3 gggggg
It should calculate that col1 contains 3 unique symbols, and col2 contains 4 unique symbols.
My code so far (but this might be wrong):
unique_symbols = [0]*203
i = 0
for col in df.columns:
observed_symbols = []
df_temp = df[[col]]
df_temp = df_temp.astype('str')
#This part is where I am not so sure
for index, row in df_temp.iterrows():
pass
if symbol not in observed_symbols:
observed_symbols.append(symbol)
unique_symbols[i] = len(observed_symbols)
i += 1
Thanks in advance

Option 1
str.join + set inside a dict comprehension
For problems like this, I'd prefer falling back to python, because it's so much faster.
{c : len(set(''.join(df[c]))) for c in df.columns}
{'col1': 3, 'col2': 4}
Option 2
agg
If you want to stay in pandas space.
df.agg(lambda x: set(''.join(x)), axis=0).str.len()
Or,
df.agg(lambda x: len(set(''.join(x))), axis=0)
col1 3
col2 4
dtype: int64

Here is one way:
df.apply(lambda x: len(set(''.join(x.astype(str)))))
col1 3
col2 4

Maybe
df.sum().apply(set).str.len()
Out[673]:
col1 3
col2 4
dtype: int64

One more option:
In [38]: df.applymap(lambda x: len(set(x))).sum()
Out[38]:
col1 3
col2 4
dtype: int64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Combinig pandas dataframes - python

If df1 and df2 don't have duplicates in col1, you can try this: pd.concat([df1, df2]).merge(df3) Data: df1 = pd.DataFrame({'col1': [1], 'col2': [5]}) df2 = pd.DataFrame({'col1': [2], 'col2': [9]}) df3 = pd.DataFrame({'col1': [1,2], 'col3': ['some', 'data']})

Related

drop all duplicate values in python

Pandas merge not working after using StringIO

pandas select rows by condition for all of dataframe columns

How to find common elements in several dataframes

Count unique symbols per column in Pandas

Categories

Resources