I need to convert a string into a pandas DataFrame to further merge it with another DataFrame, unfortunately the merge is not working.
str_data = StringIO("""col1;col2
one;apple
two;lemon""")
df = pd.read_csv(str_data, sep =";")
df2 = pd.DataFrame([['one', 10], ['two', 15]], columns = ['col1', 'col3'])
df=df.merge(df2, how='left', on='col1')
The resulting DataFrame has on NaNĀ“s in col3 and not the integers from col3 in df2
col1 col2 col3
0 one apple NaN
1 two lemon NaN
thanks in advance for recommendations!
For me working well:
from io import StringIO
str_data = StringIO("""col1;col2
one;apple
two;lemon""")
df = pd.read_csv(str_data, sep =";")
df2 = pd.DataFrame([['one', 10], ['two', 15]], columns = ['col1', 'col3'])
df=df.merge(df2, how='left', on='col1')
print (df)
col1 col2 col3
0 one apple 10
1 two lemon 15
Related
I have the following dataframe
data = [
{'col1': 11, 'col2': 111, 'col3': 1111},
{'col1': 22, 'col2': 222, 'col3': 2222},
{'col1': 33, 'col2': 333, 'col3': 3333},
{'col1': 44, 'col2': 444, 'col3': 4444}
]
and the following list:
lst = [(11, 111), (22, 222), (99, 999)]
I would like to get out of my data only rows that col1 and col2 do not exist in the lst
result for above example would be:
[
{'col1': 33, 'col2': 333, 'col3': 3333},
{'col1': 44, 'col2': 444, 'col3': 4444}
]
how can I achieve that?
import pandas as pd
df = pd.DataFrame(data)
list_df = pd.DataFrame(lst)
# command like ??
# df.subtract(list_df)
If need test by pairs is possible compare MultiIndex created by both columns in Index.isin with inverted mask by ~ in boolean indexing:
df = df[~df.set_index(['col1','col2']).index.isin(lst)]
print (df)
col1 col2 col3
2 33 333 3333
3 44 444 4444
Or with left join by merge with indicator parameter:
m = df.merge(list_df,
left_on=['col1','col2'],
right_on=[0,1],
indicator=True,
how='left')['_merge'].eq('left_only')
df = df[mask]
print (df)
col1 col2 col3
2 33 333 3333
3 44 444 4444
You can create a tuple out of your col1 and col2 columns and then check if those tuples are in the lst list. Then drop the fines with True values.
df.drop(df.apply(lambda x: (x['col1'], x['col2']), axis =1)
.isin(lst)
.loc[lambda x: x==True]
.index)
With this solution you don't even have to make the second list a dataframe
You can create the tuples of col1 and col2 by .apply() with tuple. Then test these tuples whether in lst by .isin() (add ~ for the negation/opposite condition).
Finally, locate the rows with .loc, as follows:
df.loc[~df[['col1', 'col2']].apply(tuple, axis=1).isin(lst)]
Result:
col1 col2 col3
2 33 333 3333
3 44 444 4444
You can extract the list of values using zip and slice using a mask generated with isna:
a,b = zip(*lst)
data[~(data['col1'].isin(a)|data['col2'].isin(b))]
output:
col1 col2 col3
2 33 333 3333
3 44 444 4444
Or if you need both conditions to be true to drop:
data[~(data['col1'].isin(a)&data['col2'].isin(b))]
NB. if you have many columns, you can automate the process:
mask = sum(data[col].isin(v) for col,v in zip(data, zip(*lst))).eq(0)
df[mask]
use polars's anti join:
df1.join(pl.DataFrame(pd.DataFrame(lst,columns=["col1","col2"]))
,on=["col1","col2"],how="anti").to_pandas()
Result:
col1 col2 col3
2 33 333 3333
3 44 444 4444
My table looks like the following:
import pandas as pd
d = {'col1': ['a>b>c']}
df = pd.DataFrame(data=d)
print(df)
"""
col1
0 a>b>c
"""
and my desired output need to be like this:
d1 = {'col1': ['a>b>c'],'col11': ['a'],'col12': ['b'],'col13': ['c']}
d1 = pd.DataFrame(data=d1)
print(d1)
"""
col1 col11 col12 col13
0 a>b>c a b c
"""
I have to run .split('>') method but then I don't know how to go on. Any help?
You can simply split using str.split('>')and expand the dataframe
import pandas as pd
d = {'col1': ['a>b>c'],'col2':['a>b>c']}
df = pd.DataFrame(data=d)
print(df)
col='col1'
#temp = df[col].str.split('>',expand=True).add_prefix(col)
temp = df[col].str.split('>',expand=True).rename(columns=lambda x: col + str(int(x)+1))
temp.merge(df,left_index=True,right_index=True,how='outer')
Out:
col1 col11 col12 col13
0 a>b>c a b c
Incase if you want to do it on multiple columns you can also take
for col in df.columns:
temp = df[col].str.split('>',expand=True).rename(columns=lambda x: col + str(int(x)+1))
df = temp.merge(df,left_index=True,right_index=True,how='outer')
Out:
col21 col22 col23 col11 col12 col13 col1 col2
0 a b c a b c a>b>c a>b>c
Using split:
d = {'col1': ['a>b>c']}
df = pd.DataFrame(data=d)
df = pd.concat([df, df.col1.str.split('>', expand=True)], axis=1)
df.columns = ['col1', 'col11', 'col12', 'col13']
df
Output:
col1 col11 col12 col13
0 a>b>c a b c
I have the following dataframes:
df1 = pd.DataFrame({'col1': ['A','M','C'],
'col2': ['B','N','O'],
# plus many more
})
df2 = pd.DataFrame({'col3': ['A','A','A','B','B','B'],
'col4': ['M','P','Q','J','P','M'],
# plus many more
})
Which look like these:
df1:
col1 col2
A B
M N
C O
#...plus many more
df2:
col3 col4
A M
A P
A Q
B J
B P
B M
#...plus many more
The objective is to create a dataframe containing all elements in col4 for each col3 that occurs in one row in df1. For example, let's look at row 1 of df1. We see that A is in col1 and B is in col2. Then, we go to df2, and check what col4 is for df2[df2['col3'] == 'A'] and df2[df2['col3'] == 'B']. We get, for A: ['M','P','Q'], and for B, ['J','P','M']. The intersection of these is['M', 'P'], so what I want is something like this
col1 col2 col4
A B M
A B P
....(and so on for the other rows)
The naive way to go about this is to iterate over rows and then get the intersection, but I was wondering if it's possible to solve this via merging techniques or other faster methods. So far, I can't think of any way how.
This should achieve what you want, using a combination of merge, groupby and set intersection:
# Getting tuple of all col1=col3 values in col4
df3 = pd.merge(df1, df2, left_on='col1', right_on='col3')
df3 = df3.groupby(['col1', 'col2'])['col4'].apply(tuple)
df3 = df3.reset_index()
# Getting tuple of all col2=col3 values in col4
df3 = pd.merge(df3, df2, left_on='col2', right_on='col3')
df3 = df3.groupby(['col1', 'col2', 'col4_x'])['col4_y'].apply(tuple)
df3 = df3.reset_index()
# Taking set intersection of our two tuples
df3['col4'] = df3.apply(lambda row: set(row['col4_x']) & set(row['col4_y']), axis=1)
# Dropping unnecessary columns
df3 = df3.drop(['col4_x', 'col4_y'], axis=1)
print(df3)
col1 col2 col4
0 A B {P, M}
If required, see this answer for examples of how to 'melt' col4.
I have 3 dataframes that I'd like to combine. They look like this:
df1 |df2 |df3
col1 col2 |col1 col2 |col1 col3
1 5 2 9 1 some
2 data
I'd like the first two df-s to be merged into the third df based on col1, so the desired output is
df3
col1 col3 col2
1 some 5
2 data 9
How can I achieve this? I'm trying:
df3['col2'] = df1[df1.col1 == df3.col1].col2 if df1[df1.col1 == df3.col1].col2 is not None else df2[df2.col1 == df3.col1].col2
For this I get ValueError: Series lengths must match to compare
It is guaranteed, that df3's col1 values are present either in df1 or df2. What's the way to do this? PLEASE NOTE, that a simple concat will not work, since there is other data in df3, not just col1.
If df1 and df2 don't have duplicates in col1, you can try this:
pd.concat([df1, df2]).merge(df3)
Data:
df1 = pd.DataFrame({'col1': [1], 'col2': [5]})
df2 = pd.DataFrame({'col1': [2], 'col2': [9]})
df3 = pd.DataFrame({'col1': [1,2], 'col3': ['some', 'data']})
I've found a behavior in pandas DataFrames that I don't understand.
df = pd.DataFrame(np.random.randint(1, 10, (3, 3)), index=['one', 'one', 'two'], columns=['col1', 'col2', 'col3'])
new_data = pd.Series({'col1': 'new', 'col2': 'new', 'col3': 'new'})
df.iloc[0] = new_data
# resulting df looks like:
# col1 col2 col3
#one new new new
#one 9 6 1
#two 8 3 7
But if I try to add a dictionary instead, I get this:
new_data = {'col1': 'new', 'col2': 'new', 'col3': 'new'}
df.iloc[0] = new_data
#
# col1 col2 col3
#one col2 col3 col1
#one 2 1 7
#two 5 8 6
Why is this happening? In the process of writing up this question, I realized that most likely df.loc is only taking the keys from new_data, which also explains why the values are out of order. But, again, why is this the case? If I try to create a DataFrame from a dictionary, it handles the keys as if they were columns:
pd.DataFrame([new_data])
# col1 col2 col3
#0 new new new
Why is that not the default behavior in df.loc?
It's the difference between how a dictionary iterates and how a pandas series is treated.
A pandas series matches it's index to columns when being assigned to a row and matches to index if being assigned to a column. After that, it assigns the value that corresponds to that matched index or column.
When an object is not a pandas object with a convenient index object to match off of, pandas will iterate through the object. A dictionary iterates through it's keys and that's why you see the dictionary keys in that rows slots. Dictionaries are not sorted and that's why you see shuffled keys in that row.
just how to do it
this is a compact way, how to fulfill your task. I removed the index of your df, as "one" appeared twice and this prevents unique indexing.
>>> df = pd.DataFrame(np.random.randint(1, 10, (3, 3)), columns=['col1', 'col2', 'col3'])
>>> new_data = {'col1': 'new', 'col2': 'new', 'col3': 'new'}
>>>
>>> df
col1 col2 col3
0 1 6 1
1 4 2 3
2 6 2 3
>>> new_data
{'col1': 'new', 'col2': 'new', 'col3': 'new'}
>>>
>>> df.loc[0, new_data.keys()] = new_data.values()
>>> df
col1 col2 col3
0 new new new
1 4 2 3
2 6 2 3
a compact way
using an intermediate cast to pd.Series
>>> import pandas as pd
>>> df = pd.DataFrame(np.random.randint(1, 10, (3, 3)), columns=['col1', 'col2', 'col3'])
>>> new_data = {'col1': 'new1', 'col2': 'new2', 'col3': 'new3'}
>>>
>>> df
col1 col2 col3
0 5 7 9
1 8 7 8
2 5 3 3
>>> new_data
{'col1': 'new1', 'col2': 'new2', 'col3': 'new3'}
>>>
>>> df.loc[0] = pd.Series(new_data)
>>> df
col1 col2 col3
0 new1 new2 new3
1 8 7 8
2 5 3 3