Related
I am using the code below to make a search on a .csv file and match a column in both files and grab a different column I want and add it as a new column. However, I am trying to make the match based on two columns instead of one. Is there a way to do this?
import pandas as pd
df1 = pd.read_csv("matchone.csv")
df2 = pd.read_csv("comingfrom.csv")
def lookup_prod(ip):
for row in df2.itertuples():
if ip in row[1]:
return row[3]
else:
return '0'
df1['want'] = df1['name'].apply(lookup_prod)
df1[df1.want != '0']
print(df1)
#df1.to_csv('file_name.csv')
The code above makes a search from the column name 'samename' in both files and gets the column I request ([3]) from the df2. I want to make the code make a match for both column 'name' and another column 'price' and only if both columns in both df1 and df2 match then the code take the value on ([3]).
df 1 :
name price value
a 10 35
b 10 21
c 10 33
d 10 20
e 10 88
df 2 :
name price want
a 10 123
b 5 222
c 10 944
d 10 104
e 5 213
When the code is run (asking for the want column from d2, based on both if df1 name = df2 name) the produced result is :
name price value want
a 10 35 123
b 10 21 222
c 10 33 944
d 10 20 104
e 10 88 213
However, what I want is if both df1 name = df2 name and df1 price = df2 price, then take the column df2 want, so the desired result is:
name price value want
a 10 35 123
b 10 21 0
c 10 33 944
d 10 20 104
e 10 88 0
You need to use pandas.DataFrame.merge() method with multiple keys:
df1.merge(df2, on=['name','price'], how='left').fillna(0)
Method represents missing values as NaNs, so that the column's dtype changes to float64 but you can change it back after filling the missed values with 0.
Also please be aware that duplicated combinations of name and price in df2 will appear several times in the result.
If you are matching the two dataframes based on the name and the price, you can use df.where and df.isin
df1['want'] = df2['want'].where(df1[['name','price']].isin(df2).all(axis=1)).fillna('0')
df1
name price value want
0 a 10 35 123.0
1 b 10 21 0
2 c 10 33 944.0
3 d 10 20 104.0
4 e 10 88 0
Expanding on https://stackoverflow.com/a/73830294/20110802:
You can add the validate option to the merge in order to avoid duplication on one side (or both):
pd.merge(df1, df2, on=['name','price'], how='left', validate='1:1').fillna(0)
Also, if the float conversion is a problem for you, one option is to do an inner join first and then pd.concat the result with the "leftover" df1 where you already added a constant valued column. Would look something like:
df_inner = pd.merge(df1, df2, on=['name', 'price'], how='inner', validate='1:1')
merged_pairs = set(zip(df_inner.name, df_inner.price))
df_anti = df1.loc[~pd.Series(zip(df1.name, df1.price)).isin(merged_pairs)]
df_anti['want'] = 0
df_result = pd.concat([df_inner, df_anti]) # perhaps ignore_index=True ?
Looks complicated, but should be quite performant because it filters by set. I think there might be a possibility to set name and price as index, merge on index and then filter by index to not having to do the zip-set-shenanigans, bit I'm no expert on multiindex-handling.
#Try this code it will give you expected results
import pandas as pd
df1 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,10,10,10,10],
'value' : [35,21,33,20,88]})
df2 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,5,10,10,5],
'want' : [123,222,944,104 ,213]})
new = pd.merge(df1,df2, how='left', left_on=['name','price'], right_on=['name','price'])
print(new.fillna(0))
I have two dataframes I need to match by row. Where a match occurs I need to increment the value +1 in of a field in df1. df2 has mulitple matches to df1. I don't want to merge the dataframes, just update df1 based off a match to df2.
The basic logic in my head is read the first row of df1, then try to match TRANID to each row of df2. When a match occurs, add +1 to the NUMINSTS value. Then loop back and do the same for the next row on df1. I'm just not sure how to approach this in Python/Pandas.
I'm an old COBOL programmer and am just learning Python/Pandas so any help is greatly appreciated.
Input Data
df1:
TRANID NUMINSTS
60000022 22
60000333 6
70000001 15
70000233 60
df2:
TRANID
60000333
70000233
70000233
Output
df3:
TRANID NUMINSTS
60000022 22
60000333 7 #incremented by 1
70000001 15
70000233 62 #incremented by 2
We can filter based on the values in df2 and keep adding or changing values in df1.
import pandas as pd
df1 = pd.DataFrame({"TRAINID":["60000022", "60000333", "70000001", "70000233"], "NUMINSTS":[22,6,15,60]})
df2 = pd.DataFrame({"TRAINID":[ "60000333", "70000233", "70000233"]})
def add_num(df1,df2):
for id in list(df2["TRAINID"]):
df1.loc[df1["TRAINID"] == id, "NUMINSTS"] += 1
return df1
df3 = add_num(df1,df2)
print(df3)
You want two cases:
Tranid exists in df1
Tranid doesn't exist in df1.
Here is your code:
import pandas as pd
df1=pd.DataFrame({'tranid':[1,2,3],'numinst':[2,4,6]})
df2=pd.DataFrame({'tranid':[1,2,4]})
tranvalues=df1['tranid']
for i in range(len(df2)):
if df2['tranid'][i] in tranvalues:
df1['numinst'][df1['tranid']==df2['tranid'][i]]=df1['numinst']+1
else:
df1.loc[len(df1.index)]=[df2['tranid'][i],1]
You may try:
df1 = pd.DataFrame({'TRANID':[60000022, 60000333, 70000001, 70000233],
'NUMINSTS':[22,6,15,60]})
df1:
TRANID NUMINSTS
0 60000022 22
1 60000333 6
2 70000001 15
3 70000233 60
df2 = pd.DataFrame({'TRANID':[60000333, 70000233, 70000233]})
df2:
TRANID
0 60000333
1 70000233
2 70000233
Build a dictionary of counts of TRANID values from df2:
d = df2['TRANID'].value_counts().to_dict()
Copy df3 from df1 and update the NUMINSTS column like if the TRANID is in the above dictionary , increment by the value count otherwise keep it the same:
df3 = df1.copy()
df3['NUMINSTS'] = df3.apply(
lambda row:
row['NUMINSTS']+d[row['TRANID']] if row['TRANID'] in d else row['NUMINSTS'], axis=1)
If you don't want the rows that don't match, you could replace None like below and then drop those with None values:
df3['NUMINSTS'] = df3.apply(
lambda row:
row['NUMINSTS']+d[row['TRANID']] if row['TRANID'] in d else None, axis=1)
df3.dropna(subset=['NUMINSTS'], inplace=True)
df3['NUMINSTS'] = df3['NUMINSTS'].astype(int)
df3.reset_index(inplace=True,drop=True)
Output df3:
TRANID NUMINSTS
0 60000333 7
1 70000233 62
I have a DataFrame that looks like this:
df:
amount info
12 {id:'1231232', type:'trade', amount:12}
14 {id:'4124124', info:{operation_type:'deposit'}}
What I want to achieve is this:
df:
amount type operation_type
12 trade Nan
14 Nan deposit
I have tried the df.explode('info') method but with no luck. Are there any other ways to do this?
We could do it in 2 steps: (i) Build a DataFrame df with data; (ii) use json_normalize on "info" column and join it back to df:
df = pd.DataFrame(data)
out = df.join(pd.json_normalize(df['info'].tolist())[['type', 'info.operation_type']]).drop(columns='info')
out.columns = out.columns.map(lambda x: x.split('.')[-1])
Output:
amount type operation_type
0 12 trade NaN
1 14 NaN deposit
I have the dataframe df1 with the columns type, Date and amount.
My goal is to create a Dataframe df2 with a subset of dates from df1, in which each type has a column with the amounts of the type as values for the respective date.
Input Dataframe:
df1 =
,type,Date,amount
0,42,2017-02-01,4
1,42,2017-02-02,5
2,42,2017-02-03,7
3,42,2017-02-04,2
4,48,2017-02-01,6
5,48,2017-02-02,8
6,48,2017-02-03,3
7,48,2017-02-04,6
8,46,2017-02-01,3
9,46,2017-02-02,8
10,46,2017-02-03,3
11,46,2017-02-04,4
Desired Output, if the subset of Dates are 2017-02-02 and 2017-02-04:
df2 =
,Date,42,48,46
0,2017-02-02,5,8,8
1,2017-02-04,2,6,4
I tried it like this:
types = list(df1["type"].unique())
dates = ["2017-02-02","2017-02-04"]
df2 = pd.DataFrame()
df2["Date"]=dates
for t in types:
df2[t] = df1[(df1["type"]==t)&(df1[df1["type"]==t][["Date"]]==df2["Date"])][["amount"]]
but with this solution I get a lot of NaNs, it seems my comparison condition is wrong.
This is the Ouput I get:
,Date,42,48,46
0,2017-02-02,,,
1,2017-02-04,,,
You can use .pivot_table() and then filter data:
df2 = df1.pivot_table(
index="Date", columns="type", values="amount", aggfunc="sum"
)
dates = ["2017-02-02", "2017-02-04"]
print(df2.loc[dates].reset_index())
Prints:
type Date 42 46 48
0 2017-02-02 5 8 8
1 2017-02-04 2 4 6
I have two dataframes like the following:
df1
id name
-------------------------
0 43 c
1 23 t
2 38 j
3 9 s
df2
user id
--------------------------------------------------
0 222087 27,26
1 1343649 6,47,17
2 404134 18,12,23,22,27,43,38,20,35,1
3 1110200 9,23,2,20,26,47,37
I want to split all the ids in df2 into multiple rows and join the resultant dataframe to df1 on "id".
I do the following:
b = pd.DataFrame(df2['id'].str.split(',').tolist(), index=df2.user_id).stack()
b = b.reset_index()[[0, 'user_id']] # var1 variable is currently labeled 0
b.columns = ['Item_id', 'user_id']
When I try to merge, I get NaNs in the resultant dataframe.
pd.merge(b, df1, on = "id", how="left")
id user name
-------------------------------------
0 27 222087 NaN
1 26 222087 NaN
2 6 1343649 NaN
3 47 1343649 NaN
4 17 1343649 NaN
So, I tried doing the following:
b['name']=np.nan
for i in range(0, len(df1)):
b['name'][(b['id'] == df1['id'][i])] = df1['name'][i]
It still gives the same result as above. I am confused as to what could cause this because I am sure both of them should work!
Any help would be much appreciated!
I read similar posts on SO but none seemed to have a concrete answer. I am also not sure if this is not at all related to coding or not.
Thanks in advance!
Problem is you need convert column id in df2 to int, because output of string functions is always string, also if works with numeric.
df2.id = df2.id.astype(int)
Another solution is convert df1.id to string:
df1.id = df1.id.astype(str)
And get NaNs because no match - str values doesnt match with int values.