How can I compare two columns in a dataframe and create a new column based on the difference of those two columns efficiently?
I have a feature in my table that has a lot of missing values and I need to backfill those information by using other tables in the database that contain that same feature. I have used np.select to compare the feature in my original table with the same feature in other table, but I feel like there should be an easy method.
Eg: pd.DataFrame({'A': [1,2,3,4,np.nan], 'B':[1,np.nan,30,4,np.nan]})
I expect the new column to contain values [1,2,"different",4,np.nan]. Any help will be appreciated!
pandas.Series.combine_first or pandas.DataFrame.combine_first could be useful here. These operate like a SQL COALESCE and combine the two columns by choosing the first non-null value if one exists.
df = pd.DataFrame({'A': [1,2,3,4,np.nan], 'B':[1,np.nan,30,4,np.nan]})
C = df.A.combine_first(df.B)
C looks like:
0 1.0
1 2.0
2 3.0
3 4.0
4 NaN
Then, to capture your requirement that two different non-null values should give "different" when combined, just find those indices and update the values.
mask = ~df.A.isna() & ~df.B.isna() & (df.A != df.B)
C[mask] = 'different'
C now looks like:
0 1
1 2
2 different
3 4
4 NaN
Another way is to use pd.DataFrame.iterrows with nunique:
import pandas as pd
df['C'] = [s['A'] if s.nunique()<=1 else 'different' for _, s in df.iterrows()]
Output:
A B C
0 1.0 1.0 1
1 2.0 NaN 2
2 3.0 30.0 different
3 4.0 4.0 4
4 NaN NaN NaN
Related
I want to merge/join two large dataframes while the 'id' column the dataframe on the right is assumed to be substrings of the left 'id' column.
For illustration purposes:
import pandas as pd
import numpy as np
df1=pd.DataFrame({'id':['abc','adcfek','acefeasdq'],'numbers':[1,2,np.nan],'add_info':[3123,np.nan,312441]})
df2=pd.DataFrame({'matching':['adc','fek','acefeasdq','abcef','acce','dcf'],'needed_info':[1,2,3,4,5,6],'other_info':[22,33,11,44,55,66]})
This is df1:
id numbers add_info
0 abc 1.0 3123.0
1 adcfek 2.0 NaN
2 acefeasdq NaN 312441.0
And this is df2:
matching needed_info other_info
0 adc 1 22
1 fek 2 33
2 acefeasdq 3 11
3 abcef 4 44
4 acce 5 55
5 dcf 6 66
And this is the desired output:
id numbers add_info needed_info other_info
0 abc 1.0 3123.0 NaN NaN
1 adcfek 2.0 NaN 2.0 33.0
2 adcfek 2.0 NaN 6.0 66.0
3 acefeasdq NaN 312441.0 3.0 11.0
So as described, I only want to merge the additional columns only when the 'matching' column is a substring of the 'id' column. If it is the other way around, e.g. 'abc' is a substring of 'adcef', nothing should happen.
In my data, a lot of the matches between df1 and df2 are actually exact, like the 'acefeasdq' row. But there are cases where 'id's contain multiple 'matching's. For the moment, it is okish to ignore these cases but I'd like to learn how I can tackle this issue. And additionally, is it possible to mark out the rows that are merged based on substrings and the rows that are merged exactly?
You can use pd.merge(how='cross') to create a dataframe containing all combinations of the rows. And then filter the dataframe using a boolean series:
df = pd.merge(df1, df2, how="cross")
include_row = df.apply(lambda row: row.matching in row.id, axis=1)
filtered = df.loc[include_row]
print(filtered)
Docs:
pd.merge
Indexing and selecting data
If your processing can handle CROSS JOIN (problematic with large datasets), then you could cross join and then delete/filter only those you want.
map= cross.apply(lambda x: str(x['matching']) in str(x['id']), axis=1) #create map of booleans
final = cross[map] #get only those where condition was met
I have one two source of data. One data is old and one is current version of same data.
I need to find new and updated and deleted rows in this two data.
Here is an example. Updated in a sense values of any column is changed from old data.
>>> import pandas as pd
>>> import numpy as np
>>> df1 = pd.DataFrame({'id':[1,2,3,4],'b':[4,np.nan,6,12]})
>>> df2 = pd.DataFrame({'id':[2,1,3,5],'b':[np.nan,40,6,6]})
>>> df1
id b
0 1 4.0
1 2 NaN
2 3 6.0
3 4 12.0
>>> df2
id b
0 2 NaN
1 1 40.0
2 3 6.0
3 5 6.0
here id is primary key for table.
I can easily find new rows from comparing primary key.
>>> df2[~df2.id.isin(df1.id)]
id b
3 5 6.0
But i am having trouble finding updated rows in new data source.
I tried following
>>>tmp = df1.merge(df2)
>>> df2[(~df2.id.isin(tmp.id)) & (df2.id.isin(df1.id))]
id b
1 1 40.0
This works for given case. But when i apply same thing to my original data frame -(shape (97000,58) and two columns combined makes a PK)- is not giving desired result. Its giving rows that are not updated.
My question is 'Is this the right way to achieve this?'.
How can i improve this?
Get the intersection of the ids and simply compare using ==. This is only possible because you have identically-labeled data frames (i.e. same indexes - due to the intersection - and same columns).
ids = set(df1.id.unique()).intersection(df2.id)
d1 = df1[df1.id.isin(ids)].set_index('id').sort_index()
d2 = df2[df2.id.isin(ids)].set_index('id').sort_index()
comp = (d1 == d2) | (pd.isnull(d1) & pd.isnull(d2))
which gives a boolean data frame with True values wherever values are equal, and False values wherever they differ
id b
0 1 False
1 2 True
2 3 True
(no idea how to introduce a matrix here for readability)
I have two dataframes obtained with Panda and Python.
df1 = pd.DataFrame({'Index': ['0','1','2'], 'number':[3,'dd',1], 'people':[3,'s',3]})
df1 = df1.set_index('Index')
df2 = pd.DataFrame({'Index': ['0','1','2'], 'quantity':[3,2,'hi'], 'persons':[1,5,np.nan]})
I would like to sum the quantities of columns based on Index. Columns do not have the same name and may contain strings. (I have in fact 50 columns on each df). I want to consider nan as 0. The result should look:
df3
Index column 1 column 2
0 6 4
1 nan nan
2 nan nan
I was wondering how could this be done.
Note:
For sure a double while or for would do the trick, just not very elegant...
indices=0
columna=0
while indices<len(df.index)-1:
while columna<numbercolumns-1:
df3.iloc[indices,columna]=df1.iloc[indices,columna] +df2.iloc[indices,columna]
indices += 1
columna += 1
Thank you.
You can try of concatenating both dataframes, then add based on the index group
df1.columns = df.columns
df1.people = pd.to_numeric(df1.people,errors='coerce')
pd.concat([df,df1]).groupby('Index').sum()
Out:
number people
Index
A 8 5.0
B 2 2.0
C 2 5.0
F 3 3.0
I have initialized an empty pandas dataframe that I am now trying to fill but I keep running into the same error. This is the (simplified) code I am using
import pandas as pd
cols = list("ABC")
df = pd.DataFrame(columns=cols)
# sett the values for the first two rows
df.loc[0:2,:] = [[1,2],[3,4],[5,6]]
On running the above code I get the following error:
ValueError: cannot copy sequence with size 3 to array axis with dimension 0
I am not sure whats causing this. I tried the same using a single row at a time and it works (df.loc[0,:] = [1,2,3]). I thought this should be the logical expansion when I want to handle more than one rows. But clearly, I am wrong. Whats the correct way to do this? I need to enter values for multiple rows and columns and once. I can do it using a loop but that's not what I am looking for.
Any help would be great. Thanks
Since you have the columns from empty dataframe use it in dataframe constructor i.e
import pandas as pd
cols = list("ABC")
df = pd.DataFrame(columns=cols)
df = pd.DataFrame(np.array([[1,2],[3,4],[5,6]]).T,columns=df.columns)
A B C
0 1 3 5
1 2 4 6
Well, if you want to use loc specifically then, reindex the dataframe first then assign i.e
arr = np.array([[1,2],[3,4],[5,6]]).T
df = df.reindex(np.arange(arr.shape[0]))
df.loc[0:arr.shape[0],:] = arr
A B C
0 1 3 5
1 2 4 6
How about adding data by index as below. You can add externally to a function as and when you receive data.
def add_to_df(index, data):
for idx,i in zip(index,(zip(*data))):
df.loc[idx]=i
#Set values for first two rows
data1 = [[1,2],[3,4],[5,6]]
index1 = [0,1]
add_to_df(index1, data1)
print df
print ""
#Set values for next three rows
data2 = [[7,8,9],[10,11,12],[13,14,15]]
index2 = [2,3,4]
add_to_df(index2, data2)
print df
Result
>>>
A B C
0 1.0 3.0 5.0
1 2.0 4.0 6.0
A B C
0 1.0 3.0 5.0
1 2.0 4.0 6.0
2 7.0 10.0 13.0
3 8.0 11.0 14.0
4 9.0 12.0 15.0
>>>
Seeing through the documentation and some experiments, my guess is that loc only allows you to insert 1 key at a time. However, you can insert multiple keys first with reindex as #Dark shows.
The .loc/[] operations can perform enlargement when setting a non-existent key for that axis.
http://pandas-docs.github.io/pandas-docs-travis/indexing.html#setting-with-enlargement
Also, while you are using loc[:2, :], you mean you want to select the first two rows. However, there is nothing in the empty df for you to select. There is no rows while you are trying to insert 3 rows. Thus, the message gives
ValueError: cannot copy sequence with size 3 to array axis with dimension 0
BTW, [[1,2],[3,4],[5,6]] will be 3 rows rather than 2.
Does this get the output you looking for:
import pandas as pd
df=pd.DataFrame({'A':[1,2],'B':[3,4],'C':[5,6]})
Output :
A B C
0 1 3 5
1 2 4 6
What I want is this:
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
48944305 A02AG A03AX13 N02BE01 R05X NaN NaN NaN
I don't know how many atc_1...atc_7...?atc_100 columns there will need to be in advance. I just need to gather all associated atc_codes into one row with each visit_id.
This seems like a group_by and then a pivot but I have tried many times and failed. I also tried to self-join a la SQL using pandas' merge() but that doesn't work either.
The end result is that I will paste together atc_1, atc_7, ... atc_100 to form one long atc_code. This composite atc_code will be my "Y" or "labels" column of my dataset that I am trying to predict.
Thank you!
Use cumcount first for count values per groups which create columns by function pivot. Then add missing columns with reindex_axis and change column names by add_prefix. Last reset_index:
g = df.groupby('visit_id').cumcount() + 1
print (g)
0 1
1 2
2 3
3 4
4 5
5 1
6 2
7 3
8 4
dtype: int64
df = pd.pivot(index=df['visit_id'], columns=g, values=df['atc_code'])
.reindex_axis(range(1, 8), 1)
.add_prefix('atc_')
.reset_index()
print (df)
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
0 48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
1 48944305 A02AG A03AX13 N02BE01 R05X None NaN NaN