After merging two data frames I have some gaps in my data frame that can be filled in based on neighboring columns (I have many more columns, and rows in the DF but I'm focusing on these three columns):
Example DF:
Unique ID | Type | Location
A 1 Land
A NaN NaN
B 2 sub
B NaN NaN
C 3 Land
C 3 Land
Ultimately I want the three columns to be filled in:
Unique ID | Type | Location
A 1 Land
A 1 Land
B 2 sub
B 2 sub
C 3 Land
C 3 Land
I've tried:
df.loc[df.Type.isnull(), 'Type'] = df.loc[df.Type.isnull(), 'Unique ID'].map(df.loc[df.Type.notnull()].set_index('Unique ID')['Type'])
but it throws:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
What am I missing here? - Thanks
Your example indicates that you want to forward-fill. YOu can do it like this (complete code):
import pandas as pd
from io import StringIO
clientdata = '''ID N T
A 1 Land
A NaN NaN
B 2 sub
B NaN NaN
C 3 Land
C 3 Land'''
df = pd.read_csv(StringIO(clientdata), sep='\s+')
df["N"] = df["N"].fillna(method="ffill")
df["T"] = df["T"].fillna(method="ffill")
print(df)
The best solution is to probably just get rid of the NaN rows instead of overwriting them. Pandas has a simple command for that:
df.dropna()
Here's the documentation for it: pandas.DataFrame.dropna
Related
I have one two source of data. One data is old and one is current version of same data.
I need to find new and updated and deleted rows in this two data.
Here is an example. Updated in a sense values of any column is changed from old data.
>>> import pandas as pd
>>> import numpy as np
>>> df1 = pd.DataFrame({'id':[1,2,3,4],'b':[4,np.nan,6,12]})
>>> df2 = pd.DataFrame({'id':[2,1,3,5],'b':[np.nan,40,6,6]})
>>> df1
id b
0 1 4.0
1 2 NaN
2 3 6.0
3 4 12.0
>>> df2
id b
0 2 NaN
1 1 40.0
2 3 6.0
3 5 6.0
here id is primary key for table.
I can easily find new rows from comparing primary key.
>>> df2[~df2.id.isin(df1.id)]
id b
3 5 6.0
But i am having trouble finding updated rows in new data source.
I tried following
>>>tmp = df1.merge(df2)
>>> df2[(~df2.id.isin(tmp.id)) & (df2.id.isin(df1.id))]
id b
1 1 40.0
This works for given case. But when i apply same thing to my original data frame -(shape (97000,58) and two columns combined makes a PK)- is not giving desired result. Its giving rows that are not updated.
My question is 'Is this the right way to achieve this?'.
How can i improve this?
Get the intersection of the ids and simply compare using ==. This is only possible because you have identically-labeled data frames (i.e. same indexes - due to the intersection - and same columns).
ids = set(df1.id.unique()).intersection(df2.id)
d1 = df1[df1.id.isin(ids)].set_index('id').sort_index()
d2 = df2[df2.id.isin(ids)].set_index('id').sort_index()
comp = (d1 == d2) | (pd.isnull(d1) & pd.isnull(d2))
which gives a boolean data frame with True values wherever values are equal, and False values wherever they differ
id b
0 1 False
1 2 True
2 3 True
How can I compare two columns in a dataframe and create a new column based on the difference of those two columns efficiently?
I have a feature in my table that has a lot of missing values and I need to backfill those information by using other tables in the database that contain that same feature. I have used np.select to compare the feature in my original table with the same feature in other table, but I feel like there should be an easy method.
Eg: pd.DataFrame({'A': [1,2,3,4,np.nan], 'B':[1,np.nan,30,4,np.nan]})
I expect the new column to contain values [1,2,"different",4,np.nan]. Any help will be appreciated!
pandas.Series.combine_first or pandas.DataFrame.combine_first could be useful here. These operate like a SQL COALESCE and combine the two columns by choosing the first non-null value if one exists.
df = pd.DataFrame({'A': [1,2,3,4,np.nan], 'B':[1,np.nan,30,4,np.nan]})
C = df.A.combine_first(df.B)
C looks like:
0 1.0
1 2.0
2 3.0
3 4.0
4 NaN
Then, to capture your requirement that two different non-null values should give "different" when combined, just find those indices and update the values.
mask = ~df.A.isna() & ~df.B.isna() & (df.A != df.B)
C[mask] = 'different'
C now looks like:
0 1
1 2
2 different
3 4
4 NaN
Another way is to use pd.DataFrame.iterrows with nunique:
import pandas as pd
df['C'] = [s['A'] if s.nunique()<=1 else 'different' for _, s in df.iterrows()]
Output:
A B C
0 1.0 1.0 1
1 2.0 NaN 2
2 3.0 30.0 different
3 4.0 4.0 4
4 NaN NaN NaN
(no idea how to introduce a matrix here for readability)
I have two dataframes obtained with Panda and Python.
df1 = pd.DataFrame({'Index': ['0','1','2'], 'number':[3,'dd',1], 'people':[3,'s',3]})
df1 = df1.set_index('Index')
df2 = pd.DataFrame({'Index': ['0','1','2'], 'quantity':[3,2,'hi'], 'persons':[1,5,np.nan]})
I would like to sum the quantities of columns based on Index. Columns do not have the same name and may contain strings. (I have in fact 50 columns on each df). I want to consider nan as 0. The result should look:
df3
Index column 1 column 2
0 6 4
1 nan nan
2 nan nan
I was wondering how could this be done.
Note:
For sure a double while or for would do the trick, just not very elegant...
indices=0
columna=0
while indices<len(df.index)-1:
while columna<numbercolumns-1:
df3.iloc[indices,columna]=df1.iloc[indices,columna] +df2.iloc[indices,columna]
indices += 1
columna += 1
Thank you.
You can try of concatenating both dataframes, then add based on the index group
df1.columns = df.columns
df1.people = pd.to_numeric(df1.people,errors='coerce')
pd.concat([df,df1]).groupby('Index').sum()
Out:
number people
Index
A 8 5.0
B 2 2.0
C 2 5.0
F 3 3.0
What I want is this:
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
48944305 A02AG A03AX13 N02BE01 R05X NaN NaN NaN
I don't know how many atc_1...atc_7...?atc_100 columns there will need to be in advance. I just need to gather all associated atc_codes into one row with each visit_id.
This seems like a group_by and then a pivot but I have tried many times and failed. I also tried to self-join a la SQL using pandas' merge() but that doesn't work either.
The end result is that I will paste together atc_1, atc_7, ... atc_100 to form one long atc_code. This composite atc_code will be my "Y" or "labels" column of my dataset that I am trying to predict.
Thank you!
Use cumcount first for count values per groups which create columns by function pivot. Then add missing columns with reindex_axis and change column names by add_prefix. Last reset_index:
g = df.groupby('visit_id').cumcount() + 1
print (g)
0 1
1 2
2 3
3 4
4 5
5 1
6 2
7 3
8 4
dtype: int64
df = pd.pivot(index=df['visit_id'], columns=g, values=df['atc_code'])
.reindex_axis(range(1, 8), 1)
.add_prefix('atc_')
.reset_index()
print (df)
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
0 48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
1 48944305 A02AG A03AX13 N02BE01 R05X None NaN NaN
I am new to stackoverflow and pandas for python. I found part of my answer in the post Looking to merge two Excel files by ID into one Excel file using Python 2.7
However, I also want to merge or combine columns from the two excel files with the same name. I thought the following post would have my answer but I guess it's not titled correctly: Merging Pandas DataFrames with the same column name
Right now I have the code:
import pandas as pd
file1 = pd.read_excel("file1.xlsx")
file2 = pd.read_excel("file2.xlsx")
file3 = file1.merge(file2, on="ID", how="outer")
file3.to_excel("merged.xlsx")
file1.xlsx
ID,JanSales,FebSales,test
1,100,200,cars
2,200,500,
3,300,400,boats
file2.xlsx
ID,CreditScore,EMMAScore,test
2,good,Watson,planes
3,okay,Thompson,
4,not-so-good,NA,
what I get is merged.xlsx
ID,JanSales,FebSales,test_x,CreditScore,EMMAScore,test_y
1,100,200,cars,NaN,NaN,
2,200,500,,good,Watson,planes
3,300,400,boats,okay,Thompson,
4,NaN,NaN,,not-so-good,NaN,
what I want is merged.xlsx
ID,JanSales,FebSales,CreditScore,EMMAScore,test
1,100,200,NaN,NaN,cars
2,200,500,good,Watson,planes
3,300,400,okay,Thompson,boats
4,NaN,NaN,not-so-good,NaN,NaA
In my real data, there are 200+ columns that correspond to the "test" column in my example. I want the program to find these columns with the same names in both file1.xlsx and file2.xlsx and combine them in the merged file.
OK, here is a more dynamic way, after merging we assume that clashes will occur and result in 'column_name_x' or '_y'.
So first figure out the common column names and remove 'ID' from this list
In [51]:
common_columns = list(set(list(df1.columns)) & set(list(df2.columns)))
common_columns.remove('ID')
common_columns
Out[51]:
['test']
Now we can iterate over this list to create the new column and use where to conditionally assign the value dependent on which value is not null.
In [59]:
for col in common_columns:
df3[col] = df3[col+'_x'].where(df3[col+'_x'].notnull(), df3[col+'_y'])
df3
Out[59]:
ID JanSales FebSales test_x CreditScore EMMAScore test_y test
0 1 100 200 cars NaN NaN NaN cars
1 2 200 500 NaN good Watson planes planes
2 3 300 400 boats okay Thompson NaN boats
3 4 NaN NaN NaN not-so-good NaN NaN NaN
[4 rows x 8 columns]
Then just to finish off drop all the extra columns:
In [68]:
clash_names = [elt+suffix for elt in common_columns for suffix in ('_x','_y') ]
clash_names
df3.drop(labels=clash_names, axis=1,inplace=True)
df3
Out[68]:
ID JanSales FebSales CreditScore EMMAScore test
0 1 100 200 NaN NaN cars
1 2 200 500 good Watson planes
2 3 300 400 okay Thompson boats
3 4 NaN NaN not-so-good NaN NaN
[4 rows x 6 columns]
The snippet above is from this :Prepend prefix to list elements with list comprehension