Related
I'd like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don't add):
Dataframe A:
I II
0 1 2
1 3 1
Dataframe B:
I II
0 5 6
1 3 1
New Dataframe:
I II
0 1 2
1 3 1
2 5 6
How can I do this?
The simplest way is to just do the concatenation, and then drop duplicates.
>>> df1
A B
0 1 2
1 3 1
>>> df2
A B
0 5 6
1 3 1
>>> pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
A B
0 1 2
1 3 1
2 5 6
The reset_index(drop=True) is to fix up the index after the concat() and drop_duplicates(). Without it you will have an index of [0,1,0] instead of [0,1,2]. This could cause problems for further operations on this dataframe down the road if it isn't reset right away.
In case you have a duplicate row already in DataFrame A, then concatenating and then dropping duplicate rows, will remove rows from DataFrame A that you might want to keep.
In this case, you will need to create a new column with a cumulative count, and then drop duplicates, it all depends on your use case, but this is common in time-series data
Here is an example:
df_1 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':34},])
df_2 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':14},
])
df_1['count'] = df_1.groupby(['date','id','value']).cumcount()
df_2['count'] = df_2.groupby(['date','id','value']).cumcount()
df_tot = pd.concat([df_1,df_2], ignore_index=False)
df_tot = df_tot.drop_duplicates()
df_tot = df_tot.drop(['count'], axis=1)
>>> df_tot
date id value
0 11/20/2015 4 24
1 11/20/2015 4 24
2 11/20/2015 6 34
1 11/20/2015 6 14
I'm surprised that pandas doesn't offer a native solution for this task.
I don't think that it's efficient to just drop the duplicates if you work with large datasets (as Rian G suggested).
It is probably most efficient to use sets to find the non-overlapping indices. Then use list comprehension to translate from index to 'row location' (boolean), which you need to access rows using iloc[,]. Below you find a function that performs the task. If you don't choose a specific column (col) to check for duplicates, then indexes will be used, as you requested. If you chose a specific column, be aware that existing duplicate entries in 'a' will remain in the result.
import pandas as pd
def append_non_duplicates(a, b, col=None):
if ((a is not None and type(a) is not pd.core.frame.DataFrame) or (b is not None and type(b) is not pd.core.frame.DataFrame)):
raise ValueError('a and b must be of type pandas.core.frame.DataFrame.')
if (a is None):
return(b)
if (b is None):
return(a)
if(col is not None):
aind = a.iloc[:,col].values
bind = b.iloc[:,col].values
else:
aind = a.index.values
bind = b.index.values
take_rows = list(set(bind)-set(aind))
take_rows = [i in take_rows for i in bind]
return(a.append( b.iloc[take_rows,:] ))
# Usage
a = pd.DataFrame([[1,2,3],[1,5,6],[1,12,13]], index=[1000,2000,5000])
b = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], index=[1000,2000,3000])
append_non_duplicates(a,b)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 3000 7 8 9 <- from b
append_non_duplicates(a,b,0)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 2000 4 5 6 <- from b
# 3000 7 8 9 <- from b
Another option:
concatenation = pd.concat([
dfA,
dfB[dfB['I'].isin(dfA['I']) == False], # <-- get all the data in dfB that doesn't show up in dfB (based on values in column 'I')
])
The object concatenation will be:
I II
0 1 2
1 3 1
2 5 6
Say I have a predefined data frame of size NxM (N rows and M columns), it is initialized with all NaNs.
Now say I have a series of size M, I want to fill all rows with that series. For example, if the series is of ints [3,4,5] where N=2, M=3, my resultant dataframe should be (I have included column and row indices):
0
1
2
0
3
4
5
1
3
4
5
Alternatively, say I have a series of size N, I want to fill all columns with that series. For example, if the series is of ints [10, 11] where N=2, M=3, my resultant dataframe should be (I have included column and row indices):
0
1
2
0
10
10
10
1
11
11
11
How do I do both of these operations in pure pandas without resorting to using something like np.repeat or np.tile?
You can use np.broadcast_to, last if necessary transpose ouput for second sample data:
s = pd.Series([3,4,5])
N = 2
M = 3
df = pd.DataFrame(index=range(N), columns=range(M))
df[:] = np.broadcast_to(s.to_numpy(), (N, M))
print (df)
0 1 2
0 3 4 5
1 3 4 5
s = pd.Series([10, 11])
N = 2
M = 3
df = pd.DataFrame(index=range(N), columns=range(M))
df[:] = np.broadcast_to(s.to_numpy(), (M, N)).T
print (df)
0 1 2
0 10 10 10
1 11 11 11
s = pd.Series([3,4,5])
N = 2
M = 3
df = pd.DataFrame(np.broadcast_to(s.to_numpy(), (N, M)))
print (df)
0 1 2
0 3 4 5
1 3 4 5
s = pd.Series([10, 11])
N = 2
M = 3
df = pd.DataFrame(np.broadcast_to(s.to_numpy(), (M, N)).T)
print (df)
0 1 2
0 10 10 10
1 11 11 11
I have the dataframe below and I'm able to move a row with group_member_id=0 to the top with the script below:
import pandas as pd
data = {
'row_id':[1,2,3,4,5,6,7,8,9,10,11,12],
'group_id':[1,1,1,1,2,2,2,2,3,3,3,3],
'group_member_id':[1,0,2,3,1,2,0,3,3,2,1,0]
}
trips_data = pd.DataFrame(data)
row_id group_id group_member_id
11 12 3 0
0 1 1 1
1 2 1 0
2 3 1 2
3 4 1 3
4 5 2 1
5 6 2 2
6 7 2 0
7 8 2 3
8 9 3 3
9 10 3 2
10 11 3 1
11 12 3 0
#Function to move one row with group_member_id =0 to the top
def shift_row_to_top(df):
#Shift row, given by index_to_shift, to top of df
idx = df.index.tolist()
index_to_shift = trips_data[trips_data['group_member_id']==0].index.values.max()
df = df.reindex([index_to_shift] + idx)
return df
The above code successfully moves just one row to the top (where group_member_id=0) but what I really want to do is; for each group of group_id (each group_id has four rows), I want to move group_member_id=0 to the top and leave the rest as it.
I tried the code below but got an error stating "only integer scalar arrays can be converted to a scalar index":
def shift_row_to_top2(trips):
print(trips)
#Identify row index where group_member_id=0
index_to_shift=trips[trips['group_member_id']==0].index.values
idx = trips.index.tolist()
idx.pop(index_to_shift)
trips = trips.reindex([index_to_shift] + idx)
trips_data_all_shifted = trips_data.groupby(["group_id"]).apply(shift_row_to_top2)
Below is a sample of what I'm trying to achieve; first row of every group_id starts with group_member_id=0
# I was able to get to the solution
#Dataframe
data = {
'row_id':[1,2,3,4,5,6,7,8,9,10,11,12],
'group_id':[1,1,1,1,2,2,2,2,3,3,3,3],
'group_member_id':[1,0,2,3,1,2,0,3,3,2,1,0]
}
df = pd.DataFrame(data)
#Function to move line of dataframe where
#group_member_id==0 for each group of "group_id"
def shift_row_to_top(df):
index_to_shift = df[df['group_member_id']==0].index.values.max()
idx = [i for i in df.index if i!=index_to_shift]
df2 = df.loc[[index_to_shift] + idx]
return df2
df3 = df.groupby(["group_id"]).apply(shift_row_to_top)
print(df3)
i have df below
Cost,Reve
0,3
4,0
0,0
10,10
4,8
len(df['Cost']) = 300
len(df['Reve']) = 300
I need to divide df['Cost'] / df['Reve']
Below is my code
df[['Cost','Reve']] = df[['Cost','Reve']].apply(pd.to_numeric)
I got the error ValueError: Columns must be same length as key
df['C/R'] = df[['Cost']].div(df['Reve'].values, axis=0)
I got the error ValueError: Wrong number of items passed 2, placement implies 1
Problem is duplicated columns names, verify:
#generate duplicates
df = pd.concat([df, df], axis=1)
print (df)
Cost Reve Cost Reve
0 0 3 0 3
1 4 0 4 0
2 0 0 0 0
3 10 10 10 10
4 4 8 4 8
df[['Cost','Reve']] = df[['Cost','Reve']].apply(pd.to_numeric)
print (df)
# ValueError: Columns must be same length as key
You can find this columns names:
print (df.columns[df.columns.duplicated(keep=False)])
Index(['Cost', 'Reve', 'Cost', 'Reve'], dtype='object')
If same values in columns is possible remove duplicated by:
df = df.loc[:, ~df.columns.duplicated()]
df[['Cost','Reve']] = df[['Cost','Reve']].apply(pd.to_numeric)
#simplify division
df['C/R'] = df['Cost'].div(df['Reve'])
print (df)
Cost Reve C/R
0 0 3 0.0
1 4 0 inf
2 0 0 NaN
3 10 10 1.0
4 4 8 0.5
The issue lies in the size of data that you are trying to assign to the columns. I had an issue with this:
df[['X1','X2', 'X3']] = pd.DataFrame(df.X.tolist(), index= df.index)
I was trying to assign values of X to 3 columns X1,X2,X3, assuming that X has 3 values, but, X had 4 values.
So the revised code in my case was
df[['X1','X2', 'X3','X4']] = pd.DataFrame(df.X.tolist(), index= df.index)
I had the same error, but it did not come from the above two issues. In my case the columns had the same length. What helped me was transforming my Series to a DataFrame with pd.DataFrame() and then assigning its values to a new column of my existing df.
I'd like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don't add):
Dataframe A:
I II
0 1 2
1 3 1
Dataframe B:
I II
0 5 6
1 3 1
New Dataframe:
I II
0 1 2
1 3 1
2 5 6
How can I do this?
The simplest way is to just do the concatenation, and then drop duplicates.
>>> df1
A B
0 1 2
1 3 1
>>> df2
A B
0 5 6
1 3 1
>>> pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
A B
0 1 2
1 3 1
2 5 6
The reset_index(drop=True) is to fix up the index after the concat() and drop_duplicates(). Without it you will have an index of [0,1,0] instead of [0,1,2]. This could cause problems for further operations on this dataframe down the road if it isn't reset right away.
In case you have a duplicate row already in DataFrame A, then concatenating and then dropping duplicate rows, will remove rows from DataFrame A that you might want to keep.
In this case, you will need to create a new column with a cumulative count, and then drop duplicates, it all depends on your use case, but this is common in time-series data
Here is an example:
df_1 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':34},])
df_2 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':14},
])
df_1['count'] = df_1.groupby(['date','id','value']).cumcount()
df_2['count'] = df_2.groupby(['date','id','value']).cumcount()
df_tot = pd.concat([df_1,df_2], ignore_index=False)
df_tot = df_tot.drop_duplicates()
df_tot = df_tot.drop(['count'], axis=1)
>>> df_tot
date id value
0 11/20/2015 4 24
1 11/20/2015 4 24
2 11/20/2015 6 34
1 11/20/2015 6 14
I'm surprised that pandas doesn't offer a native solution for this task.
I don't think that it's efficient to just drop the duplicates if you work with large datasets (as Rian G suggested).
It is probably most efficient to use sets to find the non-overlapping indices. Then use list comprehension to translate from index to 'row location' (boolean), which you need to access rows using iloc[,]. Below you find a function that performs the task. If you don't choose a specific column (col) to check for duplicates, then indexes will be used, as you requested. If you chose a specific column, be aware that existing duplicate entries in 'a' will remain in the result.
import pandas as pd
def append_non_duplicates(a, b, col=None):
if ((a is not None and type(a) is not pd.core.frame.DataFrame) or (b is not None and type(b) is not pd.core.frame.DataFrame)):
raise ValueError('a and b must be of type pandas.core.frame.DataFrame.')
if (a is None):
return(b)
if (b is None):
return(a)
if(col is not None):
aind = a.iloc[:,col].values
bind = b.iloc[:,col].values
else:
aind = a.index.values
bind = b.index.values
take_rows = list(set(bind)-set(aind))
take_rows = [i in take_rows for i in bind]
return(a.append( b.iloc[take_rows,:] ))
# Usage
a = pd.DataFrame([[1,2,3],[1,5,6],[1,12,13]], index=[1000,2000,5000])
b = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], index=[1000,2000,3000])
append_non_duplicates(a,b)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 3000 7 8 9 <- from b
append_non_duplicates(a,b,0)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 2000 4 5 6 <- from b
# 3000 7 8 9 <- from b
Another option:
concatenation = pd.concat([
dfA,
dfB[dfB['I'].isin(dfA['I']) == False], # <-- get all the data in dfB that doesn't show up in dfB (based on values in column 'I')
])
The object concatenation will be:
I II
0 1 2
1 3 1
2 5 6