I'd like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don't add):
Dataframe A:
I II
0 1 2
1 3 1
Dataframe B:
I II
0 5 6
1 3 1
New Dataframe:
I II
0 1 2
1 3 1
2 5 6
How can I do this?
The simplest way is to just do the concatenation, and then drop duplicates.
>>> df1
A B
0 1 2
1 3 1
>>> df2
A B
0 5 6
1 3 1
>>> pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
A B
0 1 2
1 3 1
2 5 6
The reset_index(drop=True) is to fix up the index after the concat() and drop_duplicates(). Without it you will have an index of [0,1,0] instead of [0,1,2]. This could cause problems for further operations on this dataframe down the road if it isn't reset right away.
In case you have a duplicate row already in DataFrame A, then concatenating and then dropping duplicate rows, will remove rows from DataFrame A that you might want to keep.
In this case, you will need to create a new column with a cumulative count, and then drop duplicates, it all depends on your use case, but this is common in time-series data
Here is an example:
df_1 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':34},])
df_2 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':14},
])
df_1['count'] = df_1.groupby(['date','id','value']).cumcount()
df_2['count'] = df_2.groupby(['date','id','value']).cumcount()
df_tot = pd.concat([df_1,df_2], ignore_index=False)
df_tot = df_tot.drop_duplicates()
df_tot = df_tot.drop(['count'], axis=1)
>>> df_tot
date id value
0 11/20/2015 4 24
1 11/20/2015 4 24
2 11/20/2015 6 34
1 11/20/2015 6 14
I'm surprised that pandas doesn't offer a native solution for this task.
I don't think that it's efficient to just drop the duplicates if you work with large datasets (as Rian G suggested).
It is probably most efficient to use sets to find the non-overlapping indices. Then use list comprehension to translate from index to 'row location' (boolean), which you need to access rows using iloc[,]. Below you find a function that performs the task. If you don't choose a specific column (col) to check for duplicates, then indexes will be used, as you requested. If you chose a specific column, be aware that existing duplicate entries in 'a' will remain in the result.
import pandas as pd
def append_non_duplicates(a, b, col=None):
if ((a is not None and type(a) is not pd.core.frame.DataFrame) or (b is not None and type(b) is not pd.core.frame.DataFrame)):
raise ValueError('a and b must be of type pandas.core.frame.DataFrame.')
if (a is None):
return(b)
if (b is None):
return(a)
if(col is not None):
aind = a.iloc[:,col].values
bind = b.iloc[:,col].values
else:
aind = a.index.values
bind = b.index.values
take_rows = list(set(bind)-set(aind))
take_rows = [i in take_rows for i in bind]
return(a.append( b.iloc[take_rows,:] ))
# Usage
a = pd.DataFrame([[1,2,3],[1,5,6],[1,12,13]], index=[1000,2000,5000])
b = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], index=[1000,2000,3000])
append_non_duplicates(a,b)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 3000 7 8 9 <- from b
append_non_duplicates(a,b,0)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 2000 4 5 6 <- from b
# 3000 7 8 9 <- from b
Another option:
concatenation = pd.concat([
dfA,
dfB[dfB['I'].isin(dfA['I']) == False], # <-- get all the data in dfB that doesn't show up in dfB (based on values in column 'I')
])
The object concatenation will be:
I II
0 1 2
1 3 1
2 5 6
Related
I'd like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don't add):
Dataframe A:
I II
0 1 2
1 3 1
Dataframe B:
I II
0 5 6
1 3 1
New Dataframe:
I II
0 1 2
1 3 1
2 5 6
How can I do this?
The simplest way is to just do the concatenation, and then drop duplicates.
>>> df1
A B
0 1 2
1 3 1
>>> df2
A B
0 5 6
1 3 1
>>> pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
A B
0 1 2
1 3 1
2 5 6
The reset_index(drop=True) is to fix up the index after the concat() and drop_duplicates(). Without it you will have an index of [0,1,0] instead of [0,1,2]. This could cause problems for further operations on this dataframe down the road if it isn't reset right away.
In case you have a duplicate row already in DataFrame A, then concatenating and then dropping duplicate rows, will remove rows from DataFrame A that you might want to keep.
In this case, you will need to create a new column with a cumulative count, and then drop duplicates, it all depends on your use case, but this is common in time-series data
Here is an example:
df_1 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':34},])
df_2 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':14},
])
df_1['count'] = df_1.groupby(['date','id','value']).cumcount()
df_2['count'] = df_2.groupby(['date','id','value']).cumcount()
df_tot = pd.concat([df_1,df_2], ignore_index=False)
df_tot = df_tot.drop_duplicates()
df_tot = df_tot.drop(['count'], axis=1)
>>> df_tot
date id value
0 11/20/2015 4 24
1 11/20/2015 4 24
2 11/20/2015 6 34
1 11/20/2015 6 14
I'm surprised that pandas doesn't offer a native solution for this task.
I don't think that it's efficient to just drop the duplicates if you work with large datasets (as Rian G suggested).
It is probably most efficient to use sets to find the non-overlapping indices. Then use list comprehension to translate from index to 'row location' (boolean), which you need to access rows using iloc[,]. Below you find a function that performs the task. If you don't choose a specific column (col) to check for duplicates, then indexes will be used, as you requested. If you chose a specific column, be aware that existing duplicate entries in 'a' will remain in the result.
import pandas as pd
def append_non_duplicates(a, b, col=None):
if ((a is not None and type(a) is not pd.core.frame.DataFrame) or (b is not None and type(b) is not pd.core.frame.DataFrame)):
raise ValueError('a and b must be of type pandas.core.frame.DataFrame.')
if (a is None):
return(b)
if (b is None):
return(a)
if(col is not None):
aind = a.iloc[:,col].values
bind = b.iloc[:,col].values
else:
aind = a.index.values
bind = b.index.values
take_rows = list(set(bind)-set(aind))
take_rows = [i in take_rows for i in bind]
return(a.append( b.iloc[take_rows,:] ))
# Usage
a = pd.DataFrame([[1,2,3],[1,5,6],[1,12,13]], index=[1000,2000,5000])
b = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], index=[1000,2000,3000])
append_non_duplicates(a,b)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 3000 7 8 9 <- from b
append_non_duplicates(a,b,0)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 2000 4 5 6 <- from b
# 3000 7 8 9 <- from b
Another option:
concatenation = pd.concat([
dfA,
dfB[dfB['I'].isin(dfA['I']) == False], # <-- get all the data in dfB that doesn't show up in dfB (based on values in column 'I')
])
The object concatenation will be:
I II
0 1 2
1 3 1
2 5 6
This might be a quite easy problem but I can't deal with it properly and didn't find the exact answer here. So, let's say we have a Python Dataframe as below:
df:
ID a b c d
0 1 3 4 9
1 2 8 8 3
2 1 3 10 12
3 0 1 3 0
I want to remove all the rows that contain repeating values in different columns. In other words, I am only interested in keeping rows with unique values. Referring to the above example, the desired output should be:
ID a b c d
0 1 3 4 9
2 1 3 10 12
(I didn't change the ID values on purpose to make the comparison easier). Please let me know if you have any ideas. Thanks!
You can compare length of sets with length of columns names:
lc = len(df.columns)
df1 = df[df.apply(lambda x: len(set(x)) == lc, axis=1)]
print (df1)
a b c d
ID
0 1 3 4 9
2 1 3 10 12
Or test by Series.duplicated and Series.any:
df1 = df[~df.apply(lambda x: x.duplicated().any(), axis=1)]
Or DataFrame.nunique:
df1 = df[df.nunique(axis=1).eq(lc)]
Or:
df1 = df[[len(set(x)) == lc for x in df.to_numpy()]]
I have two dataframes A and B with a common column 'label'.I would like to create a new column 'Map' in dataframe A which consist of corresponding mapping from dataframe B.
Required Conditions :
With every mapping, increment a variable count by 1 (which would be compared to the 'Capacity' column in dataframe B)
The mapping of 'label' column should be done based on higher value of 'Num' column from dataframe B. Also if the count becomes greater than 'Capacity' for next assignment, assign second best 'Num' mapping and so on.
If there's no available mapping OR the 'Capacity' for available mapping is full then update the 'Map' as None
Dataframe A
Id label
1 1
2 1
3 1
4 2
5 2
6 3
7 3
Dataframe B
label Capacity Map Num
1 1 A 0.1
1 2 B 0.2
2 2 C 0.3
3 1 D 0.2
Expected Output Dataframe
Id label Map
1 1 B
2 1 B
3 1 A
4 2 C
5 2 C
6 3 D
7 3 None
Any pythonic way for this. I would appreciate some explanation on the code.
Assuming your initial dataframes are:
>>> dfa
Id label
0 1 1
1 2 1
2 3 1
3 4 2
4 5 2
5 6 3
6 7 3
>>> dfb
label Capacity Map Num
0 1 1 A 0.1
1 1 2 B 0.2
2 2 2 C 0.3
3 3 1 D 0.2
First, start with refactoring a bit the dataframes. We calculate the cumcount for dfa and cumsum for dfb. This gives us how many rows can be allocated in the order of the map with a cumulated limit.
dfa['count'] = dfa.groupby('label').cumcount()+1
dfb.sort_values(by='Num', ascending=False, inplace=True)
dfb['count'] = dfb.groupby('label')['Capacity'].cumsum()
Then we define a custom function to do the mapping. The try/except block handles the case where no rows are available to map and the function will return None
def custom_map(s):
try:
return (dfb[dfb['label'].eq(s['label']) & # same label
dfb['count'].ge(s['count']) # within capacity
].iloc[0]['Map']) # take first element
except IndexError:
pass
Finally, we map the values using:
dfa['Map'] = dfa.apply(custom_map, axis=1)
dfa.drop('count', axis=1)
output:
Id label Map
0 1 1 B
1 2 1 B
2 3 1 A
3 4 2 C
4 5 2 C
5 6 3 D
6 7 3 None
I have tried to duplicate the mentioned data frames. My approach is to first sort the "B" dataframe by "Num" first and then by "capacity". Then looping over the "A" dataframe, I was able to select the correct "map" label and decrement the available capacity as well.
import pandas as pd
dfA = pd.DataFrame()
dfA["Id"] = [1,2,3,4,5,6,7
]
dfA["label"] = [1,1,1,2,2,3,3]
dfB = pd.DataFrame()
dfB["label"] = [1,1,2,3]
dfB["cap"] = [1,2,2,1]
dfB["map"] = ["A","B","C","D"]
dfB["num"] = [0.1,0.2,0.3,0.2]
test = dfB.copy()
test = test.sort_values(by = ['num', "cap"], ascending = [False, False], na_position = 'first')
map_list = []
for index, row in dfA.iterrows():
currLabel = row["label"]
x = test.loc[test['label'] == currLabel]
if len(x):
foundMap = False
for i,r in x.iterrows():
if r["cap"] > 0:
test.at[i,"cap"] = r["cap"]-1
map_list.append(r["map"])
foundMap = True
break
if not foundMap:
map_list.append(None)
else:
map_list.append(None)
dfA["map"] = map_list
Instead of creating a copy of dfB, you could also create a new column in dfB, which will maintain the realtime capacity.
In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)
First, I show the pandas dataframe to elucidate my problem.
import pandas as pd
mi = pd.MultiIndex.from_product([["A","B"],["c","d"]], names=['lv1', 'lv2'])
df1 = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]],columns=mi)
this python code creates dataframe(df1) like this:
#input dataframe
lv1 A B
lv2 c d c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
I want to create columns 'c*d' on lv2 by using df1's data. like this:
#output dataframe after calculation
lv1 A B
lv2 c d c*d c d c*d
0 1 2 2 3 4 12
1 5 6 30 7 8 56
2 9 10 90 11 12 132
For this problem,I wrote some code like this:
for l1 in mi.levels[0]:
df1.loc[:, (l1, "c*d")] = df1.loc[:,(l1,"c")]*df1.loc[:,(l1,"d")]
df1.sort_index(1,inplace=True)
Although this code almost solved my problem, but I really want to write without 'for' statement like this:
df1.loc[:,(slice(None),"c*d")]=df1.loc[:,(slice(None),"c")]*df1.loc[:,(slice(None),"d")]
With this statement,I got Key error that says 'c*d' is missing.
Is there no syntax sugar for this calculation? Or can I achieve better performance by other code?
A bit improved your solution:
for l1 in mi.levels[0]:
df1.loc[:, (l1, "c*d")] = df1.loc[:,(l1,"c")]*df1.loc[:,(l1,"d")]
mux = pd.MultiIndex.from_product([df1.columns.levels[0], ['c','d','c*d']])
df1 = df1.reindex(columns=mux)
print (df1)
A B
c d c*d c d c*d
0 1 2 2 3 4 12
1 5 6 30 7 8 56
2 9 10 90 11 12 132
Another solution with stack and unstack:
mux = pd.MultiIndex.from_product([df1.columns.levels[0], ['c','d','c_d']])
df1 = df1.stack(0)
.assign(c_d = lambda x: x.sum(1))
.unstack()
.swaplevel(0,1,1)
.reindex(columns=mux)
print (df1)
A B
c d c_d c d c_d
0 1 2 3 3 4 7
1 5 6 11 7 8 15
2 9 10 19 11 12 23
df2 = df1.xs("c", axis=1, level=1).mul(df1.xs("d", axis=1, level=1))
df2.columns = pd.MultiIndex.from_product([df2.columns, ['c*d']])
print (df2)
A B
c*d c*d
0 2 12
1 30 56
2 90 132
mux = pd.MultiIndex.from_product([df2.columns.levels[0], ['c','d','c*d']])
df = df1.join(df2).reindex(columns=mux)
print (df)
A B
c d c*d c d c*d
0 1 2 2 3 4 12
1 5 6 30 7 8 56
2 9 10 90 11 12 132
Explanation of jezrael's answer using stack which is may be the most idiomatic way in pandas.
output = (df1
# "Stack" data, by moving the top level ('lv1') of the
# column MultiIndex into row index,
# now the rows are a MultiIndex and the columns
# are a regular Index.
.stack(0)
# Since we only have 2 columns now, 'lv2' ('c' & 'd')
# we can multiply them together along the row axis.
# The assign method takes key=value pairs mapping new column
# names to the function used to calculate them. Here we're
# wrapping them in a dictionary and unpacking them using **
.assign(**{'c*d': lambda x: x.product(axis=1)})
# Undos the stack operation, moving 'lv1', back to the
# column index, but now as the bottom level of the column index
.unstack()
# This sets the order of the column index MultiIndex levels.
# Since they are named we can use the names, you can also use
# their integer positions instead. Here axis=1 references
# the column index
.swaplevel('lv1', 'lv2', axis=1)
# Sort the values in both levels of the column MultiIndex.
# This will order them as c, c*d, d which is not what you
# specified above, however having a sorted MultiIndex is required
# for indexing via .loc[:, (...)] to work properly
.sort_index(axis=1)
)