Join dataframe in Python Pandas - python

I have two data frame as below
Data Frame 1
Data Frame 2
I would like to merge this two data frames into something like below;
I try to use pd.merge and join as below
frames = pd.merge(df1, df2, how='outer', on=['apple_id','apple_wgt_colour', 'apple_wgt_no_colour'])
But the result is like this one
Anyone can help?

You can do it by using concat() and groupby(). Because you want to sum the corresponding values from apple_wgt_colour and apple_wgt_no_colour, you should use agg() to sum at the end.
You first should concat the two dataframes, then use groupby to aggreate the two columns, apple_wgt_colour and apple_wgt_no_colour.
# Generating the two dataframe you exampled.
df1 = pd.DataFrame(
{
'apple_id': [1, 2, 3],
'apple_wgt_1': [9, 16, 8],
'apple_wgt_colour': [9, 6, 8],
'apple_wgt_no_colour': [0, 10, 13],
}
)
df2 = pd.DataFrame(
{
'apple_id': [1, 2, 3],
'apple_wgt_2': [9, 16, 8],
'apple_wgt_colour': [9, 6, 8],
'apple_wgt_no_colour': [0, 10, 13],
}
)
print(df1)
print(df2)
apple_id apple_wgt_1 apple_wgt_colour apple_wgt_no_colour
0 1 9 9 0
1 2 16 6 10
2 3 8 8 13
apple_id apple_wgt_2 apple_wgt_colour apple_wgt_no_colour
0 1 9 9 0
1 2 16 6 10
2 3 8 8 13
Next code will make a result you want:
frames = pd.concat([df1, df2]).groupby('apple_id', as_index=False).agg(sum)
# to change column order as you want
frames = frames[['apple_id', 'apple_wgt_1', 'apple_wgt_2', 'apple_wgt_colour', 'apple_wgt_no_colour']]
print(frames)
apple_id apple_wgt_1 apple_wgt_2 apple_wgt_colour apple_wgt_no_colour
0 1 9.0 9.0 18 0
1 2 16.0 16.0 12 20
2 3 8.0 8.0 16 26

Related

rolling unique value count in pandas across multiple columns

there are several answers around rolling count in pandas
Rolling unique value count in pandas
How to efficiently compute a rolling unique count in a pandas time series?
How do I count unique values across multiple columns?
For one column, I can do:
df[my_col]=df[my_col].rolling(300).apply(lambda x: len(np.unique(x)))
How to extend to multipe columns, counting unique values overall across all values in the rolling window?
Inside a list comprehension iterate over the rolling windows and for each window flatten the values in required columns then use set to get the distinct elements
cols = [...] # define your cols here
df['count'] = [len(set(w[cols].values.ravel())) for w in df.rolling(300)]
I took a dataframe as a example (3-rows rolling window taking into account all the columns at the same time)
Dataframe for visualization
col1 col2 col3
0 1 1 1
1 1 1 4
2 2 5 2
3 3 3 3
4 3 7 3
5 5 3 9
6 8 8 2
Proposed script for checkings
import pandas as pd
df = pd.DataFrame({'col1':[1, 1, 2, 3, 3, 5, 8],
'col2':[1, 1, 5, 3, 7, 3, 8],
'col3':[1, 4, 2, 3, 3, 9, 2],})
df['count'] = df.rolling(3).apply(lambda w: len(set(df.iloc[w.index].to_numpy().flatten())))['col1']
print(df)
Output
col1 col2 col3 count
0 1 1 1 NaN
1 1 1 4 NaN
2 2 5 2 4.0
3 3 3 3 5.0
4 3 7 3 4.0
5 5 3 9 4.0
6 8 8 2 6.0
Another method
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':[1, 1, 2, 3, 3, 5, 8],
'col2':[1, 1, 5, 3, 7, 3, 8],
'col3':[1, 4, 2, 3, 3, 9, 2],})
df = (df.assign( count=df.rolling(3, method='table')
.apply(lambda d:len(set(d.flatten()) ), raw=True, engine="numba")
.iloc[:,-1:] )
)

Pandas group by with list applied to multiple columns

I have a table:
ID
Component
Revenue
1
4
10
1
5
20
2
4
15
3
6
30
and I'd like to group by ID, creating a column with a dictionary or list as such:
ID
Grouped
1
[[4, 10], [5,20]]
2
[4, 15]
3
[6, 30]
I know using
df.groupby(['ID']).Component.apply(list).reset_index()
will do so for one column but I'm not sure for many columns.
You can use:
(df.groupby(['ID'])[['Component', 'Revenue']]
.apply(lambda d: d.to_numpy().tolist())
.reset_index(name='Grouped')
)
output:
ID Grouped
0 1 [[4, 10], [5, 20]]
1 2 [[4, 15]]
2 3 [[6, 30]]

Delete rows in apply() function or depending on apply() result

Here I have a working solution but my question focus on how to do this the Pandas way. I assume Pandas over better solutions for this.
I use groupby() and then apply(axis=1) to compare the values in the rows of the groups. And while doing this I made the decision which row to delete.
The rule doesn't matter! In this example here the rule is that when values in column A differ only by 1 (the values are "near") then delete the second one. How the decision is made is not part of the question. There could also be a list of color names and I would say that darkblue and marineblue are "near" and one if should be deleted.
The initial data frame is that.
X A B
0 A 9 0 <--- DELETE
1 A 14 1
2 A 8 2
3 A 1 3
4 A 18 4
5 B 10 5
6 B 20 6
7 B 11 7 <--- DELETE
8 B 30 8
Row index 0 should be deleted because it's value 9 is near the value 8 in row index 2. The same with row index 7: It's value 11 is "near" 10 in row index 5.
That is the code
#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame(
{
'X': list('AAAAABBBB'),
'A': [9, 14, 8, 1, 18, 10, 20, 11, 30],
'B': range(9)
}
)
print(df)
def mark_near_neighbors(group):
# I snip the decission process here.
# Delete 9 because it is "near" 8.
default_result = pd.Series(
data=[False] * len(group),
index=['Delete'] * len(group)
)
if group.X.iloc[0] is 'A':
# the 9
default_result.iloc[0] = True
else:
# the 11
default_result.iloc[2] = True
return default_result
result = df.groupby('X').apply(mark_near_neighbors)
result = result.reset_index(drop=True)
print(result)
df = df.loc[~result]
print(df)
So in the end I use a "boolean indexing thing" to solve this
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
dtype: bool
But is there a better way to do this?
Initialize the dataframe
df = pd.DataFrame([
['A', 9, 0],
['A', 14, 1],
['A', 8, 2],
['A', 1, 3],
['B', 18, 4],
['B', 10, 5],
['B', 20, 6],
['B', 11, 7],
['B', 30, 8],
], columns=['X', 'A', 'B'])
Sort the dataframe based on A column
df = df.sort_values('A')
Find the difference between values
df["diff" ] =df.groupby('X')['A'].diff()
Select the rows where the difference is not 1
result = df[df["diff"] != 1.0]
Drop the extra column and sort by index to get the initial dataframe
result.drop("diff", axis=1, inplace=True)
result = result.sort_index()
Sample output
X A B
1 A 14 1
2 A 8 2
3 A 1 3
4 B 18 4
5 B 10 5
6 B 20 6
8 B 30 8
IIUC, you can use numpy broadcasting to compare all values within a group. Keeping everything with apply here as it seems wanted:
def mark_near_neighbors(group, thresh=1):
a = group.to_numpy().astype(float)
idx = np.argsort(a)
b = a[idx]
d = abs(b-b[:,None])
d[np.triu_indices(d.shape[0])] = thresh+1
return pd.Series((d>thresh).all(1)[np.argsort(idx)], index=group.index)
out = df[df.groupby('X')['A'].apply(mark_near_neighbors)]
output:
X A B
1 A 14 1
2 A 8 2
3 A 1 3
4 A 18 4
5 B 10 5
6 B 20 6
8 B 30 8

using isin across multiple columns

I'm trying to use .isin with the ~ so I can get a list of unique rows back based on multiple columns in 2 data-sets.
So, I have 2 data-sets with 9 rows:
df1 is the bottom and df2 is the top (sorry but I couldn't get it to show both below, it showed 1 then a row of numbers)
Index Serial Count Churn
1 9 5 0
2 8 6 0
3 10 2 1
4 7 4 2
5 7 9 2
6 10 2 2
7 2 9 1
8 9 8 3
9 4 3 5
Index Serial Count Churn
1 10 2 1
2 10 2 1
3 9 3 0
4 8 6 0
5 9 8 0
6 1 9 1
7 10 3 1
8 6 7 1
9 4 8 0
I would like to get a list of rows from df1 which aren't in df2 based on more than 1 column.
For example if I base my search on the columns Serial and Count I wouldn't get Index 1 and 2 back from df1 as it appears in df2 at Index position 6, the same with Index position 4 in df1 as it appears at Index position 2 in df2. The same would apply to Index position 5 in df1 as it is at Index position 8 in df2.
The churn column doesn't really matter.
I can get it to work but based only on 1 column but not on more than 1 column.
df2[~df2.Serial.isin(df1.Serial.values)] kinda does what I want, but only on 1 column. I want it to be based on 2 or more.
Index Serial Count Churn
3 9 3 0
6 1 9 1
7 10 3 1
8 6 7 1
9 4 8 0
One solution is to merge with indicators:
df1 = pd.DataFrame([[10, 2, 0], [9, 4, 1], [9, 8, 1], [8, 6, 1], [9, 8, 1], [1, 9, 1], [10, 3, 1], [6, 7, 1], [4, 8, 1]], columns=['Serial', 'Count', 'Churn'])
df2 = pd.DataFrame([[9, 5, 1], [8, 6, 1], [10, 2, 1], [7, 4, 1], [7, 9, 1], [10, 2, 1], [2, 9, 1], [9, 8, 1], [4, 3, 1]], columns=['Serial', 'Count', 'Churn'])
# merge with indicator on
df_temp = df1.merge(df2[['Serial', 'Count']].drop_duplicates(), on=['Serial', 'Count'], how='left', indicator=True)
res = df_temp.loc[df_temp['_merge'] == 'left_only'].drop('_merge', axis=1)
Output
Serial Count Churn
1 9 4 1
5 1 9 1
6 10 3 1
7 6 7 1
8 4 8 1
I've had similar issue to solve, I've found the easiest way to deal with it by creating a temporary column, which consists of merged identifier columns and utilising isin on this newly created temporary column values.
A simple function achieving this could be the following
from functools import reduce
get_temp_col = lambda df, cols: reduce(lambda x, y: x + df[y].astype('str'), cols, "")
def subset_on_x_columns(df1, df2, cols):
"""
Subsets the input dataframe `df1` based on the missing unique values of input columns
`cols` of dataframe `df2`.
:param df1: Pandas dataframe to be subsetted
:param df2: Pandas dataframe which missing values are going to be
used to subset `df1` by
:param cols: List of column names
"""
df1_temp_col = get_temp_col(df1, cols)
df2_temp_col = get_temp_col(df2, cols)
return df1[~df1_temp_col.isin(df2_temp_col.unique())]
Thus for your case all that is needed, is to execute:
result_df = subset_on_x_columns(df1, df2, ['Serial', 'Count'])
which has the wanted rows:
Index Serial Count Churn
3 9 3 0
6 1 9 1
7 10 3 1
8 6 7 1
9 4 8 0
The nice thing about this solution is that it is naturally scalable in the number of columns to use, i.e. all that is needed is to specify in the input parameter list cols which columns to use as identifiers.

Is it possible to add several columns at once to a pandas DataFrame?

If I want to create a new DataFrame with several columns, I can add all the columns at once -- for example, as follows:
data = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(data)
But now suppose farther down the road I want to add a set of additional columns to this DataFrame. Is there a way to add them all simultaneously, as in
additional_data = {'col_3': [8, 9, 10, 11],
'col_4': [12, 13, 14, 15]}
#Below is a made-up function of the kind I desire.
df.add_data(additional_data)
I'm aware I could do this:
for key, value in additional_data.iteritems():
df[key] = value
Or this:
df2 = pd.DataFrame(additional_data, index=df.index)
df = pd.merge(df, df2, on=df.index)
I was just hoping for something cleaner. If I'm stuck with these two options, which is preferred?
Pandas has assign method since 0.16.0. You could use it on dataframes like
In [1506]: df1.assign(**df2)
Out[1506]:
col_1 col_2 col_3 col_4
0 0 4 8 12
1 1 5 9 13
2 2 6 10 14
3 3 7 11 15
or, you could directly use the dictionary like
In [1507]: df1.assign(**additional_data)
Out[1507]:
col_1 col_2 col_3 col_4
0 0 4 8 12
1 1 5 9 13
2 2 6 10 14
3 3 7 11 15
What you need is the join function:
df1.join(df2, how='outer')
#or
df1.join(df2) # this works also
Example:
data = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
df1 = pd.DataFrame(data)
additional_data = {'col_3': [8, 9, 10, 11],
'col_4': [12, 13, 14, 15]}
df2 = pd.DataFrame(additional_data)
df1.join(df2, how='outer')
output:
col_1 col_2 col_3 col_4
0 0 4 8 12
1 1 5 9 13
2 2 6 10 14
3 3 7 11 15
If you don't want to create new DataFrame from additional_data, you can use something like this:
>>> additional_data = [[8, 9, 10, 11], [12, 13, 14, 15]]
>>> df['col3'], df['col4'] = additional_data
>>> df
col_1 col_2 col3 col4
0 0 4 8 12
1 1 5 9 13
2 2 6 10 14
3 3 7 11 15
It's also possible to do something like this, but it would be new DataFrame, not inplace modification of existing DataFrame:
>>> additional_header = ['col_3', 'col_4']
>>> additional_data = [[8, 9, 10, 11], [12, 13, 14, 15]]
>>> df = pd.DataFrame(data=np.concatenate((df.values.T, additional_data)).T, columns=np.concatenate((df.columns, additional_header)))
>>> df
col_1 col_2 col_3 col_4
0 0 4 8 12
1 1 5 9 13
2 2 6 10 14
3 3 7 11 15
All you need to do is create the new columns with data from the additional dataframe.
data = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
additional_data = {'col_3': [8, 9, 10, 11],
'col_4': [12, 13, 14, 15]}
df = pd.DataFrame(data)
df2 = pd.DataFrame(additional_data)
df[df2.columns] = df2
df now looks like:
col_1 col_2 col_3 col_4
0 0 4 8 12
1 1 5 9 13
2 2 6 10 14
3 3 7 11 15
Indices from the original dataframe will be used as if you had performed an in-place left join. Data from the original dataframe in columns with a matching name in the additional dataframe will be overwritten.
For example:
data = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
additional_data = {'col_2': [8, 9, 10, 11],
'col_3': [12, 13, 14, 15]}
df = pd.DataFrame(data)
df2 = pd.DataFrame(additional_data, index=[0,1,2,4])
df[df2.columns] = df2
df now looks like:
col_1 col_2 col_3
0 0 8 12
1 1 9 13
2 2 10 14
3 3 NaN NaN

Categories

Resources