I'm trying to use .isin with the ~ so I can get a list of unique rows back based on multiple columns in 2 data-sets.
So, I have 2 data-sets with 9 rows:
df1 is the bottom and df2 is the top (sorry but I couldn't get it to show both below, it showed 1 then a row of numbers)
Index Serial Count Churn
1 9 5 0
2 8 6 0
3 10 2 1
4 7 4 2
5 7 9 2
6 10 2 2
7 2 9 1
8 9 8 3
9 4 3 5
Index Serial Count Churn
1 10 2 1
2 10 2 1
3 9 3 0
4 8 6 0
5 9 8 0
6 1 9 1
7 10 3 1
8 6 7 1
9 4 8 0
I would like to get a list of rows from df1 which aren't in df2 based on more than 1 column.
For example if I base my search on the columns Serial and Count I wouldn't get Index 1 and 2 back from df1 as it appears in df2 at Index position 6, the same with Index position 4 in df1 as it appears at Index position 2 in df2. The same would apply to Index position 5 in df1 as it is at Index position 8 in df2.
The churn column doesn't really matter.
I can get it to work but based only on 1 column but not on more than 1 column.
df2[~df2.Serial.isin(df1.Serial.values)] kinda does what I want, but only on 1 column. I want it to be based on 2 or more.
Index Serial Count Churn
3 9 3 0
6 1 9 1
7 10 3 1
8 6 7 1
9 4 8 0
One solution is to merge with indicators:
df1 = pd.DataFrame([[10, 2, 0], [9, 4, 1], [9, 8, 1], [8, 6, 1], [9, 8, 1], [1, 9, 1], [10, 3, 1], [6, 7, 1], [4, 8, 1]], columns=['Serial', 'Count', 'Churn'])
df2 = pd.DataFrame([[9, 5, 1], [8, 6, 1], [10, 2, 1], [7, 4, 1], [7, 9, 1], [10, 2, 1], [2, 9, 1], [9, 8, 1], [4, 3, 1]], columns=['Serial', 'Count', 'Churn'])
# merge with indicator on
df_temp = df1.merge(df2[['Serial', 'Count']].drop_duplicates(), on=['Serial', 'Count'], how='left', indicator=True)
res = df_temp.loc[df_temp['_merge'] == 'left_only'].drop('_merge', axis=1)
Output
Serial Count Churn
1 9 4 1
5 1 9 1
6 10 3 1
7 6 7 1
8 4 8 1
I've had similar issue to solve, I've found the easiest way to deal with it by creating a temporary column, which consists of merged identifier columns and utilising isin on this newly created temporary column values.
A simple function achieving this could be the following
from functools import reduce
get_temp_col = lambda df, cols: reduce(lambda x, y: x + df[y].astype('str'), cols, "")
def subset_on_x_columns(df1, df2, cols):
"""
Subsets the input dataframe `df1` based on the missing unique values of input columns
`cols` of dataframe `df2`.
:param df1: Pandas dataframe to be subsetted
:param df2: Pandas dataframe which missing values are going to be
used to subset `df1` by
:param cols: List of column names
"""
df1_temp_col = get_temp_col(df1, cols)
df2_temp_col = get_temp_col(df2, cols)
return df1[~df1_temp_col.isin(df2_temp_col.unique())]
Thus for your case all that is needed, is to execute:
result_df = subset_on_x_columns(df1, df2, ['Serial', 'Count'])
which has the wanted rows:
Index Serial Count Churn
3 9 3 0
6 1 9 1
7 10 3 1
8 6 7 1
9 4 8 0
The nice thing about this solution is that it is naturally scalable in the number of columns to use, i.e. all that is needed is to specify in the input parameter list cols which columns to use as identifiers.
Related
I have two data frame as below
Data Frame 1
Data Frame 2
I would like to merge this two data frames into something like below;
I try to use pd.merge and join as below
frames = pd.merge(df1, df2, how='outer', on=['apple_id','apple_wgt_colour', 'apple_wgt_no_colour'])
But the result is like this one
Anyone can help?
You can do it by using concat() and groupby(). Because you want to sum the corresponding values from apple_wgt_colour and apple_wgt_no_colour, you should use agg() to sum at the end.
You first should concat the two dataframes, then use groupby to aggreate the two columns, apple_wgt_colour and apple_wgt_no_colour.
# Generating the two dataframe you exampled.
df1 = pd.DataFrame(
{
'apple_id': [1, 2, 3],
'apple_wgt_1': [9, 16, 8],
'apple_wgt_colour': [9, 6, 8],
'apple_wgt_no_colour': [0, 10, 13],
}
)
df2 = pd.DataFrame(
{
'apple_id': [1, 2, 3],
'apple_wgt_2': [9, 16, 8],
'apple_wgt_colour': [9, 6, 8],
'apple_wgt_no_colour': [0, 10, 13],
}
)
print(df1)
print(df2)
apple_id apple_wgt_1 apple_wgt_colour apple_wgt_no_colour
0 1 9 9 0
1 2 16 6 10
2 3 8 8 13
apple_id apple_wgt_2 apple_wgt_colour apple_wgt_no_colour
0 1 9 9 0
1 2 16 6 10
2 3 8 8 13
Next code will make a result you want:
frames = pd.concat([df1, df2]).groupby('apple_id', as_index=False).agg(sum)
# to change column order as you want
frames = frames[['apple_id', 'apple_wgt_1', 'apple_wgt_2', 'apple_wgt_colour', 'apple_wgt_no_colour']]
print(frames)
apple_id apple_wgt_1 apple_wgt_2 apple_wgt_colour apple_wgt_no_colour
0 1 9.0 9.0 18 0
1 2 16.0 16.0 12 20
2 3 8.0 8.0 16 26
So I have a dataframe as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3], [4, 3, 6], [7, 2, 9]]),
columns=['a', 'b', 'c'])
df
Output:
a
b
c
1
2
3
4
3
6
7
2
9
I want to select or keep the two columns, with the highest values in the last row. What is the best way to approach?
So in fact I just want to select or keep column 'a' due to value 7 and column 'c' due to value 9.
Try:
df = df[df.iloc[-1].nlargest(2).index]
Output:
c a
0 3 1
1 6 4
2 9 7
If you want to keep original column sequence as well, you can use Index.intersection() together with .nlargest(), as follows:
df[df.columns.intersection(df.iloc[-1].nlargest(2).index, sort=False)]
Result:
a c
0 1 3
1 4 6
2 7 9
I have two dataframes
df_train_1 with shape (70652, 4)
and
df_test_1 with shape (24581, 4)
I am trying to concat them with df_train_1 ontop.
I have tried the following two methods:
df_combined = df_train_1.append(df_test_1)
df_combined = pd.concat([df_train_1, df_test_1])
when I call df_combined.title[0] I get the both [0] values from the original dataframe. Can someone point me in the direction of how to avoid this please
If you look at the example of the documentation of pandas: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df
A B
0 1 2
1 3 4
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df.append(df2)
A B
0 1 2
1 3 4
0 5 6
1 7 8
You will see that the index will be stacked.
So like the comment suggested, use ignore_index=True to reset the index to numeric order:
df.append(df2, ignore_index=True)
A B
0 1 2
1 3 4
2 5 6
3 7 8
After looking here and here and in the documentation, I still cannot find a way to select rows from a DataFrame according to all these criteria:
Return rows in an order given from a list of values from a given column
Return repeated rows (associated with repeated values in the list)
Preserve the original indices
Ignore values of the list not present in the DataFrame
As an example, let
df = pd.DataFrame({'A': [5, 6, 3, 4], 'B': [1, 2, 3, 5]})
df
A B
0 5 1
1 6 2
2 3 3
3 4 5
and let
list_of_values = [3, 4, 6, 4, 3, 8]
Then I would like to get the following DataFrame:
A B
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
How can I accomplish that? Zero's answer looks promising as it is the only one I found which preserves the original index, but it does not work with repetitions. Any ideas about how to modify/generalize it?
We have to preserve the index by assigning it as a column first so we can set_index after the mering:
list_of_values = [3, 4, 6, 4, 3, 8]
df2 = pd.DataFrame({'A': list_of_values, 'order': range(len(list_of_values))})
dfn = (
df.assign(idx=df.index)
.merge(df2, on='A')
.sort_values('order')
.set_index('idx')
.drop('order', axis=1)
)
A B
idx
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
If you want to get rid of the index name (idx), use rename_axis:
dfn = dfn.rename_axis(None)
A B
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
Here's a way to do that using merge:
list_df = pd.DataFrame({"A": list_of_values, "order": range(len(list_of_values))})
pd.merge(list_df, df, on="A").sort_values("order").drop("order", axis=1)
The output is:
A B
0 3 3
2 4 5
4 6 2
3 4 5
1 3 3
I am trying to create a new df which summarises my key information, by taking that information from 3 (say) other dataframes.
dfdate = {'x1': [2, 4, 7, 5, 6],
'x2': [2, 2, 2, 6, 7],
'y1': [3, 1, 4, 5, 9]}
dfdate = pd.DataFrame(df, index=range(0:4))
dfqty = {'x1': [1, 2, 6, 6, 8],
'x2': [3, 1, 1, 7, 5],
'y1': [2, 4, 3, 2, 8]}
dfqty = pd.DataFrame(df2, range(0:4))
dfprices = {'x1': [0, 2, 2, 4, 4],
'x2': [2, 0, 0, 3, 4],
'y1': [1, 3, 2, 1, 3]}
dfprices = pd.DataFrame(df3, range(0:4))
Let us say the above 3 dataframes are my data. Say, some dates, qty, and prices of goods. My new df is to be constructed from the above data:
rng = len(dfprices.columns)*len(dfprices.index) # This is the len of new df
dfnew = pd.DataFrame(np.nan,index=range(0,rng),columns=['Letter', 'Number', 'date', 'qty', 'price])
Now, this is where I struggle to piece my stuff together. I am trying to take all the data in dfdate and put it into a column in the new df. same with dfqty and dfprice. (so 3x5 matricies essentially goto a 1x15 vector and are placed into the new df).
As well as that, I need a couple of columns in dfnew as identifiers, from the names of the columns of the old df.
Ive tried for loops but to no avail, and don't know how to convert a df to series. But my desired output is:
dfnew:
'Lettercol','Numbercol', 'date', 'qty', 'price'
0 X 1 2 1 0
1 X 1 4 2 2
2 X 1 7 6 2
3 X 1 5 6 4
4 X 1 6 8 4
5 X 2 2 3 2
6 X 2 2 1 0
7 X 2 2 1 0
8 X 2 6 7 3
9 X 2 7 5 4
10 Y 1 3 2 1
11 Y 1 1 4 3
12 Y 1 4 3 2
13 Y 1 5 2 1
14 Y 1 9 8 3
where the numbers 0-14 are the index.
letter = letter from col header in DFs
number = number from col header in DFs
next 3 columns are data from the orig df's
(don't ask why the original data is in that funny format :)
thanks so much. my last Q wasn't well received so have tried to make this one better, thanks
Use:
#list of DataFrames
dfs = [dfdate, dfqty, dfprices]
#list comprehension with reshape
comb = [x.unstack() for x in dfs]
#join together
df = pd.concat(comb, axis=1, keys=['date', 'qty', 'price'])
#remove second level of MultiIndex and index to column
df = df.reset_index(level=1, drop=True).reset_index().rename(columns={'index':'col'})
#extract all values without first by indexing [1:] and first letter by [0]
df['Number'] = df['col'].str[1:]
df['Letter'] = df['col'].str[0]
cols = ['Letter', 'Number', 'date', 'qty', 'price']
#change order of columns
df = df.reindex(columns=cols)
print (df)
Letter Number date qty price
0 x 1 2 1 0
1 x 1 4 2 2
2 x 1 7 6 2
3 x 1 5 6 4
4 x 1 6 8 4
5 x 2 2 3 2
6 x 2 2 1 0
7 x 2 2 1 0
8 x 2 6 7 3
9 x 2 7 5 4
10 y 1 3 2 1
11 y 1 1 4 3
12 y 1 4 3 2
13 y 1 5 2 1
14 y 1 9 8 3