pandas remove equal rows by comparing columns in two dataframes

pandas remove equal rows by comparing columns in two dataframes - python

df1 = [['tom', 10, 1.2], ['nick', 15, 1.3], ['juli', 14, 1.4]]
df1 = [['tom', 10, 1.2], ['nick', 15, 1.3], ['juli', 100, 1.4]]
When I am trying compare and remove equal using below code
diff = df1.compare(df2, align_axis=1, keep_equal=True, keep_shape=True).drop_duplicates(
keep=False).rename(index={'self': 'df1', 'other': 'df2'}, level=-1)
I am getting
I want to keep only that row which has any unequal records and remove remaining. It means only last row should be present in output not all rows like blow. Please suggest changes.

Assuming you want everything from df1 that does not matches df2
n_columns = len(df1.columns)
df1[(df1 == df2).apply(sum, axis=1).apply(lambda x: x != n_columns)]

Related

What is wrong here in colouring the Excel sheet?

Here I need to colour 'red' for rows with Age<13 and colur 'green' for rows with Age>=13. But the final 'Report.xlsx' isn't getting coloured. What is wrong here?
import pandas as pd
data = [['tom', 10], ['nick', 12], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
df_styled = df.style.applymap(lambda x: 'background:red' if x < 13 else 'background:green', subset=['Age'])
df_styled.to_excel('Report.xlsx',engine='openpyxl',index=False)

Loosing column names converting back to dataframe from list

I have created a dataframe,i need to do two operations:
Converting to a list
converting the same list back to the dataframe with original column names.
Issue: i am loosing the column names when i first convert to a list and when i convert back to dataframe i am not getting those column names
Please help!
import pandas as pd
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
#convert df to list
a=df.values.tolist()
#convert back to original dataframe
df1 = pd.DataFrame(a)
print(df1)
Current output
i am unable to get column names

You need pass columns names by df.columns, also if not default index is necessary pass it too:
df1 = pd.DataFrame(a, columns=df.columns, index=df.index)
If default RangeIndex in original DataFrame:
df1 = pd.DataFrame(a, columns=df.columns)
EDIT:
If need some similar structure use DataFrame.to_dict with orient='split' there are converted DataFrame to dictionary of columnsnames, index and data like:
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
d = df.to_dict(orient='split')
print (d)
{'index': [0, 1, 2],
'columns': ['Name', 'Age'],
'data': [['tom', 10], ['nick', 15], ['juli', 14]]}
And for original DataFrame use:
df2 = pd.DataFrame(d['data'], index=d['index'], columns=d['columns'])
print (df2)
Name Age
0 tom 10
1 nick 15
2 juli 14

Remove first two valid data points from time series data in wide format

I have data where each row as customers and what is the quantity they bought. There are 12 columns in the data starting from Jan 2018 to Dec 2018 (each column is a month).
Let us say for customer X1, my data starts in June 2018 so first 5 columns of this row are empty.
For customer X2, my data starts in Aug 2018 so first 7 columns of this row are empty.
For customer X3, my data starts in Jan 2018 so all of the columns have data points.
For each of the row (i.e.) every customer, I want to delete the first 2 data points and make them null. Red color indicates null values.
df = pd.DataFrame({'Jan-18': [np.nan, np.nan, 15],
'Feb-18': [np.nan, np.nan, 20],
'Mar-18': [np.nan, np.nan, 15],
'Apr-18': [np.nan, np.nan, 20],
'May-18': [np.nan, np.nan, 15],
'Jun-18': [2, np.nan, 20],
'Jul-18': [5, np.nan, 15],
'Aug-18': [10, 10, 20],
'Sep-18': [15, np.nan, 15],
'Oct-18': [20, 15, 20],
'Nov-18': [25, 20, 15],
'Dec-18': [30, 20, 20]})
output_df = pd.DataFrame({'Jan-18': [np.nan, np.nan, 15],
'Feb-18': [np.nan, np.nan, 20],
'Mar-18': [np.nan, np.nan, 15],
'Apr-18': [np.nan, np.nan, 20],
'May-18': [np.nan, np.nan, 15],
'Jun-18': [np.nan, np.nan, 20],
'Jul-18': [np.nan, np.nan, 15],
'Aug-18': [10, np.nan, 20],
'Sep-18': [15, np.nan, 15],
'Oct-18': [20, np.nan, 20],
'Nov-18': [25, 20, 15],
'Dec-18': [30, 20, 20]})
So for X1, I delete June and July (both were valid data points i.e. not null) and data will start from August.
For X2, I delete August, there was no data for Sept, but there is data for Oct. So, I have to delete both August and Oct.
For X3, since I dont know when exactly in past it became my customer, I dont want to delete anything. [I can calculate count for every row and filter rows with count 12 so no deletion happens there]
I have thought about using count and shape to find number of null values in every row. df.shape[1] - df.count(axis=1)
But not sure how to delete the first 2 data points in every row. Any help is appreciated.

# script (after using the provided code to generate `df`):
x, y = np.nonzero(df.notnull().values)
loc = pd.DataFrame({'x': x, 'y': y}).groupby('x').head(2)
xnull, ynull = zip(*loc.groupby('x').filter(lambda p: list(p.y) != [0, 1]).values)
df.iloc[list(xnull), list(ynull)] = np.nan
first the x & y index values are obtained for the dataframe having non-null values.
for each x coord, the first two y values are taken to form a dataframe loc with 6 rows.
loc is filtered to remove rows if the y coords are 0 & 1, i.e the first non null values are found in the first two rows of the original dataframe
the filtered locations are split into the x & y coordinates that will be set to null.
the values in the dataframe are overwritten with null. note that this mutates the original dataframe, so if the original dataframe is required, ensure that a copy is made & modified instead.
notes:
the filter would null-out the case where a customer made a purchase in the first or second month & never purchased again.
xnull & ynull are initially tuples, which had to be converted to list to work with iloc. However list(p.y) is aggregating values in column y into a list

Pandas get cell value by row NUMBER (NOT row index) and column NAME

data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'], index = [7,3,9])
display(df)
df.iat[0,0]
I'd like to return the Age in first row (basically something like df.iat[0,'Age']. Expected result = 10
Thanks for your help!

df['Age'].iloc[0] works too, similar to what Chris had answered.

Use iloc and Index.get_loc:
df.iloc[0, df.columns.get_loc("Age")]
Output:
10

Check which rows of pandas exist in another

I have two Pandas Data Frames of different sizes (at least 500,000 rows in both of them). For simplicity, you can call them df1 and df2 . I'm interested in finding the rows of df1 which are not present in df2. It is not necessary that any of the data frames would be the subset of the other. Also, the order of the rows does not matter.
For example, ith observation in df1 may be jth observation in df2 and I need to consider it as being present (order won't matter). Another important thing is that both data frames may contain null values (so the operation has to work also for that).
A simple example of both data frame would be
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 100], 'col2' : [10, 11, NaN, 50})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 100], 'col2' : [20, 21, NaN, 13, 14, 50]})
in this case the solution would be
df3 = pandas.DataFrame(data = {'col1' : [1, 2 ], 'col2' : [10, 11]})
Please note that in reality, both data frames have 15 columns (exactly same columns names, exact same data type). Also, I'm using Python 2.7 on Jupyter Notebook on windows 7. I have used Pandas built in function df1.isin(df2) but it does not provide the accurate results that I want.
Moreover, I have also seen this question
but this assumes that one data frame is the subset of another which is not necessarily true in my case.

Here's one way:
import pandas as pd, numpy as np
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 100], 'col2' : [10, 11, np.nan, 50]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 100], 'col2' : [20, 21, np.nan, 13, 14, 50]})
x = set(map(tuple, df1.fillna(-1).values)) - set(map(tuple, df2.fillna(-1).values))
# {(1.0, 10.0), (2.0, 11.0)}
pd.DataFrame(list(x), columns=['col1', 'col2'])
If you have np.nan data in your result, it'll come through as -1, but you can easily convert back. Assumes you won't have negative numbers in your underlying data [if so, replace by some impossible value].
The reason for the complication is that np.nan == np.nan is considered False.

Here is on solution
pd.concat([df1,df2.loc[df2.col1.isin(df1.col1)]],keys=[1,2]).drop_duplicates(keep=False).loc[1]
Out[892]:
col1 col2
0 1 10.0
1 2 11.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas remove equal rows by comparing columns in two dataframes - python

Assuming you want everything from df1 that does not matches df2 n_columns = len(df1.columns) df1[(df1 == df2).apply(sum, axis=1).apply(lambda x: x != n_columns)]

Related

What is wrong here in colouring the Excel sheet?

Loosing column names converting back to dataframe from list

Remove first two valid data points from time series data in wide format

Pandas get cell value by row NUMBER (NOT row index) and column NAME

Check which rows of pandas exist in another

Categories

Resources