Checking for missing rows in pandas dataframe based on subsetting columns - python

I have two dataframes from two sources that should be the same.
I would like to iterate through a subset of each column combination to get a count of the number of different rows between each dataframe.
Right now I can do it manually, but I would like to write a function or script that can automate this for me. Any ideas?
d = {'col1': [1, 2, 3], 'col2': [3, 4, 6], 'col3': [3, 4, 7]}
d1 = {'col1': [1, 2, 3], 'col2': [3, 4, 6], 'col3': [3, 4, 3]}
df = pd.DataFrame(data=d)
df1 = pd.DataFrame(data=d1)
Check all rows:
merged = df.merge(df1, indicator=True, how='outer')
rows_missing_from_df = merged[merged['_merge'] == 'right_only']
rows_missing_from_df.shape
(1, 4)
Check rows for just col1 and col2
df_select = df[['col1', 'col2']]
df1_select = df1[['col1', 'col2']]
merged = df_select.merge(df1_select, indicator=True, how='outer')
rows_missing_from_df = merged[merged['_merge'] == 'right_only']
rows_missing_from_df.shape
(0, 3)
Check rows for just col2 and col3
df_select_1 = df[['col2', 'col3']]
df1_select_1 = df1[['col2', 'col3']]
merged = df_select_1.merge(df1_select_1, indicator=True, how='outer')
rows_missing_from_df = merged[merged['_merge'] == 'right_only']
rows_missing_from_df.shape
(1, 3)

Related

How to fill column with condition in polars

I would like to add new column using other column value with condition
In pandas, I do this like below
import pandas as pd
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df['c'] = df['a']
df.loc[df['b']==4, 'c'] = df['b']
The result is
a
b
c
1
3
1
2
4
4
Could you teach me how to do this with polars?
Use when/then/otherwise
df = pl.DataFrame({'a': [1, 2], 'b': [3, 4]})
df.with_columns(
pl.when(pl.col("b") == 4).then(pl.col('b')).otherwise(pl.col('a')).alias("c")
)

Pandas Efficient Filtering: Same filter condition on multiple columns

Say I have the data below:
df = pd.DataFrame({'col1': [1, 2, 1],
'col2': [2, 4, 3],
'col3': [3, 6, 5],
'col4': [4, 8, 7]})
Is there a way to use list comprehensions to filter data efficiently? For example, if I wanted to find all cases where col2 was even OR col3 was even OR col 4 was even, is there a simpler way than just writing this?
df[(df['col2'] % 2 == 0) | (df['col3'] % 2 == 0) | (df['col4'] % 2 == 0)]
It would be nice if I could pass in a list of columns and the condition to check.
df[(df[cols] % 2 == 0).any(axis=1)]
where cols is your list of columns

Create Series by applying function over every df column

Given a DataFrame, I'd like to count the number of NaN values in each column, to show the proportion as a histogram.
I've come up with
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
nan_dict = {}
for col in df:
nan_dict[col] = df[col].value_counts(dropna=False)[0]
and then build the histogram from the dict. This seems really cumbersome; also, it fails when there are no NaNs.
Is there a way I could apply value_counts along all columns so that I get back a Series with NaN values per column?
df = pd.DataFrame({"col1": [1, 2], "col2": [3, 4]})
print(dict(zip(df.columns, df.isna().sum())))
Prints:
{'col1': 0, 'col2': 0}
For dataframe:
col1 col2
0 1 3.0
1 2 NaN
Prints:
{'col1': 0, 'col2': 1}

Add total row to dataframe with multi level index

Consider the follow dataframe with a multi level index:
arrays = [np.array(['bar', 'bar', 'baz']),
np.array(['one', 'two', 'one'])]
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'],
index=arrays)
All I'm trying to do is add a 'Totals' row to the bottom (12, 15, 18 would be the expected values here). It seems like I need to calculate the totals and then append them to the dataframe, but I just can't get it work while preserving the multi level index (which I want to do). Thanks in advance!
This does not preserve your multi-level index, but it does append a new row called "total" that contains column sums:
import pandas as pd
import numpy as np
arrays = [np.array(['bar', 'bar', 'baz']),
np.array(['one', 'two', 'one'])]
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'],
index=arrays)
df.append(df.sum().rename('total')).assign(total=lambda d: d.sum(1))
I figured it out. Thanks for the responses. Those plus a little more education about indices in Python got me to something that worked.
# Create df of totals
df2 = pd.DataFrame(df.sum())
# Transpose df
df2 = df2.T
# Reset index
df2 = df2.reset_index()
# Add additional column so the columns of df2 match the columns of df
df2['Index'] = "zTotal"
# Set indices to match df indices
df2 = df2.set_index(['index', 'Index'])
# Concat df and df2
df3 = pd.concat([df, df2])
# Sort in desired order
df3 = df3.sort_index(ascending=[False,True])

Can I use pd.drop in method chaining to drop specific rows?

I was wondering if I can used pandas .drop method to drop rows when chaining methods to construct a data frame.
Dropping rows is straight forward once the data frame exists:
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
print(df1)
# drop the entries that match "2"
df1 = df1[df1['A'] !=2]
print(df1)
However, I would like to do this while I am creating the data frame:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
# .drop(lambda x: x['A']!=2)
)
print(df2)
The commented line does not work, but maybe there is a correct way of doing this. Grateful for any input.
Use DataFrame.loc with callable:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.loc[lambda x: x['AA']!=2]
)
Or DataFrame.query:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.query("AA != 2")
)
print(df2)
AA B
0 1 5
2 3 3
You can use DataFrame.apply with DataFrame.dropna:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.apply(lambda x: x if x['AA'] !=2 else np.nan,axis=1).dropna()
)
print(df2)
AA B
0 1.0 5.0
2 3.0 3.0
Maybe you can try a selection with ".loc" at the end of your definition line code.

Categories

Resources