Why does df.isnull().sum() work the way it does? - python

When I do df.isnull().sum(), I get the count of null values in a column. But the default axis for .sum() is None, or 0 - which should be summing across the columns.
Why does .sum() calculate the sum down the columns, instead of the rows, when the default says to sum across axis = 0?
Thanks!

I'm seeing the opposite behavior as you explained:
Sums across the columns
In [3309]: df1.isnull().sum(1)
Out[3309]:
0 0
1 1
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
dtype: int64
Sums down the columns
In [3310]: df1.isnull().sum()
Out[3310]:
date 0
variable 1
value 0
dtype: int64

Uh.. this is not what I am seeing for functionality. Let's look at this small example.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[np.nan, np.nan, 3],'B':[1,1,3]}, index =[*'abc'])
print(df)
print(df.isnull().sum())
print(df.sum())
Note the columns are uppercase 'A' and 'B', and the index or row indexes are lowercase.
Output:
A B
a NaN 1
b NaN 1
c 3.0 3
A 2
B 0
dtype: int64
A 3.0
B 5.0
dtype: float64
Per docs:
axis : {index (0), columns (1)} Axis for the function to be applied
on.

The axis parameter is orthogonal to the direction which you wish to sum.
Unfortunately, the pandas documentation for sum doesn't currently make this clear, but the documentation for count does:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.count.html
Parameters
axis{0 or ‘index’, 1 or ‘columns’}, default 0
If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.

Related

How to count uniques for each column in a pandas dataframe?

I have a code below that creates a summary table of missing values in each column of my data frame. I wish I could build a similar table to count unique values, but DataFrame does not have an unique() method, only each column independently.
def missing_values_table(df):
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum()/len(df)
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : 'Missing Values', 1 : '% of Total Values'})
return mis_val_table_ren_columns
(source: https://stackoverflow.com/a/39734251/7044473)
How can I accomplish the same for unique values?
You can use function called 'nunique()' to get unique count of all columns
df = pd.DataFrame(np.random.randint(0, 3, (4, 3)))
print(df)
0 1 2
0 2 0 2
1 1 2 1
2 1 2 2
3 1 1 2
count=df.nunique()
print(count)
0 2
1 3
2 2
dtype: int64
You can create a series of unique value counts using the pd.unique function. For example:
>>> df = pd.DataFrame(np.random.randint(0, 3, (4, 3)))
>>> print(df)
0 1 2
0 2 0 2
1 1 2 1
2 1 2 2
3 1 1 2
>>> pd.Series({col: len(pd.unique(df[col])) for col in df})
0 2
1 3
2 2
dtype: int64
If you actually want the number of times each value appears in each column, you can do a similar thing with pd.value_counts:
>>> pd.DataFrame({col: pd.value_counts(df[col]) for col in df}).fillna(0)
0 1 2
0 0.0 1 0.0
1 3.0 1 1.0
2 1.0 2 3.0
This is not exactly what you asked for, but may be useful for your analysis.
def diversity_percentage(df, columns):
"""
This function returns the number of different elements in each column as a percentage of the total elements in the group.
A low value indicates there are many repeated elements.
Example 1: a value of 0 indicates all values are the same.
Example 2: a value of 100 indicates all values are different.
"""
diversity = dict()
for col in columns:
diversity[col] = len(df[col].unique())
diversity_series = pd.Series(diversity)
return (100*diversity_series/len(df)).sort_values()
__
>>> diversity_percentage(df, selected_columns)
operationdate 0.002803
payment 1.076414
description 16.933901
customer_id 17.536581
customer_name 48.895554
customer_email 62.129282
token 68.290632
id 100.000000
transactionid 100.000000
dtype: float64
However, you can always return diversity_series directly and will obtain just the count.

Pandas MultiIndex Aggregation

I am trying to do some aggregation on a multi-indexDataFrame based on a DatetimeIndex generated from pandas.date_range.
My DatetimeIndex looks like this:
DatetimeIndex(['2000-05-30', '2000-05-31', '2000-06-01' ... '2001-1-31'])
And my multi-index DateFrame looks like this:
value
date id
2000-05-31 1 0
2 1
3 1
2000-06-30 2 1
3 0
4 0
2000-07-30 2 1
4 0
1 0
2002-09-30 1 1
3 1
The dates in the DatetimeIndex may or may not be in the date index.
I need to retrieve all the id such that the percentage of value==1 is greater than or equal to some decimal threshold e.g. 0.6 for all the rows where the date for that id is in the DatetimeIndex.
For example if the threshold is 0.5, then the output should be [2, 3] or some DataFrame containing 2 and 3.
1 does not meet the requirement because 2002-09-30 is not in the DatetimeIndex.
I have a solution with loops and dictonaries to keep track of how often value==1 for each id, but it runs very slowly.
How can I utilize pandas to perform this aggregation?
Thank you.
You can use:
#define range
rng = pd.date_range('2000-05-30', '2000-7-01')
#filtering with isin
df = df[df.index.get_level_values('date').isin(rng)]
#get all treshes
s = df.groupby('id')['value'].mean()
print (s)
id
1 0.0
2 1.0
3 0.5
4 0.0
Name: value, dtype: float64
#get all values of index by tresh
a = s.index[s >= 0.5].tolist()
print (a)
[2, 3]

Select last observation per group

Someone asked to select the first observation per group in pandas df, I am interested in both first and last, and I don't know an efficient way of doing it except writing a for loop.
I am going to modify his example to tell you what I am looking for
basically there is a df like this:
group_id
1
1
1
2
2
2
3
3
3
I would like to have a variable that indicates the last observation in a group:
group_id indicator
1 0
1 0
1 1
2 0
2 0
2 1
3 0
3 0
3 1
Using pandas.shift, you can do something like:
df['group_indicator'] = df.group_id != df.group_id.shift(-1)
(or
df['group_indicator'] = (df.group_id != df.group_id.shift(-1)).astype(int)
if it's actually important for you to have it as an integer.)
Note:
for large datasets, this should be much faster than list comprehension (not to mention loops).
As Alexander notes, this assumes the DataFrame is sorted as it is in the example.
First, we'll create a list of the index locations containing the last element of each group. You can see the elements of each group as follows:
>>> df.groupby('group_id').groups
{1: [0, 1, 2], 2: [3, 4, 5], 3: [6, 7, 8]}
We use a list comprehension to extract the last index location (idx[-1]) of each of these group index values.
We assign the indicator to the dataframe by using a list comprehension and a ternary operator (i.e. 1 if condition else 0), iterating across each element in the index and checking if it is in the idx_last_group list.
idx_last_group = [idx[-1] for idx in df.groupby('group_id').groups.values()]
df['indicator'] = [1 if idx in idx_last_group else 0 for idx in df.index]
>>> df
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Use the .tail method:
df=df.groupby('group_id').tail(1)
You can groupby the 'id' and call nth(-1) to get the last entry for each group, then use this to mask the df and set the 'indicator' to 1 and then the rest with 0 using fillna:
In [21]:
df.loc[df.groupby('group_id')['group_id'].nth(-1).index,'indicator'] = 1
df['indicator'].fillna(0, inplace=True)
df
Out[21]:
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Here is the output from the groupby:
In [22]:
df.groupby('group_id')['group_id'].nth(-1)
Out[22]:
2 1
5 2
8 3
Name: group_id, dtype: int64
One line:
data['indicator'] = (data.groupby('group_id').cumcount()==data.groupby('group_id')['any_other_column'].transform('size') -1 ).astype(int)`
What we do is check if the cumulative count (which returns a vector the same size as the dataframe) is equal to the "size of the group - 1" which we calculate using transform so it also returns a vector the same size as the dataframe.
We need to use some other column for the transform because it won't let you transform the .groupby() variable but this can literally any other column and it won't be affected since its only used in calculating the new indicator. Use .astype(int) to make it a binary and done.

how to transpose multiple level pandas dataframe based only on outer index

the below is my dataframe with two level indexing. I want 'only' the outer index to be transposed as columns. My desired output would be 2X2 dataframe instead of a 4X1 dataframe as is the case now. Can any of you please help?
0
0 0 232
1 3453
1 0 443
1 3241
Given you have the multi index you can use unstack() on level 0.
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_tuples([(0,0),(0,1),(1,0),(1,1)])
df = pd.DataFrame([[1],[2],[3],[4]] , index=index, columns=[0])
print df.unstack(level=[0])
0
0 1
0 1 3
1 2 4
One way to do this would be to reset the index and then pivot the table indexing on the level_1 of the index, and using level_0 as the columns and 0 as the values. Example -
df.reset_index().pivot(index='level_1',columns='level_0',values=0)
Demo -
In [66]: index = pd.MultiIndex.from_tuples([(0,0),(0,1),(1,0),(1,1)])
In [67]: df = pd.DataFrame([[1],[2],[3],[4]] , index=index, columns=[0])
In [68]: df
Out[68]:
0
0 0 1
1 2
1 0 3
1 4
In [69]: df.reset_index().pivot(index='level_1',columns='level_0',values=0)
Out[69]:
level_0 0 1
level_1
0 1 3
1 2 4
Later on, if you want you can set the .name attribute for the index as well as the columns to empty string or whatever you want , if you don't want the level_* there.

Group by value of sum of columns with Pandas

I got lost in Pandas doc and features trying to figure out a way to groupby a DataFrame by the values of the sum of the columns.
for instance, let say I have the following data :
In [2]: dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
In [3]: df = pd.DataFrame(dat)
In [4]: df
Out[4]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
I would like columns a, b and c to be grouped since they all have their sum equal to 1. The resulting DataFrame would have columns labels equals to the sum of the columns it summed. Like this :
1 9
0 2 2
1 1 3
2 0 4
Any idea to put me in the good direction ? Thanks in advance !
Here you go:
In [57]: df.groupby(df.sum(), axis=1).sum()
Out[57]:
1 9
0 2 2
1 1 3
2 0 4
[3 rows x 2 columns]
df.sum() is your grouper. It sums over the 0 axis (the index), giving you the two groups: 1 (columns a, b, and, c) and 9 (column d) . You want to group the columns (axis=1), and take the sum of each group.
Because pandas is designed with database concepts in mind, it's really expected information to be stored together in rows, not in columns. Because of this, it's usually more elegant to do things row-wise. Here's how to solve your problem row-wise:
dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
df = pd.DataFrame(dat)
df = df.transpose()
df['totals'] = df.sum(1)
print df.groupby('totals').sum().transpose()
#totals 1 9
#0 2 2
#1 1 3
#2 0 4

Categories

Resources