I have a code below that creates a summary table of missing values in each column of my data frame. I wish I could build a similar table to count unique values, but DataFrame does not have an unique() method, only each column independently.
def missing_values_table(df):
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum()/len(df)
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : 'Missing Values', 1 : '% of Total Values'})
return mis_val_table_ren_columns
(source: https://stackoverflow.com/a/39734251/7044473)
How can I accomplish the same for unique values?
You can use function called 'nunique()' to get unique count of all columns
df = pd.DataFrame(np.random.randint(0, 3, (4, 3)))
print(df)
0 1 2
0 2 0 2
1 1 2 1
2 1 2 2
3 1 1 2
count=df.nunique()
print(count)
0 2
1 3
2 2
dtype: int64
You can create a series of unique value counts using the pd.unique function. For example:
>>> df = pd.DataFrame(np.random.randint(0, 3, (4, 3)))
>>> print(df)
0 1 2
0 2 0 2
1 1 2 1
2 1 2 2
3 1 1 2
>>> pd.Series({col: len(pd.unique(df[col])) for col in df})
0 2
1 3
2 2
dtype: int64
If you actually want the number of times each value appears in each column, you can do a similar thing with pd.value_counts:
>>> pd.DataFrame({col: pd.value_counts(df[col]) for col in df}).fillna(0)
0 1 2
0 0.0 1 0.0
1 3.0 1 1.0
2 1.0 2 3.0
This is not exactly what you asked for, but may be useful for your analysis.
def diversity_percentage(df, columns):
"""
This function returns the number of different elements in each column as a percentage of the total elements in the group.
A low value indicates there are many repeated elements.
Example 1: a value of 0 indicates all values are the same.
Example 2: a value of 100 indicates all values are different.
"""
diversity = dict()
for col in columns:
diversity[col] = len(df[col].unique())
diversity_series = pd.Series(diversity)
return (100*diversity_series/len(df)).sort_values()
__
>>> diversity_percentage(df, selected_columns)
operationdate 0.002803
payment 1.076414
description 16.933901
customer_id 17.536581
customer_name 48.895554
customer_email 62.129282
token 68.290632
id 100.000000
transactionid 100.000000
dtype: float64
However, you can always return diversity_series directly and will obtain just the count.
Related
df = pd.DataFrame({
'group': [1,1,1,2,2,2],
'value': [None,None,'A',None,'B',None]
})
I would like to replace missing values by the first next non missing value by group. The desired result is:
df = pd.DataFrame({
'group': [1,1,1,2,2,2],
'value': ['A','A','A','B','B',None]
})
You can try this:
df['value'] = df.groupby(by=['group'])['value'].backfill()
print(df)
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 NaN
The Easiest way as #Erfan mention using backfill method DataFrameGroupBy.bfill.
Solution 1)
>>> df['value'] = df.groupby('group')['value'].bfill()
>>> df
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 NaN
Solution 2)
DataFrameGroupBy.bfill with limit parameter works perfectly as well here.
From the pandas Documentation which nicely briefs the Limit the amount of filling is worth to read. as per the doc If we only want consecutive gaps filled up to a certain number of data points, we can use the limit keyword.
>>> df['value'] = df.groupby(['group']).bfill(limit=2)
# >>> df['value'] = df.groupby('group').bfill(limit=2)
>>> df
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 NaN
Solution 3)
With groupby() we can also combine fillna() with bfill() along with limit parameter.
>>> df.groupby('group').fillna(method='bfill',limit=2)
value
0 A
1 A
2 A
3 B
4 B
5 None
Solution 4)
Other way around using DataFrame.transform function to fill the value column after group by with DataFrameGroupBy.bfill.
>>> df['value'] = df.groupby('group')['value'].transform(lambda v: v.bfill())
>>> df
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 None
Solution 5)
You can use DataFrame.set_index to add the group column to the index, making it unique, and do a simple bfill() via groupby(), then you can use reset index to its original state.
>>> df.set_index('group', append=True).groupby(level=1).bfill().reset_index(level=1)
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 NaN
Solution 6)
In case strictly not going for groupby() then below would be the easiest ..
>>> df['value'] = df['value'].bfill()
>>> df
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 None
I have a dataframe given shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1],
'val' :[5,6.4,5.4,6,6,6]
})
It looks like as shown below
I would like to drop the values from val column which ends with .[1-9]. Basically I would like to retain values like 5.0,6.0 and drop values like 5.4,6.4 etc
Though I tried below, it isn't accurate
df['val'] = df['val'].astype(int)
df.drop_duplicates() # it doesn't give expected output and not accurate.
I expect my output to be like as shown below
First idea is compare original value with casted column to integer, also assign integers back for expected output (integers in column):
s = df['val']
df['val'] = df['val'].astype(int)
df = df[df['val'] == s]
print (df)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6
Another idea is test is_integer:
mask = df['val'].apply(lambda x: x.is_integer())
df['val'] = df['val'].astype(int)
df = df[mask]
print (df)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6
If need floats in output you can use:
df1 = df[ df['val'].astype(int) == df['val']]
print (df1)
subject_id val
0 1 5.0
3 1 6.0
4 1 6.0
5 1 6.0
Use mod 1 to determine the residual. If residual is 0 it means the number is a int. Then use the results as a mask to select only those rows.
df.loc[df.val.mod(1).eq(0)].astype(int)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6
The DataFrame named df is shown as follows.
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 3]})
Input:
id
0 1
1 1
2 3
I want to count the number of each id, and take the result as a new column count.
Expected:
id count
0 1 2
1 1 2
2 3 1
pd.factorize and np.bincount
My favorite. factorize does not sort and has time complexity of O(n). For big data sets, factorize should be preferred over np.unique
i, u = df.id.factorize()
df.assign(Count=np.bincount(i)[i])
id Count
0 1 2
1 1 2
2 3 1
np.unique and np.bincount
u, i = np.unique(df.id, return_inverse=True)
df.assign(Count=np.bincount(i)[i])
id Count
0 1 2
1 1 2
2 3 1
Assign the new count column to the dataframe by grouping on id and then transforming that column with value_counts (or size).
>>> f.assign(count=f.groupby('id')['id'].transform('value_counts'))
id count
0 1 2
1 1 2
2 3 1
Use Series.map with Series.value_counts:
df['count'] = df['id'].map(df['id'].value_counts())
#alternative
#from collections import Counter
#df['count'] = df['id'].map(Counter(df['id']))
Detail:
print (df['id'].value_counts())
1 2
3 1
Name: id, dtype: int64
Or GroupBy.transform for return Series with same size as original DataFrame with GroupBy.size:
df['count'] = df.groupby('id')['id'].transform('size')
print (df)
id count
0 1 2
1 1 2
2 3 1
I have a list of ids that correspond to the row of a data frame. From that list of ids, I want to increment a value of another column that intersects with that id's row.
What I was thinking was something like this:
ids = [1,2,3,4]
for id in ids:
my_df.loc[my_df['id']] == id]['other_column'] += 1
But this doesn't work. How can I mutate the original df, my_df?
try this:
my_df.loc[my_df['id'].isin(ids), 'other_column'] += 1
Demo:
In [233]: ids=[0,2]
In [234]: df = pd.DataFrame(np.random.randint(0,3, (5, 3)), columns=list('abc'))
In [235]: df
Out[235]:
a b c
0 2 2 1
1 1 0 2
2 2 2 0
3 0 2 1
4 0 1 2
In [236]: df.loc[df.a.isin(ids), 'c'] += 100
In [237]: df
Out[237]:
a b c
0 2 2 101
1 1 0 2
2 2 2 100
3 0 2 101
4 0 1 102
the ids are unique, correct? If so, you can directly place the id into the df:
ids = [1,2,3,4]
for id in ids:
df.loc[id,'column_name'] = id+1
I got lost in Pandas doc and features trying to figure out a way to groupby a DataFrame by the values of the sum of the columns.
for instance, let say I have the following data :
In [2]: dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
In [3]: df = pd.DataFrame(dat)
In [4]: df
Out[4]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
I would like columns a, b and c to be grouped since they all have their sum equal to 1. The resulting DataFrame would have columns labels equals to the sum of the columns it summed. Like this :
1 9
0 2 2
1 1 3
2 0 4
Any idea to put me in the good direction ? Thanks in advance !
Here you go:
In [57]: df.groupby(df.sum(), axis=1).sum()
Out[57]:
1 9
0 2 2
1 1 3
2 0 4
[3 rows x 2 columns]
df.sum() is your grouper. It sums over the 0 axis (the index), giving you the two groups: 1 (columns a, b, and, c) and 9 (column d) . You want to group the columns (axis=1), and take the sum of each group.
Because pandas is designed with database concepts in mind, it's really expected information to be stored together in rows, not in columns. Because of this, it's usually more elegant to do things row-wise. Here's how to solve your problem row-wise:
dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
df = pd.DataFrame(dat)
df = df.transpose()
df['totals'] = df.sum(1)
print df.groupby('totals').sum().transpose()
#totals 1 9
#0 2 2
#1 1 3
#2 0 4