I have a dataframe that's ordered by two columns : 'ID' and a date column.
There's a significant amount of missing values in that table and what I'm interested in is understanding how the missing values are distributed : are they mainly concentrated for one 'ID', do all IDs have missing values in their start for example (date wise), are missing values unrelated etc.
After a groupby ID + count of missing values, I used missingno package and it proved to be useful, this is the result I got (sanitizing column names) :
From the picture, it seems like there are specific batches of rows where most columns are missing.
If you look at the arrow for example, I can probably ballpark a value for indexes to search (~idx = 750000) but this wouldn't be practical since there are other instances with the same thing happening.
What I would like to have is a function batches_missing(cols, n_rows) that takes a list of columns and and an int n_rows and returns a list of tuples [(index_start_batch1, index_end_batch1), ...] of all batches where the given columns have more than n_rows consecutive rows of missing values.
With a mock example :
df = pd.DataFrame({'col1':[1, 2, np.nan, np.nan, np.nan, np.nan, 2, 2, np.nan, np.nan, np.nan],
'col2':[9, 7, np.nan, np.nan, np.nan, np.nan, 0, np.nan, np.nan, np.nan, np.nan],
'col3':[11, 12, 13, np.nan, 1, 2, 3, np.nan, 1, 2, 3]})
batches_missing(['col1','col2'] , 3) would return [(2,5),(8,10)]
Can this be done efficiently given that the actual data is pretty big (1 mil rows) ? I would also be very interested in hearing about other ways of analyzing missing data so would appreciate any reading materials / links !
Thanks everyone.
You tally row wise to see which rows are all NAs, given selected columns:
rowwise_tally = df[['col1','col2']].isna().apply(all,axis=1)
0 False
1 False
2 True
3 True
4 True
5 True
6 False
7 False
8 True
9 True
10 True
Now you can group this runs:
grp = rowwise_tally.diff().cumsum().fillna(0)
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 2.0
7 2.0
8 3.0
9 3.0
10 3.0
Then count the number of nas in each group and also get the start and end:
na_counts = rowwise_tally.groupby(grp).sum()
pos = pd.Series(np.arange(len(df))).groupby(grp).agg([np.min, np.max])
pos[na_counts>=3].to_numpy()
array([[ 2, 5],
[ 8, 10]])
There might be a better way to get the position instead of using pd.Series like I did. For now, wrap this into a function:
def fun(data,cols,minlen):
rowwise_tally = data[cols].isna().apply(all,axis=1)
grp = rowwise_tally.diff().cumsum().fillna(0)
na_counts = rowwise_tally.groupby(grp).sum()
pos = pd.Series(np.arange(len(data))).groupby(grp).agg([np.min, np.max])
return pos[na_counts>=minlen].to_numpy()
fun(df,['col1','col2'],3)
Related
Lets say I have following pandas Series:
s = pd.Series([np.nan, np.nan, np.nan, 0, 1, 2, 3])
and I want to use the pandas nearest interpolate method on this data.
When I run the code
s.interpolate(method='nearest') - it does not do the interpolation.
When I modify the series lets say s = pd.Series([np.nan, 1, np.nan, 0, 1, 2, 3]) then the same method works.
Do you know how to do the interpolation in the first case?
Thanks!
You need two surrounding values to be able to interpolate, else this would be extrapolation.
As you can see with:
s = pd.Series([np.nan, 1, np.nan, 0, 1, 2, 3])
s.interpolate(method='nearest')
only the intermediate NaNs are interpolated:
0 NaN # cannot interpolate
1 1.0
2 1.0 # interpolated
3 0.0
4 1.0
5 2.0
6 3.0
dtype: float64
As you want the nearest value, a workaround could be to bfill (or ffill):
s.interpolate(method='nearest').bfill()
output:
0 1.0
1 1.0
2 1.0
3 0.0
4 1.0
5 2.0
6 3.0
dtype: float64
follow-up
The only problem occurred when 1. s = pd.Series([np.nan, np.nan, np.nan, 0, np.nan, np.nan, np.nan]) and 2. s = pd.Series([np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]) . In the first case,
I want to have 0 everywhere. In the second case, I want to leave it as
it is
try:
s2 = s.interpolate(method='nearest').bfill().ffill()
except ValueError:
s2 = s.bfill().ffill()
I have some numerical time-series of varying lengths stored in a wide pandas dataframe. Each row corresponds to one series and each column to a measurement time point. Because of their varying length, those series can have missing values (NA) tails either left (first time points) or right (last time points) or both. There is always a continuous stripe without NA of a minimum length on each row.
I need to get a random subset of fixed length from each of these rows, without including any NA. Ideally, I wish to keep the original dataframe intact and to report the subsets in a new one.
I managed to obtain this output with a very inefficient for loop that goes through each row one by one, determines a start for the crop position such that NAs will not be included in the output and copies the cropped result. This works but it is extremely slow on large datasets. Here is the code:
import pandas as pd
import numpy as np
from copy import copy
def crop_random(df_in, output_length, ignore_na_tails=True):
# Initialize new dataframe
colnames = ['X_' + str(i) for i in range(output_length)]
df_crop = pd.DataFrame(index=df_in.index, columns=colnames)
# Go through all rows
for irow in range(df_in.shape[0]):
series = copy(df_in.iloc[irow, :])
series = np.array(series).astype('float')
length = len(series)
if ignore_na_tails:
pos_non_na = np.where(~np.isnan(series))
# Range where the subset might start
lo = pos_non_na[0][0]
hi = pos_non_na[0][-1]
left = np.random.randint(lo, hi - output_length + 2)
else:
left = np.random.randint(0, length - output_length)
series = series[left : left + output_length]
df_crop.iloc[irow, :] = series
return df_crop
And a toy example:
df = pd.DataFrame.from_dict({'t0': [np.NaN, 1, np.NaN],
't1': [np.NaN, 2, np.NaN],
't2': [np.NaN, 3, np.NaN],
't3': [1, 4, 1],
't4': [2, 5, 2],
't5': [3, 6, 3],
't6': [4, 7, np.NaN],
't7': [5, 8, np.NaN],
't8': [6, 9, np.NaN]})
# t0 t1 t2 t3 t4 t5 t6 t7 t8
# 0 NaN NaN NaN 1 2 3 4 5 6
# 1 1 2 3 4 5 6 7 8 9
# 2 NaN NaN NaN 1 2 3 NaN NaN NaN
crop_random(df, 3)
# One possible output:
# X_0 X_1 X_2
# 0 2 3 4
# 1 7 8 9
# 2 1 2 3
How could I achieve same results in a way adapted to large dataframes?
Edit: Moved my improved solution to the answer section.
I managed to speed up things quite drastically with:
def crop_random(dataset, output_length, ignore_na_tails=True):
# Get a random range to crop for each row
def get_range_crop(series, output_length, ignore_na_tails):
series = np.array(series).astype('float')
if ignore_na_tails:
pos_non_na = np.where(~np.isnan(series))
start = pos_non_na[0][0]
end = pos_non_na[0][-1]
left = np.random.randint(start,
end - output_length + 2) # +1 to include last in randint; +1 for slction span
else:
length = len(series)
left = np.random.randint(0, length - output_length)
right = left + output_length
return left, right
# Crop the rows to random range, reset_index to do concat without recreating new columns
range_subset = dataset.apply(get_range_crop, args=(output_length,ignore_na_tails, ), axis = 1)
new_rows = [dataset.iloc[irow, range_subset[irow][0]: range_subset[irow][1]]
for irow in range(dataset.shape[0])]
for row in new_rows:
row.reset_index(drop=True, inplace=True)
# Concatenate all rows
dataset_cropped = pd.concat(new_rows, axis=1).T
return dataset_cropped
There is already an answer that deals with a relatively simple dataframe that is given here.
However, the dataframe I have at hand has multiple columns and large number of rows. One Dataframe contains three dataframes attached along axis=0. (Bottom end of one is attached to the top of the next.) They are separated by a row of NaN values.
How can I create three dataframes out of this one data by splitting it along the NaN rows?
Like in the answer you linked, you want to create a column which identifies the group number. Then you can apply the same solution.
To do so, you have to make a test for all the values of a row to be NaN. I don't know if there is such a test builtin in pandas, but pandas has a test to check if a Series is full of NaN. So what you want to do is to perform that on the transpose of your dataframe, so that your "Series" is actually your row:
df["group_no"] = df.isnull().all(axis=1).cumsum()
At that point you can use the same technique from that answer to split the dataframes.
You might want to do a .dropna() at the end, because you will still have the NaN rows in your result.
Ran into this same question in 2022. Here's what I did to split dataframes on rows with NaNs, caveat is this relies on pip install python-rle for run-length encoding:
import rle
def nanchucks(df):
# It chucks NaNs outta dataframes
# True if whole row is NaN
df_nans = pd.isnull(df).sum(axis="columns").astype(bool)
values, counts = rle.encode(df_nans)
df_nans = pd.DataFrame({"values": values, "counts": counts})
df_nans["cum_counts"] = df_nans["counts"].cumsum()
df_nans["start_idx"] = df_nans["cum_counts"].shift(1)
df_nans.loc[0, "start_idx"] = 0
df_nans["start_idx"] = df_nans["start_idx"].astype(int) # np.nan makes it a float column
df_nans["end_idx"] = df_nans["cum_counts"] - 1
# Only keep the chunks of data w/o NaNs
df_nans = df_nans[df_nans["values"] == False]
indices = []
for idx, row in df_nans.iterrows():
indices.append((row["start_idx"], row["end_idx"]))
return [df.loc[df.index[i[0]]: df.index[i[1]]] for i in indices]
Examples:
sample_df1 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, np.nan, 3, 4],
"c": [1, 2, np.nan, 3, 4],
})
sample_df2 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, 3, np.nan, 4],
"c": [1, 2, np.nan, 3, 4],
})
print(nanchucks(sample_df1))
# [ a b c
# 0 1.0 1.0 1.0
# 1 2.0 2.0 2.0,
# a b c
# 3 3.0 3.0 3.0
# 4 4.0 4.0 4.0]
print(nanchucks(sample_df2))
# [ a b c
# 0 1.0 1.0 1.0
# 1 2.0 2.0 2.0,
# a b c
# 4 4.0 4.0 4.0]
For example I have a dataframe:
df = pd.DataFrame({'Value_Bucket': [5, 5, 5, 10, 10, 10],
'DayofWeek': [1, 1, 3, 2, 4, 2],
'Hour_Bucket': [1, 5, 7, 4, 3, 12],
'Values': [1, 1.5, 2, 3, 5, 3]})
The actual data set is rather large (5000 rows+). I'm looking to perform functions on 'Values' if the "Value_Bucket" = 5, and for each possible combination of "DayofWeek" and "Hour_Bucket".
Essentially the data will be grouped to a table of 24 rows (Hour_Bucket) and 7 columns (DayofWeek), and each cell is filled with the result of a function (say average for example). I can use a groupby function for 1 criteria, can someone explain how I can group two criteria and tabulate the result in a table?
query to subset
groupby
unstack
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack()
DayofWeek 1 3
Hour_Bucket
1 1.0 NaN
5 1.5 NaN
7 NaN 2.0
If you want to have zeros instead of NaN
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack(fill_value=0)
DayofWeek 1 3
Hour_Bucket
1 1.0 0.0
5 1.5 0.0
7 0.0 2.0
Pivot tables seem more natural to me than groupby paired with unstack though they do the exact same thing.
pd.pivot_table(data=df.query('Value_Bucket == 5'),
index='Hour_Bucket',
columns='DayofWeek',
values='Values',
aggfunc='mean',
fill_value=0)
Output
DayofWeek 1 3
Hour_Bucket
1 1.0 0
5 1.5 0
7 0.0 2
I have a data which something looks like this:
UserID region1 region 2 region 3 Conditionid
0 0 NaN NaN NaN NAN
1 693 2 1 NaN NAN
2 709 1 NaN NaN 100
3 730 NaN NaN NaN NAN
4 840 NaN NaN 5 100
Here numbers in the region columns represent the number of visits.
Now I want to calculate a metric A such that among users who have visited that region , what percentage have a particular condition with conditionid equal to 10. So this has to done for each column(region).
A simple logic for one one region would be such as :
if region 1 != NA and conditionid=100 then count=count +1`.
Once I have this count, then I have to divide this by visits for region1. So firstly we have to iterate in first column row wise and then in second column(region2) again row wise and so on for all the regions. Now problem is how to iterate in the manner mentioned above and how do I store the metric A for each region ? I think there is some built-in mechanics in pandas-but not sure.
When using pandas, you should not iterate through rows. Nearly every problem you will face can be solved using boolean logic and/or vectorized operations.
d = {'UserID' : [0, 693, 709, 730, 840],
'Region1' : [np.nan, 2, 1, np.nan, np.nan],
'Region2' : [np.nan, 1, np.nan, np.nan, np.nan],
'Region3' : [np.nan, np.nan, np.nan, np.nan, 5],
'Conditionid' : [np.nan, np.nan, 100, np.nan, 100]}
df = pd.DataFrame(d)
You can then apply some boolean logic to find the count which you're interested in:
df[~(df['Region1'].isnull()) & (df['Conditionid'] == 100)]['Region1'].count()
Note, the ~ means NOT true. So, in this case, Is NOT isnull.
If you want to iterate through specific columns, you can do something like this:
from __future__ import division
for i in range(1, 4):
column = 'Region' + str(i)
print column
numerator = df[~(df[column].isnull()) & (df['Conditionid'] == 100)][column].count()
denominator = df[~df[column].isnull()][column].count()
print numerator / denominator
This will create 'Region1' to 'Region3', and sum up what I think you're looking for. If not, this should at-least give you a good starting point.