Splitting a Dataframe at NaN row - python

There is already an answer that deals with a relatively simple dataframe that is given here.
However, the dataframe I have at hand has multiple columns and large number of rows. One Dataframe contains three dataframes attached along axis=0. (Bottom end of one is attached to the top of the next.) They are separated by a row of NaN values.
How can I create three dataframes out of this one data by splitting it along the NaN rows?

Like in the answer you linked, you want to create a column which identifies the group number. Then you can apply the same solution.
To do so, you have to make a test for all the values of a row to be NaN. I don't know if there is such a test builtin in pandas, but pandas has a test to check if a Series is full of NaN. So what you want to do is to perform that on the transpose of your dataframe, so that your "Series" is actually your row:
df["group_no"] = df.isnull().all(axis=1).cumsum()
At that point you can use the same technique from that answer to split the dataframes.
You might want to do a .dropna() at the end, because you will still have the NaN rows in your result.

Ran into this same question in 2022. Here's what I did to split dataframes on rows with NaNs, caveat is this relies on pip install python-rle for run-length encoding:
import rle
def nanchucks(df):
# It chucks NaNs outta dataframes
# True if whole row is NaN
df_nans = pd.isnull(df).sum(axis="columns").astype(bool)
values, counts = rle.encode(df_nans)
df_nans = pd.DataFrame({"values": values, "counts": counts})
df_nans["cum_counts"] = df_nans["counts"].cumsum()
df_nans["start_idx"] = df_nans["cum_counts"].shift(1)
df_nans.loc[0, "start_idx"] = 0
df_nans["start_idx"] = df_nans["start_idx"].astype(int) # np.nan makes it a float column
df_nans["end_idx"] = df_nans["cum_counts"] - 1
# Only keep the chunks of data w/o NaNs
df_nans = df_nans[df_nans["values"] == False]
indices = []
for idx, row in df_nans.iterrows():
indices.append((row["start_idx"], row["end_idx"]))
return [df.loc[df.index[i[0]]: df.index[i[1]]] for i in indices]
Examples:
sample_df1 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, np.nan, 3, 4],
"c": [1, 2, np.nan, 3, 4],
})
sample_df2 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, 3, np.nan, 4],
"c": [1, 2, np.nan, 3, 4],
})
print(nanchucks(sample_df1))
# [ a b c
# 0 1.0 1.0 1.0
# 1 2.0 2.0 2.0,
# a b c
# 3 3.0 3.0 3.0
# 4 4.0 4.0 4.0]
print(nanchucks(sample_df2))
# [ a b c
# 0 1.0 1.0 1.0
# 1 2.0 2.0 2.0,
# a b c
# 4 4.0 4.0 4.0]

Related

Convert a lambda function to a regular function

I'm trying to understand how can I convert a lambda function to a normal one. I have this lambda function that it supposed to fill the null values of each column with the mode
def fill_nn(data):
df= data.apply(lambda column: column.fillna(column.mode()[0]))
return df
I tried this:
def fill_nn(df):
for column in df:
if df[column].isnull().any():
return df[column].fillna(df[column].mode()[0])
Hi 👋 Hope you are doing well!
If I understood your question correctly then the best possible way will be similar to this:
import pandas as pd
def fill_missing_values(series: pd.Series) -> pd.Series:
"""Fill missing values in series/column."""
value_to_use = series.mode()[0]
return series.fillna(value=value_to_use)
df = pd.DataFrame(
{
"A": [1, 2, 3, 4, 5],
"B": [None, 2, 3, 4, None],
"C": [None, None, 3, 4, None],
}
)
df = df.apply(fill_missing_values) # type: ignore
print(df)
# A B C
# 0 1 2.0 3.0
# 1 2 2.0 3.0
# 2 3 3.0 3.0
# 3 4 4.0 4.0
# 4 5 2.0 3.0
but personally, I would still use the lambda as it requires less code and is easier to handle (especially for such a small task).

Get index of row where column value changes from previous row

I have a pandas dataframe with a column such as :
df1 = pd.DataFrame({ 'val': [997.95, 997.97, 989.17, 999.72, 984.66, 1902.15]})
I have 2 types of events that can be detected from this column, I wanna label them 1 and 2 .
I need to get the indexes of each label , and to do so I need to find where the 'val' column has changed a lot (± 7 ) from previous row.
Expected output:
one = [0, 1, 3, 5]
two = [2, 4 ]
Use Series.diff with mask for test less values like 0, last use boolean indexing with indices:
m = df1.val.diff().lt(0)
#if need test less like -7
#m = df1.val.diff().lt(-7)
one = df1.index[~m]
two = df1.index[m]
print (one)
Int64Index([0, 1, 3, 5], dtype='int64')
print (two)
nt64Index([2, 4], dtype='int64')
If need lists:
one = df1.index[~m].tolist()
two = df1.index[m].tolist()
Details:
print (df1.val.diff())
0 NaN
1 0.02
2 -8.80
3 10.55
4 -15.06
5 917.49
Name: val, dtype: float64

Return indexes of longest batches of NaN

I have a dataframe that's ordered by two columns : 'ID' and a date column.
There's a significant amount of missing values in that table and what I'm interested in is understanding how the missing values are distributed : are they mainly concentrated for one 'ID', do all IDs have missing values in their start for example (date wise), are missing values unrelated etc.
After a groupby ID + count of missing values, I used missingno package and it proved to be useful, this is the result I got (sanitizing column names) :
From the picture, it seems like there are specific batches of rows where most columns are missing.
If you look at the arrow for example, I can probably ballpark a value for indexes to search (~idx = 750000) but this wouldn't be practical since there are other instances with the same thing happening.
What I would like to have is a function batches_missing(cols, n_rows) that takes a list of columns and and an int n_rows and returns a list of tuples [(index_start_batch1, index_end_batch1), ...] of all batches where the given columns have more than n_rows consecutive rows of missing values.
With a mock example :
df = pd.DataFrame({'col1':[1, 2, np.nan, np.nan, np.nan, np.nan, 2, 2, np.nan, np.nan, np.nan],
'col2':[9, 7, np.nan, np.nan, np.nan, np.nan, 0, np.nan, np.nan, np.nan, np.nan],
'col3':[11, 12, 13, np.nan, 1, 2, 3, np.nan, 1, 2, 3]})
batches_missing(['col1','col2'] , 3) would return [(2,5),(8,10)]
Can this be done efficiently given that the actual data is pretty big (1 mil rows) ? I would also be very interested in hearing about other ways of analyzing missing data so would appreciate any reading materials / links !
Thanks everyone.
You tally row wise to see which rows are all NAs, given selected columns:
rowwise_tally = df[['col1','col2']].isna().apply(all,axis=1)
0 False
1 False
2 True
3 True
4 True
5 True
6 False
7 False
8 True
9 True
10 True
Now you can group this runs:
grp = rowwise_tally.diff().cumsum().fillna(0)
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 2.0
7 2.0
8 3.0
9 3.0
10 3.0
Then count the number of nas in each group and also get the start and end:
na_counts = rowwise_tally.groupby(grp).sum()
pos = pd.Series(np.arange(len(df))).groupby(grp).agg([np.min, np.max])
pos[na_counts>=3].to_numpy()
array([[ 2, 5],
[ 8, 10]])
There might be a better way to get the position instead of using pd.Series like I did. For now, wrap this into a function:
def fun(data,cols,minlen):
rowwise_tally = data[cols].isna().apply(all,axis=1)
grp = rowwise_tally.diff().cumsum().fillna(0)
na_counts = rowwise_tally.groupby(grp).sum()
pos = pd.Series(np.arange(len(data))).groupby(grp).agg([np.min, np.max])
return pos[na_counts>=minlen].to_numpy()
fun(df,['col1','col2'],3)

Manipulate pandas.DataFrame with multiple criterias

For example I have a dataframe:
df = pd.DataFrame({'Value_Bucket': [5, 5, 5, 10, 10, 10],
'DayofWeek': [1, 1, 3, 2, 4, 2],
'Hour_Bucket': [1, 5, 7, 4, 3, 12],
'Values': [1, 1.5, 2, 3, 5, 3]})
The actual data set is rather large (5000 rows+). I'm looking to perform functions on 'Values' if the "Value_Bucket" = 5, and for each possible combination of "DayofWeek" and "Hour_Bucket".
Essentially the data will be grouped to a table of 24 rows (Hour_Bucket) and 7 columns (DayofWeek), and each cell is filled with the result of a function (say average for example). I can use a groupby function for 1 criteria, can someone explain how I can group two criteria and tabulate the result in a table?
query to subset
groupby
unstack
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack()
DayofWeek 1 3
Hour_Bucket
1 1.0 NaN
5 1.5 NaN
7 NaN 2.0
If you want to have zeros instead of NaN
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack(fill_value=0)
DayofWeek 1 3
Hour_Bucket
1 1.0 0.0
5 1.5 0.0
7 0.0 2.0
Pivot tables seem more natural to me than groupby paired with unstack though they do the exact same thing.
pd.pivot_table(data=df.query('Value_Bucket == 5'),
index='Hour_Bucket',
columns='DayofWeek',
values='Values',
aggfunc='mean',
fill_value=0)
Output
DayofWeek 1 3
Hour_Bucket
1 1.0 0
5 1.5 0
7 0.0 2

data mungling in python-pandas

I have a data which something looks like this:
UserID region1 region 2 region 3 Conditionid
0 0 NaN NaN NaN NAN
1 693 2 1 NaN NAN
2 709 1 NaN NaN 100
3 730 NaN NaN NaN NAN
4 840 NaN NaN 5 100
Here numbers in the region columns represent the number of visits.
Now I want to calculate a metric A such that among users who have visited that region , what percentage have a particular condition with conditionid equal to 10. So this has to done for each column(region).
A simple logic for one one region would be such as :
if region 1 != NA and conditionid=100 then count=count +1`.
Once I have this count, then I have to divide this by visits for region1. So firstly we have to iterate in first column row wise and then in second column(region2) again row wise and so on for all the regions. Now problem is how to iterate in the manner mentioned above and how do I store the metric A for each region ? I think there is some built-in mechanics in pandas-but not sure.
When using pandas, you should not iterate through rows. Nearly every problem you will face can be solved using boolean logic and/or vectorized operations.
d = {'UserID' : [0, 693, 709, 730, 840],
'Region1' : [np.nan, 2, 1, np.nan, np.nan],
'Region2' : [np.nan, 1, np.nan, np.nan, np.nan],
'Region3' : [np.nan, np.nan, np.nan, np.nan, 5],
'Conditionid' : [np.nan, np.nan, 100, np.nan, 100]}
df = pd.DataFrame(d)
You can then apply some boolean logic to find the count which you're interested in:
df[~(df['Region1'].isnull()) & (df['Conditionid'] == 100)]['Region1'].count()
Note, the ~ means NOT true. So, in this case, Is NOT isnull.
If you want to iterate through specific columns, you can do something like this:
from __future__ import division
for i in range(1, 4):
column = 'Region' + str(i)
print column
numerator = df[~(df[column].isnull()) & (df['Conditionid'] == 100)][column].count()
denominator = df[~df[column].isnull()][column].count()
print numerator / denominator
This will create 'Region1' to 'Region3', and sum up what I think you're looking for. If not, this should at-least give you a good starting point.

Categories

Resources