data mungling in python-pandas

data mungling in python-pandas - python

I have a data which something looks like this:
UserID region1 region 2 region 3 Conditionid
0 0 NaN NaN NaN NAN
1 693 2 1 NaN NAN
2 709 1 NaN NaN 100
3 730 NaN NaN NaN NAN
4 840 NaN NaN 5 100
Here numbers in the region columns represent the number of visits.
Now I want to calculate a metric A such that among users who have visited that region , what percentage have a particular condition with conditionid equal to 10. So this has to done for each column(region).
A simple logic for one one region would be such as :
if region 1 != NA and conditionid=100 then count=count +1`.
Once I have this count, then I have to divide this by visits for region1. So firstly we have to iterate in first column row wise and then in second column(region2) again row wise and so on for all the regions. Now problem is how to iterate in the manner mentioned above and how do I store the metric A for each region ? I think there is some built-in mechanics in pandas-but not sure.

When using pandas, you should not iterate through rows. Nearly every problem you will face can be solved using boolean logic and/or vectorized operations.
d = {'UserID' : [0, 693, 709, 730, 840],
'Region1' : [np.nan, 2, 1, np.nan, np.nan],
'Region2' : [np.nan, 1, np.nan, np.nan, np.nan],
'Region3' : [np.nan, np.nan, np.nan, np.nan, 5],
'Conditionid' : [np.nan, np.nan, 100, np.nan, 100]}
df = pd.DataFrame(d)
You can then apply some boolean logic to find the count which you're interested in:
df[~(df['Region1'].isnull()) & (df['Conditionid'] == 100)]['Region1'].count()
Note, the ~ means NOT true. So, in this case, Is NOT isnull.
If you want to iterate through specific columns, you can do something like this:
from __future__ import division
for i in range(1, 4):
column = 'Region' + str(i)
print column
numerator = df[~(df[column].isnull()) & (df['Conditionid'] == 100)][column].count()
denominator = df[~df[column].isnull()][column].count()
print numerator / denominator
This will create 'Region1' to 'Region3', and sum up what I think you're looking for. If not, this should at-least give you a good starting point.

Related

How can I forward fill a dataframe column where the limit of rows filled is based on the value of a cell in another column?

So I am trying to forward fill a column with the limit being the value in another column. This is the code I run and I get this error message.
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['NM'] = [0, 0, 1, np.nan, np.nan, np.nan, 0]
df['length'] = [0, 0, 2, 0, 0, 0, 0]
print(df)
NM length
0 0.0 0
1 0.0 0
2 1.0 2
3 NaN 0
4 NaN 0
5 NaN 0
6 0.0 0
df['NM'] = df['NM'].fillna(method='ffill', limit=df['length'])
print(df)
ValueError: Limit must be an integer
The dataframe I want looks like this:
NM length
0 0.0 0
1 0.0 0
2 1.0 2
3 1.0 0
4 1.0 0
5 NaN 0
6 0.0 0
Thanks in advance for any help you can provide!

I do not think you want to use ffill for this instance.
Rather I would recommend filtering to where length is greater than 0, then iterating through those rows to enter the NM value from that row in the next n+length rows.
for row in df.loc[df.length.gt(0)].reset_index().to_dict(orient='records'):
df.loc[row['index']+1:row['index']+row['length'], 'NM'] = row['NM']
To better break this down:
Get rows containing change information be sure to include the index.
df.loc[df.length.gt(0)].reset_index().to_dict(orient='records')
iterate through them... I prefer to_dict for performance reasons on large datasets. It is a habit.
sets NM rows to the NM value of your row with the defined length.

You can first group the dataframe by the length column before filling. Only issue is that for the first group in your example limit would be 0 which causes an error, so we can make sure it's at least 1 with max. This might cause unexpected results if there are nan values before the first non-zero value in length but from the given data it's not clear if that can happen.
# make groups
m = df.length.gt(0).cumsum()
# fill the column
df["NM"] = df.groupby(m).apply(
lambda f: f.NM.fillna(
method="ffill",
limit=max(f.length.iloc[0], 1))
).values

Optimizing a function to replace a row with a previous row given, a condition in Pandas

I have a relatively large dataframe (~24000 rows and 15 columns) which has 2D coordinate data of rat movements, outputted by a neural network (DeepLabCut).
As part of this output data, there is a p-value score that is a measure of how certain the neural network was when applying that label. I'm trying to filter low quality predictions by copying the previous row into its place, each time that a low p-value is encountered, which assumes that the rat remained still for that frame.
Here's my function thus far:
def checkPVals(DataFrame, CutOff):
for Cols in DataFrame.columns.values:
if Cols % 3 == 0:
for Vals in DataFrame.index.values:
if float(DataFrame[Cols][Vals]) < CutOff:
if (Vals != 0):
PreviousRow = DataFrame.loc[Vals - 1, Cols - 3:Cols]
DataFrame.loc[Vals, Cols - 3:Cols] = PreviousRow
return(DataFrame)
Here is a sample of the input data frame:
pd.DataFrame(data={
"x":[1, 2, 3, 4],
"y":[5, 4, 3, 2],
"likelihood":[1, 1, 0.3, 1]
})
Here is a sample of the desired output:
x y Pval
0 1 5 1.0
1 2 4 1.0
2 2 4 1.0
3 4 2 1.0
With the idea being that row index 2 is replaced with values from row index 1, such that when the inter-frame Euclidean distance between these coordinates is calculated, the distance is 0, implying the label (rat) has not moved.
Clearly, my current implementation is very inefficient. I was looking at iterrows(), but that converts my data into a series and messes with it. My other thought was to convert the p-value columns into np.arrrays, iterate through those, take the index of the p-values below threshold and then swap the rows for the previous one in an iterative manner. However, I feel like that'll take just as long.
Any help is very much appreciated. Thank you!

I'm pretty sure I understood what you are attempting to do. If you could update your question to have a sample output that's paired with you sample input, that would be greatly beneficial.
If I understood correctly, you should be using a vectorized approach instead of explicit looping (this will massively speed up your data wrangling). Essentially you can mask the rows of the dataframe depending on whether or not the "likelihood" column is above a certain value. Once you mask the low likelihoods away (i.e. replace those values with NaN), you can simply forward fill the entire dataframe to fill in the "bad" rows with the previous row's values.
df = pd.DataFrame(data={
"x":[1, 2, 3, 4],
"y":[5, 4, 3, 2],
"likelihood":[1, 1, 0.3, 1]
})
cutoff = 0.5
new_df = df.mask(df["likelihood"] < cutoff).ffill()
print(new_df)
x y likelihood
0 1.0 5.0 1.0
1 2.0 4.0 1.0
2 2.0 4.0 1.0
3 4.0 2.0 1.0

Cleaning outliers inside a column with interpolation

I'm trying to do the following.
I have some data with wrong values (x<=0 or x>=1100) inside a dataframe.
I am trying to change those values to values inside an acceptable range.
For the time being, this is what I do code-wise
def while_non_nan(A, k):
init = k
if k+1 >= len(A)-1:
return A.iloc[k-1]
while np.isnan(A[k+1]):
k += 1
#Calculate the value.
n = k-init+1
value = (n*A.iloc[init-1] + A.iloc[k])/(n+1)
return value
evoli.loc[evoli['T1'] >= 1100, 'T1'] = np.nan
evoli.loc[evoli['T1'] <= 0, 'T1'] = np.nan
inds = np.where(np.isnan(evoli))
#Place column means in the indices. Align the arrays using take
for k in inds[0] :
evoli['T1'].iloc[k] = while_non_nan(evoli['T1'], k)
I transform the outlier values into nan.
Afterwards, I get the position of those nan.
Finally, I modify the nan to the mean value between the previous value and the next one.
Since, several nan can be next to each other, the whie_non_nan search for the next non_nan value and get the ponderated mean.
Example of what I'm hoping to get:
Input :
[nan 0 1 2 nan 4 nan nan 7 nan ]
Output:
[0 0 1 2 3 4 5 6 7 7 ]
Hope it is clear enough. Thanks !

Pandas has a builtin interpolation you could use after setting your limits to NaN:
from numpy import NaN
import pandas as pd
df = pd.DataFrame({"T1": [1, 2, NaN, 3, 5, NaN, NaN, 4, NaN]})
df["T1"] = df["T1"].interpolate(method='linear', axis=0).ffill().bfill()
print(df)
Interpolate is a DataFrame method that fills NaN values with specified interpolation method (linear in this case). Calling .bfill() for backward fill and .ffill() for forward fill ensures the 1st and last item are also replaced if needed, with 2nd and 2nd to last item respectively. If you want some fancier strategy for 1st and last item you need to write it yourself.

Return indexes of longest batches of NaN

I have a dataframe that's ordered by two columns : 'ID' and a date column.
There's a significant amount of missing values in that table and what I'm interested in is understanding how the missing values are distributed : are they mainly concentrated for one 'ID', do all IDs have missing values in their start for example (date wise), are missing values unrelated etc.
After a groupby ID + count of missing values, I used missingno package and it proved to be useful, this is the result I got (sanitizing column names) :
From the picture, it seems like there are specific batches of rows where most columns are missing.
If you look at the arrow for example, I can probably ballpark a value for indexes to search (~idx = 750000) but this wouldn't be practical since there are other instances with the same thing happening.
What I would like to have is a function batches_missing(cols, n_rows) that takes a list of columns and and an int n_rows and returns a list of tuples [(index_start_batch1, index_end_batch1), ...] of all batches where the given columns have more than n_rows consecutive rows of missing values.
With a mock example :
df = pd.DataFrame({'col1':[1, 2, np.nan, np.nan, np.nan, np.nan, 2, 2, np.nan, np.nan, np.nan],
'col2':[9, 7, np.nan, np.nan, np.nan, np.nan, 0, np.nan, np.nan, np.nan, np.nan],
'col3':[11, 12, 13, np.nan, 1, 2, 3, np.nan, 1, 2, 3]})
batches_missing(['col1','col2'] , 3) would return [(2,5),(8,10)]
Can this be done efficiently given that the actual data is pretty big (1 mil rows) ? I would also be very interested in hearing about other ways of analyzing missing data so would appreciate any reading materials / links !
Thanks everyone.

You tally row wise to see which rows are all NAs, given selected columns:
rowwise_tally = df[['col1','col2']].isna().apply(all,axis=1)
0 False
1 False
2 True
3 True
4 True
5 True
6 False
7 False
8 True
9 True
10 True
Now you can group this runs:
grp = rowwise_tally.diff().cumsum().fillna(0)
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 2.0
7 2.0
8 3.0
9 3.0
10 3.0
Then count the number of nas in each group and also get the start and end:
na_counts = rowwise_tally.groupby(grp).sum()
pos = pd.Series(np.arange(len(df))).groupby(grp).agg([np.min, np.max])
pos[na_counts>=3].to_numpy()
array([[ 2, 5],
[ 8, 10]])
There might be a better way to get the position instead of using pd.Series like I did. For now, wrap this into a function:
def fun(data,cols,minlen):
rowwise_tally = data[cols].isna().apply(all,axis=1)
grp = rowwise_tally.diff().cumsum().fillna(0)
na_counts = rowwise_tally.groupby(grp).sum()
pos = pd.Series(np.arange(len(data))).groupby(grp).agg([np.min, np.max])
return pos[na_counts>=minlen].to_numpy()
fun(df,['col1','col2'],3)

Splitting a Dataframe at NaN row

There is already an answer that deals with a relatively simple dataframe that is given here.
However, the dataframe I have at hand has multiple columns and large number of rows. One Dataframe contains three dataframes attached along axis=0. (Bottom end of one is attached to the top of the next.) They are separated by a row of NaN values.
How can I create three dataframes out of this one data by splitting it along the NaN rows?

Like in the answer you linked, you want to create a column which identifies the group number. Then you can apply the same solution.
To do so, you have to make a test for all the values of a row to be NaN. I don't know if there is such a test builtin in pandas, but pandas has a test to check if a Series is full of NaN. So what you want to do is to perform that on the transpose of your dataframe, so that your "Series" is actually your row:
df["group_no"] = df.isnull().all(axis=1).cumsum()
At that point you can use the same technique from that answer to split the dataframes.
You might want to do a .dropna() at the end, because you will still have the NaN rows in your result.

Ran into this same question in 2022. Here's what I did to split dataframes on rows with NaNs, caveat is this relies on pip install python-rle for run-length encoding:
import rle
def nanchucks(df):
# It chucks NaNs outta dataframes
# True if whole row is NaN
df_nans = pd.isnull(df).sum(axis="columns").astype(bool)
values, counts = rle.encode(df_nans)
df_nans = pd.DataFrame({"values": values, "counts": counts})
df_nans["cum_counts"] = df_nans["counts"].cumsum()
df_nans["start_idx"] = df_nans["cum_counts"].shift(1)
df_nans.loc[0, "start_idx"] = 0
df_nans["start_idx"] = df_nans["start_idx"].astype(int) # np.nan makes it a float column
df_nans["end_idx"] = df_nans["cum_counts"] - 1
# Only keep the chunks of data w/o NaNs
df_nans = df_nans[df_nans["values"] == False]
indices = []
for idx, row in df_nans.iterrows():
indices.append((row["start_idx"], row["end_idx"]))
return [df.loc[df.index[i[0]]: df.index[i[1]]] for i in indices]
Examples:
sample_df1 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, np.nan, 3, 4],
"c": [1, 2, np.nan, 3, 4],
})
sample_df2 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, 3, np.nan, 4],
"c": [1, 2, np.nan, 3, 4],
})
print(nanchucks(sample_df1))
# [ a b c
# 0 1.0 1.0 1.0
# 1 2.0 2.0 2.0,
# a b c
# 3 3.0 3.0 3.0
# 4 4.0 4.0 4.0]
print(nanchucks(sample_df2))
# [ a b c
# 0 1.0 1.0 1.0
# 1 2.0 2.0 2.0,
# a b c
# 4 4.0 4.0 4.0]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

data mungling in python-pandas - python

Related

How can I forward fill a dataframe column where the limit of rows filled is based on the value of a cell in another column?

Optimizing a function to replace a row with a previous row given, a condition in Pandas

Cleaning outliers inside a column with interpolation

Return indexes of longest batches of NaN

Splitting a Dataframe at NaN row

Categories

Resources