How to group data and create bins?

How to group data and create bins? - python

I have the following DataFrame df (a small extract is given):
time_diff avg_qty_per_day
1.450000 1.0
1.483333 1.0
1.500000 1.0
2.516667 1.0
2.533333 1.0
2.533333 1.5
3.633333 1.8
3.644567 5.0
How can I group it into bins in order to get the following result?:
1 3
2 3.5
3 6.8
The size of a bin should be configurable.

I think you need cut:
bins = [-np.inf, 2, 3, np.inf]
labels=[1,2,3]
df = df['avg_qty_per_day'].groupby(pd.cut(df['time_diff'], bins=bins, labels=labels)).sum()
print (df)
time_diff
1 3.0
2 3.5
3 6.8
Name: avg_qty_per_day, dtype: float64
If want check labels:
bins = [-np.inf, 2, 3, np.inf]
labels=[1,2,3]
df['label'] = pd.cut(df['time_diff'], bins=bins, labels=labels)
print (df)
time_diff avg_qty_per_day label
0 1.450000 1.0 1
1 1.483333 1.0 1
2 1.500000 1.0 1
3 2.516667 1.0 2
4 2.533333 1.0 2
5 2.533333 1.5 2
6 3.633333 1.8 3
7 3.644567 5.0 3

Related

number of time periods between the previous two periods where (non-zero) demand occurs in Python

I'm trying to get the time interval of the last two periods of non-zero demand. The final column should be as shown in nonzero_interval. TIA.
edit:
I've added a link to the paper where this question was motivated from.
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'y': [34, 12, 2, 0, 0, 0, 23, 0, 10, 0],
'nonzero_interval' : [np.nan, np. nan, 1,1,1,1,1,4,4,2]})
print(df)
The idea comes from Forecasting Intermittent Demand Patterns with Time Seriesand Machine Learning Methodologies

One method from numpy
n=2
s=df[df.y.ne(0)].index
a=np.diag(s.values-s.values[:,None],k=n-1)
df['New']=pd.Series(a,index=s[n-1:])
df.New=df.New.shift(n-1).ffill()
df
y nonzero_interval New
0 34 NaN NaN
1 12 NaN NaN
2 2 1.0 1.0
3 0 1.0 1.0
4 0 1.0 1.0
5 0 1.0 1.0
6 23 1.0 1.0
7 0 4.0 4.0
8 10 4.0 4.0
9 0 2.0 2.0

IIUC, you can do it with groupby.transform with count, the groups are created where there are a value not equal to 0 with cumsum. then change where the values are equal to 0 to nan with where, shift and ffill.
df['nonzero_interval'] = (df.groupby(df['y'].ne(0).cumsum().shift())
['y'].transform('count')
.where(df['y'].ne(0))
.shift().ffill()
)
print (df)
y nonzero_interval
0 34 NaN
1 12 NaN
2 2 1.0
3 0 1.0
4 0 1.0
5 0 1.0
6 23 1.0
7 0 4.0
8 10 4.0
9 0 2.0

Python Count The Different Values From Two Columns

If i have a dataframe:
A B C
0.0285714285714285 4 0.11428571
0.107142857142857 4 0.42857143
0.007142857142857 6 0.04285714
1.2 4 5.5
1.5 3 3
Desired output is;
A*B C Difference
0.114285714285714‬ 0.11428571 0.000000004285714‬
0.428571428571428‬ 0.42857143 -0.000000001428572‬
0.042857142857142‬ 0.04285714 0.000000002857142‬
4.8 5.5 -0.7
4.5 3 1.5
Count: 2
I want to ignore the like 3 rows, because the difference is very small. only the first digit after the comma should be included.
Could you please help me about this?

EDIT:
Because values in column A are objects (obviously strings):
df['A'] = df['A'].astype(float)
If not working, because bad values (e.g. some strings) - bad values are repalced by NaNs:
df['A'] = pd.to_numeric(df['A'], errors='coerce')
Use Series.mask for set new column by condition with Series.between:
#multiple columns
df['A*B'] = df["A"]*df["B"]
#subtract to Series
diff = df['A*B'] - df['C']
#create mask
mask = diff.between(-0.1, 0.1)
df["difference"] = diff.mask(mask, 0)
print (df)
A B C A*B difference
0 0.028571 4 0.114286 0.114286 0.0
1 0.107143 4 0.428571 0.428571 0.0
2 0.007143 6 0.042857 0.042857 0.0
3 1.200000 4 5.500000 4.800000 -0.7
4 1.500000 3 3.000000 4.500000 1.5
print (f'Count: {(~mask).sum()}')
Count: 2
If order is important add DataFrame.insert with DataFrame.pop for extract columns:
df.insert(0, 'A*B', df.pop("A")*df.pop("B"))
diff = df['A*B'] - df['C']
mask = diff.between(-0.1, 0.1)
df["difference"] = diff.mask(mask, 0)
print (df)
A*B C difference
0 0.114286 0.114286 0.0
1 0.428571 0.428571 0.0
2 0.042857 0.042857 0.0
3 4.800000 5.500000 -0.7
4 4.500000 3.000000 1.5
print (f'Count: {(~mask).sum()}')
Count: 2

Using np.where to check whether the result is significant enough:
df["difference"] = np.where((df["A"]*df["B"]-df["C"]>=0.1)|(df["A"]*df["B"]-df["C"]<=-0.1),df["A"]*df["B"]-df["C"],0)
print (df)
#
A B C difference
0 0.028571 4 0.114286 0.0
1 0.107143 4 0.428571 0.0
2 0.007143 6 0.042857 0.0
3 1.200000 4 5.500000 -0.7
4 1.500000 3 3.000000 1.5

pandas rolling apply to allow nan

I have a very simple Pandas Series:
xx = pd.Series([1, 2, np.nan, np.nan, 3, 4, 5])
If I run this I get what I want:
>>> xx.rolling(3,1).mean()
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
But if I have to use .apply() I cannot get it to work by ignoring NaNs in the mean() operation:
>>> xx.rolling(3,1).apply(np.mean)
0 1.0
1 1.5
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
>>> xx.rolling(3,1).apply(lambda x : np.mean(x))
0 1.0
1 1.5
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
What should I do in order to both use .apply() and have the result in the first output? My actual problem is more complicated that I have to use .apply() to realize but it boils down to this issue.

You can use np.nanmean()
xx.rolling(3,1).apply(lambda x : np.nanmean(x))
Out[59]:
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
dtype: float64
If you have to process the nans explicitly, you can do:
xx.rolling(3,1).apply(lambda x : np.mean(x[~np.isnan(x)]))
Out[94]:
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
dtype: float64

Pandas sumif with repeated column names

What's the best way to sum the columns of df2 by the columns of df3 in the below?
df = pd.DataFrame(np.random.rand(25).reshape((5,5)),index = ['A','B','C','D','E'])
df1 = pd.DataFrame(np.random.rand(15).reshape((5,3)),index = ['A','B','C','D','E'])
df2 = pd.concat([df,df1],axis=1)
df3 = pd.DataFrame(np.random.rand(25).reshape((5,5)),columns = np.arange(5),index = ['A','B','C','D','E'])
The answer would be in the shape of df3.
EDIT for clarity:
df = pd.DataFrame(np.ones(25).reshape((5,5)),index = ['A','B','C','D','E'])
df1 = pd.DataFrame(np.ones(15).reshape((5,3))*2,index = ['A','B','C','D','E'],columns = [1,3,4])
df2 = pd.concat([df,df1],axis=1)
df3 = pd.DataFrame(np.empty((5,5)),columns = np.arange(5),index = ['A','B','C','D','E'])
print(df2)
0 1 2 3 4 1 3 4
A 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0
B 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0
C 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0
D 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0
E 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0
The desired result would be:
0 1 2 3 4
A 1.0 3.0 1.0 3.0 3.0
B 1.0 3.0 1.0 3.0 3.0
C 1.0 3.0 1.0 3.0 3.0
D 1.0 3.0 1.0 3.0 3.0
E 1.0 3.0 1.0 3.0 3.0

you can group your DF by columns:
In [57]: df2.groupby(axis=1, by=df2.columns).sum()
Out[57]:
0 1 2 3 4
A 1.0 3.0 1.0 3.0 3.0
B 1.0 3.0 1.0 3.0 3.0
C 1.0 3.0 1.0 3.0 3.0
D 1.0 3.0 1.0 3.0 3.0
E 1.0 3.0 1.0 3.0 3.0
you can specify the axis name explicitly:
In [58]: df2.groupby(axis='columns', by=df2.columns).sum()
Out[58]:
0 1 2 3 4
A 1.0 3.0 1.0 3.0 3.0
B 1.0 3.0 1.0 3.0 3.0
C 1.0 3.0 1.0 3.0 3.0
D 1.0 3.0 1.0 3.0 3.0
E 1.0 3.0 1.0 3.0 3.0
or a short version from #piRSquared
df2.groupby(df2.columns, 1).sum()

Let use T transpose, groupby and sum:
df2.T.groupby(level=0).sum().T
Original df2:
0 1 2 3 4 0 1 \
A 0.627278 0.008150 0.285077 0.931831 0.683035 0.691318 0.873139
B 0.246861 0.108021 0.903743 0.030373 0.870753 0.143835 0.251623
C 0.367309 0.551530 0.193623 0.704314 0.136061 0.102401 0.287334
D 0.580771 0.592600 0.949666 0.806875 0.288331 0.794173 0.034380
E 0.088984 0.838401 0.988919 0.636134 0.353484 0.584571 0.090235
2
A 0.763687
B 0.735570
C 0.405304
D 0.446789
E 0.542930
new_df2 = df2.T.groupby(level=0).sum().T
print(new_df2)
Output new df2:
0 1 2 3 4
A 1.318595 0.881289 1.048764 0.931831 0.683035
B 0.390697 0.359644 1.639314 0.030373 0.870753
C 0.469710 0.838864 0.598927 0.704314 0.136061
D 1.374944 0.626980 1.396455 0.806875 0.288331
E 0.673555 0.928636 1.531849 0.636134 0.353484

solution 1
numpy.dot + pandas.get_dummies
cols = df2.columns.values
pd.DataFrame(
df2.values.dot(pd.get_dummies(cols).values),
df2.index, pd.unique(df2.columns.values)
)
0 1 2 3 4
A 1 3 1 3 3
B 1 3 1 3 3
C 1 3 1 3 3
D 1 3 1 3 3
E 1 3 1 3 3
solution 2
numpy.einsum + pandas.get_dummies
cols = df2.columns.values
pd.DataFrame(
np.einsum('ij,jk->ik', df2.values, pd.get_dummies(cols).values),
df2.index, pd.unique(df2.columns.values)
)
0 1 2 3 4
A 1 3 1 3 3
B 1 3 1 3 3
C 1 3 1 3 3
D 1 3 1 3 3
E 1 3 1 3 3
naive timing
setup
df2 = pd.DataFrame(
[[1, 1, 1, 1, 1, 2, 2, 2]],
list('ABCDE'),
[0, 1, 2, 3, 4, 1, 3, 4]
)

Is this what you mean:
new_df = pd.DataFrame()
for c in df3.columns:
try:
new_df[c] = [sum(x) for x in df2[c].values]
except:
new_df[c] = df2[c].values

Missing data, insert rows in Pandas and fill with NAN

I'm new to Python and Pandas so there might be a simple solution which I don't see.
I have a number of discontinuous datasets which look like this:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 3.5 2 0
4 4.0 4 5
5 4.5 3 3
I now look for a solution to get the following:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NAN NAN
4 2.0 NAN NAN
5 2.5 NAN NAN
6 3.0 NAN NAN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
The problem is,that the gap in A varies from dataset to dataset in position and length...

set_index and reset_index are your friends.
df = DataFrame({"A":[0,0.5,1.0,3.5,4.0,4.5], "B":[1,4,6,2,4,3], "C":[3,2,1,0,5,3]})
First move column A to the index:
In [64]: df.set_index("A")
Out[64]:
B C
A
0.0 1 3
0.5 4 2
1.0 6 1
3.5 2 0
4.0 4 5
4.5 3 3
Then reindex with a new index, here the missing data is filled in with nans. We use the Index object since we can name it; this will be used in the next step.
In [66]: new_index = Index(arange(0,5,0.5), name="A")
In [67]: df.set_index("A").reindex(new_index)
Out[67]:
B C
0.0 1 3
0.5 4 2
1.0 6 1
1.5 NaN NaN
2.0 NaN NaN
2.5 NaN NaN
3.0 NaN NaN
3.5 2 0
4.0 4 5
4.5 3 3
Finally move the index back to the columns with reset_index. Since we named the index, it all works magically:
In [69]: df.set_index("A").reindex(new_index).reset_index()
Out[69]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3

Using the answer by EdChum above, I created the following function
def fill_missing_range(df, field, range_from, range_to, range_step=1, fill_with=0):
return df\
.merge(how='right', on=field,
right = pd.DataFrame({field:np.arange(range_from, range_to, range_step)}))\
.sort_values(by=field).reset_index().fillna(fill_with).drop(['index'], axis=1)
Example usage:
fill_missing_range(df, 'A', 0.0, 4.5, 0.5, np.nan)

In this case I am overwriting your A column with a newly generated dataframe and merging this to your original df, I then resort it:
In [177]:
df.merge(how='right', on='A', right = pd.DataFrame({'A':np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5)})).sort(columns='A').reset_index().drop(['index'], axis=1)
Out[177]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
So in the general case you can adjust the arange function which takes a start and end value, note I added 0.5 to the end as ranges are open closed, and pass a step value.
A more general method could be like this:
In [197]:
df = df.set_index(keys='A', drop=False).reindex(np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5))
df.reset_index(inplace=True)
df['A'] = df['index']
df.drop(['A'], axis=1, inplace=True)
df.reset_index().drop(['level_0'], axis=1)
Out[197]:
index B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Here we set the index to column A but don't drop it and then reindex the df using the arange function.

This question was asked a long time ago, but I have a simple solution that's worth mentioning. You can simply use NumPy's NaN. For instance:
import numpy as np
df[i,j] = np.NaN
will do the trick.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to group data and create bins? - python

Related

number of time periods between the previous two periods where (non-zero) demand occurs in Python

Python Count The Different Values From Two Columns

pandas rolling apply to allow nan

Pandas sumif with repeated column names

Missing data, insert rows in Pandas and fill with NAN

Categories

Resources