I have the following DataFrame df (a small extract is given):
time_diff avg_qty_per_day
1.450000 1.0
1.483333 1.0
1.500000 1.0
2.516667 1.0
2.533333 1.0
2.533333 1.5
3.633333 1.8
3.644567 5.0
How can I group it into bins in order to get the following result?:
1 3
2 3.5
3 6.8
The size of a bin should be configurable.
I think you need cut:
bins = [-np.inf, 2, 3, np.inf]
labels=[1,2,3]
df = df['avg_qty_per_day'].groupby(pd.cut(df['time_diff'], bins=bins, labels=labels)).sum()
print (df)
time_diff
1 3.0
2 3.5
3 6.8
Name: avg_qty_per_day, dtype: float64
If want check labels:
bins = [-np.inf, 2, 3, np.inf]
labels=[1,2,3]
df['label'] = pd.cut(df['time_diff'], bins=bins, labels=labels)
print (df)
time_diff avg_qty_per_day label
0 1.450000 1.0 1
1 1.483333 1.0 1
2 1.500000 1.0 1
3 2.516667 1.0 2
4 2.533333 1.0 2
5 2.533333 1.5 2
6 3.633333 1.8 3
7 3.644567 5.0 3
Related
I'm trying to get the time interval of the last two periods of non-zero demand. The final column should be as shown in nonzero_interval. TIA.
edit:
I've added a link to the paper where this question was motivated from.
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'y': [34, 12, 2, 0, 0, 0, 23, 0, 10, 0],
'nonzero_interval' : [np.nan, np. nan, 1,1,1,1,1,4,4,2]})
print(df)
The idea comes from Forecasting Intermittent Demand Patterns with Time Seriesand Machine Learning Methodologies
One method from numpy
n=2
s=df[df.y.ne(0)].index
a=np.diag(s.values-s.values[:,None],k=n-1)
df['New']=pd.Series(a,index=s[n-1:])
df.New=df.New.shift(n-1).ffill()
df
y nonzero_interval New
0 34 NaN NaN
1 12 NaN NaN
2 2 1.0 1.0
3 0 1.0 1.0
4 0 1.0 1.0
5 0 1.0 1.0
6 23 1.0 1.0
7 0 4.0 4.0
8 10 4.0 4.0
9 0 2.0 2.0
IIUC, you can do it with groupby.transform with count, the groups are created where there are a value not equal to 0 with cumsum. then change where the values are equal to 0 to nan with where, shift and ffill.
df['nonzero_interval'] = (df.groupby(df['y'].ne(0).cumsum().shift())
['y'].transform('count')
.where(df['y'].ne(0))
.shift().ffill()
)
print (df)
y nonzero_interval
0 34 NaN
1 12 NaN
2 2 1.0
3 0 1.0
4 0 1.0
5 0 1.0
6 23 1.0
7 0 4.0
8 10 4.0
9 0 2.0
If i have a dataframe:
A B C
0.0285714285714285 4 0.11428571
0.107142857142857 4 0.42857143
0.007142857142857 6 0.04285714
1.2 4 5.5
1.5 3 3
Desired output is;
A*B C Difference
0.114285714285714 0.11428571 0.000000004285714
0.428571428571428 0.42857143 -0.000000001428572
0.042857142857142 0.04285714 0.000000002857142
4.8 5.5 -0.7
4.5 3 1.5
Count: 2
I want to ignore the like 3 rows, because the difference is very small. only the first digit after the comma should be included.
Could you please help me about this?
EDIT:
Because values in column A are objects (obviously strings):
df['A'] = df['A'].astype(float)
If not working, because bad values (e.g. some strings) - bad values are repalced by NaNs:
df['A'] = pd.to_numeric(df['A'], errors='coerce')
Use Series.mask for set new column by condition with Series.between:
#multiple columns
df['A*B'] = df["A"]*df["B"]
#subtract to Series
diff = df['A*B'] - df['C']
#create mask
mask = diff.between(-0.1, 0.1)
df["difference"] = diff.mask(mask, 0)
print (df)
A B C A*B difference
0 0.028571 4 0.114286 0.114286 0.0
1 0.107143 4 0.428571 0.428571 0.0
2 0.007143 6 0.042857 0.042857 0.0
3 1.200000 4 5.500000 4.800000 -0.7
4 1.500000 3 3.000000 4.500000 1.5
print (f'Count: {(~mask).sum()}')
Count: 2
If order is important add DataFrame.insert with DataFrame.pop for extract columns:
df.insert(0, 'A*B', df.pop("A")*df.pop("B"))
diff = df['A*B'] - df['C']
mask = diff.between(-0.1, 0.1)
df["difference"] = diff.mask(mask, 0)
print (df)
A*B C difference
0 0.114286 0.114286 0.0
1 0.428571 0.428571 0.0
2 0.042857 0.042857 0.0
3 4.800000 5.500000 -0.7
4 4.500000 3.000000 1.5
print (f'Count: {(~mask).sum()}')
Count: 2
Using np.where to check whether the result is significant enough:
df["difference"] = np.where((df["A"]*df["B"]-df["C"]>=0.1)|(df["A"]*df["B"]-df["C"]<=-0.1),df["A"]*df["B"]-df["C"],0)
print (df)
#
A B C difference
0 0.028571 4 0.114286 0.0
1 0.107143 4 0.428571 0.0
2 0.007143 6 0.042857 0.0
3 1.200000 4 5.500000 -0.7
4 1.500000 3 3.000000 1.5
I have a very simple Pandas Series:
xx = pd.Series([1, 2, np.nan, np.nan, 3, 4, 5])
If I run this I get what I want:
>>> xx.rolling(3,1).mean()
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
But if I have to use .apply() I cannot get it to work by ignoring NaNs in the mean() operation:
>>> xx.rolling(3,1).apply(np.mean)
0 1.0
1 1.5
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
>>> xx.rolling(3,1).apply(lambda x : np.mean(x))
0 1.0
1 1.5
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
What should I do in order to both use .apply() and have the result in the first output? My actual problem is more complicated that I have to use .apply() to realize but it boils down to this issue.
You can use np.nanmean()
xx.rolling(3,1).apply(lambda x : np.nanmean(x))
Out[59]:
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
dtype: float64
If you have to process the nans explicitly, you can do:
xx.rolling(3,1).apply(lambda x : np.mean(x[~np.isnan(x)]))
Out[94]:
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
dtype: float64
What's the best way to sum the columns of df2 by the columns of df3 in the below?
df = pd.DataFrame(np.random.rand(25).reshape((5,5)),index = ['A','B','C','D','E'])
df1 = pd.DataFrame(np.random.rand(15).reshape((5,3)),index = ['A','B','C','D','E'])
df2 = pd.concat([df,df1],axis=1)
df3 = pd.DataFrame(np.random.rand(25).reshape((5,5)),columns = np.arange(5),index = ['A','B','C','D','E'])
The answer would be in the shape of df3.
EDIT for clarity:
df = pd.DataFrame(np.ones(25).reshape((5,5)),index = ['A','B','C','D','E'])
df1 = pd.DataFrame(np.ones(15).reshape((5,3))*2,index = ['A','B','C','D','E'],columns = [1,3,4])
df2 = pd.concat([df,df1],axis=1)
df3 = pd.DataFrame(np.empty((5,5)),columns = np.arange(5),index = ['A','B','C','D','E'])
print(df2)
0 1 2 3 4 1 3 4
A 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0
B 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0
C 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0
D 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0
E 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0
The desired result would be:
0 1 2 3 4
A 1.0 3.0 1.0 3.0 3.0
B 1.0 3.0 1.0 3.0 3.0
C 1.0 3.0 1.0 3.0 3.0
D 1.0 3.0 1.0 3.0 3.0
E 1.0 3.0 1.0 3.0 3.0
you can group your DF by columns:
In [57]: df2.groupby(axis=1, by=df2.columns).sum()
Out[57]:
0 1 2 3 4
A 1.0 3.0 1.0 3.0 3.0
B 1.0 3.0 1.0 3.0 3.0
C 1.0 3.0 1.0 3.0 3.0
D 1.0 3.0 1.0 3.0 3.0
E 1.0 3.0 1.0 3.0 3.0
you can specify the axis name explicitly:
In [58]: df2.groupby(axis='columns', by=df2.columns).sum()
Out[58]:
0 1 2 3 4
A 1.0 3.0 1.0 3.0 3.0
B 1.0 3.0 1.0 3.0 3.0
C 1.0 3.0 1.0 3.0 3.0
D 1.0 3.0 1.0 3.0 3.0
E 1.0 3.0 1.0 3.0 3.0
or a short version from #piRSquared
df2.groupby(df2.columns, 1).sum()
Let use T transpose, groupby and sum:
df2.T.groupby(level=0).sum().T
Original df2:
0 1 2 3 4 0 1 \
A 0.627278 0.008150 0.285077 0.931831 0.683035 0.691318 0.873139
B 0.246861 0.108021 0.903743 0.030373 0.870753 0.143835 0.251623
C 0.367309 0.551530 0.193623 0.704314 0.136061 0.102401 0.287334
D 0.580771 0.592600 0.949666 0.806875 0.288331 0.794173 0.034380
E 0.088984 0.838401 0.988919 0.636134 0.353484 0.584571 0.090235
2
A 0.763687
B 0.735570
C 0.405304
D 0.446789
E 0.542930
new_df2 = df2.T.groupby(level=0).sum().T
print(new_df2)
Output new df2:
0 1 2 3 4
A 1.318595 0.881289 1.048764 0.931831 0.683035
B 0.390697 0.359644 1.639314 0.030373 0.870753
C 0.469710 0.838864 0.598927 0.704314 0.136061
D 1.374944 0.626980 1.396455 0.806875 0.288331
E 0.673555 0.928636 1.531849 0.636134 0.353484
solution 1
numpy.dot + pandas.get_dummies
cols = df2.columns.values
pd.DataFrame(
df2.values.dot(pd.get_dummies(cols).values),
df2.index, pd.unique(df2.columns.values)
)
0 1 2 3 4
A 1 3 1 3 3
B 1 3 1 3 3
C 1 3 1 3 3
D 1 3 1 3 3
E 1 3 1 3 3
solution 2
numpy.einsum + pandas.get_dummies
cols = df2.columns.values
pd.DataFrame(
np.einsum('ij,jk->ik', df2.values, pd.get_dummies(cols).values),
df2.index, pd.unique(df2.columns.values)
)
0 1 2 3 4
A 1 3 1 3 3
B 1 3 1 3 3
C 1 3 1 3 3
D 1 3 1 3 3
E 1 3 1 3 3
naive timing
setup
df2 = pd.DataFrame(
[[1, 1, 1, 1, 1, 2, 2, 2]],
list('ABCDE'),
[0, 1, 2, 3, 4, 1, 3, 4]
)
Is this what you mean:
new_df = pd.DataFrame()
for c in df3.columns:
try:
new_df[c] = [sum(x) for x in df2[c].values]
except:
new_df[c] = df2[c].values
I'm new to Python and Pandas so there might be a simple solution which I don't see.
I have a number of discontinuous datasets which look like this:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 3.5 2 0
4 4.0 4 5
5 4.5 3 3
I now look for a solution to get the following:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NAN NAN
4 2.0 NAN NAN
5 2.5 NAN NAN
6 3.0 NAN NAN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
The problem is,that the gap in A varies from dataset to dataset in position and length...
set_index and reset_index are your friends.
df = DataFrame({"A":[0,0.5,1.0,3.5,4.0,4.5], "B":[1,4,6,2,4,3], "C":[3,2,1,0,5,3]})
First move column A to the index:
In [64]: df.set_index("A")
Out[64]:
B C
A
0.0 1 3
0.5 4 2
1.0 6 1
3.5 2 0
4.0 4 5
4.5 3 3
Then reindex with a new index, here the missing data is filled in with nans. We use the Index object since we can name it; this will be used in the next step.
In [66]: new_index = Index(arange(0,5,0.5), name="A")
In [67]: df.set_index("A").reindex(new_index)
Out[67]:
B C
0.0 1 3
0.5 4 2
1.0 6 1
1.5 NaN NaN
2.0 NaN NaN
2.5 NaN NaN
3.0 NaN NaN
3.5 2 0
4.0 4 5
4.5 3 3
Finally move the index back to the columns with reset_index. Since we named the index, it all works magically:
In [69]: df.set_index("A").reindex(new_index).reset_index()
Out[69]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Using the answer by EdChum above, I created the following function
def fill_missing_range(df, field, range_from, range_to, range_step=1, fill_with=0):
return df\
.merge(how='right', on=field,
right = pd.DataFrame({field:np.arange(range_from, range_to, range_step)}))\
.sort_values(by=field).reset_index().fillna(fill_with).drop(['index'], axis=1)
Example usage:
fill_missing_range(df, 'A', 0.0, 4.5, 0.5, np.nan)
In this case I am overwriting your A column with a newly generated dataframe and merging this to your original df, I then resort it:
In [177]:
df.merge(how='right', on='A', right = pd.DataFrame({'A':np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5)})).sort(columns='A').reset_index().drop(['index'], axis=1)
Out[177]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
So in the general case you can adjust the arange function which takes a start and end value, note I added 0.5 to the end as ranges are open closed, and pass a step value.
A more general method could be like this:
In [197]:
df = df.set_index(keys='A', drop=False).reindex(np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5))
df.reset_index(inplace=True)
df['A'] = df['index']
df.drop(['A'], axis=1, inplace=True)
df.reset_index().drop(['level_0'], axis=1)
Out[197]:
index B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Here we set the index to column A but don't drop it and then reindex the df using the arange function.
This question was asked a long time ago, but I have a simple solution that's worth mentioning. You can simply use NumPy's NaN. For instance:
import numpy as np
df[i,j] = np.NaN
will do the trick.