DataFrame manipulation function/method - python

I have a df looks like
A B
1.2 1
1.3 1
1.1 1
1.0 0
1.0 0
1.5 1
1.6 1
0.7 1
1.1 0
is there any function or method to calculate cumsum piece by piece, I mean for every consecutive B value 1, calculate cumsum, in the above example it should be
A B C
1.2 1 1.2
1.3 1 2.5
1.1 1 3.6
1.0 0 0
1.0 0 0
1.5 1 1.5
1.6 1 3.1
0.7 1 3.8
1.1 0 0
many thanks,

from io import StringIO
import pandas as pd
import numpy as np
text = """a b
1.2 1
1.3 1
1.1 1
1.0 0
1.0 0
1.5 1
1.6 1
0.7 1
1.1 0"""
df = pd.read_csv(StringIO(text), delim_whitespace=True)
c = df["a"].cumsum()
mask = ~df["b"].astype(bool)
s = pd.Series(np.nan, index=df.index)
s[mask] = c[mask]
c -= s.ffill().fillna(0)
print(c)
output:
0 1.2
1 2.5
2 3.6
3 0.0
4 0.0
5 1.5
6 3.1
7 3.8
8 0.0
dtype: float64

Another approach (which may be slightly more general) is to groupby the consecutive entries in B.
First we enumerate the groups:
In [11]: (df.B != df.B.shift())
Out[11]:
0 True
1 False
2 False
3 True
4 False
5 True
6 False
7 False
8 True
Name: B, dtype: bool
In [12]: enumerate_B_changes = (df.B != df.B.shift()).astype(int).cumsum()
In [13]: enumerate_B_changes
Out[13]:
0 1
1 1
2 1
3 2
4 2
5 3
6 3
7 3
8 4
dtype: int64
And then we can groupby this Series, and cumsum:
In [14]: df.groupby(enumerate_B_changes)['A'].cumsum()
Out[14]:
0 1.2
1 2.5
2 3.6
3 1.0
4 2.0
5 1.5
6 3.1
7 3.8
8 1.1
dtype: float64
However we have to multiply by df['B'] in this case to account for 0s in column B.
In [15]: df.groupby(enumerate_B_changes)['A'].cumsum() * df['B']
Out[15]:
0 1.2
1 2.5
2 3.6
3 0.0
4 0.0
5 1.5
6 3.1
7 3.8
8 0.0
dtype: float64
If we wanted a different operation for entires neither 0 or 1 we could do something different here.

I'm not super versed in numpy, however the code below should help.
It goes through and if b is 1 continues to add to the cummulative sum, otherwise it resets it.
df = [
(1.2, 1),
(1.3, 1),
(1.1, 1),
(1.0, 0),
(1.0, 0),
(1.5, 1),
(1.6, 1),
(0.7, 1),
(1.1, 0)]
c=[]
cumsum=0
for a,b in df:
if b == 1:
cumsum +=a
c.append(cumsum)
else:
cumsum = 0
c.append(0)
print c
And it outputs (with rounding issues, which shouldn't happen in numpy):
[1.2, 2.5, 3.6000000000000001, 0, 0, 1.5, 3.1000000000000001, 3.7999999999999998, 0]

Related

Pandas dataframe range check using between and rolling

I have to consider nth row and check n+1 to n+3 rows, if it is in the range of (nth row value)-0.5 to (nth row value)+0.5, and(&) the results of 3 rows.
A result
0 1.1 1 # 1.2 1.3 and 1.5 are in range of 0.6 to 1.6, ( 1 & 1 & 1)
1 1.2 0 # 1.3 and 1.5 are in range of 0.7 to 1.7, but not 2, hence ( 1 & 0 & 0)
2 1.3 0 # 1.5 and 1 are in range of 0.8 to 1.8, but not 2 ( 1 & 0 & 1)
3 1.5
4 2.0
5 1.0
6 2.5
7 1.8
8 4.0
9 4.2
10 4.5
11 3.9
df = pd.DataFrame( {
'A': [1.1,1.2,1.3,1.9,2,1,2.5,1.8,4,4.2,4.5,3.9]
} )
I have done some research on the site, but couldn't able to find exact syntax. I tried using rolling function for taking 3 rows and use between function check range and then and the results. Could you please help here.
s = pd.Series([1, 2, 3, 4])
s.rolling(2).between(s-1,s+1)
getting error :
AttributeError: 'Rolling' object has no attribute 'between'
You can also achieve the result without using rolling() while keep using .between(), as follows:
df['result'] = (
(df['A'].shift(-1).between(df['A'] - 0.5, df['A'] + 0.5)) &
(df['A'].shift(-2).between(df['A'] - 0.5, df['A'] + 0.5)) &
(df['A'].shift(-3).between(df['A'] - 0.5, df['A'] + 0.5))
).astype(int)
Result:
print(df)
A result
0 1.1 1
1 1.2 0
2 1.3 0
3 1.5 0
4 2.0 0
5 1.0 0
6 2.5 0
7 1.8 0
8 4.0 1
9 4.2 0
10 4.5 0
11 3.9 0
Rolling windows tend to be quite slow in pandas. One quick solution can be to generate a dataframe with the values of the windows per row:
df_temp = pd.concat([df['A'].shift(i) for i in range(-1, 2)], axis=1)
df_temp
A A A
0 1.2 1.1 NaN
1 1.3 1.2 1.1
2 1.9 1.3 1.2
3 2.0 1.9 1.3
4 1.0 2.0 1.9
5 2.5 1.0 2.0
6 1.8 2.5 1.0
7 4.0 1.8 2.5
8 4.2 4.0 1.8
9 4.5 4.2 4.0
10 3.9 4.5 4.2
11 NaN 3.9 4.5
Then you can check per row if the value is in the desired range:
df['result'] = df_temp.apply(lambda x: (x - x.iloc[0]).between(-0.5, 0.5), axis=1).all(axis=1).astype(int)
A result
0 1.1 0
1 1.2 1
2 1.3 0
3 1.9 0
4 2.0 0
5 1.0 0
6 2.5 0
7 1.8 0
8 4.0 0
9 4.2 1
10 4.5 0
11 3.9 0

Pandas astype int not removing decimal points from values

I tried converting the values in some columns of a DataFrame of floats to integers by using round then astype. However, the values still contained decimal places. What is wrong with my code?
nums = np.arange(1, 11)
arr = np.array(nums)
arr = arr.reshape((2, 5))
df = pd.DataFrame(arr)
df += 0.1
df
Original df:
0 1 2 3 4
0 1.1 2.1 3.1 4.1 5.1
1 6.1 7.1 8.1 9.1 10.1
Rounding then to int code:
df.iloc[:, 2:] = df.iloc[:, 2:].round()
df.iloc[:, 2:] = df.iloc[:, 2:].astype(int)
df
Output:
0 1 2 3 4
0 1.1 2.1 3.0 4.0 5.0
1 6.1 7.1 8.0 9.0 10.0
Expected output:
0 1 2 3 4
0 1.1 2.1 3 4 5
1 6.1 7.1 8 9 10
The problem is for the .iloc it assign the value and did not change the column type
l = df.columns[2:]
df[l] = df[l].astype(int)
df
0 1 2 3 4
0 1.1 2.1 3 4 5
1 6.1 7.1 8 9 10
One way to solve that is to use .convert_dtypes()
df.iloc[:, 2:] = df.iloc[:, 2:].round()
df = df.convert_dtypes()
print(df)
output:
0 1 2 3 4
0 1.1 2.1 3 4 5
1 6.1 7.1 8 9 10
It will help you to coerce all dtypes of your dataframe to a better fit.
had the same issue, was able to resolve with converting numbers to str and applying an lambda to cut of zeros.
df['converted'] = df['floats'].astype(str)
def cut_zeros(row):
if row[-2:]=='.0':
row=row[:-2]
else:row
return row
df['converted'] = df.apply(lambda row: cut_zeros(row['converted']),axis=1)

Add rows to each group in a dataframe to match a range and fill NA with previous value or zero

I need to add missing days (as integers) between rows for each group and then fill missing values in a valuecolumn.
df = pd.DataFrame({'days':[0, 2, 3, 1, 3], 'group':['A', 'A', 'A', 'B', 'B'], 'value': [1.2, 2.3, 3.4, 0.2, 0.3]})
Input:
days group value
0 A 1.2
2 A 2.3
3 A 3.4
1 B 0.2
3 B 0.3
I'm stuck on the first step - adding rows if 0-3 is missing.
I tried so far to join a dataframe on a Series repeated for each group or to reindex a dataframe.
df = df.set_index('days')
df.reindex(pd.Series(range(4)))
ValueError: cannot reindex from a duplicate axis
Expected output:
cons_days days group value
0 0 A 1.2
1 NaN A 1.2
2 2 A 2.3
3 3 A 3.4
0 NaN B 0.0
1 1 B 0.2
2 NaN B 0.2
3 3 B 0.3
You can do with pivot , then reindex
df.pivot(*df.columns).reindex(pd.Series(range(4))).reset_index().melt('index')
Out[222]:
index group value
0 0 A 1.2
1 1 A NaN
2 2 A 2.3
3 3 A 3.4
4 0 B NaN
5 1 B 0.2
6 2 B NaN
7 3 B 0.3
Update
df.pivot(*df.columns).reindex(pd.Series(range(4))).ffill().fillna(0).reset_index().melt('index')
Out[226]:
index group value
0 0 A 1.2
1 1 A 1.2
2 2 A 2.3
3 3 A 3.4
4 0 B 0.0
5 1 B 0.2
6 2 B 0.2
7 3 B 0.3
Here's a solution using groupby:
df = (df.set_index('days')
.groupby('group')['value']
.apply(lambda x: x.reindex(range(0, x.index.max() + 1)))
.reset_index()
)
group days value
0 A 0 1.2
1 A 1 NaN
2 A 2 2.3
3 A 3 3.4
4 B 0 NaN
5 B 1 0.2
6 B 2 NaN
7 B 3 0.3
Update using #WeNYoBen's fill method:
df = (df.set_index('days')
.groupby('group')['value']
.apply(lambda x: x.reindex(range(0, x.index.max() + 1)).ffill().fillna(0))
.reset_index()
)
group days value
0 A 0 1.2
1 A 1 1.2
2 A 2 2.3
3 A 3 3.4
4 B 0 0.0
5 B 1 0.2
6 B 2 0.2
7 B 3 0.3

How to group data and create bins?

I have the following DataFrame df (a small extract is given):
time_diff avg_qty_per_day
1.450000 1.0
1.483333 1.0
1.500000 1.0
2.516667 1.0
2.533333 1.0
2.533333 1.5
3.633333 1.8
3.644567 5.0
How can I group it into bins in order to get the following result?:
1 3
2 3.5
3 6.8
The size of a bin should be configurable.
I think you need cut:
bins = [-np.inf, 2, 3, np.inf]
labels=[1,2,3]
df = df['avg_qty_per_day'].groupby(pd.cut(df['time_diff'], bins=bins, labels=labels)).sum()
print (df)
time_diff
1 3.0
2 3.5
3 6.8
Name: avg_qty_per_day, dtype: float64
If want check labels:
bins = [-np.inf, 2, 3, np.inf]
labels=[1,2,3]
df['label'] = pd.cut(df['time_diff'], bins=bins, labels=labels)
print (df)
time_diff avg_qty_per_day label
0 1.450000 1.0 1
1 1.483333 1.0 1
2 1.500000 1.0 1
3 2.516667 1.0 2
4 2.533333 1.0 2
5 2.533333 1.5 2
6 3.633333 1.8 3
7 3.644567 5.0 3

how to clip pandas dataframe column-wise?

I have
In [67]: a
Out[67]:
0 1 2
0 1 2 3
1 4 5 6
when I run
In [69]: a.clip(lower=[1.5,2.5,3.5],axis=1)
I got
ValueError: other must be the same shape as self when an ndarray
Is that expected?
I was expecting to get something like:
Out[72]:
0 1 2
0 1.5 2.5 3.5
1 4.0 5.0 6.0
Instead of a numpy array, you can use a Series so the labels are aligned:
df
Out:
A B
0 1 4
1 2 5
2 3 6
df.clip(lower=pd.Series({'A': 2.5, 'B': 4.5}), axis=1)
Out:
A B
0 2.5 4.5
1 2.5 5.0
2 3.0 6.0
lower : float or array_like, default None
According to API reference, you're supposed to use same shaped array.
import numpy as np
import pandas as pd
...
print df.shape
(2, 3)
print df.clip(lower=(df.clip(lower=(np.array([[n+1.5 for n in range(df.shape[1])] for _ in range(df.shape[0])])), axis=1))
0 1 2
0 1.5 2.5 3.5
1 4.0 5.0 6.0

Categories

Resources