how to clip pandas dataframe column-wise?

how to clip pandas dataframe column-wise? - python

I have
In [67]: a
Out[67]:
0 1 2
0 1 2 3
1 4 5 6
when I run
In [69]: a.clip(lower=[1.5,2.5,3.5],axis=1)
I got
ValueError: other must be the same shape as self when an ndarray
Is that expected?
I was expecting to get something like:
Out[72]:
0 1 2
0 1.5 2.5 3.5
1 4.0 5.0 6.0

Instead of a numpy array, you can use a Series so the labels are aligned:
df
Out:
A B
0 1 4
1 2 5
2 3 6
df.clip(lower=pd.Series({'A': 2.5, 'B': 4.5}), axis=1)
Out:
A B
0 2.5 4.5
1 2.5 5.0
2 3.0 6.0

lower : float or array_like, default None
According to API reference, you're supposed to use same shaped array.
import numpy as np
import pandas as pd
...
print df.shape
(2, 3)
print df.clip(lower=(df.clip(lower=(np.array([[n+1.5 for n in range(df.shape[1])] for _ in range(df.shape[0])])), axis=1))
0 1 2
0 1.5 2.5 3.5
1 4.0 5.0 6.0

Related

How to apply function vertically in df

I want to add column values vertically from top to down
def add(x,y):
return x,y
df = pd.DataFrame({'A':[1,2,3,4,5]})
df['add'] = df.apply(lambda row : add(row['A'], axis = 1)
I tried using apply but its not working
Desired output is basically adding A column values 1+2, 2+3:
A add
0 1 1
1 2 3
2 3 5
3 4 7
4 5 9

You can apply rolling.sum on a moving window of size 2:
df.A.rolling(2, min_periods=1).sum()
0 1.0
1 3.0
2 5.0
3 7.0
4 9.0
Name: A, dtype: float64

Try this instead:
>>> df['add'] = (df + df.shift()).fillna(df)['A']
>>> df
A add
0 1 1.0
1 2 3.0
2 3 5.0
3 4 7.0
4 5 9.0
>>>

Pandas astype int not removing decimal points from values

I tried converting the values in some columns of a DataFrame of floats to integers by using round then astype. However, the values still contained decimal places. What is wrong with my code?
nums = np.arange(1, 11)
arr = np.array(nums)
arr = arr.reshape((2, 5))
df = pd.DataFrame(arr)
df += 0.1
df
Original df:
0 1 2 3 4
0 1.1 2.1 3.1 4.1 5.1
1 6.1 7.1 8.1 9.1 10.1
Rounding then to int code:
df.iloc[:, 2:] = df.iloc[:, 2:].round()
df.iloc[:, 2:] = df.iloc[:, 2:].astype(int)
df
Output:
0 1 2 3 4
0 1.1 2.1 3.0 4.0 5.0
1 6.1 7.1 8.0 9.0 10.0
Expected output:
0 1 2 3 4
0 1.1 2.1 3 4 5
1 6.1 7.1 8 9 10

The problem is for the .iloc it assign the value and did not change the column type
l = df.columns[2:]
df[l] = df[l].astype(int)
df
0 1 2 3 4
0 1.1 2.1 3 4 5
1 6.1 7.1 8 9 10

One way to solve that is to use .convert_dtypes()
df.iloc[:, 2:] = df.iloc[:, 2:].round()
df = df.convert_dtypes()
print(df)
output:
0 1 2 3 4
0 1.1 2.1 3 4 5
1 6.1 7.1 8 9 10
It will help you to coerce all dtypes of your dataframe to a better fit.

had the same issue, was able to resolve with converting numbers to str and applying an lambda to cut of zeros.
df['converted'] = df['floats'].astype(str)
def cut_zeros(row):
if row[-2:]=='.0':
row=row[:-2]
else:row
return row
df['converted'] = df.apply(lambda row: cut_zeros(row['converted']),axis=1)

Finding mean of specific column and keep all rows that have specific mean values

I have this dataframe.
from pandas import DataFrame
import pandas as pd
df = pd.DataFrame({'name': ['A','D','M','T','B','C','D','E','A','L'],
'id': [1,1,1,2,2,3,3,3,3,5],
'rate': [3.5,4.5,2.0,5.0,4.0,1.5,2.0,2.0,1.0,5.0]})
>> df
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 C 3 1.5
6 D 3 2.0
7 E 3 2.0
8 A 3 1.0
9 L 5 5.0
df = df.groupby('id')['rate'].mean()
what i want is this:
1) find mean of every 'id'.
2) give the number of ids (length) which has mean >= 3.
3) give back all rows of dataframe (where mean of any id >= 3.
Expected output:
Number of ids (length) where mean >= 3: 3
>> dataframe where (mean(id) >=3)
>>df
name id rate
0 A 1 3.0
1 D 1 4.0
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 L 5 5.0

Use GroupBy.transform for means by all groups with same size like original DataFrame, so possible filter by boolean indexing:
df = df[df.groupby('id')['rate'].transform('mean') >=3]
print (df)
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
9 L 5 5.0
Detail:
print (df.groupby('id')['rate'].transform('mean'))
0 3.333333
1 3.333333
2 3.333333
3 4.500000
4 4.500000
5 1.625000
6 1.625000
7 1.625000
8 1.625000
9 5.000000
Name: rate, dtype: float64
Alternative solution with DataFrameGroupBy.filter:
df = df.groupby('id').filter(lambda x: x['rate'].mean() >=3)

Python / Pandas: How to merge rows in dataframe

After merging of two data frames:
output = pd.merge(df1, df2, on='ID', how='outer')
I have data frame like this:
index x y z
0 2 NaN 3
0 NaN 3 3
1 2 NaN 4
1 NaN 3 4
...
How to merge rows with the same index?
Expected output:
index x y z
0 2 3 3
1 2 3 4

Perhaps, you could take mean on them.
In [418]: output.groupby('index', as_index=False).mean()
Out[418]:
index x y z
0 0 2.0 3.0 3
1 1 2.0 3.0 4

We can group the DataFrame by the 'index' and then... we can just get the first values with .first() or minimum with .min() etc. depending on the case of course. What do you want to get if the values in z differ?
In [28]: gr = df.groupby('index', as_index=False)
In [29]: gr.first()
Out[29]:
index x y z
0 0 2.0 3.0 3
1 1 2.0 3.0 4
In [30]: gr.max()
Out[30]:
index x y z
0 0 2.0 3.0 3
1 1 2.0 3.0 4
In [31]: gr.min()
Out[31]:
index x y z
0 0 2.0 3.0 3
1 1 2.0 3.0 4
In [32]: gr.mean()
Out[32]:
index x y z
0 0 2.0 3.0 3
1 1 2.0 3.0 4

DataFrame manipulation function/method

I have a df looks like
A B
1.2 1
1.3 1
1.1 1
1.0 0
1.0 0
1.5 1
1.6 1
0.7 1
1.1 0
is there any function or method to calculate cumsum piece by piece, I mean for every consecutive B value 1, calculate cumsum, in the above example it should be
A B C
1.2 1 1.2
1.3 1 2.5
1.1 1 3.6
1.0 0 0
1.0 0 0
1.5 1 1.5
1.6 1 3.1
0.7 1 3.8
1.1 0 0
many thanks,

from io import StringIO
import pandas as pd
import numpy as np
text = """a b
1.2 1
1.3 1
1.1 1
1.0 0
1.0 0
1.5 1
1.6 1
0.7 1
1.1 0"""
df = pd.read_csv(StringIO(text), delim_whitespace=True)
c = df["a"].cumsum()
mask = ~df["b"].astype(bool)
s = pd.Series(np.nan, index=df.index)
s[mask] = c[mask]
c -= s.ffill().fillna(0)
print(c)
output:
0 1.2
1 2.5
2 3.6
3 0.0
4 0.0
5 1.5
6 3.1
7 3.8
8 0.0
dtype: float64

Another approach (which may be slightly more general) is to groupby the consecutive entries in B.
First we enumerate the groups:
In [11]: (df.B != df.B.shift())
Out[11]:
0 True
1 False
2 False
3 True
4 False
5 True
6 False
7 False
8 True
Name: B, dtype: bool
In [12]: enumerate_B_changes = (df.B != df.B.shift()).astype(int).cumsum()
In [13]: enumerate_B_changes
Out[13]:
0 1
1 1
2 1
3 2
4 2
5 3
6 3
7 3
8 4
dtype: int64
And then we can groupby this Series, and cumsum:
In [14]: df.groupby(enumerate_B_changes)['A'].cumsum()
Out[14]:
0 1.2
1 2.5
2 3.6
3 1.0
4 2.0
5 1.5
6 3.1
7 3.8
8 1.1
dtype: float64
However we have to multiply by df['B'] in this case to account for 0s in column B.
In [15]: df.groupby(enumerate_B_changes)['A'].cumsum() * df['B']
Out[15]:
0 1.2
1 2.5
2 3.6
3 0.0
4 0.0
5 1.5
6 3.1
7 3.8
8 0.0
dtype: float64
If we wanted a different operation for entires neither 0 or 1 we could do something different here.

I'm not super versed in numpy, however the code below should help.
It goes through and if b is 1 continues to add to the cummulative sum, otherwise it resets it.
df = [
(1.2, 1),
(1.3, 1),
(1.1, 1),
(1.0, 0),
(1.0, 0),
(1.5, 1),
(1.6, 1),
(0.7, 1),
(1.1, 0)]
c=[]
cumsum=0
for a,b in df:
if b == 1:
cumsum +=a
c.append(cumsum)
else:
cumsum = 0
c.append(0)
print c
And it outputs (with rounding issues, which shouldn't happen in numpy):
[1.2, 2.5, 3.6000000000000001, 0, 0, 1.5, 3.1000000000000001, 3.7999999999999998, 0]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to clip pandas dataframe column-wise? - python

I have In [67]: a Out[67]: 0 1 2 0 1 2 3 1 4 5 6 when I run In [69]: a.clip(lower=[1.5,2.5,3.5],axis=1) I got ValueError: other must be the same shape as self when an ndarray Is that expected? I was expecting to get something like: Out[72]: 0 1 2 0 1.5 2.5 3.5 1 4.0 5.0 6.0

Instead of a numpy array, you can use a Series so the labels are aligned: df Out: A B 0 1 4 1 2 5 2 3 6 df.clip(lower=pd.Series({'A': 2.5, 'B': 4.5}), axis=1) Out: A B 0 2.5 4.5 1 2.5 5.0 2 3.0 6.0

Related

How to apply function vertically in df

Pandas astype int not removing decimal points from values

Finding mean of specific column and keep all rows that have specific mean values

Python / Pandas: How to merge rows in dataframe

DataFrame manipulation function/method

Categories

Resources