pandas dataframe assign doesn't update the dataframe - python

I made a pandas dataframe of the Iris dataset and I want to put 4 extra column in it. The content of the columns have to be SepalRatio, PetalRatio, SepalMultiplied, PetalMultiplied. I used the assign() function of the DataFrame to add this four columns but the DataFrame remains the samen.
My code to add column is :
iris.assign(SepalRatio = iris['SepalLengthCm'] / `iris['SepalWidthCm']).assign(PetalRatio = iris['PetalLengthCm'] / iris['PetalWidthCm']).assign(SepalMultiplied = iris['SepalLengthCm'] * iris['SepalWidthCm']).assign(PetalMultiplied = iris['PetalLengthCm'] * iris['PetalWidthCm'])`
When executing in Jupyter notebook a correct table is shown but if I use the print statement the four column aren't added.
Output in Jupyter notebook :
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species SepalRatio PetalRatio SepalMultiplied PetalMultiplied
0 1 5.1 3.5 1.4 0.2 Iris-setosa 1.457143 7.000000 17.85 0.28
1 2 4.9 3.0 1.4 0.2 Iris-setosa 1.633333 7.000000 14.70 0.28
2 3 4.7 3.2 1.3 0.2 Iris-setosa 1.468750 6.500000 15.04 0.26
3 4 4.6 3.1 1.5 0.2 Iris-setosa 1.483871 7.500000 14.26 0.30
4 5 5.0 3.6 1.4 0.2 Iris-setosa 1.388889 7.000000 18.00 0.28
5 6 5.4 3.9 1.7 0.4 Iris-setosa 1.384615 4.250000 21.06 0.68
6 7 4.6 3.4 1.4 0.3 Iris-setosa 1.352941 4.666667 15.64 0.42
7 8 5.0 3.4 1.5 0.2 Iris-setosa 1.470588 7.500000 17.00 0.30
8 9 4.4 2.9 1.4 0.2 Iris-setosa 1.517241 7.000000 12.76 0.28
9 10 4.9 3.1 1.5 0.1 Iris-setosa 1.580645 15.000000 15.19 0.15
output after printing the dataframe :
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
5 6 5.4 3.9 1.7 0.4
6 7 4.6 3.4 1.4 0.3
7 8 5.0 3.4 1.5 0.2
8 9 4.4 2.9 1.4 0.2
9 10 4.9 3.1 1.5 0.1
Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
5 Iris-setosa
6 Iris-setosa
7 Iris-setosa
8 Iris-setosa
9 Iris-setosa

You need assign output to variable like:
iris = iris.assign(SepalRatio = iris['SepalLengthCm'] / iris['SepalWidthCm']).assign(PetalRatio = iris['PetalLengthCm'] / iris['PetalWidthCm']).assign(SepalMultiplied = iris['SepalLengthCm'] * iris['SepalWidthCm']).assign(PetalMultiplied = iris['PetalLengthCm'] * iris['PetalWidthCm'])
Beter is use only one assign:
iris = iris.assign(SepalRatio = iris['SepalLengthCm'] / iris['SepalWidthCm'],
PetalRatio = iris['PetalLengthCm'] / iris['PetalWidthCm'],
SepalMultiplied = iris['SepalLengthCm'] * iris['SepalWidthCm'],
PetalMultiplied = iris['PetalLengthCm'] * iris['PetalWidthCm'])

Related

How to filter data and then select next 10 rows

I am currently filtering my dataset based on certain statements as such :
from sklearn.datasets import load_iris
iris = load_iris()
data1 = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
# filter dataset
data1[(data1['sepal length (cm)'] > 4) | (data1['sepal width (cm)'] > 3)]
I want to be able to get the next 10 rows following each filter too and I am not sure how to even start that so for example when they find one row where the length is greater than 4, I want to return the next 10 as well as that one etc.
Please let me know how I can do this.
the dataset that you loaded has a sequential index starting at 0.
to get the 10 rows following a filter, you're looking for rows where the index is between index of the most recent filtered row and 10 + index of the most recent filtered row
however, the filter that you have provided (data1['sepal length (cm)'] > 4) | (data1['sepal width (cm)'] > 3) matches on every row. the minimum width is 2. so for this illustration i'll use the filter sepal length (cm) == 4.6 and filter the next 5 rows instead of 10.
filt = data1['sepal length (cm)'] == 4.6
data1.loc[filt, 'sentinel'] = data1.index[filt]
data1.sentinel = data1.sentinel.ffill()
data1[(data1.index >= data1.sentinel) & (data1.index <= data1.sentinel + 5)]
This filters 21 rows below
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target sentinel
3 4.6 3.1 1.5 0.2 0.0 3.0
4 5.0 3.6 1.4 0.2 0.0 3.0
5 5.4 3.9 1.7 0.4 0.0 3.0
6 4.6 3.4 1.4 0.3 0.0 6.0
7 5.0 3.4 1.5 0.2 0.0 6.0
8 4.4 2.9 1.4 0.2 0.0 6.0
9 4.9 3.1 1.5 0.1 0.0 6.0
10 5.4 3.7 1.5 0.2 0.0 6.0
11 4.8 3.4 1.6 0.2 0.0 6.0
22 4.6 3.6 1.0 0.2 0.0 22.0
23 5.1 3.3 1.7 0.5 0.0 22.0
24 4.8 3.4 1.9 0.2 0.0 22.0
25 5.0 3.0 1.6 0.2 0.0 22.0
26 5.0 3.4 1.6 0.4 0.0 22.0
27 5.2 3.5 1.5 0.2 0.0 22.0
47 4.6 3.2 1.4 0.2 0.0 47.0
48 5.3 3.7 1.5 0.2 0.0 47.0
49 5.0 3.3 1.4 0.2 0.0 47.0
50 7.0 3.2 4.7 1.4 1.0 47.0
51 6.4 3.2 4.5 1.5 1.0 47.0
52 6.9 3.1 4.9 1.5 1.0 47.0

pandas how to add column of group by running range

I have a dataframe:
A B
0 0.1
0.1 0.3
0.35 0.48
1.3 1.5
1.5 1.9
2.2 2.9
3.1 3.4
5.1 5.5
And I want to add a column that will be the rank of B after grouping in to bins of 1.5, so it will be
A B T
0 0.1 0
0.1 0.3 0
0.35 0.48 0
1.3 1.5 0
1.5 1.9 1
2.2 2.9 1
3.1 3.4 2
5.1 5.5 3
What is the best way to do so?
Use cut with Series.factorize:
df['T'] = pd.factorize(pd.cut(df.B, bins=np.arange(0, df.B.max() + 1.5, 1.5)))[0]
print (df)
A B T
0 0.00 0.10 0
1 0.10 0.30 0
2 0.35 0.48 0
3 1.30 1.50 0
4 1.50 1.90 1
5 2.20 2.90 1
6 3.10 3.40 2
7 5.10 5.50 3

Is there short Pandas method chain for assigning grouped nth value?

I use nth value as columns without row aggregation.
Because I want to create a feature that can be tracked by using the window function and the aggregation function at any time.
R:
library(tidyverse)
iris %>% arrange(Species, Sepal.Length) %>% group_by(Species) %>%
mutate(cs = cumsum(Sepal.Length), cs4th = cumsum(Sepal.Length)[4]) %>%
slice(c(1:4))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species cs cs4th
<dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
1 4.3 3 1.1 0.1 setosa 4.3 17.5
2 4.4 2.9 1.4 0.2 setosa 8.7 17.5
3 4.4 3 1.3 0.2 setosa 13.1 17.5
4 4.4 3.2 1.3 0.2 setosa 17.5 17.5
5 4.9 2.4 3.3 1 versicolor 4.9 20
6 5 2 3.5 1 versicolor 9.9 20
7 5 2.3 3.3 1 versicolor 14.9 20
8 5.1 2.5 3 1.1 versicolor 20 20
9 4.9 2.5 4.5 1.7 virginica 4.9 22
10 5.6 2.8 4.9 2 virginica 10.5 22
11 5.7 2.5 5 2 virginica 16.2 22
12 5.8 2.7 5.1 1.9 virginica 22 22
Python: Too long and verbose!
import numpy as np
import pandas as pd
import seaborn as sns
iris = sns.load_dataset('iris')
iris.sort_values(['species','sepal_length']).assign(
index_species=lambda x: x.groupby('species').cumcount(),
cs=lambda x: x.groupby('species').sepal_length.cumsum(),
tmp=lambda x: np.where(x.index_species==3, x.cs, 0),
cs4th=lambda x: x.groupby('species').tmp.transform(sum)
).iloc[list(range(0,4))+list(range(50,54))+list(range(100,104))]
sepal_length sepal_width petal_length ... cs tmp cs4th
13 4.3 3.0 1.1 ... 4.3 0.0 17.5
8 4.4 2.9 1.4 ... 8.7 0.0 17.5
38 4.4 3.0 1.3 ... 13.1 0.0 17.5
42 4.4 3.2 1.3 ... 17.5 17.5 17.5
57 4.9 2.4 3.3 ... 4.9 0.0 20.0
60 5.0 2.0 3.5 ... 9.9 0.0 20.0
93 5.0 2.3 3.3 ... 14.9 0.0 20.0
98 5.1 2.5 3.0 ... 20.0 20.0 20.0
106 4.9 2.5 4.5 ... 4.9 0.0 22.0
121 5.6 2.8 4.9 ... 10.5 0.0 22.0
113 5.7 2.5 5.0 ... 16.2 0.0 22.0
101 5.8 2.7 5.1 ... 22.0 22.0 22.0
Python : My better solution(not smart. There is room for improvement about specifications of groupby )
iris.sort_values(['species','sepal_length']).assign(
cs=lambda x: x.groupby('species').sepal_length.transform('cumsum'),
cs4th=lambda x: x.merge(
x.groupby('species', as_index=False).nth(3).loc[:,['species','cs']],on='species')
.iloc[:,-1]
)
This doesn't work in a good way
iris.groupby('species').transform('nth(3)')
Here is an updated solution, using Pandas, which is still longer than what you will get with dplyr:
import seaborn as sns
import pandas as pd
iris = sns.load_dataset('iris')
iris['cs'] = (iris
.sort_values(['species','sepal_length'])
.groupby('species')['sepal_length']
.transform('cumsum'))
M = (iris
.sort_values(['species','cs'])
.groupby('species')['cs'])
groupby has a nth function that gets you a row per group : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.nth.html
iris = (iris
.sort_values(['species','cs'])
.reset_index(drop=True)
.merge(M.nth(3), how='left', on='species')
.rename(columns={'cs_x':'cs',
'cs_y':'cs4th'})
)
iris.head()
sepal_length sepal_width petal_length petal_width species cs cs4th
0 4.3 3.0 1.1 0.1 setosa 4.3 17.5
1 4.4 2.9 1.4 0.2 setosa 8.7 17.5
2 4.4 3.0 1.3 0.2 setosa 13.1 17.5
3 4.4 3.2 1.3 0.2 setosa 17.5 17.5
4 4.5 2.3 1.3 0.3 setosa 22.0 17.5
Update: 16/04/2021 ... Below is a better way to achieve the OP's goal:
(iris
.sort_values(['species', 'sepal_length'])
.assign(cs = lambda df: df.groupby('species')
.sepal_length
.transform('cumsum'),
cs4th = lambda df: df.groupby('species')
.cs
.transform('nth', 3)
)
.groupby('species')
.head(4)
)
sepal_length sepal_width petal_length petal_width species cs cs4th
13 4.3 3.0 1.1 0.1 setosa 4.3 17.5
8 4.4 2.9 1.4 0.2 setosa 8.7 17.5
38 4.4 3.0 1.3 0.2 setosa 13.1 17.5
42 4.4 3.2 1.3 0.2 setosa 17.5 17.5
57 4.9 2.4 3.3 1.0 versicolor 4.9 20.0
60 5.0 2.0 3.5 1.0 versicolor 9.9 20.0
93 5.0 2.3 3.3 1.0 versicolor 14.9 20.0
98 5.1 2.5 3.0 1.1 versicolor 20.0 20.0
106 4.9 2.5 4.5 1.7 virginica 4.9 22.0
121 5.6 2.8 4.9 2.0 virginica 10.5 22.0
113 5.7 2.5 5.0 2.0 virginica 16.2 22.0
101 5.8 2.7 5.1 1.9 virginica 22.0 22.0
Now you can do it in a non-verbose way as you did in R with datar in python:
>>> from datar.datasets import iris
>>> from datar.all import f, arrange, mutate, cumsum, slice
>>>
>>> (iris >>
... arrange(f.Species, f.Sepal_Length) >>
... group_by(f.Species) >>
... mutate(cs=cumsum(f.Sepal_Length), cs4th=cumsum(f.Sepal_Length)[3]) >>
... slice(f[1:4]))
Sepal_Length Sepal_Width Petal_Length Petal_Width Species cs cs4th
0 4.3 3.0 1.1 0.1 setosa 4.3 17.5
1 4.4 2.9 1.4 0.2 setosa 8.7 17.5
2 4.4 3.0 1.3 0.2 setosa 13.1 17.5
3 4.4 3.2 1.3 0.2 setosa 17.5 17.5
4 4.9 2.4 3.3 1.0 versicolor 4.9 20.0
5 5.0 2.0 3.5 1.0 versicolor 9.9 20.0
6 5.0 2.3 3.3 1.0 versicolor 14.9 20.0
7 5.1 2.5 3.0 1.1 versicolor 20.0 20.0
8 4.9 2.5 4.5 1.7 virginica 4.9 22.0
9 5.6 2.8 4.9 2.0 virginica 10.5 22.0
10 5.7 2.5 5.0 2.0 virginica 16.2 22.0
11 5.8 2.7 5.1 1.9 virginica 22.0 22.0
[Groups: ['Species'] (n=3)]
I am the author of the package. Feel free to submit issues if you have any questions.

Change values of one column in pandas dataframe

How can I change the values of the column 4 to 1 and -1, so that Iris-setosa is replace with 1 and Iris-virginica replaced with -1?
0 1 2 3 4
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
.. ... ... ... ... ...
120 6.9 3.2 5.7 2.3 Iris-virginica
121 5.6 2.8 4.9 2.0 Iris-virginica
122 7.7 2.8 6.7 2.0 Iris-virginica
123 6.3 2.7 4.9 1.8 Iris-virginica
124 6.7 3.3 5.7 2.1 Iris-virginica
125 7.2 3.2 6.0 1.8 Iris-virginica
126 6.2 2.8 4.8 1.8 Iris-virginica
I would appreciate the help.
You can use replace
d = {'Iris-setosa': 1, 'Iris-virginica': -1}
df['4'].replace(d,inplace = True)
0 1 2 3 4
0 5.1 3.5 1.4 0.2 1
1 4.9 3.0 1.4 0.2 1
2 4.7 3.2 1.3 0.2 1
3 4.6 3.1 1.5 0.2 1
4 5.0 3.6 1.4 0.2 1
5 5.4 3.9 1.7 0.4 1
6 4.6 3.4 1.4 0.3 1
.. ... ... ... ... ...
120 6.9 3.2 5.7 2.3 -1
121 5.6 2.8 4.9 2.0 -1
122 7.7 2.8 6.7 2.0 -1
123 6.3 2.7 4.9 1.8 -1
124 6.7 3.3 5.7 2.1 -1
125 7.2 3.2 6.0 1.8 -1
126 6.2 2.8 4.8 1.8 -1
df.iloc[df["4"]=="Iris-setosa","4"]=1
df.iloc[df["4"]=="Iris-virginica","4"]=-1
I would do something like this
def encode_row(self, row):
if row[4] == "Iris-setosa":
return 1
return -1
df_test[4] = df_test.apply(lambda row : self.encode_row(row), axis=1)
assuming that df_test is your data frame
Sounds like
df['4'] = np.where(df['4'] == 'Iris-setosa', 1, -1)
should do the job

python, h2o.import_file() return empty line

i have some problem with read files in h2o.
import h2o
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
h2o.init()
train = h2o.import_file("("https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv"")
splits = train.split_frame(ratios=[0.75], seed=1234)
dl = H2ODeepLearningEstimator(distribution="quantile",quantile_alpha=0.8)
dl.train(x=range(0,2), y="petal_len", training_frame=splits[0])
print(dl.predict(splits[1]))
UPDATE_1, The fourth line has this form(sorry, i copied wrong from IDE):
train = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
I got H2OTypeError: Argument x should be a None | integer | string | list(string | integer) | set(integer | string), got range range(0, 2).
This is due to the fact that "train" is empty.
In [23]: train
Out[23]:
I thought that there is a problem with reading from and linking and manually downloading file.
train = h2o.import_file("iris_wheader.csv")
But i got same result.
In [26]: train
Out[26]:
I connected pandas and open this .csv in pandas. It opened, I got a pandas-dataframe, I used
train = h2o.H2OFrame(train)
and got an empty train.
In [29]: train
Out[29]:
How to solve this problem?
UPDATE_2 When I went to 127.0.0.1:54321/flow/index.html, and it shows me that the dataframe has been loaded into the cluster. But in Python, I get empty train. I use Spyder IDE with IPython console, can it somehow influence the result?
There is a problem with this line:
train = h2o.import_file("("https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv"")
You have extra " and (, it should be:
train = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
Then you'll see that train and also print(train) give output:
In [6]: train
Out[6]: sepal_len sepal_wid petal_len petal_wid class
----------- ----------- ----------- ----------- -----------
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5 3.6 1.4 0.2 Iris-setosa
5.4 3.9 1.7 0.4 Iris-setosa
4.6 3.4 1.4 0.3 Iris-setosa
5 3.4 1.5 0.2 Iris-setosa
4.4 2.9 1.4 0.2 Iris-setosa
4.9 3.1 1.5 0.1 Iris-setosa
[150 rows x 5 columns]
In [7]: train.nrow
Out[7]: 150
In [8]: print(train)
sepal_len sepal_wid petal_len petal_wid class
----------- ----------- ----------- ----------- -----------
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5 3.6 1.4 0.2 Iris-setosa
5.4 3.9 1.7 0.4 Iris-setosa
4.6 3.4 1.4 0.3 Iris-setosa
5 3.4 1.5 0.2 Iris-setosa
4.4 2.9 1.4 0.2 Iris-setosa
4.9 3.1 1.5 0.1 Iris-setosa
[150 rows x 5 columns]

Categories

Resources