pandas how to add column of group by running range

pandas how to add column of group by running range - python

I have a dataframe:
A B
0 0.1
0.1 0.3
0.35 0.48
1.3 1.5
1.5 1.9
2.2 2.9
3.1 3.4
5.1 5.5
And I want to add a column that will be the rank of B after grouping in to bins of 1.5, so it will be
A B T
0 0.1 0
0.1 0.3 0
0.35 0.48 0
1.3 1.5 0
1.5 1.9 1
2.2 2.9 1
3.1 3.4 2
5.1 5.5 3
What is the best way to do so?

Use cut with Series.factorize:
df['T'] = pd.factorize(pd.cut(df.B, bins=np.arange(0, df.B.max() + 1.5, 1.5)))[0]
print (df)
A B T
0 0.00 0.10 0
1 0.10 0.30 0
2 0.35 0.48 0
3 1.30 1.50 0
4 1.50 1.90 1
5 2.20 2.90 1
6 3.10 3.40 2
7 5.10 5.50 3

Related

dataframe dropping all nan index and cells keeping original index value

I have data like below with nan index and nan cells
id no name percentage score result
0 0 0.30 pencils 0.84 0.974185 1
1 1 0.18 computer 1.14 1.0 1
2 2 0.27 laptop 1.32 1.0 1
nan 0 0.18 apples 1.59 0.999655 4
4 1 0.84 vegetables 1.770008 0.99992 4
id no name percentage score result
0 0 nan chicken 0.84 0.974185 1
1 1 0.18 fish 1.14 . 1
2 2 0.27 meat 1.32 1.0 1
I want to keep the original index and drop all rows where nan index and nan cells or special character like . and without repeated header as below
id no name percentage score result
0 0 0.30 pencils 0.84 0.974185 1
1 1 0.18 computer 1.14 1.0 1
2 2 0.27 laptop 1.32 1.0 1
4 1 0.84 vegetables 1.770008 0.99992 4
2 2 0.27 meat 1.32 1.0 1
I tried but, I cannot keep the original index.

You can do this:
In [394]: df
Out[394]:
id no name percentage score result
0 0 0.30 pencils 0.84 0.974185 1.0
1 1 0.18 computer 1.14 1.0 1.0
2 2 0.27 laptop 1.32 1.0 1.0
NaN 0 0.18 apples 1.59 0.999655 4.0
4 1 0.84 vegetables 1.770008 0.99992 4.0
id no name percentage score result NaN
0 0 NaN chicken 0.84 0.974185 1.0
1 1 0.18 fish 1.14 . 1.0
2 2 0.27 meat 1.32 1.0 1.0
In [393]: df[~df.eq('.').any(1) & ~df.index.isin(df.columns) & df.index.notna()].dropna()
Out[393]:
id no name percentage score result
0 0 0.30 pencils 0.84 0.974185 1.0
1 1 0.18 computer 1.14 1.0 1.0
2 2 0.27 laptop 1.32 1.0 1.0
4 1 0.84 vegetables 1.770008 0.99992 4.0
2 2 0.27 meat 1.32 1.0 1.0

Pandas creating new dataframe from several group by operations

I have a pandas dataframe
test = pd.DataFrame({'d':[1,1,1,2,2,3,3], 'id':[1,2,3,1,2,2,3], 'v1':[10, 20, 15, 35, 5, 10, 30], 'v2':[3, 4, 1, 6, 0, 2, 0], 'w1':[0.1, 0.3, 0.2, 0.1, 0.4, 0.3, 0.2], 'w2':[0.8, 0.1, 0.2, 0.3, 0.1, 0.1, 0.0]})
d id v1 v2 w1 w2
0 1 1 10 3 0.10 0.80
1 1 2 20 4 0.30 0.10
2 1 3 15 1 0.20 0.20
3 2 1 35 6 0.10 0.30
4 2 2 5 0 0.40 0.10
5 3 2 10 2 0.30 0.10
6 3 3 30 0 0.20 0.00
and I would like to get some weighted values by group like
test['w1v1'] = test['w1'] * test['v1']
test['w1v2'] = test['w1'] * test['v2']
test['w2v1'] = test['w2'] * test['v1']
test['w2v2'] = test['w2'] * test['v2']
How can I get the result nicely into a df. something that looks like
test.groupby('id').sum()['w1v1'] / test.groupby('id').sum()['w1']
id
1 22.50
2 11.00
3 22.50
but includes columns for each weighted value, so like
id w1v1 w1v2 w2v1 w2v2
1 22.50 ... ... ...
2 11.00 ... ... ...
3 22.50 ... ... ...
Any ideas how I can achieve this quick and easy?

Use:
cols = ['w1v1','w1v2','w2v1','w2v2']
test1 = (test[['w1', 'w2', 'w1', 'w2']] * test[['v1', 'v1', 'v2', 'v2']].values)
test1.columns = cols
print (test1)
w1v1 w1v2 w2v1 w2v2
0 1.0 8.0 0.3 2.4
1 6.0 2.0 1.2 0.4
2 3.0 3.0 0.2 0.2
3 3.5 10.5 0.6 1.8
4 2.0 0.5 0.0 0.0
5 3.0 1.0 0.6 0.2
6 6.0 0.0 0.0 0.0
df = test.join(test1).groupby('id').sum()
df1 = df[cols] / df[['w1', 'w2', 'w1', 'w2']].values
print (df1)
w1v1 w1v2 w2v1 w2v2
id
1 22.5 16.818182 4.5 3.818182
2 11.0 11.666667 1.8 2.000000
3 22.5 15.000000 0.5 1.000000
Another more dynamic solution with MultiIndex DataFrames:
a = ['v1', 'v2']
b = ['w1', 'w2']
mux = pd.MultiIndex.from_product([a,b])
df1 = test.set_index('id').drop('d', axis=1)
v = df1.reindex(columns=mux, level=0)
w = df1.reindex(columns=mux, level=1)
print (v)
v1 v2
w1 w2 w1 w2
id
1 10 10 3 3
2 20 20 4 4
3 15 15 1 1
1 35 35 6 6
2 5 5 0 0
2 10 10 2 2
3 30 30 0 0
print (w)
v1 v2
w1 w2 w1 w2
id
1 0.1 0.8 0.1 0.8
2 0.3 0.1 0.3 0.1
3 0.2 0.2 0.2 0.2
1 0.1 0.3 0.1 0.3
2 0.4 0.1 0.4 0.1
2 0.3 0.1 0.3 0.1
3 0.2 0.0 0.2 0.0
df = w * v
print (df)
v1 v2
w1 w2 w1 w2
id
1 1.0 8.0 0.3 2.4
2 6.0 2.0 1.2 0.4
3 3.0 3.0 0.2 0.2
1 3.5 10.5 0.6 1.8
2 2.0 0.5 0.0 0.0
2 3.0 1.0 0.6 0.2
3 6.0 0.0 0.0 0.0
df1 = df.groupby('id').sum() / w.groupby('id').sum()
#flatten MultiIndex columns
df1.columns = ['{0[1]}{0[0]}'.format(x) for x in df1.columns]
print (df1)
w1v1 w2v1 w1v2 w2v2
id
1 22.5 16.818182 4.5 3.818182
2 11.0 11.666667 1.8 2.000000
3 22.5 15.000000 0.5 1.000000

If you can take multi index columns, you can use groupby + dot:
test.groupby('id').apply(
lambda g: g.filter(like='v').T.dot(g.filter(like='w')/g.filter(like='w').sum()).stack()
)
# v1 v2
# w1 w2 w1 w2
#id
#1 22.5 16.818182 4.5 3.818182
#2 11.0 11.666667 1.8 2.000000
#3 22.5 15.000000 0.5 1.000000

pandas dataframe assign doesn't update the dataframe

I made a pandas dataframe of the Iris dataset and I want to put 4 extra column in it. The content of the columns have to be SepalRatio, PetalRatio, SepalMultiplied, PetalMultiplied. I used the assign() function of the DataFrame to add this four columns but the DataFrame remains the samen.
My code to add column is :
iris.assign(SepalRatio = iris['SepalLengthCm'] / `iris['SepalWidthCm']).assign(PetalRatio = iris['PetalLengthCm'] / iris['PetalWidthCm']).assign(SepalMultiplied = iris['SepalLengthCm'] * iris['SepalWidthCm']).assign(PetalMultiplied = iris['PetalLengthCm'] * iris['PetalWidthCm'])`
When executing in Jupyter notebook a correct table is shown but if I use the print statement the four column aren't added.
Output in Jupyter notebook :
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species SepalRatio PetalRatio SepalMultiplied PetalMultiplied
0 1 5.1 3.5 1.4 0.2 Iris-setosa 1.457143 7.000000 17.85 0.28
1 2 4.9 3.0 1.4 0.2 Iris-setosa 1.633333 7.000000 14.70 0.28
2 3 4.7 3.2 1.3 0.2 Iris-setosa 1.468750 6.500000 15.04 0.26
3 4 4.6 3.1 1.5 0.2 Iris-setosa 1.483871 7.500000 14.26 0.30
4 5 5.0 3.6 1.4 0.2 Iris-setosa 1.388889 7.000000 18.00 0.28
5 6 5.4 3.9 1.7 0.4 Iris-setosa 1.384615 4.250000 21.06 0.68
6 7 4.6 3.4 1.4 0.3 Iris-setosa 1.352941 4.666667 15.64 0.42
7 8 5.0 3.4 1.5 0.2 Iris-setosa 1.470588 7.500000 17.00 0.30
8 9 4.4 2.9 1.4 0.2 Iris-setosa 1.517241 7.000000 12.76 0.28
9 10 4.9 3.1 1.5 0.1 Iris-setosa 1.580645 15.000000 15.19 0.15
output after printing the dataframe :
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
5 6 5.4 3.9 1.7 0.4
6 7 4.6 3.4 1.4 0.3
7 8 5.0 3.4 1.5 0.2
8 9 4.4 2.9 1.4 0.2
9 10 4.9 3.1 1.5 0.1
Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
5 Iris-setosa
6 Iris-setosa
7 Iris-setosa
8 Iris-setosa
9 Iris-setosa

You need assign output to variable like:
iris = iris.assign(SepalRatio = iris['SepalLengthCm'] / iris['SepalWidthCm']).assign(PetalRatio = iris['PetalLengthCm'] / iris['PetalWidthCm']).assign(SepalMultiplied = iris['SepalLengthCm'] * iris['SepalWidthCm']).assign(PetalMultiplied = iris['PetalLengthCm'] * iris['PetalWidthCm'])
Beter is use only one assign:
iris = iris.assign(SepalRatio = iris['SepalLengthCm'] / iris['SepalWidthCm'],
PetalRatio = iris['PetalLengthCm'] / iris['PetalWidthCm'],
SepalMultiplied = iris['SepalLengthCm'] * iris['SepalWidthCm'],
PetalMultiplied = iris['PetalLengthCm'] * iris['PetalWidthCm'])

Get max value from row of a dataframe in python [duplicate]

This question already has answers here:
Find the max of two or more columns with pandas
(3 answers)
How to select max and min value in a row for selected columns
(2 answers)
Closed 5 years ago.
This is my dataframe df
a b c
1.2 2 0.1
2.1 1.1 3.2
0.2 1.9 8.8
3.3 7.8 0.12
I'm trying to get max value from each row of a dataframe, I m expecting output like this
max_value
2
3.2
8.8
7.8
This is what I have tried
df[len(df.columns)].argmax()
I'm not getting proper output, any help would be much appreciated. Thanks

Use max with axis=1:
df = df.max(axis=1)
print (df)
0 2.0
1 3.2
2 8.8
3 7.8
dtype: float64
And if need new column:
df['max_value'] = df.max(axis=1)
print (df)
a b c max_value
0 1.2 2.0 0.10 2.0
1 2.1 1.1 3.20 3.2
2 0.2 1.9 8.80 8.8
3 3.3 7.8 0.12 7.8

You could use numpy
df.assign(max_value=df.values.max(1))
a b c max_value
0 1.2 2.0 0.10 2.0
1 2.1 1.1 3.20 3.2
2 0.2 1.9 8.80 8.8
3 3.3 7.8 0.12 7.8

Why isn't this code to plot a histogram on a continuous value Pandas column working?

I am trying to create a histogram on a continuous value column Trip_distance in a large 1.4M row pandas dataframe. Wrote the following code:
fig = plt.figure(figsize=(17,10))
trip_data.hist(column="Trip_distance")
plt.xlabel("Trip_distance",fontsize=15)
plt.ylabel("Frequency",fontsize=15)
plt.xlim([0.0,100.0])
#plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
But I am not sure why all values give the same frequency plot which shouldn't be the case. What's wrong with the code?
Test data:
VendorID lpep_pickup_datetime Lpep_dropoff_datetime Store_and_fwd_flag RateCodeID Pickup_longitude Pickup_latitude Dropoff_longitude Dropoff_latitude Passenger_count Trip_distance Fare_amount Extra MTA_tax Tip_amount Tolls_amount Ehail_fee improvement_surcharge Total_amount Payment_type Trip_type
0 2 2015-09-01 00:02:34 2015-09-01 00:02:38 N 5 -73.979485 40.684956 -73.979431 40.685020 1 0.00 7.8 0.0 0.0 1.95 0.0 NaN 0.0 9.75 1 2.0
1 2 2015-09-01 00:04:20 2015-09-01 00:04:24 N 5 -74.010796 40.912216 -74.010780 40.912212 1 0.00 45.0 0.0 0.0 0.00 0.0 NaN 0.0 45.00 1 2.0
2 2 2015-09-01 00:01:50 2015-09-01 00:04:24 N 1 -73.921410 40.766708 -73.914413 40.764687 1 0.59 4.0 0.5 0.5 0.50 0.0 NaN 0.3 5.80 1 1.0
3 2 2015-09-01 00:02:36 2015-09-01 00:06:42 N 1 -73.921387 40.766678 -73.931427 40.771584 1 0.74 5.0 0.5 0.5 0.00 0.0 NaN 0.3 6.30 2 1.0
4 2 2015-09-01 00:00:14 2015-09-01 00:04:20 N 1 -73.955482 40.714046 -73.944412 40.714729 1 0.61 5.0 0.5 0.5 0.00 0.0 NaN 0.3 6.30 2 1.0
5 2 2015-09-01 00:00:39 2015-09-01 00:05:20 N 1 -73.945297 40.808186 -73.937668 40.821198 1 1.07 5.5 0.5 0.5 1.36 0.0 NaN 0.3 8.16 1 1.0
6 2 2015-09-01 00:00:52 2015-09-01 00:05:50 N 1 -73.890877 40.746426 -73.876923 40.756306 1 1.43 6.5 0.5 0.5 0.00 0.0 NaN 0.3 7.80 1 1.0
7 2 2015-09-01 00:02:15 2015-09-01 00:05:34 N 1 -73.946701 40.797321 -73.937645 40.804516 1 0.90 5.0 0.5 0.5 0.00 0.0 NaN 0.3 6.30 2 1.0
8 2 2015-09-01 00:02:36 2015-09-01 00:07:20 N 1 -73.963150 40.693829 -73.956787 40.680531 1 1.33 6.0 0.5 0.5 1.46 0.0 NaN 0.3 8.76 1 1.0
9 2 2015-09-01 00:02:13 2015-09-01 00:07:23 N 1 -73.896820 40.746128 -73.888626 40.752724 1 0.84 5.5 0.5 0.5 0.00 0.0 NaN 0.3 6.80 2 1.0
In [ ]:

Trip_distance column
0 0.00
1 0.00
2 0.59
3 0.74
4 0.61
5 1.07
6 1.43
7 0.90
8 1.33
9 0.84
10 0.80
11 0.70
12 1.01
13 0.39
14 0.56
Name: Trip_distance, dtype: float64
After 100 bins:

EDIT:
After your comments this actually makes perfect sense why you don't get a histogram of each different value. There are 1.4 million rows, and ten discrete buckets. So apparently each bucket is exactly 10% (to within what you can see in the plot).
A quick rerun of your data:
In [25]: df.hist(column='Trip_distance')
Prints out absolutely fine.
The df.hist function comes with an optional keyword argument bins=10 which buckets the data into discrete bins. With only 10 discrete bins and a more or less homogeneous distribution of hundreds of thousands of rows, you might not be able to see the difference in the ten different bins in your low resolution plot:
In [34]: df.hist(column='Trip_distance', bins=50)

Here's another way to plot the data, involves turning the date_time into an index, this might help you for future slicing
#convert column to datetime
trip_data['lpep_pickup_datetime'] = pd.to_datetime(trip_data['lpep_pickup_datetime'])
#turn the datetime to an index
trip_data.index = trip_data['lpep_pickup_datetime']
#Plot
trip_data['Trip_distance'].plot(kind='hist')
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas how to add column of group by running range - python

Use cut with Series.factorize: df['T'] = pd.factorize(pd.cut(df.B, bins=np.arange(0, df.B.max() + 1.5, 1.5)))[0] print (df) A B T 0 0.00 0.10 0 1 0.10 0.30 0 2 0.35 0.48 0 3 1.30 1.50 0 4 1.50 1.90 1 5 2.20 2.90 1 6 3.10 3.40 2 7 5.10 5.50 3

Related

dataframe dropping all nan index and cells keeping original index value

Pandas creating new dataframe from several group by operations

pandas dataframe assign doesn't update the dataframe

Get max value from row of a dataframe in python [duplicate]

Why isn't this code to plot a histogram on a continuous value Pandas column working?

Categories

Resources