pandas how to add column of group by running range - python

I have a dataframe:
A B
0 0.1
0.1 0.3
0.35 0.48
1.3 1.5
1.5 1.9
2.2 2.9
3.1 3.4
5.1 5.5
And I want to add a column that will be the rank of B after grouping in to bins of 1.5, so it will be
A B T
0 0.1 0
0.1 0.3 0
0.35 0.48 0
1.3 1.5 0
1.5 1.9 1
2.2 2.9 1
3.1 3.4 2
5.1 5.5 3
What is the best way to do so?

Use cut with Series.factorize:
df['T'] = pd.factorize(pd.cut(df.B, bins=np.arange(0, df.B.max() + 1.5, 1.5)))[0]
print (df)
A B T
0 0.00 0.10 0
1 0.10 0.30 0
2 0.35 0.48 0
3 1.30 1.50 0
4 1.50 1.90 1
5 2.20 2.90 1
6 3.10 3.40 2
7 5.10 5.50 3

Related

dataframe dropping all nan index and cells keeping original index value

I have data like below with nan index and nan cells
id no name percentage score result
0 0 0.30 pencils 0.84 0.974185 1
1 1 0.18 computer 1.14 1.0 1
2 2 0.27 laptop 1.32 1.0 1
nan 0 0.18 apples 1.59 0.999655 4
4 1 0.84 vegetables 1.770008 0.99992 4
id no name percentage score result
0 0 nan chicken 0.84 0.974185 1
1 1 0.18 fish 1.14 . 1
2 2 0.27 meat 1.32 1.0 1
I want to keep the original index and drop all rows where nan index and nan cells or special character like . and without repeated header as below
id no name percentage score result
0 0 0.30 pencils 0.84 0.974185 1
1 1 0.18 computer 1.14 1.0 1
2 2 0.27 laptop 1.32 1.0 1
4 1 0.84 vegetables 1.770008 0.99992 4
2 2 0.27 meat 1.32 1.0 1
I tried but, I cannot keep the original index.
You can do this:
In [394]: df
Out[394]:
id no name percentage score result
0 0 0.30 pencils 0.84 0.974185 1.0
1 1 0.18 computer 1.14 1.0 1.0
2 2 0.27 laptop 1.32 1.0 1.0
NaN 0 0.18 apples 1.59 0.999655 4.0
4 1 0.84 vegetables 1.770008 0.99992 4.0
id no name percentage score result NaN
0 0 NaN chicken 0.84 0.974185 1.0
1 1 0.18 fish 1.14 . 1.0
2 2 0.27 meat 1.32 1.0 1.0
In [393]: df[~df.eq('.').any(1) & ~df.index.isin(df.columns) & df.index.notna()].dropna()
Out[393]:
id no name percentage score result
0 0 0.30 pencils 0.84 0.974185 1.0
1 1 0.18 computer 1.14 1.0 1.0
2 2 0.27 laptop 1.32 1.0 1.0
4 1 0.84 vegetables 1.770008 0.99992 4.0
2 2 0.27 meat 1.32 1.0 1.0

Pandas creating new dataframe from several group by operations

I have a pandas dataframe
test = pd.DataFrame({'d':[1,1,1,2,2,3,3], 'id':[1,2,3,1,2,2,3], 'v1':[10, 20, 15, 35, 5, 10, 30], 'v2':[3, 4, 1, 6, 0, 2, 0], 'w1':[0.1, 0.3, 0.2, 0.1, 0.4, 0.3, 0.2], 'w2':[0.8, 0.1, 0.2, 0.3, 0.1, 0.1, 0.0]})
d id v1 v2 w1 w2
0 1 1 10 3 0.10 0.80
1 1 2 20 4 0.30 0.10
2 1 3 15 1 0.20 0.20
3 2 1 35 6 0.10 0.30
4 2 2 5 0 0.40 0.10
5 3 2 10 2 0.30 0.10
6 3 3 30 0 0.20 0.00
and I would like to get some weighted values by group like
test['w1v1'] = test['w1'] * test['v1']
test['w1v2'] = test['w1'] * test['v2']
test['w2v1'] = test['w2'] * test['v1']
test['w2v2'] = test['w2'] * test['v2']
How can I get the result nicely into a df. something that looks like
test.groupby('id').sum()['w1v1'] / test.groupby('id').sum()['w1']
id
1 22.50
2 11.00
3 22.50
but includes columns for each weighted value, so like
id w1v1 w1v2 w2v1 w2v2
1 22.50 ... ... ...
2 11.00 ... ... ...
3 22.50 ... ... ...
Any ideas how I can achieve this quick and easy?
Use:
cols = ['w1v1','w1v2','w2v1','w2v2']
test1 = (test[['w1', 'w2', 'w1', 'w2']] * test[['v1', 'v1', 'v2', 'v2']].values)
test1.columns = cols
print (test1)
w1v1 w1v2 w2v1 w2v2
0 1.0 8.0 0.3 2.4
1 6.0 2.0 1.2 0.4
2 3.0 3.0 0.2 0.2
3 3.5 10.5 0.6 1.8
4 2.0 0.5 0.0 0.0
5 3.0 1.0 0.6 0.2
6 6.0 0.0 0.0 0.0
df = test.join(test1).groupby('id').sum()
df1 = df[cols] / df[['w1', 'w2', 'w1', 'w2']].values
print (df1)
w1v1 w1v2 w2v1 w2v2
id
1 22.5 16.818182 4.5 3.818182
2 11.0 11.666667 1.8 2.000000
3 22.5 15.000000 0.5 1.000000
Another more dynamic solution with MultiIndex DataFrames:
a = ['v1', 'v2']
b = ['w1', 'w2']
mux = pd.MultiIndex.from_product([a,b])
df1 = test.set_index('id').drop('d', axis=1)
v = df1.reindex(columns=mux, level=0)
w = df1.reindex(columns=mux, level=1)
print (v)
v1 v2
w1 w2 w1 w2
id
1 10 10 3 3
2 20 20 4 4
3 15 15 1 1
1 35 35 6 6
2 5 5 0 0
2 10 10 2 2
3 30 30 0 0
print (w)
v1 v2
w1 w2 w1 w2
id
1 0.1 0.8 0.1 0.8
2 0.3 0.1 0.3 0.1
3 0.2 0.2 0.2 0.2
1 0.1 0.3 0.1 0.3
2 0.4 0.1 0.4 0.1
2 0.3 0.1 0.3 0.1
3 0.2 0.0 0.2 0.0
df = w * v
print (df)
v1 v2
w1 w2 w1 w2
id
1 1.0 8.0 0.3 2.4
2 6.0 2.0 1.2 0.4
3 3.0 3.0 0.2 0.2
1 3.5 10.5 0.6 1.8
2 2.0 0.5 0.0 0.0
2 3.0 1.0 0.6 0.2
3 6.0 0.0 0.0 0.0
df1 = df.groupby('id').sum() / w.groupby('id').sum()
#flatten MultiIndex columns
df1.columns = ['{0[1]}{0[0]}'.format(x) for x in df1.columns]
print (df1)
w1v1 w2v1 w1v2 w2v2
id
1 22.5 16.818182 4.5 3.818182
2 11.0 11.666667 1.8 2.000000
3 22.5 15.000000 0.5 1.000000
If you can take multi index columns, you can use groupby + dot:
test.groupby('id').apply(
lambda g: g.filter(like='v').T.dot(g.filter(like='w')/g.filter(like='w').sum()).stack()
)
# v1 v2
# w1 w2 w1 w2
#id
#1 22.5 16.818182 4.5 3.818182
#2 11.0 11.666667 1.8 2.000000
#3 22.5 15.000000 0.5 1.000000

pandas dataframe assign doesn't update the dataframe

I made a pandas dataframe of the Iris dataset and I want to put 4 extra column in it. The content of the columns have to be SepalRatio, PetalRatio, SepalMultiplied, PetalMultiplied. I used the assign() function of the DataFrame to add this four columns but the DataFrame remains the samen.
My code to add column is :
iris.assign(SepalRatio = iris['SepalLengthCm'] / `iris['SepalWidthCm']).assign(PetalRatio = iris['PetalLengthCm'] / iris['PetalWidthCm']).assign(SepalMultiplied = iris['SepalLengthCm'] * iris['SepalWidthCm']).assign(PetalMultiplied = iris['PetalLengthCm'] * iris['PetalWidthCm'])`
When executing in Jupyter notebook a correct table is shown but if I use the print statement the four column aren't added.
Output in Jupyter notebook :
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species SepalRatio PetalRatio SepalMultiplied PetalMultiplied
0 1 5.1 3.5 1.4 0.2 Iris-setosa 1.457143 7.000000 17.85 0.28
1 2 4.9 3.0 1.4 0.2 Iris-setosa 1.633333 7.000000 14.70 0.28
2 3 4.7 3.2 1.3 0.2 Iris-setosa 1.468750 6.500000 15.04 0.26
3 4 4.6 3.1 1.5 0.2 Iris-setosa 1.483871 7.500000 14.26 0.30
4 5 5.0 3.6 1.4 0.2 Iris-setosa 1.388889 7.000000 18.00 0.28
5 6 5.4 3.9 1.7 0.4 Iris-setosa 1.384615 4.250000 21.06 0.68
6 7 4.6 3.4 1.4 0.3 Iris-setosa 1.352941 4.666667 15.64 0.42
7 8 5.0 3.4 1.5 0.2 Iris-setosa 1.470588 7.500000 17.00 0.30
8 9 4.4 2.9 1.4 0.2 Iris-setosa 1.517241 7.000000 12.76 0.28
9 10 4.9 3.1 1.5 0.1 Iris-setosa 1.580645 15.000000 15.19 0.15
output after printing the dataframe :
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
5 6 5.4 3.9 1.7 0.4
6 7 4.6 3.4 1.4 0.3
7 8 5.0 3.4 1.5 0.2
8 9 4.4 2.9 1.4 0.2
9 10 4.9 3.1 1.5 0.1
Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
5 Iris-setosa
6 Iris-setosa
7 Iris-setosa
8 Iris-setosa
9 Iris-setosa
You need assign output to variable like:
iris = iris.assign(SepalRatio = iris['SepalLengthCm'] / iris['SepalWidthCm']).assign(PetalRatio = iris['PetalLengthCm'] / iris['PetalWidthCm']).assign(SepalMultiplied = iris['SepalLengthCm'] * iris['SepalWidthCm']).assign(PetalMultiplied = iris['PetalLengthCm'] * iris['PetalWidthCm'])
Beter is use only one assign:
iris = iris.assign(SepalRatio = iris['SepalLengthCm'] / iris['SepalWidthCm'],
PetalRatio = iris['PetalLengthCm'] / iris['PetalWidthCm'],
SepalMultiplied = iris['SepalLengthCm'] * iris['SepalWidthCm'],
PetalMultiplied = iris['PetalLengthCm'] * iris['PetalWidthCm'])

Get max value from row of a dataframe in python [duplicate]

This question already has answers here:
Find the max of two or more columns with pandas
(3 answers)
How to select max and min value in a row for selected columns
(2 answers)
Closed 5 years ago.
This is my dataframe df
a b c
1.2 2 0.1
2.1 1.1 3.2
0.2 1.9 8.8
3.3 7.8 0.12
I'm trying to get max value from each row of a dataframe, I m expecting output like this
max_value
2
3.2
8.8
7.8
This is what I have tried
df[len(df.columns)].argmax()
I'm not getting proper output, any help would be much appreciated. Thanks
Use max with axis=1:
df = df.max(axis=1)
print (df)
0 2.0
1 3.2
2 8.8
3 7.8
dtype: float64
And if need new column:
df['max_value'] = df.max(axis=1)
print (df)
a b c max_value
0 1.2 2.0 0.10 2.0
1 2.1 1.1 3.20 3.2
2 0.2 1.9 8.80 8.8
3 3.3 7.8 0.12 7.8
You could use numpy
df.assign(max_value=df.values.max(1))
a b c max_value
0 1.2 2.0 0.10 2.0
1 2.1 1.1 3.20 3.2
2 0.2 1.9 8.80 8.8
3 3.3 7.8 0.12 7.8

Why isn't this code to plot a histogram on a continuous value Pandas column working?

I am trying to create a histogram on a continuous value column Trip_distance in a large 1.4M row pandas dataframe. Wrote the following code:
fig = plt.figure(figsize=(17,10))
trip_data.hist(column="Trip_distance")
plt.xlabel("Trip_distance",fontsize=15)
plt.ylabel("Frequency",fontsize=15)
plt.xlim([0.0,100.0])
#plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
But I am not sure why all values give the same frequency plot which shouldn't be the case. What's wrong with the code?
Test data:
VendorID lpep_pickup_datetime Lpep_dropoff_datetime Store_and_fwd_flag RateCodeID Pickup_longitude Pickup_latitude Dropoff_longitude Dropoff_latitude Passenger_count Trip_distance Fare_amount Extra MTA_tax Tip_amount Tolls_amount Ehail_fee improvement_surcharge Total_amount Payment_type Trip_type
0 2 2015-09-01 00:02:34 2015-09-01 00:02:38 N 5 -73.979485 40.684956 -73.979431 40.685020 1 0.00 7.8 0.0 0.0 1.95 0.0 NaN 0.0 9.75 1 2.0
1 2 2015-09-01 00:04:20 2015-09-01 00:04:24 N 5 -74.010796 40.912216 -74.010780 40.912212 1 0.00 45.0 0.0 0.0 0.00 0.0 NaN 0.0 45.00 1 2.0
2 2 2015-09-01 00:01:50 2015-09-01 00:04:24 N 1 -73.921410 40.766708 -73.914413 40.764687 1 0.59 4.0 0.5 0.5 0.50 0.0 NaN 0.3 5.80 1 1.0
3 2 2015-09-01 00:02:36 2015-09-01 00:06:42 N 1 -73.921387 40.766678 -73.931427 40.771584 1 0.74 5.0 0.5 0.5 0.00 0.0 NaN 0.3 6.30 2 1.0
4 2 2015-09-01 00:00:14 2015-09-01 00:04:20 N 1 -73.955482 40.714046 -73.944412 40.714729 1 0.61 5.0 0.5 0.5 0.00 0.0 NaN 0.3 6.30 2 1.0
5 2 2015-09-01 00:00:39 2015-09-01 00:05:20 N 1 -73.945297 40.808186 -73.937668 40.821198 1 1.07 5.5 0.5 0.5 1.36 0.0 NaN 0.3 8.16 1 1.0
6 2 2015-09-01 00:00:52 2015-09-01 00:05:50 N 1 -73.890877 40.746426 -73.876923 40.756306 1 1.43 6.5 0.5 0.5 0.00 0.0 NaN 0.3 7.80 1 1.0
7 2 2015-09-01 00:02:15 2015-09-01 00:05:34 N 1 -73.946701 40.797321 -73.937645 40.804516 1 0.90 5.0 0.5 0.5 0.00 0.0 NaN 0.3 6.30 2 1.0
8 2 2015-09-01 00:02:36 2015-09-01 00:07:20 N 1 -73.963150 40.693829 -73.956787 40.680531 1 1.33 6.0 0.5 0.5 1.46 0.0 NaN 0.3 8.76 1 1.0
9 2 2015-09-01 00:02:13 2015-09-01 00:07:23 N 1 -73.896820 40.746128 -73.888626 40.752724 1 0.84 5.5 0.5 0.5 0.00 0.0 NaN 0.3 6.80 2 1.0
In [ ]:
​
Trip_distance column
0 0.00
1 0.00
2 0.59
3 0.74
4 0.61
5 1.07
6 1.43
7 0.90
8 1.33
9 0.84
10 0.80
11 0.70
12 1.01
13 0.39
14 0.56
Name: Trip_distance, dtype: float64
After 100 bins:
EDIT:
After your comments this actually makes perfect sense why you don't get a histogram of each different value. There are 1.4 million rows, and ten discrete buckets. So apparently each bucket is exactly 10% (to within what you can see in the plot).
A quick rerun of your data:
In [25]: df.hist(column='Trip_distance')
Prints out absolutely fine.
The df.hist function comes with an optional keyword argument bins=10 which buckets the data into discrete bins. With only 10 discrete bins and a more or less homogeneous distribution of hundreds of thousands of rows, you might not be able to see the difference in the ten different bins in your low resolution plot:
In [34]: df.hist(column='Trip_distance', bins=50)
Here's another way to plot the data, involves turning the date_time into an index, this might help you for future slicing
#convert column to datetime
trip_data['lpep_pickup_datetime'] = pd.to_datetime(trip_data['lpep_pickup_datetime'])
#turn the datetime to an index
trip_data.index = trip_data['lpep_pickup_datetime']
#Plot
trip_data['Trip_distance'].plot(kind='hist')
plt.show()

Categories

Resources