I have a dataframe:
A B
0 0.1
0.1 0.3
0.35 0.48
1.3 1.5
1.5 1.9
2.2 2.9
3.1 3.4
5.1 5.5
And I want to add a column that will be the rank of B after grouping in to bins of 1.5, so it will be
A B T
0 0.1 0
0.1 0.3 0
0.35 0.48 0
1.3 1.5 0
1.5 1.9 1
2.2 2.9 1
3.1 3.4 2
5.1 5.5 3
What is the best way to do so?
Use cut with Series.factorize:
df['T'] = pd.factorize(pd.cut(df.B, bins=np.arange(0, df.B.max() + 1.5, 1.5)))[0]
print (df)
A B T
0 0.00 0.10 0
1 0.10 0.30 0
2 0.35 0.48 0
3 1.30 1.50 0
4 1.50 1.90 1
5 2.20 2.90 1
6 3.10 3.40 2
7 5.10 5.50 3
I have a file with a table (.csv file).
The table is composed by many sub "areas" like this example:
As you can see, there are more some data which can be grouped together (blue group, orange group, etc.)
Now.. the color is just to make the concept clear, but in the .csv there is no group identified by a color. In reality there is no color to identify the groups and the groups dimensions (rows) can change. There is no pattern to predict where the next group has 1, 2, 3, 4 or more rows.
The problem is that I need to open the table and import it using a dataframe using pandas. In my algorithm one group should be identified, copied to another dataframe and then saved.
How can I group data using pandas?
I was thinking to index the groups like the following table:
but in this case I cannot access the cells with the same index sequentially.
Any idea?
EDIT: here the table from the .csv file:
,X,Y,Z,mm,ff,cc
1,1,2,3,0.2,0.4,0.3
,,,,0.1,0.3,0.4
2,1,2,3,0.1,1.2,-1.2
,,,,0.12,-1.234,303.4
,,,,1.2,43.2,44.3
,,,,7.4,88.3,34.4
3,2,4,2,1.13,4.1,55.1
,,,,80.3,34.1,4.01
,,,,43.12,12.3,98.4
You can create an index and insert into the first position per your desired output. I have also used ffill() to get rid of nulls, but that is optional for you
# without ffill()
df.insert(0, 'index', (df[['X', 'Y', 'Z']].notnull().sum(axis=1) == 3).cumsum())
# df = df.ffill() # uncomment if you want ffill()
df
Out[1]:
index X Y Z mm ff cc
0 1 1.0 2.0 3.0 0.20 0.400 0.30
1 1 NaN NaN NaN 0.10 0.300 0.40
2 2 1.0 2.0 3.0 0.10 1.200 -1.20
3 2 NaN NaN NaN 0.12 -1.234 303.40
4 2 NaN NaN NaN 1.20 43.200 44.30
5 3 2.0 4.0 2.0 1.13 4.100 55.10
6 3 NaN NaN NaN 80.30 34.100 4.01
# with ffill
df = df.ffill()
df
Out[2]:
index X Y Z mm ff cc
0 1 1.0 2.0 3.0 0.20 0.400 0.30
1 1 1.0 2.0 3.0 0.10 0.300 0.40
2 2 1.0 2.0 3.0 0.10 1.200 -1.20
3 2 1.0 2.0 3.0 0.12 -1.234 303.40
4 2 1.0 2.0 3.0 1.20 43.200 44.30
5 3 2.0 4.0 2.0 1.13 4.100 55.10
6 3 2.0 4.0 2.0 80.30 34.100 4.01
Try groupby:
groups = df[['X','Y','Z']].notna().all(axis=1).cumsum()
for k, d in df.groupby(groups):
# do something with the groups
print(f'Group {k}')
print(d)
Objective
I have this df and take some ratios below. I want to calculate these ratios by each id and datadate and I believe the groupby function is the way to go, however I am not exactly sure. Any help would be super!
df
id datadate dltt ceq ... pstk icapt dlc sale
1 001004 1975-02-28 3.0 193.0 ... 1.012793 1 0.20 7.367237
2 001004 1975-05-31 4.0 197.0 ... 1.249831 1 0.21 8.982741
3 001004 1975-08-31 5.0 174.0 ... 1.142086 2 0.24 8.115609
4 001004 1975-11-30 8.0 974.0 ... 1.400673 3 0.26 9.944990
5 001005 1975-02-28 3.0 191.0 ... 1.012793 4 0.25 7.367237
6 001005 1975-05-31 3.0 971.0 ... 1.249831 5 0.26 8.982741
7 001005 1975-08-31 2.0 975.0 ... 1.142086 6 0.27 8.115609
8 001005 1975-11-30 1.0 197.0 ... 1.400673 3 0.27 9.944990
9 001006 1975-02-28 3.0 974.0 ... 1.012793 2 0.28 7.367237
10 001006 1975-05-31 4.0 74.0 ... 1.249831 1 0.21 8.982741
11 001006 1975-08-31 5.0 75.0 ... 1.142086 3 0.23 8.115609
12 001006 1975-11-30 5.0 197.0 ... 1.400673 4 0.24 9.944990
Example of ratios
df['capital_ratioa'] = df['dltt']/(df['dltt']+df['ceq']+df['pstk'])
df['equity_invcapa'] = df['ceq']/df['icapt']
df['debt_invcapa'] = df['dltt']/df['icapt']
df['sale_invcapa']=df['sale']/df['icapt']
df['totdebt_invcapa']=(df['dltt']+df['dlc'])/df['icapt']
Is this what you're looking for?
df = df.groupby(by=['id'], as_index=False).sum()
df['capital_ratioa'] = df['dltt']/(df['dltt']+df['ceq']+df['pstk'])
df['equity_invcapa'] = df['ceq']/df['icapt']
df['debt_invcapa'] = df['dltt']/df['icapt']
df['sale_invcapa']=df['sale']/df['icapt']
df['totdebt_invcapa']=(df['dltt']+df['dlc'])/df['icapt']
print(df)
Output:
id dltt ceq pstk icapt dlc sale capital_ratioa equity_invcapa debt_invcapa sale_invcapa totdebt_invcapa
0 1004 20.0 1538.0 4.805383 7 0.91 34.410577 0.012797 219.714286 2.857143 4.915797 2.987143
1 1005 9.0 2334.0 4.805383 18 1.05 34.410577 0.003833 129.666667 0.500000 1.911699 0.558333
2 1006 17.0 1320.0 4.805383 10 0.96 34.410577 0.012669 132.000000 1.700000 3.441058 1.796000
I tried using transpose and adding some twists to it but it didn't workout
Convert Upper:
Data :
0 1 2 3
0 5 NaN NaN NaN
1 1 NaN NaN NaN
2 0.21 0.31 0.41 0.51
3 0.32 0.42 0.52 NaN
4 0.43 0.53 NaN NaN
5 0.54 NaN NaN Nan
to:
Data :
0 1 2 3
0 5 NaN NaN NaN
1 1 NaN NaN NaN
2 0.21 NaN NaN NaN
3 0.31 0.32 NaN NaN
4 0.41 0.42 0.43 NaN
5 0.51 0.52 0.53 0.54
without effecting the first two rows
I believe you need justify with sort with exclude first 2 rows:
arr = justify(df.values[2:,:], invalid_val=np.nan, side='down', axis=0)
df.values[2:,:] = np.sort(arr, axis=1)
print (df)
0 1 2 3
0 5.00 NaN NaN NaN
1 1.00 NaN NaN NaN
2 0.21 NaN NaN NaN
3 0.31 0.32 NaN NaN
4 0.41 0.42 0.43 NaN
5 0.51 0.52 0.53 0.54
IIUC you can first index the dataframe from row 2 onwards and swap with the transpose, and then you can use justify so that all NaNs are at the top:
df.iloc[2:,:] = df.iloc[2:,:].T.values
pd.Dataframe(justify(df.values.astype(float), invalid_val=np.nan, side='down', axis=0))
0 1 2 3
0 5 NaN NaN NaN
1 1 NaN NaN NaN
2 0.21 NaN NaN NaN
3 0.31 0.32 NaN NaN
4 0.41 0.42 0.43 NaN
5 0.51 0.52 0.53 0.54
I am trying to create a histogram on a continuous value column Trip_distance in a large 1.4M row pandas dataframe. Wrote the following code:
fig = plt.figure(figsize=(17,10))
trip_data.hist(column="Trip_distance")
plt.xlabel("Trip_distance",fontsize=15)
plt.ylabel("Frequency",fontsize=15)
plt.xlim([0.0,100.0])
#plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
But I am not sure why all values give the same frequency plot which shouldn't be the case. What's wrong with the code?
Test data:
VendorID lpep_pickup_datetime Lpep_dropoff_datetime Store_and_fwd_flag RateCodeID Pickup_longitude Pickup_latitude Dropoff_longitude Dropoff_latitude Passenger_count Trip_distance Fare_amount Extra MTA_tax Tip_amount Tolls_amount Ehail_fee improvement_surcharge Total_amount Payment_type Trip_type
0 2 2015-09-01 00:02:34 2015-09-01 00:02:38 N 5 -73.979485 40.684956 -73.979431 40.685020 1 0.00 7.8 0.0 0.0 1.95 0.0 NaN 0.0 9.75 1 2.0
1 2 2015-09-01 00:04:20 2015-09-01 00:04:24 N 5 -74.010796 40.912216 -74.010780 40.912212 1 0.00 45.0 0.0 0.0 0.00 0.0 NaN 0.0 45.00 1 2.0
2 2 2015-09-01 00:01:50 2015-09-01 00:04:24 N 1 -73.921410 40.766708 -73.914413 40.764687 1 0.59 4.0 0.5 0.5 0.50 0.0 NaN 0.3 5.80 1 1.0
3 2 2015-09-01 00:02:36 2015-09-01 00:06:42 N 1 -73.921387 40.766678 -73.931427 40.771584 1 0.74 5.0 0.5 0.5 0.00 0.0 NaN 0.3 6.30 2 1.0
4 2 2015-09-01 00:00:14 2015-09-01 00:04:20 N 1 -73.955482 40.714046 -73.944412 40.714729 1 0.61 5.0 0.5 0.5 0.00 0.0 NaN 0.3 6.30 2 1.0
5 2 2015-09-01 00:00:39 2015-09-01 00:05:20 N 1 -73.945297 40.808186 -73.937668 40.821198 1 1.07 5.5 0.5 0.5 1.36 0.0 NaN 0.3 8.16 1 1.0
6 2 2015-09-01 00:00:52 2015-09-01 00:05:50 N 1 -73.890877 40.746426 -73.876923 40.756306 1 1.43 6.5 0.5 0.5 0.00 0.0 NaN 0.3 7.80 1 1.0
7 2 2015-09-01 00:02:15 2015-09-01 00:05:34 N 1 -73.946701 40.797321 -73.937645 40.804516 1 0.90 5.0 0.5 0.5 0.00 0.0 NaN 0.3 6.30 2 1.0
8 2 2015-09-01 00:02:36 2015-09-01 00:07:20 N 1 -73.963150 40.693829 -73.956787 40.680531 1 1.33 6.0 0.5 0.5 1.46 0.0 NaN 0.3 8.76 1 1.0
9 2 2015-09-01 00:02:13 2015-09-01 00:07:23 N 1 -73.896820 40.746128 -73.888626 40.752724 1 0.84 5.5 0.5 0.5 0.00 0.0 NaN 0.3 6.80 2 1.0
In [ ]:
Trip_distance column
0 0.00
1 0.00
2 0.59
3 0.74
4 0.61
5 1.07
6 1.43
7 0.90
8 1.33
9 0.84
10 0.80
11 0.70
12 1.01
13 0.39
14 0.56
Name: Trip_distance, dtype: float64
After 100 bins:
EDIT:
After your comments this actually makes perfect sense why you don't get a histogram of each different value. There are 1.4 million rows, and ten discrete buckets. So apparently each bucket is exactly 10% (to within what you can see in the plot).
A quick rerun of your data:
In [25]: df.hist(column='Trip_distance')
Prints out absolutely fine.
The df.hist function comes with an optional keyword argument bins=10 which buckets the data into discrete bins. With only 10 discrete bins and a more or less homogeneous distribution of hundreds of thousands of rows, you might not be able to see the difference in the ten different bins in your low resolution plot:
In [34]: df.hist(column='Trip_distance', bins=50)
Here's another way to plot the data, involves turning the date_time into an index, this might help you for future slicing
#convert column to datetime
trip_data['lpep_pickup_datetime'] = pd.to_datetime(trip_data['lpep_pickup_datetime'])
#turn the datetime to an index
trip_data.index = trip_data['lpep_pickup_datetime']
#Plot
trip_data['Trip_distance'].plot(kind='hist')
plt.show()