Groupby id to calculate ratios - python

Objective
I have this df and take some ratios below. I want to calculate these ratios by each id and datadate and I believe the groupby function is the way to go, however I am not exactly sure. Any help would be super!
df
id datadate dltt ceq ... pstk icapt dlc sale
1 001004 1975-02-28 3.0 193.0 ... 1.012793 1 0.20 7.367237
2 001004 1975-05-31 4.0 197.0 ... 1.249831 1 0.21 8.982741
3 001004 1975-08-31 5.0 174.0 ... 1.142086 2 0.24 8.115609
4 001004 1975-11-30 8.0 974.0 ... 1.400673 3 0.26 9.944990
5 001005 1975-02-28 3.0 191.0 ... 1.012793 4 0.25 7.367237
6 001005 1975-05-31 3.0 971.0 ... 1.249831 5 0.26 8.982741
7 001005 1975-08-31 2.0 975.0 ... 1.142086 6 0.27 8.115609
8 001005 1975-11-30 1.0 197.0 ... 1.400673 3 0.27 9.944990
9 001006 1975-02-28 3.0 974.0 ... 1.012793 2 0.28 7.367237
10 001006 1975-05-31 4.0 74.0 ... 1.249831 1 0.21 8.982741
11 001006 1975-08-31 5.0 75.0 ... 1.142086 3 0.23 8.115609
12 001006 1975-11-30 5.0 197.0 ... 1.400673 4 0.24 9.944990
Example of ratios
df['capital_ratioa'] = df['dltt']/(df['dltt']+df['ceq']+df['pstk'])
df['equity_invcapa'] = df['ceq']/df['icapt']
df['debt_invcapa'] = df['dltt']/df['icapt']
df['sale_invcapa']=df['sale']/df['icapt']
df['totdebt_invcapa']=(df['dltt']+df['dlc'])/df['icapt']

Is this what you're looking for?
df = df.groupby(by=['id'], as_index=False).sum()
df['capital_ratioa'] = df['dltt']/(df['dltt']+df['ceq']+df['pstk'])
df['equity_invcapa'] = df['ceq']/df['icapt']
df['debt_invcapa'] = df['dltt']/df['icapt']
df['sale_invcapa']=df['sale']/df['icapt']
df['totdebt_invcapa']=(df['dltt']+df['dlc'])/df['icapt']
print(df)
Output:
id dltt ceq pstk icapt dlc sale capital_ratioa equity_invcapa debt_invcapa sale_invcapa totdebt_invcapa
0 1004 20.0 1538.0 4.805383 7 0.91 34.410577 0.012797 219.714286 2.857143 4.915797 2.987143
1 1005 9.0 2334.0 4.805383 18 1.05 34.410577 0.003833 129.666667 0.500000 1.911699 0.558333
2 1006 17.0 1320.0 4.805383 10 0.96 34.410577 0.012669 132.000000 1.700000 3.441058 1.796000

Related

groupby two columns, then reset_index fails due to having the same name

I made the following groupby with my pandas dataframe:
df.groupby([df.Hora.dt.hour, df.Hora.dt.minute]).describe()['Qtd']
after groupby the data is as follows:
count mean std min 25% 50% 75% max
Hora Hora
9 0 11.0 5.909091 2.022600 5.000 5.0 5.0 5.00 10.0
1 197.0 6.421320 4.010210 5.000 5.0 5.0 5.00 30.0
2 125.0 6.040000 4.679054 5.000 5.0 5.0 5.00 50.0
3 131.0 6.450382 5.700491 5.000 5.0 5.0 5.00 60.0
4 182.0 6.401099 5.212458 5.000 5.0 5.0 5.00 50.0
5 147.0 6.054422 5.402666 5.000 5.0 5.0 5.00 60.0
6 59.0 6.779661 6.416756 5.000 5.0 5.0 5.00 45.0
7 16.0 6.875000 5.123475 5.000 5.0 5.0 5.00 25.0
when trying to use reset_index() I get an error, because the index names are the same:
ValueError: cannot insert Hora, already exists
How do I reset_index and get the data as follows:
Hora Minute count
9 0 11.0
9 1 197.0
9 2 125.0
9 3 131.0
9 4 182.0
9 5 147.0
9 6 59.0
9 7 16.0
You can first rename and then reset_index:
(
df.rename_axis(index=['Hora', 'Minute'])
.reset_index()
['count']
)

How to split a DataFrame on each different value in a column?

Below is an example DataFrame.
0 1 2 3 4
0 0.0 13.00 4.50 30.0 0.0,13.0
1 0.0 13.00 4.75 30.0 0.0,13.0
2 0.0 13.00 5.00 30.0 0.0,13.0
3 0.0 13.00 5.25 30.0 0.0,13.0
4 0.0 13.00 5.50 30.0 0.0,13.0
5 0.0 13.00 5.75 0.0 0.0,13.0
6 0.0 13.00 6.00 30.0 0.0,13.0
7 1.0 13.25 0.00 30.0 0.0,13.25
8 1.0 13.25 0.25 0.0 0.0,13.25
9 1.0 13.25 0.50 30.0 0.0,13.25
10 1.0 13.25 0.75 30.0 0.0,13.25
11 2.0 13.25 1.00 30.0 0.0,13.25
12 2.0 13.25 1.25 30.0 0.0,13.25
13 2.0 13.25 1.50 30.0 0.0,13.25
14 2.0 13.25 1.75 30.0 0.0,13.25
15 2.0 13.25 2.00 30.0 0.0,13.25
16 2.0 13.25 2.25 30.0 0.0,13.25
I want to split this into new dataframes when the row in column 0 changes.
0 1 2 3 4
0 0.0 13.00 4.50 30.0 0.0,13.0
1 0.0 13.00 4.75 30.0 0.0,13.0
2 0.0 13.00 5.00 30.0 0.0,13.0
3 0.0 13.00 5.25 30.0 0.0,13.0
4 0.0 13.00 5.50 30.0 0.0,13.0
5 0.0 13.00 5.75 0.0 0.0,13.0
6 0.0 13.00 6.00 30.0 0.0,13.0
7 1.0 13.25 0.00 30.0 0.0,13.25
8 1.0 13.25 0.25 0.0 0.0,13.25
9 1.0 13.25 0.50 30.0 0.0,13.25
10 1.0 13.25 0.75 30.0 0.0,13.25
11 2.0 13.25 1.00 30.0 0.0,13.25
12 2.0 13.25 1.25 30.0 0.0,13.25
13 2.0 13.25 1.50 30.0 0.0,13.25
14 2.0 13.25 1.75 30.0 0.0,13.25
15 2.0 13.25 2.00 30.0 0.0,13.25
16 2.0 13.25 2.25 30.0 0.0,13.25
I've tried adapting the following solutions without any luck so far. Split array at value in numpy
Split a large pandas dataframe
Looks like you want to groupby the first colum. You could create a dictionary from the groupby object, and have the groupby keys be the dictionary keys:
out = dict(tuple(df.groupby(0)))
Or we could also build a list from the groupby object. This becomes more useful when we only want positional indexing rather than based on the grouping key:
out = [sub_df for _, sub_df in df.groupby(0)]
We could then index the dict based on the grouping key, or the list based on the group's position:
print(out[0])
0 1 2 3 4
0 0.0 13.0 4.50 30.0 0.0,13.0
1 0.0 13.0 4.75 30.0 0.0,13.0
2 0.0 13.0 5.00 30.0 0.0,13.0
3 0.0 13.0 5.25 30.0 0.0,13.0
4 0.0 13.0 5.50 30.0 0.0,13.0
5 0.0 13.0 5.75 0.0 0.0,13.0
6 0.0 13.0 6.00 30.0 0.0,13.0
Based on
I want to split this into new dataframes when the row in column 0 changes.
If you only want to group when value in column 0 changes , You can try:
d=dict([*df.groupby(df['0'].ne(df['0'].shift()).cumsum())])
print(d[1])
print(d[2])
0 1 2 3 4
0 0.0 13.0 4.50 30.0 0.0,13.0
1 0.0 13.0 4.75 30.0 0.0,13.0
2 0.0 13.0 5.00 30.0 0.0,13.0
3 0.0 13.0 5.25 30.0 0.0,13.0
4 0.0 13.0 5.50 30.0 0.0,13.0
5 0.0 13.0 5.75 0.0 0.0,13.0
6 0.0 13.0 6.00 30.0 0.0,13.0
0 1 2 3 4
7 1.0 13.25 0.00 30.0 0.0,13.25
8 1.0 13.25 0.25 0.0 0.0,13.25
9 1.0 13.25 0.50 30.0 0.0,13.25
10 1.0 13.25 0.75 30.0 0.0,13.25
I will use GroupBy.__iter__:
d = dict(df.groupby(df['0'].diff().ne(0).cumsum()).__iter__())
#d = dict(df.groupby(df[0].diff().ne(0).cumsum()).__iter__())
Note that if there are repeated non-consecutive values ​​different groups will be created, if you only use groupby(0) they will be grouped in the same group

Get the subarray with same numbers and consecutive index

I have a text file like this
0, 23.00, 78.00, 75.00, 105.00, 2,0.97
1, 371.00, 305.00, 38.00, 48.00, 0,0.85
1, 24.00, 78.00, 75.00, 116.00, 2,0.98
1, 372.00, 306.00, 37.00, 48.00, 0,0.84
2, 28.00, 87.00, 74.00, 101.00, 2,0.97
2, 372.00, 307.00, 35.00, 47.00, 0,0.80
3, 32.00, 86.00, 73.00, 98.00, 2,0.98
3, 363.00, 310.00, 34.00, 46.00, 0,0.83
4, 40.00, 77.00, 71.00, 98.00, 2,0.94
4, 370.00, 307.00, 38.00, 47.00, 0,0.84
4, 46.00, 78.00, 74.00, 116.00, 2,0.97
5, 372.00, 308.00, 34.00, 46.00, 0,0.57
5, 43.00, 66.00, 67.00, 110.00, 2,0.96
Code I tried
frames = []
x = []
y = []
labels = []
with open(file, 'r') as lb:
for line in lb:
line = line.replace(',', ' ')
arr = line.split()
frames.append(arr[0])
x.append(arr[1])
y.append(arr[2])
labels.append(arr[5])
print(np.shape(frames))
for d, a in enumerate(frames):
compare = []
if a == frames[d+2]:
compare.append(x[d])
compare.append(x[d+1])
compare.append(x[d+2])
xm = np.argmin(compare)
label = {0: int(labels[d]), 1: int(labels[d+1]), 2: int(labels[d+2])}.get(xm)
elif a == frames[d+1]:
compare.append(x[d])
compare.append(x[d+1])
xm = np.argmin(compare)
label = {0: int(labels[d]), 1: int(labels[d+1])}.get(xm)
In the first line, because the first number (0) is unique so I extract the sixth number (2) easily.
But after that, I got many lines with the same first number, so I want somehow to store all the lines with the same first number to compare the second number, then extract the sixth number of the line which has the lowest second number.
Can someone provide python solutions for me? I tried readline() and next() but don't know how to solve it.
you can read the file with pandas.read_csv instead, and things will come much more easily
import pandas as pd
df = pd.read_csv(file_path, header = None)
You'll read the file as a table
0 1 2 3 4 5 6
0 0 23.0 78.0 75.0 105.0 2 0.97
1 1 371.0 305.0 38.0 48.0 0 0.85
2 1 24.0 78.0 75.0 116.0 2 0.98
3 1 372.0 306.0 37.0 48.0 0 0.84
4 2 28.0 87.0 74.0 101.0 2 0.97
5 2 372.0 307.0 35.0 47.0 0 0.80
6 3 32.0 86.0 73.0 98.0 2 0.98
7 3 363.0 310.0 34.0 46.0 0 0.83
8 4 40.0 77.0 71.0 98.0 2 0.94
9 4 370.0 307.0 38.0 47.0 0 0.84
10 4 46.0 78.0 74.0 116.0 2 0.97
11 5 372.0 308.0 34.0 46.0 0 0.57
12 5 43.0 66.0 67.0 110.0 2 0.96
then you can group in subtables based on one of the columns (in your case column 0)
for group, sub_df in d.groupby(0):
row = sub_df[1].idxmin() # returns the index of the minimum value for column 1
df.loc[row, 5] # this is the number you are looking for
I think this is what you need using pandas:
import pandas as pd
df = pd.read_table('./test.txt', sep=',', names = ('1','2','3','4','5','6','7'))
print(df)
# 1 2 3 4 5 6 7
# 0 0 23.0 78.0 75.0 105.0 2 0.97
# 1 1 371.0 305.0 38.0 48.0 0 0.85
# 2 1 24.0 78.0 75.0 116.0 2 0.98
# 3 1 372.0 306.0 37.0 48.0 0 0.84
# 4 2 28.0 87.0 74.0 101.0 2 0.97
# 5 2 372.0 307.0 35.0 47.0 0 0.80
# 6 3 32.0 86.0 73.0 98.0 2 0.98
# 7 3 363.0 310.0 34.0 46.0 0 0.83
# 8 4 40.0 77.0 71.0 98.0 2 0.94
# 9 4 370.0 307.0 38.0 47.0 0 0.84
# 10 4 46.0 78.0 74.0 116.0 2 0.97
# 11 5 372.0 308.0 34.0 46.0 0 0.57
# 12 5 43.0 66.0 67.0 110.0 2 0.96
df_new = df.loc[df.groupby("1")["6"].idxmin()]
print(df_new)
# 1 2 3 4 5 6 7
# 0 0 23.0 78.0 75.0 105.0 2 0.97
# 1 1 371.0 305.0 38.0 48.0 0 0.85
# 5 2 372.0 307.0 35.0 47.0 0 0.80
# 7 3 363.0 310.0 34.0 46.0 0 0.83
# 9 4 370.0 307.0 38.0 47.0 0 0.84
# 11 5 372.0 308.0 34.0 46.0 0 0.57

Sum dataframes of different lenght, with overlapping indexes

I have many dataframes of equal lenght and equal Datetime indexes
Date OPP
0 2008-01-04 0.0
1 2008-02-04 0.0
2 2008-03-04 0.0
3 2008-04-04 0.0
4 2008-05-04 0.0
5 2008-06-04 0.0
6 2008-07-04 393.75
7 2008-08-04 -168.75
8 2008-09-04 -656.25
9 2008-10-04 -1631.25
Date OPP
0 2008-01-04 750.0
1 2008-02-04 0.0
2 2008-03-04 150.0
3 2008-04-04 600.0
4 2008-05-04 0.0
5 2008-06-04 0.0
6 2008-07-04 0.0
7 2008-08-04 -250.0
8 2008-09-04 1000.0
9 2008-10-04 0.0
I need to create a unique dataframe that sums all the OPP columns from many dataframes. This can easily be done like this:
df3 = df1["OPP"] + df2["OPP"]
df3["Date"] = df1["Date"]
This works as long as all the dataframes are same length and same Date index.
How can I make it work even if these conditions aren't met? What if I had another dataframe like this:
Date OPP
0 2008-07-04 393.75
1 2008-08-04 -168.75
2 2008-09-04 -656.25
3 2008-10-04 -1631.25
4 2008-11-04 -675.00
5 2008-12-04 0.00
I could do this manually: search for the df with the smallest starting date, the one with the biggest starting date and fill every df with all the dates and zeroes, so that I'd have df's of equal lenght... and then proceed with a simple sum.
But, is there a way to do this automatically in Pandas?
Following this answers method, we can use functools.reduce for this.
Whats left is to only sum over axis=1:
from functools import reduce
dfs = [df1, df2, df3]
df_final = reduce(lambda left,right: pd.merge(left,right,on='Date', how='left'), dfs)
Which gives us:
Date OPP_x OPP_y OPP
0 2008-01-04 0.00 750.0 NaN
1 2008-02-04 0.00 0.0 NaN
2 2008-03-04 0.00 150.0 NaN
3 2008-04-04 0.00 600.0 NaN
4 2008-05-04 0.00 0.0 NaN
5 2008-06-04 0.00 0.0 NaN
6 2008-07-04 393.75 0.0 393.75
7 2008-08-04 -168.75 -250.0 -168.75
8 2008-09-04 -656.25 1000.0 -656.25
9 2008-10-04 -1631.25 0.0 -1631.25
Then we sum:
df_final.iloc[:, 1:].sum(axis=1)
0 750.0
1 0.0
2 150.0
3 600.0
4 0.0
5 0.0
6 787.5
7 -587.5
8 -312.5
9 -3262.5
dtype: float64
Or as new column:
df_final['sum'] = df_final.iloc[:, 1:].sum(axis=1)
Date OPP_x OPP_y OPP sum
0 2008-01-04 0.00 750.0 NaN 750.0
1 2008-02-04 0.00 0.0 NaN 0.0
2 2008-03-04 0.00 150.0 NaN 150.0
3 2008-04-04 0.00 600.0 NaN 600.0
4 2008-05-04 0.00 0.0 NaN 0.0
5 2008-06-04 0.00 0.0 NaN 0.0
6 2008-07-04 393.75 0.0 393.75 787.5
7 2008-08-04 -168.75 -250.0 -168.75 -587.5
8 2008-09-04 -656.25 1000.0 -656.25 -312.5
9 2008-10-04 -1631.25 0.0 -1631.25 -3262.5
Use list comprehension for create Series with DatetimeIndex, then join together by concat and sum:
dfs = [df1, df2]
compr = [x.set_index('Date')['OPP'] for x in dfs]
df1 = pd.concat(compr, axis=1).sum(axis=1).reset_index(name='OPP')
print (df1)
Date OPP
0 2008-01-04 750.00
1 2008-02-04 0.00
2 2008-03-04 150.00
3 2008-04-04 600.00
4 2008-05-04 0.00
5 2008-06-04 0.00
6 2008-07-04 393.75
7 2008-08-04 -418.75
8 2008-09-04 343.75
9 2008-10-04 -1631.25
You can simply concat them and sum on groupby date:
(pd.concat((df1,df2,df3))
.groupby('Date', as_index=False)
.sum()
)
Output for your three sample dataframes:
Date OPP
0 2008-01-04 750.0
1 2008-02-04 0.0
2 2008-03-04 150.0
3 2008-04-04 600.0
4 2008-05-04 0.0
5 2008-06-04 0.0
6 2008-07-04 787.5
7 2008-08-04 -587.5
8 2008-09-04 -312.5
9 2008-10-04 -3262.5
10 2008-11-04 -675.0
11 2008-12-04 0.0

Why isn't this code to plot a histogram on a continuous value Pandas column working?

I am trying to create a histogram on a continuous value column Trip_distance in a large 1.4M row pandas dataframe. Wrote the following code:
fig = plt.figure(figsize=(17,10))
trip_data.hist(column="Trip_distance")
plt.xlabel("Trip_distance",fontsize=15)
plt.ylabel("Frequency",fontsize=15)
plt.xlim([0.0,100.0])
#plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
But I am not sure why all values give the same frequency plot which shouldn't be the case. What's wrong with the code?
Test data:
VendorID lpep_pickup_datetime Lpep_dropoff_datetime Store_and_fwd_flag RateCodeID Pickup_longitude Pickup_latitude Dropoff_longitude Dropoff_latitude Passenger_count Trip_distance Fare_amount Extra MTA_tax Tip_amount Tolls_amount Ehail_fee improvement_surcharge Total_amount Payment_type Trip_type
0 2 2015-09-01 00:02:34 2015-09-01 00:02:38 N 5 -73.979485 40.684956 -73.979431 40.685020 1 0.00 7.8 0.0 0.0 1.95 0.0 NaN 0.0 9.75 1 2.0
1 2 2015-09-01 00:04:20 2015-09-01 00:04:24 N 5 -74.010796 40.912216 -74.010780 40.912212 1 0.00 45.0 0.0 0.0 0.00 0.0 NaN 0.0 45.00 1 2.0
2 2 2015-09-01 00:01:50 2015-09-01 00:04:24 N 1 -73.921410 40.766708 -73.914413 40.764687 1 0.59 4.0 0.5 0.5 0.50 0.0 NaN 0.3 5.80 1 1.0
3 2 2015-09-01 00:02:36 2015-09-01 00:06:42 N 1 -73.921387 40.766678 -73.931427 40.771584 1 0.74 5.0 0.5 0.5 0.00 0.0 NaN 0.3 6.30 2 1.0
4 2 2015-09-01 00:00:14 2015-09-01 00:04:20 N 1 -73.955482 40.714046 -73.944412 40.714729 1 0.61 5.0 0.5 0.5 0.00 0.0 NaN 0.3 6.30 2 1.0
5 2 2015-09-01 00:00:39 2015-09-01 00:05:20 N 1 -73.945297 40.808186 -73.937668 40.821198 1 1.07 5.5 0.5 0.5 1.36 0.0 NaN 0.3 8.16 1 1.0
6 2 2015-09-01 00:00:52 2015-09-01 00:05:50 N 1 -73.890877 40.746426 -73.876923 40.756306 1 1.43 6.5 0.5 0.5 0.00 0.0 NaN 0.3 7.80 1 1.0
7 2 2015-09-01 00:02:15 2015-09-01 00:05:34 N 1 -73.946701 40.797321 -73.937645 40.804516 1 0.90 5.0 0.5 0.5 0.00 0.0 NaN 0.3 6.30 2 1.0
8 2 2015-09-01 00:02:36 2015-09-01 00:07:20 N 1 -73.963150 40.693829 -73.956787 40.680531 1 1.33 6.0 0.5 0.5 1.46 0.0 NaN 0.3 8.76 1 1.0
9 2 2015-09-01 00:02:13 2015-09-01 00:07:23 N 1 -73.896820 40.746128 -73.888626 40.752724 1 0.84 5.5 0.5 0.5 0.00 0.0 NaN 0.3 6.80 2 1.0
In [ ]:
​
Trip_distance column
0 0.00
1 0.00
2 0.59
3 0.74
4 0.61
5 1.07
6 1.43
7 0.90
8 1.33
9 0.84
10 0.80
11 0.70
12 1.01
13 0.39
14 0.56
Name: Trip_distance, dtype: float64
After 100 bins:
EDIT:
After your comments this actually makes perfect sense why you don't get a histogram of each different value. There are 1.4 million rows, and ten discrete buckets. So apparently each bucket is exactly 10% (to within what you can see in the plot).
A quick rerun of your data:
In [25]: df.hist(column='Trip_distance')
Prints out absolutely fine.
The df.hist function comes with an optional keyword argument bins=10 which buckets the data into discrete bins. With only 10 discrete bins and a more or less homogeneous distribution of hundreds of thousands of rows, you might not be able to see the difference in the ten different bins in your low resolution plot:
In [34]: df.hist(column='Trip_distance', bins=50)
Here's another way to plot the data, involves turning the date_time into an index, this might help you for future slicing
#convert column to datetime
trip_data['lpep_pickup_datetime'] = pd.to_datetime(trip_data['lpep_pickup_datetime'])
#turn the datetime to an index
trip_data.index = trip_data['lpep_pickup_datetime']
#Plot
trip_data['Trip_distance'].plot(kind='hist')
plt.show()

Categories

Resources