Calculating the share of each code, by ID

Calculating the share of each code, by ID - python

I have this data-frame:
ID code X X_total
A 456 40 40
A 789 0 40
B 123 75 100
B 987 25 100
C 789 13 91
C 987 0 91
C 123 35 91
C 456 43 91
I want the calculate the share of each code (from [123, 465, 789, 987]), by dividing X by X_total, for each ID.
Expected result:
ID share_123 share_456 share_789 share_987
A 0.00 1.00 0.00 0.00
B 0.75 0.00 0.00 0.25
C 0.38 0.47 0.14 0.00

Let us do crosstab
s = pd.crosstab(df.ID, df.code, df.X ,aggfunc='sum', normalize='index').add_prefix("share_")
Out[70]:
code 123 456 789 987
ID
A 0.000000 1.000000 0.000000 0.00
B 0.750000 0.000000 0.000000 0.25
C 0.384615 0.472527 0.142857 0.00

Or with df.pivot with your logic:
df.assign(k=df['X'].div(df['X_total'])).pivot("ID","code","k").fillna(0)
code 123 456 789 987
ID
A 0.000000 1.000000 0.000000 0.00
B 0.750000 0.000000 0.000000 0.25
C 0.384615 0.472527 0.142857 0.00
Adding formatting:
(df.assign(k=df['X'].div(df['X_total'])).pivot("ID","code","k").fillna(0)
.add_prefix("share_").round(2).rename_axis(None,axis=1).reset_index())
ID share_123 share_456 share_789 share_987
0 A 0.00 1.00 0.00 0.00
1 B 0.75 0.00 0.00 0.25
2 C 0.38 0.47 0.14 0.00

Another approach with groupby + unstack
df['X'].div(df['X_total']).groupby([df['ID'], df['code']]).sum().unstack(fill_value=0)
code 123 456 789 987
ID
A 0.000000 1.000000 0.000000 0.00
B 0.750000 0.000000 0.000000 0.25
C 0.384615 0.472527 0.142857 0.00

Related

Unable to scrape 2nd table from Fbref.com

I would like to scrape the 2nd table in the page seen below from the link - https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard
on google collab.
but pd.read_html("https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard") only gives me the first table.
Please help me understand where I am going wrong.
Snippet of page

This is one way to read that data:
import pandas as pd
import requests
url= 'https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard'
response = requests.get(url).text.replace('<!--', '').replace('-->', '')
df = pd.read_html(response, header=1)[2]
print(df)
Result in terminal:
Rk Player Nation Pos Squad Age Born MP Starts Min 90s Gls Ast G-PK PK PKatt CrdY CrdR Gls.1 Ast.1 G+A G-PK.1 G+A-PK Matches
0 1 Sahal Abdul Samad in IND MF Kerala Blasters 24 1997 20 19 1443 16.0 5 1 5 0 0 0 0 0.31 0.06 0.37 0.31 0.37 Matches
1 2 Ayush Adhikari in IND MF Kerala Blasters 21 2000 14 6 540 6.0 0 0 0 0 0 3 1 0.00 0.00 0.00 0.00 0.00 Matches
2 3 Gani Ahammed Nigam in IND FW NorthEast Utd 23 1998 6 0 66 0.7 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 Matches
3 4 Airam es ESP FW Goa 33 1987 13 8 751 8.3 6 1 5 1 2 0 0 0.72 0.12 0.84 0.60 0.72 Matches
4 5 Alex br BRA MF Jamshedpur 32 1988 20 12 1118 12.4 1 4 1 0 0 2 0 0.08 0.32 0.40 0.08 0.40 Matches
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
302 292 João Victor br BRA MF Hyderabad FC 32 1988 18 18 1590 17.7 5 1 3 2 2 3 0 0.28 0.06 0.34 0.17 0.23 Matches
303 293 David Williams au AUS FW Mohun Bagan 33 1988 15 6 602 6.7 4 1 4 0 1 2 0 0.60 0.15 0.75 0.60 0.75 Matches
304 294 Banana Yaya cm CMR DF Bengaluru 30 1991 5 2 229 2.5 0 1 0 0 0 1 0 0.00 0.39 0.39 0.00 0.39 Matches
305 295 Joe Zoherliana in IND DF NorthEast Utd 22 1999 9 6 677 7.5 0 1 0 0 0 0 0 0.00 0.13 0.13 0.00 0.13 Matches
306 296 Mark Zothanpuia in IND MF Hyderabad FC 19 2002 3 0 63 0.7 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 Matches
307 rows × 24 columns
Relevant pandas documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

How to get maximums of multiple groups based on grouping column?

I have an initial dataset data grouped by id:
id x y
1 0.21 1.00
1 0.34 0.66
1 0.35 0.33
1 0.94 0.00
2 0.11 1.00
2 0.90 0.66
2 0.31 0.33
2 0.33 0.00
3 0.12 1.00
3 0.34 0.71
3 0.64 0.43
3 0.89 0.14
4 0.32 1.00
4 0.33 0.66
4 0.45 0.33
4 0.76 0.00
I am trying to predict the maximum y based on variable x while considering the groups. First, I train_test_split based on the groups:
data_train
id x y
1 0.21 1.00
1 0.34 0.66
1 0.35 0.33
1 0.94 0.00
2 0.11 1.00
2 0.90 0.66
2 0.31 0.33
2 0.33 0.00
and
data_test
id x y
3 0.12 1.00
3 0.34 0.66
3 0.64 0.33
3 0.89 0.00
4 0.33 1.00
4 0.32 0.66
4 0.45 0.33
4 0.76 0.00
After training the model and applying the model on data_test, I get:
y_hat
0.65
0.33
0.13
0.00
0.33
0.34
0.21
0.08
I am trying to transform y_hat so that the maximum in each of the initial groups is 1.00; otherwise, it is 0.00:
y_hat_transform
1.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
How would I do that? Note that the groups can be of varying sizes.
Edit: To simplify the problem, I have id_test and y_hat, where
id_test
3
3
3
3
4
4
4
4
and I am trying to get y_hat_transform.

id y
0 3 0.65
1 3 0.65
2 3 0.33
3 3 0.13
4 3 0.00
5 4 0.33
6 4 0.34
7 4 0.21
8 4 0.08
# Find max rows per group and assign them values
# I see 1.0 and 0.0 as binary so directly did it by casting to float
# transform gives new column of same size and repeated maxs per group
id_y['y_transform'] = (id_y['y'] == id_y.groupby(['id'])['y'].transform(max)).astype(float)

Slicing a pandas dataframe based on less or equal citerion

I bet this question has been answered a number of times but I am struggling to find a definitive solution.
I need to delete dataframe rows based on a greater or equal condition. Because of float64 type I am not able to satisfy the "equal" part of the condition. Splitting the condition into two seems cumbersome and not very pandorable. Can someone help me with finding solution?
Thanks.
Dataframe:
Sg Sw temp_S Krg Krw Pc
0 0.00 1.00 -5.263158e-02 0.000000 0.650000 0.000000
1 0.05 0.95 -4.382459e-17 0.000000 0.650000 0.000000
2 0.10 0.90 5.263158e-02 0.000000 0.593548 0.095790
3 0.15 0.85 1.052632e-01 0.000000 0.537097 0.107775
4 0.20 0.80 1.578947e-01 0.000000 0.480645 0.122121
5 0.25 0.75 2.105263e-01 0.000000 0.424194 0.139496
6 0.30 0.70 2.631579e-01 0.000000 0.367742 0.160837
7 0.35 0.65 3.157895e-01 0.000000 0.311290 0.187397
8 0.36 0.64 3.263158e-01 0.000000 0.300000 0.193483
9 0.40 0.60 3.684211e-01 0.014167 0.230400 0.221009
Slicing:
print(object.sc_df[object.sc_df['Sg'].values > 0.05])
Output:
Sg Sw temp_S Krg Krw Pc
2 0.10 0.90 0.052632 0.000000 0.593548 0.095790
3 0.15 0.85 0.105263 0.000000 0.537097 0.107775
4 0.20 0.80 0.157895 0.000000 0.480645 0.122121
5 0.25 0.75 0.210526 0.000000 0.424194 0.139496
6 0.30 0.70 0.263158 0.000000 0.367742 0.160837
7 0.35 0.65 0.315789 0.000000 0.311290 0.187397
8 0.36 0.64 0.326316 0.000000 0.300000 0.193483
9 0.40 0.60 0.368421 0.014167 0.230400 0.221009
As you can see, line 1 is missing. What would be the best way satisfying "equal" condition?

Pandas: find the n lowest values each m rows

I have a dataframe with a Counter, increasing by 1 each 24 rows, and a value column, like below.
value counter
0 0.00 1
1 0.00 1
2 0.00 1
3 0.00 1
4 0.00 1
5 0.00 1
6 0.00 1
7 0.00 1
8 55.00 1
9 90.00 1
10 49.27 1
11 51.80 1
12 49.06 1
13 43.46 1
14 45.96 1
15 43.95 1
16 45.00 1
17 43.97 1
18 42.00 1
19 41.14 1
20 43.92 1
21 51.74 1
22 40.85 1
23 0.00 2
24 0.00 2
25 0.00 2
26 0.00 2
27 0.00 2
28 0.00 2
29 0.00 2
... ... ...
187 82.38 9
188 66.89 9
189 59.83 9
190 52.46 9
191 40.48 9
192 28.87 9
193 41.90 9
194 42.56 9
195 40.93 9
196 40.02 9
197 36.54 9
198 33.70 9
199 38.99 9
200 46.10 9
201 44.82 9
202 0.00 9
203 0.00 9
204 0.00 9
205 0.00 9
206 0.00 9
207 0.00 10
208 0.00 10
209 0.00 10
210 74.69 10
211 89.20 10
212 74.59 10
213 55.11 10
214 58.39 10
215 40.81 10
216 45.06 10
I would like to know if there is a way to create a third column with the 4 lowest values in each Group where the Counter has the same value. See below an example for the first Group with Count=1:
value counter value 2
0 0.00 1 0.00
1 0.00 1 0.00
2 0.00 1 0.00
3 0.00 1 0.00
4 0.00 1 0.00
5 0.00 1 0.00
6 0.00 1 0.00
7 0.00 1 0.00
8 55.00 1 0.00
9 90.00 1 0.00
10 49.27 1 0.00
11 51.80 1 0.00
12 49.06 1 0.00
13 43.46 1 43.46
14 45.96 1 0.00
15 43.95 1 0.00
16 45.00 1 0.00
17 43.97 1 0.00
18 42.00 1 42.00
19 41.14 1 41.14
20 43.92 1 0.00
21 51.74 1 0.00
22 40.85 1 40.85
I know about some functions like nsmallest(n,'column') but I don't know how to limit it with the Count grouping
Any idea? thank you in advance! .

I think you need first filter out rows with 0 values in value, sorting by sort_values and get DataFrame.head for top 4 values, last add reindex for filling 0 for not matched values:
df['value 2'] = (df[df['value'] != 0]
.sort_values('value')
.groupby('counter')['value'].head(4)
.reindex(df.index, fill_value=0))
print (df)
value counter value 2
0 0.00 1 0.00
1 0.00 1 0.00
2 0.00 1 0.00
3 0.00 1 0.00
4 0.00 1 0.00
5 0.00 1 0.00
6 0.00 1 0.00
7 0.00 1 0.00
8 55.00 1 0.00
9 90.00 1 0.00
10 49.27 1 0.00
11 51.80 1 0.00
12 49.06 1 0.00
13 43.46 1 43.46
14 45.96 1 0.00
15 43.95 1 0.00
16 45.00 1 0.00
17 43.97 1 0.00
18 42.00 1 42.00
19 41.14 1 41.14
20 43.92 1 0.00
21 51.74 1 0.00
22 40.85 1 40.85
23 0.00 2 0.00
24 0.00 2 0.00
25 0.00 2 0.00
26 0.00 2 0.00
27 0.00 2 0.00

How to create multiple spacing CSV from pandas?

I have a pandas dataframe and want to output a text file separated by different spacing for input to other model. How can I do that?
The sample OUTPUT text file is as follow (each columns in the text file correspond to columns in df):
SO HOUREMIS 92 5 1 1 MC12 386.91 389.8 11.45
SO HOUREMIS 92 5 1 1 MC3 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 MC4 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 ED1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 ED2 322.00 397.4 13.00
SO HOUREMIS 92 5 1 1 HL2 25.55 464.3 7.46
SO HOUREMIS 92 5 1 1 WC1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 WC2 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 MC12 405.35 389.3 11.54
SO HOUREMIS 92 5 1 2 MC3 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 MC4 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 ED1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 ED2 319.90 396.3 13.00

After referring to this post. I found the solution:
fmt = '%0s %+1s %+1s %+2s %+2s %+2s %+6s %+15s'
np.savetxt('test.txt', data.values[0:10], fmt=fmt)
I can format each columns and specify how many spacing and the alignment.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculating the share of each code, by ID - python

Let us do crosstab s = pd.crosstab(df.ID, df.code, df.X ,aggfunc='sum', normalize='index').add_prefix("share_") Out[70]: code 123 456 789 987 ID A 0.000000 1.000000 0.000000 0.00 B 0.750000 0.000000 0.000000 0.25 C 0.384615 0.472527 0.142857 0.00

Another approach with groupby + unstack df['X'].div(df['X_total']).groupby([df['ID'], df['code']]).sum().unstack(fill_value=0) code 123 456 789 987 ID A 0.000000 1.000000 0.000000 0.00 B 0.750000 0.000000 0.000000 0.25 C 0.384615 0.472527 0.142857 0.00

Related

Unable to scrape 2nd table from Fbref.com

How to get maximums of multiple groups based on grouping column?

Slicing a pandas dataframe based on less or equal citerion

Pandas: find the n lowest values each m rows

How to create multiple spacing CSV from pandas?

Categories

Resources