Pandas: find the n lowest values each m rows - python

I have a dataframe with a Counter, increasing by 1 each 24 rows, and a value column, like below.
value counter
0 0.00 1
1 0.00 1
2 0.00 1
3 0.00 1
4 0.00 1
5 0.00 1
6 0.00 1
7 0.00 1
8 55.00 1
9 90.00 1
10 49.27 1
11 51.80 1
12 49.06 1
13 43.46 1
14 45.96 1
15 43.95 1
16 45.00 1
17 43.97 1
18 42.00 1
19 41.14 1
20 43.92 1
21 51.74 1
22 40.85 1
23 0.00 2
24 0.00 2
25 0.00 2
26 0.00 2
27 0.00 2
28 0.00 2
29 0.00 2
... ... ...
187 82.38 9
188 66.89 9
189 59.83 9
190 52.46 9
191 40.48 9
192 28.87 9
193 41.90 9
194 42.56 9
195 40.93 9
196 40.02 9
197 36.54 9
198 33.70 9
199 38.99 9
200 46.10 9
201 44.82 9
202 0.00 9
203 0.00 9
204 0.00 9
205 0.00 9
206 0.00 9
207 0.00 10
208 0.00 10
209 0.00 10
210 74.69 10
211 89.20 10
212 74.59 10
213 55.11 10
214 58.39 10
215 40.81 10
216 45.06 10
I would like to know if there is a way to create a third column with the 4 lowest values in each Group where the Counter has the same value. See below an example for the first Group with Count=1:
value counter value 2
0 0.00 1 0.00
1 0.00 1 0.00
2 0.00 1 0.00
3 0.00 1 0.00
4 0.00 1 0.00
5 0.00 1 0.00
6 0.00 1 0.00
7 0.00 1 0.00
8 55.00 1 0.00
9 90.00 1 0.00
10 49.27 1 0.00
11 51.80 1 0.00
12 49.06 1 0.00
13 43.46 1 43.46
14 45.96 1 0.00
15 43.95 1 0.00
16 45.00 1 0.00
17 43.97 1 0.00
18 42.00 1 42.00
19 41.14 1 41.14
20 43.92 1 0.00
21 51.74 1 0.00
22 40.85 1 40.85
I know about some functions like nsmallest(n,'column') but I don't know how to limit it with the Count grouping
Any idea? thank you in advance! .

I think you need first filter out rows with 0 values in value, sorting by sort_values and get DataFrame.head for top 4 values, last add reindex for filling 0 for not matched values:
df['value 2'] = (df[df['value'] != 0]
.sort_values('value')
.groupby('counter')['value'].head(4)
.reindex(df.index, fill_value=0))
print (df)
value counter value 2
0 0.00 1 0.00
1 0.00 1 0.00
2 0.00 1 0.00
3 0.00 1 0.00
4 0.00 1 0.00
5 0.00 1 0.00
6 0.00 1 0.00
7 0.00 1 0.00
8 55.00 1 0.00
9 90.00 1 0.00
10 49.27 1 0.00
11 51.80 1 0.00
12 49.06 1 0.00
13 43.46 1 43.46
14 45.96 1 0.00
15 43.95 1 0.00
16 45.00 1 0.00
17 43.97 1 0.00
18 42.00 1 42.00
19 41.14 1 41.14
20 43.92 1 0.00
21 51.74 1 0.00
22 40.85 1 40.85
23 0.00 2 0.00
24 0.00 2 0.00
25 0.00 2 0.00
26 0.00 2 0.00
27 0.00 2 0.00

Related

Unable to scrape 2nd table from Fbref.com

I would like to scrape the 2nd table in the page seen below from the link - https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard
on google collab.
but pd.read_html("https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard") only gives me the first table.
Please help me understand where I am going wrong.
Snippet of page
This is one way to read that data:
import pandas as pd
import requests
url= 'https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard'
response = requests.get(url).text.replace('<!--', '').replace('-->', '')
df = pd.read_html(response, header=1)[2]
print(df)
Result in terminal:
Rk Player Nation Pos Squad Age Born MP Starts Min 90s Gls Ast G-PK PK PKatt CrdY CrdR Gls.1 Ast.1 G+A G-PK.1 G+A-PK Matches
0 1 Sahal Abdul Samad in IND MF Kerala Blasters 24 1997 20 19 1443 16.0 5 1 5 0 0 0 0 0.31 0.06 0.37 0.31 0.37 Matches
1 2 Ayush Adhikari in IND MF Kerala Blasters 21 2000 14 6 540 6.0 0 0 0 0 0 3 1 0.00 0.00 0.00 0.00 0.00 Matches
2 3 Gani Ahammed Nigam in IND FW NorthEast Utd 23 1998 6 0 66 0.7 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 Matches
3 4 Airam es ESP FW Goa 33 1987 13 8 751 8.3 6 1 5 1 2 0 0 0.72 0.12 0.84 0.60 0.72 Matches
4 5 Alex br BRA MF Jamshedpur 32 1988 20 12 1118 12.4 1 4 1 0 0 2 0 0.08 0.32 0.40 0.08 0.40 Matches
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
302 292 João Victor br BRA MF Hyderabad FC 32 1988 18 18 1590 17.7 5 1 3 2 2 3 0 0.28 0.06 0.34 0.17 0.23 Matches
303 293 David Williams au AUS FW Mohun Bagan 33 1988 15 6 602 6.7 4 1 4 0 1 2 0 0.60 0.15 0.75 0.60 0.75 Matches
304 294 Banana Yaya cm CMR DF Bengaluru 30 1991 5 2 229 2.5 0 1 0 0 0 1 0 0.00 0.39 0.39 0.00 0.39 Matches
305 295 Joe Zoherliana in IND DF NorthEast Utd 22 1999 9 6 677 7.5 0 1 0 0 0 0 0 0.00 0.13 0.13 0.00 0.13 Matches
306 296 Mark Zothanpuia in IND MF Hyderabad FC 19 2002 3 0 63 0.7 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 Matches
307 rows × 24 columns
Relevant pandas documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

create pandas column with conditionals

The value of column A can be -1, 0 and 1. How to create a new column with conditions: column A; value different from 0 in the line below (-1,1), add the value of column X, if value of column X equals 0 then add the next value different from 0. In the other lines of this new column, assign the value 0.
import pandas as pd
d = {'a': [-1,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,1,1,1,0,0,0,0,1,1,1,1,1],
'x': [0.00,-2.13,0.00,0.00,0.00,0.00,0.00,0.21,-0.63,0.00,0.29,-0.11,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.62,0.00,
-0.36,0.00,-0.03,0.00,0.00,0.00,0.22,0.05,0.00]}
df = pd.DataFrame(data=d)
df
desired result
a x r
0 -1 0.00 0.00
1 -1 -2.13 0.00
2 -1 0.00 0.00
3 0 0.00 0.00
4 0 0.00 0.00
5 -1 0.00 0.00
6 -1 0.00 0.00
7 -1 0.21 0.21
8 -1 -0.63 0.00
9 -1 0.00 0.00
10 -1 0.29 0.00
11 -1 -0.11 0.00
12 -1 0.00 0.00
13 0 0.00 0.00
14 0 0.00 0.00
15 0 0.00 0.00
16 0 0.00 0.00
17 0 0.00 0.00
18 1 0.00 0.00
19 1 0.62 0.62
20 1 0.00 0.00
21 0 -0.36 0.00
22 0 0.00 0.00
23 0 -0.03 0.00
24 0 0.00 0.00
25 1 0.00 0.00
26 1 0.00 0.00
27 1 0.22 0.22
28 1 0.05 0.00
29 1 0.00 0.00

get value from one column as variable for subtraction

i have a data frame with XY's and distances. what i am trying to do is store the distance as a variable and subtract it from the next distance if X or Y has a value greater than 0
here is a sample df
dist x y
0 12.93 99.23
200 0 0
400 0 0
600 0 0
800 0 0
1000 12.46 99.14
1200 0 0
1400 0 0
1600 0 0
1800 0 0
2000 12.01 99.07
and this is new df
dist x y
0 12.93 99.23
200 0 0
400 0 0
600 0 0
800 0 0
0 12.46 99.14
200 0 0
400 0 0
600 0 0
800 0 0
2000 12.01 99.07
the last value doesn't matter, but technically, it would be 0.
the idea is that at every know XY, assign the distance as 0 and subtract that distance until the next known XY
in the above example, the distances are rounded numbers, but in reality, they could be like
132.05
19.999
1539.65
and so on
Check with transform
df.dist-=df.groupby(df.x.ne(0).cumsum())['dist'].transform('first')
df
Out[769]:
dist x y
0 0 12.93 99.23
1 200 0.00 0.00
2 400 0.00 0.00
3 600 0.00 0.00
4 800 0.00 0.00
5 0 12.46 99.14
6 200 0.00 0.00
7 400 0.00 0.00
8 600 0.00 0.00
9 800 0.00 0.00
10 0 12.01 99.07
You can use groupby and apply, using a custom grouper calculated as follows:
grouper = (df['x'].ne(0) | df['y'].ne(0)).cumsum()
df['dist'].groupby(grouper).apply(lambda x: x - x.values[0])
0 0
1 200
2 400
3 600
4 800
5 0
6 200
7 400
8 600
9 800
10 0
Name: dist, dtype: int64
Where,
grouper
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 2
10 3
dtype: int64
The idea is to mark all rows that must be subtracted from the first non-zero value of that corresponding group.
With where + ffill
df['dist'] = df.dist - df.where(df.x.gt(0) | df.y.gt(0)).dist.ffill()
dist x y
0 0.0 12.93 99.23
1 200.0 0.00 0.00
2 400.0 0.00 0.00
3 600.0 0.00 0.00
4 800.0 0.00 0.00
5 0.0 12.46 99.14
6 200.0 0.00 0.00
7 400.0 0.00 0.00
8 600.0 0.00 0.00
9 800.0 0.00 0.00
10 0.0 12.01 99.07

How to create multiple spacing CSV from pandas?

I have a pandas dataframe and want to output a text file separated by different spacing for input to other model. How can I do that?
The sample OUTPUT text file is as follow (each columns in the text file correspond to columns in df):
SO HOUREMIS 92 5 1 1 MC12 386.91 389.8 11.45
SO HOUREMIS 92 5 1 1 MC3 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 MC4 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 ED1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 ED2 322.00 397.4 13.00
SO HOUREMIS 92 5 1 1 HL2 25.55 464.3 7.46
SO HOUREMIS 92 5 1 1 WC1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 WC2 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 MC12 405.35 389.3 11.54
SO HOUREMIS 92 5 1 2 MC3 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 MC4 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 ED1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 ED2 319.90 396.3 13.00
After referring to this post. I found the solution:
fmt = '%0s %+1s %+1s %+2s %+2s %+2s %+6s %+15s'
np.savetxt('test.txt', data.values[0:10], fmt=fmt)
I can format each columns and specify how many spacing and the alignment.

Converting string/numerical data to categorical format in pandas

I have a very large csv file that I have converted to a Pandas dataframe, which has string and integer/float values. I would like to change this data to categorical format in order to try and save some memory. I am basing this idea off of the documentation here: https://pandas.pydata.org/pandas-docs/version/0.20/categorical.html
My dataframe looks like the following:
clean_data_measurements.head(20)
station date prcp tobs
0 USC00519397 1/1/2010 0.08 65
1 USC00519397 1/2/2010 0.00 63
2 USC00519397 1/3/2010 0.00 74
3 USC00519397 1/4/2010 0.00 76
5 USC00519397 1/7/2010 0.06 70
6 USC00519397 1/8/2010 0.00 64
7 USC00519397 1/9/2010 0.00 68
8 USC00519397 1/10/2010 0.00 73
9 USC00519397 1/11/2010 0.01 64
10 USC00519397 1/12/2010 0.00 61
11 USC00519397 1/14/2010 0.00 66
12 USC00519397 1/15/2010 0.00 65
13 USC00519397 1/16/2010 0.00 68
14 USC00519397 1/17/2010 0.00 64
15 USC00519397 1/18/2010 0.00 72
16 USC00519397 1/19/2010 0.00 66
17 USC00519397 1/20/2010 0.00 66
18 USC00519397 1/21/2010 0.00 69
19 USC00519397 1/22/2010 0.00 67
20 USC00519397 1/23/2010 0.00 67
It is precipitation data which goes on another 2700 rows. Since it is all of the same category (station number), it should be convertible to categorical format which will save processing time. I am just unsure of how to write the code. Can anyone help? Thanks.
I think we can convert object to category data by using factorize
objectdf=df.select_dtypes(include='object')
df.loc[:,objectdf.columns]=objectdf.apply(lambda x : pd.factorize(x)[0])
df
Out[452]:
station date prcp tobs
0 0 0 0.08 65
1 0 1 0.00 63
2 0 2 0.00 74
3 0 3 0.00 76
5 0 4 0.06 70
6 0 5 0.00 64
7 0 6 0.00 68
8 0 7 0.00 73
9 0 8 0.01 64
10 0 9 0.00 61
11 0 10 0.00 66
12 0 11 0.00 65
13 0 12 0.00 68
14 0 13 0.00 64
15 0 14 0.00 72
16 0 15 0.00 66
17 0 16 0.00 66
18 0 17 0.00 69
19 0 18 0.00 67
20 0 19 0.00 67
You can try this as well.
for y,x in zip(df.columns,df.dtypes):
if x == 'object':
df[y]=pd.factorize(df[y])[0]
elif x=='int64':
df[y]=df[y].astype(np.int8)
else:
df[y]=df[y].astype(np.float32)

Categories

Resources