Converting string/numerical data to categorical format in pandas - python

I have a very large csv file that I have converted to a Pandas dataframe, which has string and integer/float values. I would like to change this data to categorical format in order to try and save some memory. I am basing this idea off of the documentation here: https://pandas.pydata.org/pandas-docs/version/0.20/categorical.html
My dataframe looks like the following:
clean_data_measurements.head(20)
station date prcp tobs
0 USC00519397 1/1/2010 0.08 65
1 USC00519397 1/2/2010 0.00 63
2 USC00519397 1/3/2010 0.00 74
3 USC00519397 1/4/2010 0.00 76
5 USC00519397 1/7/2010 0.06 70
6 USC00519397 1/8/2010 0.00 64
7 USC00519397 1/9/2010 0.00 68
8 USC00519397 1/10/2010 0.00 73
9 USC00519397 1/11/2010 0.01 64
10 USC00519397 1/12/2010 0.00 61
11 USC00519397 1/14/2010 0.00 66
12 USC00519397 1/15/2010 0.00 65
13 USC00519397 1/16/2010 0.00 68
14 USC00519397 1/17/2010 0.00 64
15 USC00519397 1/18/2010 0.00 72
16 USC00519397 1/19/2010 0.00 66
17 USC00519397 1/20/2010 0.00 66
18 USC00519397 1/21/2010 0.00 69
19 USC00519397 1/22/2010 0.00 67
20 USC00519397 1/23/2010 0.00 67
It is precipitation data which goes on another 2700 rows. Since it is all of the same category (station number), it should be convertible to categorical format which will save processing time. I am just unsure of how to write the code. Can anyone help? Thanks.

I think we can convert object to category data by using factorize
objectdf=df.select_dtypes(include='object')
df.loc[:,objectdf.columns]=objectdf.apply(lambda x : pd.factorize(x)[0])
df
Out[452]:
station date prcp tobs
0 0 0 0.08 65
1 0 1 0.00 63
2 0 2 0.00 74
3 0 3 0.00 76
5 0 4 0.06 70
6 0 5 0.00 64
7 0 6 0.00 68
8 0 7 0.00 73
9 0 8 0.01 64
10 0 9 0.00 61
11 0 10 0.00 66
12 0 11 0.00 65
13 0 12 0.00 68
14 0 13 0.00 64
15 0 14 0.00 72
16 0 15 0.00 66
17 0 16 0.00 66
18 0 17 0.00 69
19 0 18 0.00 67
20 0 19 0.00 67
You can try this as well.
for y,x in zip(df.columns,df.dtypes):
if x == 'object':
df[y]=pd.factorize(df[y])[0]
elif x=='int64':
df[y]=df[y].astype(np.int8)
else:
df[y]=df[y].astype(np.float32)

Related

Unable to scrape 2nd table from Fbref.com

I would like to scrape the 2nd table in the page seen below from the link - https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard
on google collab.
but pd.read_html("https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard") only gives me the first table.
Please help me understand where I am going wrong.
Snippet of page
This is one way to read that data:
import pandas as pd
import requests
url= 'https://fbref.com/en/comps/82/stats/Indian-Super-League-Stats#all_stats_standard'
response = requests.get(url).text.replace('<!--', '').replace('-->', '')
df = pd.read_html(response, header=1)[2]
print(df)
Result in terminal:
Rk Player Nation Pos Squad Age Born MP Starts Min 90s Gls Ast G-PK PK PKatt CrdY CrdR Gls.1 Ast.1 G+A G-PK.1 G+A-PK Matches
0 1 Sahal Abdul Samad in IND MF Kerala Blasters 24 1997 20 19 1443 16.0 5 1 5 0 0 0 0 0.31 0.06 0.37 0.31 0.37 Matches
1 2 Ayush Adhikari in IND MF Kerala Blasters 21 2000 14 6 540 6.0 0 0 0 0 0 3 1 0.00 0.00 0.00 0.00 0.00 Matches
2 3 Gani Ahammed Nigam in IND FW NorthEast Utd 23 1998 6 0 66 0.7 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 Matches
3 4 Airam es ESP FW Goa 33 1987 13 8 751 8.3 6 1 5 1 2 0 0 0.72 0.12 0.84 0.60 0.72 Matches
4 5 Alex br BRA MF Jamshedpur 32 1988 20 12 1118 12.4 1 4 1 0 0 2 0 0.08 0.32 0.40 0.08 0.40 Matches
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
302 292 João Victor br BRA MF Hyderabad FC 32 1988 18 18 1590 17.7 5 1 3 2 2 3 0 0.28 0.06 0.34 0.17 0.23 Matches
303 293 David Williams au AUS FW Mohun Bagan 33 1988 15 6 602 6.7 4 1 4 0 1 2 0 0.60 0.15 0.75 0.60 0.75 Matches
304 294 Banana Yaya cm CMR DF Bengaluru 30 1991 5 2 229 2.5 0 1 0 0 0 1 0 0.00 0.39 0.39 0.00 0.39 Matches
305 295 Joe Zoherliana in IND DF NorthEast Utd 22 1999 9 6 677 7.5 0 1 0 0 0 0 0 0.00 0.13 0.13 0.00 0.13 Matches
306 296 Mark Zothanpuia in IND MF Hyderabad FC 19 2002 3 0 63 0.7 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 Matches
307 rows × 24 columns
Relevant pandas documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

Expand time series data in pandas dataframe

I am attempting to interpolate between time points for all data in a pandas dataframe. My current data is in time increments of 0.04 seconds. I want it to be in increments of 0.01 seconds to match another data set. I realize I can use the DataFrame.interpolate() function to do this. However, I am stuck on how to insert 3 rows of NaN in-between every row of my dataframe in an efficient manner.
import pandas as pd
import numpy as np
df = pd.DataFrame(data={"Time": [0.0, 0.04, 0.08, 0.12],
"Pulse": [76, 74, 77, 80],
"O2":[99, 100, 99, 98]})
df_ins = pd.DataFrame(data={"Time": [np.nan, np.nan, np.nan],
"Pulse": [np.nan, np.nan, np.nan],
"O2":[np.nan, np.nan, np.nan]})
I want df to transform from this:
Time Pulse O2
0 0.00 76 99
1 0.04 74 100
2 0.08 77 99
3 0.12 80 98
To something like this:
Time Pulse O2
0 0.00 76 99
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 0.04 74 100
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 0.08 77 99
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 0.12 80 98
Which I can then call on
df = df.interpolate()
Which would yield something like this (I'm making up the numbers here):
Time Pulse O2
0 0.00 76 99
1 0.01 76 99
2 0.02 75 99
3 0.03 74 100
4 0.04 74 100
5 0.05 75 100
6 0.06 76 99
7 0.07 77 99
8 0.08 77 99
9 0.09 77 99
10 0.10 78 98
11 0.11 79 98
12 0.12 80 98
I attempted to use an iterrows technique by inserting the df_ins frame after every row. But my index was thrown off during the iteration. I also tried slicing df and concatenating the df slices and df_ins, but once again the indexes were thrown off by the loop.
Does anyone have any recommendations on how to do this efficiently?
Use resample here (replace ffill with your desired behavior, maybe mess around with interpolate)
df["Time"] = pd.to_timedelta(df["Time"], unit="S")
df.set_index("Time").resample("0.01S").ffill()
Pulse O2
Time
00:00:00 76 99
00:00:00.010000 76 99
00:00:00.020000 76 99
00:00:00.030000 76 99
00:00:00.040000 74 100
00:00:00.050000 74 100
00:00:00.060000 74 100
00:00:00.070000 74 100
00:00:00.080000 77 99
00:00:00.090000 77 99
00:00:00.100000 77 99
00:00:00.110000 77 99
00:00:00.120000 80 98
If you do want to interpolate:
df.set_index("Time").resample("0.01S").interpolate()
Pulse O2
Time
00:00:00 76.00 99.00
00:00:00.010000 75.50 99.25
00:00:00.020000 75.00 99.50
00:00:00.030000 74.50 99.75
00:00:00.040000 74.00 100.00
00:00:00.050000 74.75 99.75
00:00:00.060000 75.50 99.50
00:00:00.070000 76.25 99.25
00:00:00.080000 77.00 99.00
00:00:00.090000 77.75 98.75
00:00:00.100000 78.50 98.50
00:00:00.110000 79.25 98.25
00:00:00.120000 80.00 98.00
I believe using np.linspace and process column-wise should be faster than interpolate (if your Time column is not exactly in time format):
import numpy as np
import pandas as pd
new_dict = {}
for c in df.columns:
arr = df[c]
ret = []
for i in range(1, len(arr)):
ret.append(np.linspace(arr[i-1], arr[i], 4, endpoint=False)[1:])
new_dict[c] = np.concatenate(ret)
pd.concat([df, pd.DataFrame(new_dict)]).sort_values('Time').reset_index(drop=True)
Time Pulse O2
0 0.00 76.00 99.00
1 0.01 75.50 99.25
2 0.02 75.00 99.50
3 0.03 74.50 99.75
4 0.04 74.00 100.00
5 0.05 74.75 99.75
6 0.06 75.50 99.50
7 0.07 76.25 99.25
8 0.08 77.00 99.00
9 0.09 77.75 98.75
10 0.10 78.50 98.50
11 0.11 79.25 98.25
12 0.12 80.00 98.00

Pandas: find the n lowest values each m rows

I have a dataframe with a Counter, increasing by 1 each 24 rows, and a value column, like below.
value counter
0 0.00 1
1 0.00 1
2 0.00 1
3 0.00 1
4 0.00 1
5 0.00 1
6 0.00 1
7 0.00 1
8 55.00 1
9 90.00 1
10 49.27 1
11 51.80 1
12 49.06 1
13 43.46 1
14 45.96 1
15 43.95 1
16 45.00 1
17 43.97 1
18 42.00 1
19 41.14 1
20 43.92 1
21 51.74 1
22 40.85 1
23 0.00 2
24 0.00 2
25 0.00 2
26 0.00 2
27 0.00 2
28 0.00 2
29 0.00 2
... ... ...
187 82.38 9
188 66.89 9
189 59.83 9
190 52.46 9
191 40.48 9
192 28.87 9
193 41.90 9
194 42.56 9
195 40.93 9
196 40.02 9
197 36.54 9
198 33.70 9
199 38.99 9
200 46.10 9
201 44.82 9
202 0.00 9
203 0.00 9
204 0.00 9
205 0.00 9
206 0.00 9
207 0.00 10
208 0.00 10
209 0.00 10
210 74.69 10
211 89.20 10
212 74.59 10
213 55.11 10
214 58.39 10
215 40.81 10
216 45.06 10
I would like to know if there is a way to create a third column with the 4 lowest values in each Group where the Counter has the same value. See below an example for the first Group with Count=1:
value counter value 2
0 0.00 1 0.00
1 0.00 1 0.00
2 0.00 1 0.00
3 0.00 1 0.00
4 0.00 1 0.00
5 0.00 1 0.00
6 0.00 1 0.00
7 0.00 1 0.00
8 55.00 1 0.00
9 90.00 1 0.00
10 49.27 1 0.00
11 51.80 1 0.00
12 49.06 1 0.00
13 43.46 1 43.46
14 45.96 1 0.00
15 43.95 1 0.00
16 45.00 1 0.00
17 43.97 1 0.00
18 42.00 1 42.00
19 41.14 1 41.14
20 43.92 1 0.00
21 51.74 1 0.00
22 40.85 1 40.85
I know about some functions like nsmallest(n,'column') but I don't know how to limit it with the Count grouping
Any idea? thank you in advance! .
I think you need first filter out rows with 0 values in value, sorting by sort_values and get DataFrame.head for top 4 values, last add reindex for filling 0 for not matched values:
df['value 2'] = (df[df['value'] != 0]
.sort_values('value')
.groupby('counter')['value'].head(4)
.reindex(df.index, fill_value=0))
print (df)
value counter value 2
0 0.00 1 0.00
1 0.00 1 0.00
2 0.00 1 0.00
3 0.00 1 0.00
4 0.00 1 0.00
5 0.00 1 0.00
6 0.00 1 0.00
7 0.00 1 0.00
8 55.00 1 0.00
9 90.00 1 0.00
10 49.27 1 0.00
11 51.80 1 0.00
12 49.06 1 0.00
13 43.46 1 43.46
14 45.96 1 0.00
15 43.95 1 0.00
16 45.00 1 0.00
17 43.97 1 0.00
18 42.00 1 42.00
19 41.14 1 41.14
20 43.92 1 0.00
21 51.74 1 0.00
22 40.85 1 40.85
23 0.00 2 0.00
24 0.00 2 0.00
25 0.00 2 0.00
26 0.00 2 0.00
27 0.00 2 0.00

How to create multiple spacing CSV from pandas?

I have a pandas dataframe and want to output a text file separated by different spacing for input to other model. How can I do that?
The sample OUTPUT text file is as follow (each columns in the text file correspond to columns in df):
SO HOUREMIS 92 5 1 1 MC12 386.91 389.8 11.45
SO HOUREMIS 92 5 1 1 MC3 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 MC4 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 ED1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 ED2 322.00 397.4 13.00
SO HOUREMIS 92 5 1 1 HL2 25.55 464.3 7.46
SO HOUREMIS 92 5 1 1 WC1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 WC2 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 MC12 405.35 389.3 11.54
SO HOUREMIS 92 5 1 2 MC3 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 MC4 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 ED1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 ED2 319.90 396.3 13.00
After referring to this post. I found the solution:
fmt = '%0s %+1s %+1s %+2s %+2s %+2s %+6s %+15s'
np.savetxt('test.txt', data.values[0:10], fmt=fmt)
I can format each columns and specify how many spacing and the alignment.

Inner join on 2 columns for two dataframes python

I have 2 dataframes named geostat and geostat_query. I am trying to do inner join on 2 columns. The code that I have written is giving me empty result.
My dataframes are:
geostat:
STATE COUNT PERCENT state pool number STATE CODE
0 0.00 251 CA
1 0.00 252 CA
2 0.00 253 CA
3 0.00 787 CA
4 0.00 789 CA
5 0.00 4401 CA
6 0.00 4402 CA
7 0.00 4403 CA
8 0.00 4404 CA
9 0.00 4406 CA
10 0.00 4568 CA
11 0.00 4569 FL
12 0.00 4576 CA
13 0.00 4577 CA
14 0.00 4578 CA
15 0.00 4579 CA
16 0.00 4580 CA
17 0.00 4581 CA
18 0.00 4582 CA
19 0.00 4584 CA
20 0.00 4585 CA
21 0.00 4588 CA
22 0.00 4589 CA
23 0.00 4591 CA
24 0.00 4592 CA
25 0.00 4593 CA
26 0.00 4594 FL
27 0.00 4595 CA
28 0.00 4595 FL
29 0.00 6221 MS
30 0.00 817085 GA
31 0.03 817085 IL
32 0.03 817085 IN
33 0.03 817085 MA
34 0.03 817085 ME
35 0.07 817085 MI
36 0.07 817085 MO
37 0.03 817085 NE
38 0.07 817085 OH
39 0.03 817085 PA
40 0.03 817085 SC
41 0.03 817085 SD
42 0.03 817085 TX
43 0.07 817085 WI
44 0.08 817094 AL
45 0.09 817094 CA
geostat_query:
MaxOfState count percent state pool number
0 100 251
1 100 252
2 100 253
3 100 787
4 100 789
5 100 4401
6 100 4402
7 100 4403
8 100 4404
9 100 4406
10 100 4568
11 100 4569
12 100 4576
13 100 4577
14 100 4578
15 100 4579
16 100 4580
17 100 4581
18 100 4582
19 100 4584
20 100 4585
21 100 4588
22 100 4589
23 100 4591
24 100 4592
25 100 4593
26 100 4594
27 75 4595
28 100 6221
29 100 8194
The code I wrote is :
geomerge = geostat.merge(geostat_query, left_on=['STATE COUNT PERCENT','state pool number'], right_on=['MaxOfState count percent','state pool number'],how='inner')
But this gives me empty result. I dont understand where am I going wrong?

Categories

Resources