Using pd.to_datetime to convert "object" column into %HH:MM:SS - python

I am doing some exploratory data analysis using finish-time data scraped from the 2018 KONA IRONMAN. I used JSON to format the data and pandas to read into csv. The 'swim','bike','run' columns should be formatted as %HH:MM:SS to be operable, however, I am receiving a ValueError: ('Unknown string format:', '--:--:--').
print(data.head(2))
print(kona.info())
print(kona.describe())
Name div_rank ... bike run
0 Avila, Anthony 2470 138 ... 05:27:59 04:31:56
1 Lindgren, Mikael 1050 151 ... 05:17:51 03:49:20
swim 2472 non-null object
bike 2472 non-null object
run 2472 non-null object
Name div_rank ... bike run
count 2472 2472 ... 2472 2472
unique 2472 288 ... 2030 2051
top Jara, Vicente 986 -- ... --:--:-- --:--:--
freq 1 165 ... 122 165
How should I use pd.to_datetime to properly format the 'bike','swim','run' column and for future use sum these columns and append a 'Total Finish Time' column? Thanks!

The reason the error is because it can't pull the time from '--:--:--'. So you'd need to convert all those to '00:00:00', but then that implies they did the event in 0 time. The other option is to just convert the times that are present, leaving a null in the places that don't have a time. This will also include a date of 1900-01-01, when you convert to datetime. I put the .dt.time so only time will display.
timed_events = ['bike', 'swim', 'run']
for event in timed_events:
result[event] = pd.to_datetime(result[result[event] != '--:--:--'][event], format="%H:%M:%S").dt.time
The problem with this though is I remember seeing you wanted to sum those times, which would require you to do some extra conversions. So I'm suggesting to use .to_timedelta() instead. It'll work the same way, as you'd need to not include the --:--:--. But then you can sum those times. I also added a column of number of event completed, so that if you want to sort by best times, you can filter out anyone who hasn't competed in all three events, as obviously they'd have better times because they are missing entire events:
I'll also add, regarding the comment of:
"You think providing all the code will be helpful but it does not. You
will get a quicker and more useful response if you keep the code
minimum that can replicate your issue.stackoverflow.com/help/mcve –
mad_ "
I'll give him the benefit of the doubt as seeing the whole code and not realizing that the code you provided was the minimal code to replicate your issue, as no one wants to code a way to generate your data to work with. Sometimes you can explicitly state that in your question.
ie:
Here's the code to generate my data:
CODE PART 1
import bs4
import pandas as pd
code...
But now that I have the data, here's where I'm having trouble:
df = pd.to_timedelta()...
...
Luckily I remembered helping you earlier on this so knew I could go back and get that code. So the code you originally had was fine.
But here's the full code I used, which is a different way of storing the csv than you orginially had. So you can change that part, but the end part is what you'll need:
from bs4 import BeautifulSoup, Comment
from collections import defaultdict
import requests
import pandas as pd
sauce = 'http://m.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx'
r = requests.get(sauce)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
def parse_table(soup):
result = defaultdict(list)
my_table = soup.find('tbody')
for node in my_table.children:
if isinstance(node, Comment):
# Get content and strip comment "<!--" and "-->"
# Wrap the rows in "table" tags as well.
data = '<table>{}</table>'.format(node[4:-3])
break
table = BeautifulSoup(data, 'html.parser')
for row in table.find_all('tr'):
name, _, swim, bike, run, div_rank, gender_rank, overall_rank = [col.text.strip() for col in row.find_all('td')[1:]]
result[name].append({
'div_rank': div_rank,
'gender_rank': gender_rank,
'overall_rank': overall_rank,
'swim': swim,
'bike': bike,
'run': run,
})
return result
jsonObj = parse_table(soup)
result = pd.DataFrame()
for k, v in jsonObj.items():
temp_df = pd.DataFrame.from_dict(v)
temp_df['name'] = k
result = result.append(temp_df)
result = result.reset_index(drop=True)
result.to_csv('C:/data.csv', index=False)
# However you read in your csv/dataframe, use the code below on it to get those times
timed_events = ['bike', 'swim', 'run']
for event in timed_events:
result[event] = pd.to_timedelta(result[result[event] != '--:--:--'][event])
result['total_events_participated'] = 3 - result.isnull().sum(axis=1)
result['total_times'] = result[timed_events].sum(axis=1)
Output:
print (result)
bike div_rank ... total_events_participated total_times
0 05:27:59 138 ... 3 11:20:06
1 05:17:51 151 ... 3 10:16:17
2 06:14:45 229 ... 3 14:48:28
3 05:13:56 162 ... 3 10:19:03
4 05:19:10 6 ... 3 09:51:48
5 04:32:26 25 ... 3 08:23:26
6 04:49:08 155 ... 3 10:16:16
7 04:50:10 216 ... 3 10:55:47
8 06:45:57 71 ... 3 13:50:28
9 05:24:33 178 ... 3 10:21:35
10 06:36:36 17 ... 3 14:36:59
11 NaT -- ... 0 00:00:00
12 04:55:29 100 ... 3 09:28:53
13 05:39:18 72 ... 3 11:44:40
14 04:40:41 -- ... 2 05:35:18
15 05:23:18 45 ... 3 10:55:27
16 05:15:10 3 ... 3 10:28:37
17 06:15:59 78 ... 3 11:47:24
18 NaT -- ... 0 00:00:00
19 07:11:19 69 ... 3 15:39:51
20 05:49:02 29 ... 3 10:32:36
21 06:45:48 4 ... 3 13:39:17
22 04:39:46 -- ... 2 05:48:38
23 06:03:01 3 ... 3 11:57:42
24 06:24:58 193 ... 3 13:52:57
25 05:07:42 116 ... 3 10:01:24
26 04:44:46 112 ... 3 09:29:22
27 04:46:06 55 ... 3 09:32:43
28 04:41:05 69 ... 3 09:31:32
29 05:27:55 68 ... 3 11:09:37
... ... ... ... ...
2442 NaT -- ... 0 00:00:00
2443 05:26:40 3 ... 3 11:28:53
2444 05:04:37 19 ... 3 10:27:13
2445 04:50:45 74 ... 3 09:15:14
2446 07:17:40 120 ... 3 14:46:05
2447 05:26:32 45 ... 3 10:50:48
2448 05:11:26 186 ... 3 10:26:00
2449 06:54:15 185 ... 3 14:05:16
2450 05:12:10 22 ... 3 11:21:37
2451 04:59:44 45 ... 3 09:29:43
2452 06:03:59 96 ... 3 12:12:35
2453 06:07:27 16 ... 3 12:47:11
2454 04:38:06 91 ... 3 09:52:27
2455 04:41:56 14 ... 3 08:58:46
2456 04:38:48 85 ... 3 09:18:31
2457 04:42:30 42 ... 3 09:07:29
2458 04:40:54 110 ... 3 09:32:34
2459 06:08:59 37 ... 3 12:15:23
2460 04:32:20 -- ... 2 05:31:05
2461 04:45:03 96 ... 3 09:30:06
2462 06:14:29 95 ... 3 13:38:54
2463 06:00:20 164 ... 3 12:10:03
2464 05:11:07 22 ... 3 10:32:35
2465 05:56:06 188 ... 3 13:32:48
2466 05:09:26 2 ... 3 09:54:55
2467 05:22:15 7 ... 3 10:26:14
2468 05:53:14 254 ... 3 12:34:21
2469 05:00:29 156 ... 3 10:18:29
2470 04:30:46 7 ... 3 08:38:23
2471 04:34:59 39 ... 3 09:04:13
[2472 rows x 9 columns]

Related

fastest way to access dataframe cell by colums values?

I have the following dataframe :
time bk1_lvl0_id bk2_lvl0_id pr_ss order_upto_level initial_inventory leadtime1 leadtime2 adjusted_leadtime
0 2020 1000 3 16 18 17 3 0.100000 1
1 2020 10043 3 65 78 72 12 0.400000 1
2 2020 1005 3 0 1 1 9 0.300000 1
3 2020 1009 3 325 363 344 21 0.700000 1
4 2020 102 3 0 1 1 7 0.233333 1
I want a function to get the pr_ss for example for (bk1_lvl0_id=1000,bk2_lvl0_id=3).
that's the code i've tried but it takes time :
def get_safety_stock(df,bk1,bk2):
##a function that returns the safety stock for any given (bk1,bk2)
for index,row in df.iterrows():
if (row["bk1_lvl0_id"]==bk1) and (row["bk2_lvl0_id"]==bk2):
return int(row["pr_ss"])
break
If your dataframe has no duplicate values based on bk1_lvl0_id and bk2_lvl0_id, You can make function as follows:
def get_safety_stock(df,bk1,bk2):
return df.loc[df.bk1_lvl0_id.eq(bk1) & df.bk2_lvl0_id.eq(bk2), 'pr_ss'][0]
Note that its accessing the first value in the Series which shouldnt be an issue if there are no duplicates in data. If you want all of them, just remove the [0] from the end and it should give you the whole series. This can be called as follows:
get_safety_stock(df, 1000,3)
>>>16

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000
ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2

Column has dtype object, cannot use method 'nlargest' with this dtype

I'm using Google Colab and I want to analyze a file from Google Spreadsheet using pandas. I imported them successfully and I can print them out with pd.DataFrame
data_tablet = gc.open_by_url(f'https://docs.google.com/spreadsheets/d/{sheet_id}/edit#gid={tablet_gid}')
tablet_var = data_tablet.worksheet('tablet')
tablet_data = tablet_var.get_all_records()
df_tablet = pd.DataFrame(tablet_data)
print(df_tablet)
name 1st quarter ... 4th quarter total
0 Albendazol 400 mg 18.0 ... 60.0 78
1 Alopurinol 100 mg 125.0 ... 821.0 946
2 Ambroksol 30 mg 437.0 ... 798.0 1,235.00
3 Aminofilin 200 mg 70.0 ... 522.0 592
4 Amitriptilin 25 mg 83.0 ... 178.0 261
.. ... ... ... ... ...
189 Levoflaksin 250 mg 611.0 ... 822.0 1,433.00
190 Linezolid 675.0 ... 315.0 990
191 Moxifloxacin 400 mg 964.0 ... 99.0 1,063.00
192 Pyrazinamide 500 mg 395.0 ... 189.0 584
193 Vitamin B 6 330.0 ... 825.0 1,155.00
[194 rows x 6 columns]
I want to select the top 10 out of 194 items from the total and it did not work.
Selecting the top 10 from total and running command below and I get cannot use method 'nlargest' with this dtype
# Ambil data 10 terbesar dari 194 item
df_tablet_top10 = df_tablet.nlargest(10, 'total')
print(df_tablet_top10)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-a7295330f7a9> in <module>()
1 # Ambil data 10 terbesar dari 194 item
----> 2 df_tablet_top10 = df_tablet.nlargest(10, 'total')
3 print(df_tablet_top10)
2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/algorithms.py in compute(self, method)
1273 if not self.is_valid_dtype_n_method(dtype):
1274 raise TypeError(
-> 1275 f"Column {repr(column)} has dtype {dtype}, "
1276 f"cannot use method {repr(method)} with this dtype"
1277 )
TypeError: Column 'total' has dtype object, cannot use method 'nlargest' with this dtype
But when I select it from 1st quarter it works just fine
df_tablet_top10 = df_tablet.nlargest(10, '1st quarter')
print(df_tablet_top10)
nama 1st quarter ... 4th quarter total
154 Salbutamol 4 mg 981.0 ... 23.0 1,004.00
74 MDT FB dewasa (obat kusta) 978.0 ... 910.0 1,888.00
155 Paracetamol 500 mg Tablet 976.0 ... 503.0 1,479.00
33 Furosemid 40 mg 975.0 ... 524.0 1,499.00
23 Deksametason 0,5 mg 972.0 ... 793.0 1,765.00
21 Bisakodil (dulkolax) 5 mg 970.0 ... 798.0 1,768.00
191 Moxifloxacin 400 mg 964.0 ... 99.0 1,063.00
85 Metronidazol 250 mg 958.0 ... 879.0 1,837.00
96 Nistatin 500.000 IU 951.0 ... 425.0 1,376.00
37 Glimepirid 2 mg 947.0 ... 890.0 1,837.00
[10 rows x 6 columns]
Any idea what causes this to happen?
Also, I have changed the format for the 1st quarter to total as number on google sheet and it still did not work
I found the solution, but not the explanation.
All I did is just to convert the total column as float with
df_tablet['total'] = df_tablet['total'].astype(float)
df_tablet['total'] = df_tablet['total'].astype(float)
df_tablet_top10 = df_tablet.nlargest(10, '1st quarter')
print(df_tablet_top10)

How could I create sum of column itself dynamically in python

My raw data is:
def f_ST(ST,F,T):
a=ST/F-1-np.log(ST/F)
return 2*a/T
df=pd.DataFrame(range(50,140,5),columns=['K'])
df['f(K0)']=df.apply(lambda x: f_ST(x.K,100,0.25),axis=1)
df['f(K1)']=df['f(K0)'].shift(-1)
df['dK']=df['K'].diff(1)
The thing I want to do is: I have a funtion f(k)
f(k)= (k-100)/100 - ln(k/100)
I want to calculate w, which goes following the steps
get 1-period foward value of f(k), then calculate
tmp(k)=f1_f(k)-f(k)/dk
w is calculated as
w[0]=tmpw[0]
w[n]-tmpw[n]-(w[0]+w[1]+...w[n-1])
And the result look like
nbr date k f(k) f1_f(k) d_k tmpw w
10 2019-02-19 100 0.000000 0.009679 5.0 0.001936 0.001936
11 2019-02-19 105 0.009679 0.037519 5.0 0.005568 0.003632
12 2019-02-19 110 0.037519 0.081904 5.0 0.008877 0.003309
13 2019-02-19 115 0.081904 0.141428 5.0 ...
14 2019-02-19 120 0.141428 0.214852 5.0 ...
15 2019-02-19 125 0.214852 0.301086 5.0
16 2019-02-19 130 0.301086 0.399163 5.0
Question: could anyone help to derive a quick code (not mathematically) without using loop?
Thanks a lot!
I don't fully understand your question, for me all those notation were a bit confusing.
If I got what you want right, for every row you want to have an accumulated value of all previous rows. than the value of another column of this row would be calculated based on this accumulated value.
In this case I would prefer something, calculate an accumulated column first, use it later.
for example:
note you need to call list(range()) instead of list, so your example is throwing an error
import pandas as pd
import numpy as np
def f_ST(ST,F,T):
a=ST/F-1-np.log(ST/F)
return 2*a/T
df=pd.DataFrame(list(range(50,140,5)),columns=['K'])
df['f(K0)']=df.apply(lambda x: f_ST(x.K,100,0.25),axis=1)
df['f(K1)']=df['f(K0)'].shift(-1)
df['dK']=df['K'].diff(1)
df['accumulate'] = df['K'].shift(1).cumsum()
df['currentVal-accumulated'] = df['K'] - df['accumulate']
print(df)
prints:
K f(K0) ... accumulate currentVal-accumulated
0 50 1.545177 ... NaN NaN
1 55 1.182696 ... 50.0 5.0
2 60 0.886605 ... 105.0 -45.0
3 65 0.646263 ... 165.0 -100.0
4 70 0.453400 ... 230.0 -160.0
5 75 0.301457 ... 300.0 -225.0
6 80 0.185148 ... 375.0 -295.0
7 85 0.100151 ... 455.0 -370.0
8 90 0.042884 ... 540.0 -450.0
9 95 0.010346 ... 630.0 -535.0
10 100 0.000000 ... 725.0 -625.0
11 105 0.009679 ... 825.0 -720.0
12 110 0.037519 ... 930.0 -820.0
13 115 0.081904 ... 1040.0 -925.0
14 120 0.141428 ... 1155.0 -1035.0
15 125 0.214852 ... 1275.0 -1150.0
16 130 0.301086 ... 1400.0 -1270.0
17 135 0.399163 ... 1530.0 -1395.0
[18 rows x 6 columns]

How can I Extract only numbers from this columns?

Suppose, you have a column in excel, with values like this... there are only 5500 numbers present but it show length 5602 means that 102 strings are present
4 SELECTIO
6 N NO
14 37001
26 37002
38 37003
47 37004
60 37005
73 37006
82 37007
92 37008
105 37009
119 37010
132 37011
143 37012
157 37013
168 37014
184 37015
196 37016
207 37017
220 37018
236 37019
253 37020
267 37021
280 37022
287 Krishan
290 37023
300 37024
316 37025
337 37026
365 37027
...
74141 42471
74154 42472
74169 42473
74184 42474
74200 42475
74216 42476
74233 42477
74242 42478
74256 42479
74271 42480
74290 42481
74309 42482
74323 42483
74336 42484
74350 42485
74365 42486
74378 42487
74389 42488
74398 42489
74413 42490
74430 42491
74446 42492
74459 42493
74474 42494
74491 42495
74504 42496
74516 42497
74530 42498
74544 42499
74558 42500
Name: Selection No., Length: 5602, dtype: object
and I want to get only numeric values like this in python using pandas
37001
37002
37003
37004
37005
how can I do this? I have attached my code in python using pandas..............................................
def selection(sle):
if sle in re.match('[3-4][0-9]{4}',sle):
return 1
else:
return 0
select['status'] = select['Selection No.'].apply(selection)
and now I am geting an "argument of type 'NoneType' is not iterable" error.
Try using Numpy with np.isreal and only select numbers..
import pandas as pd
import numpy as np
df = pd.DataFrame({'SELECTIO':['N NO',37002,37003,'Krishan',37004,'singh',37005], 'some_col':[4,6,14,26,38,47,60]})
df
SELECTIO some_col
0 N NO 4
1 37002 6
2 37003 14
3 Krishan 26
4 37004 38
5 singh 47
6 37005 60
>>> df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
result:
Specific to column SELECTIO ..
df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
OR just another approach importing numbers + lambda :
import numbers
df[df[['SELECTIO']].applymap(lambda x: isinstance(x, numbers.Number)).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
Note: there is problem when you are extracting a column you are using ['Selection No.'] but indeed you have a Space in the name it will be like ['Selection No. '] that's the reason you are getting KeyError while executing it, try and see!
Your function contains wrong expression: if sle in re.match('[3-4][0-9]{4}',sle): - it tries to find a column value sle IN match object which "always have a boolean value of True" (re.match returns None when there's no match)
I would suggest to proceed with pd.Series.str.isnumeric function:
In [544]: df
Out[544]:
Selection No.
0 37001
1 37002
2 37003
3 asnsh
4 37004
5 singh
6 37005
In [545]: df['Status'] = df['Selection No.'].str.isnumeric().astype(int)
In [546]: df
Out[546]:
Selection No. Status
0 37001 1
1 37002 1
2 37003 1
3 asnsh 0
4 37004 1
5 singh 0
6 37005 1
If a strict regex pattern is required - use pd.Series.str.contains function:
df['Status'] = df['Selection No.'].str.contains('^[3-4][0-9]{4}$', regex=True).astype(int)

Categories

Resources