How to read .xls files in tensorflow-python - python

I have a quite big problem with reading xls. file to my machine learning project. Data that i need to extract is saved in .xls file and i can't find any option to easy extract to tensorflow dataset model, can anyone help?
link to this data:
"http://archive.ics.uci.edu/ml/machine-learning-databases/00192/BreastTissue.xls"

Try to use Pandas module:
import pandas as pd
In [24]: df = pd.read_excel(r'D:\download\BreastTissue.xls', sheet_name='Data')
In [25]: df
Out[25]:
Case # Class I0 PA500 HFS DA Area A/DA Max IP DR P
0 1 car 524.794072 0.187448 0.032114 228.800228 6843.598481 29.910803 60.204880 220.737212 556.828334
1 2 car 330.000000 0.226893 0.265290 121.154201 3163.239472 26.109202 69.717361 99.084964 400.225776
2 3 car 551.879287 0.232478 0.063530 264.804935 11888.391827 44.894903 77.793297 253.785300 656.769449
3 4 car 380.000000 0.240855 0.286234 137.640111 5402.171180 39.248524 88.758446 105.198568 493.701814
4 5 car 362.831266 0.200713 0.244346 124.912559 3290.462446 26.342127 69.389389 103.866552 424.796503
5 6 car 389.872978 0.150098 0.097738 118.625814 2475.557078 20.868620 49.757149 107.686164 429.385788
6 7 car 290.455141 0.144164 0.053058 74.635067 1189.545213 15.938154 35.703331 65.541324 330.267293
7 8 car 275.677393 0.153938 0.187797 91.527893 1756.234837 19.187974 39.305183 82.658682 331.588302
8 9 car 470.000000 0.213105 0.225497 184.590057 8185.360837 44.343455 84.482483 164.122511 603.315715
9 10 car 423.000000 0.219562 0.261799 172.371241 6108.106297 35.435762 79.056351 153.172903 558.274515
.. ... ... ... ... ... ... ... ... ... ... ...
96 97 adi 1650.000000 0.047647 0.043284 274.426177 5824.895192 21.225727 81.239571 262.125656 1603.070348
97 98 adi 2800.000000 0.083078 0.184307 583.259257 31388.652882 53.815953 298.582977 501.038494 2896.582483
98 99 adi 2329.840138 0.066148 0.353255 377.253368 25369.039925 67.246689 336.075165 171.387227 2686.435346
99 100 adi 2400.000000 0.084125 0.220610 596.041956 37939.255571 63.651988 261.348175 535.689409 2447.772353
100 101 adi 2000.000000 0.067195 0.124267 330.271646 15381.097687 46.571051 169.197983 283.639564 2063.073212
101 102 adi 2000.000000 0.106989 0.105418 520.222649 40087.920984 77.059161 204.090347 478.517223 2088.648870
102 103 adi 2600.000000 0.200538 0.208043 1063.441427 174480.476218 164.071543 418.687286 977.552367 2664.583623
103 104 adi 1600.000000 0.071908 -0.066323 436.943603 12655.342135 28.963331 103.732704 432.129749 1475.371534
104 105 adi 2300.000000 0.045029 0.136834 185.446044 5086.292497 27.427344 178.691742 49.593290 2480.592151
105 106 adi 2600.000000 0.069988 0.048869 745.474369 39845.773698 53.450226 154.122604 729.368395 2545.419744
[106 rows x 11 columns]
In [26]: df.dtypes
Out[26]:
Case # int64
Class object
I0 float64
PA500 float64
HFS float64
DA float64
Area float64
A/DA float64
Max IP float64
DR float64
P float64
dtype: object
In [27]: df.shape
Out[27]: (106, 11)

Related

Column has dtype object, cannot use method 'nlargest' with this dtype

I'm using Google Colab and I want to analyze a file from Google Spreadsheet using pandas. I imported them successfully and I can print them out with pd.DataFrame
data_tablet = gc.open_by_url(f'https://docs.google.com/spreadsheets/d/{sheet_id}/edit#gid={tablet_gid}')
tablet_var = data_tablet.worksheet('tablet')
tablet_data = tablet_var.get_all_records()
df_tablet = pd.DataFrame(tablet_data)
print(df_tablet)
name 1st quarter ... 4th quarter total
0 Albendazol 400 mg 18.0 ... 60.0 78
1 Alopurinol 100 mg 125.0 ... 821.0 946
2 Ambroksol 30 mg 437.0 ... 798.0 1,235.00
3 Aminofilin 200 mg 70.0 ... 522.0 592
4 Amitriptilin 25 mg 83.0 ... 178.0 261
.. ... ... ... ... ...
189 Levoflaksin 250 mg 611.0 ... 822.0 1,433.00
190 Linezolid 675.0 ... 315.0 990
191 Moxifloxacin 400 mg 964.0 ... 99.0 1,063.00
192 Pyrazinamide 500 mg 395.0 ... 189.0 584
193 Vitamin B 6 330.0 ... 825.0 1,155.00
[194 rows x 6 columns]
I want to select the top 10 out of 194 items from the total and it did not work.
Selecting the top 10 from total and running command below and I get cannot use method 'nlargest' with this dtype
# Ambil data 10 terbesar dari 194 item
df_tablet_top10 = df_tablet.nlargest(10, 'total')
print(df_tablet_top10)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-a7295330f7a9> in <module>()
1 # Ambil data 10 terbesar dari 194 item
----> 2 df_tablet_top10 = df_tablet.nlargest(10, 'total')
3 print(df_tablet_top10)
2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/algorithms.py in compute(self, method)
1273 if not self.is_valid_dtype_n_method(dtype):
1274 raise TypeError(
-> 1275 f"Column {repr(column)} has dtype {dtype}, "
1276 f"cannot use method {repr(method)} with this dtype"
1277 )
TypeError: Column 'total' has dtype object, cannot use method 'nlargest' with this dtype
But when I select it from 1st quarter it works just fine
df_tablet_top10 = df_tablet.nlargest(10, '1st quarter')
print(df_tablet_top10)
nama 1st quarter ... 4th quarter total
154 Salbutamol 4 mg 981.0 ... 23.0 1,004.00
74 MDT FB dewasa (obat kusta) 978.0 ... 910.0 1,888.00
155 Paracetamol 500 mg Tablet 976.0 ... 503.0 1,479.00
33 Furosemid 40 mg 975.0 ... 524.0 1,499.00
23 Deksametason 0,5 mg 972.0 ... 793.0 1,765.00
21 Bisakodil (dulkolax) 5 mg 970.0 ... 798.0 1,768.00
191 Moxifloxacin 400 mg 964.0 ... 99.0 1,063.00
85 Metronidazol 250 mg 958.0 ... 879.0 1,837.00
96 Nistatin 500.000 IU 951.0 ... 425.0 1,376.00
37 Glimepirid 2 mg 947.0 ... 890.0 1,837.00
[10 rows x 6 columns]
Any idea what causes this to happen?
Also, I have changed the format for the 1st quarter to total as number on google sheet and it still did not work
I found the solution, but not the explanation.
All I did is just to convert the total column as float with
df_tablet['total'] = df_tablet['total'].astype(float)
df_tablet['total'] = df_tablet['total'].astype(float)
df_tablet_top10 = df_tablet.nlargest(10, '1st quarter')
print(df_tablet_top10)

How could I create sum of column itself dynamically in python

My raw data is:
def f_ST(ST,F,T):
a=ST/F-1-np.log(ST/F)
return 2*a/T
df=pd.DataFrame(range(50,140,5),columns=['K'])
df['f(K0)']=df.apply(lambda x: f_ST(x.K,100,0.25),axis=1)
df['f(K1)']=df['f(K0)'].shift(-1)
df['dK']=df['K'].diff(1)
The thing I want to do is: I have a funtion f(k)
f(k)= (k-100)/100 - ln(k/100)
I want to calculate w, which goes following the steps
get 1-period foward value of f(k), then calculate
tmp(k)=f1_f(k)-f(k)/dk
w is calculated as
w[0]=tmpw[0]
w[n]-tmpw[n]-(w[0]+w[1]+...w[n-1])
And the result look like
nbr date k f(k) f1_f(k) d_k tmpw w
10 2019-02-19 100 0.000000 0.009679 5.0 0.001936 0.001936
11 2019-02-19 105 0.009679 0.037519 5.0 0.005568 0.003632
12 2019-02-19 110 0.037519 0.081904 5.0 0.008877 0.003309
13 2019-02-19 115 0.081904 0.141428 5.0 ...
14 2019-02-19 120 0.141428 0.214852 5.0 ...
15 2019-02-19 125 0.214852 0.301086 5.0
16 2019-02-19 130 0.301086 0.399163 5.0
Question: could anyone help to derive a quick code (not mathematically) without using loop?
Thanks a lot!
I don't fully understand your question, for me all those notation were a bit confusing.
If I got what you want right, for every row you want to have an accumulated value of all previous rows. than the value of another column of this row would be calculated based on this accumulated value.
In this case I would prefer something, calculate an accumulated column first, use it later.
for example:
note you need to call list(range()) instead of list, so your example is throwing an error
import pandas as pd
import numpy as np
def f_ST(ST,F,T):
a=ST/F-1-np.log(ST/F)
return 2*a/T
df=pd.DataFrame(list(range(50,140,5)),columns=['K'])
df['f(K0)']=df.apply(lambda x: f_ST(x.K,100,0.25),axis=1)
df['f(K1)']=df['f(K0)'].shift(-1)
df['dK']=df['K'].diff(1)
df['accumulate'] = df['K'].shift(1).cumsum()
df['currentVal-accumulated'] = df['K'] - df['accumulate']
print(df)
prints:
K f(K0) ... accumulate currentVal-accumulated
0 50 1.545177 ... NaN NaN
1 55 1.182696 ... 50.0 5.0
2 60 0.886605 ... 105.0 -45.0
3 65 0.646263 ... 165.0 -100.0
4 70 0.453400 ... 230.0 -160.0
5 75 0.301457 ... 300.0 -225.0
6 80 0.185148 ... 375.0 -295.0
7 85 0.100151 ... 455.0 -370.0
8 90 0.042884 ... 540.0 -450.0
9 95 0.010346 ... 630.0 -535.0
10 100 0.000000 ... 725.0 -625.0
11 105 0.009679 ... 825.0 -720.0
12 110 0.037519 ... 930.0 -820.0
13 115 0.081904 ... 1040.0 -925.0
14 120 0.141428 ... 1155.0 -1035.0
15 125 0.214852 ... 1275.0 -1150.0
16 130 0.301086 ... 1400.0 -1270.0
17 135 0.399163 ... 1530.0 -1395.0
[18 rows x 6 columns]

need writing function to find percentage from total in pivot table

For example this table:
list_1=[['A',10,42,12,64],
['B',24,11,62,95],
['C',14,78,20,112]]
labels=['Class','Amount_1','Amount_2','Amount_3','Total']
df=pd.DataFrame(list_1,columns=labels)
df
Class Amount_1 Amount_2 Amount_3 Total
0 A 10 42 12 64
1 B 24 11 62 95
2 C 14 78 20 112
I need write function to get this table (amount rate from total):
Class Amount_1 Amount_2 Amount_3
A 0.156250 0.656250 0.187500
B 0.252632 0.115789 0.652632
C 0.125000 0.696429 0.178571
Try:
from sklearn.preprocessing import normalize
df[["Amount_1", "Amount_2", "Amount_3"]]=normalize(df[["Amount_1", "Amount_2", "Amount_3"]], axis=1, norm="l1")
Outputs:
Class Amount_1 Amount_2 Amount_3 Total
0 A 0.156250 0.656250 0.187500 64
1 B 0.247423 0.113402 0.639175 95
2 C 0.125000 0.696429 0.178571 112
IIUC
df.update(df.loc[:,df.columns.str.contains('Amount')].div(df.Total,0))
df
Out[41]:
Class Amount_1 Amount_2 Amount_3 Total
0 A 0.156250 0.656250 0.187500 64
1 B 0.252632 0.115789 0.652632 95
2 C 0.125000 0.696429 0.178571 112

Using pd.to_datetime to convert "object" column into %HH:MM:SS

I am doing some exploratory data analysis using finish-time data scraped from the 2018 KONA IRONMAN. I used JSON to format the data and pandas to read into csv. The 'swim','bike','run' columns should be formatted as %HH:MM:SS to be operable, however, I am receiving a ValueError: ('Unknown string format:', '--:--:--').
print(data.head(2))
print(kona.info())
print(kona.describe())
Name div_rank ... bike run
0 Avila, Anthony 2470 138 ... 05:27:59 04:31:56
1 Lindgren, Mikael 1050 151 ... 05:17:51 03:49:20
swim 2472 non-null object
bike 2472 non-null object
run 2472 non-null object
Name div_rank ... bike run
count 2472 2472 ... 2472 2472
unique 2472 288 ... 2030 2051
top Jara, Vicente 986 -- ... --:--:-- --:--:--
freq 1 165 ... 122 165
How should I use pd.to_datetime to properly format the 'bike','swim','run' column and for future use sum these columns and append a 'Total Finish Time' column? Thanks!
The reason the error is because it can't pull the time from '--:--:--'. So you'd need to convert all those to '00:00:00', but then that implies they did the event in 0 time. The other option is to just convert the times that are present, leaving a null in the places that don't have a time. This will also include a date of 1900-01-01, when you convert to datetime. I put the .dt.time so only time will display.
timed_events = ['bike', 'swim', 'run']
for event in timed_events:
result[event] = pd.to_datetime(result[result[event] != '--:--:--'][event], format="%H:%M:%S").dt.time
The problem with this though is I remember seeing you wanted to sum those times, which would require you to do some extra conversions. So I'm suggesting to use .to_timedelta() instead. It'll work the same way, as you'd need to not include the --:--:--. But then you can sum those times. I also added a column of number of event completed, so that if you want to sort by best times, you can filter out anyone who hasn't competed in all three events, as obviously they'd have better times because they are missing entire events:
I'll also add, regarding the comment of:
"You think providing all the code will be helpful but it does not. You
will get a quicker and more useful response if you keep the code
minimum that can replicate your issue.stackoverflow.com/help/mcve –
mad_ "
I'll give him the benefit of the doubt as seeing the whole code and not realizing that the code you provided was the minimal code to replicate your issue, as no one wants to code a way to generate your data to work with. Sometimes you can explicitly state that in your question.
ie:
Here's the code to generate my data:
CODE PART 1
import bs4
import pandas as pd
code...
But now that I have the data, here's where I'm having trouble:
df = pd.to_timedelta()...
...
Luckily I remembered helping you earlier on this so knew I could go back and get that code. So the code you originally had was fine.
But here's the full code I used, which is a different way of storing the csv than you orginially had. So you can change that part, but the end part is what you'll need:
from bs4 import BeautifulSoup, Comment
from collections import defaultdict
import requests
import pandas as pd
sauce = 'http://m.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx'
r = requests.get(sauce)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
def parse_table(soup):
result = defaultdict(list)
my_table = soup.find('tbody')
for node in my_table.children:
if isinstance(node, Comment):
# Get content and strip comment "<!--" and "-->"
# Wrap the rows in "table" tags as well.
data = '<table>{}</table>'.format(node[4:-3])
break
table = BeautifulSoup(data, 'html.parser')
for row in table.find_all('tr'):
name, _, swim, bike, run, div_rank, gender_rank, overall_rank = [col.text.strip() for col in row.find_all('td')[1:]]
result[name].append({
'div_rank': div_rank,
'gender_rank': gender_rank,
'overall_rank': overall_rank,
'swim': swim,
'bike': bike,
'run': run,
})
return result
jsonObj = parse_table(soup)
result = pd.DataFrame()
for k, v in jsonObj.items():
temp_df = pd.DataFrame.from_dict(v)
temp_df['name'] = k
result = result.append(temp_df)
result = result.reset_index(drop=True)
result.to_csv('C:/data.csv', index=False)
# However you read in your csv/dataframe, use the code below on it to get those times
timed_events = ['bike', 'swim', 'run']
for event in timed_events:
result[event] = pd.to_timedelta(result[result[event] != '--:--:--'][event])
result['total_events_participated'] = 3 - result.isnull().sum(axis=1)
result['total_times'] = result[timed_events].sum(axis=1)
Output:
print (result)
bike div_rank ... total_events_participated total_times
0 05:27:59 138 ... 3 11:20:06
1 05:17:51 151 ... 3 10:16:17
2 06:14:45 229 ... 3 14:48:28
3 05:13:56 162 ... 3 10:19:03
4 05:19:10 6 ... 3 09:51:48
5 04:32:26 25 ... 3 08:23:26
6 04:49:08 155 ... 3 10:16:16
7 04:50:10 216 ... 3 10:55:47
8 06:45:57 71 ... 3 13:50:28
9 05:24:33 178 ... 3 10:21:35
10 06:36:36 17 ... 3 14:36:59
11 NaT -- ... 0 00:00:00
12 04:55:29 100 ... 3 09:28:53
13 05:39:18 72 ... 3 11:44:40
14 04:40:41 -- ... 2 05:35:18
15 05:23:18 45 ... 3 10:55:27
16 05:15:10 3 ... 3 10:28:37
17 06:15:59 78 ... 3 11:47:24
18 NaT -- ... 0 00:00:00
19 07:11:19 69 ... 3 15:39:51
20 05:49:02 29 ... 3 10:32:36
21 06:45:48 4 ... 3 13:39:17
22 04:39:46 -- ... 2 05:48:38
23 06:03:01 3 ... 3 11:57:42
24 06:24:58 193 ... 3 13:52:57
25 05:07:42 116 ... 3 10:01:24
26 04:44:46 112 ... 3 09:29:22
27 04:46:06 55 ... 3 09:32:43
28 04:41:05 69 ... 3 09:31:32
29 05:27:55 68 ... 3 11:09:37
... ... ... ... ...
2442 NaT -- ... 0 00:00:00
2443 05:26:40 3 ... 3 11:28:53
2444 05:04:37 19 ... 3 10:27:13
2445 04:50:45 74 ... 3 09:15:14
2446 07:17:40 120 ... 3 14:46:05
2447 05:26:32 45 ... 3 10:50:48
2448 05:11:26 186 ... 3 10:26:00
2449 06:54:15 185 ... 3 14:05:16
2450 05:12:10 22 ... 3 11:21:37
2451 04:59:44 45 ... 3 09:29:43
2452 06:03:59 96 ... 3 12:12:35
2453 06:07:27 16 ... 3 12:47:11
2454 04:38:06 91 ... 3 09:52:27
2455 04:41:56 14 ... 3 08:58:46
2456 04:38:48 85 ... 3 09:18:31
2457 04:42:30 42 ... 3 09:07:29
2458 04:40:54 110 ... 3 09:32:34
2459 06:08:59 37 ... 3 12:15:23
2460 04:32:20 -- ... 2 05:31:05
2461 04:45:03 96 ... 3 09:30:06
2462 06:14:29 95 ... 3 13:38:54
2463 06:00:20 164 ... 3 12:10:03
2464 05:11:07 22 ... 3 10:32:35
2465 05:56:06 188 ... 3 13:32:48
2466 05:09:26 2 ... 3 09:54:55
2467 05:22:15 7 ... 3 10:26:14
2468 05:53:14 254 ... 3 12:34:21
2469 05:00:29 156 ... 3 10:18:29
2470 04:30:46 7 ... 3 08:38:23
2471 04:34:59 39 ... 3 09:04:13
[2472 rows x 9 columns]

Subtraction of two series from different parts of the dataframe

I have the following data frame:
SID AID START END
71 1 1 -11136 -11122
74 1 1 -11121 -11109
78 1 1 -11034 -11014
79 1 2 -11137 -11152
83 1 2 -11114 -11127
86 1 2 -11032 -11038
88 1 2 -11121 -11002
I want to do a subtraction of the START elements with AID==1 and AID==2, in order, such that the expected result would be:
-11136 - (-11137) = 1
-11121 - (-11114) =-7
-11034 - (-11032) =-2
Nan - (-11002) = NaN
So I extracted two groups:
values1 = group.loc[group['AID'] == 1]["START"]
values2 = group.loc[group['AID'] == 2]["START"]
with the following result:
71 -11136
74 -11121
78 -11034
Name: START, dtype: int64
79 -11137
83 -11114
86 -11032
88 -11002
Name: START, dtype: int64
and did a simple subtraction:
values1-values2
But I got all NaNs:
71 NaN
74 NaN
78 NaN
79 NaN
83 NaN
86 NaN
I noticed that if I use data from the same AID group (e.g. START-END), I get the right answer. I get the NaN only when I "mix" AID group. I'm just getting started with Pandas, but I'm obviously missing something here. Any suggestion?
Let's try this:
df.set_index([df.groupby(['SID','AID']).cumcount(),'AID'])['START'].unstack().add_prefix('col_').eval('col_1 - col_2')
Output:
0 1.0
1 -7.0
2 -2.0
3 NaN
dtype: float64
pandas does those operations based on labels. Since your labels ((71, 74, 78) and (79, 83, 86)) don't match, it cannot find any value to subtract. One way to deal with this is to use a numpy array instead of a Series so there is no label associated:
values1 - values2.values
Out:
71 1
74 -7
78 -2
Name: START, dtype: int64
Bizarre way to go about it
-np.diff([g.reset_index(drop=True) for n, g in df.groupby('AID').START])[0]
0 1.0
1 -7.0
2 -2.0
3 NaN
Name: START, dtype: float64

Categories

Resources