Python: xlrd discerning dates from floats - python

I wanted to import a file containing text, numbers and dates using xlrd on Python.
I tried something like:
if "/" in worksheet.cell_value:
do_this
else:
do_that
But that was of no use as I latter discovered dates are stored as floats, not strings. To convert them to datetime type I did:
try:
get_row = str(datetime.datetime(*xlrd.xldate_as_tuple(worksheet.cell_value(i, col - 1), workbook.datemode)))
except:
get_row = unicode(worksheet.cell_value(i, col - 1))
I have an exception in place for when the cell contains text. Now i want to get the numbers as numbers and the dates as dates, because right now all numbers are converted to dates.
Any ideas?

I think you could make this much simpler by making more use of the tools available in xlrd:
cell_type = worksheet.cell_type(row - 1, i)
cell_value = worksheet.cell_value(row - 1, i)
if cell_type == xlrd.XL_CELL_DATE:
# Returns a tuple.
dt_tuple = xlrd.xldate_as_tuple(cell_value, workbook.datemode)
# Create datetime object from this tuple.
get_col = datetime.datetime(
dt_tuple[0], dt_tuple[1], dt_tuple[2],
dt_tuple[3], dt_tuple[4], dt_tuple[5]
)
elif cell_type == xlrd.XL_CELL_NUMBER:
get_col = int(cell_value)
else:
get_col = unicode(cell_value)

Well, never mind, I found a solution and here it is!
try:
cell = worksheet.cell(row - 1, i)
if cell.ctype == xlrd.XL_CELL_DATE:
date = datetime.datetime(1899, 12, 30)
get_ = datetime.timedelta(int(worksheet.cell_value(row - 1, i)))
get_col2 = str(date + get_)[:10]
d = datetime.datetime.strptime(get_col2, '%Y-%m-%d')
get_col = d.strftime('%d-%m-%Y')
else:
get_col = unicode(int(worksheet.cell_value(row - 1, i)))
except:
get_col = unicode(worksheet.cell_value(row - 1, i))
A bit of explanation: it turns out that with xlrd you can actually check the type of a cell and check if it's a date or not. Also, Excel seems to have a strange way to save daytimes. It saves them as floats (left part for days, right part for hours) and then it takes a specific date (1899, 12, 30, seems to work OK) and adds the days and hours from the float to create the date. So, to create the date that I wanted, I just added them them and kept only the 10 first letters ([:10]) to get rid of the hours(00.00.00 or something...). I also changed the order of days_months-years because in Greece we use a different order. Finally, this code also checks if it can convert a number to an integer(I don't want any floats to show at my program...) and if everything fails, it just uses the cell as it is(in cases there are strings in the cells...).
I hope that you find that useful, I think there are other threads that say that this is impossible or something...

Related

extracting values matching timestamps by a new set of timestamps

sample table here
i am trying to look up corresponding commodity prices from columns(CU00.SHF,AU00.SHF,SC00.SHF,I8888.DCE C00.DCE), with a new set of timestamps, the dates of which are 32 days later than the dates in column 'history_date'.
i tried .loc and .at in a loop to extract the matching values with below functions:
latest_day = data.iloc[data.shape[0] - 1, 0].date()
def next_trade_day(x):
x = pd.to_datetime(x).date() #imported is_workday funtion requires datetime type
while True:
if is_workday(x + timedelta(32)) != False:
break
return (pd.Timestamp((x + timedelta(32))))
if is_workday(x + timedelta(32)) == False:
x = x + timedelta(1)
return pd.Timestamp(x + timedelta(32))
def end_price(x):
x = pd.Timestamp(x)
if x <= latest_day:
return data.at[x,'CU00.SHF']
if x > latest_day:
return'None'
return data.at[x,'CU00.SHF']
but it always gives
KeyError: Timestamp('2023-02-03 00:00:00')
any idea how should i achieve the target?
thanks in advance!
if you want work datetime:
convert column datetime
check date converted, use filte
pd.to_datetime(df['your column'],errors='ignore')
df.loc[df.['your column'] > 'your-date' ]
if work both, then check your full code.

Python: converting timestamp to date time not working

I am requesting data from the api.etherscan.io website. For this, I require a free API key. I am getting information for the following wallet addresses 0xdafea492d9c6733ae3d56b7ed1adb60692c98bc5, 0xc508dbe4866528db024fb126e0eb97595668c288. Below is the code I am using:
wallet_addresses = ['0xdafea492d9c6733ae3d56b7ed1adb60692c98bc5', '0xc508dbe4866528db024fb126e0eb97595668c288']
page_number = 0
df_main = pd.DataFrame()
while True:
for address in wallet_addresses:
url=f'https://api.etherscan.io/api?module=account&action=txlist&address={address}&startblock=0&endblock=99999999&page={page_number}&offset=10&sort=asc&apikey={ether_api}'
output = requests.get(url).text
df_temp = pd.DataFrame(json.loads(output)['result'])
df_temp['wallet_address'] = address
df_main = df_main.append(df_temp)
page_number += 1
df_main['timeStamp'] = pd.to_datetime(df_main['timeStamp'], unit='s')
if min(pd.to_datetime(df_main['timeStamp']).dt.date) < datetime.date(2022, 1, 1):
pass
Note that you need your own (free) ether_api.
What I want to do is get data from today's date, all the way back to 2022-01-01 which is what I am trying to achieve in the if statement.
However, the above gives me an error: ValueError: unit='s' not valid with non-numerical val='2022-09-19 18:14:47'
How can this be done? I've tried multiple methods to get pandas datetime to work, but all of them gave me errors.
Here you go, it's working without an error:
page_number = 0
df_main = pd.DataFrame()
while True:
for address in wallet_addresses:
url=f'https://api.etherscan.io/api?module=account&action=txlist&address={address}&startblock=0&endblock=99999999&page={page_number}&offset=10&sort=asc&apikey={ether_api}'
output = requests.get(url).text
df_temp = pd.DataFrame(json.loads(output)['result'])
df_temp['wallet_address'] = address
page_number += 1
df_temp['timeStamp'] = pd.to_datetime(df_temp['timeStamp'], unit='s')
df_main = df_main.append(df_temp)
if min(pd.to_datetime(df_main['timeStamp']).dt.date) < datetime(2022, 1, 1).date():
pass
Wrong append
So, what has happened here. As suggested in the first comment under question we acknowledged the type of first record in df_main with type(df_main['timeStamp'].iloc[0]). With IPython and Jupyter-Notebook one can look what is happening with df_main just after receiving an error with it being populated on the last for loop iteration that failed.
Otherwise if one uses PyCharm or any other IDE with a possibility to debug, the contents of df_main can be revealed via debug.
What we were missing, is that df_main = df_main.append(df_temp) is placed in a slightly wrong place. On first iteration it works well, pd.to_datetime(df_main['timeStamp'], unit='s') gets an str type with Linux epoch and gets converted to pandas._libs.tslibs.timestamps.Timestamp.
But on next iteration df_main['timeStamp'] already has the Timestamp type and it gets appended with str type, so we get a column with mixed type. E.g.:
type(df_main['timeStamp'].iloc[0]) == type(df_main['timeStamp'].iloc[-1])
This results with False. Hence when trying to convert Timestamp to Timestamp one gets an error featured in question.
To mitigate this we can place .append() below the conversion and do this conversion on df_temp instead of df_main, this way we will only append Timestamps to the resulting DataFrame and the code below with if clause will work fine.
As a side note
Another small change I've made was datetime.date(2022, 1, 1). This change was not needed, but the way one works with datetime depends on how this library was imported, so it's worth mentioning:
import datetime
datetime.date(2022, 1, 1)
datetime.datetime(2022, 1, 1).date()
from datetime import datetime
datetime(2022, 1, 1).date()
All the above is legit and will produce the same. On the first import module gets imported, on the second one type gets imported.
Alternative solution
Conversion to Timestamp takes time. If the API provides Linux epoch dates, why not use this date for comparison? Let's add this somewhere where you define wallet_addresses:
reference_date = "01/01/2021"
reference_date = int(time.mktime(datetime.datetime.strptime(reference_date, "%d/%m/%Y").timetuple()))
This will result in 1609448400. Other stack overflow question as reference.
This integer can now be compared with timestamps provided by the API. The only thing left is to cast str to int. We can have your code left intact with some minor changes at the end:
<< Your code without changes >>
df_main['timeStamp'] = df_main['timeStamp'].astype(int)
if min(df_main['timeStamp']) < reference_date:
pass
To make a benchmark I've changed while True: to for _ in range(0,4): to limit the infinite cycle, results are as follows:
Initial solution took 11.6 s to complete
Alternative solution took 8.85 s to complete
It's 30% faster. Casting str to int takes less time than conversion to TimeStamps, I would call this a preferable solution.
Future warning
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
It makes sense to comply with this warning. df_main = df_main.append(df_temp) has to be changed to df_main = pd.concat([df_main, df_temp]).
As for current 1.5.0 version it's already deprecated. Time to upgrade!

How to count the duration of a field in a given value while having the field change history data?

I'm working with field change history data which has timestamps for when the field value was changed. In this example, I need to calculate the overall case duration in 'Termination in Progress' status.
The given case was changed from and to this status three times in total:
see screenshot
I need to add up all three durations in this case and in other cases it can be more or less than three.
Does anyone know how to calculate that in Python?
Welcome to Stack Overflow!
Based on the limited data you provided, here is a solution that should work although the code makes some assumptions that could cause errors so you will want to modify it to suit your needs. I avoided using list comprehension or array math to make it more clear since you said you're new to Python.
Assumptions:
You're pulling this data into a pandas dataframe
All Old values of "Termination in Progress" have a matching new value for all Case Numbers
import datetime
import pandas as pd
import numpy as np
fp = r'<PATH TO FILE>\\'
f = '<FILENAME>.csv'
data = pd.read_csv(fp+f)
#convert ts to datetime for later use doing time delta calculations
data['Edit Date'] = pd.to_datetime(data['Edit Date'])
# sort by the same case number and date in opposing order to make sure values for old and new align properly
data.sort_values(by = ['CaseNumber','Edit Date'], ascending = [True,False],inplace = True)
#find timestamps where Termination in progress occurs
old_val_ts = data.loc[data['Old Value'] == 'Termination in progress']['Edit Date'].to_list()
new_val_ts = data.loc[data['New Value'] == 'Termination in progress']['Edit Date'].to_list()
#Loop over the timestamps and calc the time delta
ts_deltas = list()
for i in range(len(old_val_ts)):
item = old_val_ts[i] - new_val_ts[i]
ts_deltas.append(item)
# this loop could also be accomplished with list comprehension like this:
#ts_deltas = [old_ts - new_ts for (old_ts, new_ts) in zip(old_val_ts, new_val_ts)]
print('Deltas between groups')
print(ts_deltas)
print()
#Sum the time deltas
total_ts_delta = sum(ts_deltas,datetime.timedelta())
print('Total Time Delta')
print(total_ts_delta)
Deltas between groups
[Timedelta('0 days 00:08:00'), Timedelta('0 days 00:06:00'), Timedelta('0 days 02:08:00')]
Total Time Delta
0 days 02:22:00
I've also attached a picture of the solution minus my file path for obvious reasons. Hope this helps. Please remember to mark as correct if this solution works for you. Otherwise let me know what issues you run into.
EDIT:
If you have multiple case numbers you want to look at, you could do it in various ways, but the simplest would be to just get a list of unique case numbers with data['CaseNumber'].unique() then iterate over that array filtering for each case number and appending the total time delta to a new list or a dictionary (not necessarily the most efficient solution, but it will work).
cases_total_td = {}
unique_cases = data['CaseNumber'].unique()
for case in unique_cases:
temp_data = data[data['CaseNumber'] == case]
#find timestamps where Termination in progress occurs
old_val_ts = data.loc[data['Old Value'] == 'Termination in progress']['Edit Date'].to_list()
new_val_ts = data.loc[data['New Value'] == 'Termination in progress']['Edit Date'].to_list()
#Loop over the timestamps and calc the time delta
ts_deltas = list()
for i in range(len(old_val_ts)):
item = old_val_ts[i] - new_val_ts[i]
ts_deltas.append(item)
ts_deltas = [old_ts - new_ts for (old_ts, new_ts) in zip(old_val_ts, new_val_ts)]
#Sum the time deltas
total_ts_delta = sum(ts_deltas,datetime.timedelta())
cases_total_td[case] = total_ts_delta
print(cases_total_td)
{1005222: Timedelta('0 days 02:22:00')}

My program compute values as string and not as float even when ichange the type

i have a problem with my program and i'm confused, i don't know why it won't change the type of the columns, or maybe it is changing the type of the columns and it just still compute the columns as string. When i change the type into float, if i want it to be multiplied by 8, it will give me, for example with 4, 44444444. Here is my code.
import pandas as pd
import re
import numpy as np
link = "excelfilett.txt"
file = open(link, "r")
frames = []
is_count_frames = False
for line in file:
if "[Frames]" in line:
is_count_frames = True
if is_count_frames == True:
frames.append(line)
if "[EthernetRouting]" in line:
break
number_of_rows = len(frames) - 3
header = re.split(r'\t', frames[1])
number_of_columns = len(header)
frame_array = np.full((number_of_rows, number_of_columns), 0)
df_frame_array = pd.DataFrame(frame_array)
df_frame_array.columns= header
for row in range(number_of_rows):
frame_row = re.split(r'\t',frames[row+2])
for position in range(len(frame_row)):
df_frame_array.iloc[row, position]=frame_row[position]
df_frame_array['[MinDistance (ms)]'].astype(float)
df_frame_array.loc[:,'[MinDistance (ms)]'] *= 8
print(df_frame_array['[MinDistance (ms)]'])
but it gives me 8 times the value like (100100...100100), i also tried with puting them in a list
MinDistList = df_frame_array['[MinDistance (ms)]'].tolist()
product = []
for i in MinDistList:
product.append(i*8)
print(product)
but it still won't work, any ideas?
df_frame_array['[MinDistance (ms)]'].astype(float) doesn't change the column in place, but returns a new one.
You had the right idea, so just store it back:
df_frame_array['[MinDistance (ms)]'] = df_frame_array['[MinDistance (ms)]'].astype(float)

Pandas df.loc comparing-floats-condition never works

df[['gc_lat', 'gc_lng']] = df[['gc_lat', 'gc_lng']].apply(pd.to_numeric, errors='ignore')
df_realty[['lat', 'lng']] = df_realty[['lat', 'lng']].apply(pd.to_numeric, errors='ignore')
for index, row in df.iterrows():
gc_lat = float(df.get_value(index,'gc_lat'))
gc_lng = float(df.get_value(index, 'gc_lng'))
latmax = gc_lat + 1/110.574*radius_km
latmin = gc_lat - 1/110.574*radius_km
longmax = gc_lng + 1/111.320*radius_km*cos(df.get_value(index,'gc_lat'))
longmin = gc_lng - 1/111.320*radius_km*cos(df.get_value(index,'gc_lat'))
print(latmax, latmin, longmax, longmin)
print (gc_lat)
print (gc_lng)
print (df_realty.shape)
subset = df_realty.loc[(df_realty['lat']<latmax) & (df_realty['lat']>latmin) & (df_realty['lng']>longmin) & (df_realty['lng'] <longmax)]
print (subset.shape)
print ('subset selected!')
prints
59.12412758664786 59.03369041335215 37.88659685779323 37.960157142206775
59.078909
37.923377
(290584, 3)
(0, 3)
subset selected!
So I am trying to split Dataframe to subsets, but the condition I put in df.loc never works!
The data in df_realty is OK, already tested.
It seems like I have to explict some type casts, but i've already made one (pd.to_numeric)
Any suggestions?
Found a solution
The problem was that the longmax sometimes became smaller than longmin because cos sometimes returns negative float.
puting abs() in front of cosinus solved the problem

Categories

Resources