Python: converting timestamp to date time not working

Python: converting timestamp to date time not working - python

I am requesting data from the api.etherscan.io website. For this, I require a free API key. I am getting information for the following wallet addresses 0xdafea492d9c6733ae3d56b7ed1adb60692c98bc5, 0xc508dbe4866528db024fb126e0eb97595668c288. Below is the code I am using:
wallet_addresses = ['0xdafea492d9c6733ae3d56b7ed1adb60692c98bc5', '0xc508dbe4866528db024fb126e0eb97595668c288']
page_number = 0
df_main = pd.DataFrame()
while True:
for address in wallet_addresses:
url=f'https://api.etherscan.io/api?module=account&action=txlist&address={address}&startblock=0&endblock=99999999&page={page_number}&offset=10&sort=asc&apikey={ether_api}'
output = requests.get(url).text
df_temp = pd.DataFrame(json.loads(output)['result'])
df_temp['wallet_address'] = address
df_main = df_main.append(df_temp)
page_number += 1
df_main['timeStamp'] = pd.to_datetime(df_main['timeStamp'], unit='s')
if min(pd.to_datetime(df_main['timeStamp']).dt.date) < datetime.date(2022, 1, 1):
pass
Note that you need your own (free) ether_api.
What I want to do is get data from today's date, all the way back to 2022-01-01 which is what I am trying to achieve in the if statement.
However, the above gives me an error: ValueError: unit='s' not valid with non-numerical val='2022-09-19 18:14:47'
How can this be done? I've tried multiple methods to get pandas datetime to work, but all of them gave me errors.

Here you go, it's working without an error:
page_number = 0
df_main = pd.DataFrame()
while True:
for address in wallet_addresses:
url=f'https://api.etherscan.io/api?module=account&action=txlist&address={address}&startblock=0&endblock=99999999&page={page_number}&offset=10&sort=asc&apikey={ether_api}'
output = requests.get(url).text
df_temp = pd.DataFrame(json.loads(output)['result'])
df_temp['wallet_address'] = address
page_number += 1
df_temp['timeStamp'] = pd.to_datetime(df_temp['timeStamp'], unit='s')
df_main = df_main.append(df_temp)
if min(pd.to_datetime(df_main['timeStamp']).dt.date) < datetime(2022, 1, 1).date():
pass
Wrong append
So, what has happened here. As suggested in the first comment under question we acknowledged the type of first record in df_main with type(df_main['timeStamp'].iloc[0]). With IPython and Jupyter-Notebook one can look what is happening with df_main just after receiving an error with it being populated on the last for loop iteration that failed.
Otherwise if one uses PyCharm or any other IDE with a possibility to debug, the contents of df_main can be revealed via debug.
What we were missing, is that df_main = df_main.append(df_temp) is placed in a slightly wrong place. On first iteration it works well, pd.to_datetime(df_main['timeStamp'], unit='s') gets an str type with Linux epoch and gets converted to pandas._libs.tslibs.timestamps.Timestamp.
But on next iteration df_main['timeStamp'] already has the Timestamp type and it gets appended with str type, so we get a column with mixed type. E.g.:
type(df_main['timeStamp'].iloc[0]) == type(df_main['timeStamp'].iloc[-1])
This results with False. Hence when trying to convert Timestamp to Timestamp one gets an error featured in question.
To mitigate this we can place .append() below the conversion and do this conversion on df_temp instead of df_main, this way we will only append Timestamps to the resulting DataFrame and the code below with if clause will work fine.
As a side note
Another small change I've made was datetime.date(2022, 1, 1). This change was not needed, but the way one works with datetime depends on how this library was imported, so it's worth mentioning:
import datetime
datetime.date(2022, 1, 1)
datetime.datetime(2022, 1, 1).date()
from datetime import datetime
datetime(2022, 1, 1).date()
All the above is legit and will produce the same. On the first import module gets imported, on the second one type gets imported.
Alternative solution
Conversion to Timestamp takes time. If the API provides Linux epoch dates, why not use this date for comparison? Let's add this somewhere where you define wallet_addresses:
reference_date = "01/01/2021"
reference_date = int(time.mktime(datetime.datetime.strptime(reference_date, "%d/%m/%Y").timetuple()))
This will result in 1609448400. Other stack overflow question as reference.
This integer can now be compared with timestamps provided by the API. The only thing left is to cast str to int. We can have your code left intact with some minor changes at the end:
<< Your code without changes >>
df_main['timeStamp'] = df_main['timeStamp'].astype(int)
if min(df_main['timeStamp']) < reference_date:
pass
To make a benchmark I've changed while True: to for _ in range(0,4): to limit the infinite cycle, results are as follows:
Initial solution took 11.6 s to complete
Alternative solution took 8.85 s to complete
It's 30% faster. Casting str to int takes less time than conversion to TimeStamps, I would call this a preferable solution.
Future warning
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
It makes sense to comply with this warning. df_main = df_main.append(df_temp) has to be changed to df_main = pd.concat([df_main, df_temp]).
As for current 1.5.0 version it's already deprecated. Time to upgrade!

Related

How to count the duration of a field in a given value while having the field change history data?

I'm working with field change history data which has timestamps for when the field value was changed. In this example, I need to calculate the overall case duration in 'Termination in Progress' status.
The given case was changed from and to this status three times in total:
see screenshot
I need to add up all three durations in this case and in other cases it can be more or less than three.
Does anyone know how to calculate that in Python?

Welcome to Stack Overflow!
Based on the limited data you provided, here is a solution that should work although the code makes some assumptions that could cause errors so you will want to modify it to suit your needs. I avoided using list comprehension or array math to make it more clear since you said you're new to Python.
Assumptions:
You're pulling this data into a pandas dataframe
All Old values of "Termination in Progress" have a matching new value for all Case Numbers
import datetime
import pandas as pd
import numpy as np
fp = r'<PATH TO FILE>\\'
f = '<FILENAME>.csv'
data = pd.read_csv(fp+f)
#convert ts to datetime for later use doing time delta calculations
data['Edit Date'] = pd.to_datetime(data['Edit Date'])
# sort by the same case number and date in opposing order to make sure values for old and new align properly
data.sort_values(by = ['CaseNumber','Edit Date'], ascending = [True,False],inplace = True)
#find timestamps where Termination in progress occurs
old_val_ts = data.loc[data['Old Value'] == 'Termination in progress']['Edit Date'].to_list()
new_val_ts = data.loc[data['New Value'] == 'Termination in progress']['Edit Date'].to_list()
#Loop over the timestamps and calc the time delta
ts_deltas = list()
for i in range(len(old_val_ts)):
item = old_val_ts[i] - new_val_ts[i]
ts_deltas.append(item)
# this loop could also be accomplished with list comprehension like this:
#ts_deltas = [old_ts - new_ts for (old_ts, new_ts) in zip(old_val_ts, new_val_ts)]
print('Deltas between groups')
print(ts_deltas)
print()
#Sum the time deltas
total_ts_delta = sum(ts_deltas,datetime.timedelta())
print('Total Time Delta')
print(total_ts_delta)
Deltas between groups
[Timedelta('0 days 00:08:00'), Timedelta('0 days 00:06:00'), Timedelta('0 days 02:08:00')]
Total Time Delta
0 days 02:22:00
I've also attached a picture of the solution minus my file path for obvious reasons. Hope this helps. Please remember to mark as correct if this solution works for you. Otherwise let me know what issues you run into.
EDIT:
If you have multiple case numbers you want to look at, you could do it in various ways, but the simplest would be to just get a list of unique case numbers with data['CaseNumber'].unique() then iterate over that array filtering for each case number and appending the total time delta to a new list or a dictionary (not necessarily the most efficient solution, but it will work).
cases_total_td = {}
unique_cases = data['CaseNumber'].unique()
for case in unique_cases:
temp_data = data[data['CaseNumber'] == case]
#find timestamps where Termination in progress occurs
old_val_ts = data.loc[data['Old Value'] == 'Termination in progress']['Edit Date'].to_list()
new_val_ts = data.loc[data['New Value'] == 'Termination in progress']['Edit Date'].to_list()
#Loop over the timestamps and calc the time delta
ts_deltas = list()
for i in range(len(old_val_ts)):
item = old_val_ts[i] - new_val_ts[i]
ts_deltas.append(item)
ts_deltas = [old_ts - new_ts for (old_ts, new_ts) in zip(old_val_ts, new_val_ts)]
#Sum the time deltas
total_ts_delta = sum(ts_deltas,datetime.timedelta())
cases_total_td[case] = total_ts_delta
print(cases_total_td)
{1005222: Timedelta('0 days 02:22:00')}

How do I convert from int64 back to timestamp or datetime'?

I am working on a project to look at how much a pitcher's different pitches break each game. I looked here for an earlier error which fixed my error but it gives me some weird numbers. What I mean is like when I print what I hope to be August 3rd,2020 I get 1.5964128e+18. Here's how I got there.
hughes2020=pd.read_csv(r"C:/Users/Stratus/Downloads/Hughes2020Test.csv",parse_dates=['game_date'])
game=hughes2020['game_date'].astype(np.int64)
#Skipping to next part to an example
elif name[i]=="Curveball":
if (c<curve)
xcurve[c]=totalx[i]
ycurve[c]=totaly[i]
cudate[c]=game[i]
c+=1
and when I print the cudate it gives me the large number and I am wondering how I can change it back.
And if I run it as
game=hughes2020['game_date'] #.astype(np.int64)
#Skipping to next part to an example
elif name[i]=="Curveball":
if (c<curve)
xcurve[c]=totalx[i]
ycurve[c]=totaly[i]
cudate[c]=game[i]
c+=1
It gives me an
TypeError: float() argument must be a string or a number, not 'Timestamp'

To convert int to datetime use pd.to_datetime():
df = pd.DataFrame(data=[1.5964128e+18], columns = ['t'])
df['t2'] = pd.to_datetime(df['t'])
t t2
0 1.596413e+18 2020-08-03
However a better solution would be to convert the dates at the time of csv reading (As #sinanspd correctly pointed out). Use parse_dates and other related options in pd.read_csv(). Function manual is here

Trouble obtaining counts using multiple datetime columns as conditionals

I am attempting to collect counts of occurrences of an id between two time periods in a dataframe. I have a moderately sized dataframe (about 400 unique ids and just short of 1m rows) containing a time of occurrence and an id for the account which caused the occurrence. I am attempting to get a count of occurrences for multiple time periods (1 hour, 6 hour, 1 day, etc.) prior a specific occurrence and have run into lots of difficulties.
I am using Python 3.7, and for this instance I only have the pandas package loaded. I have tried using for loops and while it likely would have worked (eventually), I am looking for something a bit more efficient time-wise. I have also tried using list comprehension and have run into some errors that I did not anticipate when dealing with datetimes columns. Examples of both are below.
## Sample data
data = {'id':[ 'EAED813857474821E1A61F588FABA345', 'D528C270B80F11E284931A7D66640965', '6F394474B8C511E2A76C1A7D66640965', '7B9C7C02F19711E38C670EDFB82A24A9', '80B409D1EC3D4CC483239D15AAE39F2E', '314EB192F25F11E3B68A0EDFB82A24A9', '68D30EE473FE11E49C060EDFB82A24A9', '156097CF030E4519DBDF84419B855E10', 'EE80E4C0B82B11E28C561A7D66640965', 'CA9F2DF6B82011E28C561A7D66640965', '6F394474B8C511E2A76C1A7D66640965', '314EB192F25F11E3B68A0EDFB82A24A9', 'D528C270B80F11E284931A7D66640965', '3A024345C1E94CED8C7E0DA3A96BBDCA', '314EB192F25F11E3B68A0EDFB82A24A9', '47C18B6B38E540508561A9DD52FD0B79', 'B72F6EA5565B49BBEDE0E66B737A8E6B', '47C18B6B38E540508561A9DD52FD0B79', 'B92CB51EFA2611E2AEEF1A7D66640965', '136EDF0536F644E0ADE6F25BB293DD17', '7B9C7C02F19711E38C670EDFB82A24A9', 'C5FAF9ACB88D4B55AB8196DBFFE5B3C0', '1557D4ECEFA74B40C718A4E5425F3ACB', '68D30EE473FE11E49C060EDFB82A24A9', '68D30EE473FE11E49C060EDFB82A24A9', 'CAF9D8CD627B422DFE1D587D25FC4035', 'C620D865AEE1412E9F3CA64CB86DC484', '47C18B6B38E540508561A9DD52FD0B79', 'CA9F2DF6B82011E28C561A7D66640965', '06E2501CB81811E290EF1A7D66640965', '68EEE17873FE11E4B5B90AFEF9534BE1', '47C18B6B38E540508561A9DD52FD0B79', '1BFE9CB25AD84B64CC2D04EF94237749', '7B20C2BEB82811E28C561A7D66640965', '261692EA8EE447AEF3804836E4404620', '74D7C3901F234993B4788EFA9E6BEE9E', 'CAF9D8CD627B422DFE1D587D25FC4035', '76AAF82EB8C511E2A76C1A7D66640965', '4BD38D6D44084681AFE13C146542A565', 'B8D27E80B82911E28C561A7D66640965' ], 'datetime':[ "24/06/2018 19:56", "24/05/2018 03:45", "12/01/2019 14:36", "18/08/2018 22:42", "19/11/2018 15:43", "08/07/2017 21:32", "15/05/2017 14:00", "25/03/2019 22:12", "27/02/2018 01:59", "26/05/2019 21:50", "11/02/2017 01:33", "19/11/2017 19:17", "04/04/2019 13:46", "08/05/2019 14:12", "11/02/2018 02:00", "07/04/2018 16:15", "29/10/2016 20:17", "17/11/2018 21:58", "12/05/2017 16:39", "28/01/2016 19:00", "24/02/2019 19:55", "13/06/2019 19:24", "30/09/2016 18:02", "14/07/2018 17:59", "06/04/2018 22:19", "25/08/2017 17:51", "07/04/2019 02:24", "26/05/2018 17:41", "27/08/2014 06:45", "15/07/2016 19:30", "30/10/2016 20:08", "15/09/2018 18:45", "29/01/2018 02:13", "10/09/2014 23:10", "11/05/2017 22:00", "31/05/2019 23:58", "19/02/2019 02:34", "02/02/2019 01:02", "27/04/2018 04:00", "29/11/2017 20:35"]}
df = pd.dataframe(data)
df = df.sort_values(['id', 'datetime'], ascending=True)
# for loop attempt
totalAccounts = df['id'].unique()
for account in totalAccounts:
oneHourCount=0
subset = df[df['id'] == account]
for i in range(len(subset)):
onehour = subset['datetime'].iloc[i] - timedelta(hours=1)
for j in range(len(subset)):
if (subset['datetime'].iloc[j] >= onehour) and (subset['datetime'].iloc[j] < sub):
oneHourCount+=1
#list comprehension attempt
df['onehour'] = df['datetime'] - timedelta(hours=1)
for account in totalAccounts:
onehour = sum([1 for x in subset['datetime'] if x >= subset['onehour'] and x < subset['datetime']])
I am getting either 1) incredibly long runtime with the for loop or 2) an ValueError regarding the truth of a series being ambiguous. I know the issue is dealing with the datetimes, and perhaps it is just going to be slow-going, but I want to check here first just to make sure.

So I was able to figure this out using bisection. If you have a similar question please PM me and I'd be more than happy to help.
Solution:
left = bisect_left(keys, subset['start_time'].iloc[i]) ## calculated time
right = bisect_right(keys, subset['datetime'].iloc[i]) ## actual time of occurrence
count=len(subset['datetime'][left:right]

How to correctly modify timestamps on streaming data to create unique indexes

The type of data we are streaming in is taken from our PI System which is outputting data in an irregular manner. This is not uncommon with time series data, so I have attempted to add 1 second or so to each time stamp to ensure the index is unique. However this has not worked as I hoped as I keep received a type error.
I have attempted to implement the solutions highlighted in (Modifying timestamps in pandas to make index unique) however without any success.
The error message I get is:
TypeError: ufunc add cannot use operands with types dtype('O') and dtype('<m8')
The code implementation is below:
values = Slugging_Sep.index.duplicated(keep=False).astype(float)
values[values==0] = np.NaN
missings = np.isnan(values)
cumsum = np.cumsum(~missings)
diff = np.diff(np.concatenate(([0.], cumsum[missings])))
values[missings] = -diff
# print result
result = Slugging_Sep.index + np.cumsum(values).astype(np.timedelta64)
print(result)
What I have tried
Type Casting - I thought that the calculation was due to two
different types being added together but this hasn't resolved the
issue.
Using Time Delta in Pandas - This creates the same Type Error.
pd.to_timedelta(Slugging_Sep.groupby('Time').cumcount(), unit='ms'))
Slugging_Sep['Time'] = (str(Slugging_Sep['Time'] +
pd.to_timedelta(Slugging_Sep.groupby('Time').cumcount(), unit='ms')))
So I have two questions from this:
Could anyone provide some advice to me regarding how to solve this
for future time series issues?
What actually is dtype ('<m8')
Thank you.

Using Alex Zisman's suggestion, I reconverted the Slugging_Sep.index via the following line:
pd.to_datetime(Slugging_Sep['Time'])
Slugging_Sep.set_index('Time', inplace=True)
I then implemented the following code taken from the above SO link I mentioned:
#values = Slugging_Sep.index.duplicated(keep=False).astype(float)
#values[values==0] = np.NaN
#missings = np.isnan(values)
#cumsum = np.cumsum(~missings)
#diff = np.diff(np.concatenate(([0.], cumsum[missings])))
#values[missings] = -diff
# print result
#result = Slugging_Sep.index + np.cumsum(values).astype(np.timedelta64())
#Slugging_Sep.index = result
#print(Slugging_Sep.index)
This resolved the issue and added nanoseconds to each duplicate time stamp so it became a unique index.

How do I execute my function on an array of data in Python?

In MATLAB, if I write my own function I can pass an array to that function and it automagically just deals with it. I'm trying to do the same thing in Python, the other Data Science language, and it's not dealing with it.
Is there an easy way to do this without having to make loops every time I want to operate on all values in an array? That's a lot more work! It seems like the great minds working in Python would have had something like this need before.
I tried switching the data type to a list() since that's iterable, but that didn't seem to work. I'm still getting an error that basically says it doesn't want an array object.
Here's my code:
import scipy
from collections import deque
import numpy as np
import os
from datetime import date, timedelta
def GetExcelData(filename,rowNum,titleCol):
csv = np.genfromtxt(filename, delimiter= ",")
Dates = deque(csv[rowNum,:])
if titleCol == True:
Dates.popleft()
return list(Dates)
def from_excel_ordinal(ordinal, _epoch=date(1900, 1, 1)):
if ordinal > 59: #the function stops working here when I pass my array
ordinal -= 1 # Excel leap year bug, 1900 is not a leap year!
return _epoch + timedelta(days=ordinal - 1) # epoch is day 1
os.chdir("C:/Users/blahblahblah")
filename = "SamplePandL.csv"
Dates = GetExcelData(filename,1,1)
Dates = from_excel_ordinal(Dates) #this is the call with an issue
print(Dates)

Well you can use map function for that.
Dates = from_excel_ordinal(Dates)
Replace the above line from the code with the following code
Dates = list(map(from_excel_ordinal,Dates))
In the above code, for each value present in Dates, function from_excel_ordinal will be called. Finally it gets converted into list and stored into Dates.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: converting timestamp to date time not working - python

Related

How to count the duration of a field in a given value while having the field change history data?

How do I convert from int64 back to timestamp or datetime'?

Trouble obtaining counts using multiple datetime columns as conditionals

How to correctly modify timestamps on streaming data to create unique indexes

How do I execute my function on an array of data in Python?

Categories

Resources