I am working on a project to look at how much a pitcher's different pitches break each game. I looked here for an earlier error which fixed my error but it gives me some weird numbers. What I mean is like when I print what I hope to be August 3rd,2020 I get 1.5964128e+18. Here's how I got there.
hughes2020=pd.read_csv(r"C:/Users/Stratus/Downloads/Hughes2020Test.csv",parse_dates=['game_date'])
game=hughes2020['game_date'].astype(np.int64)
#Skipping to next part to an example
elif name[i]=="Curveball":
if (c<curve)
xcurve[c]=totalx[i]
ycurve[c]=totaly[i]
cudate[c]=game[i]
c+=1
and when I print the cudate it gives me the large number and I am wondering how I can change it back.
And if I run it as
game=hughes2020['game_date'] #.astype(np.int64)
#Skipping to next part to an example
elif name[i]=="Curveball":
if (c<curve)
xcurve[c]=totalx[i]
ycurve[c]=totaly[i]
cudate[c]=game[i]
c+=1
It gives me an
TypeError: float() argument must be a string or a number, not 'Timestamp'
To convert int to datetime use pd.to_datetime():
df = pd.DataFrame(data=[1.5964128e+18], columns = ['t'])
df['t2'] = pd.to_datetime(df['t'])
t t2
0 1.596413e+18 2020-08-03
However a better solution would be to convert the dates at the time of csv reading (As #sinanspd correctly pointed out). Use parse_dates and other related options in pd.read_csv(). Function manual is here
Related
I am trying to split the time_taken attribute (eg., 02h 10m) into only numbers using the below code.
I have checked earlier posts and this code seemed to work fine for some of you but it is not working for me.
t=pd.to_timedelta(df3['time_taken'])
df3['hours']=t.dt.components['hours']
df3['minutes']=t.dt.components['minutes']
df3.head()
I am getting the following error:
ValueError: invalid unit abbreviation: hm
I am unable to understand the error. Can anyone help me split the column into hours and mins? It would be of great help. Thanks in advance.
You can try this code. Since you mentioned that your time_taken attribute looks like this: 02h 10m. I have written an example code which you can try out.
import pandas as pd
# initializing example time data
time_taken = ['1h 10m', '2h 20m', '3h 30m', '4h 40m', '5h 50m']
#inserting the time data into a pandas DataFrame
data = pd.DataFrame(time_taken, columns = ['time_taken'])
# see how the data looks like
print(data)
# initializing "Hours" and "Minutes" columns"
# and assigning the value 0 to both for now.
data['Hours'] = 0
data['Minutes'] = 0
# when I ran this code, the data type for the elements
# in time_taken column was numpy.int64
# so we convert it into string type
data['time_taken'] = data['time_taken'].apply(str)
# loop through the elements to split into Hours and minutes
for i in range(len(data)):
temp = data.iat[i,0]
hours, minutes = temp.split() # use python .split() function for strings
data.iat[i,1] = hours.translate({ord('h'): None})
data.iat[i,2] = minutes.translate({ord('m'): None})
# the correct data is here
print(data)
I am requesting data from the api.etherscan.io website. For this, I require a free API key. I am getting information for the following wallet addresses 0xdafea492d9c6733ae3d56b7ed1adb60692c98bc5, 0xc508dbe4866528db024fb126e0eb97595668c288. Below is the code I am using:
wallet_addresses = ['0xdafea492d9c6733ae3d56b7ed1adb60692c98bc5', '0xc508dbe4866528db024fb126e0eb97595668c288']
page_number = 0
df_main = pd.DataFrame()
while True:
for address in wallet_addresses:
url=f'https://api.etherscan.io/api?module=account&action=txlist&address={address}&startblock=0&endblock=99999999&page={page_number}&offset=10&sort=asc&apikey={ether_api}'
output = requests.get(url).text
df_temp = pd.DataFrame(json.loads(output)['result'])
df_temp['wallet_address'] = address
df_main = df_main.append(df_temp)
page_number += 1
df_main['timeStamp'] = pd.to_datetime(df_main['timeStamp'], unit='s')
if min(pd.to_datetime(df_main['timeStamp']).dt.date) < datetime.date(2022, 1, 1):
pass
Note that you need your own (free) ether_api.
What I want to do is get data from today's date, all the way back to 2022-01-01 which is what I am trying to achieve in the if statement.
However, the above gives me an error: ValueError: unit='s' not valid with non-numerical val='2022-09-19 18:14:47'
How can this be done? I've tried multiple methods to get pandas datetime to work, but all of them gave me errors.
Here you go, it's working without an error:
page_number = 0
df_main = pd.DataFrame()
while True:
for address in wallet_addresses:
url=f'https://api.etherscan.io/api?module=account&action=txlist&address={address}&startblock=0&endblock=99999999&page={page_number}&offset=10&sort=asc&apikey={ether_api}'
output = requests.get(url).text
df_temp = pd.DataFrame(json.loads(output)['result'])
df_temp['wallet_address'] = address
page_number += 1
df_temp['timeStamp'] = pd.to_datetime(df_temp['timeStamp'], unit='s')
df_main = df_main.append(df_temp)
if min(pd.to_datetime(df_main['timeStamp']).dt.date) < datetime(2022, 1, 1).date():
pass
Wrong append
So, what has happened here. As suggested in the first comment under question we acknowledged the type of first record in df_main with type(df_main['timeStamp'].iloc[0]). With IPython and Jupyter-Notebook one can look what is happening with df_main just after receiving an error with it being populated on the last for loop iteration that failed.
Otherwise if one uses PyCharm or any other IDE with a possibility to debug, the contents of df_main can be revealed via debug.
What we were missing, is that df_main = df_main.append(df_temp) is placed in a slightly wrong place. On first iteration it works well, pd.to_datetime(df_main['timeStamp'], unit='s') gets an str type with Linux epoch and gets converted to pandas._libs.tslibs.timestamps.Timestamp.
But on next iteration df_main['timeStamp'] already has the Timestamp type and it gets appended with str type, so we get a column with mixed type. E.g.:
type(df_main['timeStamp'].iloc[0]) == type(df_main['timeStamp'].iloc[-1])
This results with False. Hence when trying to convert Timestamp to Timestamp one gets an error featured in question.
To mitigate this we can place .append() below the conversion and do this conversion on df_temp instead of df_main, this way we will only append Timestamps to the resulting DataFrame and the code below with if clause will work fine.
As a side note
Another small change I've made was datetime.date(2022, 1, 1). This change was not needed, but the way one works with datetime depends on how this library was imported, so it's worth mentioning:
import datetime
datetime.date(2022, 1, 1)
datetime.datetime(2022, 1, 1).date()
from datetime import datetime
datetime(2022, 1, 1).date()
All the above is legit and will produce the same. On the first import module gets imported, on the second one type gets imported.
Alternative solution
Conversion to Timestamp takes time. If the API provides Linux epoch dates, why not use this date for comparison? Let's add this somewhere where you define wallet_addresses:
reference_date = "01/01/2021"
reference_date = int(time.mktime(datetime.datetime.strptime(reference_date, "%d/%m/%Y").timetuple()))
This will result in 1609448400. Other stack overflow question as reference.
This integer can now be compared with timestamps provided by the API. The only thing left is to cast str to int. We can have your code left intact with some minor changes at the end:
<< Your code without changes >>
df_main['timeStamp'] = df_main['timeStamp'].astype(int)
if min(df_main['timeStamp']) < reference_date:
pass
To make a benchmark I've changed while True: to for _ in range(0,4): to limit the infinite cycle, results are as follows:
Initial solution took 11.6 s to complete
Alternative solution took 8.85 s to complete
It's 30% faster. Casting str to int takes less time than conversion to TimeStamps, I would call this a preferable solution.
Future warning
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
It makes sense to comply with this warning. df_main = df_main.append(df_temp) has to be changed to df_main = pd.concat([df_main, df_temp]).
As for current 1.5.0 version it's already deprecated. Time to upgrade!
I have a dataframe with two columns containing dates non formated.
the data in such columns is as follows:
2011-06-10T00:00:00.000+02:00
I would like to get just the date and format it.
In a Jupyter notebook I do the followings:
sections['produced'] = pd.to_datetime(sections['produced'])
sections['produced'] = [d.strftime('%Y-%m-%d') if not pd.isnull(d) else '' for d in sections['produced']]
sections['updated'] = pd.to_datetime(sections['updated'])
sections['updated'] = [d.strftime('%Y-%m-%d') if not pd.isnull(d) else '' for d in sections['updated']]
sections.info()
Then I print out the sections dataframe and indeed the dates are printed correctly.
BUT:
sections.info()
still tells me that those columns are non-null objects and not datetime.
Why?
secondly, my approach does not seem to work under the hood, i.e. the date types are not actually dates.
What should I do?
And last, the code is super verbose for something that should be one liner, or not? (i.e. pandas is powerful but has his limits)
EDIT 1: Answering some of the contributors. I expect datetime. just 2008-02-02 just the day.
So when doing:
sections['updated'] = pd.to_datetime(sections['updated'])
the date type is converted.
but when doing next:
sections['produced'] = [d.strftime('%Y-%m-%d') if not pd.isnull(d) else '' for d in sections['produced']]
So the aim here is to a) covert to datetime format b) get the date format 2008-01-02, I dont care about seconds c) it has to be printed out in jupyter notebook as such, i.e. as date
just pass errors parameter in to_datetime() method and set that equal to 'coerce':-
sections['produced'] = pd.to_datetime(sections['produced'],errors='coerce')
sections['updated'] = pd.to_datetime(sections['updated'],errors='coerce')
This should work as a one liner:
df[['produced','updated']] = df[['produced','updated']].apply(lambda x: pd.to_datetime(x,errors='coerce'))
I'm trying to fix a "logger print error: not enough arguments for format string" cropping up on a jupyter lab report and have tried a few solutions but no joy.
my dataframe looks like this:
df_1 = pd.DataFrame(df, columns = ['col1','col2','col3','col4','col5','col6','col7', 'col8', 'col9', 'col10'])
#I'm applying a % format because I only need last four columns in percentage:
df_1['col7'] = df_1['col7'].apply("{0:.0f}%".format)
df_1['col8'] = df_1['col8'].apply("{0:.0f}%".format)
df_1['col9'] = df_1['col9'].apply("{0:.0f}%".format)
df_1['col10'] = df_1['col10'].apply("{0:.0f}%".format)
I want to maintain the table format/structure so i'm not doing print(df_1) but rather just:
df_1
The above works fine, but I can't seem to get past the "logger print error: not enough arguments for format string" error.
p.s I've also tried using formats like "{:.2%}" or "{0:.0%}" but it turns -3 to -300%
Here is what the columns look like without any format:
Edit: fixed by removing this line from dataframe source query '%Y-%m-%d'
If you are using python 3, this should do it:
from random import randint
df_1['col7'] = df_1['col7'].apply(f"{randint(-3,-301)}%")
df_1['col8'] = df_1['col8'].apply(f"{randint(-3,-301)}%")
df_1['col9'] = df_1['col9'].apply(f"{randint(-3,-301)}%")
df_1['col10'] = df_1['col10'].apply(f"{randint(-3,-301)}%")
The type of data we are streaming in is taken from our PI System which is outputting data in an irregular manner. This is not uncommon with time series data, so I have attempted to add 1 second or so to each time stamp to ensure the index is unique. However this has not worked as I hoped as I keep received a type error.
I have attempted to implement the solutions highlighted in (Modifying timestamps in pandas to make index unique) however without any success.
The error message I get is:
TypeError: ufunc add cannot use operands with types dtype('O') and dtype('<m8')
The code implementation is below:
values = Slugging_Sep.index.duplicated(keep=False).astype(float)
values[values==0] = np.NaN
missings = np.isnan(values)
cumsum = np.cumsum(~missings)
diff = np.diff(np.concatenate(([0.], cumsum[missings])))
values[missings] = -diff
# print result
result = Slugging_Sep.index + np.cumsum(values).astype(np.timedelta64)
print(result)
What I have tried
Type Casting - I thought that the calculation was due to two
different types being added together but this hasn't resolved the
issue.
Using Time Delta in Pandas - This creates the same Type Error.
pd.to_timedelta(Slugging_Sep.groupby('Time').cumcount(), unit='ms'))
Slugging_Sep['Time'] = (str(Slugging_Sep['Time'] +
pd.to_timedelta(Slugging_Sep.groupby('Time').cumcount(), unit='ms')))
So I have two questions from this:
Could anyone provide some advice to me regarding how to solve this
for future time series issues?
What actually is dtype ('<m8')
Thank you.
Using Alex Zisman's suggestion, I reconverted the Slugging_Sep.index via the following line:
pd.to_datetime(Slugging_Sep['Time'])
Slugging_Sep.set_index('Time', inplace=True)
I then implemented the following code taken from the above SO link I mentioned:
#values = Slugging_Sep.index.duplicated(keep=False).astype(float)
#values[values==0] = np.NaN
#missings = np.isnan(values)
#cumsum = np.cumsum(~missings)
#diff = np.diff(np.concatenate(([0.], cumsum[missings])))
#values[missings] = -diff
# print result
#result = Slugging_Sep.index + np.cumsum(values).astype(np.timedelta64())
#Slugging_Sep.index = result
#print(Slugging_Sep.index)
This resolved the issue and added nanoseconds to each duplicate time stamp so it became a unique index.