Extract quarter information from numpy datetime64 obkect - python

I have below numpy datetime64 object
import numpy as np
date_time = np.datetime64('2012-05-01T01:00:00.000000+0100')
I would like to express this in YearQuarter i.e. '2012Q2'. Is there any method available to perform this? I tried with pandas Timestamp method but it generates error:
import pandas as pd
>>> pd.Timestamp(date_time).dt.quarter
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Timestamp' object has no attribute 'dt'
Any pointer will be very helpful

There are various ways that one can achieve that, depending on the desired output type.
If one wants the type pandas._libs.tslibs.period.Period, then one can use:
pandas.Period as follows
year_quarter = pd.Period(date_time, freq='Q')
[Out]: 2012Q2
pandas.Timestamp, as user7864386 mentioned, as follows
year_quarter = pd.Timestamp(date_time).to_period('Q')
[Out]: 2012Q2
Alternatively, if one wants the final output to be a string, one will have to pass pandas.Series.dt.strftime, more specifically .strftime('%YQ%q'), such as
year_quarter = pd.Period(date_time, freq='Q').strftime('%YQ%q')
# or
year_quarter = pd.Timestamp(date_time).to_period('Q').strftime('%YQ%q')
Notes:
date_time = np.datetime64('2012-05-01T01:00:00.000000+0100') gives a
DeprecationWarning: parsing timezone aware datetimes is deprecated; this will raise an error in the future
To check the variable year_quarter type, one can do the following
print(type(year_quarter))

Related

python - Getting error while taking difference between two dates in columns

this is my code, I am trying to get business days between two dates
the number of days is saved in a new column 'nd'
import numpy as np
df1 = pd.DataFrame(pd.date_range('2020-01-01',periods=26,freq='D'),columns=['A'])
df2 = pd.DataFrame(pd.date_range('2020-02-01',periods=26,freq='D'),columns=['B'])
df = pd.concat([df1,df2],axis=1)
# Iterate over each row of the DataFrame
for index , row in df.iterrows():
bc = np.busday_count(row['A'],row['B'])
df['nd'] = bc
I am getting this error.
Traceback (most recent call last):
File "<input>", line 35, in <module>
File "<__array_function__ internals>", line 5, in busday_count
TypeError: Iterator operand 0 dtype could not be cast from dtype('<M8[us]') to dtype('<M8[D]') according to the rule 'safe'
Is there a way to fix it or another way to get the solution?
busday_count only accepts dates, not datetimes change
bc = np.busday_count(row['A'],row['B'])
to
np.busday_count(row['A'].date(), row['B'].date())

AttributeError: 'method_descriptor' object has no attribute 'df_new' in python when replacing string

I am writing a short piece of code to remove web browser version numbers from the name in a column of data in a pandas dataframe. i.e. replace a string containing alpha and numerical characters with just the alpha characters.
I have written:
df_new=(rename())
str.replace.df_new[new_browser]('[.*0-9]',"",regex=True)
I am getting this error message and I don't understand what it's telling me
AttributeError Traceback (most recent call last)
<ipython-input-4-d8c6f9119b9f> in <module>
3 df_new=(rename())
4
----> 5 str.replace.df_new[new_browser]('[.*0-9]',"",regex=True)
AttributeError: 'method_descriptor' object has no attribute 'df_new'
The code above is following this code/function in a Jupyter Notebook
import pandas as pd
import numpy as np
import re
#write a function to change column names using a dictionary
def rename():
dict_col = {'Browser':'new_browser', 'Page':'web_page', 'URL':'web_address', 'Visitor ID':'visitor_ID'}
df = pd.read_csv ('dummy_webchat_data.csv')
for y in df.columns:
if y in dict_col:
df_new=df.rename(columns={y:dict_col}[y])
return df_new
rename()
I've been having trouble with the dataframe updates not being recognised when I next call it. Usually in JN I just keep writing the amends to the df and it retains the updates. But even the code df_new.head(1) needs to be written like this to work after the first function is run for some reason (mentioning as it feels like a similar problem even though the error messages are different):
df_new=(rename())
df_new.head(1)
can anyone help me please?
Best
Miriam
The error tells you that you are not using the Series.str.replace() method correctly.
You have:
str.replace.df_new[new_browser]('[.*0-9]',"",regex=True)
when what you want is:
df_new[new_browser].str.replace('[.*0-9]',"",regex=True)
See this:
>>> import pandas as pd
>>>
>>> s = pd.Series(['a1', 'b2', 'c3', 'd4'])
>>> s.str.replace('[.*0-9]',"",regex=True)
0 a
1 b
2 c
3 d
dtype: object
and compare it with this (which is what you get, s here is your df_new[new_browser]):
>>> import pandas as pd
>>>
>>> s = pd.Series(['a1', 'b2', 'c3', 'd4'])
>>> str.replace.s('[.*0-9]',"",regex=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'method_descriptor' object has no attribute 's'

Why is this error occuring when I am using filter in pandas: TypeError: 'int' object is not iterable

When I want to remove some elements which satisfy a particular condition, python is throwing up the following error:
TypeError Traceback (most recent call last)
<ipython-input-25-93addf38c9f9> in <module>()
4
5 df = pd.read_csv('fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv;
----> 6 df = filter(df,~('-02-29' in df['Date']))
7 '''tmax = []; tmin = []
8 for dates in df['Date']:
TypeError: 'int' object is not iterable
The following is the code :
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv');
df = filter(df,~('-02-29' in df['Date']))
What wrong could I be doing?
Following is sample data
Sample Data
Use df.filter() (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html)
Also please attach the csv so we can run it locally.
Another way to do this is to use one of pandas' string methods for Boolean indexing:
df = df[~ df['Date'].str.contains('-02-29')]
You will still have to make sure that all the dates are actually strings first.
Edit:
Seeing the picture of your data, maybe this is what you want (slashes instead of hyphens):
df = df[~ df['Date'].str.contains('/02/29')]

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced

I am trying to convert a csv into numpy array. In the numpy array, I am replacing few elements with NaN. Then, I wanted to find the indices of the NaN elements in the numpy array. The code is :
import pandas as pd
import matplotlib.pyplot as plyt
import numpy as np
filename = 'wether.csv'
df = pd.read_csv(filename,header = None )
list = df.values.tolist()
labels = list[0]
wether_list = list[1:]
year = []
month = []
day = []
max_temp = []
for i in wether_list:
year.append(i[1])
month.append(i[2])
day.append(i[3])
max_temp.append(i[5])
mid = len(max_temp) // 2
temps = np.array(max_temp[mid:])
temps[np.where(np.array(temps) == -99.9)] = np.nan
plyt.plot(temps,marker = '.',color = 'black',linestyle = 'none')
# plyt.show()
print(np.where(np.isnan(temps))[0])
# print(len(pd.isnull(np.array(temps))))
When I execute this, I am getting a warning and an error. The warning is :
wether.py:26: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
temps[np.where(np.array(temps) == -99.9)] = np.nan
The error is :
Traceback (most recent call last):
File "wether.py", line 30, in <module>
print(np.where(np.isnan(temps))[0])
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
This is a part of the dataset which I am using:
83168,2014,9,7,0.00000,89.00000,78.00000, 83.50000
83168,2014,9,22,1.62000,90.00000,72.00000, 81.00000
83168,2014,9,23,0.50000,87.00000,74.00000, 80.50000
83168,2014,9,24,0.35000,82.00000,73.00000, 77.50000
83168,2014,9,25,0.60000,85.00000,75.00000, 80.00000
83168,2014,9,26,0.76000,89.00000,77.00000, 83.00000
83168,2014,9,27,0.00000,89.00000,79.00000, 84.00000
83168,2014,9,28,0.00000,90.00000,81.00000, 85.50000
83168,2014,9,29,0.00000,90.00000,79.00000, 84.50000
83168,2014,9,30,0.50000,89.00000,75.00000, 82.00000
83168,2014,10,1,0.02000,91.00000,75.00000, 83.00000
83168,2014,10,2,0.03000,93.00000,77.00000, 85.00000
83168,2014,10,3,1.40000,93.00000,75.00000, 84.00000
83168,2014,10,4,0.06000,89.00000,75.00000, 82.00000
83168,2014,10,5,0.22000,91.00000,68.00000, 79.50000
83168,2014,10,6,0.00000,84.00000,68.00000, 76.00000
83168,2014,10,7,0.17000,85.00000,73.00000, 79.00000
83168,2014,10,8,0.06000,84.00000,73.00000, 78.50000
83168,2014,10,9,0.00000,87.00000,73.00000, 80.00000
83168,2014,10,10,0.00000,88.00000,80.00000, 84.00000
83168,2014,10,11,0.00000,87.00000,80.00000, 83.50000
83168,2014,10,12,0.00000,88.00000,80.00000, 84.00000
83168,2014,10,13,0.00000,88.00000,81.00000, 84.50000
83168,2014,10,14,0.04000,88.00000,77.00000, 82.50000
83168,2014,10,15,0.00000,88.00000,77.00000, 82.50000
83168,2014,10,16,0.09000,89.00000,72.00000, 80.50000
83168,2014,10,17,0.00000,85.00000,67.00000, 76.00000
83168,2014,10,18,0.00000,84.00000,65.00000, 74.50000
83168,2014,10,19,0.00000,84.00000,65.00000, 74.50000
83168,2014,10,20,0.00000,85.00000,69.00000, 77.00000
83168,2014,10,21,0.77000,87.00000,76.00000, 81.50000
83168,2014,10,22,0.69000,81.00000,71.00000, 76.00000
83168,2014,10,23,0.31000,82.00000,72.00000, 77.00000
83168,2014,10,24,0.71000,79.00000,73.00000, 76.00000
83168,2014,10,25,0.00000,81.00000,68.00000, 74.50000
83168,2014,10,26,0.00000,82.00000,67.00000, 74.50000
83168,2014,10,27,0.00000,83.00000,64.00000, 73.50000
83168,2014,10,28,0.00000,83.00000,66.00000, 74.50000
83168,2014,10,29,0.03000,86.00000,76.00000, 81.00000
83168,2014,10,30,0.00000,85.00000,69.00000, 77.00000
83168,2014,10,31,0.00000,85.00000,69.00000, 77.00000
83168,2014,11,1,0.00000,86.00000,59.00000, 72.50000
83168,2014,11,2,0.00000,77.00000,52.00000, 64.50000
83168,2014,11,3,0.00000,70.00000,52.00000, 61.00000
83168,2014,11,4,0.00000,77.00000,59.00000, 68.00000
83168,2014,11,5,0.02000,79.00000,73.00000, 76.00000
83168,2014,11,6,0.02000,82.00000,75.00000, 78.50000
83168,2014,11,7,0.00000,83.00000,66.00000, 74.50000
83168,2014,11,8,0.00000,84.00000,65.00000, 74.50000
83168,2014,11,9,0.00000,84.00000,65.00000, 74.50000
83168,2014,11,10,1.20000,72.00000,65.00000, 68.50000
83168,2014,11,11,0.08000,77.00000,61.00000, 69.00000
83168,2014,11,12,0.00000,80.00000,61.00000, 70.50000
83168,2014,11,13,0.00000,83.00000,63.00000, 73.00000
83168,2014,11,14,0.00000,83.00000,65.00000, 74.00000
83168,2014,11,15,0.00000,82.00000,64.00000, 73.00000
83168,2014,11,16,0.00000,83.00000,64.00000, 73.50000
83168,2014,11,17,0.07000,84.00000,64.00000, 74.00000
83168,2014,11,18,0.00000,86.00000,71.00000, 78.50000
83168,2014,11,19,0.57000,78.00000,55.00000, 66.50000
83168,2014,11,20,0.05000,72.00000,56.00000, 64.00000
83168,2014,11,21,0.05000,77.00000,63.00000, 70.00000
83168,2014,11,22,0.22000,77.00000,69.00000, 73.00000
83168,2014,11,23,0.06000,79.00000,76.00000, 77.50000
83168,2014,11,24,0.02000,84.00000,78.00000, 81.00000
83168,2014,11,25,0.00000,86.00000,78.00000, 82.00000
83168,2014,11,26,0.07000,85.00000,77.00000, 81.00000
83168,2014,11,27,0.21000,82.00000,55.00000, 68.50000
83168,2014,11,28,0.00000,73.00000,53.00000, 63.00000
83168,2015,1,8,0.00000,80.00000,57.00000,
83168,2015,1,9,0.05000,72.00000,56.00000,
83168,2015,1,10,0.00000,72.00000,57.00000,
83168,2015,1,11,0.00000,80.00000,57.00000,
83168,2015,1,12,0.05000,80.00000,59.00000,
83168,2015,1,13,0.85000,81.00000,69.00000,
83168,2015,1,14,0.05000,81.00000,68.00000,
83168,2015,1,15,0.00000,81.00000,64.00000,
83168,2015,1,16,0.00000,78.00000,63.00000,
83168,2015,1,17,0.00000,73.00000,55.00000,
83168,2015,1,18,0.00000,76.00000,55.00000,
83168,2015,1,19,0.00000,78.00000,55.00000,
83168,2015,1,20,0.00000,75.00000,56.00000,
83168,2015,1,21,0.02000,73.00000,65.00000,
83168,2015,1,22,0.00000,80.00000,64.00000,
83168,2015,1,23,0.00000,80.00000,71.00000,
83168,2015,1,24,0.00000,79.00000,72.00000,
83168,2015,1,25,0.00000,79.00000,49.00000,
83168,2015,1,26,0.00000,79.00000,49.00000,
83168,2015,1,27,0.10000,75.00000,53.00000,
83168,2015,1,28,0.00000,68.00000,53.00000,
83168,2015,1,29,0.00000,69.00000,53.00000,
83168,2015,1,30,0.00000,72.00000,60.00000,
83168,2015,1,31,0.00000,76.00000,58.00000,
83168,2015,2,1,0.00000,76.00000,58.00000,
83168,2015,2,2,0.05000,77.00000,58.00000,
83168,2015,2,3,0.00000,84.00000,56.00000,
83168,2015,2,4,0.00000,76.00000,56.00000,
I am unable to rectify the error. How to overcome the warning in the 26th line? How can one solve this error?
Update :
when I try the same thing in different way like reading dataset from file instead of converting to dataframes, I am not getting the error. What would be the reason for that? The code is :
weather_filename = 'wether.csv'
weather_file = open(weather_filename)
weather_data = weather_file.read()
weather_file.close()
# Break the weather records into lines
lines = weather_data.split('\n')
labels = lines[0]
values = lines[1:]
n_values = len(values)
# Break the list of comma-separated value strings
# into lists of values.
year = []
month = []
day = []
max_temp = []
j_year = 1
j_month = 2
j_day = 3
j_max_temp = 5
for i_row in range(n_values):
split_values = values[i_row].split(',')
if len(split_values) >= j_max_temp:
year.append(int(split_values[j_year]))
month.append(int(split_values[j_month]))
day.append(int(split_values[j_day]))
max_temp.append(float(split_values[j_max_temp]))
# Isolate the recent data.
i_mid = len(max_temp) // 2
temps = np.array(max_temp[i_mid:])
year = year[i_mid:]
month = month[i_mid:]
day = day[i_mid:]
temps[np.where(temps == -99.9)] = np.nan
# Remove all the nans.
# Trim both ends and fill nans in the middle.
# Find the first non-nan.
i_start = np.where(np.logical_not(np.isnan(temps)))[0][0]
temps = temps[i_start:]
year = year[i_start:]
month = month[i_start:]
day = day[i_start:]
i_nans = np.where(np.isnan(temps))[0]
print(i_nans)
What is wrong in the first code and why the second doesn't even give a warning?
Posting as it might help future users.
As correctly pointed out by others, np.isnan won't work for object or string dtypes. If you're using pandas, as mentioned here you can directly use pd.isnull, which should work in your case.
import pandas as pd
import numpy as np
var1 = ''
var2 = np.nan
>>> type(var1)
<class 'str'>
>>> type(var2)
<class 'float'>
>>> pd.isnull(var1)
False
>>> pd.isnull(var2)
True
Try replacing np.isnan with pd.isna. Pandas' isna supports category dtypes
What's the dtype of temps. I can reproduce your warning and error with a string dtype:
In [26]: temps = np.array([1,2,'string',0])
In [27]: temps
Out[27]: array(['1', '2', 'string', '0'], dtype='<U21')
In [28]: temps==-99.9
/usr/local/bin/ipython3:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
#!/usr/bin/python3
Out[28]: False
In [29]: np.isnan(temps)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-29-2ff7754ed926> in <module>()
----> 1 np.isnan(temps)
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
First, comparing strings with the number gives this future warning.
Second, testing for nan produces the error.
Note that given the dtype, the nan assignment assigns a string value, not a float (np.nan is a float).
In [30]: temps[-1] = np.nan
In [31]: temps
Out[31]: array(['1', '2', 'string', 'nan'], dtype='<U21')
isnan(ndarray) fails on ndarray dtype of "object"
isnan(ndarray.astype(np.float)), but strings cannot be coerced to float.
This is likely a result of an unwanted float to string conversion. To repair it, just reverse it by adding string-to-float conversion (assuming data is convertible to a number) using float or np.float64:
np.isnan(float(str(np.nan)))
True
or
np.isnan(float(str("nan")))
True
rather than:
np.isnan(str(np.nan))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [164], line 1
----> 1 np.isnan(str(np.nan))
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Note that if your data is NOT convertible to numbers (floats), you need to use a string-compatible function such as pd.isna instead of np.isnan.
I came across this error when trying to transform my dataset using sklearn.preprocessing.OneHotEncoder. The error was thrown by _check_unknown function defined in sklearn.utils._encode.
This was caused by the fact that, at transform time, one of the columns to be transformed had a type float64 as opposed to object - in my case an entire column was NaN.
The solution was to cast the dataframe to object type before invoking transform:
ohe.transform(data.astype("O"))
Note: This answer is somewhat related to the title of the question because this error prompts when working with Decimal types.
I got the same error when considering Decimal type values. For some reason, one column of the dataframe I'm considering comes as decimal. For example, when calling .unique() on this column I got
[Decimal('0'), Decimal('95'), Decimal('38'), Decimal('25'),
Decimal('42'), Decimal('11'), Decimal('18'), Decimal('22'),
.....Decimal('220'), Decimal('724')]
As the traceback of the error showed me that it failed when calling some numpy function. I manage to reproduce the error by considering the min and maxvalues of the above array
from decimal import Decimal
xmin, xmax = Decimal('0'), Decimal('724')
np.isnan([xmin, xmax])
it will prompt the error
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
The solution in this case was to cast all these values to int.
df.astype({col:int for col in desired_columns_to_convert})

VAR model with pandas + statsmodels in Python

I am an avid user of R, but recently switched to Python for a few different reasons. However, I am struggling a little to run the vector AR model in Python from statsmodels.
Q#1. I get an error when I run this, and I have a suspicion it has something to do with the type of my vector.
import numpy as np
import statsmodels.tsa.api
from statsmodels import datasets
import datetime as dt
import pandas as pd
from pandas import Series
from pandas import DataFrame
import os
df = pd.read_csv('myfile.csv')
speedonly = DataFrame(df['speed'])
results = statsmodels.tsa.api.VAR(speedonly)
Traceback (most recent call last):
File "<pyshell#14>", line 1, in <module>
results = statsmodels.tsa.api.VAR(speedonly)
File "C:\Python27\lib\site-packages\statsmodels\tsa\vector_ar\var_model.py", line 336, in __init__
super(VAR, self).__init__(endog, None, dates, freq)
File "C:\Python27\lib\site-packages\statsmodels\tsa\base\tsa_model.py", line 40, in __init__
self._init_dates(dates, freq)
File "C:\Python27\lib\site-packages\statsmodels\tsa\base\tsa_model.py", line 54, in _init_dates
raise ValueError("dates must be of type datetime")
ValueError: dates must be of type datetime
Now, interestingly, when I run the VAR example from here https://github.com/statsmodels/statsmodels/blob/master/docs/source/vector_ar.rst#id5, it works fine.
I try the VAR model with a third, shorter vector, ts, from Wes McKinney's "Python for Data Analysis," page 293 and it doesn't work.
Okay, so now I'm thinking it's because the vectors are different types:
>>> speedonly.head()
speed
0 559.984
1 559.984
2 559.984
3 559.984
4 559.984
>>> type(speedonly)
<class 'pandas.core.frame.DataFrame'> #DOESN'T WORK
>>> type(data)
<type 'numpy.ndarray'> #WORKS
>>> ts
2011-01-02 -0.682317
2011-01-05 1.121983
2011-01-07 0.507047
2011-01-08 -0.038240
2011-01-10 -0.890730
2011-01-12 -0.388685
>>> type(ts)
<class 'pandas.core.series.TimeSeries'> #DOESN'T WORK
So I convert speedonly to an ndarray... and it still doesn't work. But this time I get another error:
>>> nda_speedonly = np.array(speedonly)
>>> results = statsmodels.tsa.api.VAR(nda_speedonly)
Traceback (most recent call last):
File "<pyshell#47>", line 1, in <module>
results = statsmodels.tsa.api.VAR(nda_speedonly)
File "C:\Python27\lib\site-packages\statsmodels\tsa\vector_ar\var_model.py", line 345, in __init__
self.neqs = self.endog.shape[1]
IndexError: tuple index out of range
Any suggestions?
Q#2. I have exogenous feature variables in my data set that appear to be useful for predictions. Is the above model from statsmodels even the best one to use?
When you give a pandas object to a time-series model, it expects that the index is dates. The error message is improved in the current source (to be released soon).
ValueError: Given a pandas object and the index does not contain dates
In the second case, you're giving a single 1d series to a VAR. VARs are used when you have more than one series. That's why you have the shape error because it expects there to be a second dimension in your array. We could probably improve the error message here. For a single series AR model with exogenous variables, you probably want to use sm.tsa.ARMA. Note that there is a known bug in ARMA.predict for models with exogenous variables to fixed soon. If you could provide a test case for this it would be helpful.

Categories

Resources