I am trying to convert a new column in a dataframe through a function based on the values in the date column, but get an error indicating "Timestamp object has no attribute dt." However, if I run this outside of a function, the dt attribute works fine.
Any guidance would be appreciated.
This code runs with no issues:
sample = {'Date': ['2015-07-02 11:47:00', '2015-08-02 11:30:00']}
dftest = pd.DataFrame.from_dict(sample)
dftest['Date'] = pd.to_datetime(dftest['Date'])
display(dftest.info())
dftest['year'] = dftest['Date'].dt.year
dftest['month'] = dftest['Date'].dt.month
This code gives me the error message:
sample = {'Date': ['2015-07-02 11:47:00', '2015-08-02 11:30:00']}
dftest = pd.DataFrame.from_dict(sample)
dftest['Date'] = pd.to_datetime(dftest['Date'])
def CALLYMD(dftest):
if dftest['Date'].dt.month>9:
return str(dftest['Date'].dt.year) + '1231'
elif dftest['Date'].dt.month>6:
return str(dftest['Date'].dt.year) + '0930'
elif dftest['Date'].dt.month>3:
return str(dftest['Date'].dt.year) + '0630'
else:
return str(dftest['Date'].dt.year) + '0331'
dftest['CALLYMD'] = dftest.apply(CALLYMD, axis=1)
Lastly, I'm open to any suggestions on how to make this code better as I'm still learning.
I'm guessing you should remove .dt in the second case. When you do apply it's applying to each element, .dt is needed when it's a group of data, if it's only one element you don't need .dt otherwise it will raise
{AttributeError: 'Timestamp' object has no attribute 'dt'}
reference: https://stackoverflow.com/a/48967889/13720936
After looking at the timestamp documentation, I found removing the .dt and just doing .year and .month works. However, I'm still confused as to why it works in the first code but does not work in the second code.
here is how to create a yearmonth bucket using the year and month
for key, item in df.iterrows():
year=pd.to_datetime(item['Date']).year
month=str(pd.to_datetime(item['Date']).month)
df.loc[key,'YearMonth']="{:.0f}{}".format(year,month.zfill(2))
Related
I am writing a function that will serve as filter for rows that I wanted to use.
The sample data frame is as follow:
df = pd.DataFrame()
df ['Xstart'] = [1,2.5,3,4,5]
df ['Xend'] = [6,8,9,10,12]
df ['Ystart'] = [0,1,2,3,4]
df ['Yend'] = [6,8,9,10,12]
df ['GW'] = [1,1,2,3,4]
def filter(data,Game_week):
pass_data = data [(data['GW'] == Game_week)]
when I recall the function filter as follow, I got an error.
df1 = filter(df,1)
The error message is
AttributeError: 'NoneType' object has no attribute 'head'
but when I use manual filter, it works.
pass_data = df [(df['GW'] == [1])]
This is my first issue.
My second issue is that I want to filter the rows with multiple GW (1,2,3) etc.
For that I can manually do it as follow:
pass_data = df [(df['GW'] == [1])|(df['GW'] == [2])|(df['GW'] == [3])]
if I want to use in function input as list [1,2,3]
how can I write it in function such that I can input a range of 1 to 3?
Could anyone please advise?
Thanks,
Zep
Use isin for pass list of values instead scalar, also filter is existing function in python, so better is change function name:
def filter_vals(data,Game_week):
return data[data['GW'].isin(Game_week)]
df1 = filter_vals(df,range(1,4))
Because you don't return in the function, so it will be None, not the desired dataframe, so do (note that also no need parenthesis inside the data[...]):
def filter(data,Game_week):
return data[data['GW'] == Game_week]
Also, isin may well be better:
def filter(data,Game_week):
return data[data['GW'].isin(Game_week)]
Use return to return data from the function for the first part. For the second, use -
def filter(data,Game_week):
return data[data['GW'].isin(Game_week)]
Now apply the filter function -
df1 = filter(df,[1,2])
I want to convert all the items in the 'Time' column of my pandas dataframe from UTC to Eastern time. However, following the answer in this stackoverflow post, some of the keywords are not known in pandas 0.20.3. Overall, how should I do this task?
tweets_df = pd.read_csv('valid_tweets.csv')
tweets_df['Time'] = tweets_df.to_datetime(tweets_df['Time'])
tweets_df.set_index('Time', drop=False, inplace=True)
error is:
tweets_df['Time'] = tweets_df.to_datetime(tweets_df['Time'])
File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 3081, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'to_datetime'
items from the Time column look like this:
2016-10-20 03:43:11+00:00
Update:
using
tweets_df['Time'] = pd.to_datetime(tweets_df['Time'])
tweets_df.set_index('Time', drop=False, inplace=True)
tweets_df.index = tweets_df.index.tz_localize('UTC').tz_convert('US/Eastern')
did no time conversion. Any idea what could be fixed?
Update 2:
So the following code, does not do in-place conversion meaning when I print the row['Time'] using iterrows() it shows the original values. Do you know how to do the in-place conversion?
tweets_df['Time'] = pd.to_datetime(tweets_df['Time'])
for index, row in tweets_df.iterrows():
row['Time'].tz_localize('UTC').tz_convert('US/Eastern')
for index, row in tweets_df.iterrows():
print(row['Time'])
to_datetime is a function defined in pandas not a method on a DataFrame. Try:
tweets_df['Time'] = pd.to_datetime(tweets_df['Time'])
I've got some order data that I want to analyse.
Currently of interest is: How often has which SKU been bought in which month?
Here a small example:
import datetime
import pandas as pd
import numpy as np
d = {'sku': ['RT-17']}
df_skus = pd.DataFrame(data=d)
print(df_skus)
d = {'date': ['2017/02/17', '2017/03/17', '2017/04/17', '2017/04/18', '2017/05/02'], 'item_sku': ['HT25', 'RT-17', 'HH30', 'RT-17', 'RT-19']}
df_orders = pd.DataFrame(data=d)
print(df_orders)
for i in df_orders.index:
print("\n toll")
df_orders.loc[i,'date']=pd.to_datetime(df_orders.loc[i, 'date'])
df_orders = df_orders[df_orders["item_sku"].isin(df_skus["sku"])]
monthly_sales = df_orders.groupby(["item_sku", pd.Grouper(key="date",freq="M")]).size()
monthly_sales = monthly_sales.unstack(0)
print(monthly_sales)
That works fine, but if I use my real order data (from CSV) I get after some minutes:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Int64Index'
That problem comes from the line:
monthly_sales = df_orders.groupby(["item_sku", pd.Grouper(key="date",freq="M")]).size()
Is it possible to skip over the error?
I tried a try except block:
try:
monthly_sales = df_orders.groupby(["item_sku", pd.Grouper(key="date",freq="M")]).size()
monthly_sales = monthly_sales.unstack(0)
except:
print "\n Here seems to be one issue"
Then I get for the print(monthly_sales)
Empty DataFrame
Columns: [txn_id, date, item_sku, quantity]
Index: []
So something in my data empties or brakes the grouping it seems like?
How can I 'clean' my data?
Or I'd be even fine with loosing the data of a sale here and there if I can just 'skip' over the error, is this possible?
When reading your CSV, use the parse_dates argument -
df_order = pd.read_csv('file.csv', parse_dates=['date'])
Which automatically converts date to datetime. If that doesn't work, then you'll need to load it in as a string, and then use the errors='coerce' argument with pd.to_datetime -
df_order['date'] = pd.to_datetime(df_order['date'], errors='coerce')
Note that you can pass series objects (amongst other things) to pd.to_datetime`.
Next, filter and group as you've been doing, and it should work.
df_orders[df_orders["item_sku"].isin(df_skus["sku"])]\
.groupby(['item_sku', pd.Grouper(key='date', freq='M')]).size()
item_sku date
RT-17 2017-03-31 1
2017-04-30 1
The type of data we are streaming in is taken from our PI System which is outputting data in an irregular manner. This is not uncommon with time series data, so I have attempted to add 1 second or so to each time stamp to ensure the index is unique. However this has not worked as I hoped as I keep received a type error.
I have attempted to implement the solutions highlighted in (Modifying timestamps in pandas to make index unique) however without any success.
The error message I get is:
TypeError: ufunc add cannot use operands with types dtype('O') and dtype('<m8')
The code implementation is below:
values = Slugging_Sep.index.duplicated(keep=False).astype(float)
values[values==0] = np.NaN
missings = np.isnan(values)
cumsum = np.cumsum(~missings)
diff = np.diff(np.concatenate(([0.], cumsum[missings])))
values[missings] = -diff
# print result
result = Slugging_Sep.index + np.cumsum(values).astype(np.timedelta64)
print(result)
What I have tried
Type Casting - I thought that the calculation was due to two
different types being added together but this hasn't resolved the
issue.
Using Time Delta in Pandas - This creates the same Type Error.
pd.to_timedelta(Slugging_Sep.groupby('Time').cumcount(), unit='ms'))
Slugging_Sep['Time'] = (str(Slugging_Sep['Time'] +
pd.to_timedelta(Slugging_Sep.groupby('Time').cumcount(), unit='ms')))
So I have two questions from this:
Could anyone provide some advice to me regarding how to solve this
for future time series issues?
What actually is dtype ('<m8')
Thank you.
Using Alex Zisman's suggestion, I reconverted the Slugging_Sep.index via the following line:
pd.to_datetime(Slugging_Sep['Time'])
Slugging_Sep.set_index('Time', inplace=True)
I then implemented the following code taken from the above SO link I mentioned:
#values = Slugging_Sep.index.duplicated(keep=False).astype(float)
#values[values==0] = np.NaN
#missings = np.isnan(values)
#cumsum = np.cumsum(~missings)
#diff = np.diff(np.concatenate(([0.], cumsum[missings])))
#values[missings] = -diff
# print result
#result = Slugging_Sep.index + np.cumsum(values).astype(np.timedelta64())
#Slugging_Sep.index = result
#print(Slugging_Sep.index)
This resolved the issue and added nanoseconds to each duplicate time stamp so it became a unique index.
I want to select only those rows that have a timestamp that belongs to last 36 hours. My PySpark DataFrame df has a column unix_timestamp that is a timestamp in seconds.
This is my current code, but it fails with the error AttributeError: 'DataFrame' object has no attribute 'timestamp'. I tried to change it to unix_timestamp, but it fails all the time.
import datetime
hours_36 = (datetime.datetime.now() - datetime.timedelta(hours = 36)).strftime("%Y-%m-%d %H:%M:%S")
df = df.withColumn("unix_timestamp", df.unix_timestamp.cast("timestamp")).filter(df.timestamp > hours_36)
The time stamp column doesn't exist yet when you try to refer to it; You can either use pyspark.sql.functions.col to refer to it in a dynamic way without specifying which data frame object the column belongs to as:
import pyspark.sql.functions as F
df = df.withColumn("unix_timestamp", df.unix_timestamp.cast("timestamp")).filter(F.col("unix_timestamp") > hours_36)
Or without creating the intermediate column:
df.filter(df.unix_timestamp.cast("timestamp") > hours_36)
The API Doc tells me that you can also use a String notation for filtering:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.filter
import pyspark.sql.functions as F
df = df.withColumn("unix_timestamp", df.unix_timestamp.cast("timestamp"))
.filter("unix_timestamp > %s" % hours_36)
Maybe its not so effienc though