I want to select only those rows that have a timestamp that belongs to last 36 hours. My PySpark DataFrame df has a column unix_timestamp that is a timestamp in seconds.
This is my current code, but it fails with the error AttributeError: 'DataFrame' object has no attribute 'timestamp'. I tried to change it to unix_timestamp, but it fails all the time.
import datetime
hours_36 = (datetime.datetime.now() - datetime.timedelta(hours = 36)).strftime("%Y-%m-%d %H:%M:%S")
df = df.withColumn("unix_timestamp", df.unix_timestamp.cast("timestamp")).filter(df.timestamp > hours_36)
The time stamp column doesn't exist yet when you try to refer to it; You can either use pyspark.sql.functions.col to refer to it in a dynamic way without specifying which data frame object the column belongs to as:
import pyspark.sql.functions as F
df = df.withColumn("unix_timestamp", df.unix_timestamp.cast("timestamp")).filter(F.col("unix_timestamp") > hours_36)
Or without creating the intermediate column:
df.filter(df.unix_timestamp.cast("timestamp") > hours_36)
The API Doc tells me that you can also use a String notation for filtering:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.filter
import pyspark.sql.functions as F
df = df.withColumn("unix_timestamp", df.unix_timestamp.cast("timestamp"))
.filter("unix_timestamp > %s" % hours_36)
Maybe its not so effienc though
Related
I have a really easy dataset with just one column, and I would like to have a for loop over each row of the dataframe so that for each row it calculate the log of current_close_price/first_row_close_price. Whatever I do, it says:
TypeError: 'numpy.float64' object is not callable
import pandas as pd
import numpy as np
price.head()
Close
Date
2010-07-19 107.290001
2010-07-20 108.480003
2010-07-21 107.070000
2010-07-22 109.459999
2010-07-23 110.410004
for index, row in price.iterrows():
first_row_price=price.iloc[0,0]
current_price=price.iloc[index,0]
log_rt = np.log(current_price / reference_price)
Consider we have the table in a.csv file, which have two columns Date and Close, and writing first_row_price instead of reference_price in your code:
with open("a.csv", 'r') as a:
price = pd.read_csv(a, usecols=[1]) # which get data related to 'Close' column
for index, row in price.iterrows():
first_row_price = price.iloc[0, 0]
current_price = price.iloc[index, 0]
log_rt = np.log(current_price / first_row_price)
This code will get output as:
0.0
0.011030393877764241
-0.002052631799009411
0.020023718610826604
0.02866528771045947
I have the following summary for dataset, using pyspark on databricks
OrderMonthYear
SaleAmount
2012-11-01T00:00:00.000+0000
473760.5700000001
2010-04-01T00:00:00.000+0000
490967.0900000001
I'm having dataframe error for this map function to convert OrderMonthYear into integer type
results = summary.map(lambda r: (int(r.OrderMonthYear.replace('-','')), r.SaleAmount)).toDF(["OrderMonthYear","SaleAmount"])
any ideas?
AttributeError: 'DataFrame' object has no attribute 'map'
Found a solution here Pyspark date yyyy-mmm-dd conversion
from datetime import datetime
from pyspark.sql.functions import col, unix_timestamp, from_unixtime, date_format
from pyspark.sql.types import DateType
df = summary.withColumn('date', from_unixtime(unix_timestamp("OrderMonthYear", 'yyyy-MMM')))
df2 = df.withColumn("new_date_str", date_format(col("date"), "yyyyMMdd"))
display(df2)
thank you #mck for the help!
cheers
I am trying to convert a new column in a dataframe through a function based on the values in the date column, but get an error indicating "Timestamp object has no attribute dt." However, if I run this outside of a function, the dt attribute works fine.
Any guidance would be appreciated.
This code runs with no issues:
sample = {'Date': ['2015-07-02 11:47:00', '2015-08-02 11:30:00']}
dftest = pd.DataFrame.from_dict(sample)
dftest['Date'] = pd.to_datetime(dftest['Date'])
display(dftest.info())
dftest['year'] = dftest['Date'].dt.year
dftest['month'] = dftest['Date'].dt.month
This code gives me the error message:
sample = {'Date': ['2015-07-02 11:47:00', '2015-08-02 11:30:00']}
dftest = pd.DataFrame.from_dict(sample)
dftest['Date'] = pd.to_datetime(dftest['Date'])
def CALLYMD(dftest):
if dftest['Date'].dt.month>9:
return str(dftest['Date'].dt.year) + '1231'
elif dftest['Date'].dt.month>6:
return str(dftest['Date'].dt.year) + '0930'
elif dftest['Date'].dt.month>3:
return str(dftest['Date'].dt.year) + '0630'
else:
return str(dftest['Date'].dt.year) + '0331'
dftest['CALLYMD'] = dftest.apply(CALLYMD, axis=1)
Lastly, I'm open to any suggestions on how to make this code better as I'm still learning.
I'm guessing you should remove .dt in the second case. When you do apply it's applying to each element, .dt is needed when it's a group of data, if it's only one element you don't need .dt otherwise it will raise
{AttributeError: 'Timestamp' object has no attribute 'dt'}
reference: https://stackoverflow.com/a/48967889/13720936
After looking at the timestamp documentation, I found removing the .dt and just doing .year and .month works. However, I'm still confused as to why it works in the first code but does not work in the second code.
here is how to create a yearmonth bucket using the year and month
for key, item in df.iterrows():
year=pd.to_datetime(item['Date']).year
month=str(pd.to_datetime(item['Date']).month)
df.loc[key,'YearMonth']="{:.0f}{}".format(year,month.zfill(2))
I want to convert all the items in the 'Time' column of my pandas dataframe from UTC to Eastern time. However, following the answer in this stackoverflow post, some of the keywords are not known in pandas 0.20.3. Overall, how should I do this task?
tweets_df = pd.read_csv('valid_tweets.csv')
tweets_df['Time'] = tweets_df.to_datetime(tweets_df['Time'])
tweets_df.set_index('Time', drop=False, inplace=True)
error is:
tweets_df['Time'] = tweets_df.to_datetime(tweets_df['Time'])
File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 3081, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'to_datetime'
items from the Time column look like this:
2016-10-20 03:43:11+00:00
Update:
using
tweets_df['Time'] = pd.to_datetime(tweets_df['Time'])
tweets_df.set_index('Time', drop=False, inplace=True)
tweets_df.index = tweets_df.index.tz_localize('UTC').tz_convert('US/Eastern')
did no time conversion. Any idea what could be fixed?
Update 2:
So the following code, does not do in-place conversion meaning when I print the row['Time'] using iterrows() it shows the original values. Do you know how to do the in-place conversion?
tweets_df['Time'] = pd.to_datetime(tweets_df['Time'])
for index, row in tweets_df.iterrows():
row['Time'].tz_localize('UTC').tz_convert('US/Eastern')
for index, row in tweets_df.iterrows():
print(row['Time'])
to_datetime is a function defined in pandas not a method on a DataFrame. Try:
tweets_df['Time'] = pd.to_datetime(tweets_df['Time'])
I've got some order data that I want to analyse.
Currently of interest is: How often has which SKU been bought in which month?
Here a small example:
import datetime
import pandas as pd
import numpy as np
d = {'sku': ['RT-17']}
df_skus = pd.DataFrame(data=d)
print(df_skus)
d = {'date': ['2017/02/17', '2017/03/17', '2017/04/17', '2017/04/18', '2017/05/02'], 'item_sku': ['HT25', 'RT-17', 'HH30', 'RT-17', 'RT-19']}
df_orders = pd.DataFrame(data=d)
print(df_orders)
for i in df_orders.index:
print("\n toll")
df_orders.loc[i,'date']=pd.to_datetime(df_orders.loc[i, 'date'])
df_orders = df_orders[df_orders["item_sku"].isin(df_skus["sku"])]
monthly_sales = df_orders.groupby(["item_sku", pd.Grouper(key="date",freq="M")]).size()
monthly_sales = monthly_sales.unstack(0)
print(monthly_sales)
That works fine, but if I use my real order data (from CSV) I get after some minutes:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Int64Index'
That problem comes from the line:
monthly_sales = df_orders.groupby(["item_sku", pd.Grouper(key="date",freq="M")]).size()
Is it possible to skip over the error?
I tried a try except block:
try:
monthly_sales = df_orders.groupby(["item_sku", pd.Grouper(key="date",freq="M")]).size()
monthly_sales = monthly_sales.unstack(0)
except:
print "\n Here seems to be one issue"
Then I get for the print(monthly_sales)
Empty DataFrame
Columns: [txn_id, date, item_sku, quantity]
Index: []
So something in my data empties or brakes the grouping it seems like?
How can I 'clean' my data?
Or I'd be even fine with loosing the data of a sale here and there if I can just 'skip' over the error, is this possible?
When reading your CSV, use the parse_dates argument -
df_order = pd.read_csv('file.csv', parse_dates=['date'])
Which automatically converts date to datetime. If that doesn't work, then you'll need to load it in as a string, and then use the errors='coerce' argument with pd.to_datetime -
df_order['date'] = pd.to_datetime(df_order['date'], errors='coerce')
Note that you can pass series objects (amongst other things) to pd.to_datetime`.
Next, filter and group as you've been doing, and it should work.
df_orders[df_orders["item_sku"].isin(df_skus["sku"])]\
.groupby(['item_sku', pd.Grouper(key='date', freq='M')]).size()
item_sku date
RT-17 2017-03-31 1
2017-04-30 1