I have the following summary for dataset, using pyspark on databricks
OrderMonthYear
SaleAmount
2012-11-01T00:00:00.000+0000
473760.5700000001
2010-04-01T00:00:00.000+0000
490967.0900000001
I'm having dataframe error for this map function to convert OrderMonthYear into integer type
results = summary.map(lambda r: (int(r.OrderMonthYear.replace('-','')), r.SaleAmount)).toDF(["OrderMonthYear","SaleAmount"])
any ideas?
AttributeError: 'DataFrame' object has no attribute 'map'
Found a solution here Pyspark date yyyy-mmm-dd conversion
from datetime import datetime
from pyspark.sql.functions import col, unix_timestamp, from_unixtime, date_format
from pyspark.sql.types import DateType
df = summary.withColumn('date', from_unixtime(unix_timestamp("OrderMonthYear", 'yyyy-MMM')))
df2 = df.withColumn("new_date_str", date_format(col("date"), "yyyyMMdd"))
display(df2)
thank you #mck for the help!
cheers
Related
I'm trying to use the funcion to.csv to export a dataset but the error "'str' object has no attribute 'columns'" was reported. That's my script:
import pandas as pd
data=pd.read_csv('Documents/Pos/ETLSIM/Dados/ETLSIM.DORES_MG_2019_t.csv', low_memory="false")
data2 = pd.read_csv('Documents/Pos/ETLSIM/ETLSIM.DORES_MG_2018_t.csv', low_memory="false")
df_concat = pd.concat([data,data2], sort = False)
df_concat.to_csv('concatenado.csv')
I'm having a hard time with my program, I'm trying to apply a UDF to a dataframe and getting an error msg as per my title. Here is my code
import pandas as pd
import datetime as dt
import numpy as np
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
df = pd.DataFrame({
'ID':[1,2,2],
'dt':[pd.Timestamp.now(),pd.Timestamp.now(),
pd.Timestamp.now()]})
df.head()
def FlagUsers(df,ids,tm,gap):
df=df.sort_values([ids,tm])
df[ids]=df[ids].astype(str)
df['timediff'] = df.groupby(ids)[tm].diff()
df['prevtime']= df.groupby (ids)[tm].shift()
df['prevuser']= df[ids].shift()
df['prevuser'].fillna(0,inplace=True)
df['timediff']=df.timediff/ pd.Timedelta('1 minute')
df['timediff'].fillna(99,inplace=True)
df['flagnew']=np.where((df.timediff<gap) & (df['prevuser']==df[ids]),'existing','new' )
df.loc[df.flagnew == 'new','sessnum'] = df.groupby([ids,'flagnew']).cumcount()+1
df['sessnum']=df['sessnum'].fillna(method='ffill')
df['session_key']= df[ids].astype(str)+"_"+df['sessnum'].astype(str)
df.drop(['prevtime', 'prevuser'], axis =1, inplace= True)
arr=df['session_key'].values
return arr
# Python Function works fine:
FlagUsers(df,'ID','dt',5)
s_df = spark.createDataFrame(df)
s_df.show()
spark.udf.register("FlagUsers", FlagUsers)
s_df = s_df.withColumn('session_key',FlagUsers(s_df,'ID','dt',5))
My function works fine in python but when i try to run it in Spark it does not work? i'm really sorry if this is a silly question! Thank you & Best Wishes
pyspark udf is not same as other native python udf, it has specific requirements,
Please experiment with pandas udf.
it is several times faster and better
https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html
I want to convert all the items in the 'Time' column of my pandas dataframe from UTC to Eastern time. However, following the answer in this stackoverflow post, some of the keywords are not known in pandas 0.20.3. Overall, how should I do this task?
tweets_df = pd.read_csv('valid_tweets.csv')
tweets_df['Time'] = tweets_df.to_datetime(tweets_df['Time'])
tweets_df.set_index('Time', drop=False, inplace=True)
error is:
tweets_df['Time'] = tweets_df.to_datetime(tweets_df['Time'])
File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 3081, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'to_datetime'
items from the Time column look like this:
2016-10-20 03:43:11+00:00
Update:
using
tweets_df['Time'] = pd.to_datetime(tweets_df['Time'])
tweets_df.set_index('Time', drop=False, inplace=True)
tweets_df.index = tweets_df.index.tz_localize('UTC').tz_convert('US/Eastern')
did no time conversion. Any idea what could be fixed?
Update 2:
So the following code, does not do in-place conversion meaning when I print the row['Time'] using iterrows() it shows the original values. Do you know how to do the in-place conversion?
tweets_df['Time'] = pd.to_datetime(tweets_df['Time'])
for index, row in tweets_df.iterrows():
row['Time'].tz_localize('UTC').tz_convert('US/Eastern')
for index, row in tweets_df.iterrows():
print(row['Time'])
to_datetime is a function defined in pandas not a method on a DataFrame. Try:
tweets_df['Time'] = pd.to_datetime(tweets_df['Time'])
I want to select only those rows that have a timestamp that belongs to last 36 hours. My PySpark DataFrame df has a column unix_timestamp that is a timestamp in seconds.
This is my current code, but it fails with the error AttributeError: 'DataFrame' object has no attribute 'timestamp'. I tried to change it to unix_timestamp, but it fails all the time.
import datetime
hours_36 = (datetime.datetime.now() - datetime.timedelta(hours = 36)).strftime("%Y-%m-%d %H:%M:%S")
df = df.withColumn("unix_timestamp", df.unix_timestamp.cast("timestamp")).filter(df.timestamp > hours_36)
The time stamp column doesn't exist yet when you try to refer to it; You can either use pyspark.sql.functions.col to refer to it in a dynamic way without specifying which data frame object the column belongs to as:
import pyspark.sql.functions as F
df = df.withColumn("unix_timestamp", df.unix_timestamp.cast("timestamp")).filter(F.col("unix_timestamp") > hours_36)
Or without creating the intermediate column:
df.filter(df.unix_timestamp.cast("timestamp") > hours_36)
The API Doc tells me that you can also use a String notation for filtering:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.filter
import pyspark.sql.functions as F
df = df.withColumn("unix_timestamp", df.unix_timestamp.cast("timestamp"))
.filter("unix_timestamp > %s" % hours_36)
Maybe its not so effienc though
I'm attempting to convert a pipelinedRDD in pyspark to a dataframe. This is the code snippet:
newRDD = rdd.map(lambda row: Row(row.__fields__ + ["tag"])(row + (tagScripts(row), )))
df = newRDD.toDF()
When I run the code though, I receive this error:
'list' object has no attribute 'encode'
I've tried multiple other combinations, such as converting it to a Pandas dataframe using:
newRDD = rdd.map(lambda row: Row(row.__fields__ + ["tag"])(row + (tagScripts(row), )))
df = newRDD.toPandas()
But then I end up receiving this error:
AttributeError: 'PipelinedRDD' object has no attribute 'toPandas'
Any help would be greatly appreciated. Thank you for your time.
rdd.toDF() or rdd.toPandas() is only used for SparkSession.
To fix your code, try below:
spark = SparkSession.builder.getOrCreate()
rdd = spark.sparkContext.textFile()
newRDD = rdd.map(...)
df = newRDD.toDF() or newRDD.toPandas()