My situation is that I'm receiving transaction data from a vendor that has a datetime that is in local time but it has no offset. For example, the ModifiedDate column may have a value of
'2020-05-16T15:04:55.7429192+00:00'
I can get the local timezone by pulling some other data together about the store in which the transaction occurs
timezone_local = tz.timezone(tzDf[0]["COUNTRY"] + '/' + tzDf[0]["TIMEZONE"])
I then wrote a function to take those two values and give it the proper timezone:
from datetime import datetime
import dateutil.parser as parser
import pytz as tz
def convert_naive_to_aware(datetime_local_str, timezone_local):
yy = parser.parse(datetime_local_str).year
mm = parser.parse(datetime_local_str).month
dd = parser.parse(datetime_local_str).day
hh = parser.parse(datetime_local_str).hour
mm = parser.parse(datetime_local_str).minute
ss = parser.parse(datetime_local_str).second
# ms = parser.parse(datetime_local_str).microsecond
# print('yy:' + str(yy) + ', mm:' + str(mm) + ', dd:' + str(dd) + ', hh:' + str(hh) + ', mm:' + str(mm) + ', ss:' + str(ss))
aware = datetime(yy,mm,dd,hh,mm,ss,0,timezone_local)
return aware
It works fine when I send it the timestamp as a string in testing but balks when I try to apply it to a dataframe. I presume because I don't yet know the right way to pass the column value as a string. In this case, I'm trying to replace the current ModifiedTime value with the results of the call to the function.
from pyspark.sql import functions as F
.
.
.
ordersDf = ordersDf.withColumn("ModifiedTime", ( convert_naive_to_aware( F.substring( ordersDf.ModifiedTime, 1, 19 ), timezone_local)),)
Those of you more knowledgeable than I won't be surprised that I received the following error:
TypeError: 'Column' object is not callable
I admit, I'm a bit of a tyro at python and dataframes and I may well be taking the long way 'round. I've attempted a few other things such as ordersDf.ModifiedTime.cast("String"), etc but no luck I'd be grateful for any suggestions.
We're using Azure Databricks, the cluster is Scala 2.11.
You need to convert the function into a UDF before you can apply it on a Spark dataframe:
from pyspark.sql import functions as F
# I assume `tzDf` is a pandas dataframe... This syntax wouldn't work with spark.
timezone_local = tz.timezone(tzDf[0]["COUNTRY"] + '/' + tzDf[0]["TIMEZONE"])
# Convert function to UDF
time_udf = F.udf(convert_naive_to_aware)
# Avoid overwriting dataframe variables. Here I appended `2` to the new variable name.
ordersDf2 = ordersDf.withColumn(
"ModifiedTime",
convert_naive_to_aware(
F.substring(ordersDf.ModifiedTime, 1, 19), F.lit(str(timezone_local))
)
)
Related
I came across- the below lambda code line in PySpark while browsing a long python Jupyter notebook, I am trying to understand this piece of line. Can you explain what it does in a best possible way?
parse = udf (lambda x: (datetime.datetime.utcnow() - timedelta(hours= x)).isoformat()[:-3] + 'Z', StringType())
udf(
lambda x: (datetime.datetime.utcnow() - timedelta(hours=x)).isoformat()[:-3] + 'Z',
StringType()
)
udf in PySpark assigns a Python function which is run for every row of Spark df.
Creates a user defined function (UDF).
New in version 1.3.0.
Parameters:
f : function
python function if used as a standalone function
returnType : pyspark.sql.types.DataType or str
the return type of the user-defined function. The value can be either a pyspark.sql.types.DataType object or a DDL-formatted type string.
The returnType will be a string. Removing it, we get the function body we're interested in:
lambda x: (datetime.datetime.utcnow() - timedelta(hours=x)).isoformat()[:-3] + 'Z'
In order to find out what the given lambda function does, you can create a regular function from it. You may need to add imports too.
import datetime
from datetime import timedelta
def func(x):
return (datetime.datetime.utcnow() - timedelta(hours= x)).isoformat()[:-3] + 'Z'
To really see what's going on you can create variables out of every element and print them.
import datetime
from datetime import timedelta
def my_func(x):
v1 = datetime.datetime.utcnow()
v2 = timedelta(hours=x)
v3 = v1 - v2
v4 = v3.isoformat()
v5 = v4[:-3]
v6 = v5 + 'Z'
[print(e) for e in (v1, v2, v3, v4, v5)]
return v6
print(my_func(3))
# 2022-06-17 07:16:36.212566
# 3:00:00
# 2022-06-17 04:16:36.212566
# 2022-06-17T04:16:36.212566
# 2022-06-17T04:16:36.212
# 2022-06-17T04:16:36.212Z
This way you see how result changes after every step. You can print whatever you want at any step you need. E.g. print(type(v4))
How can I build a function that creates these dataframes?:
buy_orders_1h = pd.DataFrame(
{'Date_buy': buy_orders_date_1h,
'Name_buy': buy_orders_name_1h
})
sell_orders_1h = pd.DataFrame(
{'Date_sell': sell_orders_date_1h,
'Name_sell': sell_orders_name_1h
})
I have 10 dataframes like this I create very manually and everytime I want to add a new column I would have to do it in all of them which is time consuming. If I can build a function I would only have to do it once.
The differences between the two above function are of course one is for buy signals the other is for sell signals.
I guess the inputs to the function should be:
_buy/_sell - for the Column name
buy_ / sell_ - for the Column input
I'm thinking input to the function could be something like:
def create_dfs(col, col_input,hour):
df = pd.DataFrame(
{'Date' + col : col_input + "_orders_date_" + hour,
'Name' + col : col_input + "_orders_name_" + hour
}
return df
buy_orders_1h = create_dfs("_buy", "buy_", "1h")
sell_orders_1h = create_dfs("_sell", "sell_", "1h")
A dataframe needs an index, so either you can manually pass an index, or enter your row values in list form:
def create_dfs(col, col_input, hour):
df = pd.DataFrame(
{'Date' + col: [col_input + "_orders_date_" + hour],
'Name' + col: [col_input + "_orders_name_" + hour]})
return df
buy_orders_1h = create_dfs("_buy", "buy_", "1h")
sell_orders_1h = create_dfs("_sell", "sell_", "1h")
Edit: Updated due to new information:
To call a global variable using a string, enter globals() before the string in the following manner:
'Date' + col: globals()[col_input + "_orders_date_" + hour]
Check the output please to see if this is what you want. You first create two dictionaries, then depending on the buy=True condition, it either appends to the buying_df or to the selling_df. I created two sample lists of dates and column names, and iteratively appended to the desired dataframes. After creating the dicts, then pandas.DataFrame is created. You do not need to create it iteratively, rather once in the end when your dates and names have been collected into a dict.
from collections import defaultdict
import pandas as pd
buying_df=defaultdict(list)
selling_df=defaultdict(list)
def add_column_to_df(date,name,buy=True):
if buy:
buying_df["Date_buy"].append(date)
buying_df["Name_buy"].append(name)
else:
selling_df["Date_sell"].append(date)
selling_df["Name_sell"].append(name)
dates=["1900","2000","2010"]
names=["Col_name1","Col_name2","Col_name3"]
for date, name in zip(dates,names):
add_column_to_df(date,name)
#print(buying_df)
df=pd.DataFrame(buying_df)
print(df)
The excel file creating from python is extremely slow to open even the size of file is about 50 mb.
I have tried on both pandas and openpyxl.
def to_file(list_report,list_sheet,strip_columns,Name):
i = 0
wb = ExcelWriter(path_output + '\\' + Name + dateformat + '.xlsx')
while i <= len(list_report)-1:
try:
df = pd.DataFrame(pd.read_csv(path_input + '\\' + list_report[i] + reportdate + '.csv'))
for column in strip_column:
try:
df[column] = df[column].str.strip('=("")')
except:
pass
df = adjust_report(df,list_report[i])
df = df.apply(pd.to_numeric, errors ='ignore', downcast = 'integer')
df.to_excel(wb, sheet_name = list_sheet[i], index = False)
except:
print('Missing report: ' + list_report[i])
i += 1
wb.save()
Is there anyway to speed it up?
idiom
Let us rename list_report to reports.
Then your while loop is usually expressed as simply: for i in range(len(reports)):
You access the i-th element several times. The loop could bind that for you, with: for i, report in enumerate(reports):.
But it turns out you never even need i. So most folks would write this as: for report in reports:
code organization
This bit of code is very nice:
for column in strip_column:
try:
df[column] = df[column].str.strip('=("")')
except:
pass
I recommend you bury it in a helper function, using def strip_punctuation.
(The list should be plural, I think? strip_columns?)
Then you would have a simple sequence of df assignments.
timing
Profile elapsed time(). Surround each df assignment with code like this:
t0 = time()
df = ...
print(time() - t0)
That will show you which part of your processing pipeline takes the longest and therefore should receive the most effort for speeding it up.
I suspect adjust_report() uses the bulk of the time,
but without seeing it that's hard to say.
I am new to PySpark dataframes and used to work with RDDs before. I have a dataframe like this:
date path
2017-01-01 /A/B/C/D
2017-01-01 /X
2017-01-01 /X/Y
And want to convert to the following:
date path
2017-01-01 /A/B
2017-01-01 /X
2017-01-01 /X/Y
Basically to get rid of everything after the third / including it. So before with RDD I used to have the following:
from urllib import quote_plus
path_levels = df['path'].split('/')
filtered_path_levels = []
for _level in range(min(df_size, 3)):
# Take only the top 2 levels of path
filtered_path_levels.append(quote_plus(path_levels[_level]))
df['path'] = '/'.join(map(str, filtered_path_levels))
Things with pyspark are more complicated I would say. Here is what I have got so far:
path_levels = split(results_df['path'], '/')
filtered_path_levels = []
for _level in range(size(df_size, 3)):
# Take only the top 2 levels of path
filtered_path_levels.append(quote_plus(path_levels[_level]))
df['path'] = '/'.join(map(str, filtered_path_levels))
which is giving me the following error:
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
Any help regrading this would be much appreciated. Let me know if this need more information/explanation.
Use udf:
from pyspark.sql.functions import *
#udf
def quote_string_(path, size):
if path:
return "/".join(quote_plus(x) for x in path.split("/")[:size])
df.withColumn("foo", quote_string_("path", lit(2)))
I resolved my problem using the following code:
from pyspark.sql.functions import split, col, lit, concat
split_col = split(df['path'], '/')
df = df.withColumn('l1_path', split_col.getItem(1))
df = df.withColumn('l2_path', split_col.getItem(2))
df = df.withColumn('path', concat(col('l1_path'), lit('/'), col('l2_path')))
df = df.drop('l1_path', 'l2_path')
I am iterating over the rows that are available, but it doesn't seem to be the most optimal way to do it -- it takes forever.
Is there a special way in Pandas to do it.
INIT_TIME = datetime.datetime.strptime(date + ' ' + time, "%Y-%B-%d %H:%M:%S")
#NEED TO ADD DATA FROM THAT COLUMN
df = pd.read_csv(dataset_path, delimiter=',',skiprows=range(0,1),names=['TCOUNT','CORE','COUNTER','EMPTY','NAME','TSTAMP','MULT','STAMPME'])
df = df.drop('MULT',1)
df = df.drop('EMPTY',1)
df = df.drop('TSTAMP', 1)
for index, row in df.iterrows():
TMP_TIME = INIT_TIME + datetime.timedelta(seconds=row['TCOUNT'])
df['STAMPME'] = TMP_TIME.strftime("%s")
In addition, the datetime I am adding is in the following format
2017-05-11 11:12:37.100192 1494493957
2017-05-11 11:12:37.200541 1494493957
and therefore the unix timestamp is same (and it is correct), but is there a better way to represent it?
Assuming the datetimes are correctly reflecting what you're trying to do, with respect to Pandas you should be able to do:
df['STAMPME'] = df['TCOUNT'].apply(lambda x: (datetime.timedelta(seconds=x) + INIT_TIME).strftime("%s"))
As noted here you should not use iterrows() to modify the DF you are iterating over. If you need to iterate row by row (as opposed to using the apply method) you can use another data object, e.g. a list, to retain the values you're calculating, and then create a new column from that.
Also, for future reference, the itertuples() method is faster than iterrows(), although it requires you to know the index of each column (i.e. row[x] as opposed to row['name']).
I'd rewrite your code like this
INIT_TIME = datetime.datetime.strptime(date + ' ' + time, "%Y-%B-%d %H:%M:%S")
INIT_TIME = pd.to_datetime(INIT_TIME)
df = pd.read_csv(
dataset_path, delimiter=',',skiprows=range(0,1),
names=['TCOUNT','CORE','COUNTER','EMPTY','NAME','TSTAMP','MULT','STAMPME']
)
df = df.drop(['MULT', 'EMPTY', 'TSTAMP'], 1)
df['STAMPME'] = pd.to_timedelta(df['TCOUNT'], 's') + INIT_TIME