Using .apply() in Sframes to manipulate multiple columns of each row

Using .apply() in Sframes to manipulate multiple columns of each row - python

I have an SFrame with the columns Date1 and Date2.
I am trying to use .apply() to find the datediff between Date1 and Date2, but I can't figure out how to use the other argument.
Ideally something like
frame['new_col'] = frame['Date1'].apply(lambda x: datediff(x,frame('Date2')))

You can directly take the difference between the dates in column Date2 and those in Date1 by just subtracting frame['Date1'] from frame['Date2']. That, for some reason, returns the number of seconds between the two dates (only tested with python's datetime objects), which you can convert into number of days with simple arithmetics:
from sframe import SFrame
from datetime import datetime, timedelta
mydict = {'Date1':[datetime.now(), datetime.now()+timedelta(2)],
'Date2':[datetime.now()+timedelta(10), datetime.now()+timedelta(17)]}
frame = SFrame(mydict)
frame['new_col'] = (frame['Date2'] - frame['Date1']).apply(lambda x: x//(60*60*24))
Output:
+----------------------------+----------------------------+---------+
| Date1 | Date2 | new_col |
+----------------------------+----------------------------+---------+
| 2016-10-02 21:12:14.712556 | 2016-10-12 21:12:14.712574 | 10.0 |
| 2016-10-04 21:12:14.712567 | 2016-10-19 21:12:14.712576 | 15.0 |
+----------------------------+----------------------------+---------+

Related

Minutes to Hours on datetime column Pyspark

I have a pyspark dataframe with a column datetime containing : 2022-06-01 13:59:58
I would like to transform that datetime value into : 2022-06-01 14:00:58
Is there a way to round the minutes into hours , when the minutes are 59 min?

You can accomplish this using expr or unix_timestamp and further adding 1 minute respectively , based on the minute against your timestamp value using when-otherwise -
Unix Timestamps can get a bit fiddly as it involves an additional step of converting it to epoch, but either ways the end result is the same across both
Data Preparation
s = StringIO("""
date_str
2022-03-01 13:59:50
2022-05-20 13:45:50
2022-06-21 16:59:50
2022-10-22 20:59:50
""")
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df)\
.withColumn('date_parsed',F.to_timestamp(F.col('date_str'), 'yyyy-MM-dd HH:mm:ss'))\
.drop('date_str')
sparkDF.show()
+-------------------+
| date_parsed|
+-------------------+
|2022-03-01 13:59:50|
|2022-05-20 13:45:50|
|2022-06-21 16:59:50|
|2022-10-22 20:59:50|
+-------------------+
Extracting Minute & Addition
sparkDF = sparkDF.withColumn("date_minute", F.minute("date_parsed"))
sparkDF = sparkDF.withColumn('date_parsed_updated_expr',
F.when(F.col('date_minute') == 59,F.col('date_parsed') + F.expr('INTERVAL 1 MINUTE'))\
.otherwise(F.col('date_parsed'))
).withColumn('date_parsed_updated_unix',
F.when(F.col('date_minute') == 59,(F.unix_timestamp(F.col('date_parsed')) + 60).cast('timestamp'))
.otherwise(F.col('date_parsed'))
)
sparkDF.show()
+-------------------+-----------+------------------------+------------------------+
| date_parsed|date_minute|date_parsed_updated_expr|date_parsed_updated_unix|
+-------------------+-----------+------------------------+------------------------+
|2022-03-01 13:59:50| 59| 2022-03-01 14:00:50| 2022-03-01 14:00:50|
|2022-05-20 13:45:50| 45| 2022-05-20 13:45:50| 2022-05-20 13:45:50|
|2022-06-21 16:59:50| 59| 2022-06-21 17:00:50| 2022-06-21 17:00:50|
|2022-10-22 20:59:50| 59| 2022-10-22 21:00:50| 2022-10-22 21:00:50|
+-------------------+-----------+------------------------+------------------------+

How to get the correlation between two columns?

I have such a dataframe df:
time | Score| weekday
01-01-21 12:00 | 1 | Friday
01-01-21 24:00 | 33 | Friday
02-01-21 12:00 | 12 | Saturday
02-01-21 24:00 | 9 | Saturday
03-01-21 12:00 | 11 | Sunday
03-01-21 24:00 | 8 | Sunday
I now want to get the correlation between columns Score and weekday.
I did the following to get it:
s_corr = df.weekday.str.get_dummies().corrwith(df['Score'])
print (s_corr)
I am now wondering if this is the correct way of doing it? Or would it be better to create a new dataframe in which all the rows are first summed for each day by the time column and after this using the code above to get the correlation between Score and weekday? Or are there maybe other suggestions for improvement?

I have used numpy.corrcoef before for getting correlations between continuous and categorical variables. You can try it and see if it works for you:
I first created dummies for the categorical variables:
df_dummies = pd.get_dummies(df['weekday'], drop_first= True)
df_new = pd.concat([df['Score'], df_dummies], axis=1)
I then converted the DataFrame with the dummies to a numpy array and applied corrcoef on it likewise:
df_arr = df_new.to_numpy()
corr_matrix = np.corrcoef(df_arr.T)

Converting VBA script to Python Script

I'm trying to convert some VBA scripts into a python script, and I have been having troubles trying to figure some few things out, as the results seem different from what the excel file gives.
So I have an example dataframe like this :
|Name | A_Date |
_______________________
|RAHEAL | 04/30/2020|
|GIFTY | 05/31/2020|
||ERIC | 03/16/2020|
|PETER | 05/01/2020|
|EMMANUEL| 12/15/2019|
|BABA | 05/23/2020|
and I would want to achieve this result(VBA script result) :
|Name | A_Date | Sold
__________________________________
|RAHEAL | 04/30/2020| No
|GIFTY | 05/31/2020| Yes
||ERIC | 03/16/2020| No
|PETER | 05/01/2020| Yes
|EMMANUEL| 12/15/2019| No
|BABA | 05/23/2020| Yes
By converting this VBA script :
Range("C2").Select
Selection.Value = _
"=IF(RC[-1]>=(INT(R2C2)-DAY(INT(R2C2))+1),""Yes"",""No"")"
Selection.AutoFill Destination:=Range("C2", "C" & Cells(Rows.Count, 1).End(xlUp).Row)
Range("C1").Value = "Sold"
ActiveSheet.Columns("C").Copy
ActiveSheet.Columns("C").PasteSpecial xlPasteValues
Simply :=IF(B2>=(INT($B$2)-DAY(INT($B$2))+1),"Yes","No")
To this Python script:
sales['Sold']=np.where(sales['A_Date']>=(sales['A_Date'] - pd.to_timedelta(sales.A_Date.dt.day, unit='d'))+ timedelta(days=1),'Yes','No')
But I keep getting a "Yes" throughout.... could anyone help me spot out where I might have made some kind of mistake

import pandas as pd
df = pd.DataFrame({'Name':['RAHEAL','GIFTY','ERIC','PETER','EMMANUEL','BABA'],
'A_Date':['04/30/2020','05/31/2020','03/16/2020',
'05/01/2020','12/15/2019','05/23/2020']})
df['A_Date'] = pd.to_datetime(df['A_Date'])
print(df)
df['Sold'] = df['A_Date'] >= df['A_Date'].iloc[0].replace(day=1)
df['Sold'] = df['Sold'].map({True:'Yes', False:'No'})
print(df)
output:
Name A_Date
0 RAHEAL 2020-04-30
1 GIFTY 2020-05-31
2 ERIC 2020-03-16
3 PETER 2020-05-01
4 EMMANUEL 2019-12-15
5 BABA 2020-05-23
Name A_Date Sold
0 RAHEAL 2020-04-30 Yes
1 GIFTY 2020-05-31 Yes
2 ERIC 2020-03-16 No
3 PETER 2020-05-01 Yes
4 EMMANUEL 2019-12-15 No
5 BABA 2020-05-23 Yes
If I read the formula right - if A_Date value >= 04/01/2020 (i.e. first day of month for date in B2), so RAHEAL should be Yes too
I don't know if you noticed (and if this is intended), but if A_Date value has a fractional part (i.e. time), when you calculate the value for 1st of the month, there is room for error. If the time in B2 is let's say 10:00 AM, when you calculate cut value, it will be 04/1/2020 10:00. Then if you have another value, let's say 04/01/2020 09:00, it will be evaluated as False/No. This is how it works also in your Excel formula.
EDIT (12 Jan 2021): Note, values in column A_Date are of type datetime.datetime or datetime.date. Presumably they are converted when reading the Excel file or explicitly afterwards.

Very much embarassed I didn't see the simple elegant solution that buran gave +. I did more of a literal translation.
first_date.toordinal() - 693594 is the integer date value for your initial date, current_date.toordinal() - 693594 is the integer date value for the current iteration of the dates column. I apply your cell formula logic to each A_Date row value and output as the corresponding Sold column value.
import pandas as pd
from datetime import datetime
def is_sold(current_date:datetime, first_date:datetime, day_no:int)->str:
# use of toordinal idea from #rjha94 https://stackoverflow.com/a/47478659
if current_date.toordinal() - 693594 >= first_date.toordinal() - 693594 - day_no + 1:
return "Yes"
else:
return "No"
sales = pd.DataFrame({'Name':['RAHEAL','GIFTY','ERIC','PETER','EMMANUEL','BABA'],
'A_Date':['2020-04-30','2020-05-31','2020-03-16','2020-05-01','2019-12-15','2020-05-23']})
sales['A_Date'] = pd.to_datetime(sales['A_Date'], errors='coerce')
sales['Sold'] = sales['A_Date'].apply(lambda x:is_sold(x, sales['A_Date'][0], x.day))
print(sales)

Make date column into standard format using pandas

How can I use pandas to make dates column into a standard format i.e. 12-08-1996. The data I have is:
I've tried some methods by searching online but haven't found the one where it detects the format and make it standard.
Here is what I've coded:
df = pd.read_excel(r'date cleanup.xlsx')
df.head(10)
df.DOB = pd.to_datetime(df.DOB) #Error is in this line
The error I get is:
ValueError: ('Unknown string format:', '20\ \december\ \1992')
UPDATE:
Using
for date in df.DOB:
print(parser.parse(date))
Works great but there is a value 20\\december \\1992 for which it gives the above highlighted error. So I'm not familiar with all the formats that are in the data this is why I was looking for a technique that can auto-detect it and convert it to standard format.

You could use dateparser library:
import dateparser
df = pd.DataFrame(["12 aug 1996", "24th december 2006", "20\\ december \\2007"], columns = ['DOB'])
df['date'] = df['DOB'].apply(lambda x :dateparser.parse(x))
Output
| | DOB | date |
|---|--------------------|------------|
| 0 | 12 aug 1996 | 1996-08-12 |
| 1 | 24th december 2006 | 2006-12-24 |
| 2 | 20\ december \2007 | 2020-12-07 |
EDIT
Note, there is a STRICT_PARSING setting which can be used to handle exceptions :
You can also ignore parsing incomplete dates altogether by setting STRICT_PARSING
df['date'] = df['DOB'].apply(lambda x : dateparser.parse(x, settings={'STRICT_PARSING': True}) if len(str(x))>6 else None)

Python PySpark: Count Number of Rows by Week w/Week Starting on Monday and Ending on Sunday

I have a data frame that contains the following columns:
ID Scheduled Date
241 10/9/2018
423 9/25/2018
126 9/30/2018
123 8/13/2018
132 8/16/2018
143 10/6/2018
I want to count the total number of IDs by week. Specifically, I want the week to always start on Monday and always end on Sunday.
I achieved this in Jupyter Notebook already:
weekly_count_output = df.resample('W-Mon', on='Scheduled Date', label='left', closed='left').sum().query('count_row > 0')
weekly_count_output = weekly_count_output.reset_index()
weekly_count_output = weekly_count_output[['Scheduled Date', 'count_row']]
weekly_count_output = weekly_count_output.rename(columns = {'count_row': 'Total Count'})
But I don't know how to write the above code in Python PySpark syntax. I want my resulting output to look like this:
Scheduled Date Total Count
8/13/2018 2
9/24/2018 2
10/1/2018 1
10/8/2018 1
Please note the Scheduled Date is always a Monday (indicating beginning of week) and the total count goes from Monday to Sunday of that week.

Thanks to Get Last Monday in Spark for defining the funcion previous_day.
Firstly import,
from pyspark.sql.functions import *
from datetime import datetime
Assuming your input data as in my df (DataFrame)
cols = ['id', 'scheduled_date']
vals = [
(241, '10/09/2018'),
(423, '09/25/2018'),
(126, '09/30/2018'),
(123, '08/13/2018'),
(132, '08/16/2018'),
(143, '10/06/2018')
]
df = spark.createDataFrame(vals, cols)
This is the function defined
def previous_day(date, dayOfWeek):
return date_sub(next_day(date, 'monday'), 7)
# Converting the string column to timestamp.
df = df.withColumn('scheduled_date', date_format(unix_timestamp('scheduled_date', 'MM/dd/yyy') \
.cast('timestamp'), 'yyyy-MM-dd'))
df.show()
+---+--------------+
| id|scheduled_date|
+---+--------------+
|241| 2018-10-09|
|423| 2018-09-25|
|126| 2018-09-30|
|123| 2018-08-13|
|132| 2018-08-16|
|143| 2018-10-06|
+---+--------------+
# Returns the first monday of a week
df_mon = df.withColumn("scheduled_date", previous_day('scheduled_date', 'monday'))
df_mon.show()
+---+--------------+
| id|scheduled_date|
+---+--------------+
|241| 2018-10-08|
|423| 2018-09-24|
|126| 2018-09-24|
|123| 2018-08-13|
|132| 2018-08-13|
|143| 2018-10-01|
+---+--------------+
# You can groupBy and do agg count of 'id'.
df_mon_grp = df_mon.groupBy('scheduled_date').agg(count('id')).orderBy('scheduled_date')
# Reformatting to match your resulting output.
df_mon_grp = df_mon_grp.withColumn('scheduled_date', date_format(unix_timestamp('scheduled_date', "yyyy-MM-dd") \
.cast('timestamp'), 'MM/dd/yyyy'))
df_mon_grp.show()
+--------------+---------+
|scheduled_date|count(id)|
+--------------+---------+
| 08/13/2018| 2|
| 09/24/2018| 2|
| 10/01/2018| 1|
| 10/08/2018| 1|
+--------------+---------+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using .apply() in Sframes to manipulate multiple columns of each row - python

I have an SFrame with the columns Date1 and Date2. I am trying to use .apply() to find the datediff between Date1 and Date2, but I can't figure out how to use the other argument. Ideally something like frame['new_col'] = frame['Date1'].apply(lambda x: datediff(x,frame('Date2')))

Related

Minutes to Hours on datetime column Pyspark

How to get the correlation between two columns?

Converting VBA script to Python Script

Make date column into standard format using pandas

Python PySpark: Count Number of Rows by Week w/Week Starting on Monday and Ending on Sunday

Categories

Resources