Converting VBA script to Python Script - python

I'm trying to convert some VBA scripts into a python script, and I have been having troubles trying to figure some few things out, as the results seem different from what the excel file gives.
So I have an example dataframe like this :
|Name | A_Date |
_______________________
|RAHEAL | 04/30/2020|
|GIFTY | 05/31/2020|
||ERIC | 03/16/2020|
|PETER | 05/01/2020|
|EMMANUEL| 12/15/2019|
|BABA | 05/23/2020|
and I would want to achieve this result(VBA script result) :
|Name | A_Date | Sold
__________________________________
|RAHEAL | 04/30/2020| No
|GIFTY | 05/31/2020| Yes
||ERIC | 03/16/2020| No
|PETER | 05/01/2020| Yes
|EMMANUEL| 12/15/2019| No
|BABA | 05/23/2020| Yes
By converting this VBA script :
Range("C2").Select
Selection.Value = _
"=IF(RC[-1]>=(INT(R2C2)-DAY(INT(R2C2))+1),""Yes"",""No"")"
Selection.AutoFill Destination:=Range("C2", "C" & Cells(Rows.Count, 1).End(xlUp).Row)
Range("C1").Value = "Sold"
ActiveSheet.Columns("C").Copy
ActiveSheet.Columns("C").PasteSpecial xlPasteValues
Simply :=IF(B2>=(INT($B$2)-DAY(INT($B$2))+1),"Yes","No")
To this Python script:
sales['Sold']=np.where(sales['A_Date']>=(sales['A_Date'] - pd.to_timedelta(sales.A_Date.dt.day, unit='d'))+ timedelta(days=1),'Yes','No')
But I keep getting a "Yes" throughout.... could anyone help me spot out where I might have made some kind of mistake

import pandas as pd
df = pd.DataFrame({'Name':['RAHEAL','GIFTY','ERIC','PETER','EMMANUEL','BABA'],
'A_Date':['04/30/2020','05/31/2020','03/16/2020',
'05/01/2020','12/15/2019','05/23/2020']})
df['A_Date'] = pd.to_datetime(df['A_Date'])
print(df)
df['Sold'] = df['A_Date'] >= df['A_Date'].iloc[0].replace(day=1)
df['Sold'] = df['Sold'].map({True:'Yes', False:'No'})
print(df)
output:
Name A_Date
0 RAHEAL 2020-04-30
1 GIFTY 2020-05-31
2 ERIC 2020-03-16
3 PETER 2020-05-01
4 EMMANUEL 2019-12-15
5 BABA 2020-05-23
Name A_Date Sold
0 RAHEAL 2020-04-30 Yes
1 GIFTY 2020-05-31 Yes
2 ERIC 2020-03-16 No
3 PETER 2020-05-01 Yes
4 EMMANUEL 2019-12-15 No
5 BABA 2020-05-23 Yes
If I read the formula right - if A_Date value >= 04/01/2020 (i.e. first day of month for date in B2), so RAHEAL should be Yes too
I don't know if you noticed (and if this is intended), but if A_Date value has a fractional part (i.e. time), when you calculate the value for 1st of the month, there is room for error. If the time in B2 is let's say 10:00 AM, when you calculate cut value, it will be 04/1/2020 10:00. Then if you have another value, let's say 04/01/2020 09:00, it will be evaluated as False/No. This is how it works also in your Excel formula.
EDIT (12 Jan 2021): Note, values in column A_Date are of type datetime.datetime or datetime.date. Presumably they are converted when reading the Excel file or explicitly afterwards.

Very much embarassed I didn't see the simple elegant solution that buran gave +. I did more of a literal translation.
first_date.toordinal() - 693594 is the integer date value for your initial date, current_date.toordinal() - 693594 is the integer date value for the current iteration of the dates column. I apply your cell formula logic to each A_Date row value and output as the corresponding Sold column value.
import pandas as pd
from datetime import datetime
def is_sold(current_date:datetime, first_date:datetime, day_no:int)->str:
# use of toordinal idea from #rjha94 https://stackoverflow.com/a/47478659
if current_date.toordinal() - 693594 >= first_date.toordinal() - 693594 - day_no + 1:
return "Yes"
else:
return "No"
sales = pd.DataFrame({'Name':['RAHEAL','GIFTY','ERIC','PETER','EMMANUEL','BABA'],
'A_Date':['2020-04-30','2020-05-31','2020-03-16','2020-05-01','2019-12-15','2020-05-23']})
sales['A_Date'] = pd.to_datetime(sales['A_Date'], errors='coerce')
sales['Sold'] = sales['A_Date'].apply(lambda x:is_sold(x, sales['A_Date'][0], x.day))
print(sales)

Related

Filtering, transposing and concatenating with Pandas

I'm trying something i've never done before and i'm in need of some help.
Basically, i need to filter sections of a pandas dataframe, transpose each filtered section and then concatenate every resulting section together.
Here's a representation of my dataframe:
df:
id | text_field | text_value
1 Date 2021-06-23
1 Hour 10:50
2 Position City
2 Position Countryside
3 Date 2021-06-22
3 Hour 10:45
I can then use some filtering method to isolate parts of my data:
df.groupby('id').filter(lambda x: True)
test = df.query(' id == 1 ')
test = test[["text_field","text_value"]]
test_t = test.set_index("text_field").T
test_t:
text_field | Date | Hour
text_value | 2021-06-23 | 10:50
If repeat the process looking for row with id == 3 and then concatenate the result with test_t, i'll have the following:
text_field | Date | Hour
text_value | 2021-06-23 | 10:50
text_value | 2021-06-22 | 10:45
I'm aware that performing this with rows where id == 2 will give me other columns and that's alright too, it's what a i want as well.
What i can't figure out is how to do this for every "id" in my dataframe. I wasn't able to create a function or for loop that works. Can somebody help me?
To summarize:
1 - I need to separate my dataframe in sections according with values from the "id" column
2 - After that i need to remove the "id" column and transpose the result
3 - I need to concatenate every resulting dataframe into one big dataframe
You can use pivot_table:
df.pivot_table(
index='id', columns='text_field', values='text_value', aggfunc='first')
Output:
text_field Date Hour Position
id
1 2021-06-23 10:50 NaN
2 NaN NaN City
3 2021-06-22 10:45 NaN
It's not exactly clear how you want to deal with repeating values though, would be great to have some description of that (id=2 would make a good example)
Update: If you want to ignore the ids and simply concatenate all the values:
pd.DataFrame(df.groupby('text_field')['text_value'].apply(list).to_dict())
Output:
Date Hour Position
0 2021-06-23 10:50 City
1 2021-06-22 10:45 Countryside

How to get the correlation between two columns?

I have such a dataframe df:
time | Score| weekday
01-01-21 12:00 | 1 | Friday
01-01-21 24:00 | 33 | Friday
02-01-21 12:00 | 12 | Saturday
02-01-21 24:00 | 9 | Saturday
03-01-21 12:00 | 11 | Sunday
03-01-21 24:00 | 8 | Sunday
I now want to get the correlation between columns Score and weekday.
I did the following to get it:
s_corr = df.weekday.str.get_dummies().corrwith(df['Score'])
print (s_corr)
I am now wondering if this is the correct way of doing it? Or would it be better to create a new dataframe in which all the rows are first summed for each day by the time column and after this using the code above to get the correlation between Score and weekday? Or are there maybe other suggestions for improvement?
I have used numpy.corrcoef before for getting correlations between continuous and categorical variables. You can try it and see if it works for you:
I first created dummies for the categorical variables:
df_dummies = pd.get_dummies(df['weekday'], drop_first= True)
df_new = pd.concat([df['Score'], df_dummies], axis=1)
I then converted the DataFrame with the dummies to a numpy array and applied corrcoef on it likewise:
df_arr = df_new.to_numpy()
corr_matrix = np.corrcoef(df_arr.T)

Make date column into standard format using pandas

How can I use pandas to make dates column into a standard format i.e. 12-08-1996. The data I have is:
I've tried some methods by searching online but haven't found the one where it detects the format and make it standard.
Here is what I've coded:
df = pd.read_excel(r'date cleanup.xlsx')
df.head(10)
df.DOB = pd.to_datetime(df.DOB) #Error is in this line
The error I get is:
ValueError: ('Unknown string format:', '20\ \december\ \1992')
UPDATE:
Using
for date in df.DOB:
print(parser.parse(date))
Works great but there is a value 20\\december \\1992 for which it gives the above highlighted error. So I'm not familiar with all the formats that are in the data this is why I was looking for a technique that can auto-detect it and convert it to standard format.
You could use dateparser library:
import dateparser
df = pd.DataFrame(["12 aug 1996", "24th december 2006", "20\\ december \\2007"], columns = ['DOB'])
df['date'] = df['DOB'].apply(lambda x :dateparser.parse(x))
Output
| | DOB | date |
|---|--------------------|------------|
| 0 | 12 aug 1996 | 1996-08-12 |
| 1 | 24th december 2006 | 2006-12-24 |
| 2 | 20\ december \2007 | 2020-12-07 |
EDIT
Note, there is a STRICT_PARSING setting which can be used to handle exceptions :
You can also ignore parsing incomplete dates altogether by setting STRICT_PARSING
df['date'] = df['DOB'].apply(lambda x : dateparser.parse(x, settings={'STRICT_PARSING': True}) if len(str(x))>6 else None)

Pandas - Evaluate a condition for each row of a series

I have a dataset like this:
Policy | Customer | Employee | CoveragDate | LapseDate
123 | 1234 | 1234 | 2011-06-01 | 2015-12-31
124 | 1234 | 1234 | 2016-01-01 | ?
125 | 1234 | 1234 | 2011-06-01 | 2012-01-01
124 | 5678 | 5555 | 2014-01-01 | ?
I'm trying to iterate through each policy for each employee of each customer (a customer can have many employees, an employee can have multiple policies) and compare the covered date against the lapse date for a particular employee. If the covered date and lapse date are within 5 days, I'd like to add that policy to a results list.
So, expected output would be:
Policy | Customer | Employee
123 | 1234 | 1234
because policy 123's lapse date was within 5 days of policy 124's covered date.
So far, I've used this code:
import pandas
import datetime
#Pull in data from query
wd = pandas.read_csv('DATA')
wd=wd.set_index('Policy#')
wd = wd.rename(columns={'Policy#':'Policy'})
Resultlist=[]
for EMPID in wd.groupby(['EMPID', 'Customer']):
for Policy in wd.groupby(['EMPID','Customer']):
EffDate = pandas.to_datetime(wd['CoverageEffDate'])
for Policy in wd.groupby(['EMPID','Customer']):
check=wd['LapseDate'].astype(str)
if check.any() =='?': #here lies the problem - it's evaluating if ANY of the items ='?'
print(check)
continue
else:
LapseDate = pandas.to_datetime(wd['LapseDate']) + datetime.timedelta(days=5)
if EffDate < LapseDate:
Resultlist.append(wd['Policy','Customer'])
print(Resultlist)
I'm trying to use the pandas .any() function to evaluate if the current row is a '?' (which means null data, i.e. the policy hasn't lapsed). However, it appears that this statement just evaluates if there is a '?' row in the entire column, not the current row. I need to determine this because if I compare the '?' value against a date I get an error.
Is there a way to reference just the row I'm iterating on for a conditional check? To my knowledge, I can't use the pandas apply function technique because I need each employee's policy data compared against any other policies they hold.
Thank you!
check.str.contains('?') would return a boolean array showing which entries had a '?' in them. Otherwise you might consider just iterating through i.e
check=wd['LapseDate'].astype(str)
for row in check:
if row == '?':
print(check)
but there's really no difference between checking for any match and returning if there's a match and iterating through all and returning if there's a match.

How can I concatenate 3 columns into 1 using Excel and Python?

I am working on a project in GIS software where I need to have a column containing dates in the format YYYY-MM-DD. Currently, in Excel I have 3 columns: 1 with the year, 1 with the month and 1 with the day. Looks like this:
| A | B | C |
| 2012 | 1 | 1 |
| 2012 | 2 | 1 |
| 2012 | 3 | 1 |
...etc...
And I need it to look like this:
| A |
| 2012-01-01|
| 2012-02-01|
| 2012-03-01|
I have several workbooks that I need in the same format so I figured that perhaps python would be a useful tool so that I didn't have to manually concatenate everything in Excel.
So, my question is, is there a simple way to not only concatenate these three columns, but to also add a zero in front of the month and day numbers?
I have been experimenting a little bit with the python library openpyxl, but have not come up with anything useful so far. Any help would be appreciated, thanks.
If you're going to be staying in excel, you may as well just use the excel macro scripting. If your year, month, and day are in columns A, B and C, you can just type this in column D to concatenate them, then format it as a date and adjust the padding.
=$A1 & "-" & $B1 & "-" & $C1
Try this:
def DateFromABC(self,ws):
import datetime
# Start reading from row 1
for i,row in enumerate(ws.rows,1):
sRow = str(i)
# datetime goes into column 'D'
Dcell = ws['D'+sRow]
# Set datetime from 'ABC'
Dcell.value = datetime.date(year=ws['A'+sRow].value,
month=ws['B'+sRow].value,
day=ws['C'+sRow].value
)
print('i=%s,tpye=%s, value=%s' % (i,str(type(Dcell.value)),Dcell.value ) )
#end for
#end def DateFromABC
Tested with Python:3.4.2 - openpyxl:2.4.1 - LibreOffice: 4.3.3.2

Categories

Resources