Make date column into standard format using pandas - python

How can I use pandas to make dates column into a standard format i.e. 12-08-1996. The data I have is:
I've tried some methods by searching online but haven't found the one where it detects the format and make it standard.
Here is what I've coded:
df = pd.read_excel(r'date cleanup.xlsx')
df.head(10)
df.DOB = pd.to_datetime(df.DOB) #Error is in this line
The error I get is:
ValueError: ('Unknown string format:', '20\ \december\ \1992')
UPDATE:
Using
for date in df.DOB:
print(parser.parse(date))
Works great but there is a value 20\\december \\1992 for which it gives the above highlighted error. So I'm not familiar with all the formats that are in the data this is why I was looking for a technique that can auto-detect it and convert it to standard format.

You could use dateparser library:
import dateparser
df = pd.DataFrame(["12 aug 1996", "24th december 2006", "20\\ december \\2007"], columns = ['DOB'])
df['date'] = df['DOB'].apply(lambda x :dateparser.parse(x))
Output
| | DOB | date |
|---|--------------------|------------|
| 0 | 12 aug 1996 | 1996-08-12 |
| 1 | 24th december 2006 | 2006-12-24 |
| 2 | 20\ december \2007 | 2020-12-07 |
EDIT
Note, there is a STRICT_PARSING setting which can be used to handle exceptions :
You can also ignore parsing incomplete dates altogether by setting STRICT_PARSING
df['date'] = df['DOB'].apply(lambda x : dateparser.parse(x, settings={'STRICT_PARSING': True}) if len(str(x))>6 else None)

Related

Is there a difference between pd.Series.dt.weekday vs pd.Series.dt.dayofweek?

I am with time in Pandas and have seen that there are two ways of extracting a Day of Week integer from a timestamp. These are pd.Series.dt.weekday and pd.Series.dt.dayofweek.
The documentation says that both return: Series or Index containing integers indicating the day number where the day of the week with Monday=0, Sunday=6.
Am I missing something or are these two functions effectively the same?
In the "See Also" section, it does describe the other function as an Alias. Does this answer my question?
You solved your question:
| dayofweek
| The day of the week with Monday=0, Sunday=6.
|
| Return the day of the week. It is assumed the week starts on
| Monday, which is denoted by 0 and ends on Sunday which is denoted
| by 6. This method is available on both Series with datetime
| values (using the `dt` accessor) or DatetimeIndex.
|
| Returns
| -------
| Series or Index
| Containing integers indicating the day number.
|
| See Also
| --------
| Series.dt.dayofweek : Alias.
| Series.dt.weekday : Alias. # <-- YES IT'S AN ALIAS
| Series.dt.day_name : Returns the name of the day of the week.
Source code:
dayofweek = day_of_week
weekday = dayofweek

Converting VBA script to Python Script

I'm trying to convert some VBA scripts into a python script, and I have been having troubles trying to figure some few things out, as the results seem different from what the excel file gives.
So I have an example dataframe like this :
|Name | A_Date |
_______________________
|RAHEAL | 04/30/2020|
|GIFTY | 05/31/2020|
||ERIC | 03/16/2020|
|PETER | 05/01/2020|
|EMMANUEL| 12/15/2019|
|BABA | 05/23/2020|
and I would want to achieve this result(VBA script result) :
|Name | A_Date | Sold
__________________________________
|RAHEAL | 04/30/2020| No
|GIFTY | 05/31/2020| Yes
||ERIC | 03/16/2020| No
|PETER | 05/01/2020| Yes
|EMMANUEL| 12/15/2019| No
|BABA | 05/23/2020| Yes
By converting this VBA script :
Range("C2").Select
Selection.Value = _
"=IF(RC[-1]>=(INT(R2C2)-DAY(INT(R2C2))+1),""Yes"",""No"")"
Selection.AutoFill Destination:=Range("C2", "C" & Cells(Rows.Count, 1).End(xlUp).Row)
Range("C1").Value = "Sold"
ActiveSheet.Columns("C").Copy
ActiveSheet.Columns("C").PasteSpecial xlPasteValues
Simply :=IF(B2>=(INT($B$2)-DAY(INT($B$2))+1),"Yes","No")
To this Python script:
sales['Sold']=np.where(sales['A_Date']>=(sales['A_Date'] - pd.to_timedelta(sales.A_Date.dt.day, unit='d'))+ timedelta(days=1),'Yes','No')
But I keep getting a "Yes" throughout.... could anyone help me spot out where I might have made some kind of mistake
import pandas as pd
df = pd.DataFrame({'Name':['RAHEAL','GIFTY','ERIC','PETER','EMMANUEL','BABA'],
'A_Date':['04/30/2020','05/31/2020','03/16/2020',
'05/01/2020','12/15/2019','05/23/2020']})
df['A_Date'] = pd.to_datetime(df['A_Date'])
print(df)
df['Sold'] = df['A_Date'] >= df['A_Date'].iloc[0].replace(day=1)
df['Sold'] = df['Sold'].map({True:'Yes', False:'No'})
print(df)
output:
Name A_Date
0 RAHEAL 2020-04-30
1 GIFTY 2020-05-31
2 ERIC 2020-03-16
3 PETER 2020-05-01
4 EMMANUEL 2019-12-15
5 BABA 2020-05-23
Name A_Date Sold
0 RAHEAL 2020-04-30 Yes
1 GIFTY 2020-05-31 Yes
2 ERIC 2020-03-16 No
3 PETER 2020-05-01 Yes
4 EMMANUEL 2019-12-15 No
5 BABA 2020-05-23 Yes
If I read the formula right - if A_Date value >= 04/01/2020 (i.e. first day of month for date in B2), so RAHEAL should be Yes too
I don't know if you noticed (and if this is intended), but if A_Date value has a fractional part (i.e. time), when you calculate the value for 1st of the month, there is room for error. If the time in B2 is let's say 10:00 AM, when you calculate cut value, it will be 04/1/2020 10:00. Then if you have another value, let's say 04/01/2020 09:00, it will be evaluated as False/No. This is how it works also in your Excel formula.
EDIT (12 Jan 2021): Note, values in column A_Date are of type datetime.datetime or datetime.date. Presumably they are converted when reading the Excel file or explicitly afterwards.
Very much embarassed I didn't see the simple elegant solution that buran gave +. I did more of a literal translation.
first_date.toordinal() - 693594 is the integer date value for your initial date, current_date.toordinal() - 693594 is the integer date value for the current iteration of the dates column. I apply your cell formula logic to each A_Date row value and output as the corresponding Sold column value.
import pandas as pd
from datetime import datetime
def is_sold(current_date:datetime, first_date:datetime, day_no:int)->str:
# use of toordinal idea from #rjha94 https://stackoverflow.com/a/47478659
if current_date.toordinal() - 693594 >= first_date.toordinal() - 693594 - day_no + 1:
return "Yes"
else:
return "No"
sales = pd.DataFrame({'Name':['RAHEAL','GIFTY','ERIC','PETER','EMMANUEL','BABA'],
'A_Date':['2020-04-30','2020-05-31','2020-03-16','2020-05-01','2019-12-15','2020-05-23']})
sales['A_Date'] = pd.to_datetime(sales['A_Date'], errors='coerce')
sales['Sold'] = sales['A_Date'].apply(lambda x:is_sold(x, sales['A_Date'][0], x.day))
print(sales)

Python for loop returns first value

Background: the code is supposed to reference a ticker symbol and the time of a trade, and then pull in the subsequent closing prices after the time of the trade (next_price_1m and next_price_2m)
Problem: when a ticker repeats, the next_price_1m and next_price_2m repeat from the prior call for that ticker, even though the time of the trade has changed. My initial call to get barset is working, but only for the first instance of the ticker.
Example output:
symbol | transaction_time | next price 1m | next price 2m |
--------|------------------|---------------|---------------|--
JPM | 10:00 a.m. | $90 | $91 |
SPY | 10:25 a.m. | $260 | $261 |
JPM | 11:37 a.m. | $90 | $91 |
AAPL | 2:25 p.m. | $330 | $335 |
JPM | 3:02 p.m. | $90 | $91 |
JPM should have different next_price_1m and next_price_2m on 2nd and 3rd calls.
Code:
trades_list = api.get_activities(date='2020-04-06')
data = []
for trade in trades_list:
my_list_of_trade_order_ids = trade.order_id
price = trade.price
qty = trade.qty
side = trade.side
symbol = trade.symbol
transaction_time = trade.transaction_time
client_order_id = api.get_order(trade.order_id).client_order_id
barset = api.get_barset(timeframe='minute',symbols=trade.symbol,limit=15,after=trade.transaction_time)
df_bars = pd.DataFrame(barset)
next_price_1m = df_bars.iat[0,0].c
next_price_2m = df_bars.iat[1,0].c
data.append({'price':price, 'qty':qty, 'side':side,'symbol':symbol,'transaction_time':transaction_time, 'client_order_id':client_order_id, 'next price 1m':next_price_1m,'next price 2m':next_price_2m})
df = pd.DataFrame(data)
df
Thanks for the responses.
The issue was that the 'after' parameter was looking for a timestamp in a different format than the timestamp format being provided with trade.transaction_time
Rather than providing an error, the API would ignore the time parameters, and just return the latest data.

How can I concatenate 3 columns into 1 using Excel and Python?

I am working on a project in GIS software where I need to have a column containing dates in the format YYYY-MM-DD. Currently, in Excel I have 3 columns: 1 with the year, 1 with the month and 1 with the day. Looks like this:
| A | B | C |
| 2012 | 1 | 1 |
| 2012 | 2 | 1 |
| 2012 | 3 | 1 |
...etc...
And I need it to look like this:
| A |
| 2012-01-01|
| 2012-02-01|
| 2012-03-01|
I have several workbooks that I need in the same format so I figured that perhaps python would be a useful tool so that I didn't have to manually concatenate everything in Excel.
So, my question is, is there a simple way to not only concatenate these three columns, but to also add a zero in front of the month and day numbers?
I have been experimenting a little bit with the python library openpyxl, but have not come up with anything useful so far. Any help would be appreciated, thanks.
If you're going to be staying in excel, you may as well just use the excel macro scripting. If your year, month, and day are in columns A, B and C, you can just type this in column D to concatenate them, then format it as a date and adjust the padding.
=$A1 & "-" & $B1 & "-" & $C1
Try this:
def DateFromABC(self,ws):
import datetime
# Start reading from row 1
for i,row in enumerate(ws.rows,1):
sRow = str(i)
# datetime goes into column 'D'
Dcell = ws['D'+sRow]
# Set datetime from 'ABC'
Dcell.value = datetime.date(year=ws['A'+sRow].value,
month=ws['B'+sRow].value,
day=ws['C'+sRow].value
)
print('i=%s,tpye=%s, value=%s' % (i,str(type(Dcell.value)),Dcell.value ) )
#end for
#end def DateFromABC
Tested with Python:3.4.2 - openpyxl:2.4.1 - LibreOffice: 4.3.3.2

Using .apply() in Sframes to manipulate multiple columns of each row

I have an SFrame with the columns Date1 and Date2.
I am trying to use .apply() to find the datediff between Date1 and Date2, but I can't figure out how to use the other argument.
Ideally something like
frame['new_col'] = frame['Date1'].apply(lambda x: datediff(x,frame('Date2')))
You can directly take the difference between the dates in column Date2 and those in Date1 by just subtracting frame['Date1'] from frame['Date2']. That, for some reason, returns the number of seconds between the two dates (only tested with python's datetime objects), which you can convert into number of days with simple arithmetics:
from sframe import SFrame
from datetime import datetime, timedelta
mydict = {'Date1':[datetime.now(), datetime.now()+timedelta(2)],
'Date2':[datetime.now()+timedelta(10), datetime.now()+timedelta(17)]}
frame = SFrame(mydict)
frame['new_col'] = (frame['Date2'] - frame['Date1']).apply(lambda x: x//(60*60*24))
Output:
+----------------------------+----------------------------+---------+
| Date1 | Date2 | new_col |
+----------------------------+----------------------------+---------+
| 2016-10-02 21:12:14.712556 | 2016-10-12 21:12:14.712574 | 10.0 |
| 2016-10-04 21:12:14.712567 | 2016-10-19 21:12:14.712576 | 15.0 |
+----------------------------+----------------------------+---------+

Categories

Resources