.Between function not working as expected - python

I am encountering some issues when using the .between method in Python.
I have a simple dataset consisting of ~59000 records
The date format is in DD/MM/YYYY and I would like to filter the days in the month of April in the year 2014.
psi_df = pd.read_csv('thecsvfile.csv')
psi_west_df = psi_df[['24-hr_psi','west']]
april_records = psi_west_df[psi_west_df['24-hr_psi'].between('1/4/2014','31/4/2014')]
april_records.head(100)
I received the output whereby the date suddenly jumps from 3/4/2014 (3rd April) - 10/4/2014 (10th April). This pattern recurs for every month and for every year up till the year 2020 (the final year of this dataset), which was not my original intention of obtaining the data for the month of April in the year 2014.
As I am still rather new to python, I decided to perform some fixes in Excel instead. I separated the date and the time columns and reran the code with the necessary syntax updated.
psi_df = pd.read_csv('psi_new.csv')
psi_west_df = psi_df[['date','west']]
april_records = psi_west_df[psi_west_df['date'].between('1/4/2014','31/4/2014')]
april_records.head(100)
I still faced the same issue and now, I am totally stumped as to why this is occurring. Am I using the .between method wrongly? Seeking everyone's kind guidance and directions as to why this is occurring. Much appreciated and many thanks everyone.
The csv file that I am using can be obtained from this website:
https://data.gov.sg/dataset/historical-24-hr-psi

The first problem is your date column isn't a date but an object column.
Ensure you column is really a date by using the pandas to_datetime function.
psi_west_df['date'] = pd.to_datetime(psi_west_df['date'], format='%d/%m/%Y')
After the column is really a date column in order for the between function to run with no problems you should give it two date object and not string object like this:
start_day = pd.to_datetime('1/4/2014', format='%d/%m/%Y')
end_day = pd.to_datetime('30/4/2014', format='%d/%m/%Y')
april_records = psi_west_df[psi_west_df['date'].between(start_day, end_day)]
So all together:
psi_df = pd.read_csv('psi_new.csv')
psi_west_df = psi_df[['date','west']]
psi_west_df['date'] = pd.to_datetime(psi_west_df['date'], format='%d/%m/%Y')
start_day = pd.to_datetime('1/4/2014', format='%d/%m/%Y')
end_day = pd.to_datetime('30/4/2014', format='%d/%m/%Y')
april_records = psi_west_df[psi_west_df['date'].between(start_day, end_day)]
april_records.head(100)
Note - this code should work on the data after you change it with excel, meaning you have a separate column for data and time.

Related

Processing data with incorrect dates like 30th of February

In trying to process a large number of bank account statements given in CSV format I realized that some of the dates are incorrect (30th of February, which is not possible).
So this snippet fails [1] telling me that some dates are incorrect:
df_from_csv = pd.read_csv( csv_filename
, encoding='cp1252'
, sep=";"
, thousands='.', decimal=","
, dayfirst=True
, parse_dates=['Buchungstag', 'Wertstellung']
)
I could of course pre-process those CSV files and replace the 30th of Feb with 28th of Feb (or whatever the Feb ended in that year).
But is there a way to do this in Pandas, while importing? Like "If this column fails, set it to X"?
Sample row
775945;28.02.2018;30.02.2018;;901;"Zinsen"
As you can see, the date 30.02.2018 is not correct, because there ain't no 30th of Feb. But this seems to be a known problem in Germany. See [2].
[1] Here's the error message:
ValueError: day is out of range for month
[2] https://de.wikipedia.org/wiki/30._Februar
Here is how I solved it:
I added a custom date-parser:
import calendar
def mydateparser(dat_str):
"""Given a date like `30.02.2020` create a correct date `28.02.2020`"""
if dat_str.startswith("30.02"):
(d, m, y) = [int(el) for el in dat_str.split(".")]
# This here will get the first and last days in a given year/month:
(first, last) = calendar.monthrange(y, m)
# Use the correct last day (`last`) in creating a new datestring:
dat_str = f"{last:02d}.{m:02d}.{y}"
return pd.datetime.strptime(dat_str, "%d.%m.%Y")
# and used it in `read_csv`
for csv_filename in glob.glob(f"{path}/*.csv"):
# read csv into DataFrame
df_from_csv = pd.read_csv(csv_filename,
encoding='cp1252',
sep=";",
thousands='.', decimal=",",
dayfirst=True,
parse_dates=['Buchungstag', 'Wertstellung'],
date_parser=mydateparser
)
This allows me to fix those incorrect "30.02.XX" dates and allow pandas to convert those two date columns (['Buchungstag', 'Wertstellung']) into dates, instead of objects.
You could load it all up as text, then run it through a regex to identify non legal dates - which you could apply some adjustment function.
A sample regex you might apply could be:
ok_date_pattern = re.compile(r"^(0[1-9]|[12][0-9]|3[01])[-](0[1-9]|1[012])[-](19|20|99)[0-9]{2}\b")
This finds dates in DD-MM-YYYY format where the DD is constrained to being from 01 to 31 (i.e. a day of 42 would be considered illegal) and MM is constrained to 01 to 12, and YYYY is constrained to being within the range 1900 to 2099.
There are other regexes that go into more depth - such as some of the inventive answers found here
What you then need is a working adjustment function - perhaps that parses the date as best it can and returns a nearest legal date. I'm not aware of anything that does that out of the box, but a function could be written to deal with the most common edge cases I guess.
Then it'd be a case of tagging legal and illegal dates using an appropriate regex, and assigning some date-conversion function to deal with these two classes of dates appropriately.

Making both day-first and month-first dates in a csv file day-first

I have a csv file that has a column of dates. The dates are in order of month - so January comes first, then Feb, and so on. The problem is some of the dates are in mm/dd/yyyy format and others in dd/mm/yyyy format. Here's what it looks like.
Date
01/08/2005
01/12/2005
15/01/2005
19/01/2005
22/01/2005
26/01/2005
29/01/2005
03/02/2005
05/02/2005
...
I would like to bring all of them to the same format (dd/mm/yyyy)
I am using Python and pandas to read and edit the csv file. I tried using Excel to manually change the date formats using the built-in formatting tools but it seems impossible with the large number of rows. I'm thinking of using regex but I'm not quite sure how to distinguish between month-first and day-first.
# here's what i have so far
date = df.loc[i, 'Date']
pattern = r'\d\d/\d\d/\d\d'
match = re.search(pattern, date)
if match:
date_items = date.split('/')
day = date_items[1]
month = date_items[0]
year = date_items[2]
new_date = f'{dd}/{mm}/{year}'
df.loc[i, 'Date'] = new_date
I want the csv to have a uniform date format in the end.
In short: you can't!
There's no way for you to know if 01/02/2019 is Jan 2nd or Feb 1st!
Same goes for other dates in your examples such as:
01/08/2005
01/12/2005
03/02/2005
05/02/2005

Python, Getting stock data on particular days

I'm looking into getting data from the American stock exchanges for some python code, Basically what I need to do is import a stock name and previous date in time and it will give me all the data for the next 10 days of the market being open, Is this possible?
market = input("Market:")
ticker = input("Ticker:")
ticker = ticker.upper()
ystartdate = (input("Start Date IN FORMAT yyyy-mm-dd:"))
day1=input("Day1 :")
day2=input("Day2 :")
day3=input("Day3 :")
day4=input("Day4 :")
day5=input("Day5 :")
day6=input("Day6 :")
day7=input("Day7 :")
day8=input("Day8 :")
day9=input("Day9 :")
day10=input("Day10:")
Currently i have to input all the data automatically but that is a pain to do, Basically i would put in a stock and date like 2012-10-15 and it would go look at the stock on that date and for the next 10 days. If its possible it would be a life saver! Thanks
You should be working with a proper time format, not strings for this.
You can use pandas for example with datetime64.
import pandas as pd
input = ("Starting Date: ")
dates = pd.date_range(start=start_date, periods=10)
There is also the datetime package which has timedelta concepts which may help you if you don't want to use pandas.
I think what your need is included in pandas. In fact, you want to use either pandas.bdate_range or pandas.date_range with the freq argument set to B (I think both are more or less the same). These create business days, that is they would non include weekends. bdate_range also allows you to specify holidays, so I think that it might be a little more flexible.
>>> import pandas as pd
>>> dates = pd.bdate_range(start='2018-10-25', periods=10) # Start date is a Thursday
>>> print(dates)
DatetimeIndex(['2018-10-25', '2018-10-26', '2018-10-29', '2018-10-30',
'2018-10-31', '2018-11-01', '2018-11-02', '2018-11-05',
'2018-11-06', '2018-11-07'],
dtype='datetime64[ns]', freq='B')
Note how this excludes the 27th (a Saturday) and the 28th (a Sunday). If you want to specify holidays, you need to specify freq='C'.
Having these dates in separate variables is kind of ugly, but if you really want to, you can then go and unpack them like this:
>>> day1, day2, day3, day4, day5, day6, day7, day8, day9, day10 = dates

Python: Pick an specific date from a dataframe with Pandas

The following short script uses findatapy to collect data from Dukascopy website. Note that this package uses Pandas and it doesn't require to import it separately.
from findatapy.market import Market, MarketDataRequest, MarketDataGenerator
market = Market(market_data_generator=MarketDataGenerator())
md_request = MarketDataRequest(start_date='08 Feb 2017', finish_date='09 Feb 2017', category='fx', fields=['bid', 'ask'], freq='tick', data_source='dukascopy', tickers=['EURUSD'])
df = market.fetch_market(md_request)
#Group everything by an hourly frequency.
df=df.groupby(pd.TimeGrouper('1H')).head(1)
#Deleting the milliseconds from the Dateframe
df.index =df.index.map(lambda t: t.strftime('%Y-%m-%d %H:%M:%S'))
#Computing Average between columns 1 and 2, and storing it in a new one.
df['Avg'] = (df['EURUSD.bid'] + df['EURUSD.ask'])/2
The outcome looks like this:
Until this point, everything runs properly, but I need to extract an specific hour from this dataframe. I'd like to pick, let's say, all the values (bid, ask, avg... or just one of them) at a certain hour, 10:00:00AM.
By seeing other posts, I thought I could do something like this:
match_timestamp = "10:00:00"
df.loc[(df.index.strftime("%H:%M:%S") == match_timestamp)]
But the outcome is an error message saying:
AttributeError: 'Index' object has no attribute 'strftime'
I can't even perform df.index.hour, it used to work before the line where I remove the milliseconds (the dtype is datetime64[ns] until that point), after that the dtype is an 'Object'. Looks like I need to reverse this format in order to use strftime.
Can you help me out?
you should take a look at resample :
df = df.resample('H').first() # resample for each hour and use first value of hour
then:
df.loc[df.index.hour == 10] # index is still a date object, play with it
if you dont like that, you can just set your index to a datetime object like so:
df.index = pd.to_datetime(df.index)
then your code should work as is
try to reset the index
match_timestamp = "10:00:00"
df = df.reset_index()
df = df.assign(Date=pd.to_datetime(df.Date))
df.loc[(df.Date.strftime("%H:%M:%S") == match_timestamp)]

What is an efficient way to trim a date in Python?

Currently I am trying to trim the current date into day, month and year with the following code.
#Code from my local machine
from datetime import datetime
from datetime import timedelta
five_days_ago = datetime.now()-timedelta(days=5)
# result: 2017-07-14 19:52:15.847476
get_date = str(five_days_ago).rpartition(' ')[0]
#result: 2017-07-14
#Extract the day
day = get_date.rpartition('-')[2]
# result: 14
#Extract the year
year = get_date.rpartition('-')[0])
# result: 2017-07
I am not a Python professional because I grasp this language for a couple of months ago but I want to understand a few things here:
Why did I receive this 2017-07 if str.rpartition() is supposed to separate a string once you have declared some sort separator (-, /, " ")? I was expecting to receive 2017...
Is there an efficient way to separate day, month and year? I do not want to repeat the same mistakes with my insecure code.
I tried my code in the following tech. setups:
local machine with Python 3.5.2 (x64), Python 3.6.1 (x64) and repl.it with Python 3.6.1
Try the code online, copy and paste the line codes
Try the following:
from datetime import date, timedelta
five_days_ago = date.today() - timedelta(days=5)
day = five_days_ago.day
year = five_days_ago.year
If what you want is a date (not a date and time), use date instead of datetime. Then, the day and year are simply properties on the date object.
As to your question regarding rpartition, it works by splitting on the rightmost separator (in your case, the hyphen between the month and the day) - that's what the r in rpartition means. So get_date.rpartition('-') returns ['2017-07', '-', '14'].
If you want to persist with your approach, your year code would be made to work if you replace rpartition with partition, e.g.:
year = get_date.partition('-')[0]
# result: 2017
However, there's also a related (better) approach - use split:
parts = get_date.split('-')
year = parts[0]
month = parts[1]
day = parts[2]

Categories

Resources