pandas.DataFrame.loc returning empty dataframes - python

Can someone shed some light on why I can't locate a row from the .loc operation based on my search criteria which is in date format?
import yfinance as yf
import matplotlib.pyplot as plt
import pandas as pd
import os
import numpy as np
stockCode = 'AAPL'
data = yf.download(stockCode, '2014-10-20', '2015-01-27')
#dfClose = pd.DataFrame(data.Close.values)
dfOpen = pd.DataFrame(data.Open.values)
dflist = dfOpen.values
dfClose = pd.DataFrame({"open": data.Open.values,
"year": data.Close.index.year.values,
"month": data.Close.index.month.values,
"day": data.Close.index.day.values,
"date": data.Close.index.date})
dfClose[0:5]
open year month day date
0 98.320000 2014 10 20 2014-10-20
1 103.019997 2014 10 21 2014-10-21
2 102.839996 2014 10 22 2014-10-22
3 104.080002 2014 10 23 2014-10-23
4 105.180000 2014 10 24 2014-10-24
RETURNING EMPTY FRAME
dfClose.loc[dfClose['date'] == "2014-10-21"]
open year month day date
Also trying a date range but no luck
dfClose.loc['2014-10-21':'2014-10-24']
open year month day date
This seems to work when I use a variable later on. Is this because it's in a np array?
floating_Max = np.amax(dflist)
print ("Max\n", dfClose.loc[dfClose['open'] == floating_Max])
Max
open year month day date
28 119.269997 2014 11 28 2014-11-28

I think the date column is an object type in dfClose. Try the following to convert it to datetime64 -
dfClose['date'] = pd.to_datetime(dfClose['date'], format='%Y-%m-%d')
The dfClose.loc[dfClose['date'] == "2014-10-21"] should work now.
If you want to apply .loc by a date-range, the date column might have to be set as an index after converting to type datetime64. To do that, try the following -
dfClose = dfClose.set_index('date')
The command dfClose.loc['2014-10-21':'2014-10-24'] should work then.

+1 to #erips thought. Here is what I do for filtering if the type is pandas datetime (my guess of what type it is).
date_to_check = pd.Timestamp(2019, 3, 20)
filter_mask = df['Date'] > date_to_check
df_filtered=df[filter_mask]
If for whatever reason it is not already a datetime object you can cast it as such:
df['Date'] = pd.to_datetime(df['Date'])

df.loc indexes on the index name, therefore when you write dfClose.loc[dfClose['date'] == "2014-10-21"] you are passing a Series of bools to loc. This is the mistake.
What you could do is dfClose[dfClose['date'] == "2014-10-21"] to get the rows where date matches that string (watch out for the type of the values, comparing a str to a datetime won't return what you expect)

Related

Iterating through a range of dates in Python with missing dates

Here I got a pandas data frame with daily return of stocks and columns are date and return rate.
But if I only want to keep the last day of each week, and the data has some missing days, what can I do?
import pandas as pd
df = pd.read_csv('Daily_return.csv')
df.Date = pd.to_datetime(db.Date)
count = 300
for last_day in ('2017-01-01' + 7n for n in range(count)):
Actually my brain stop working at this point with my limited imagination......Maybe one of the biggest point is "+7n" kind of stuff is meaningless with some missing dates.
I'll create a sample dataset with 40 dates and 40 sample returns, then sample 90 percent of that randomly to simulate the missing dates.
The key here is that you need to convert your date column into datetime if it isn't already, and make sure your df is sorted by the date.
Then you can groupby year/week and take the last value. If you run this repeatedly you'll see that the selected dates can change if the value dropped was the last day of the week.
Based on that
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['date'] = pd.date_range(start='04-18-2022',periods=40, freq='D')
df['return'] = np.random.uniform(size=40)
# Keep 90 percent of the records so we can see what happens when some days are missing
df = df.sample(frac=.9)
# In case your dates are actually strings
df['date'] = pd.to_datetime(df['date'])
# Make sure they are sorted from oldest to newest
df = df.sort_values(by='date')
df = df.groupby([df['date'].dt.isocalendar().year,
df['date'].dt.isocalendar().week], as_index=False).last()
print(df)
Output
date return
0 2022-04-24 0.299958
1 2022-05-01 0.248471
2 2022-05-08 0.506919
3 2022-05-15 0.541929
4 2022-05-22 0.588768
5 2022-05-27 0.504419

Changing timeseries column into a date

I have a timeseries with 2 columns, the first being hours after 1 Jan 1970. In this column, a year is only 360 days, with 12 months of 30 days. I need to convert this column into a usable date so that I can analyse the other column based on month, year etc (e.g 1997-Jan-1-1 being year-month-day-hour).
I need to make an array with modulo, to convert the each row of the hours column into hour_of_day, day_of_month, year etc, so that the column is instead a year, month, day and hour. But I don't know how to do this. Appreciate it might be confusing. Any help on doing this would be very helpful.
Input: 233280.5 (in hours)
Output: 1997-01-01-01 (year-day-month-hour)
you can calculate the number of years and add it to the reference date like e.g.
import pandas as pd
import numpy as np
from pandas.tseries.offsets import DateOffset
refdate = pd.Timestamp('1970-01-01')
df = pd.DataFrame({'360d_year_hours': [233280.5]})
# we calculate the number of years and fractional years as helper Series
y_frac, y = np.modf(df['360d_year_hours'] / (24*360))
# now we can calculate the new date's year:
df['datetime'] = pd.Series(refdate + DateOffset(years=i) for i in y)
# we need the days in the given year to be able to use y_frac
daysinyear = np.where(df['datetime'].dt.is_leap_year, 366, 365)
# ...so we can update the datetime and round to the hour:
df['datetime'] = (df['datetime'] + pd.to_timedelta(y_frac*daysinyear, unit='d')).dt.round('h')
# df['datetime']
# 0 1997-01-01 01:00:00
# Name: datetime, dtype: datetime64[ns]

Converting different date time formats to MM/DD/YYYY format in pandas dataframe

I have a date column in a pandas.DataFrame in various date time formats and stored as list object, like the following:
date
1 [May 23rd, 2011]
2 [January 1st, 2010]
...
99 [Apr. 15, 2008]
100 [07-11-2013]
...
256 [9/01/1995]
257 [04/15/2000]
258 [11/22/68]
...
360 [12/1997]
361 [08/2002]
...
463 [2014]
464 [2016]
For the sake of convenience, I want to convert them all to MM/DD/YYYY format. It doesn't seem possible to use regex replace() function to do this, since one cannot execute this operation over list objects. Also, to use strptime() for each cell will be too time-consuming.
What will be the easier way to convert them all to the desired MM/DD/YYYY format? I found it very hard to do this on list objects within a dataframe.
Note: for cell values of the form [YYYY] (e.g., [2014] and [2016]), I will assume they are the first day of that year (i.e., January 1, 1968) and for cell values such as [08/2002] (or [8/2002]), I will assume they the first day of the month of that year (i.e., August 1, 2002).
Given your sample data, with the addition of a NaT, this works:
Code:
df.date.apply(lambda x: pd.to_datetime(x).strftime('%m/%d/%Y')[0])
Test Code:
import pandas as pd
df = pd.DataFrame([
[['']],
[['May 23rd, 2011']],
[['January 1st, 2010']],
[['Apr. 15, 2008']],
[['07-11-2013']],
[['9/01/1995']],
[['04/15/2000']],
[['11/22/68']],
[['12/1997']],
[['08/2002']],
[['2014']],
[['2016']],
], columns=['date'])
df['clean_date'] = df.date.apply(
lambda x: pd.to_datetime(x).strftime('%m/%d/%Y')[0])
print(df)
Results:
date clean_date
0 [] NaT
1 [May 23rd, 2011] 05/23/2011
2 [January 1st, 2010] 01/01/2010
3 [Apr. 15, 2008] 04/15/2008
4 [07-11-2013] 07/11/2013
5 [9/01/1995] 09/01/1995
6 [04/15/2000] 04/15/2000
7 [11/22/68] 11/22/1968
8 [12/1997] 12/01/1997
9 [08/2002] 08/01/2002
10 [2014] 01/01/2014
11 [2016] 01/01/2016
It would be better if you use this it'll give you the date format in MM-DD-YYYY the you can apply strftime:
df['Date_ColumnName'] = pd.to_datetime(df['Date_ColumnName'], dayfirst = False, yearfirst = False)
Provided code will work for following scenarios.
Change date format from M/D/YY to MM/DD/YY (5/2/2009 to 05/02/2009)
change form ANY FORMAT to MM/DD/YY
import pandas as pd
'''
* checking provided input file date format correct or not
* if format is correct change date format from M/D/YY to MM/DD/YY
* else date format is not correct in input file
Date format change form ANY FORMAT to MM/DD/YY
'''
input_file_name = 'C:/Users/Admin/Desktop/SarenderReddy/predictions.csv'
dest_file_name = 'C:/Users/Admin/Desktop/SarenderReddy/Enrich.csv'
#input_file_name = 'C:/Users/Admin/Desktop/SarenderReddy/enrichment.csv'
read_data = pd.read_csv(input_file_name)
print(pd.to_datetime(read_data['Date'], format='%m/%d/%Y', errors='coerce').notnull().all())
if pd.to_datetime(read_data['Date'], format='%m/%d/%Y', errors='coerce').notnull().all():
print("Provided correct input date format in input file....!")
read_data['Date'] = pd.to_datetime(read_data['Date'],format='%m/%d/%Y')
read_data['Date'] = read_data['Date'].dt.strftime('%m/%d/%Y')
read_data.to_csv(dest_file_name,index=False)
print(read_data['Date'])
else:
print("NOT... Provided correct input date format in input file....!")
data_format = pd.read_csv(input_file_name,parse_dates=['Date'], dayfirst=True)
#print(df['Date'])
data_format['Date'] = pd.to_datetime(data_format['Date'],format='%m/%d/%Y')
data_format['Date'] = data_format['Date'].dt.strftime('%m/%d/%Y')
data_format.to_csv(dest_file_name,index=False)
print(data_format['Date'])

Stripping and testing against Month component of a date

I have a dataset that looks like this:
import numpy as np
import pandas as pd
raw_data = {'Series_Date':['2017-03-10','2017-04-13','2017-05-14','2017-05-15','2017-06-01']}
df = pd.DataFrame(raw_data,columns=['Series_Date'])
print df
I would like to pass in a date parameter as a string as follows:
date = '2017-03-22'
I would now like to know if there are any dates in my DataFrame 'df' for which the month is 3 months after the month in the date parameter.
That is if the month in the date parameter is March, then it should check if there are any dates in df from June. If there are any, I would like to see those dates. If not, it should just output 'No date found'.
In this example, the output should be '2017-06-01' as it is a date from June as my date parameter is from March.
Could anyone help how may I get started with this?
convert your column to Timestamp
df.Series_Date = pd.to_datetime(df.Series_Date)
date = pd.to_datetime('2017-03-01')
Then
df[
(df.Series_Date.dt.year - date.year) * 12 +
df.Series_Date.dt.month - date.month == 3
]
Series_Date
4 2017-06-01

How to find the number of the day in a year based on the actual dates using Pandas?

My data frame data has a date variable dateOpen with the following format date_format = "%Y-%m-%d %H:%M:%S.%f" and I would like to have a new column called openDay which is the day number based on 365 days a year. I tried applying the following
data['dateOpen'] = [datetime.strptime(dt, date_format) for dt in data['dateOpen']]
data['openDay'] = [dt.day for dt in data['dateOpen']]
however, I get the day in the month. For example if the date was 2013-02-21 10:12:14.3 then the above formula would return 21. However, I want it to return 52 which is 31 days from January plus the 21 days from February.
Is there a simple way to do this in Pandas?
On latest pandas you can use date-time properties:
>>> ts = pd.Series(pd.to_datetime(['2013-02-21 10:12:14.3']))
>>> ts
0 2013-02-21 10:12:14.300000
dtype: datetime64[ns]
>>> ts.dt.dayofyear
0 52
dtype: int64
On older versions, you may be able to convert to a DatetimeIndex and then use .dayofyear property:
>>> pd.Index(ts).dayofyear # may work
array([52], dtype=int32)
Not sure if there's a pandas builtin, but in Python, you can get the "Julian" day, eg:
data['openDay'] = [int(format(dt, '%j')) for dt in data['dateOpen']]
Example:
>>> from datetime import datetime
>>> int(format(datetime(2013,2,21), '%j'))
52
#To find number of days in this year sofar
from datetime import datetime
from datetime import date
today = date.today()
print("Today's date:", today)
print(int(format(today, '%j')))
Today's date: 2020-03-26
86

Categories

Resources