I have a dataset that looks like this:
import numpy as np
import pandas as pd
raw_data = {'Series_Date':['2017-03-10','2017-04-13','2017-05-14','2017-05-15','2017-06-01']}
df = pd.DataFrame(raw_data,columns=['Series_Date'])
print df
I would like to pass in a date parameter as a string as follows:
date = '2017-03-22'
I would now like to know if there are any dates in my DataFrame 'df' for which the month is 3 months after the month in the date parameter.
That is if the month in the date parameter is March, then it should check if there are any dates in df from June. If there are any, I would like to see those dates. If not, it should just output 'No date found'.
In this example, the output should be '2017-06-01' as it is a date from June as my date parameter is from March.
Could anyone help how may I get started with this?
convert your column to Timestamp
df.Series_Date = pd.to_datetime(df.Series_Date)
date = pd.to_datetime('2017-03-01')
Then
df[
(df.Series_Date.dt.year - date.year) * 12 +
df.Series_Date.dt.month - date.month == 3
]
Series_Date
4 2017-06-01
Related
I have a time column with the format XXXHHMMSS where XXX is the Day of Year. I also have a year column. I want to merge both these columns into one date time object.
Before I had detached XXX into a new column but this was making it more complicated.
I've converted the two columns to strings
points['UTC_TIME'] = points['UTC_TIME'].astype(str)
points['YEAR_'] = points['YEAR_'].astype(str)
Then I have the following line:
points['Time'] = pd.to_datetime(points['YEAR_'] * 1000 + points['UTC_TIME'], format='%Y%j%H%M%S')
I'm getting the value errorr, ValueError: time data '137084552' does not match format '%Y%j%H%M%S' (match)
Here is a photo of my columns and a link to the data
works fine for me if you combine both columns as string, EX:
import pandas as pd
df = pd.DataFrame({'YEAR_': [2002, 2002, 2002],
'UTC_TIME': [99082552, 135082552, 146221012]})
pd.to_datetime(df['YEAR_'].astype(str) + df['UTC_TIME'].astype(str).str.zfill(9),
format="%Y%j%H%M%S")
# 0 2002-04-09 08:25:52
# 1 2002-05-15 08:25:52
# 2 2002-05-26 22:10:12
# dtype: datetime64[ns]
Note, since %j expects zero-padded day of year, you might need to zero-fill, see first row in the example above.
What I wanted to do is get 1 year of data.
By calculate latest date from the column date, as my end date. Then use the end date - 1 year to get the start date. After that, I can filter the data in between those start and end date.
I did manage to get the end date, but can't find how I can get the start date.
Below is the code that I have used so far. -1 year is what needs to be solved.
and if you know how to filter in pyspark is also welcome.
from pyspark.sql.functions import min, max
import datetime
import pyspark.sql.function as F
from pyspark.sql.functions import date_format, col
#convert string to date type
df = df.withColumn('risk_date', F.to_date(F.col('chosen_risk_prof_date'), 'dd.MM.yyyy'))
#filter only 1 year of data from big data set.
#calculate the start date and end date. lastest_date = end end.
latest_date = df.select((max("risk_date"))).show()
start_date = latest_date - *1 year*
new_df = df.date > start_date & df.date < end_date
Then after this get all the data between start date and end date
you can use relativedelta as below
from datetime import datetime
from dateutil.relativedelta import relativedelta
print(datetime.now() - relativedelta(years=1))
A small snippet from my dataframe
I have separate columns for month and date. I need to parse only month and date into a pandas datetime type(other datetime types would also help), so that I could plot a TimeSeries Line plot.
I tried this piece of code,
df['newdate'] = pd.to_datetime(df[['Days','Month']], format='%d%m')
but I threw me an error
KeyError: "['Days' 'Month'] not in index"
How should I approach this error?
an illustration of my comment; if you take the columns as type string, you can join and strptime them easily as follows:
import pandas as pd
df = pd.DataFrame({'Month': [1,2,11,12], 'Days': [1,22,3,23]})
pd.to_datetime(df['Month'].astype(str)+' '+df['Days'].astype(str), format='%m %d')
# 0 1900-01-01
# 1 1900-02-22
# 2 1900-11-03
# 3 1900-12-23
# dtype: datetime64[ns]
You could also add a 'Year' column to your df with an arbitrary year number and use the method you originally intended:
df = pd.DataFrame({'Month': [1,2,11,12], 'Days': [1,22,3,23]})
df['Year'] = 2020
pd.to_datetime(df[['Year', 'Month', 'Days']])
Can someone shed some light on why I can't locate a row from the .loc operation based on my search criteria which is in date format?
import yfinance as yf
import matplotlib.pyplot as plt
import pandas as pd
import os
import numpy as np
stockCode = 'AAPL'
data = yf.download(stockCode, '2014-10-20', '2015-01-27')
#dfClose = pd.DataFrame(data.Close.values)
dfOpen = pd.DataFrame(data.Open.values)
dflist = dfOpen.values
dfClose = pd.DataFrame({"open": data.Open.values,
"year": data.Close.index.year.values,
"month": data.Close.index.month.values,
"day": data.Close.index.day.values,
"date": data.Close.index.date})
dfClose[0:5]
open year month day date
0 98.320000 2014 10 20 2014-10-20
1 103.019997 2014 10 21 2014-10-21
2 102.839996 2014 10 22 2014-10-22
3 104.080002 2014 10 23 2014-10-23
4 105.180000 2014 10 24 2014-10-24
RETURNING EMPTY FRAME
dfClose.loc[dfClose['date'] == "2014-10-21"]
open year month day date
Also trying a date range but no luck
dfClose.loc['2014-10-21':'2014-10-24']
open year month day date
This seems to work when I use a variable later on. Is this because it's in a np array?
floating_Max = np.amax(dflist)
print ("Max\n", dfClose.loc[dfClose['open'] == floating_Max])
Max
open year month day date
28 119.269997 2014 11 28 2014-11-28
I think the date column is an object type in dfClose. Try the following to convert it to datetime64 -
dfClose['date'] = pd.to_datetime(dfClose['date'], format='%Y-%m-%d')
The dfClose.loc[dfClose['date'] == "2014-10-21"] should work now.
If you want to apply .loc by a date-range, the date column might have to be set as an index after converting to type datetime64. To do that, try the following -
dfClose = dfClose.set_index('date')
The command dfClose.loc['2014-10-21':'2014-10-24'] should work then.
+1 to #erips thought. Here is what I do for filtering if the type is pandas datetime (my guess of what type it is).
date_to_check = pd.Timestamp(2019, 3, 20)
filter_mask = df['Date'] > date_to_check
df_filtered=df[filter_mask]
If for whatever reason it is not already a datetime object you can cast it as such:
df['Date'] = pd.to_datetime(df['Date'])
df.loc indexes on the index name, therefore when you write dfClose.loc[dfClose['date'] == "2014-10-21"] you are passing a Series of bools to loc. This is the mistake.
What you could do is dfClose[dfClose['date'] == "2014-10-21"] to get the rows where date matches that string (watch out for the type of the values, comparing a str to a datetime won't return what you expect)
I have a csv with a date column with dates listed as MM/DD/YY but I want to change the years from 00,02,03 to 1900, 1902, 1903 so that they are instead listed as MM/DD/YYYY
This is what works for me:
df2['Date'] = df2['Date'].str.replace(r'00', '1900')
but I'd have to do this for every year up until 68 (aka repeat this 68 times). I'm not sure how to create a loop to do the code above for every year in that range. I tried this:
ogyear=00
newyear=1900
while ogyear <= 68:
df2['date']=df2['Date'].str.replace(r'ogyear','newyear')
ogyear += 1
newyear += 1
but this returns an empty data set. Is there another way to do this?
I can't use datetime because it assumes that 02 refers to 2002 instead of 1902 and when I try to edit that as a date I get an error message from python saying that dates are immutable and that they must be changed in the original data set. For this reason I need to keep the dates as strings. I also attached the csv here in case thats helpful.
I would do it like this:
# create a data frame
d = pd.DataFrame({'date': ['20/01/00','20/01/20','20/01/50']})
# create year column
d['year'] = d['date'].str.split('/').str[2].astype(int) + 1900
# add new year into old date by replacing old year
d['new_data'] = d['date'].str.replace('[0-9]*.$','') + d['year'].astype(str)
date year new_data
0 20/01/00 1900 20/01/1900
1 20/01/20 1920 20/01/1920
2 20/01/50 1950 20/01/1950
I'd do it the following way:
from datetime import datetime
# create a data frame with dates in format month/day/shortened year
d = pd.DataFrame({'dates': ['2/01/10','5/01/20','6/01/30']})
#loop through the dates in the dates column and add them
#to list in desired form using datetime library,
#then substitute the dataframe dates column with the new ordered list
new_dates = []
for date in list(d['dates']):
dat = datetime.date(datetime.strptime(date, '%m/%d/%y'))
dat = dat.strftime("%m/%d/%Y")
new_dates.append(dat)
new_dates
d['dates'] = pd.Series(new_dates)
d