Pandas Frequency Conversion - python

I'm trying to find if is possible to use data.asfreq(MonthEnd()) with no date_range created data.
What I'm trying to achive. I run csv query with the following code:
import numpy as np
import pandas as pd
data = pd.read_csv("https://www.quandl.com/api/v3/datasets/FRED/GDPC1.csv?api_key=", parse_dates=True)
data.columns = ["period", "integ"]
data['period'] = pd.to_datetime(data['period'], infer_datetime_format=True)
Then I want to assign frequency to my 'period' column by doing this:
tdelta = data.period[1] - data.period[0]
data.period.freq = tdelta
And some print comands:
print(data)
print(data.period.freq)
print(data.dtypes)
Returns:
..........
270 1948-07-01 2033.2
271 1948-04-01 2021.9
272 1948-01-01 1989.5
273 1947-10-01 1960.7
274 1947-07-01 1930.3
275 1947-04-01 1932.3
276 1947-01-01 1934.5
[277 rows x 2 columns]
-92 days +00:00:00
period datetime64[ns]
integ float64
dtype: object
I can also parse the original 'DATE' column by making it 'index':
data = pd.read_csv("https://www.quandl.com/api/v3/datasets/FRED/GDPC1.csv?api_key=", parse_dates=True, index_col='DATE')
What I want to do is just to covert the quarterly data in to monthly rows. For example:
270 1948-07-01 2033.2
271 1948-06-01 NaN
272 1948-05-01 NaN
273 1948-04-01 2021.9
274 1948-03-01 NaN
275 1948-02-01 NaN
276 1948-01-01 1989.5
......and so on.......
I'm eventually trying to do this by using ts.asfreq(MonthBegin()) and , ts.asfreq(MonthBegin(), method='pad'). So far unsuccessfully. I have the following error:
NameError: name 'MonthBegin' is not defined
My question is can I use asfreq if I don't use date_range to create the frame? Somehow to 'pass' my date column to the function. If this is not the solution is it there any other easy way to convert quarterly to monthly frequency?

Use a TimeGrouper:
import pandas as pd
periods = ['1948-07-01', '1948-04-01', '1948-01-01', '1947-10-01',
'1947-07-01', '1947-04-01', '1947-01-01']
integs = [2033.2, 2021.9, 1989.5, 1960.7, 1930.3, 1932.3, 1934.5]
df = pd.DataFrame({'period': pd.to_datetime(periods), 'integ': integs})
df = df.set_index('period')
df = df.groupby(pd.TimeGrouper('MS')).sum().sort_index(ascending=False)
EDIT: You can also use resample instead of a TimeGrouper:
df.resample('MS').sum().sort_index(ascending=False)

Related

How to add day of the year column w.r.t Date in pandas

I have a date column and I want to add the day of year(1-365), day of half year(1-182), day of quarter(1-92) and day of half quarter(1-46) columns to my dataframe w.r.t date.
In R we can use
df$half_year = df$yearday %% 182
Can anyone help me with this?
the pandas.Timestamp has a descriptor called dayofyear which you can call like this: pd.Timestamp.dayofyear
import pandas as pd
# create dates
some_dates = pd.date_range(
start=pd.to_datetime('07-02-1990', format='%m-%d-%Y'),
end=pd.to_datetime('07-04-1990', format='%m-%d-%Y'),
)
# store in pandas df
some_df = pd.DataFrame()
some_df['Date'] = some_dates
# get day of years
some_df['day_of_year'] = some_df.Date.dt.dayofyear
some_df['day_of_half_year'] = some_df.Date.dt.dayofyear % 182
print(some_df)
>>> Date day_of_year day_of_half_year
0 1990-07-02 183 1
1 1990-07-03 184 2
2 1990-07-04 185 3

Why do i get all date similiar while trying to fill them in dataset?

I have dataset with 800 rows and i want to create new column with date, and in each row in should increase on one day.
import datetime
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
for x in range(800):
df['Date'] = date + datetime.timedelta(days=x)
In each column date is equal to '2014-01-12', as i inderstand it fills as if x is always equal to 799
Each time through the loop you are updating the ENTIRE Date column. You see the results of the 800th update at the end.
You could use a date range:
dr = pd.date_range('5/11/2011', periods=800, freq='D')
df = pd.DataFrame({'Date': dr})
print(df)
Date
0 2011-05-11
1 2011-05-12
2 2011-05-13
3 2011-05-14
4 2011-05-15
.. ...
795 2013-07-14
796 2013-07-15
797 2013-07-16
798 2013-07-17
799 2013-07-18
Or:
df['Date'] = dr
pandas is nice tool which can repeate some calculations without using for-loop.
When you use df['Date'] = ... then you assign the same value to all cells in column.
You have to use df.loc[x, 'Date'] = ... to assign to single cell.
Minimal working example (with only 10 rows).
import pandas as pd
import datetime
df = pd.DataFrame({'Date':[1,2,3,4,5,6,7,8,9,0]})
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
for x in range(10):
df.loc[x,'Date'] = date + datetime.timedelta(days=x)
print(df)
But you could use also pd.date_range() for this.
Minimal working example (with only 10 rows).
import pandas as pd
import datetime
df = pd.DataFrame({'Date':[1,2,3,4,5,6,7,8,9,0]})
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
df['Date'] = pd.date_range(date, periods=10)
print(df)

Problem with group by max period in dataframe pandas

I'm still a novice with python and I'm having problems trying to group some data to show that record that has the highest (maximum) date, the dataframe is as follows:
...
I am trying the following:
df_2 = df.max(axis = 0)
df_2 = df.periodo.max()
df_2 = df.loc[df.groupby('periodo').periodo.idxmax()]
And it gives me back:
Timestamp('2020-06-01 00:00:00')
periodo 2020-06-01 00:00:00
valor 3.49136
Although the value for 'periodo' is correct, for 'valor' it is not, since I need to obtain the corresponding complete record ('period' and 'value'), and not the maximum of each one. I have tried other ways but I can't get to what I want ...
I need to do?
Thank you in advance, I will be attentive to your answers!
Regards!
# import packages we need, seed random number generator
import pandas as pd
import datetime
import random
random.seed(1)
Create example dataframe
dates = [single_date for single_date in (start_date + datetime.timedelta(n) for n in range(day_count))]
values = [random.randint(1,1000) for _ in dates]
df = pd.DataFrame(zip(dates,values),columns=['dates','values'])
ie df will be:
dates values
0 2020-01-01 389
1 2020-01-02 808
2 2020-01-03 215
3 2020-01-04 97
4 2020-01-05 500
5 2020-01-06 30
6 2020-01-07 915
7 2020-01-08 856
8 2020-01-09 400
9 2020-01-10 444
Select rows with highest entry in each column
You can do:
df[df['dates'] == df['dates'].max()]
(Or, if wanna use idxmax, can do: df.loc[[df['dates'].idxmax()]])
Returning:
dates values
9 2020-01-10 444
ie this is the row with the latest date
&
df[df['values'] == df['values'].max()]
(Or, if wanna use idxmax again, can do: df.loc[[df['values'].idxmax()]] - as in Scott Boston's answer.)
and
dates values
6 2020-01-07 915
ie this is the row with the highest value in the values column.
Reference.
I think you need something like:
df.loc[[df['valor'].idxmax()]]
Where you use idxmax on the 'valor' column. Then use that index to select that row.
MVCE:
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'periodo':pd.date_range('2018-07-01', periods = 600, freq='d'),
'valor':np.random.random(600)+3})
df.loc[[df['valor'].idxmax()]]
Output:
periodo valor
474 2019-10-18 3.998918

Trouble in plotting dates in PyPlot

I am trying to plot a simple time-series. Here's my code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
%matplotlib inline
df = pd.read_csv("sample.csv", parse_dates=['t'])
df[['sq', 'iq', 'rq']] = df[['sq', 'iq', 'rq']].apply(pd.to_numeric, errors='coerce')
df = df.fillna(0)
df.set_index('t')
This is part of the output:
df[['t','sq']].plot()
plt.show()
As you can see, the x-axis in the plot above is not the dates I intended it to show. When I change the plotting call as below, I get the following gibberish plot, although the x-axis is now correct.
df[['t','sq']].plot(x = 't')
plt.show()
Any tips on what I am doing wrong? Please comment and let me know if you need more information about the problem. Thanks in advance.
I think your problem is that although you have parsed the t column it is not of type date-time. Try the following:
# Set t to date-time and then to index
df['t'] = pd.to_datetime(df['t'])
df.set_index('t', inplace=True)
Reading you comment and the answer you have added someone may conclude that this kind of problem can only be solved by specifying a parser in pd.read_csv(). So here is proof that my solution works in principle. Looking at what you have posted as a question, the other problem with you code is the way you have specified the plot command. Once t has become an index, you only need to select columns other than t for the plot command.
import pandas as pd
import matplotlib.pyplot as plt
# Read data from file
df = pd.read_csv('C:\\datetime.csv', parse_dates=['Date'])
# Convert Date to date-time and set as index
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
df.plot(marker='D')
plt.xlabel('Date')
plt.ylabel('Number of Visitors')
plt.show()
df
Out[37]:
Date Adults Children Seniors
0 2018-01-05 309 240 296
1 2018-01-06 261 296 308
2 2018-01-07 273 249 338
3 2018-01-08 311 250 244
4 2018-01-08 272 234 307
df
Out[39]:
Adults Children Seniors
Date
2018-01-05 309 240 296
2018-01-06 261 296 308
2018-01-07 273 249 338
2018-01-08 311 250 244
2018-01-08 272 234 307
The issue turned out to be incorrect parsing of dates, as pointed out in an answer above. However, the solution for it was to pass a date_parser to the read_csv method call:
from datetime import datetime as dt
dtm = lambda x: dt.strptime(str(x), "%Y-%m-%d")
df = pd.read_csv("sample.csv", parse_dates=['t'], infer_datetime_format = True, date_parser= dtm)

Calculate day of year in dataframe from a datetime column to another column in Python

I have a dataframe with a datetime column in it, like 2014-01-01, 2016-06-05, etc. Now I want to add a column in the dataframe calculating the day of year (for that given year).
On this forum I did find some hints for sure, but I'm struggling with the types and dataframe stuff.
So this works fine
from datetime import datetime
day_to_calc = today
day_of_year = day_to_calc.timetuple().tm_yday
day_of_year
But my day_to_calc is not today, but df['Date']. However, if I try this
df['DOY'] = df['Date'].timetuple().tm_yday
I get
AttributeError: 'Series' object has no attribute 'timetuple'
Ok, so I guess I need a map function perhaps?
So I'm trying something like ..
df['DOY'] = map (datetime.timetuple().tm_yday,df['Date'])
And surely you guys see how stupid that is ;-) (but I'm still learning Python)
TypeError: descriptor 'timetuple' of 'datetime.datetime' object needs an argument
So that makes sense sort of because I need to pass the date as parameter, sooo .. trying
df['DOY'] = datetime.timetuple(df['Date']).tm_yday
TypeError: descriptor 'timetuple' requires a 'datetime.datetime' object but received a 'Series'
There must be a simple way, but I just can't figure out the syntax :-(
Use dayofyear function:
import pandas as pd
# first convert date string to datetime with a proper format string
df = pd.DataFrame({'Date':pd.to_datetime(['2014-01-01', '2016-06-05'], format='%Y-%m-%d')})
# calculate day of year
df['DOY'] = df['Date'].dt.dayofyear
print(df)
Output:
Date DOY
0 2014-01-01 1
1 2016-06-05 157
I noticed the above answer does not go into great detail, so I've provided a more explanatory answer below.
Try the following:
import pandas as pd
# Create a pandas datetime range for the year 2022
passed_2022 = pd.date_range('2022-01-01', '2022-12-31')
# Convert the datetime range to a list of strings in the format 'YYYY-MM-DD'
passed_2022_list = [i.strftime('%Y-%m-%d') for i in passed_2022]
# Create a DataFrame
data = pd.DataFrame({'datetime': passed_2022_list})
# Filter the data DataFrame to only include dates in the passed_2022 list
data = data[data['datetime'].isin(passed_2022_list)]
# Count the number of rows in the filtered DataFrame
num_days_passed = len(data)
# Create a new DataFrame with 'datetime' and 'DAYS_OF_YEAR' columns
result = pd.DataFrame({'datetime': passed_2022_list,
'DAYS OF YEAR': range(1, num_days_passed+1)})
# Print the result of the DataFrame
print(result)
Output:
datetime DAYS OF YEAR
0 2022-01-01 1
1 2022-01-02 2
2 2022-01-03 3
3 2022-01-04 4
4 2022-01-05 5
.. ... ...
360 2022-12-27 361
361 2022-12-28 362
362 2022-12-29 363
363 2022-12-30 364
364 2022-12-31 365
[365 rows x 2 columns]
Process finished with exit code 0

Categories

Resources