Just looking for a best approach as someone who spends more time in data analysis land than programming proper (hat tip to you all). Pretty straightforward, large ETL project but hand coding it in Python which is a first. Fixed-width file is being read successfully into initial PANDAS df.
I am trying to add a new column with a static, end-of-month date value (2014-01-31, for example) indicating the "Data Month" for further EDW processing. Ultimately, I am going to use datetime/timedelta functionality to pass this value as an automatically generated when I CRON it on the utility server.
My confusion seems to be about which function to utilize (apply, mapapply, etc.), if I need to reference an index value in the original df to apply a completely unrelated value to the initial df, and the most optimized, pythonic way to accomplish this.
Currently referencing: "Python for Data Analysis", PANDAS Docs. Thanks!
EDIT
Here is a small example of some fixed-width data:
5151022314
5113 22204
111 20018
Here is some code for reading it into a PANDAS df:
import pandas as pd
import numpy as np
path = 'C:\Users\Office\Desktop\example data.txt'
widths = [2, 3, 5]
names = (['STATE_CD', 'CNTY_CD', 'ZIP_CD',])
df = pd.read_fwf(path, names=names, widths=widths, header=0)
This should return something like this as a df for the example date above:
STATE_CD,CNTY_CD,ZIP_CD
51,510,22314
51,1 ,22204
11,3 ,20018
What I am trying to do is add a column "DATA_MM" like this for all rows:
STATE_CD,CNTY_CD,ZIP_CD, DATA_MM
51,510,22314,2014-01-31
51,1 ,22204,2014-01-31
11,3 ,20018,2014-01-31
Ultimately, I am hoping to utilize something like this to generate the value that is applied automatically when this monthly job initiates:
import datetime
today = datetime.date.today()
first = datetime.date(day=1, month=today.month, year=today.year)
lastMonth = first - datetime.timedelta(days=1)
print lastMonth.strftime("%Y-%m-%d")
If you want to fill a column with a new value that doesn't depend on your original DataFrame, you don't need to make reference to the original indices. You can fill the new column by simply assigning the new value to it:
df["DATA_MM"] = date
You can get the last day of the month by using datetime and calendar:
import datetime
import calendar
today = datetime.date.today()
y = today.year
m = today.month
eom = datetime.date(y, m, calendar.monthrange(y, m)[1])
df["DATA_MM"] = eom
monthrange returns a tuple with the first and last days of the month, so [1] references the last day of the month. You can also use #Alexander's method for finding the date of the last day, and assign it directly to the column instead of applying it.
Lets say your DataFrame is named df and it has a date column of Timestamps for which you would like to get end-of-month (EOM) values:
df['EOM date'] = df.date.apply(lambda x: x.to_period('M').to_timestamp('M'))
You are coercing the objects to Pandas Period objects and then back to end of month timestamps, so it may not be the most efficient method.
Here is an alternative implementation with some performance stats:
dates = pd.date_range('2000-1-1', '2015-1-1')
df = pd.DataFrame(dates, columns=['date'])
%%timeit
df.date.apply(lambda x: x.to_period('M').to_timestamp('M'))
10 loops, best of 3: 161 ms per loop
%%timeit
df.date.apply(lambda x: x + pd.datetools.MonthEnd())
1 loops, best of 3: 177 ms per loop
just getting a DATETIME.DATE (per request below) for the end-of-month date from the current date can be achieve as follows:
pd.Timestamp(dt.datetime.now()).to_period('M').to_timestamp('M').date()
Related
I have a Data set that contains Dates as an index, and each column is the name of an item with count as value. I'm trying to figure out how to filter each column where there will be more than 3 consecutive days where the count is zero for each different column. I was thinking of using a for loop, any help is appreciated. I'm using python for this project.
I'm fairly new to python, so far I tried using for loops, but did not get it to work in any way.
for i in a.index:
if a.loc[i,'name']==3==df.loc[i+1,'name']==df.loc[i+2,'name']:
print(a.loc[i,"name"])
Cannot add integral value to Timestamp without freq.
It would be better if you included a sample dataframe and desired output in your question. Please do the next time. This way, I have to guess what your data looks like and may not be answering your question. I assume the values are integers. Does your dataframe have a row for every day? I will assume that might not be the case. I will make it so that every day in the last delta days has a row. I created a sample dataframe like this:
import pandas as pd
import numpy as np
import datetime
# Here I am just creating random data from your description
delta = 365
start_date = datetime.datetime.now() - datetime.timedelta(days=delta)
end_date = datetime.datetime.now()
datetimes = [end_date - diff for diff in [datetime.timedelta(days=i) for i in range(delta,0,-1)]]
# This is the list of dates we will have in our final dataframe (includes all days)
dates = pd.Series([date.strftime('%Y-%m-%d') for date in datetimes], name='Date', dtype='datetime64[ns]')
# random integer dataframe
df = pd.DataFrame(np.random.randint(0, 5, size=(delta,4)), columns=['item' + str(i) for i in range(4)])
df = pd.concat([df, dates], axis=1).set_index('Date')
# Create a missing day
df = df.drop(df.loc['2019-08-01'].name)
# Reindex so that index has all consecutive days
df = df.reindex(index=dates)
Now that we have a sample dataframe, the rest will be straightforward. I am going to check if a value in the dataframe is equal to 0 and then do a rolling sum with the window of 4 (>3). This way I can avoid for loops. The resulting dataframe has all the rows where at least one of the items had a value of 0 for 4 consecutive rows. If there is a 0 for more than window consecutive rows, it will show as two rows where the dates are just one day apart. I hope that makes sense.
# custom function as I want "np.nan" returned if a value does not equal "test_value"
def equals(df_value, test_value=0):
return 1 if df_value == test_value else np.nan
# apply the function to every value in the dataframe
# for each row, calculate the sum of four subsequent rows (>3)
df = df.applymap(equals).rolling(window=4).sum()
# if there was np.nan in the sum, the sum is np.nan, so it can be dropped
# keep the rows where there is at least 1 value
df = df.dropna(thresh=1)
# drop all columns that don't have any values
df = df.dropna(thresh=1, axis=1)
I want to compute week of the month for a specified date. For computing week of the month, I currently use the user-defined function.
Input data frame:
Output data frame:
Here is what I have tried:
from math import ceil
def week_of_month(dt):
"""
Returns the week of the month for the specified date.
"""
first_day = dt.replace(day=1)
dom = dt.day
adjusted_dom = dom + first_day.weekday()
return int(ceil(adjusted_dom/7.0))
After this,
import pandas as pd
df = pd.read_csv("input_dataframe.csv")
df.date = pd.to_datetime(df.date)
df['year_of_date'] = df.date.dt.year
df['month_of_date'] = df.date.dt.month
df['day_of_date'] = df.date.dt.day
wom = pd.Series()
# worker function for creating week of month series
def convert_date(t):
global wom
wom = wom.append(pd.Series(week_of_month(datetime.datetime(t[0],t[1],t[2]))), ignore_index = True)
# calling worker function for each row of dataframe
_ = df[['year_of_date','month_of_date','day_of_date']].apply(convert_date, axis = 1)
# adding new computed column to dataframe
df['week_of_month'] = wom
# here this updated dataframe should look like Output data frame.
What this does is for each row of data frame it computes week of the month using given function. It makes computations slower as the data frame grows to more rows. Because currently I have more than 10M+ rows.
I am looking for a faster way of doing this. What changes can I make to this code to vectorize this operation across all rows?
Thanks in advance.
Edit: What worked for me after reading answers is below code,
first_day_of_month = pd.to_datetime(df.date.values.astype('datetime64[M]'))
df['week_of_month'] = np.ceil((df.date.dt.day + first_day_of_month.weekday) / 7.0).astype(int)
The week_of_month method can be vectorized. It could be beneficial to not do the conversion to datetime objects, and instead use pandas only methods.
first_day_of_month = df.date.to_period("M").to_timestamp()
df["week_of_month"] = np.ceil((data.day + first_day_of_month.weekday) / 7.0).astype(int)
just right off the bat without even going into your code and mentioning X/Y problems, etc.:
try to get a list of unique dates, I'm sure in the 10M rows you have more than one is a duplicate.
Steps:
create a 2nd df that contains only the columns you need and no
duplicates (drop_duplicates)
run your function on the small dataframe
merge the large and small dfs
(optional) drop the small one
I am working with some financial data that is organized as a df with a MultiIndex that contains the ticker and the date and a column that contains the return. I am wondering whether one should convert the index to a PeriodIndex instead of a DateTimeIndex since returns are really over a period rather than an instant in time. Beside the philosophical argument, what practical functionality does PeriodIndex provide that may be useful in this particular use case vs DateTimeIndex?
There are some functions available in DateTimeIndex (such as is_month_start, is_quarter_end) which are not available in PeriodIndex. I use PeriodIndex when is not possible to have the format I need with DateTimeIndex. For example if I need a monthly frequency in the format yyyy-mm, I use the PeriodIndex.
Example:
Assume that df has an index as
df.index
'2020-02-26 13:50:00', '2020-02-27 14:20:00',
'2020-02-28 11:10:00', '2020-02-29 13:50:00'],
dtype='datetime64[ns]', name='peak_time', length=1025, freq=None)
The minimum monthly data can be obtained via the following code
dfg = df.groupby([df.index.year, df.index.month]).min()
whose index is a MultiIndex
dfg.index
MultiIndex([(2017, 1),
...
(2020, 1),
(2020, 2)],
names=['peak_time', 'peak_time'])
No I convert it to a PeriodIndex:
dfg["date"] = pd.PeriodIndex (dfg.index.map(lambda x: "{0}{1:02d}".format(*x)),freq="M")
For me, the PeriodIndex can be automatically displayed as the corresponding month, quarter and year in the downsampling.
import pandas as pd
# https://github.com/jiahe224/bug_report/blob/main/resample_test.csv
temp = pd.read_csv('resample_test.csv',dtype={'stockcode':str, 'A股代码':str})
temp['date'] = pd.to_datetime(temp['date'])
temp = temp.set_index(['date'])
result = temp['北向占自由流通比'].resample('Q',closed='left').first()
result
result = temp['北向占自由流通比'].resample('Q',closed='left').first().to_period()
result
Off topic, there is a problem with resample that has not been fixed as of yet, the bug report at https://github.com/pandas-dev/pandas/issues/45869
Behavior on partial periods.
date_range returns empty index. period_range returns index with len 1 when specifying start and end that do not cover a whole period.
(also, the timezone information is lost for periods of months).
date_range:
dates = pd.core.indexes.datetimes.date_range("2022-12-01 0:00:00+00:00", "2022-12-05 14:26:00+00:00", inclusive="both", freq="1M")
dates
DatetimeIndex([], dtype='datetime64[ns, UTC]', freq='M')
period_range:
periods = pd.core.indexes.period.period_range("2022-12-01 0:00:00+00:00", "2022-12-05 14:26:00+00:00", freq="1M")
periods
PeriodIndex(['2022-12'], dtype='period[M]')
This question has two parts:
1) Is there a better way to do this?
2) If NO to #1, how can I fix my date issue?
I have a dataframe as follows
GROUP DATE VALUE DELTA
A 12/20/2015 2.5 ??
A 11/30/2015 25
A 1/31/2016 8.3
B etc etc
B etc etc
C etc etc
C etc etc
This is a representation, there are close to 100 rows for each group (each row representing a unique date).
For each letter in GROUP, I want to find the change in value between successive dates. So for example for GROUP A I want the change between 11/30/2015 and 12/20/2015, which is -22.5. Currently I am doing the following:
df['DATE'] = pd.to_datetime(df['DATE'],infer_datetime_format=True)
df.sort_values('DATE',ascending=True)
df_out = []
for GROUP in df.GROUP.unique():
x = df[df.GROUP == GROUP]
x['VALUESHIFT'] = x['VALUE'].shift(+1)
x['DELTA'] = x['VALUE'].sub(x['VALUESHIFT'])
df_out.append(x)
df_out = pd.concat(df_out)
The challenge I am running into is the dates are not sorted correctly. So when the shift takes place and I calculate the delta it is not really the delta between successive dates.
Is this the right approach to handle? If so how can I fix my date issue? I have reviewed/tried the following to no avail:
Applying datetime format in pandas for sorting
how to make a pandas dataframe column into a datetime object showing just the date to correctly sort
doing calculations in pandas dataframe based on trailing row
Pandas - Split dataframe into multiple dataframes based on dates?
Answering my own question. This works:
df['DATE'] = pd.to_datetime(df['DATE'],infer_datetime_format=True)
df_out = []
for ID in df.GROUP.unique():
x = df[df.GROUP == ID]
x.sort_values('DATE',ascending=True, inplace=True)
x['VALUESHIFT'] = x['VALUE'].shift(+1)
x['DELTA'] = x['VALUE'].sub(x['VALUESHIFT'])
df_out.append(x)
df_out = pd.concat(df_out)
1) Added inplace=True to sort value.
2) Added the sort within the for loop.
3) Changed by loop from using GROUP to ID since it is also the name of a column name, which I imagine is considered sloppy?
How do I create an datetime index "foo" to use with raw data series.
(Example would "as of" every 15 seconds 'foo' and and every 30 seconds 'foo2'.) If raw series can be inserted into a 'base' dataframe, I would like to use 'foo' to recast the dataframe.
If wanted series to combine combine df "foo" and df "foo2", what would be the memory hits
Would it be better to fill the foo index with the raw data series.
EDIT:
after import pandas , datetime.timedelta stops working
It's very hard for me to understand what you're asking; an illustration of exactly what you're looking for, with example data, would help make things more clear.
I think what you should do:
rng = DateRange(start, end, offset=datetools.Second(15)
to create the date range. To put data in a DataFrame indexed by that, you should add the columns and reindex them to the date range above using method='ffill':
df = DataFrame(index=rng)
df[colname] = series.reindex(df.index, method='ffill')
Per datetime.timedelta, datetime.datetime is part of the pandas namespace, so if you did from pandas import * then any import datetime you had done before that would be masked by the datetime.datetime reference inside the pandas namespace.
Since Wes' answer I think pandas.DateRange is no longer present in pandas. I'm on pandas version 0.22.0.
I used pandas.DatetimeIndex instead, e.g.:
import datetime
import pandas as pd
start = datetime.datetime.now()
times = pd.DatetimeIndex(freq='2s', start=start, periods=10)
or alternatively
start = datetime.datetime.now()
end = start + datetime.timedelta(hours=1)
times = pd.DatetimeIndex(freq='2s', start=start, end=end)
as of version 0.24
Creating a DatetimeIndex based on start, periods, and end has been deprecated in favor of date_range().
Using date_range() is similar to DatetimeIndex()
start = datetime.datetime.now()
end = start + datetime.timedelta(hours=1)
times = pd.date_range(freq='2s', start=start, end=end)
times is a DatetimeIndex with 1801 elements with an interval of 2 seconds