I have a time series of daily data from 2000 to 2015. What I want is another single time series which only contains data from each year between April 15 to June 15 (because that is the period relevant for my analysis).
I have already written a code to do the same myself, which is given below:
import pandas as pd
df = pd.read_table(myfilename, delimiter=",", parse_dates=['Date'], na_values=-99)
dff = df[df['Date'].apply(lambda x: x.month>=4 and x.month<=6)]
dff = dff[dff['Date'].apply(lambda x: x.day>=15 if x.month==4 else True)]
dff = dff[dff['Date'].apply(lambda x: x.day<=15 if x.month==6 else True)]
I think this code is too much ineffecient as it has to carry out operation on the dataframe 3 times to get the desired subset.
I would like to know the following two things:
Is there an inbuilt pandas function to achieve this?
If not, is there a more efficient and better way to achieve this?
let the data frame look like this:
df = pd.DataFrame({'Date': pd.date_range('2000-01-01', periods=365*10, freq='D'),
'Value': np.random.random(365*10)})
create a series of dates with the year set to the same value
x = df.Date.apply(lambda x: pd.datetime(2000,x.month, x.day))
filter using this series to select from the dataframe
df.values[(x >= pd.datetime(2000,4,15)) & (x <= pd.datetime(2000,6,15))]
try this:
index = pd.date_range("2000/01/01", "2016/01/01")
s = index.to_series()
s[(s.dt.month * 100 + s.dt.day).between(415, 615)]
Related
I am trying to compute aggregation metrics with pandas of a dataset with a start and finish date of a month interval, i need to do this efficiently because my dataset can have millions of rows.
My dataset is like this
import pandas as pd
from dateutil.relativedelta import relativedelta
df = pd.DataFrame([["2020-01-01", "2020-05-01", 200],
["2020-02-01", "2020-03-01", 100],
["2020-03-01", "2020-04-01", 350],
["2020-02-01", "2020-05-01", 500]], columns=["start", "end", "value"])
df["start"] = pd.to_datetime(df["start"])
df["end"] = pd.to_datetime(df["end"])
And i want to have something like this:
I've tried two approaches, making a month timerange with the start and end dates and exploding them, then grouping by month:
df["months"] = df.apply(lambda x: pd.date_range(x["start"], x["end"], freq="MS"), axis=1)
df_explode = df.explode("months")
df_explode.groupby("months")["value"].agg(["mean", "sum", "std"])
The other one is iterating month by month, checking what month rows contain this month, then aggregating them:
rows = []
for m in pd.date_range(df.start.min(), df.end.max(), freq="MS"):
rows.append(df[(df.start <= m) & (m <= df.end)]["value"].agg(["mean", "sum", "std"]))
pd.DataFrame(rows, index=pd.date_range(df.start.min(), df.end.max(), freq="MS"))
The first approach works faster with smaller datasets, the second one is best with bigger datasets, but I'd want to know if there is a better approach for doing this better and faster.
Thank you very much
This is similar to your second approach, but vectorized. It assumes your start and end dates are month starts.
month_starts = pd.date_range(df.start.min(), df.end.max(), freq="MS")[:-1].to_numpy()
contained = np.logical_and(
np.greater_equal.outer(month_starts, df["start"].to_numpy()),
np.less.outer(month_starts, df["end"].to_numpy()),
)
masked = np.where(contained, np.broadcast_to(df[["value"]].transpose(),contained.shape), np.nan)
pd.DataFrame(masked, index=month_starts).agg(["mean", "sum", "std"], axis=1)
I have Data Frame in Python Pandas like below with str values:
NR
--------
910517196
921122192
020612567
And I try to calculate age based on values in column "NR" using below code:
ABT_DATE = pd.Timestamp(year=2021, month=6, day=30)
df['age'] = (ABT_DATE - pd.to_datetime(df.NR.str[:6], format = '%y%m%d')) / np.timedelta64(1, 'Y')
df["age"] = df.age.astype("int")
Logic of above code is: take 6 first numbers from df from column "NR" and calculate age based on that, because for example: 910517196 (first 6 numbers) is 1991-05-17.
Nevertheless, when I try to use my code I have error like below:
ValueError: unconverted data remains: 20
My DataFrame has over 400k rows so it is difficult to check all rows, but I am sure I do not have NaN and years months and days are in the correct intervals.
As you can see on below sample this code is correct and should works, why it works on small sample code and does not work on my over 400k rows Data Frame?
df = pd.DataFrame({"NR" : ["95050611475", "00112575862"]})
df['age'] = (ABT_DATE - pd.to_datetime(df.NR.str[:6], format = '%y%m%d')) / np.timedelta64(1, 'Y')
df["age"] = df.age.astype("int")
df
How can I repair my big Data Frame to be able to use my code in Python Pandas ?
You probably have some bad formatted rows. To find them, I suggest you to use to_datetime with errors='coerce' as parameter. All unconverted values are set to NaN. So you can a boolean mask m to find bad values.
df = pd.DataFrame({"NR" : ["95050611475", "00112575862", "badformat"]})
m = pd.to_datetime(df.NR.str[:6], format='%y%m%d', errors='coerce').isna()
print(df[m])
# Output:
NR
2 badformat
This question has two parts:
1) Is there a better way to do this?
2) If NO to #1, how can I fix my date issue?
I have a dataframe as follows
GROUP DATE VALUE DELTA
A 12/20/2015 2.5 ??
A 11/30/2015 25
A 1/31/2016 8.3
B etc etc
B etc etc
C etc etc
C etc etc
This is a representation, there are close to 100 rows for each group (each row representing a unique date).
For each letter in GROUP, I want to find the change in value between successive dates. So for example for GROUP A I want the change between 11/30/2015 and 12/20/2015, which is -22.5. Currently I am doing the following:
df['DATE'] = pd.to_datetime(df['DATE'],infer_datetime_format=True)
df.sort_values('DATE',ascending=True)
df_out = []
for GROUP in df.GROUP.unique():
x = df[df.GROUP == GROUP]
x['VALUESHIFT'] = x['VALUE'].shift(+1)
x['DELTA'] = x['VALUE'].sub(x['VALUESHIFT'])
df_out.append(x)
df_out = pd.concat(df_out)
The challenge I am running into is the dates are not sorted correctly. So when the shift takes place and I calculate the delta it is not really the delta between successive dates.
Is this the right approach to handle? If so how can I fix my date issue? I have reviewed/tried the following to no avail:
Applying datetime format in pandas for sorting
how to make a pandas dataframe column into a datetime object showing just the date to correctly sort
doing calculations in pandas dataframe based on trailing row
Pandas - Split dataframe into multiple dataframes based on dates?
Answering my own question. This works:
df['DATE'] = pd.to_datetime(df['DATE'],infer_datetime_format=True)
df_out = []
for ID in df.GROUP.unique():
x = df[df.GROUP == ID]
x.sort_values('DATE',ascending=True, inplace=True)
x['VALUESHIFT'] = x['VALUE'].shift(+1)
x['DELTA'] = x['VALUE'].sub(x['VALUESHIFT'])
df_out.append(x)
df_out = pd.concat(df_out)
1) Added inplace=True to sort value.
2) Added the sort within the for loop.
3) Changed by loop from using GROUP to ID since it is also the name of a column name, which I imagine is considered sloppy?
I have a timeseries dataframe with a PeriodIndex. I would like to use the values as column names in another dataframe and add other columns, which are not Periods. The problem is that when I create the dataframe by using only periods as column-index adding a column whos index is a string raises an error. However if I create the dataframe with a columns index that has periods and strings, then I'm able to add a columns with string indices.
import pandas as pd
data = np.random.normal(size=(5,2))
idx = pd.Index(pd.period_range(2011,2012,freq='A'),name=year)
df = pd.DataFrame(data,columns=idx)
df['age'] = 0
This raises an error.
import pandas as pd
data = np.random.normal(size=(5,2))
idx = pd.Index(pd.period_range(2011,2012,freq='A'),name=year)
df = pd.DataFrame(columns=idx.tolist()+['age'])
df = df.iloc[:,:-1]
df[:] = data
df['age'] = 0
This does not raise an error and gives my desired outcome, but doing it this way I can't assign the data in a convenient way when I create the dataframe. I would like a more elegant way of achieving the result. I wonder if this is a bug in Pandas?
Not really sure what you are trying to achieve, but here is one way to get what I understood you wanted:
import pandas as pd
idx = pd.Index(pd.period_range(2011,2015,freq='A'),name='year')
df = pd.DataFrame(index=idx)
df1 = pd.DataFrame({'age':['age']})
df1 = df1.set_index('age')
df = df.append(df1,ignore_index=False).T
print df
Which gives:
Empty DataFrame
Columns: [2011, 2012, 2013, 2014, 2015, age]
Index: []
And it keeps you years as Periods:
df.columns[0]
Period('2011', 'A-DEC')
The same result most likely can be achieved using .merge.
I am using pandas to deal with monthly data that have some missing value. I would like to be able to use the resample method to compute annual statistics but for years with no missing data.
Here is some code and output to demonstrate :
import pandas as pd
import numpy as np
dates = pd.date_range(start = '1980-01', periods = 24,freq='M')
df = pd.DataFrame( [np.nan] * 10 + range(14), index = dates)
Here is what I obtain if I resample :
In [18]: df.resample('A')
Out[18]:
0
1980-12-31 0.5
1981-12-31 7.5
I would like to have a np.nan for the 1980-12-31 index since that year does not have monthly values for every month. I tried to play with the 'how' argument but to no luck.
How can I accomplish this?
i'm sure there's a better way, but in this case you can use:
df.resample('A', how=[np.mean, pd.Series.count, len])
and then drop all rows where count != len