I'm trying to set the index of a df, except it doesn't work:
def save_to_csv(timestamps, values, lower, upper, query):
df = pd.DataFrame({'Time': timestamps, f'Q50-{format_filename(query)}': values})
df.set_index('Time')
df['Time'] = df['Time'].dt.strftime('%Y-%m-%dT%H:%M:%S')
print(df.tail())
df.to_csv(f"predictions/pred-Q50-{format_filename(query)}.csv")
and here is the output:
Time query_value
149007 2023-05-15T15:55:00 0.301318
149008 2023-05-15T15:56:00 0.301318
149009 2023-05-15T15:57:00 0.301318
149010 2023-05-15T15:58:00 0.301318
149011 2023-05-15T15:59:00 0.301318
I still have the original index and not the Time column set as index.
Any fix for that?
Actually, when saving to CSV, putting a index=False fixed everything!
Pandas is designed for chaining. This means that most operations return a modified version of the dataframe. Rather than reassigning at each line, the chaining format is used. Your call of set_index is returning the dataframe you want, but you aren't reassigning it to df.
This is how it would look with chaining.
def save_to_csv(timestamps, values, lower, upper, query):
df = (pd.DataFrame({'Time': timestamps, f'Q50-{format_filename(query)}':values})
.set_index('Time')
.assign(Time=lamdba x: x['Time'].dt.strftime('%Y-%m-%dT%H:%M:%S')
)
print(df.tail())
df.to_csv(f"predictions/pred-Q50-{format_filename(query)}.csv")
Related
I have a MultiIndex dataframe where I need to compute the percentage change column-wise. I used apply in conjunction with pd.pct_change. This is working as long I don't take into consideration the outer level of the MultiIndex with groupby.
# Creat pd.MultiIndex and include some NaNs
rng = pd.date_range(start='2018-12-20', periods=20, name='date')
date = np.concatenate([rng, rng])
perm = np.array([*np.repeat(1, 20), *np.repeat(2, 20)])
d = {'perm': perm,
'date': date,
'ser_1': np.random.randint(low=1, high=10, size=[40]),
'ser_2': np.random.randint(low=1, high=10, size=[40])}
df = pd.DataFrame(data=d)
df.iloc[5:8, 2:] = np.nan
df.iloc[11:13, 2] = np.nan
df.iloc[25:28, 2:] = np.nan
df.iloc[33:37, 3] = np.nan
df.set_index(['perm', 'date'], drop=True, inplace=True)
# Apply pd.pct_change to every column individually in order to take care of the
# NaNs at different positions. Also, use groupby for every 'perm'. This one is
# where I am struggling.
# This is working properly, but it doesn't take into account 'perm'. The first
# two rows of perm=2 (i.e. rows 20 and 21) must be NaN.
chg = df.apply(lambda x, periods:
x.dropna().pct_change(periods=2).
reindex(df.index, method='ffill'),
axis=0, periods=2)
# This one is causing an error:
# TypeError: () got an unexpected keyword argument 'axis'
chg = df.groupby('perm').apply(lambda x, periods:
x.dropna().pct_change(periods=2).
reindex(df.index, method='ffill'),
axis=0, periods=2)
The " unexpected keyword argument 'axis' " error comes from the fact that pandas.DataFrame.apply and pandas.core.groupby.GroupBy.apply are two different methods, with similar but different parameters: they have the same name because they are intended to perform a very similar task, but they belong to two different classes.
If you check the documentation, you'll see that the first one require an axis parameter. The second one not.
So to have a working code with groupby, just remove the axis parameter from GroupBy.apply. Since you want to work column by column because of dropna, you need to use DataFrame.apply inside GroupBy.apply:
chg = df.groupby('perm').apply(lambda x:
x.apply(lambda y : y.dropna().pct_change(periods=2)
.reindex(x.index, method='ffill'),
axis=0))
This produces what you want (first two rows of "perm 2" are NaN, other numbers are equal to the result you get by using apply without groupby).
Note that I've also edit the first argument in reindex: is x.index and not df.index otherwise you'll get a double perm index in the final result.
Final note, no need to pass a period argument to the lambda function if you are setting it hardcoded in pc_change. Is redundant.
I need to sort panda dataframe df, by a datetime column my_date. IWhenever I use .loc sorting does not apply.
df = df.loc[(df.some_column == 'filter'),]
df.sort_values(by=['my_date'])
print(dfolc)
# ...
# Not sorted!
# ...
df = df.loc[(df.some_column == 'filter'),].sort_values(by=['my_date'])
# ...
# sorting WORKS!
What is the difference of these two uses? What am I missing about dataframes?
In the first case, you didn't perform an operation in-place: you should have used either df = df.sort_values(by=['my_date']) or df.sort_values(by=['my_date'], inplace=True).
In the second case, the result of .sort_values() was saved to df, hence printing df shows sorted dataframe.
In the code df = df.loc[(df.some_column == 'filter'),] df.sort_values(by=['my_date']) print(dfolc), you are using df.loc() df.sort_values(), I'm not sure how that works.
In the seconf line, you are calling it correctly df.loc().sort_values(), which is the correct way. You don't have to use the df. notation twice.
I have pandas DataFrame df with different types of columns, some values of df are NaN.
To test some assumption, I create copy of df, and transform copied df to (0, 1) with pandas.isnull():
df_copy = df
for column in df_copy:
df_copy[column] = df_copy[column].isnull().astype(int)
but after that BOTH df and df_copy consist of 0 and 1.
Why this code transforms df to 0, 1 and is there way to prevent it?
You can prevent it declaring:
df_copy = df.copy()
This creates a new object. Prior to that you essentially had two pointers to the same object. You also might want to check this answer and note that DataFrames are mutable.
Btw, you could obtain the desired result simply by:
df_copy = df.isnull().astype(int)
even better memory-wise
for column in df:
df[column + 'flag'] = df[column].isnull().astype(int)
I'm not sure how to reset index after dropna(). I have
df_all = df_all.dropna()
df_all.reset_index(drop=True)
but after running my code, row index skips steps. For example, it becomes 0,1,2,4,...
The code you've posted already does what you want, but does not do it "in place." Try adding inplace=True to reset_index() or else reassigning the result to df_all. Note that you can also use inplace=True with dropna(), so:
df_all.dropna(inplace=True)
df_all.reset_index(drop=True, inplace=True)
Does it all in place. Or,
df_all = df_all.dropna()
df_all = df_all.reset_index(drop=True)
to reassign df_all.
You can chain methods and write it as a one-liner:
df = df.dropna().reset_index(drop=True)
You can reset the index to default using set_axis() as well.
df.dropna(inplace=True)
df.set_axis(range(len(df)), inplace=True)
set_axis() is especially useful, if you want to reset the index to something other than the default because as long as the lengths match, you can change the index to literally anything with it. For example, you can change it to first row, second row etc.
df = df.dropna()
df = df.set_axis(['first row', 'second row'])
I'm practicing with using apply with Pandas dataframes.
So I have cooked up a simple dataframe with dates, and values:
dates = pd.date_range('2013',periods=10)
values = list(np.arange(1,11,1))
DF = DataFrame({'date':dates, 'value':values})
I have a second dataframe, which is made up of 3 rows of the original dataframe:
DFa = DF.iloc[[1,2,4]]
So, I'd like to use the 2nd dataframe, DFa, and get the dates from each row (using apply), and then find and sum up any dates in the original dataframe, that came earlier:
def foo(DFa, DF=DF):
cutoff_date = DFa['date']
ans=DF[DF['date'] < cutoff_date]
DFa.apply(foo, axis=1)
Things work fine. My question is, since I've created 3 ans, how do I access these values?
Obviously I'm new to apply and I'm eager to get away from loops. I just don't understand how to return values from apply.
Your function needs to return a value. E.g.,
def foo(df1, df2):
cutoff_date = df1.date
ans = df2[df2.date < cutoff_date].value.sum()
return ans
DFa.apply(lambda x: foo(x, DF), axis=1)
Also, note that apply returns a DataFrame. So your current function would return a DataFrame for each row in DFa, so you would end up with a DataFrame of DataFrames
There's a bit of a mixup the way you're using apply. With axis=1, foo will be applied to each row (see the docs), and yet your code implies (by the parameter name) that its first parameter is a DataFrame.
Additionally, you state that you want to sum up the original DataFrame's values for those less than the date. So foo needs to do this, and return the values.
So the code needs to look something like this:
def foo(row, DF=DF):
cutoff_date = row['date']
return DF[DF['date'] < cutoff_date].value.sum()
Once you make the changes, as foo returns a scalar, then apply will return a series:
>> DFa.apply(foo, axis=1)
1 1
2 3
4 10
dtype: int64