I have a DataFrame which looks like this:
1125400 5430095 1095751
2013-05-22 105.24 NaN 6507.58
2013-05-23 104.63 NaN 6393.86
2013-05-26 104.62 NaN 6521.54
2013-05-27 104.62 NaN 6609.31
2013-05-28 104.54 87.79 6640.24
2013-05-29 103.91 86.88 6577.39
2013-05-30 103.43 87.66 6516.55
2013-06-02 103.56 87.55 6559.43
I would like to compute the first non-NaN value in each column.
As Locate first and last non NaN values in a Pandas DataFrame points out, first_valid_index can be used. Unfortunately, it returns the first row where at least one element is not NaN and does not work per-column.
You should use the apply function which applies a function on either each column (default) or each row efficiently:
>>> first_valid_indices = df.apply(lambda series: series.first_valid_index())
>>> first_valid_indices
1125400 2013-05-22 00:00:00
5430095 2013-05-28 00:00:00
1095751 2013-05-22 00:00:00
first_valid_indiceswill then be a series containing the first_valid_index for each column.
You could also define the lambda function as a normal function outside:
def first_valid_index(series):
return series.first_valid_index()
and then call apply like this:
df.apply(first_valid_index)
The built in function DataFrame.groupby().column.first() returns the first non null value in the column, while last() returns the last.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.first.html
If you don't wish to get the first value for each group, you can add a dummy column of 1s. Then get the first non null value using the groupby & first functions.
from Pandas import DataFrame
df = DataFrame({'a':[None,1,None],'b':[None,2,None]})
df['dummy'] = 1
df.groupby('dummy').first()
df.groupby('dummy').last()
By compute I assume you mean access?
The simplest way to do this is with the pd.Series.first_valid_index() method probably inside a dict comprehension:
values = {col : DF.loc[DF[col].first_valid_index(), col] for col in DF.columns}
values
Just to be clear, each column in a pandas DataFrame is a Series. So the above is the same as doing:
values = {}
for column in DF.columns:
First_Non_Null_Index = DF[column].first_valid_index()
values[column] = DF.loc[First_Non_Null_Index, column]
So the operation in my one line solution is on a per column basis. I.e. it is not going to create the type of error you seem to be suggesting in the edit you made to the question. Let me know if it does not work as expected.
Related
I have a pandas dataframe (df) with a column ('ISNN'). Most of the values in that column are strings of 8 characters (e.g. "12345678"). Some of them are smaller (e.g. "983750") and I would like to add a left padding of zeros in order to reach exactly 8 characters (in the previous example, thus obtaining "00983750")
I am using rjust as follows and it works as expected:
df['ISSN'] = df['ISSN'].apply(lambda x: str(x).rjust(8, '0'))
But since some of the values of that column are NaN, they get modified as well and I get 00000nan. How can I apply rjust() just to non-NaN values?
Use Pandas' .str.zfill, which handles NaN for you:
# sample data
df = pd.DataFrame({"ISSN":[np.nan, '1234', '12345678']})
df['ISSN'] = df['ISSN'].str.zfill(8)
Output:
ISSN
0 NaN
1 00001234
2 12345678
I have a TimeSeries and I want to extract the three first three elements and with them create a row of a Pandas Dataframe with three columns. I can do this easily using a Dictionary for example. The problem is that I would like the index of this one row DataFrame to be the Datetime index of the first element of the Series. Here I fail.
For a reproducible example:
CRM
Date
2018-08-30 0.000442
2018-08-29 0.005923
2018-08-28 0.004782
2018-08-27 0.003243
pd.DataFrame({'Reg_Coef_5_1' : ts1.iloc[0][0], 'Reg_Coef_5_2' : ts1.shift(-5).iloc[0][0], \
'Reg_Coef_5_3' : ts1.shift(-10).iloc[0][0]}, index = ts1.iloc[0].index )
I get:
Reg_Coef_5_1 Reg_Coef_5_2 Reg_Coef_5_3
CRM 0.000442 0.001041 -0.00035
Instead I would like the index to be '2018-08-30' a datetime object.
If I understand you correctly, you would like the index to be a date object instead of "CRM" as it is in your example. Just set the index accordingly: index = [ts1.index[0]] instead of index = ts1.iloc[0].index.
df = pd.DataFrame({'Reg_Coef_5_1' : ts1.iloc[0][0], 'Reg_Coef_5_2' : ts1.shift(-5).iloc[0][0], \
'Reg_Coef_5_3' : ts1.shift(-10).iloc[0][0]}, index = [ts1.index[0]] )
But as user10300706 has said, there might be a better way to do what you want, ultimately.
If you're simply trying to recover the index position then do:
index = ts1.index[0]
I would note that if you are shifting your dataframe up incrementally (5/10 respectively) the indexes won't aline. I assume, however, you're trying to build out some lagging indicator.
I need to group a Pandas dataframe by date, and then take a weighted average of given values. Here's how it's currently done using the margin value as an example (and it works perfectly until there are NaN values):
df = orders.copy()
# Create new columns as required
df['margin_WA'] = df['net_margin'].astype(float) # original data as str or Decimal
def group_wa():
return lambda num: np.average(num, weights=df.loc[num.index, 'order_amount'])
agg_func = {
'margin_WA': group_wa(), # agg_func includes WAs for other elements
}
result = df.groupby('order_date').agg(agg_func)
result['margin_WA'] = result['margin_WA'].astype(str)
In the case where 'net_margin' fields contain NaN values, the WA is set to NaN. I can't seem to be able to dropna() or filtering by pd.notnull when creating new columns, and I don't know where to create a masked array to avoid passing NaN to the group_wa function (like suggested here). How do I ignore NaN in this case?
I think a simple solution is to drop the missing values before you groupby/aggregate like:
result = df.dropna(subset='margin_WA').groupby('order_date').agg(agg_func)
In this case, no indices containing missings are passed to your group_wa function.
Edit
Another approach is to move the dropna into your aggregating function like:
def group_wa(series):
dropped = series.dropna()
return np.average(dropped, weights=df.loc[dropped.index, 'order_amount'])
agg_func = {'margin_WA': group_wa}
result = df.groupby('order_date').agg(agg_func)
I have 2 columns in a dataframe and I am trying to enter into a condition based on if the second one is NaN and First one has some values, unsuccessfully using:
if np.isfinite(train_bk['Product_Category_1']) and np.isnan(train_bk['Product_Category_2'])
and
if not (train_bk['Product_Category_2']).isnull() and (train_bk['Product_Category_3']).isnull()
I would use eval:
df.eval(' ind = ((pc1==pc1) & (pc2!=pc2) )*2+((pc1==pc1)&(pc2==pc2))*3')
df.replace({'ind':{0:1})
In some transformations, I seem to be forced to break from the Pandas dataframe grouped object, and I would like a way to return to that object.
Given a dataframe of time series data, if one groups by one of the values in the dataframe, we are given an underlying dictionary from key to dataframe.
Being forced to make a Python dict from this, the structure cannot be converted back into a Dataframe using the .from_dict() because the structure is key to dataframe.
The only way to go back to Pandas without some hacky column renaming is, to my knowledge, by converting it back to a grouped object.
Is there any way to do this?
If not, how would I convert a dictionary of instance to dataframe back into a Pandas datastructure?
EDIT ADDING SAMPLE::
rng = pd.date_range('1/1/2000', periods=10, freq='10m')
df = pd.DataFrame({'a':pd.Series(randn(len(rng)), index=rng), 'b':pd.Series(randn(len(rng)), index=rng)})
// now have dataframe with 'a's and 'b's in time series
for k, v in df.groupby('a'):
df_dict[k] = v
// now we apply some transformation that cannot be applied view aggregate, transform, or apply
// how do we get this back into a groupedby object?
If I understand OP's question correctly, you want to group a dataframe by some key(s), do different operations on each group (possibly generating new columns, etc.) and then go back to the original dataframe.
Modifying you example (group by random integers instead of floats which are usually unique):
np.random.seed(200)
rng = pd.date_range('1/1/2000', periods=10, freq='10m')
df = pd.DataFrame({'a':pd.Series(np.random.randn(len(rng)), index=rng), 'b':pd.Series(np.random.randn(len(rng)), index=rng)})
df['group'] = np.random.randint(3,size=(len(df)))
Usually, If I need single values for each columns per group, I'll do this (for example, sum of 'a', mean of 'b')
In [10]: df.groupby('group').aggregate({'a':np.sum, 'b':np.mean})
Out[10]:
a b
group
0 -0.214635 -0.319007
1 0.711879 0.213481
2 1.111395 1.042313
[3 rows x 2 columns]
However, if I need a series for each group,
In [19]: def func(sub_df):
sub_df['c'] = sub_df['a'] * sub_df['b'].shift(1)
return sub_df
....:
In [20]: df.groupby('group').apply(func)
Out[20]:
a b group c
2000-01-31 -1.450948 0.073249 0 NaN
2000-11-30 1.910953 1.303286 2 NaN
2001-09-30 0.711879 0.213481 1 NaN
2002-07-31 -0.247738 1.017349 2 -0.322874
2003-05-31 0.361466 1.911712 2 0.367737
2004-03-31 -0.032950 -0.529672 0 -0.002414
2005-01-31 -0.221347 1.842135 2 -0.423151
2005-11-30 0.477257 -1.057235 0 -0.252789
2006-09-30 -0.691939 -0.862916 2 -1.274646
2007-07-31 0.792006 0.237631 0 -0.837336
[10 rows x 4 columns]
I'm guess you want something like the second example. But the original question wasn't very clear even with your example.