I'm struggling to find a simple way to change a frequency of a pd.Series that is grouped on some level of a pd.MultiIndex (so it's a pd.core.groupby.generic.SeriesGroupBy).
Here's a simple example how this could be done with a standard pd.Series:
import pandas as pd
dates = pd.date_range('1990','2000',freq='M')
values = range(len(dates))
vec = pd.Series(values,index=dates)
vec.asfreq('D')
Here's what I'd hope would work for a grouped series but it doesn't:
idx = [x//12 for x in values]
midx =pd.MultiIndex.from_arrays([idx,dates], names=['level0','level1'])
vec = pd.Series(values,index=midx)
vec.groupby('level0').asfreq('D')
I know that I could change the index to level1, change the frequency and group it again but I wonder if there is a better way to do it.
Edit
Here's an approach I mentioned above (very crude, i know):
vec_new = vec.groupby('level0').obj\
.reset_index()\
.set_index('level1')\
.asfreq('D',method='ffill')\
.reset_index()
midx = pd.MultiIndex.from_arrays([vec_new['level0'],vec_new['level1']], names=['level0','level1'])
vec_new.index = midx
vec_new.drop(columns = ['level1'],inplace=True)
vec_new.groupby('level0')
Related
For a pandas.Series, I know how to remove outliers. With something like this:
x = pd.Series(np.random.normal(size=1000))
iqr = x.quantile(.75) - x.quantile(.25)
y = x[x.between(x.quantile(.25) - 1.5*iqr, x.quantile(.75) + 1.5*iqr)]
I would like to do thins over the different Series/columns of a DataFrame
import string
import random
df = pd.DataFrame([])
df['A'] = pd.Series(np.random.normal(size=1000))
df['B'] = pd.Series(np.random.normal(size=1000, loc=-5, scale=1))
df['C'] = pd.Series(np.random.normal(size=1000, loc=10, scale=2))
df['index'] = pd.Series([random.choice(string.ascii_uppercase) for i in range(1000)])
df.set_index('index')
I usually do stuff like
df = df.groupby('index').mean()
However, in this case, it would also average the outliers, which I would like to ignore from averaging.
Notice that the random data makes than the outliers are in different positions in each column. So an outlier should be ignored only in that column/Series
The result should be a DataFrame, with 26 lines (one for each letter of index), and 3 columns, with the values averaged without outliers
I can loop over the columns of df and do the first block of code. But is there a nicer way?
Suggestion are welcome. Any approach is accepted
Use the following code.
def mean_without_outlier(x): # x: series
iqr = x.quantile(.75) - x.quantile(.25)
y = x[x.between(x.quantile(.25) - 1.5*iqr, x.quantile(.75) + 1.5*iqr)]
return y.mean()
df.groupby("index")[['A', 'B', 'C']].agg(mean_without_outlier)
I have data in a dataframe as follows:
ROWS = 1000
df = pandas.DataFrame()
df['DaT'] = pandas.date_range('2000-1-1', periods=ROWS, freq='H')
df['cat'] = numpy.random.choice(['a','b','c'],size=ROWS)
df['val'] = numpy.random.randint(2,size=ROWS)
df['r10'] = df.groupby(['cat'])['val'].apply(lambda x: x.rolling(10).mean() )
I need to calculate a column that, is grouped by category 'cat', and is a rolling (10periods) mean of the value 'val' column, but the rolling mean for a given row cannot include values from the day it occurs on.
The desired result ('wanted') can be generated as follows:
df['wanted'] = numpy.nan
for idx, row in df.iterrows():
Rdate = row['DaT'].normalize()
Rcat = row['cat']
try: df.loc[idx,'wanted'] = df[(df['DaT'] < Rdate) & (df['cat'] == Rcat) ]['val'].rolling(10).mean().iloc[-1]
except: df.loc[idx,'wanted'] = numpy.nan
The above is an awful solution, but gets the result. It is very slow for 100000+rows that need to go through. Is there are more elegant solution?
I have tried using combinations of shift and even quantize to get a more efficient solution, but no success yet
I'm about to write a backtesting tool and so for every row I'd like to have access to all the dataframe till the given row. In the following example I'm doing it from a fixed index using a loop. I'm wondering if there is any better solution.
import numpy as np
import pandas as pd
N
df = pd.DataFrame({"a":np.arange(N)})
for i in range(3,N):
print(df["a"][:i].values)
UPDATE (toy example)
I need to apply a custom function to all the previous values. Here as a toy example I will use the sum of the square of all previous values.
def toyFun(v):
return np.sum(v**2)
res = np.empty(N)
res[:] = np.nan
for i in range(3, N):
res[i] = toyFun(df["a"][:i].values)
df["res"] = res
If you are indexing rows for a particular column say 'a', you can use .iloc indexer (i stands for index, loc means location) to index on the columns.
df = pd.DataFrame({'a': [1,2,3,4]})
print(df.a.iloc[:2]) # get first two values
So, you can do:
for i in range(3, 10):
print(df.a.iloc[:i])
The best way is to use a temporary column with the direct results, that way you are not re-calculating everything.
df["a"].apply(lambda x: x**2).cumsum()
Then re-index as you which:
res[3:] = df["a"].apply(lambda x: x**2).cumsum()[2:N-1].values
or directly to the dataframe.
I have a time series of daily data from 2000 to 2015. What I want is another single time series which only contains data from each year between April 15 to June 15 (because that is the period relevant for my analysis).
I have already written a code to do the same myself, which is given below:
import pandas as pd
df = pd.read_table(myfilename, delimiter=",", parse_dates=['Date'], na_values=-99)
dff = df[df['Date'].apply(lambda x: x.month>=4 and x.month<=6)]
dff = dff[dff['Date'].apply(lambda x: x.day>=15 if x.month==4 else True)]
dff = dff[dff['Date'].apply(lambda x: x.day<=15 if x.month==6 else True)]
I think this code is too much ineffecient as it has to carry out operation on the dataframe 3 times to get the desired subset.
I would like to know the following two things:
Is there an inbuilt pandas function to achieve this?
If not, is there a more efficient and better way to achieve this?
let the data frame look like this:
df = pd.DataFrame({'Date': pd.date_range('2000-01-01', periods=365*10, freq='D'),
'Value': np.random.random(365*10)})
create a series of dates with the year set to the same value
x = df.Date.apply(lambda x: pd.datetime(2000,x.month, x.day))
filter using this series to select from the dataframe
df.values[(x >= pd.datetime(2000,4,15)) & (x <= pd.datetime(2000,6,15))]
try this:
index = pd.date_range("2000/01/01", "2016/01/01")
s = index.to_series()
s[(s.dt.month * 100 + s.dt.day).between(415, 615)]
please help. Seems easy, just can't figure it out.
DataFrame (df) contains numbers. For each column:
* compute the mean and std
* compute a new value for each value in each row in each column
* change that value with the new value
Method 1
import numpy as np
import pandas as pd
n = 1
while n<len(df.column.values.tolist()):
col = df.values[:,n]
mean = sum(col)/len(col)
std = np.std(col, axis = 0)
for x in df[df.columns.values[n]]:
y = (float(x) - float(mean)) / float(std)
df.set_value(x, df.columns.values[n], y)
n = n+1
Method 2
labels = df.columns.values.tolist()
df2 = df.ix[:,0]
n = 1
while n<len(df.column.values.tolist()):
col = df.values[:,n]
mean = sum(col)/len(col)
std = np.std(col, axis = 0)
ls = []
for x in df[df.columns.values[n]]:
y = (float(x) - float(mean)) / float(std)
ls.append(y)
df2 = pd.DataFrame({labels[n]:str(ls)})
df1 = pd.concat([df1, df2], axis=1, ignore_index=True)
n = n+1
Error: ValueError: If using all scalar values, you must pass an index
Also tried the .apply method but the new DataFrame doesn't change the values.
print(df.to_json()):
{"col1":{"subj1":4161.97,"subj2":5794.73,"subj3":4740.44,"subj4":4702.84,"subj5":3818.94},"col2":{"subj1":13974.62,"subj2":19635.32,"subj3":17087.721851,"subj4":13770.461021,"subj5":11546.157578},"col3":{"subj1":270.7,"subj2":322.607708,"subj3":293.422314,"subj4":208.644585,"subj5":210.619961},"col4":{"subj1":5400.16,"subj2":7714.080365,"subj3":6023.026011,"subj4":5880.187272,"subj5":4880.056292}}
You are standard normalizing each column by removing the mean and scaling to unit variance. You can use scikit-learn's standardScaler for this:
from sklearn import preprocessing
scaler= preprocessing.StandardScaler()
new_df = pd.DataFrame(scaler.fit_transform(df.T), columns=df.columns, index=df.index)
Here is the documentation for the same
It looks like you're trying to do operations on DataFrame columns and values as though DataFrames were simple lists or arrays, rather than in the vectorized / column-at-a-time way more usual for NumPy and Pandas work.
A simple, first-pass improvement might be:
# import your data
import json
df = pd.DataFrame(json.loads(json_text))
# loop over only numeric columns
for col in df.select_dtypes([np.number]):
# compute column mean and std
col_mean = df[col].mean()
col_std = df[col].std()
# adjust column to normalized values
df[col] = df[col].apply(lambda x: (x - col_mean) / col_std)
That is vectorized by column. It retains some explicit looping, but is straightforward and relatively beginner-friendly.
If you're comfortable with Pandas, it can done more compactly:
numeric_cols = list(df.select_dtypes([np.number]))
df[numeric_cols] = df[numeric_cols].apply(lambda col: (col - col.mean()) / col.std(), axis=0)
In your revised DataFrame, there are no string columns. But the earlier DataFrame had string columns, causing problems when they were computed upon, so let's be careful. This is a generic way to select numeric columns. If it's too much, you can simplify at the cost of generality by listing them explicitly:
numeric_cols = ['col1', 'col2', 'col3', 'col4']