I have some stock market data in excel covering the past 20 years or so which contains gaps from holidays and weekends. I wish to interpolate over those missing dates to obtain the approximate stock index for those days.
I've read both columns into Python using pandas and assigned them to their respective variables. What would be the best method to go about detecting the gaps in the dates and interpolating across them?
Pandas has methods specifically for this type of situation:
df.interpolate() # will fill in based on the linear average of the before and after
df.fillna(method='ffill') # forward fill
df.fillna(method='bfill') # backward fill
Related
So I have this dataset of temperatures. Each line describe the temperature in celsius measured by hour in a day.
So, I need to compute a new variable called avg_temp_ar_mensal which representsthe average temperature of a city in a month. City in this dataset is represented as estacao and month as mes.
I'm trying to do this using pandas. The following line of code is the one I'm trying to use to solve this problem:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes', 'estacao']).mean()
The goal of this code is to store in a new column the average of the temperature of the city and month. But it doesn't work. If I try the following line of code:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes']).mean()
It will works, but it is wrong. It will calculate for every city of the dataset and I don't want it because it will cause noise in my data. I need to separate each temperature based on month and city and then calculate the mean.
The dataframe after groupby is smaller than the initial dataframe, that is why your code run into error.
There is two ways to solve this problem. The first one is using transform as:
df.groupby(['mes', 'estacao'])['temp_ar'].transform(lambda g: g.mean())
The second is to create a new dfn from groupby then merge back to df
dfn = df.groupby(['mes', 'estacao'])['temp_ar'].mean().reset_index(name='average')
df = pd.merge(df, dfn, on=['mes', 'estacao'], how='left']
You are calling a groupby on a single column when you are doing df2['temp_ar'].groupby(...). This doesn't make much sense since in a single column, there's nothing to group by.
Instead, you have to perform the groupby on all the columns you need. Also, make sure that the final output is a series and not a dataframe
df['new_column'] = df[['city_column', 'month_column', 'temp_column']].groupby(['city_column', 'month_column']).mean()['temp_column']
This should do the trick if I understand your dataset correctly. If not, please provide a reproducible version of your df
I have downloaded ten open datasets of air pollution in 2010-2019 (which has been transferred to Pandas DataFrame by 'read_csv') that have some missing values.
The rows are ordered by each day including several items (like PM2.5, SO2,...). Most of the data include 17 or 18 items. There are 27 columns which separately are Year, Station, Item, 00, 01, ..., 23.
In this case, I already used
df.fillna(np.nan).apply(lambda x: pd.to_numeric(x,errors='coerce')
and df.interpolate(axis=1,inplace=True)
But now if the data have missing values from '00' to anytime following, the interpolate function would not works. If I want to fill all these blanks, I need to merge the last day data which is not null and use interpolate again.
However, different days have different items numbers, which means there are still some rows that can't be filled.
In a nutshell, now I'm trying to contact all data by the key of items and use interpolate.
By the way, after data cleaning, I would like to apply to xgboost and linear regression to predict PM2.5. Is there any way recommended to deal with the data?
(Or any demo code online?)
For example, the data would be like:
one of the datasets
I used df.groupby('date').size() and got
size of different days
Or in other words, how to split different days and concat together?
Groupby(['date','items'])? and then how to merge?
Or, is that possible to interpolate from the last value of the last row?
I have a .cvs file in which data is stored for data ranges - from and to date columns. However, I would like to create a daily data frame with Python out of it.
The time can be ignored, as a gasday always starts at 6am and ends at 6am.
My idea was to have in the end a data frame index with a date (like from March 1st, 2019, ranging to December 31st, 2019 on a daily granularity.
I would create columns with the unique values of the identifier and as values place the respective values or nan in.
The latter one, I can easily do with pd.pivot_table, but still my problem with the time range exists...
Any ideas of how to cope with that?
time-ranged data frame
It should look like this, just with rows in a daily granularity, considering the to column as well. Maybe with range?
output should look similar to this, just with a different period
you can use pandas and groupby the column you want:
df=pd.read_csv("yourfile.csv")
groups=df.groupby("periodFrom")
group.get_group("2019-03-09 06:00")
I have a pandas.DataFrame indexed by time, as seen below. The other column contains data recorded from a device measuring current. I want to filter to the second column by a low pass filter with a frequency of 5Hz to eliminate high frequency noise. I want to return a dataframe, but I do not mind if it changes type for the application of the filter (numpy array, etc.).
In [18]: print df.head()
Time
1.48104E+12 1.1185
1.48104E+12 0.8168
1.48104E+12 0.8168
1.48104E+12 0.8168
1.48104E+12 0.8168
I am graphing this data by df.plot(legend=True, use_index=False, color='red') but would like to graph the filtered data instead.
I am using pandas 0.18.1 but I can change.
I have visited https://oceanpython.org/2013/03/11/signal-filtering-butterworth-filter/ and many other sources of similar approaches.
Perhaps I am over-simplifying this but you create a simple condition, create a new dataframe with the filter, and then create your graph from the new dataframe. Basically just reducing the dataframe to only the records that meet the condition. I admit I do not know what the exact number is for high frequency, but let's assume your second column name is "Frequency"
condition = df["Frequency"] < 1.0
low_pass_df = df[condition]
low_pass_df.plot(legend=True, use_index=False, color='red')
I'm reading in timeseries data that contains only the available times. This leads to a Series with no missing values, but an unequally spaced index. I'd like to convert this to a Series with an equally spaced index with missing values. Since I don't know a priori what the spacing will be, I'm currently using a function like
min_dt = np.diff(series.index.values).min()
new_spacing = pandas.DateOffset(days=min_dt.days, seconds=min_dt.seconds,
microseconds=min_dt.microseconds)
series = series.asfreq(new_spacing)
to compute what the spacing should be (note that this is using Pandas 0.7.3 - the 0.8 beta code looks slightly differently since I have to use series.index.to_pydatetime() for correct behavior with Numpy 1.6).
Is there an easier way to do this operation using the pandas library?
If you want NaN's in the places where there is no data, you can just use Minute() located in datetools (as of pandas 0.7.x)
from pandas.core.datetools import day, Minute
tseries.asfreq(Minute())
That should provide an evenly spaced time series with 1 minute differences with NaNs as the series values where there is no data.