I have a problem with filtering data from column so I have a question about it.
My df looks like this:
TempHigh TempLow City
Date
2017-01-01 25 15 A
2017-01-02 23 14 A
2017-01-03 29 10 A
2017-01-01 22 13 B
2017-01-02 21 12 B
2017-01-03 12 11 B
How to make df.describe() only for City A? But not with df['City'].describe()
How to make separate plots for City A and City B and another plot where both cities and in one plot comparing with kind='line' ?
Also how to make histogram subplots for City A and plot for City B?
I tried with code by it gives me all columns and I want only one of them? And how to make City A and CIty B histogram in one?
df.groupby('CityName').hist()
Thanks in advance!
You need to read basic documentation about Indexing and Selecting Data.
>>> df[df['City']=='A'].describe()
TempHigh TempLow
count 3.000000 3.000000
mean 25.666667 13.000000
std 3.055050 2.645751
min 23.000000 10.000000
25% 24.000000 12.000000
50% 25.000000 14.000000
75% 27.000000 14.500000
max 29.000000 15.000000
Related
I have this dataframe, which contains average temps for all the summer days:
DATE TAVG
0 1955-06-01 NaN
1 1955-06-02 NaN
2 1955-06-03 NaN
3 1955-06-04 NaN
4 1955-06-05 NaN
... ... ...
5805 2020-08-27 2.067854
5806 2020-08-28 3.267854
5807 2020-08-29 3.067854
5808 2020-08-30 1.567854
5809 2020-08-31 4.167854
And I want to calculate the mean value yearly, so I can plot it, how could I do that?
If I understand correctly, can you try this ?
df['DATE']=pd.to_datetime(df['DATE'])
df.groupby(df['DATE'].dt.year)['TAVG'].mean()
I am currently trying to find a way to merge specific rows of df2 to df1 based on their datetime indices in a way that avoids lookahead bias so that I can add external features (df2) to my main dataset (df1) for ML applications. The lengths of the dataframes are different, and the datetime indices aren't increasing at a constant rate. My current thought process is to do this by using nested loops and if statements, but this method would be too slow as the dataframes I am trying to do this on both have over 30000 rows each. Is there a faster way of doing this?
df1
index a b
2015-06-02 16:00:00 0 5
2015-06-05 16:00:00 1 6
2015-06-06 16:00:00 2 7
2015-06-11 16:00:00 3 8
2015-06-12 16:00:00 4 9
df2
index c d
2015-06-02 9:03:00 10 16
2015-06-02 15:12:00 11 17
2015-06-02 16:07:00 12 18
... ... ...
2015-06-12 15:29:00 13 19
2015-06-12 16:02:00 14 20
2015-06-12 17:33:00 15 21
df_combined
(because you can't see the rows at 06-05, 06-06, 06-11, I just have NaN as the row values to make it easier to interpret)
index a b c d
2015-06-02 16:00:00 0 5 11 17
2015-06-05 16:00:00 1 NaN NaN NaN
2015-06-06 16:00:00 2 NaN NaN NaN
2015-06-11 16:00:00 3 NaN NaN NaN
2015-06-12 16:00:00 4 9 13 19
df_combined.loc[0, ['c', 'd']] and df_combined.loc[4, ['c', 'd']] are 11,17 and 13,19 respectively instead of 12,18 and 14,20 to avoid lookahead bias because in a live scenario, those values haven't been observed yet.
IIUC, you need merge_asof. assuming your index are ordered in time, it is with the direction backward.
print(pd.merge_asof(df1, df2, left_index=True, right_index=True, direction='backward'))
# a b c d
# 2015-06-02 16:00:00 0 5 11 17
# 2015-06-05 16:00:00 1 6 12 18
# 2015-06-06 16:00:00 2 7 12 18
# 2015-06-11 16:00:00 3 8 12 18
# 2015-06-12 16:00:00 4 9 13 19
Note that the dates 06-05, 06-06, 06-11 are not NaN but it is the last values in df2 (for 2015-06-02 16:07:00) being available before these dates in your given data.
Note: if what your dates are actually a column named index and not your index, then do:
print(pd.merge_asof(df1, df2, on='index', direction='backward'))
I have a Pandas dataset with a monthly Date-time index and a column of outstanding orders (like below):
Date
orders
1991-01-01
nan
1991-02-01
nan
1991-03-01
24
1991-04-01
nan
1991-05-01
nan
1991-06-01
nan
1991-07-01
nan
1991-08-01
34
1991-09-01
nan
1991-10-01
nan
1991-11-01
22
1991-12-01
nan
I want to linearly interpolate the values to fill the nans. However it has to be applied within 6-month blocks (non-rolling). So for example, one 6-month block would be all the rows between 1991-01-01 and 1991-06-01, where we would do forward and backward linear imputation such that if there is a nan the interpolation would be descending to a final value of 0. So for the same dataset above here is how I would like the end result to look:
Date
orders
1991-01-01
8
1991-02-01
16
1991-03-01
24
1991-04-01
18
1991-05-01
12
1991-06-01
6
1991-07-01
17
1991-08-01
34
1991-09-01
30
1991-10-01
26
1991-11-01
22
1991-12-01
11
I am lost on how to do this in Pandas however. Any ideas?
Idea is grouping per 6 months with prepend and append 0 values, interpolate and then remove first and last 0 values per groups:
df['Date'] = pd.to_datetime(df['Date'])
f = lambda x: pd.Series([0] + x.tolist() + [0]).interpolate().iloc[1:-1]
df['orders'] = (df.groupby(pd.Grouper(freq='6MS', key='Date'))['orders']
.transform(f))
print (df)
Date orders
0 1991-01-01 8.0
1 1991-02-01 16.0
2 1991-03-01 24.0
3 1991-04-01 18.0
4 1991-05-01 12.0
5 1991-06-01 6.0
6 1991-07-01 17.0
7 1991-08-01 34.0
8 1991-09-01 30.0
9 1991-10-01 26.0
10 1991-11-01 22.0
11 1991-12-01 11.0
I have a time series like the following:
date value
2017-08-27 564.285714
2017-09-03 28.857143
2017-09-10 NaN
2017-09-17 NaN
2017-09-24 NaN
2017-10-01 236.857143
... ...
2018-09-02 345.142857
2018-09-09 288.714286
2018-09-16 274.000000
2018-09-23 248.142857
2018-09-30 166.428571
It corresponds to that ranging from July 2017 to November 2019 and it's resampled by weeks. However, there are some weeks where the values were 0. I replaced it as there the values were missing and now I would like to feel those values based on values on the homologous period of a different year. For example, I have a lot of data missing for the month of September of 2017. I would like to interpolate those values using the values from September 2018. However, I'm a newbie and I'm not quite sure I to do it based only on a select period. I'm working in python, btw.
If anyone has any idea on how to this quickly, I'd be very much appreciated.
If you are OK with pandas library
One option is to find the week number from date and fill NaN values.
df['week'] = pd.to_datetime(df['date'], format='%Y-%m-%d').dt.strftime("%V")
df2 = df.sort_values(['week']).fillna(method='bfill').sort_values(['date'])
df2
which will give you the following output.
date value week
0 2017-08-27 564.285714 34
1 2017-09-03 28.857143 35
2 2017-09-10 288.714286 36
3 2017-09-17 274.000000 37
4 2017-09-24 248.142857 38
5 2017-10-01 236.857143 39
6 2018-09-02 345.142857 35
7 2018-09-09 288.714286 36
8 2018-09-16 274.000000 37
9 2018-09-23 248.142857 38
10 2018-09-30 166.428571 39
In Pandas:
df['value'] = df['value'].fillna(df['value_last_year'])
I have a timeseries for different categories
cat date price
A 2000-01-01 100
A 2000-02-01 101
...
A 2010-12-01 140
B 2000-01-01 10
B 2000-02-01 10.4
...
B 2010-12-01 11.1
...
Z 2010-12-01 13.1
I need to compute returns on all assets, which is very quick using
df['ret'] = df['price'] / df['price'].shift(1) - 1
However, that also computes incorrect returns for the first element of each company (besides A) based on the last observation of the previous company. Therefore, I want to NaN the first observation in each category.
It is easy to get these observations using
df.groupby('cat')['ret'].first()
but I am a bit lost on how to set them.
df.groupby('cat')['ret'].first() = np.NaN
and
df.loc[df.groupby('cat')['ret'].first(), 'ret']=np.NaN
did not lead anywhere.
for set first value per groups to missing values use Series.duplicated:
df.loc[~df['cat'].duplicated(), 'ret']=np.NaN
But it seems need DataFrame.sort_values with GroupBy.pct_change:
df = df.sort_values(['cat','date'])
df['ret1'] = df.groupby('cat')['price'].pct_change()
Your solution should be changed with DataFrameGroupBy.shift:
df['ret2'] = df['price'] / df.groupby('cat')['price'].shift(1) - 1
print (df)
cat date price ret1 ret2
0 A 2000-01-01 100.0 NaN NaN
1 A 2000-02-01 101.0 0.010000 0.010000
2 A 2010-12-01 140.0 0.386139 0.386139
3 B 2000-01-01 10.0 NaN NaN
4 B 2000-02-01 10.4 0.040000 0.040000
5 B 2010-12-01 11.1 0.067308 0.067308
6 Z 2010-12-01 13.1 NaN NaN
Try this
df.sort_values('date').groupby('cat')['price'].pct_change()