I have a pandas dataframe that looks like this:
year week val1 val2
0 2017 45 10.1 20.2
0 2017 48 10.3 20.3
0 2017 49 10.4 20.4
0 2017 52 10.3 20.5
0 2018 1 10.1 20.2
0 2018 2 10.3 20.3
0 2018 5 10.4 20.4
0 2018 9 10.3 20.5
....
Notice that the weeks are not contiguous. What is the best way to fill in the rows that are missing, with the val1 and val2 numbers as NaN? E.g so that my year would be from 2017 to 2018 and my weeks would be 45-52 and 1-9.
Thanks so much.
You can groupby year and then reindex with the union of existing and missing values:
(df.set_index("week")
.groupby("year")
.apply(lambda x: x.reindex(x.index.union(np.arange(x.index.min(),x.index.max()))))
.drop("year", 1)
.reset_index()
.rename(columns={"level_1":"week"}))
year week val1 val2
0 2017 45 10.1 20.2
1 2017 46 nan nan
2 2017 47 nan nan
3 2017 48 10.3 20.3
4 2017 49 10.4 20.4
5 2017 50 nan nan
6 2017 51 nan nan
7 2017 52 10.3 20.5
8 2018 1 10.1 20.2
9 2018 2 10.3 20.3
10 2018 3 nan nan
11 2018 4 nan nan
12 2018 5 10.4 20.4
13 2018 6 nan nan
14 2018 7 nan nan
15 2018 8 nan nan
16 2018 9 10.3 20.5
I'd create a reference dataframe and merge
ref = pd.DataFrame(
[[y, w] for y, s in df.groupby('year').week for w in range(s.min(), s.max() + 1)],
columns=['year', 'week']
)
ref.merge(df, 'left')
year week val1 val2
0 2017 45 10.1 20.2
1 2017 46 NaN NaN
2 2017 47 NaN NaN
3 2017 48 10.3 20.3
4 2017 49 10.4 20.4
5 2017 50 NaN NaN
6 2017 51 NaN NaN
7 2017 52 10.3 20.5
8 2018 1 10.1 20.2
9 2018 2 10.3 20.3
10 2018 3 NaN NaN
11 2018 4 NaN NaN
12 2018 5 10.4 20.4
13 2018 6 NaN NaN
14 2018 7 NaN NaN
15 2018 8 NaN NaN
16 2018 9 10.3 20.5
I'd make use of Time Series / Date functionality. Combining and converting year and week columns into a datetime index and resampling your dataframe with something like:
df.index = pd.to_datetime(
df.year.map(str) + " " + df.week.map(str) + " 3",
format="%Y %W %w"
)
df = df.resample("W").mean()
df.year = df.index.year
df.week = df.index.week
Note that your index is overwritten.
Related
Given this dataframe:
print(df)
0 1 2
0 354.7 April 4.0
1 55.4 August 8.0
2 176.5 December 12.0
3 95.5 February 2.0
4 85.6 January 1.0
5 152 July 7.0
6 238.7 June 6.0
7 104.8 March 3.0
8 283.5 May 5.0
9 278.8 November 11.0
10 249.6 October 10.0
11 212.7 September 9.0
If I do order by column 2 using df.sort_values('2'), I get:
0 1 2
4 85.6 January 1.0
3 95.5 February 2.0
7 104.8 March 3.0
0 354.7 April 4.0
8 283.5 May 5.0
6 238.7 June 6.0
5 152.0 July 7.0
1 55.4 August 8.0
11 212.7 September 9.0
10 249.6 October 10.0
9 278.8 November 11.0
2 176.5 December 12.0
Is there a smart way to re-define the index column (from 0 to 11) preserving the new order I got?
Use reset_index:
df.sort_values('2').reset_index(drop=True)
Also (this will replace the original dataframe)
df[:] = df.sort_values('2').values
I have pivoted the Customer ID against their year of purchase, so that I know how many times each customer purchased in different years:
Customer ID 1996 1997 ... 2019 2020
100000000000001 7 7 ... NaN NaN
100000000000002 8 8 ... NaN NaN
100000000000003 7 4 ... NaN NaN
100000000000004 NaN NaN ... 21 24
100000000000005 17 11 ... 18 NaN
My desired result is to append the column names with the latest year of purchase, and thus the number of years since their last purchase:
Customer ID 1996 1997 ... 2019 2020 Last Recency
100000000000001 7 7 ... NaN NaN 1997 23
100000000000002 8 8 ... NaN NaN 1997 23
100000000000003 7 4 ... NaN NaN 1997 23
100000000000004 NaN NaN ... 21 24 2020 0
100000000000005 17 11 ... 18 NaN 2019 1
Here is what I tried:
df_pivot["Last"] = 2020
k = 2020
while math.isnan(df_pivot2[k]):
df_pivot["Last"] = k-1
k = k-1
df_pivot["Recency"] = 2020 - df_pivot["Last"]
However what I got is "TypeError: cannot convert the series to <class 'float'>"
Could anyone help me to get the result I need?
Thanks a lot!
Dennis
You can get last year of purchase using notna + cumsum and idxmax along axis=1 then subtract this last year of purchase from the max year to compute Recency:
c = df.filter(regex=r'\d+').columns
df['Last'] = df[c].notna().cumsum(1).idxmax(1)
df['Recency'] = c.max() - df['Last']
Customer ID 1996 1997 2019 2020 Last Recency
0 100000000000001 7.0 7.0 NaN NaN 1997 23
1 100000000000002 8.0 8.0 NaN NaN 1997 23
2 100000000000003 7.0 4.0 NaN NaN 1997 23
3 100000000000004 NaN NaN 21.0 24.0 2020 0
4 100000000000005 17.0 11.0 18.0 NaN 2019 1
one idea is to apply "applymap(float)" to your dataFrame
Documentation from pandas
Year Week_Number DC_Zip Asin_code
1 2016 1 84105 NaN
2 2016 1 85034 NaN
3 2016 1 93711 NaN
4 2016 1 98433 NaN
5 2016 2 12206 21.0
6 2016 2 29306 10.0
7 2016 2 33426 11.0
8 2016 2 37206 1.0
9 2017 1 12206 266.0
10 2017 1 29306 81.0
11 2017 1 33426 NaN
12 2017 1 37206 NaN
13 2017 1 45216 99.0
14 2017 1 60160 100.0
15 2017 1 76110 76.0
16 2018 1 12206 562.0
17 2018 1 29306 184.0
18 2018 1 33426 NaN
19 2018 1 37206 NaN
20 2018 1 45216 187.0
21 2018 1 60160 192.0
22 2018 1 76110 202.0
23 2019 1 12206 511.0
24 2019 1 29306 NaN
25 2019 1 33426 224.0
26 2019 1 37206 78.0
27 2019 1 45216 160.0
28 2019 1 60160 NaN
29 2019 1 76110 221.0
30 2020 6 93711 NaN
31 2020 6 98433 NaN
32 2020 7 12206 74.0
33 2020 7 29306 22.0
34 2020 7 33426 32.0
35 2020 7 37206 10.0
36 2020 7 45216 34.0
I want to fill the NaN values with the Average of Asin_code for that particular year.I am able to fill the values for 2016 with this code
df["Asin_code"]=df.Asin_code.fillna(df.Asin_code.loc[(df.Year==2016)].mean(),axis=0)
But unable to do with the whole dataframe..
Use groupby().transform() and fillna:
df['Asin_code'] = df['Asin_code'].fillna(df.groupby('Year').Asin_code.transform('mean'))
I have a large dataset (here a link to a subset https://drive.google.com/open?id=1o7dEsRUYZYZ2-L9pd_WFnIX1n10hSA-f) with the tstamp index (2010-01-01 00:00:00) and the mm of rain. Measurements are taken every 5 minutes for many years:
mm
tstamp
2010-01-01 00:00:00 0.0
2010-01-01 00:05:00 0.0
2010-01-01 00:10:00 0.0
2010-01-01 00:15:00 0.0
2010-01-01 00:20:00 0.0
........
What I want to get is the count of rainy days for each month for each year. So ideally a dataframe like the following
tstamp rainy not rainy
2010-01 11 20
2010-02 20 8
......
2012-10 15 16
2012-11 30 0
What I'm able to obtain is a nested dict object like d = {year {month: {'rainy': 10, 'not-rainy': 20}... }...}, made with this small code snippet:
from collections import defaultdict
d = defaultdict(lambda: defaultdict(dict))
for year in df.index.year.unique():
try:
for month in df.index.month.unique():
a = df['{}-{}'.format(year, month)].resample('D').sum()
d[year][month]['rainy'] = a[a['mm'] != 0].count()
d[year][month]['not_rainy'] = a[a['mm'] == 0].count()
except:
pass
But I think I'm missing an easier and more straightforward solution. Any suggestion?
One way is to do two groupby:
daily = df['mm'].gt(0).groupby(df.index.normalize()).any()
monthly = (daily.groupby(daily.index.to_period('M'))
.value_counts()
.unstack()
)
You can do this, I don't see any non-rainy months:
df = pd.read_csv('rain.csv')
df['tstamp'] = pd.to_datetime(df['tstamp'])
df['month'] = df['tstamp'].dt.month
df['year'] = df['tstamp'].dt.year
df = df.groupby(by=['year', 'month'], as_index=False).sum()
print(df)
Output:
year month mm
0 2010 1 1.0
1 2010 2 15.4
2 2010 3 21.8
3 2010 4 9.6
4 2010 5 118.4
5 2010 6 82.8
6 2010 7 96.0
7 2010 8 161.6
8 2010 9 109.2
9 2010 10 51.2
10 2010 11 52.4
11 2010 12 39.6
12 2011 1 5.6
13 2011 2 0.8
14 2011 3 13.4
15 2011 4 1.8
16 2011 5 97.6
17 2011 6 167.8
18 2011 7 128.8
19 2011 8 67.6
20 2011 9 155.8
21 2011 10 71.6
22 2011 11 0.4
23 2011 12 29.4
24 2012 1 17.6
25 2012 2 2.2
26 2012 3 13.0
27 2012 4 55.8
28 2012 5 36.8
29 2012 6 108.4
30 2012 7 182.4
31 2012 8 191.8
32 2012 9 89.0
33 2012 10 93.6
34 2012 11 161.2
35 2012 12 26.4
I have a dataframe where I need to do a burndown starting from the baseline and subtracting all the values along, essentially I'm looking for an opposite of DataFrame().cumsum(0):
In Use
Baseline 3705.0
February 2018 0.0
March 2018 2.0
April 2018 15.0
May 2018 30.0
June 2018 14.0
July 2018 797.0
August 2018 1393.0
September 2018 86.0
October 2018 374.0
November 2018 21.0
December 2018 0.0
January 2019 0.0
February 2019 0.0
March 2019 0.0
April 2019 2.0
unknown 971.0
I cannot find a function to do or, or I'm not looking by the right tags / names.
How can this be achieved?
Use DataFrameGroupBy.diff by groups created by diff, comapring by lt < and cumulative sum:
g = df['Use'].diff().lt(0).cumsum()
df['new'] = df['Use'].groupby(g).diff().fillna(df['Use'])
print (df)
In Use new
0 Baseline 3705.0 3705.0
1 February 2018 0.0 0.0
2 March 2018 2.0 2.0
3 April 2018 15.0 13.0
4 May 2018 30.0 15.0
5 June 2018 14.0 14.0
6 July 2018 797.0 783.0
7 August 2018 1393.0 596.0
8 September 2018 86.0 86.0
9 October 2018 374.0 288.0
10 November 2018 21.0 21.0
11 December 2018 0.0 0.0
12 January 2019 0.0 0.0
13 February 2019 0.0 0.0
14 March 2019 0.0 0.0
15 April 2019 2.0 2.0
16 unknown 971.0 969.0
You can use pd.Series.diff with fillna. Here's a demo:
df = pd.DataFrame({'A': np.random.randint(0, 10, 5)})
df['B'] = df['A'].cumsum()
df['C'] = df['B'].diff().fillna(df['B']).astype(int)
print(df)
A B C
0 1 1 1
1 4 5 4
2 4 9 4
3 2 11 2
4 1 12 1