Filling missing rows in groups after groupby - python

I've got some SQL data that I'm grouping and performing some aggregation on. It works nicely:
grouped = df.groupby(['a', 'b'])
agged = grouped.aggregate({
c: [numpy.sum, numpy.mean, numpy.size],
d: [numpy.sum, numpy.mean, numpy.size]
})
and
c d
sum mean size sum mean size
a b
25 20 107.0 0.804511 133.0 5328000 40060.150376 133
21 110.0 0.774648 142.0 6031000 42471.830986 142
23 126.0 0.792453 159.0 8795000 55314.465409 159
24 72.0 0.947368 76.0 2920000 38421.052632 76
25 54.0 0.818182 66.0 2570000 38939.393939 66
26 23 126.0 0.792453 159.0 8795000 55314.465409 159
but I want to fill all of the rows that are in a=25 but not in a=26 with zeros. In other words, something like:
c d
sum mean size sum mean size
a b
25 20 107.0 0.804511 133.0 5328000 40060.150376 133
21 110.0 0.774648 142.0 6031000 42471.830986 142
23 126.0 0.792453 159.0 8795000 55314.465409 159
24 72.0 0.947368 76.0 2920000 38421.052632 76
25 54.0 0.818182 66.0 2570000 38939.393939 66
26 20 0 0 0 0 0 0
21 0 0 0 0 0 0
23 126.0 0.792453 159.0 8795000 55314.465409 159
24 0 0 0 0 0 0
25 0 0 0 0 0 0
How can I do this?

Consider the dataframe df
df = pd.DataFrame(
np.random.randint(10, size=(6, 6)),
pd.MultiIndex.from_tuples(
[(25, 20), (25, 21), (25, 23), (25, 24), (25, 25), (26, 23)],
names=['a', 'b']
),
pd.MultiIndex.from_product(
[['c', 'd'], ['sum', 'mean', 'size']]
)
)
c d
sum mean size sum mean size
a b
25 20 8 3 5 5 0 2
21 3 7 8 9 2 7
23 2 1 3 2 5 4
24 9 0 1 7 1 6
25 1 9 3 5 8 8
26 23 8 8 4 8 0 5
You can quickly recover all missing rows from the cartesian product with unstack(fill_value=0) followed by stack
df.unstack(fill_value=0).stack()
c d
mean size sum mean size sum
a b
25 20 3 5 8 0 2 5
21 7 8 3 2 7 9
23 1 3 2 5 4 2
24 0 1 9 1 6 7
25 9 3 1 8 8 5
26 20 0 0 0 0 0 0
21 0 0 0 0 0 0
23 8 4 8 0 5 8
24 0 0 0 0 0 0
25 0 0 0 0 0 0
Note: Using fill_value=0 preserves the dtype int. Without it, when unstacked, the gaps get filled with NaN and dtypes get converted to float

print(df)
c d
sum mean size sum mean size
a b
25 20 107.0 0.804511 133.0 5328000 40060.150376 133
21 110.0 0.774648 142.0 6031000 42471.830986 142
23 126.0 0.792453 159.0 8795000 55314.465409 159
24 72.0 0.947368 76.0 2920000 38421.052632 76
25 54.0 0.818182 66.0 2570000 38939.393939 66
26 23 126.0 0.792453 159.0 8795000 55314.465409 159
I like:
df = df.unstack().replace(np.nan,0).stack(-1)
print(df)
c d
mean size sum mean size sum
a b
25 20 0.804511 133.0 107.0 40060.150376 133.0 5328000.0
21 0.774648 142.0 110.0 42471.830986 142.0 6031000.0
23 0.792453 159.0 126.0 55314.465409 159.0 8795000.0
24 0.947368 76.0 72.0 38421.052632 76.0 2920000.0
25 0.818182 66.0 54.0 38939.393939 66.0 2570000.0
26 20 0.000000 0.0 0.0 0.000000 0.0 0.0
21 0.000000 0.0 0.0 0.000000 0.0 0.0
23 0.792453 159.0 126.0 55314.465409 159.0 8795000.0
24 0.000000 0.0 0.0 0.000000 0.0 0.0
25 0.000000 0.0 0.0 0.000000 0.0 0.0

Related

Python: Pandas merge three dataframes on date, keeping all dates [duplicate]

This question already has answers here:
Merge multiple DataFrames Pandas
(5 answers)
Pandas Merging 101
(8 answers)
Closed 7 months ago.
I have three dataframes
Dataframe df1:
date A
0 2022-04-11 1
1 2022-04-12 2
2 2022-04-14 26
3 2022-04-16 2
4 2022-04-17 1
5 2022-04-20 17
6 2022-04-21 14
7 2022-04-22 1
8 2022-04-23 9
9 2022-04-24 1
10 2022-04-25 5
11 2022-04-26 2
12 2022-04-27 21
13 2022-04-28 9
14 2022-04-29 17
15 2022-04-30 5
16 2022-05-01 8
17 2022-05-07 1241217
18 2022-05-08 211
19 2022-05-09 1002521
20 2022-05-10 488739
21 2022-05-11 12925
22 2022-05-12 57
23 2022-05-13 8515098
24 2022-05-14 1134576
Dateframe df2:
date B
0 2022-04-12 8
1 2022-04-14 7
2 2022-04-16 2
3 2022-04-19 2
4 2022-04-23 2
5 2022-05-07 2
6 2022-05-08 5
7 2022-05-09 2
8 2022-05-14 1
Dataframe df3:
date C
0 2022-04-12 6
1 2022-04-13 1
2 2022-04-14 2
3 2022-04-20 3
4 2022-04-21 9
5 2022-04-22 25
6 2022-04-23 56
7 2022-04-24 49
8 2022-04-25 68
9 2022-04-26 71
10 2022-04-27 40
11 2022-04-28 44
12 2022-04-29 27
13 2022-04-30 34
14 2022-05-01 28
15 2022-05-07 9
16 2022-05-08 20
17 2022-05-09 24
18 2022-05-10 21
19 2022-05-11 8
20 2022-05-12 8
21 2022-05-13 14
22 2022-05-14 25
23 2022-05-15 43
24 2022-05-16 36
25 2022-05-17 29
26 2022-05-18 28
27 2022-05-19 17
28 2022-05-20 6
I would like to merge df1, df2, df3 in a single dataframe with columns date, A, B, C, in such a way that date contains all dates which appeared in df1 and/or df2 and/or df3 (without repetition), and if a particular date was not in any of the dataframes, then for the respective column I put value 0.0. So, I would like to have something like that:
date A B C
0 2022-04-11 1.0 0.0 0.0
1 2022-08-12 2.0 8.0 6.0
2 2022-08-13 0.0 0.0 1.0
...
I tried to use this method
merge1 = pd.merge(df1, df2, how='outer')
sorted_merge1 = merge1.sort_values(by=['date'], ascending=False)
full_merge = pd.merge(sorted_merg1, df3, how='outer')
However, it seems it skips the dates which are not common for all three dataframes.
Try this,
print(pd.merge(df1, df2, on='date', how='outer').merge(df3, on='date', how='outer').fillna(0))
O/P:
date A B C
0 2022-04-11 1.0 0.0 0.0
1 2022-04-12 2.0 8.0 6.0
2 2022-04-14 26.0 7.0 2.0
3 2022-04-16 2.0 2.0 0.0
4 2022-04-17 1.0 0.0 0.0
5 2022-04-20 17.0 0.0 3.0
6 2022-04-21 14.0 0.0 9.0
7 2022-04-22 1.0 0.0 25.0
8 2022-04-23 9.0 2.0 56.0
9 2022-04-24 1.0 0.0 49.0
10 2022-04-25 5.0 0.0 68.0
11 2022-04-26 2.0 0.0 71.0
12 2022-04-27 21.0 0.0 40.0
13 2022-04-28 9.0 0.0 44.0
14 2022-04-29 17.0 0.0 27.0
15 2022-04-30 5.0 0.0 34.0
16 2022-05-01 8.0 0.0 28.0
17 2022-05-07 1241217.0 2.0 9.0
18 2022-05-08 211.0 5.0 20.0
19 2022-05-09 1002521.0 2.0 24.0
20 2022-05-10 488739.0 0.0 21.0
21 2022-05-11 12925.0 0.0 8.0
22 2022-05-12 57.0 0.0 8.0
23 2022-05-13 8515098.0 0.0 14.0
24 2022-05-14 1134576.0 1.0 25.0
25 2022-04-19 0.0 2.0 0.0
26 2022-04-13 0.0 0.0 1.0
27 2022-05-15 0.0 0.0 43.0
28 2022-05-16 0.0 0.0 36.0
29 2022-05-17 0.0 0.0 29.0
30 2022-05-18 0.0 0.0 28.0
31 2022-05-19 0.0 0.0 17.0
32 2022-05-20 0.0 0.0 6.0
​
perform merge chain and fill NaN with 0

I can't figure out why my web scraping code isn't working

I am very new to coding and I am trying to build a web scraper for Excel so that I can transfer it to Google Sheets. Unfortunately, the code that I have written is working for other people, but not me.
This is the code I have written:
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
URL = 'https://www.hockey-reference.com/leagues/NHL_2021.html'
csv_name = 'nhl_season_stats.csv'
def get_nhl_stats(URL):
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
pageTree = requests.get(URL, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
comments = pageSoup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each, header=1)[0])
except:
continue
df = tables[0]
df = df.rename(columns={'Unnamed: 1':'Team'})
df.to_csv(csv_name, index = False)
print(df)
get_nhl_stats(URL)
After running it, I receive this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 13, in get_nhl_stats
IndexError: list index out of range
Sorry for my bad jargon, as I am very new and very confused, but any help would be greatly appreciated!
this code working, maybe the problem is in the declaration of the class "Comment" or the server does not give you the requested values:
import requests
from bs4 import BeautifulSoup
import pandas as pd
URL = 'https://www.hockey-reference.com/leagues/NHL_2021.html'
csv_name = 'nhl_season_stats.csv'
def get_nhl_stats(URL):
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
pageTree = requests.get(URL, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
comments = pageSoup.find_all(string=lambda text: isinstance(text, str))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each, header=1)[0])
except:
continue
df = tables[0]
df = df.rename(columns={'Unnamed: 1':'Team'})
df.to_csv(csv_name, index = False)
print(df)
get_nhl_stats(URL)
output:
Rk Team AvAge GP W L OL PTS PTS% GF GA SOW SOL SRS SOS TG/G EVGF EVGA PP PPO PP% PPA PPOA PK% SH SHA PIM/G oPIM/G S S% SA SV% SO
0 1.0 Toronto Maple Leafs 29.0 6 4 2 0 8 0.667 19 17 0.0 0.0 0.33 -0.01 6.00 11 12 8 18 44.44 4 22 81.82 0 1 10.5 7.5 190 10.0 157 0.892 0
1 2.0 Montreal Canadiens 28.6 5 3 0 2 8 0.800 24 15 0.0 1.0 0.77 -0.83 7.80 14 8 6 20 30.00 6 25 76.00 4 1 11.4 10.6 180 13.3 140 0.893 0
2 3.0 Vegas Golden Knights 28.9 5 4 1 0 8 0.800 18 12 0.0 0.0 1.12 -0.08 6.00 15 8 2 18 11.11 3 18 83.33 1 1 7.2 7.2 150 12.0 125 0.904 0
3 4.0 Minnesota Wild 29.1 5 4 1 0 8 0.800 15 10 0.0 0.0 0.86 -0.14 5.00 13 9 1 23 4.35 1 16 93.75 1 0 7.6 10.4 166 9.0 147 0.932 0
4 5.0 Washington Capitals 30.1 5 3 0 2 8 0.800 18 16 1.0 1.0 0.10 -0.30 6.80 16 12 2 9 22.22 3 18 83.33 0 1 8.6 5.0 130 13.8 141 0.887 0
5 6.0 Philadelphia Flyers 27.0 5 3 1 1 7 0.700 19 15 0.0 1.0 0.36 -0.24 6.80 14 10 5 17 29.41 5 18 72.22 0 0 7.2 6.8 125 15.2 187 0.920 1
6 7.0 Colorado Avalanche 26.9 5 3 2 0 6 0.600 17 12 0.0 0.0 0.47 -0.53 5.80 7 9 10 25 40.00 3 19 84.21 0 0 8.0 10.4 147 11.6 143 0.916 1
7 8.0 Winnipeg Jets 27.9 4 3 1 0 6 0.750 13 10 0.0 0.0 1.10 0.35 5.75 11 6 2 20 10.00 4 12 66.67 0 0 10.3 14.3 119 10.9 134 0.925 0
8 9.0 New York Islanders 28.9 4 3 1 0 6 0.750 9 6 0.0 0.0 0.61 -0.14 3.75 5 5 4 20 20.00 1 15 93.33 0 0 11.5 11.0 108 8.3 114 0.947 2
9 10.0 Tampa Bay Lightning 27.7 3 3 0 0 6 1.000 13 5 0.0 0.0 1.70 -0.97 6.00 11 2 2 8 25.00 3 11 72.73 0 0 9.0 7.0 107 12.1 85 0.941 0
10 11.0 Pittsburgh Penguins 28.6 5 3 2 0 6 0.600 16 21 2.0 0.0 -0.43 0.17 7.40 10 16 5 18 27.78 5 19 73.68 1 0 7.6 7.2 152 10.5 130 0.838 0
11 12.0 New Jersey Devils 26.2 4 2 1 1 5 0.625 9 10 0.0 1.0 -0.35 0.15 4.75 8 3 1 11 9.09 6 16 62.50 0 1 9.8 7.3 112 8.0 150 0.933 0
12 13.0 St. Louis Blues 28.3 4 2 1 1 5 0.625 10 14 0.0 1.0 -1.66 -0.41 6.00 10 6 0 14 0.00 8 21 61.90 0 0 11.0 7.5 109 9.2 129 0.891 0
13 14.0 Boston Bruins 28.8 4 2 1 1 5 0.625 7 9 2.0 0.0 0.07 0.07 4.00 3 7 3 13 23.08 2 18 88.89 1 0 11.3 8.8 135 5.2 96 0.906 0
14 15.0 Arizona Coyotes 28.4 5 2 2 1 5 0.500 17 17 0.0 1.0 -0.04 0.16 6.80 11 11 5 22 22.73 5 24 79.17 1 1 10.4 9.6 144 11.8 157 0.892 0
15 16.0 Calgary Flames 28.1 3 2 0 1 5 0.833 11 6 0.0 0.0 1.14 -0.52 5.67 5 4 6 16 37.50 1 12 91.67 0 1 8.7 11.3 93 11.8 93 0.935 1
16 17.0 Edmonton Oilers 27.9 6 2 4 0 4 0.333 15 20 0.0 0.0 -0.91 -0.08 5.83 10 14 3 23 13.04 4 18 77.78 2 2 7.7 9.3 192 7.8 200 0.900 0
17 18.0 Vancouver Canucks 27.3 6 2 4 0 4 0.333 17 28 1.0 0.0 -1.34 0.33 7.50 12 17 4 26 15.38 9 31 70.97 1 2 13.3 10.7 179 9.5 222 0.874 0
18 19.0 Anaheim Ducks 28.6 5 1 2 2 4 0.400 8 13 0.0 0.0 -0.10 0.90 4.20 8 10 0 12 0.00 2 15 86.67 0 1 6.4 5.2 133 6.0 160 0.919 1
19 20.0 Columbus Blue Jackets 26.6 5 1 2 2 4 0.400 10 16 0.0 0.0 -1.19 0.01 5.20 9 15 1 11 9.09 1 10 90.00 0 0 9.0 9.4 152 6.6 169 0.905 0
20 21.0 Los Angeles Kings 28.3 4 1 1 2 4 0.500 12 13 0.0 0.0 0.43 0.68 6.25 8 10 4 17 23.53 3 21 85.71 0 0 11.0 9.0 119 10.1 121 0.893 0
21 22.0 Detroit Red Wings 29.3 5 2 3 0 4 0.400 10 14 0.0 0.0 -1.54 -0.74 4.80 9 9 1 12 8.33 4 16 75.00 0 1 11.4 9.8 130 7.7 155 0.910 0
22 23.0 San Jose Sharks 29.4 5 2 3 0 4 0.400 12 18 2.0 0.0 -1.32 -0.52 6.00 7 16 5 21 23.81 2 18 88.89 0 0 8.4 9.6 162 7.4 148 0.878 0
23 24.0 Carolina Hurricanes 27.0 3 2 1 0 4 0.667 9 6 0.0 0.0 0.26 -0.74 5.00 6 5 3 12 25.00 1 9 88.89 0 0 7.7 9.7 98 9.2 68 0.912 1
24 25.0 Florida Panthers 27.8 2 2 0 0 4 1.000 10 6 0.0 0.0 1.29 -0.71 8.00 7 3 3 8 37.50 3 5 40.00 0 0 5.0 8.0 66 15.2 66 0.909 0
25 26.0 Nashville Predators 28.7 4 2 2 0 4 0.500 10 14 0.0 0.0 0.01 1.01 6.00 9 7 1 16 6.25 6 16 62.50 0 1 8.0 8.0 135 7.4 126 0.889 0
26 27.0 Buffalo Sabres 27.2 5 1 3 1 3 0.300 14 15 0.0 1.0 -0.18 0.22 5.80 11 14 3 17 17.65 1 6 83.33 0 0 3.8 8.2 161 8.7 133 0.887 0
27 28.0 New York Rangers 25.6 4 1 2 1 3 0.375 11 11 0.0 1.0 -0.15 0.11 5.50 7 7 4 21 19.05 4 16 75.00 0 0 8.5 14.0 140 7.9 112 0.902 1
28 29.0 Chicago Blackhawks 26.9 5 1 3 1 3 0.300 13 21 0.0 0.0 -0.43 1.17 6.80 5 16 7 17 41.18 5 20 75.00 1 0 8.0 6.8 154 8.4 167 0.874 0
29 30.0 Ottawa Senators 27.0 4 1 2 1 3 0.375 11 14 0.0 0.0 -0.04 0.71 6.25 8 10 3 18 16.67 4 21 80.95 0 0 14.3 15.3 113 9.7 120 0.883 0
30 31.0 Dallas Stars 28.8 1 1 0 0 2 1.000 7 0 0.0 0.0 7.30 0.30 7.00 1 0 5 8 62.50 0 5 100.00 1 0 10.0 16.0 28 25.0 34 1.000 1
31 NaN League Average 28.0 4 2 2 1 5 0.574 13 13 NaN NaN NaN NaN 5.94 9 9 4 16 21.33 4 16 78.67 0 0 8.0 8.0 133 9.8 133 0.902 0

shifting down rows of specific columns from a specific index in python

I am scraping multiple tables from multiple pages of a website. The issue is there is a row missing from the initial table. Basically, this is how the dataframe looks.
mar2018 feb2018 jan2018 dec2017 nov2017
oct2017 sep2017 aug2017
balls faced 345 561 295 0 645 balls faced 200 58 0
runs scored 156 281 183 0 389 runs scored 50 20 0
strike rate 52.3 42.6 61.1 0 52.2 strike rate 25 34 0
dot balls 223 387 173 0 476 dot balls 125 34 0
fours 8 12 19 0 22 sixes 2 0 0
doubles 20 38 16 0 36 fours 4 2 0
notout 2 0 0 0 4 doubles 2 0 0
notout 4 2 0
the column 'sixes' is missing in the first page and present in the subsequent pages. So, I am trying to move the rows starting from 'fours' to 'not out' to a position down and leave nan's in row 4 for first 5 columns starting from mar2018 to nov2017.
I tried the following code but it isn't working. This is moving the values horizontally but not vertically downward.
df.iloc[4][0:6] = df.iloc[4][0:6].shift(1)
and also
df2 = pd.DataFrame(index = 4)
df = pd.concat([df.iloc[:], df2, df.iloc[4:]]).reset_index(drop=True)
did not work.
df['mar2018'] = df['mar2018'].shift(1)
But this moves all the values of that column down by 1 row.
So, I was wondering if it is possible to shift down rows of specific columns from a specific index?
I think need reindex by union by numpy.union1d of all index values:
idx = np.union1d(df1.index, df2.index)
df1 = df1.reindex(idx)
df2 = df2.reindex(idx)
print (df1)
mar2018 feb2018 jan2018 dec2017 nov2017
balls faced 345.0 561.0 295.0 0.0 645.0
dot balls 223.0 387.0 173.0 0.0 476.0
doubles 20.0 38.0 16.0 0.0 36.0
fours 8.0 12.0 19.0 0.0 22.0
notout 2.0 0.0 0.0 0.0 4.0
runs scored 156.0 281.0 183.0 0.0 389.0
sixes NaN NaN NaN NaN NaN
strike rate 52.3 42.6 61.1 0.0 52.2
print (df2)
oct2017 sep2017 aug2017
balls faced 200 58 0
dot balls 125 34 0
doubles 2 0 0
fours 4 2 0
notout 4 2 0
runs scored 50 20 0
sixes 2 0 0
strike rate 25 34 0
If multiple DataFrames in list is possible use list comprehension:
from functools import reduce
dfs = [df1, df2]
idx = reduce(np.union1d, [x.index for x in dfs])
dfs1 = [df.reindex(idx) for df in dfs]
print (dfs1)
[ mar2018 feb2018 jan2018 dec2017 nov2017
balls faced 345.0 561.0 295.0 0.0 645.0
dot balls 223.0 387.0 173.0 0.0 476.0
doubles 20.0 38.0 16.0 0.0 36.0
fours 8.0 12.0 19.0 0.0 22.0
notout 2.0 0.0 0.0 0.0 4.0
runs scored 156.0 281.0 183.0 0.0 389.0
sixes NaN NaN NaN NaN NaN
strike rate 52.3 42.6 61.1 0.0 52.2, oct2017 sep2017 aug2017
balls faced 200 58 0
dot balls 125 34 0
doubles 2 0 0
fours 4 2 0
notout 4 2 0
runs scored 50 20 0
sixes 2 0 0
strike rate 25 34 0]

Pandas: create dataframe using value_counts

I have data
age
32
16
39
39
23
36
29
26
43
34
35
50
29
29
31
42
53
I need to get smth like this
I can get
df.age.value_counts()
and
100. * df.age.value_counts() / len(df.age)
But how can I union this and give name to columns?
You can use cut with agg:
#helper df with min and max ages, necessary add category Total
df1 = pd.DataFrame({'G':['14 yo and younger','15-19','20-24','25-29','30-34',
'35-39','40-44','45-49','50-54','55-59','60-64','65+','Total'],
'Min':[0, 15,20,25,30,35,40,45,50,55,60,65,np.nan],
'Max':[14,19,24,29,34,39,44,49,54,59,64,120, np.nan]})
print (df1)
G Max Min
0 14 yo and younger 14.0 0.0
1 15-19 19.0 15.0
2 20-24 24.0 20.0
3 25-29 29.0 25.0
4 30-34 34.0 30.0
5 35-39 39.0 35.0
6 40-44 44.0 40.0
7 45-49 49.0 45.0
8 50-54 54.0 50.0
9 55-59 59.0 55.0
10 60-64 64.0 60.0
11 65+ 120.0 65.0
12 Total NaN NaN
cutoff = np.hstack([np.array(df1.Min[0]), df1.Max.values])
labels = df1.G.values
df['Groups'] = pd.cut(df.age, bins=cutoff, labels=labels, right=True, include_lowest=True)
print (df)
age Groups
0 32 30-34
1 16 15-19
2 39 35-39
3 39 35-39
4 23 20-24
5 36 35-39
6 29 25-29
7 26 25-29
8 43 40-44
9 34 30-34
10 35 35-39
11 50 50-54
12 29 25-29
13 29 25-29
14 31 30-34
15 42 40-44
16 53 50-54
df = df.groupby('Groups')['Groups']
.agg({'Total':[len, lambda x: len(x)/df.shape[0] * 100 ]})
.rename(columns={'len':'N', '<lambda>':'%'})
#last Total row
df.ix['Total'] = df.sum()
print (df)
Total
N %
Groups
14 yo and younger 0.0 0.000000
15-19 1.0 5.882353
20-24 1.0 5.882353
25-29 4.0 23.529412
30-34 3.0 17.647059
35-39 4.0 23.529412
40-44 2.0 11.764706
45-49 0.0 0.000000
50-54 2.0 11.764706
55-59 0.0 0.000000
60-64 0.0 0.000000
65+ 0.0 0.000000
Total 17.0 100.000000
EDIT1:
Solution with size scale better:
df1 = df.groupby('Groups').size().to_frame()
df1.columns = pd.MultiIndex.from_arrays(('Total','N'))
df1.ix[:,('Total','%')] = 100 * df1.ix[:,('Total','N')] / df.shape[0]
df1.ix['Total'] = df1.sum()
print (df1)
Total
N %
Groups
14 yo and younger 0.0 0.000000
15-19 1.0 5.882353
20-24 1.0 5.882353
25-29 4.0 23.529412
30-34 3.0 17.647059
35-39 4.0 23.529412
40-44 2.0 11.764706
45-49 0.0 0.000000
50-54 2.0 11.764706
55-59 0.0 0.000000
60-64 0.0 0.000000
65+ 0.0 0.000000
Total 17.0 100.000000

pandas series bfill first half, ffill second half

Suppose I had:
import pandas as pd
import numpy as np
np.random.seed([3,1415])
s = pd.Series(np.random.choice((0, 1, 2, 3, 4, np.nan),
(50,), p=(.1, .1, .1, .1, .1, .5)))
I want to back fill in missing values for the first half of the series and forward fill for the second half of the series. Middle out, if you will.
Expected output
0 4.0
1 4.0
2 4.0
3 4.0
4 4.0
5 0.0
6 1.0
7 1.0
8 1.0
9 1.0
10 1.0
11 1.0
12 1.0
13 1.0
14 1.0
15 1.0
16 1.0
17 1.0
18 4.0
19 1.0
20 2.0
21 0.0
22 0.0
23 NaN
24 NaN
25 NaN
26 NaN
27 3.0
28 2.0
29 4.0
30 4.0
31 4.0
32 4.0
33 0.0
34 0.0
35 0.0
36 0.0
37 2.0
38 2.0
39 2.0
40 2.0
41 1.0
42 1.0
43 0.0
44 2.0
45 2.0
46 2.0
47 2.0
48 2.0
49 2.0
dtype: float64
I just operate on the two halves independently here:
In [71]: s.ix[:len(s)/2].bfill().append(s.ix[1+len(s)/2:].ffill())
Out[71]:
0 4
1 4
2 4
3 4
4 4
5 0
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 1
14 1
15 1
16 1
17 1
18 4
19 1
20 2
21 0
22 0
23 NaN
24 NaN
25 NaN
26 NaN
27 3
28 2
29 4
30 4
31 4
32 4
33 0
34 0
35 0
36 0
37 2
38 2
39 2
40 2
41 1
42 1
43 0
44 2
45 2
46 2
47 2
48 2
49 2
dtype: float64

Categories

Resources