After using value_counts and some other data cleaning, I have my data in the form:
year city category count_per_city
2005 NYC 1 145
2007 ATL 1 75
2005 NYC 2 55
2006 LA 3 40
I'd like to convert it to this:
year city 1 2 3 total
2005 NYC 145 55 0 200
2006 LA 0 0 40 40
2007 ATL 75 0 0 75
I feel like there is a relatively simple way to do this that I'm missing.
You can use pivot_table() with margins and fill_value:
out = df.pivot_table(
index=['year', 'city'],
columns='category',
aggfunc='sum',
fill_value=0,
margins=True,
margins_name='total'
).drop('total')
# count_per_city
# category 1 2 3 total
# year city
# 2005 NYC 145 55 0 200
# 2006 LA 0 0 40 40
# 2007 ATL 75 0 0 75
If you want the exact output from the OP, you can do some cleanup (thanks to #HenryEcker):
out.droplevel(0, axis=1).rename_axis(columns=None).reset_index()
# year city 1 2 3 total
# 0 2005 NYC 145 55 0 200
# 1 2006 LA 0 0 40 40
# 2 2007 ATL 75 0 0 75
Another solution using unstack:
(
df.set_index(['year', 'city', 'category']).unstack(2)
.droplevel(0, axis=1)
.assign(Total =lambda x: x.fillna(0).apply(sum, axis=1))
.reset_index()
.rename_axis(columns='')
)
Related
I have the following problem and do not know how to solve it in a perfomant way:
Input Pandas DataFrame:
timestep
article
volume
35
1
20
37
2
5
123
2
12
155
3
10
178
2
23
234
1
17
478
1
28
Output Pandas DataFrame:
timestep
volume
35
20
37
25
123
32
178
53
234
50
478
61
Calculation Example for timestep 478:
28 (last article 1 volume) + 23 (last article 2 volume) + 10 (last article 3 volume) = 61
What ist the best way to do this in pandas?
Try with ffill:
#sort if needed
df = df.sort_values("timestep")
df["volume"] = (df["volume"].where(df["article"].eq(1)).ffill().fillna(0) +
df["volume"].where(df["article"].eq(2)).ffill().fillna(0))
output = df.drop("article", axis=1)
>>> output
timestep volume
0 35 20.0
1 37 25.0
2 123 32.0
3 178 43.0
4 234 40.0
5 478 51.0
Group By article & Take last element & Sum
df.groupby(['article']).tail(1)["volume"].sum()
You can set group number of consecutive article by .cumsum(). Then get the value of previous group last item by .map() with GroupBy.last(). Finally, add volume with this previous last, as follows:
# Get group number of consecutive `article`
g = df['article'].ne(df['article'].shift()).cumsum()
# Add `volume` to previous group last
df['volume'] += g.sub(1).map(df.groupby(g)['volume'].last()).fillna(0, downcast='infer')
Result:
print(df)
timestep article volume
0 35 1 20
1 37 2 25
2 123 2 32
3 178 2 43
4 234 1 40
5 478 1 51
Breakdown of steps
Previous group last values:
g.sub(1).map(df.groupby(g)['volume'].last()).fillna(0, downcast='infer')
0 0
1 20
2 20
3 20
4 43
5 43
Name: article, dtype: int64
Try:
df["new_volume"] = (
df.loc[df["article"] != df["article"].shift(-1), "volume"]
.reindex(df.index, method='ffill')
.shift()
+ df["volume"]
).fillna(df["volume"])
df
Output:
timestep article volume new_volume
0 35 1 20 20.0
1 37 2 5 25.0
2 123 2 12 32.0
3 178 2 23 43.0
4 234 1 17 40.0
5 478 1 28 51.0
Explained:
Find the last record of each group by checking the 'article' from the previous row, then reindex that series aligning to the original dataframe and fill forward and shift to the next group with that 'volume'. And this to the current row's 'volume' and fill that first value with the original 'volume' value.
I've been working on this problem for a little bit and am really close. Essentially, I want to create a time series of event counts by type from an event database. I'm really close. Here's what I've done so far:
Starting with an abbreviated version of my dataframe:
event_date year time_precision event_type \
0 2020-10-24 2020 1 Battles
1 2020-10-24 2020 1 Riots
2 2020-10-24 2020 1 Riots
3 2020-10-24 2020 1 Battles
4 2020-10-24 2020 2 Battles
I want the time series to be by month and year, so first I convert the dates to datetime:
nga_df.event_date = pd.to_datetime(nga_df.event_date)
Then, I want to create a time series of events by type, so I one-hot encode them:
nga_df = pd.get_dummies(nga_df, columns=['event_type'], prefix='', prefix_sep='')
Next, I need to extract the month, so that I can create monthly counts:
nga_df['month'] = nga_df.event_date.apply(lambda x: x.month)
finally, and I am so close here, I group my data by month and year and take the transpose:
conflict_series = nga_df.groupby(['year','month']).sum()
conflict_series.T
Which results in this lovely new dataframe:
year 1997 ... 2020
month 1 2 3 4 5 6 ... 5 6 7
fatalities 11 30 38 112 17 29 ... 1322 1015 619
Battles 4 4 5 13 2 2 ... 77 99 74
Explosions/Remote violence 2 1 0 0 3 0 ... 38 28 17
Protests 1 0 0 1 0 1 ... 31 83 50
Riots 3 3 4 1 4 1 ... 27 14 18
Strategic developments 1 0 0 0 0 0 ... 7 2 7
Violence against civilians 3 5 7 3 2 1 ... 135 112 88
So, I guess what I need to do is combine my index (columns after transpose) so that they are a single index. How do I do this?
The end goal is to combine this data with economic indicators to see if there is a trend, so I need both datasets to be in the same form, where the columns are monthly counts of different values.
Here's how I did it:
Step 1: flatten index:
# convert the multi-index to a flat set of tuples: (YYYY, MM)
index = conflict_series.index.to_flat_index().to_series()
Step 2: Add arbitrary but requiredend-of-month for conversion to date-time:
index = index.apply(lambda x: x + (28,))
Step 3: Convert resulting 3-tuple to date time:
index = index.apply(lambda x: datetime.date(*x))
Step 4: reset the DataFrame index:
conflict_series.set_index(index, inplace=True)
Results:
fatalities Battles Explosions/Remote violence Protests Riots \
1997-01-28 11 4 2 1 3
1997-02-28 30 4 1 0 3
1997-03-28 38 5 0 0 4
1997-04-28 112 13 0 1 1
1997-05-28 17 2 3 0 4
Strategic developments Violence against civilians total_events
1997-01-28 1 3 14
1997-02-28 0 5 13
1997-03-28 0 7 16
1997-04-28 0 3 18
1997-05-28 0 2 11
And now the plot I was looking for:
Pick Tm Player Pos Age To AP1 PB St CarAV ... Att Yds TD Rec Yds TD Tkl Int Sk College/Univ
0 1 CLE Myles Garrett DE 21 2017 0 0 0 0 ... 0 0 0 0 0 0 13 5.0 Texas A&M
1 2 CHI Mitch Trubisky QB 23 2017 0 0 1 0 ... 29 194 0 0 0 0 North Carolina
2 3 SFO Solomon Thomas DE 21 2017 0 0 1 0 ... 0 0 0 0 0 0 25 2.0 Stanford
3 4 JAX Leonard Fournette RB 22 2017 0 0 1 0 ... 207 822 7 25 195 1 LSU
4 5 TEN Corey Davis WR 22 2017 0 0 1 0 ... 0 0 0 22 227 0 West. Michigan
Given this df, I want to count the number of players per College/Univ.
So, just in this particular df, all collegs will have the value of 1.
Given a df and a college name, how can I count the number of items?
You can create boolean mask and then count Trues by sum, Trues are processes like 1s:
(df['College/Univ'] == 'Texas A&M').sum()
I have the foll. dataframe in pandas:
df
DAY YEAR REGION VALUE
1 2000 A 12
2 2000 A 10
3 2000 A 13
6 2000 A 15
1 2001 A 3
2 2001 A 40
3 2001 A 83
4 2001 A 95
1 2000 B 124
3 2000 B 102
5 2000 B 131
8 2000 B 150
1 2001 B 30
5 2001 B 4
8 2001 B 8
9 2001 B 12
I would like to create a new data frame such that each row contains a distinct combination of YEAR and REGION. It also contains a column which sums up the VALUE for that YEAR, REGION combination and another column which provides the maximum VALUE for the YEAR, REGION combination. The result should look like:
YEAR REGION SUM_VALUE MAX_VALUE
2000 A 50 15
2001 A 221 95
2000 B 507 150
2001 B 54 30
Here is what I am doing:
new_df = pandas.DataFrame()
for yr in df.YEAR.unique():
for reg in df.REGION.unique():
new_df = new_df.append({'YEAR': yr}, ignore_index=True)
new_df = new_df.append({'REGION: reg}, ignore_index=True)
However, this creates a new row each time, and is not very pythonic due to the xtra for loops. Is there a better way to proceed?
Please note that this is a toy dataframe, the actual dataframe has several VALUE columns. The proposed solution should scale, without having to manually specify the names of the VALUE columns.
groupby on 'YEAR' and 'REGION' and pass a list of funcs to call using agg:
In [9]:
df.groupby(['YEAR','REGION'])['VALUE'].agg(['sum','max']).reset_index()
Out[9]:
YEAR REGION sum max
0 2000 A 50 15
1 2000 B 507 150
2 2001 A 221 95
3 2001 B 54 30
EDIT:
If you want to name the aggregated columns, pass a dict:
In [18]:
df.groupby(['YEAR','REGION'])['VALUE'].agg({'sum_VALUE':'sum','max_VALUE':'max'}).reset_index()
Out[18]:
YEAR REGION max_VALUE sum_VALUE
0 2000 A 15 50
1 2000 B 150 507
2 2001 A 95 221
3 2001 B 30 54
in pandas I have a dataframe as follows (first line is the column, second is just a row now)
2012 2013 2012 2013
women women men men
0 14 43 24 45
1 34 54 35 65
and would like to get it like
women men
2012 0 14 24
2012 1 34 35
2013 0 43 45
2013 1 54 65
using df.stack, df.unstack did not get anywhere?
Any elegant solution?
In [5]: df
Out[5]:
2012 2013
women men women men
0 0 1 2 3
1 4 5 6 7
the idea is to first stack the first level of the column to the first level of index, and then swap two indices (pandas.DataFrame.swaplevel)
In [6]: df.stack(level=0).swaplevel(0,1,axis=0)
Out[6]:
men women
2012 0 1 0
2013 0 3 2
2012 1 5 4
2013 1 7 6
df.stack is most likely what you want. See below, you do need to specify that you want the first level.
In [79]: df = pd.DataFrame(0., index=[0,1], columns=pd.MultiIndex.from_product([[2012,2013], ['women','men']]))
In [83]: df.stack(level=0)
Out[83]:
men women
0 2012 0 0
2013 0 0
1 2012 0 0
2013 0 0