Bar plot by grouping values in python - python

I want to plot a bar chart where I need to compare sales of two regions with respect to Region and Tier.
I implemented below code:
df.groupby(['Region','Tier'],sort=True).sum()[['Sales2015','Sales2016']].unstack().plot(kind="bar",width = .8)
But I want to implement sales of Tier 2015 and 2016 side by side,
e.g., on the x-axis the xticks should look like High Sales of 2015 and 2016 etc.

Data generation: I randomly generated your data using below code:
import numpy as np
import pandas as pd
# The number of demo data count
demo_num = 20
# Regions
regions = ['central', 'east', 'west']
np.random.seed(9)
regions_r = np.random.choice(regions, demo_num)
# Tiers
tiers = ['hi', 'lo', 'mid']
np.random.seed(99)
tiers_r = np.random.choice(tiers, demo_num)
# Sales
sales2015 = np.array(range(demo_num)) * 100
sales2016 = np.array(range(demo_num)) * 200
# Dataframe `df` to store all above
df = pd.DataFrame({'Region': regions_r, 'Tier': tiers_r, 'Sales2015': sales2015, 'Sales2016': sales2016})
Data: Now input data looks like this
Region Sales2015 Sales2016 Tier
0 west 0 0 lo
1 central 100 200 lo
2 west 200 400 hi
3 east 300 600 lo
4 west 400 800 hi
5 central 500 1000 mid
6 west 600 1200 hi
7 east 700 1400 lo
8 east 800 1600 hi
9 west 900 1800 lo
10 central 1000 2000 mid
11 central 1100 2200 lo
12 west 1200 2400 lo
13 east 1300 2600 hi
14 central 1400 2800 lo
15 east 1500 3000 mid
16 east 1600 3200 hi
17 east 1700 3400 mid
18 central 1800 3600 hi
19 central 1900 3800 hi
Code for visualization:
import matplotlib.pyplot as plt
import pandas as pd
# Summary statistics
df = df.groupby(['Tier', 'Region'], sort=True).sum()[['Sales2015', 'Sales2016']].reset_index(level=1, drop=False)
# Loop over Regions and visualize graphs side by side
regions = df.Region.unique().tolist()
fig, axes = plt.subplots(ncols=len(regions), nrows=1, figsize=(10, 5), sharex=False, sharey=True)
for region, ax in zip(regions, axes.ravel()):
df.loc[df['Region'] == region].plot(ax=ax, kind='bar', title=region)
plt.tight_layout()
plt.show()
Result: Now graphs look like this. I haven't optimize font size etc..
Hope this helps.

Related

Drawing point to another point in R?

I have a information of migration data of different countries to Mexico. I want to represent it in a sanky. I am new to R and having a difficult time to produce it. Can someone please help me to achieve this? The destination is Mexico, so all the data flow will be to one point
df
Country 2013 2014 2015 2016 Dest
UK 1200 1200 1207 1400 Mexico
China 630 700 800 940 Mexico
Canada 1000 1000 950 920 Mexico
Brazil 820 670 550 230 Mexico
France 400 200 700 700 Mexico
Australia 440 350 340 780 Mexico
Sankey diagram example:
Here is a different way to visualise your data. We use plotly to create a dynamic animation of the flow of migrants into Mexico over time.
# Need to reshape data from wide to long and prepare data for plotly
data_long <- data %>%
rename(source = Country, target = Dest) %>%
pivot_longer(matches("\\d{4}"), names_to = "year") %>%
pivot_longer(c(source, target), values_to = "country") %>%
mutate(node_id = as.factor(country))
# Plotly Sankey diagrams need a link list that gives 0-based indices
# (instead of node names)
link_list <- data_long %>%
select(-country) %>%
mutate(node_id = as.integer(node_id) - 1) %>%
pivot_wider(names_from = "name", values_from = "node_id") %>%
arrange(year) %>%
as.list()
# Now we're ready to plot
library(plotly)
plot_ly(
type = "sankey",
orientation = "h",
node = list(
label = levels(data_long$node_id),
pad = 15,
thickness = 20),
link = link_list,
frame = ~link_list$year) %>%
animation_slider(currentvalue = list(prefix = "Year: "))
Sample data
data <- read.table(text = "Country 2013 2014 2015 2016 Dest
UK 1200 1200 1207 1400 Mexico
China 630 700 800 940 Mexico
Canada 1000 1000 950 920 Mexico
Brazil 820 670 550 230 Mexico
France 400 200 700 700 Mexico
Australia 440 350 340 780 Mexico", header = TRUE, check.names = FALSE)
A different option could be using ggalluvial to create an alluvial workflow of the values per country over the years and the total per year. Here is a reproducible example:
# remotes::install_github("davidsjoberg/ggsankey")
library(ggalluvial)
library(dplyr)
library(ggplot2)
library(tidyr)
library(scales)
df %>%
pivot_longer(cols = -`Country`:`2016`) %>%
mutate(value = as.numeric(gsub(",", "", value))) %>%
ggplot(aes(x = name, y = value, alluvium = Country)) +
geom_alluvium(aes(fill = Country, colour = Country),
alpha = .75, decreasing = FALSE) +
ggtitle("Migration to Mexico") +
scale_y_continuous(breaks=pretty_breaks()) +
theme_bw()
Created on 2022-09-04 with reprex v2.0.2

How to group ID based on 3 min intervals in pandas?

I have a dataframe that looks like this :
ID time city transport
0 1 10:20:00 London car
1 20 08:50:20 Berlin air plane
2 44 21:10:00 Paris train
3 32 10:24:00 Rome car
4 56 08:53:10 Berlin air plane
5 90 21:8:00 Paris train
.
.
.
1009 446 10:21:24 London car
I want to group these data so that same value in 'city' and 'transport' but with time difference of +3min or -3min should have the same 'ID'.
I already tried pd.Grouper() like this but didn't work:
df['time'] = pd.to_datetime(df['time'])
df['ID'] = df.groupby([pd.Grouper(key= 'time',freq ='3min'),'city','transport'])['ID'].transform('first')
The output is the first dataframe I had without any changes. One reason could be that by using .datetime the date will be added as well to "time" and because my data is very big the date will differ and groupby doesn't work.
I couldn't figure it out how to add time intervall (+3min or -3min) while using groupby and without adding DATE to 'time' column.
What I'm expecting is this :
ID time city transport
0 1 10:20:00 London car
1 20 08:50:20 Berlin air plane
2 44 21:10:00 Paris train
3 32 10:24:00 Rome car
4 20 08:53:10 Berlin air plane
5 44 21:8:00 Paris train
.
.
.
1009 1 10:21:24 London car
it has been a while that I'm struggling with this question and I really appreciate any help.
Thanks in advance
def convert(seconds):
seconds = seconds % (24 * 3600)
hour = seconds // 3600
seconds %= 3600
minutes = seconds // 60
seconds %= 60
return hour,minutes,seconds
def get_sec(h,m,s):
"""Get Seconds from time."""
if h==np.empty:
h=0
if m==np.empty:
m=0
if s==np.empty:
s=0
return int(h) * 3600 + int(m) * 60 + int(s)
df['time']=df['time'].apply(lambda x: datetime.strptime(x,'%H:%M:%S') if isinstance(x,str) else x )
df=df.sort_values(by=["time"])
print(df)
prev_hour=np.empty
prev_minute=np.empty
prev_second=np.empty
for key,item in df.iterrows():
curr_hour=item.time.hour
curr_minute=item.time.minute
curr_second=item.time.second
curr_id=item.id
curr_seconds=get_sec(curr_hour, curr_minute ,curr_second)
prev_seconds=get_sec(prev_hour, prev_minute,prev_second)
diff_seconds=curr_seconds-prev_seconds
hour,minute,second=convert(diff_seconds)
if (hour==0) & (minute <=3):
df.loc[key,'id']=prev_id
prev_hour=item.time.hour
prev_minute=item.time.minute
prev_second=item.time.second
prev_id=item.id
print(df)
output:
id time city transport
1 20 1900-01-01 08:50:20 Berlin air plane
4 20 1900-01-01 08:53:10 Berlin air plane
0 1 1900-01-01 10:20:00 London car
3 32 1900-01-01 10:24:00 Rome car
5 90 1900-01-01 21:08:00 Paris train
2 90 1900-01-01 21:10:00 Paris train
Exploring pd.Grouper()
found it useful to insert start time so that it's more obvious how buckets are being generated
you requirement +/- 3mins, most closely is a 6min bucket. Mostly matches your requirement but +/- 3 mins of what?
have done something that just shows what has been grouped and shows time bucket
setup
df = pd.read_csv(io.StringIO(""" ID time city transport
0 1 10:20:00 London car
1 20 08:50:20 Berlin air plane
2 44 21:10:00 Paris train
3 32 10:24:00 Rome car
4 56 08:53:10 Berlin air plane
5 90 21:08:00 Paris train
6 33 05:08:22 Paris train"""), sep="\s\s+", engine="python")
# force in origin so grouper generates bucket every Xmins from midnight with no seconds...
df = pd.concat([pd.DataFrame({"time":[pd.Timedelta(0)],"dummy":[True]}), df]).assign(dummy=lambda dfa: dfa.dummy.fillna(False))
df = df.assign(td=pd.to_timedelta(df.time))
analysis
### DEBUGGER ### - see whats being grouped...
df.groupby([pd.Grouper(key="td", freq="6min"), "city","transport"]).agg(lambda x: list(x) if len(x)>0 else np.nan).dropna()
see that two time buckets will group >1 ID
time
dummy
ID
(Timedelta('0 days 05:06:00'), 'Paris', 'train')
['05:08:22']
[False]
[33.0]
(Timedelta('0 days 08:48:00'), 'Berlin', 'air plane')
['08:50:20', '08:53:10']
[False, False]
[20.0, 56.0]
(Timedelta('0 days 10:18:00'), 'London', 'car')
['10:20:00']
[False]
[1.0]
(Timedelta('0 days 10:24:00'), 'Rome', 'car')
['10:24:00']
[False]
[32.0]
(Timedelta('0 days 21:06:00'), 'Paris', 'train')
['21:10:00', '21:08:00']
[False, False]
[44.0, 90.0]
solution
# finally +/- double the window. NB this is not +/- but rows that group the same
(df.assign(ID=lambda dfa: dfa
.groupby([pd.Grouper(key= 'td',freq ='6min'),'city','transport'])['ID']
.transform('first'))
# cleanup... NB needs changing if dummy row is not inserted
.query("not dummy")
.drop(columns=["td","dummy"])
.assign(ID=lambda dfa: dfa.ID.astype(int))
)
time
ID
city
transport
10:20:00
1
London
car
08:50:20
20
Berlin
air plane
21:10:00
44
Paris
train
10:24:00
32
Rome
car
08:53:10
20
Berlin
air plane
21:08:00
44
Paris
train
05:08:22
33
Paris
train

how to automate labeling of data in matplotlib?

I would like to find a shortcut to labeling data since I am working with a large data set.
here's the data I'm charting from the large data set:
Nationality
Afghanistan 4
Albania 40
Algeria 60
Andorra 1
Angola 15
...
Uzbekistan 2
Venezuela 67
Wales 129
Zambia 9
Zimbabwe 13
Name: count, Length: 164, dtype: int64
And so far this is my code:
import pandas as pd
import matplotlib.pyplot as plt
the_data = pd.read_csv('fifa_data.csv')
plt.title('Percentage of Players from Each Country')
the_data['count'] = 1
Nations = the_data.groupby(['Nationality']).count()['count']
plt.pie(Nations)
plt.show()
creating the pie chart is easy and quick this way but I haven't figured out how to automatically label each country in the pie chart without having to label each data point one by one.
pandas plot function would automatic label the data for you
# count:
Nations = the_data.groupby('Nationality').size()
# plot data
Nations.plot.pie()
plt.title('Percentage of Players from Each Country')
plt.show()

Find average of row and column groups pandas

I want to find the states with the highest average total revenue and be able to see states with the 40-45th highest average, 35-40th, etc for all states from 1992-2016.
Data is organized in a dataframe in the below picture. So ideally I could have another column like the following. I think this is what I am trying to do.
STATE // YEAR // TOTAL_REVENUE // AVG_TOTAL_REVENUE
ALABAMA // 1992 // 5000 // 6059
ALABAMA // 1993 // 4000 // 6059
ALASKA // 1992 // 3000 // 2059
ALABAMA // 1996 // 6019 // 6059
Is this possible to do? I am not sure if I am stating what I want to do correctly and not sure what I am looking for google wise to figure out a way forward.
Assuming your input looks like:
STATE YEAR TOTAL_REVENUE
Michigan 2001 1000
Michigan 2002 2000
California 2003 3000
California 2004 4000
Michigan 2005 5000
Then just do:
df['AVG_TOTAL_REVENUE'] = np.nan
states = df['STATE'].tolist()
states = list(set(states))
for state in states:
state_values = df[df['STATE'] == state]
revenues = state_values['TOTAL_REVENUE'].tolist()
revenues = [float(x) for x in revenues]
avg = sum(revenues)/len(revenues)
df['AVG_TOTAL_REVENUE'].loc[state_values.index] = avg
which gives you:
STATE YEAR TOTAL_REVENUE AVG_TOTAL_REVENUE
0 Michigan 2001 1000 2666.666667
1 Michigan 2002 2000 2666.666667
2 California 2003 3000 3500.000000
3 California 2004 4000 3500.000000
4 Michigan 2005 5000 2666.666667
If your data is stored in a pandas dataframe called df with STATE as index, then you can try:
df.set_index("STATE",inplace=True)
avg_revenue = df.groupby(level=0)["TOTAL_REVENUE"].agg("mean")
df["AVG_TOTAL_REVENUE"] = avg_revenue.loc[df.index]
df = df.sort_values(by="AVG_TOTAL_REVENUE",ascending=False)
Regarding the "40-45th highest average", I'm not sure exactly what you're looking for. But you could do this for instance:
import numpy as np
bin = (np.array([0.40, 0.45]) * len(df)).astype(int)
df.iloc[bin[0]:bin[1],:]
# Or with quantiles
min_q,max_q = (0.40, 0.45)
avg = df.AVG_TOTAL_REVENUE
df.loc[(avg >= avg.quantile(min_q)) & (avg <= avg.quantile(max_q)), :]
Or maybe you want to bin your data every 5 state in order of AVG_TOTAL_REVENUE?
df_grouped = df.groupby("STATE")["AVG_TOTAL_REVENUE"].agg("first")
n_bins = int(df_grouped.shape[0] / 5)
bins = (pd.cut(df_grouped,bins=n_bins)
.reset_index()
.groupby("AVG_TOTAL_REVENUE")
.agg(list)
)

Apply groupby on a DataFrame to display cumulative stats

Let's say I have a DataFrame that looks like this:
Bank Name House This Wk
Barc Germany 100
Barc UK 300
Barc UK 500
JPM Japan 200
JPM NYC 100
BOA LA 900
BOA LA 50
BOA LA 50
DB Italy 45
I would like to group-by Bank Name, while outputting the largest House Value as well as the total value...
For example, using the example above would result in:
Bank Name Total House This Wk
Barc 900 UK 500
JPM 300 Japan 200
BOA 1000 LA 900
DB 45 Italy 45
Essentially, it is grouping the Total by Bank Name, but also outputting the largest contributor, House, to the total and the amount contributed is This Wk.
How can I go about doing this?
In [121]: df.groupby('Bank Name', group_keys=False) \
...: .apply(lambda x: x.nlargest(1, 'This Wk').assign(Total=x['This Wk'].sum())) \
...: [['Bank Name','Total','House','This Wk']]
...:
Out[121]:
Bank Name Total House This Wk
5 BOA 1000 LA 900
2 Barc 900 UK 500
8 DB 45 Italy 45
3 JPM 300 Japan 200
You can consider df.groupby with a list of dfGroupBy.agg functions:
In [732]: out = df.groupby('Bank Name')['This Wk'].agg(['sum', 'idxmax', 'max'])\
.rename(columns={'sum' : 'Total', 'idxmax' : 'House', 'max' : 'This Wk'})\
.reset_index()
In [734]: out['House'] = df.loc[out['House'], 'House'].values; out
Out[734]:
Bank Name Total House This Wk
0 BOA 1000 LA 900
1 Barc 900 UK 500
2 DB 45 Italy 45
3 JPM 300 Japan 200
Another way using apply would be
In [17]: (df.groupby('Bank Name', sort=False)
.apply(lambda x: pd.Series(
[x['This Wk'].sum(),
x.loc[x['This Wk'].idxmax(), 'House'],
x['This Wk'].max()],
index=['Total', 'House', 'This Wk']))
.reset_index())
Out[17]:
Bank Name Total House This Wk
0 Barc 900 UK 500
1 JPM 300 Japan 200
2 BOA 1000 LA 900
3 DB 45 Italy 45

Categories

Resources