I have a dataframe like so:
switch_summary duration count
McDonalds -> Arbys -> McDonalds 0.067 1
Wendys -> Popeyes -> McDonalds -> KFC 0.293 1
Arbys -> Wendys -> Popeyes -> McDonalds 0.542 2
Arbys -> McDonalds -> KFC 1.075 1
KFC -> Arbys -> Wendys -> Popeyes 2.123 3
KFC -> Wendys -> Popeyes -> Arbys 2.297 1
I want to create a seaborn plot to visualize both the duration and the count.
I have the following code:
plt.figure(figsize = [15,7])
ax = (sns.barplot(x = 'duration',
y = 'switch_summary',
palette = sns.color_palette('winter', 10),
data = df))
for p in ax.patches:
width = p.get_width()
plt.text( 0.1 + p.get_width(), p.get_y() + 0.55 * p.get_height(),
'~{:1.2f}\n days'.format(width),
ha = 'center', va = 'center', color = 'black', weight = 'bold')
ax = ax.set(title = 'Top 10 Fastest Trends',
xlabel = 'Total Duration in Trend',
ylabel = 'Restaurant Trend')
This code will display the duration but I also want to display the count.
How can I display the count in the plt.text() portion of the code?
One quick workaround will be to enumerate the patches, get the count using the index.
import pandas as pd
data = {'switch_summary': ['McDonalds -> Arbys -> McDonalds ', 'Wendys -> Popeyes -> McDonalds -> KFC', 'Arbys -> Wendys -> Popeyes -> McDonalds ', 'Arbys -> McDonalds -> KFC', 'KFC -> Arbys -> Wendys -> Popeyes', 'KFC -> Wendys -> Popeyes -> Arbys '],
'duration': [0.067, 0.293, 0.542, 1.075, 2.123, 2.297],
'count': [1, 1, 2,1,3,1]
}
df = pd.DataFrame(data)
plt.figure(figsize = [15,7])
ax = (sns.barplot(x = 'duration',
y = 'switch_summary',
palette = sns.color_palette('winter', 10),
data = df))
for i,p in enumerate(ax.patches):
width = p.get_width()
plt.text( 0.1 + p.get_width(), p.get_y() + 0.55 * p.get_height(),
'~{:1.2f}\n days \n count - {}'.format(width,df['count'][i]),
ha = 'center', va = 'center', color = 'black', weight = 'bold')
ax = ax.set(title = 'Top 10 Fastest Trends',
xlabel = 'Total Duration in Trend',
ylabel = 'Restaurant Trend')
Related
I have a dataframe that will be visualized. This is the code to obtain that dataframe:
zonasi = (df.groupby('kodya / kab')['customer'].nunique()) zonasi
this is the output from the code above:
kab bandung 1
kab bandung barat 4
kab banyumas 2
kab batang 1
kab bekasi 29
kab bogor 13
kab kudus 11
kab tangerang 15
kab tegal 2
kota adm jakarta barat 14
kota adm jakarta pusat 6
kota adm jakarta selatan 10
kota adm jakarta timur 23
kota adm jakarta utara 9
kota balikpapan 1
kota bandung 12
kota bekasi 12
kota semarang 11
kota surabaya 3
kota surakarta 2
kota tangerang 10
kota tasikmalaya 2
no data 44
I want to visualize the output into pie chart, but since the x labels ('kodya / kab') have a lot of different unique values, the xlabels are overlapping. So, I want to try using explode to visualize the pie chart (donut chart).
I tried using this code:
`#colors
colors = sns.color_palette('husl')
#explosion
explode = (0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05)
plt.pie(zonasi, colors = colors, autopct='%.2f%%', startangle = 90, pctdistance = 0.85, explode = explode)
#draw circle
centre_circle = plt.Circle((0, 0), 0.70,fc = 'white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
#Equal aspect ratio ensures that pie is drawn as a circle
ax.axis('equal')
plt.tight_layout()
plt.show()`
but it returns this error:
'explode' must be of length 'x'
The thing is, I want to use the visualization code to different dataframe, so the xlabels will be different from one another. How can I define the explode variable so it can adjust to the xlabels automatically?
This is the example of what my output will look like:
Thank you in advance for the help.
You could do this:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
data = {'kodya / kab': ['kab bandung', 'kab bandung barat', 'kab banyumas', 'kab batang', 'kab bekasi', 'kab bogor', 'kab kudus', 'kab tangerang', 'kab tegal', 'kota adm jakarta barat', 'kota adm jakarta pusat', 'kota adm jakarta selatan', 'kota adm jakarta timur', 'kota adm jakarta utara', 'kota balikpapan', 'kota bandung', 'kota bekasi', 'kota semarang', 'kota surabaya', 'kota surakarta', 'kota tangerang', 'kota tasikmalaya', 'no data'],
'customer': [1, 4, 2, 1, 29, 13, 11, 15, 2, 14, 6, 10, 23, 9, 1, 12, 12, 11, 3, 2, 10, 2, 44]}
zonasi = pd.DataFrame(data)
zonasi.set_index('kodya / kab', inplace=True) # set the index to 'kodya / kab'
colors = sns.color_palette('husl')
explode = np.zeros(len(zonasi))
explode[1:5] = 0.1
zonasi.plot.pie(y='customer', colors=colors, autopct='%.2f%%', startangle=90, pctdistance=0.85, explode=explode,legend=False)
centre_circle = plt.Circle((0, 0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.axis('equal')
plt.tight_layout()
plt.show()
which gives
Say I have the following dataframes:
Earthquakes:
latitude longitude place year
0 36.087000 -106.168000 New Mexico 1973
1 33.917000 -90.775000 Mississippi 1973
2 37.160000 -104.594000 Colorado 1973
3 37.148000 -104.571000 Colorado 1973
4 36.500000 -100.693000 Oklahoma 1974
… … … … …
13941 36.373500 -96.818700 Oklahoma 2016
13942 36.412200 -96.882400 Oklahoma 2016
13943 37.277167 -98.072667 Kansas 2016
13944 36.939300 -97.896000 Oklahoma 2016
13945 36.940500 -97.906300 Oklahoma 2016
and Wells:
LAT LONG BBLS Year
0 36.900324 -98.218260 300.0 1977
1 36.896636 -98.177720 1000.0 2002
2 36.806113 -98.325840 1000.0 1988
3 36.888589 -98.318530 1000.0 1985
4 36.892128 -98.194620 2400.0 2002
… … … … …
11117 36.263285 -99.557631 1000.0 2007
11118 36.263220 -99.548647 1000.0 2007
11119 36.520160 -99.334183 19999.0 2016
11120 36.276728 -99.298563 19999.0 2016
11121 36.436857 -99.137391 60000.0 2012
How do I manage to make a line graph showing the number of BBLS per year (from Wells), and the number of Earthquakes that occurred in a year (from Earthquakes), where the x-axis shows the year since 1980 and the y1-axis shows the sum of BBLS per year, while y2-axis shows the number of earthquakes.
I believe I need to make a groupby, count(for earthquakes) and sum(for BBLS) in order to make the plot but I really tried so many codings and I just don't get how to do it.
The only one that kinda worked was the line graph for earthquakes as follows:
Earthquakes.pivot_table(index=['year'],columns='type',aggfunc='size').plot(kind='line')
Still, for the line graph for BBLS nothing has worked
Wells.pivot_table(index=['Year'],columns='BBLS',aggfunc='count').plot(kind='line')
This one either:
plt.plot(Wells['Year'].values, Wells['BBL'].values, label='Barrels Produced')
plt.legend() # Plot legends (the two labels)
plt.xlabel('Year') # Set x-axis text
plt.ylabel('Earthquakes') # Set y-axis text
plt.show() # Display plot
This one from another thread either:
fig, ax = plt.subplots(figsize=(10,8))
Earthquakes.plot(ax = ax, marker='v')
ax.title.set_text('Earthquakes and Injection Wells')
ax.set_ylabel('Earthquakes')
ax.set_xlabel('Year')
ax.set_xticks(Earthquakes['year'])
ax2=ax.twinx()
ax2.plot(Wells.Year, Wells.BBL, color='c',
linewidth=2.0, label='Number of Barrels', marker='o')
ax2.set_ylabel('Annual Number of Barrels')
lines_1, labels_1 = ax.get_legend_handles_labels()
lines_2, labels_2 = ax2.get_legend_handles_labels()
lines = lines_1 + lines_2
labels = labels_1 + labels_2
ax.legend(lines, labels, loc='upper center')
Input data:
>>> df2 # Earthquakes
year
0 2007
1 1974
2 1979
3 1992
4 2006
.. ...
495 2002
496 2011
497 1971
498 1977
499 1985
[500 rows x 1 columns]
>>> df1 # Wells
BBLS year
0 16655 1997
1 7740 1998
2 37277 2000
3 20195 2014
4 11882 2018
.. ... ...
495 30832 1981
496 24770 2018
497 14949 1980
498 24743 1975
499 46933 2019
[500 rows x 2 columns]
Prepare data to plot:
data1 = df1.value_counts("year").sort_index().rename("Earthquakes")
data2 = df2.groupby("year")["BBLS"].sum()
Simple plot:
ax1 = data1.plot(legend=data1.name, color="blue")
ax2 = data2.plot(legend=data2.name, color="red", ax=ax1.twinx())
Now, you can do whatever with the 2 axes.
A more controlled chart
# Figure and axis
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
# Data
line1, = ax1.plot(data1.index, data1.values, label="Earthquakes", color="b")
line2, = ax2.plot(data2.index, data2.values / 10**6, label="Barrels", color="r")
# Legend
lines = [line1, line2]
ax1.legend(lines, [line.get_label() for line in lines])
# Titles
ax1.set_title("")
ax1.set_xlabel("Year")
ax1.set_ylabel("Earthquakes")
ax2.set_ylabel("Barrels Produced (MMbbl)")
I want to plot a bar chart where I need to compare sales of two regions with respect to Region and Tier.
I implemented below code:
df.groupby(['Region','Tier'],sort=True).sum()[['Sales2015','Sales2016']].unstack().plot(kind="bar",width = .8)
But I want to implement sales of Tier 2015 and 2016 side by side,
e.g., on the x-axis the xticks should look like High Sales of 2015 and 2016 etc.
Data generation: I randomly generated your data using below code:
import numpy as np
import pandas as pd
# The number of demo data count
demo_num = 20
# Regions
regions = ['central', 'east', 'west']
np.random.seed(9)
regions_r = np.random.choice(regions, demo_num)
# Tiers
tiers = ['hi', 'lo', 'mid']
np.random.seed(99)
tiers_r = np.random.choice(tiers, demo_num)
# Sales
sales2015 = np.array(range(demo_num)) * 100
sales2016 = np.array(range(demo_num)) * 200
# Dataframe `df` to store all above
df = pd.DataFrame({'Region': regions_r, 'Tier': tiers_r, 'Sales2015': sales2015, 'Sales2016': sales2016})
Data: Now input data looks like this
Region Sales2015 Sales2016 Tier
0 west 0 0 lo
1 central 100 200 lo
2 west 200 400 hi
3 east 300 600 lo
4 west 400 800 hi
5 central 500 1000 mid
6 west 600 1200 hi
7 east 700 1400 lo
8 east 800 1600 hi
9 west 900 1800 lo
10 central 1000 2000 mid
11 central 1100 2200 lo
12 west 1200 2400 lo
13 east 1300 2600 hi
14 central 1400 2800 lo
15 east 1500 3000 mid
16 east 1600 3200 hi
17 east 1700 3400 mid
18 central 1800 3600 hi
19 central 1900 3800 hi
Code for visualization:
import matplotlib.pyplot as plt
import pandas as pd
# Summary statistics
df = df.groupby(['Tier', 'Region'], sort=True).sum()[['Sales2015', 'Sales2016']].reset_index(level=1, drop=False)
# Loop over Regions and visualize graphs side by side
regions = df.Region.unique().tolist()
fig, axes = plt.subplots(ncols=len(regions), nrows=1, figsize=(10, 5), sharex=False, sharey=True)
for region, ax in zip(regions, axes.ravel()):
df.loc[df['Region'] == region].plot(ax=ax, kind='bar', title=region)
plt.tight_layout()
plt.show()
Result: Now graphs look like this. I haven't optimize font size etc..
Hope this helps.
Is there a way of using one's preferred colors (8 to 10 or more) for different clusters plotted by the following code:
import numpy as np
existing_df_2d.plot(
kind='scatter',
x='PC2',y='PC1',
c=existing_df_2d.cluster.astype(np.float),
figsize=(16,8))
The code is from here: https://www.codementor.io/python/tutorial/data-science-python-pandas-r-dimensionality-reduction
Thanks
I have tried the following without success:
LABEL_COLOR_MAP = {0 : 'red',
1 : 'blue',
2 : 'green',
3 : 'purple'}
label_color = [LABEL_COLOR_MAP[l] for l in range(len(np.unique(existing_df_2d.cluster)))]
existing_df_2d.plot(
kind='scatter',
x='PC2',y='PC1',
c=label_color,
figsize=(16,8))
You need add one new color 4 and use maping by dictionary LABEL_COLOR_MAP:
LABEL_COLOR_MAP = {0 : 'red',
1 : 'blue',
2 : 'green',
3 : 'purple',
4 : 'yellow'}
existing_df_2d.plot(
kind='scatter',
x='PC2',y='PC1',
c=existing_df_2d.cluster.map(LABEL_COLOR_MAP),
figsize=(16,8))
because:
print np.unique(existing_df_2d.cluster)
[0 1 2 3 4]
All code:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
tb_existing_url_csv = 'https://docs.google.com/spreadsheets/d/1X5Jp7Q8pTs3KLJ5JBWKhncVACGsg5v4xu6badNs4C7I/pub?gid=0&output=csv'
existing_df = pd.read_csv(
tb_existing_url_csv,
index_col = 0,
thousands = ',')
existing_df.index.names = ['country']
existing_df.columns.names = ['year']
pca = PCA(n_components=2)
pca.fit(existing_df)
PCA(copy=True, n_components=2, whiten=False)
existing_2d = pca.transform(existing_df)
existing_df_2d = pd.DataFrame(existing_2d)
existing_df_2d.index = existing_df.index
existing_df_2d.columns = ['PC1','PC2']
existing_df_2d.head()
kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit(existing_df)
existing_df_2d['cluster'] = pd.Series(clusters.labels_, index=existing_df_2d.index)
print existing_df_2d.head()
PC1 PC2 cluster
country
Afghanistan -732.215864 203.381494 2
Albania 613.296510 4.715978 3
Algeria 569.303713 -36.837051 3
American Samoa 717.082766 5.464696 3
Andorra 661.802241 11.037736 3
LABEL_COLOR_MAP = {0 : 'red',
1 : 'blue',
2 : 'green',
3 : 'purple',
4 : 'yellow'}
existing_df_2d.plot(
kind='scatter',
x='PC2',y='PC1',
c=existing_df_2d.cluster.map(LABEL_COLOR_MAP),
figsize=(16,8))
Testing:
Top 10 rows by column PC2:
print existing_df_2d.loc[existing_df_2d['PC2'].nlargest(10).index,:]
PC1 PC2 cluster
country
Kiribati -2234.809790 864.494075 2
Djibouti -3798.447446 578.975277 4
Bhutan -1742.709249 569.448954 2
Solomon Islands -809.277671 530.292939 1
Nepal -986.570652 525.624757 1
Korea, Dem. Rep. -2146.623299 438.945977 2
Timor-Leste -1618.364795 428.244340 2
Tuvalu -1075.316806 366.666171 1
Mongolia -686.839037 363.722971 1
India -1146.809345 363.270389 1
I have a dictionary such as this created using Python.
d = {'a': ['Adam', 'Book', 4], 'b': ['Bill', 'TV', 6, 'Jill', 'Sports', 1, 'Bill', 'Computer', 5], 'c': ['Bill', 'Sports', 3], 'd': ['Quin', 'Computer', 3, 'Adam', 'Computer', 3], 'e': ['Quin', 'TV', 2, 'Quin', 'Book', 5], 'f': ['Adam', 'Computer', 7]}
I wanted to print this out in a sideways tree format rather on the console. I've tried pretty print but when the dictionary gets long, it becomes difficult to read.
For example, with this dictionary, it would return:
a -> Book -> Adam -> 4
b -> TV -> Bill -> 6
-> Sports -> Jill -> 1
-> Computer -> Bill -> 5
c -> Sports -> Bill -> 3
d -> Computer -> Quin -> 3
-> Adam -> 3
e -> TV -> Quin -> 2
Book -> Quin -> 5
f -> Computer -> Adam -> 7
Essentially, the pretty print is organized by the Activity, or the item in second position in the list, then by name and then by the number.
The sample output above is just an example. I tried working with Pretty print a tree but was unable to figure out how to turn that into a sideways format.
You can have a look at the code of the ETE toolkit. The function _asciiArt produces nice representations of trees even with internal node labels
from ete2 import Tree
t = Tree("(((A,B), C), D);")
print t
# /-A
# /---|
# /---| \-B
# | |
#----| \-C
# |
# \-D
Here's how I would do it. Since the tree is only two levels deep -- despite what your desired output format might seem to imply -- there's no need to use recursion to traverse its contents, as iteration works quite well. Probably this is nothing like the #f code you referenced, since I don't know the language, but it's a lot shorter and more readable -- at least to me.
from itertools import izip
def print_tree(tree):
for key in sorted(tree.iterkeys()):
data = tree[key]
previous = data[0], data[1], data[2]
first = True
for name, activity, value in izip(*[iter(data)]*3): # groups of three
activity = activity if first or activity != previous[1] else ' '*len(activity)
print '{} ->'.format(key) if first else ' ',
print '{} -> {} -> {}'.format(activity, name, value)
previous = name, activity, value
first = False
d = {'a': ['Adam', 'Book', 4],
'b': ['Bill', 'TV', 6, 'Jill', 'Sports', 1, 'Bill', 'Computer', 5],
'c': ['Bill', 'Sports', 3],
'd': ['Quin', 'Computer', 3, 'Adam', 'Computer', 3],
'e': ['Quin', 'TV', 2, 'Quin', 'Book', 5],
'f': ['Adam', 'Computer', 7]}
print_tree(d)
Output:
a -> Book -> Adam -> 4
b -> TV -> Bill -> 6
Sports -> Jill -> 1
Computer -> Bill -> 5
c -> Sports -> Bill -> 3
d -> Computer -> Quin -> 3
-> Adam -> 3
e -> TV -> Quin -> 2
Book -> Quin -> 5
f -> Computer -> Adam -> 7
Update
To organize the output by name instead of activity you'd need to change three lines as indicated below:
from itertools import izip
def print_tree(tree):
for key in sorted(tree.iterkeys()):
data = tree[key]
previous = data[0], data[1], data[2]
first = True
for name, activity, value in sorted(izip(*[iter(data)]*3)): # changed
name = name if first or name != previous[0] else ' '*len(name) # changed
print '{} ->'.format(key) if first else ' ',
print '{} -> {} -> {}'.format(name, activity, value) # changed
previous = name, activity, value
first = False
Output after modification:
a -> Adam -> Book -> 4
b -> Bill -> Computer -> 5
-> TV -> 6
Jill -> Sports -> 1
c -> Bill -> Sports -> 3
d -> Adam -> Computer -> 3
Quin -> Computer -> 3
e -> Quin -> Book -> 5
-> TV -> 2
f -> Adam -> Computer -> 7
def treePrint(tree):
for key in tree:
print key, # comma prevents a newline character
treeElem = tree[key] # multiple lookups is expensive, even amortized O(1)!
for subElem in treeElem:
print " -> ", subElem,
if type(subElem) != str: # OP wants indenting after digits
print "\n " # newline and a space to match indenting
print "" # forces a newline