I have an imported excel file in python and want to create a bar chart.
In the bar chart, I want the bars to be separated by profit, 0-10, 10-20, 20-30...
How do I do this?
this is one of the things I have tried:
import NumPy as np
import matplotlib.pyplot as plt
%matplotlib inline
df.plot(kind="bar",x="profit", y="people")
df[df.profit<=10]
plt.show()
and:
df[df.profit range (10,20)]
It is a bit difficult to help you better without a sample of your data, but I constructed a dataset randomly that should have the shape of yours, so that this solution can hopefully be useful to you:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# For random data
import random
%matplotlib inline
df = pd.DataFrame({'profit':[random.choice([i for i in range(100)]) for x in range(100)], 'people':[random.choice([i for i in range(100)]) for x in range(100)]})
display(df)
out = pd.cut(df['profit'], bins=[x*10 for x in range(10)], include_lowest=True)
ax = out.value_counts(sort=False).plot.bar(rot=0, color="b", figsize=(14,4))
plt.xlabel("Profit")
plt.ylabel("People")
plt.show()
I had a look at another question on here (Pandas bar plot with binned range) and there they explained how this issue can be solved.
Hope it helps :)
Related
I'm trying to use a bar chart to visualize my csv data. The data looks like this:
question,count_1,count_2,count_3,count_4,count_5
Q1,0,0,6,0,0
Q2,6,0,0,0,0
Q3,3,2,1,0,0
Q4,0,0,6,0,0
Q5,6,0,0,0,0
Q6,0,6,0,0,0
Q7,6,0,0,0,0
Q8,0,0,0,5,1
Q9,1,4,0,0,1
Q10,0,0,1,5,0
Here is my code
import pandas as pd
import csv
import matplotlib.pyplot as plt
df = pd.read_csv('example.csv')
ax = df.set_index(['question']).plot.bar(stacked=True)
ax.legend(loc='best')
plt.show()
Which gives me:
What I'm trying to do is flip the x and y axes. I want the bars to be horizontal and y axis to be the questions. I tried to transpose my data frame using:
ax = df.set_index(['question']).T.plot.bar(stacked=True)
but that gives me:
which is not what I want. Can anyone help?
to get the bars horizontally (flip the x and y axis), you need to use barh (horizontal bar). More info here. So, the code would be...
import pandas as pd
import csv
import matplotlib.pyplot as plt
df = pd.read_csv('example.csv')
ax = df.set_index(['question']).plot.barh(stacked=True)
ax.legend(loc='best')
plt.show()
Output plot
I have seen many questions on changing the tick frequency on SO, and that did help when I am building a line chart, but I have been struggling when its a bar chart. So below are my codes
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame(np.random.randint(1,10,(90,1)),columns=['Values'])
df.plot(kind='bar')
plt.show()
and thats the output I see. How do I change the tick frequency ?
(To be more clearer frequency of 5 on x axis!)
Using Pandas plot function you can do:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1,10,(90,1)),columns=['Values'])
df.plot(kind='bar', xticks=np.arange(0,90,5))
Or better:
df.plot(kind='bar', xticks=list(df.index[0::5]))
I folks,
Consider the following example
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
fig, (ax1,ax2) = plt.subplots(2,1)
dates = pd.date_range("2018-01-01","2019-01-01",freq = "1d")
x = pd.DataFrame(index = dates, data = np.linspace(0,1,len(dates)) )
x.plot(ax=ax1)
y = np.random.random([len(dates),100]) * x.values
ax2.pcolormesh(range(len(x)), np.linspace(-1,1,100), y.T)
plt.show()
At this point, I would like the both axis (ax1,ax2) to share the x-axis, i.e. displaying proper pandas dates on the second axis. sharex=True does not seem to work. How can I achieve that? I tried different possibilities which did not work out.
Edit: Since the pandas date formatting is superior to the native matplotlib formatting, please provide me with a solution where pandas date formatting is used (for instance, zooming with an interactive environment works much better with pandas date formatting). Thanks You!
One way to do it would be to do all the plotting with matplotlib, this way there are no problems with the different time formats being used:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
fig, (ax1,ax2) = plt.subplots(2,1, sharex='col')
dates = pd.date_range("2018-01-01","2019-01-01",freq = "1d")
x = pd.DataFrame(index = dates, data = np.linspace(0,1,len(dates)) )
#x.plot(ax=ax1)
ax1.plot(x.index, x.values)
y = np.random.random([len(dates),100]) * x.values
ax2.pcolormesh(x.index, np.linspace(-1,1,100), y.T)
fig.tight_layout()
plt.show()
This gives the following plot:
What seems to work fine is to first plot the same line into the axes that should host the image, then plot the image, then remove the line again. What this does is that it tells pandas to apply its locators and formatters to that axes; they will stay after removing the line.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
fig, (ax1,ax2) = plt.subplots(2,1, sharex=True)
dates = pd.date_range("2018-01-01","2019-01-01",freq = "1d")
x = pd.DataFrame(index = dates, data = np.linspace(0,1,len(dates)) )
x.plot(ax=ax1)
y = np.random.random([len(dates),100]) * x.values
x.plot(ax=ax2, legend=False)
ax2.pcolormesh(dates, np.linspace(-1,1,100), y.T)
ax2.lines[0].remove()
plt.show()
Note that there may be caveats of this solution when zooming or panning. Consider it more like a hack and use it as long as it works, but don't blame anyone once it doesn't.
I'm currently using df.hist(alpha = .5), but all of the subplots are too close from each other, like this:
Histograms
Which way is better to change the space between them?
Or is better to plot each one in a separate .png file?
One simple way is to manipulate figsize and add pyplot.tight_layout. Below is the example.
Without adjustment:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(6400)
.reshape((100, 64)), columns=['col_{}'.format(i) for i in range(64)])
df.hist(alpha=0.5)
plt.show()
You will get this as you showed:
In contrast, if you add figsize (with arbitrary size) and pyplot.tight_layout like below:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(6400)
.reshape((100, 64)), columns=['col_{}'.format(i) for i in range(64)])
df.hist(alpha=0.5, figsize=(20, 10))
plt.tight_layout()
plt.show()
In this case you will get more aligned view:
Hope this helps.
I am running following code to draw histograms in 3 by 3 grid for 9 varaibles.However, it plots only one variable.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def draw_histograms(df, variables, n_rows, n_cols):
fig=plt.figure()
for i, var_name in enumerate(variables):
ax=fig.add_subplot(n_rows,n_cols,i+1)
df[var_name].hist(bins=10,ax=ax)
plt.title(var_name+"Distribution")
plt.show()
You're adding subplots correctly but you call plt.show for each added subplot which causes what has been drawn so far to be shown, i.e. one plot. If you're for instance plotting inline in IPython you will only see the last plot drawn.
Matplotlib provides some nice examples of how to use subplots.
Your problem is fixed like:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def draw_histograms(df, variables, n_rows, n_cols):
fig=plt.figure()
for i, var_name in enumerate(variables):
ax=fig.add_subplot(n_rows,n_cols,i+1)
df[var_name].hist(bins=10,ax=ax)
ax.set_title(var_name+" Distribution")
fig.tight_layout() # Improves appearance a bit.
plt.show()
test = pd.DataFrame(np.random.randn(30, 9), columns=map(str, range(9)))
draw_histograms(test, test.columns, 3, 3)
Which gives a plot like:
In case you don't really worry about titles, here's a one-liner
df = pd.DataFrame(np.random.randint(10, size=(100, 9)))
df.hist(color='k', alpha=0.5, bins=10)