I have an assignment that only allows matplotlib and basic python. I am unable to produce the bar chart required. Although anaconda has identified the problematic line, I am unable to understand it.
The data source is here: https://data.gov.sg/dataset/bookings-for-new-flats-annual?view_id=2cdedc08-ddf6-4e0b-b279-82fdc7e678ea&resource_id=666ed30a-8344-4213-9d2e-076eeafeddd3
Have copied a sample resource and replicated it.
import numpy as np
import matplotlib.pyplot as plt
fname = "C:\data/bookings-for-new-flats.csv"
data = np.genfromtxt('C:\data/bookings-for-new-flats.csv',
skip_header=1,
dtype=[('financial_year','U50'),('no_of_units','i8')], delimiter=",",
missing_values=['na','-'],filling_values=[0])
labels = list(set(data['financial_year']))
labels.sort()
bookings = np.arange(0,len(labels))
bookings_values = data[['financial_year','no_of_units']]
values = bookings_values['no_of_units']
units_values = {}
for i in labels:
valuesforFY = values[bookings_values['financial_year']==i]
print("No.of Units in FY: " + i + " is {}".format(valuesforFY))
#the line below is critical
units_values[i] = valuesforFY
barchart = plt.bar(list(units_values.keys()), list(units_values.values()), color='b')
plt.show()
Expected a bar-chart but only received a empty one.
The system identified this line as problematic --->
barchart = plt.bar(list(units_values.keys()), list(units_values.values()), color='b')
The problem was in reading the y-data (values of the dictionary) which were single values enclosed in an array and hence you were getting a list of arrays.
Following is the solution: Iterate over the values and store only the data which can be accessed using index 0 as [0]. Here I am rewriting your code by first extracting the x-values in xdata and then the y-values in ydata for the sake of readability.
xdata = list(units_values.keys())
ydata = [i[0] for i in units_values.values()]
barchart = plt.bar(xdata, ydata, color='b')
Related
I am trying to create a Manhattan plot that will be vertically highlighted at certain parts of the plot given a list of values corresponding to points in the scatter plot. I looked at several examples but I am not sure how to proceed. I think using axvspan or ax.fill_between should work but I am not sure how. The code below was lifted directly from
How to create a Manhattan plot with matplotlib in python?
from pandas import DataFrame
from scipy.stats import uniform
from scipy.stats import randint
import numpy as np
import matplotlib.pyplot as plt
# some sample data
df = DataFrame({'gene' : ['gene-%i' % i for i in np.arange(10000)],
'pvalue' : uniform.rvs(size=10000),
'chromosome' : ['ch-%i' % i for i in randint.rvs(0,12,size=10000)]})
# -log_10(pvalue)
df['minuslog10pvalue'] = -np.log10(df.pvalue)
df.chromosome = df.chromosome.astype('category')
df.chromosome = df.chromosome.cat.set_categories(['ch-%i' % i for i in range(12)], ordered=True)
df = df.sort_values('chromosome')
# How to plot gene vs. -log10(pvalue) and colour it by chromosome?
df['ind'] = range(len(df))
df_grouped = df.groupby(('chromosome'))
fig = plt.figure()
ax = fig.add_subplot(111)
colors = ['red','green','blue', 'yellow']
x_labels = []
x_labels_pos = []
for num, (name, group) in enumerate(df_grouped):
group.plot(kind='scatter', x='ind', y='minuslog10pvalue',color=colors[num % len(colors)], ax=ax)
x_labels.append(name)
x_labels_pos.append((group['ind'].iloc[-1] - (group['ind'].iloc[-1] - group['ind'].iloc[0])/2))
ax.set_xticks(x_labels_pos)
ax.set_xticklabels(x_labels)
ax.set_xlim([0, len(df)])
ax.set_ylim([0, 3.5])
ax.set_xlabel('Chromosome')
given a list of values of the point, pvalues e.g
lst = [0.288686, 0.242591, 0.095959, 3.291343, 1.526353]
How do I highlight the region containing these points on the plot just as shown in green in the image below? Something similar to:
]1
It would help if you have a sample of your dataframe for your reference.
Assuming you want to match your lst values with Y values, you need to iterate through each Y value you're plotting and check if they are within lst.
for num, (name, group) in enumerate(df_grouped):
group Variable in your code are essentially partial dataframes of your main dataframe, df. Hence, you need to put in another loop to look through all Y values for lst matches
region_plot = []
for num, (name, group) in enumerate(a.groupby('group')):
group.plot(kind='scatter', x='ind', y='minuslog10pvalue',color=colors[num % len(colors)], ax=ax)
#create a new df to get only rows that have matched values with lst
temp_group = group[group['minuslog10pvalue'].isin(lst)]
for x_group in temp_group['ind']:
#If condition to make sure same region is not highlighted again
if x_group not in region_plot:
region_plot.append(x_group)
ax.axvspan(x_group, x_group+1, alpha=0.5, color='green')
#I put x_group+1 because I'm not sure how big of a highlight range you want
Hope this helps!
I have a dataset containing 10 features and corresponding labels. I am using scatterplot to plot distinct pair of features to see which of them describe the labels perfectly (which means that total 45 plots will be created). In order to do that, I used a nested loop format. The code shows no error and I obtained all the plots as well. However, there is clearly something wrong with the code because each new scatterplot that gets created and saved is accumulating points from the previous ones as well. I am attaching the complete code which I used. How to fix this problem? Below is the link for raw dataset:
https://github.com/IITGuwahati-AI/Learning-Content/raw/master/Phase%203%20-%202020%20(Summer)/Week%201%20(Mar%2028%20-%20Apr%204)/assignment/data.txt
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
data_url ='https://raw.githubusercontent.com/diwakar1412/Learning-Content/master/DiwakarDas_184104503/datacsv.csv'
df = pd.read_csv(data_url)
df.head()
def transform_label(value):
if value >= 2:
return "BLUE"
else:
return "RED"
df["Label"] = df.Label.apply(transform_label)
df.head()
colors = {'RED':'r', 'BLUE':'b'}
fig, ax = plt.subplots()
for i in range(1,len(df.columns)):
for j in range(i+1,len(df.columns)):
for k in range(len(df[str(i)])):
ax.scatter(df[str(i)][k], df[str(j)][k], color=colors[df['Label'][k]])
ax.set_title('F%svsF%s' %(i,j))
ax.set_xlabel('%s' %i)
ax.set_ylabel('%s' %j)
plt.savefig('F%svsF%s' %(i,j))
Dataset
You have to create a new figure each time. Try to put
fig, ax = plt.subplots()
inside your loop:
for i in range(1,len(df.columns)):
for j in range(i+1,len(df.columns)):
fig, ax = plt.subplots() # <-------------- here
for k in range(len(df[str(i)])):
ax.scatter(df[str(i)][k], df[str(j)][k], color=colors[df['Label'][k]])
ax.set_title('F%svsF%s' %(i,j))
ax.set_xlabel('%s' %i)
ax.set_ylabel('%s' %j)
plt.savefig('/Users/Alessandro/Desktop/tmp/F%svsF%s' %(i,j))
(Using Python 3.0) In increments of 0.25, I want to calculate and plot PDFs for the given data across specified ranges for easy visualization.
Calculating the individual plot has been done thanks to the SO community, but I cannot quite get the algorithm right to iterate properly across the range of values.
Data: https://www.dropbox.com/s/y78pynq9onyw9iu/Data.csv?dl=0
What I have so far is normalized toy data that looks like a shotgun blast with one of the target areas isolated between the black lines with an increment of 0.25:
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
import seaborn as sns
Data=pd.read_csv("Data.csv")
g = sns.jointplot(x="x", y="y", data=Data)
bottom_lim = 0
top_lim = 0.25
temp = Data.loc[(Data.y>=bottom_lim)&(Data.y<top_lim)]
g.ax_joint.axhline(top_lim, c='k', lw=2)
g.ax_joint.axhline(bottom_lim, c='k', lw=2)
# we have to create a secondary y-axis to the joint-plot, otherwise the kde
might be very small compared to the scale of the original y-axis
ax_joint_2 = g.ax_joint.twinx()
sns.kdeplot(temp.x, shade=True, color='red', ax=ax_joint_2, legend=False)
ax_joint_2.spines['right'].set_visible(False)
ax_joint_2.spines['top'].set_visible(False)
ax_joint_2.yaxis.set_visible(False)
And now what I want to do is make a ridgeline/joyplot of this data across each 0.25 band of data.
I tried a few techniques from the various Seaborn examples out there, but nothing really accounts for the band or range of values as the y-axis. I'm struggling to translate my written algorithm into working code as a result.
I don't know if this is exactly what you are looking for, but hopefully this gets you in the ballpark. I also know very little about python, so here is some R:
library(tidyverse)
library(ggridges)
data = read_csv("https://www.dropbox.com/s/y78pynq9onyw9iu/Data.csv?dl=1")
data2 = data %>%
mutate(breaks = cut(x, breaks = seq(-1,7,.5), labels = FALSE))
data2 %>%
ggplot(aes(x=x,y=breaks)) +
geom_density_ridges() +
facet_grid(~breaks, scales = "free")
data2 %>%
ggplot(aes(x=x,y=y)) +
geom_point() +
geom_density() +
facet_grid(~breaks, scales = "free")
And please forgive the poorly formatted axis.
I would like to replace part of my plot where the function dips down to '-1' with a dashed line carrying on from the previous point (see plots below).
Here's some code I've written, along with its output:
import numpy as np
import matplotlib.pyplot as plt
y = [5,6,8,3,5,7,3,6,-1,3,8,5]
plt.plot(np.linspace(1,12,12),y,'r-o')
plt.show()
for i in range(1,len(y)):
if y[i]!=-1:
plt.plot(np.linspace(i-1,i,2),y[i-1:i+1],'r-o')
else:
y[i]=y[i-1]
plt.plot(np.linspace(i-1,i,2),y[i-1:i+1],'r--o')
plt.ylim(-1,9)
plt.show()
Here's the original plot
Modified plot:
The code I've written works (it produces the desired output), but it's inefficient and takes a long time when I actually run it on my (much larger) dataset. Is there a smarter way to go about doing this?
You can achieve something similar without the loops:
import pandas as pd
import matplotlib.pyplot as plt
# Create a data frame from the list
a = pd.DataFrame([5,6,-1,-1, 8,3,5,7,3,6,-1,3,8,5])
# Prepare a boolean mask
mask = a > 0
# New data frame with missing values filled with the last element of
# the previous segment. Choose 'bfill' to use the first element of
# the next segment.
a_masked = a[mask].fillna(method = 'ffill')
# Prepare the plot
fig, ax = plt.subplots()
line, = ax.plot(a_masked, ls = '--', lw = 1)
ax.plot(a[mask], color=line.get_color(), lw=1.5, marker = 'o')
plt.show()
You can also highlight the negative regions by choosing a different colour for the lines:
My answer is based on a great post from July, 2017. The latter also tackles the case when the first element is NaN or in your case a negative number:
Dotted lines instead of a missing value in matplotlib
I would use numpy functionality to cut your line into segments and then plot all solid and dashed lines separately. In the example below I added two additional -1s to your data to see that this works universally.
import numpy as np
import matplotlib.pyplot as plt
Y = np.array([5,6,-1,-1, 8,3,5,7,3,6,-1,3,8,5])
X = np.arange(len(Y))
idxs = np.where(Y==-1)[0]
sub_y = np.split(Y,idxs)
sub_x = np.split(X,idxs)
fig, ax = plt.subplots()
##replacing -1 values and plotting dotted lines
for i in range(1,len(sub_y)):
val = sub_y[i-1][-1]
sub_y[i][0] = val
ax.plot([sub_x[i-1][-1], sub_x[i][0]], [val, val], 'r--')
##plotting rest
for x,y in zip(sub_x, sub_y):
ax.plot(x, y, 'r-o')
plt.show()
The result looks like this:
Note, however, that this will fail if the first value is -1, as then your problem is not well defined (no previous value to copy from). Hope this helps.
Not too elegant, but here's something that doesn't use loops which I came up with (based on the above answers) which works. #KRKirov and #Thomas Kühn , thank you for your answers, I really appreciate them
import pandas as pd
import matplotlib.pyplot as plt
# Create a data frame from the list
a = pd.DataFrame([5,6,-1,-1, 8,3,5,7,3,6,-1,3,8,5])
b=a.copy()
b[2]=b[0].shift(1,axis=0)
b[4]=(b[0]!=-1) & (b[2]==-1)
b[5]=b[4].shift(-1,axis=0)
b[0] = (b[5] | b[4])
c=b[0]
d=pd.DataFrame(c)
# Prepare a boolean mask
mask = a > 0
# New data frame with missing values filled with the last element of
# the previous segment. Choose 'bfill' to use the first element of
# the next segment.
a_masked = a[mask].fillna(method = 'ffill')
# Prepare the plot
fig, ax = plt.subplots()
line, = ax.plot(a_masked, 'b:o', lw = 1)
ax.plot(a[mask], color=line.get_color(), lw=1.5, marker = 'o')
ax.plot(a_masked[d], color=line.get_color(), lw=1.5, marker = 'o')
plt.show()
I'm trying to go away from matlab and use python + matplotlib instead. However, I haven't really figured out what the matplotlib equivalent of matlab 'handles' is. So here's some matlab code where I return the handles so that I can change certain properties. What is the exact equivalent of this code using matplotlib? I very often use the 'Tag' property of handles in matlab and use 'findobj' with it. Can this be done with matplotlib as well?
% create figure and return figure handle
h = figure();
% add a plot and tag it so we can find the handle later
plot(1:10, 1:10, 'Tag', 'dummy')
% add a legend
my_legend = legend('a line')
% change figure name
set(h, 'name', 'myfigure')
% find current axes
my_axis = gca();
% change xlimits
set(my_axis, 'XLim', [0 5])
% find the plot object generated above and modify YData
set(findobj('Tag', 'dummy'), 'YData', repmat(10, 1, 10))
There is a findobj method is matplotlib too:
import matplotlib.pyplot as plt
import numpy as np
h = plt.figure()
plt.plot(range(1,11), range(1,11), gid='dummy')
my_legend = plt.legend(['a line'])
plt.title('myfigure') # not sure if this is the same as set(h, 'name', 'myfigure')
my_axis = plt.gca()
my_axis.set_xlim(0,5)
for p in set(h.findobj(lambda x: x.get_gid()=='dummy')):
p.set_ydata(np.ones(10)*10.0)
plt.show()
Note that the gid parameter in plt.plot is usually used by matplotlib (only) when the backend is set to 'svg'. It use the gid as the id attribute to some grouping elements (like line2d, patch, text).
I have not used matlab but I think this is what you want
import matplotlib
import matplotlib.pyplot as plt
x = [1,3,4,5,6]
y = [1,9,16,25,36]
fig = plt.figure()
ax = fig.add_subplot(111) # add a plot
ax.set_title('y = x^2')
line1, = ax.plot(x, y, 'o-') #x1,y1 are lists(equal size)
line1.set_ydata(y2) #Use this to modify Ydata
plt.show()
Of course, this is just a basic plot, there is more to it.Go though this to find the graph you want and view its source code.
# create figure and return figure handle
h = figure()
# add a plot but tagging like matlab is not available here. But you can
# set one of the attributes to find it later. url seems harmless to modify.
# plot() returns a list of Line2D instances which you can store in a variable
p = plot(arange(1,11), arange(1,11), url='my_tag')
# add a legend
my_legend = legend(p,('a line',))
# you could also do
# p = plot(arange(1,11), arange(1,11), label='a line', url='my_tag')
# legend()
# or
# p[0].set_label('a line')
# legend()
# change figure name: not sure what this is for.
# set(h, 'name', 'myfigure')
# find current axes
my_axis = gca()
# change xlimits
my_axis.set_xlim(0, 5)
# You could compress the above two lines of code into:
# xlim(start, end)
# find the plot object generated above and modify YData
# findobj in matplotlib needs you to write a boolean function to
# match selection criteria.
# Here we use a lambda function to return only Line2D objects
# with the url property set to 'my_tag'
q = h.findobj(lambda x: isinstance(x, Line2D) and x.get_url() == 'my_tag')
# findobj returns duplicate objects in the list. We can take the first entry.
q[0].set_ydata(ones(10)*10.0)
# now refresh the figure
draw()