My current code takes a list from a csv file and lists the header for the user to pick from so it can plot.
import pandas as pd
df = pd.DataFrame.from_csv('log40a.csv',index_col=False)
from collections import OrderedDict
headings = OrderedDict(enumerate(df,1))
for num, heading in headings.items():
print("{}) {}".format(num, heading))
print ('Select X-Axis')
xaxis = int(input())
print ('Select Y-Axis')
yaxis = int(input())
df.plot(x= headings[xaxis], y= headings[yaxis])
My first question. How do I add a secondary Y axis. I know with matplotlib I first create a figure and then plot the first yaxis with the xaxis and then do the same thing to the 2nd yaxis. However, I am not sure how it is done in pandas. Is it similar?
I tried using matplotlib to do it but it gave me an error:
fig1 = plt.figure(figsize= (10,10))
ax = fig1.add_subplot(211)
ax.plot(headings[xaxis], headings[yaxis], label='Alt(m)', color = 'r')
ax.plot(headings[xaxis], headings[yaxis1], label='AS_Cmd', color = 'blue')
Error:
ValueError: Unrecognized character a in format string
You need to create an array with the column names that you want plotted on the y axis.
An example if you delimite the y columns with a ','
df.plot(x= headings[xaxis], y=headings[yaxis.split(",")], figsize=(15, 10))
To run it you will need to change your input method, so that it is an array rather then a string.
Related
I am trying to create a Manhattan plot that will be vertically highlighted at certain parts of the plot given a list of values corresponding to points in the scatter plot. I looked at several examples but I am not sure how to proceed. I think using axvspan or ax.fill_between should work but I am not sure how. The code below was lifted directly from
How to create a Manhattan plot with matplotlib in python?
from pandas import DataFrame
from scipy.stats import uniform
from scipy.stats import randint
import numpy as np
import matplotlib.pyplot as plt
# some sample data
df = DataFrame({'gene' : ['gene-%i' % i for i in np.arange(10000)],
'pvalue' : uniform.rvs(size=10000),
'chromosome' : ['ch-%i' % i for i in randint.rvs(0,12,size=10000)]})
# -log_10(pvalue)
df['minuslog10pvalue'] = -np.log10(df.pvalue)
df.chromosome = df.chromosome.astype('category')
df.chromosome = df.chromosome.cat.set_categories(['ch-%i' % i for i in range(12)], ordered=True)
df = df.sort_values('chromosome')
# How to plot gene vs. -log10(pvalue) and colour it by chromosome?
df['ind'] = range(len(df))
df_grouped = df.groupby(('chromosome'))
fig = plt.figure()
ax = fig.add_subplot(111)
colors = ['red','green','blue', 'yellow']
x_labels = []
x_labels_pos = []
for num, (name, group) in enumerate(df_grouped):
group.plot(kind='scatter', x='ind', y='minuslog10pvalue',color=colors[num % len(colors)], ax=ax)
x_labels.append(name)
x_labels_pos.append((group['ind'].iloc[-1] - (group['ind'].iloc[-1] - group['ind'].iloc[0])/2))
ax.set_xticks(x_labels_pos)
ax.set_xticklabels(x_labels)
ax.set_xlim([0, len(df)])
ax.set_ylim([0, 3.5])
ax.set_xlabel('Chromosome')
given a list of values of the point, pvalues e.g
lst = [0.288686, 0.242591, 0.095959, 3.291343, 1.526353]
How do I highlight the region containing these points on the plot just as shown in green in the image below? Something similar to:
]1
It would help if you have a sample of your dataframe for your reference.
Assuming you want to match your lst values with Y values, you need to iterate through each Y value you're plotting and check if they are within lst.
for num, (name, group) in enumerate(df_grouped):
group Variable in your code are essentially partial dataframes of your main dataframe, df. Hence, you need to put in another loop to look through all Y values for lst matches
region_plot = []
for num, (name, group) in enumerate(a.groupby('group')):
group.plot(kind='scatter', x='ind', y='minuslog10pvalue',color=colors[num % len(colors)], ax=ax)
#create a new df to get only rows that have matched values with lst
temp_group = group[group['minuslog10pvalue'].isin(lst)]
for x_group in temp_group['ind']:
#If condition to make sure same region is not highlighted again
if x_group not in region_plot:
region_plot.append(x_group)
ax.axvspan(x_group, x_group+1, alpha=0.5, color='green')
#I put x_group+1 because I'm not sure how big of a highlight range you want
Hope this helps!
I'm getting the error:
TypeError: Image data of dtype object cannot be converted to float
when I try to run the heapmap function in the code below:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Read the data
df = pd.read_csv("gapminder-FiveYearData.csv")
print(df.head(10))
# Create an array of n-dimensional array of life expectancy changes for countries over the years.
year = ((np.asarray(df['year'])).reshape(12,142))
country = ((np.asarray(df['country'])).reshape(12,142))
print(year)
print(country)
# Create a pivot table
result = df.pivot(index='year',columns='country',values='lifeExp')
print(result)
# Create an array to annotate the heatmap
labels = (np.asarray(["{1:.2f} \n {0}".format(year,value)
for year, value in zip(year.flatten(),
country.flatten())])
).reshape(12,142)
# Define the plot
fig, ax = plt.subplots(figsize=(15,9))
# Add title to the Heat map
title = "GapMinder Heat Map"
# Set the font size and the distance of the title from the plot
plt.title(title,fontsize=18)
ttl = ax.title
ttl.set_position([0.5,1.05])
# Hide ticks for X & Y axis
ax.set_xticks([])
ax.set_yticks([])
# Remove the axes
ax.axis('off')
# Use the heatmap function from the seaborn package
hmap = sns.heatmap(result,annot=labels,fmt="",cmap='RdYlGn',linewidths=0.30,ax=ax)
# Display the Heatmap
plt.imshow(hmap)
Here is a link to the CSV file.
The objective of the activity is to
data file is the dataset with 6 columns namely: country, year, pop, continent, lifeExp and gdpPercap.
Create a pivot table dataframe with year along x-axes, country along y-axes and lifeExp filled within cells.
Plot a heatmap using seaborn for the pivot table that was just created.
Thanks for providing your data to this question. I believe your typeError is coming from the labels array your code is creating for the annotation. Based on the function's built-in annotate properties, I actually don't think you need this extra work and it's modifying your data in a way that errors out when plotting.
I took a stab at re-writing your project to produce a heatmap that shows the pivot table of country and year of lifeExp. I'm also assuming that it is important for you to keep this number a float.
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
## UNCHANGED FROM ABOVE **
# Read in the data
df = pd.read_csv('https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv')
df.head()
## ** UNCHANGED FROM ABOVE **
# Create an array of n-dimensional array of life expectancy changes for countries over the years.
year = ((np.asarray(df['year'])).reshape(12,142))
country = ((np.asarray(df['country'])).reshape(12,142))
print('show year\n', year)
print('\nshow country\n', country)
# Create a pivot table
result = df.pivot(index='country',columns='year',values='lifeExp')
# Note: This index and columns order is reversed from your code.
# This will put the year on the X axis of our heatmap
result
I removed the labels code block.
Notes on the sb.heatmap function:
I used plt.cm.get_cmap() to restrict the number of colors in your
mapping. If you want to use the entire colormap spectrum, just remove
it and include how you had it originally.
fmt = "f", this if for float, your lifeExp values.
cbar_kws - you can use this to play around with the size, label and orientation of your color bar.
# Define the plot - feel free to modify however you want
plt.figure(figsize = [20, 50])
# Set the font size and the distance of the title from the plot
title = 'GapMinder Heat Map'
plt.title(title,fontsize=24)
ax = sb.heatmap(result, annot = True, fmt='f', linewidths = .5,
cmap = plt.cm.get_cmap('RdYlGn', 7), cbar_kws={
'label': 'Life Expectancy', 'shrink': 0.5})
# This sets a label, size 20 to your color bar
ax.figure.axes[-1].yaxis.label.set_size(20)
plt.show()
limited screenshot, only b/c the plot is so large
another of the bottom of the plot to show the year axis, slightly zoomed in on my browser.
I have a random list of 0 and 1 with a length > 300. I would like to plot the list with 1 as green and 0 as red as shown in the below pic. What is the best way to do this in matplotlib?
You can use a matplotlib table:
import matplotlib.pyplot as plt
data = [0,1,0,1,1,0] # Setup data list
fig, ax = plt.subplots(figsize=(len(data)*0.5, 0.5)) # Setup figure
ax.axis("off") # Just want table, no actual plot
# Create table, with our data array as the single row, and consuming the whole figure space
t = ax.table(cellText=[data], loc="center", cellLoc="center", bbox=[0,0,1,1])
# Iterate over cells to colour them based on value
for idx, cell in t.get_celld().items():
if data[idx[1]] == 1:
c = 'g'
else:
c = 'r'
cell.set_edgecolor(c)
cell.set_facecolor(c)
fig.show()
I have the following lists:
input = ['"25', '"500', '"10000', '"200000', '"1000000']
inComp = ['0.000001', '0.0110633', '4.1396405', '2569.270532', '49085.86398']
quickrComp=['0.0000001', '0.0003665', '0.005637', '0.1209121', '0.807273']
quickComp = ['0.000001', '0.0010253', '0.0318653', '0.8851902', '5.554448']
mergeComp = ['0.000224', '0.004089', '0.079448', '1.973014', '13.034443']
I need to create a trend graph to demonstrate the growth of the values of inComp, quickrComp, quickComp, mergeComp as the input values grow (input is the x-axis). I am using matplotlib.pyplot, and the following code:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(input,quickrComp, label="QR")
ax.plot(input,mergeComp, label="merge")
ax.plot(input, quickComp, label="Quick")
ax.plot(input, inComp, label="Insrção")
ax.legend()
plt.show()
However, what is happening is this: the values of the y-axis are disordered; the values of quickrComp on the y-axis are first inserted; then all mergeComp values and so on. I need the y-axis values to start at 0 and end at the highest of the 4-row values. How can I do this?
Two things: First, your y-values are strings. You need to convert the data to numeric (float) type. Second, your y-values in one of the lists are huge as compared to the remaining three lists. So you will have to convert the y-scale to logarithmic to see the trend. You can, in principle, convert your x-values to float (integers) as well but in your example, you don't need it. In case you want to do that, you will also have to remove the " from the front of each x-value.
A word of caution: Don't name your variables the same as in-built functions. In your case, you should rename input to something else, input1 for instance.
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
input1 = ['"25', '"500', '"10000', '"200000', '"1000000']
inComp = ['0.000001', '0.0110633', '4.1396405', '2569.270532', '49085.86398']
quickrComp=['0.0000001', '0.0003665', '0.005637', '0.1209121', '0.807273']
quickComp = ['0.000001', '0.0010253', '0.0318653', '0.8851902', '5.554448']
mergeComp = ['0.000224', '0.004089', '0.079448', '1.973014', '13.034443']
ax.plot(input1, list(map(float, quickrComp)), label="QR")
ax.plot(input1, list(map(float, mergeComp)), label="merge")
ax.plot(input1, list(map(float, quickComp)), label="Quick")
ax.plot(input1, list(map(float, inComp)), label="Insrção")
ax.set_yscale('log')
ax.legend()
plt.show()
I have an assignment that only allows matplotlib and basic python. I am unable to produce the bar chart required. Although anaconda has identified the problematic line, I am unable to understand it.
The data source is here: https://data.gov.sg/dataset/bookings-for-new-flats-annual?view_id=2cdedc08-ddf6-4e0b-b279-82fdc7e678ea&resource_id=666ed30a-8344-4213-9d2e-076eeafeddd3
Have copied a sample resource and replicated it.
import numpy as np
import matplotlib.pyplot as plt
fname = "C:\data/bookings-for-new-flats.csv"
data = np.genfromtxt('C:\data/bookings-for-new-flats.csv',
skip_header=1,
dtype=[('financial_year','U50'),('no_of_units','i8')], delimiter=",",
missing_values=['na','-'],filling_values=[0])
labels = list(set(data['financial_year']))
labels.sort()
bookings = np.arange(0,len(labels))
bookings_values = data[['financial_year','no_of_units']]
values = bookings_values['no_of_units']
units_values = {}
for i in labels:
valuesforFY = values[bookings_values['financial_year']==i]
print("No.of Units in FY: " + i + " is {}".format(valuesforFY))
#the line below is critical
units_values[i] = valuesforFY
barchart = plt.bar(list(units_values.keys()), list(units_values.values()), color='b')
plt.show()
Expected a bar-chart but only received a empty one.
The system identified this line as problematic --->
barchart = plt.bar(list(units_values.keys()), list(units_values.values()), color='b')
The problem was in reading the y-data (values of the dictionary) which were single values enclosed in an array and hence you were getting a list of arrays.
Following is the solution: Iterate over the values and store only the data which can be accessed using index 0 as [0]. Here I am rewriting your code by first extracting the x-values in xdata and then the y-values in ydata for the sake of readability.
xdata = list(units_values.keys())
ydata = [i[0] for i in units_values.values()]
barchart = plt.bar(xdata, ydata, color='b')