Add Year Index to a list - python

I am really new to Python but I need to use a already existing iPython notebook from my professor for analyzing a dataset (using python 2). The data I have is in a .txt document and is a list consisting of numbers with a "," as decimal seperator. I managed to import this list and plot it––all good till here.
My problem now is:
I want an index (year) on the x-axis of my chart starting at 563 for the first value going till 1995 for the last value (there are 1,433 data points in total). How can I add this index to the list without touching the original data?
Here is the code I use:
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
fig = plt.figure(figsize=(15,4))
import os
D = open(os.path.expanduser("~/MY_FILE_DIRECTORY/Data.txt"))
Dat = D.read().replace(',','.')
Dat = [float(x) for x in Dat.split('\n')]
D.close()
plt.subplot(1, 1, 1)
plt.plot(Dat, 'b-')
cutmin = 0
cutmax = 1420
plt.axvline(cutmin, color = 'red')
plt.axvline(cutmax, color = 'red')
plt.grid()
Please help me! :-)

I suppose when you say index you mean x-axis labels for your data which is different from the x-coordinates of your actual data (which you do not want to modify). You also say that these indices are years from 563 to 1995. xticks() function allows you to change the localtions and labels of the tick marks on your x-axis. So you can add these two lines to your code.
index = np.arange(563, 1996, 1, dtype=np.int32)
plt.xticks( index )
Hope this is what you wanted.

Related

Highlight part of scatter plot containing specific points in python

I am trying to create a Manhattan plot that will be vertically highlighted at certain parts of the plot given a list of values corresponding to points in the scatter plot. I looked at several examples but I am not sure how to proceed. I think using axvspan or ax.fill_between should work but I am not sure how. The code below was lifted directly from
How to create a Manhattan plot with matplotlib in python?
from pandas import DataFrame
from scipy.stats import uniform
from scipy.stats import randint
import numpy as np
import matplotlib.pyplot as plt
# some sample data
df = DataFrame({'gene' : ['gene-%i' % i for i in np.arange(10000)],
'pvalue' : uniform.rvs(size=10000),
'chromosome' : ['ch-%i' % i for i in randint.rvs(0,12,size=10000)]})
# -log_10(pvalue)
df['minuslog10pvalue'] = -np.log10(df.pvalue)
df.chromosome = df.chromosome.astype('category')
df.chromosome = df.chromosome.cat.set_categories(['ch-%i' % i for i in range(12)], ordered=True)
df = df.sort_values('chromosome')
# How to plot gene vs. -log10(pvalue) and colour it by chromosome?
df['ind'] = range(len(df))
df_grouped = df.groupby(('chromosome'))
fig = plt.figure()
ax = fig.add_subplot(111)
colors = ['red','green','blue', 'yellow']
x_labels = []
x_labels_pos = []
for num, (name, group) in enumerate(df_grouped):
group.plot(kind='scatter', x='ind', y='minuslog10pvalue',color=colors[num % len(colors)], ax=ax)
x_labels.append(name)
x_labels_pos.append((group['ind'].iloc[-1] - (group['ind'].iloc[-1] - group['ind'].iloc[0])/2))
ax.set_xticks(x_labels_pos)
ax.set_xticklabels(x_labels)
ax.set_xlim([0, len(df)])
ax.set_ylim([0, 3.5])
ax.set_xlabel('Chromosome')
given a list of values of the point, pvalues e.g
lst = [0.288686, 0.242591, 0.095959, 3.291343, 1.526353]
How do I highlight the region containing these points on the plot just as shown in green in the image below? Something similar to:
]1
It would help if you have a sample of your dataframe for your reference.
Assuming you want to match your lst values with Y values, you need to iterate through each Y value you're plotting and check if they are within lst.
for num, (name, group) in enumerate(df_grouped):
group Variable in your code are essentially partial dataframes of your main dataframe, df. Hence, you need to put in another loop to look through all Y values for lst matches
region_plot = []
for num, (name, group) in enumerate(a.groupby('group')):
group.plot(kind='scatter', x='ind', y='minuslog10pvalue',color=colors[num % len(colors)], ax=ax)
#create a new df to get only rows that have matched values with lst
temp_group = group[group['minuslog10pvalue'].isin(lst)]
for x_group in temp_group['ind']:
#If condition to make sure same region is not highlighted again
if x_group not in region_plot:
region_plot.append(x_group)
ax.axvspan(x_group, x_group+1, alpha=0.5, color='green')
#I put x_group+1 because I'm not sure how big of a highlight range you want
Hope this helps!

TypeError: Image data of dtype object cannot be converted to float - Issue with HeatMap Plot using Seaborn

I'm getting the error:
TypeError: Image data of dtype object cannot be converted to float
when I try to run the heapmap function in the code below:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Read the data
df = pd.read_csv("gapminder-FiveYearData.csv")
print(df.head(10))
# Create an array of n-dimensional array of life expectancy changes for countries over the years.
year = ((np.asarray(df['year'])).reshape(12,142))
country = ((np.asarray(df['country'])).reshape(12,142))
print(year)
print(country)
# Create a pivot table
result = df.pivot(index='year',columns='country',values='lifeExp')
print(result)
# Create an array to annotate the heatmap
labels = (np.asarray(["{1:.2f} \n {0}".format(year,value)
for year, value in zip(year.flatten(),
country.flatten())])
).reshape(12,142)
# Define the plot
fig, ax = plt.subplots(figsize=(15,9))
# Add title to the Heat map
title = "GapMinder Heat Map"
# Set the font size and the distance of the title from the plot
plt.title(title,fontsize=18)
ttl = ax.title
ttl.set_position([0.5,1.05])
# Hide ticks for X & Y axis
ax.set_xticks([])
ax.set_yticks([])
# Remove the axes
ax.axis('off')
# Use the heatmap function from the seaborn package
hmap = sns.heatmap(result,annot=labels,fmt="",cmap='RdYlGn',linewidths=0.30,ax=ax)
# Display the Heatmap
plt.imshow(hmap)
Here is a link to the CSV file.
The objective of the activity is to
data file is the dataset with 6 columns namely: country, year, pop, continent, lifeExp and gdpPercap.
Create a pivot table dataframe with year along x-axes, country along y-axes and lifeExp filled within cells.
Plot a heatmap using seaborn for the pivot table that was just created.
Thanks for providing your data to this question. I believe your typeError is coming from the labels array your code is creating for the annotation. Based on the function's built-in annotate properties, I actually don't think you need this extra work and it's modifying your data in a way that errors out when plotting.
I took a stab at re-writing your project to produce a heatmap that shows the pivot table of country and year of lifeExp. I'm also assuming that it is important for you to keep this number a float.
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
## UNCHANGED FROM ABOVE **
# Read in the data
df = pd.read_csv('https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv')
df.head()
## ** UNCHANGED FROM ABOVE **
# Create an array of n-dimensional array of life expectancy changes for countries over the years.
year = ((np.asarray(df['year'])).reshape(12,142))
country = ((np.asarray(df['country'])).reshape(12,142))
print('show year\n', year)
print('\nshow country\n', country)
# Create a pivot table
result = df.pivot(index='country',columns='year',values='lifeExp')
# Note: This index and columns order is reversed from your code.
# This will put the year on the X axis of our heatmap
result
I removed the labels code block.
Notes on the sb.heatmap function:
I used plt.cm.get_cmap() to restrict the number of colors in your
mapping. If you want to use the entire colormap spectrum, just remove
it and include how you had it originally.
fmt = "f", this if for float, your lifeExp values.
cbar_kws - you can use this to play around with the size, label and orientation of your color bar.
# Define the plot - feel free to modify however you want
plt.figure(figsize = [20, 50])
# Set the font size and the distance of the title from the plot
title = 'GapMinder Heat Map'
plt.title(title,fontsize=24)
ax = sb.heatmap(result, annot = True, fmt='f', linewidths = .5,
cmap = plt.cm.get_cmap('RdYlGn', 7), cbar_kws={
'label': 'Life Expectancy', 'shrink': 0.5})
# This sets a label, size 20 to your color bar
ax.figure.axes[-1].yaxis.label.set_size(20)
plt.show()
limited screenshot, only b/c the plot is so large
another of the bottom of the plot to show the year axis, slightly zoomed in on my browser.

Trend graph with Matplotlib

I have the following lists:
input = ['"25', '"500', '"10000', '"200000', '"1000000']
inComp = ['0.000001', '0.0110633', '4.1396405', '2569.270532', '49085.86398']
quickrComp=['0.0000001', '0.0003665', '0.005637', '0.1209121', '0.807273']
quickComp = ['0.000001', '0.0010253', '0.0318653', '0.8851902', '5.554448']
mergeComp = ['0.000224', '0.004089', '0.079448', '1.973014', '13.034443']
I need to create a trend graph to demonstrate the growth of the values of inComp, quickrComp, quickComp, mergeComp as the input values grow (input is the x-axis). I am using matplotlib.pyplot, and the following code:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(input,quickrComp, label="QR")
ax.plot(input,mergeComp, label="merge")
ax.plot(input, quickComp, label="Quick")
ax.plot(input, inComp, label="Insrção")
ax.legend()
plt.show()
However, what is happening is this: the values of the y-axis are disordered; the values of quickrComp on the y-axis are first inserted; then all mergeComp values and so on. I need the y-axis values to start at 0 and end at the highest of the 4-row values. How can I do this?
Two things: First, your y-values are strings. You need to convert the data to numeric (float) type. Second, your y-values in one of the lists are huge as compared to the remaining three lists. So you will have to convert the y-scale to logarithmic to see the trend. You can, in principle, convert your x-values to float (integers) as well but in your example, you don't need it. In case you want to do that, you will also have to remove the " from the front of each x-value.
A word of caution: Don't name your variables the same as in-built functions. In your case, you should rename input to something else, input1 for instance.
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
input1 = ['"25', '"500', '"10000', '"200000', '"1000000']
inComp = ['0.000001', '0.0110633', '4.1396405', '2569.270532', '49085.86398']
quickrComp=['0.0000001', '0.0003665', '0.005637', '0.1209121', '0.807273']
quickComp = ['0.000001', '0.0010253', '0.0318653', '0.8851902', '5.554448']
mergeComp = ['0.000224', '0.004089', '0.079448', '1.973014', '13.034443']
ax.plot(input1, list(map(float, quickrComp)), label="QR")
ax.plot(input1, list(map(float, mergeComp)), label="merge")
ax.plot(input1, list(map(float, quickComp)), label="Quick")
ax.plot(input1, list(map(float, inComp)), label="Insrção")
ax.set_yscale('log')
ax.legend()
plt.show()

Python plotting dictionary

I am VERY new to the world of python/pandas/matplotlib, but I have been using it recently to create box and whisker plots. I was curious how to create a box and whisker plot for each sheet using a specific column of data, i.e. I have 17 sheets, and I have column called HMB and DV on each sheet. I want to plot 17 data sets on a Box and Whisker for HMB and another 17 data sets on the DV plot. Below is what I have so far.
I can open the file, and get all the sheets into list_dfs, but then don't know where to go from there. I was going to try and manually slice each set (as I started below before coming here for help), but when I have more data in the future, I don't want to have to do that by hand. Any help would be greatly appreciated!
import pandas as pd
import numpy as np
import xlrd
import matplotlib.pyplot as plt
%matplotlib inline
from pandas import ExcelWriter
from pandas import ExcelFile
from pandas import DataFrame
excel_file = 'Project File Merger.xlsm'
list_dfs = []
xls = xlrd.open_workbook(excel_file,on_demand=True)
for sheet_name in xls.sheet_names():
df = pd.read_excel(excel_file,sheet_name)
list_dfs.append(df)
d_psppm = {}
for i, sheet_name in enumerate(xls.sheet_names()):
df = pd.read_excel(excel_file,sheet_name)
d_psppm["PSPPM" + str(i)] = df.loc[:,['PSPPM']]
values_list = list(d_psppm.values())
print(values_list[:])
A sample output looks like below, for 17 list entries, but with different number of rows for each.
PSPPM
0 0.246769
1 0.599589
2 0.082420
3 0.250000
4 0.205140
5 0.850000,
PSPPM
0 0.500887
1 0.475255
2 0.472711
3 0.412953
4 0.415883
5 0.703716,...
The next thing I want to do is create a box and whisker plot, 1 plot with 17 box and whiskers. I am not sure how to get the dictionary to plot with the values and indices as the name. I have tried to dig, and figure out how to convert the dictionary to a list and then plot each element in the list, but have had no luck.
Thanks for the help!
I agree with #Alex that forming your columns into a new DataFrame and then plotting from that would be a good approach, however, if you're going to use the dict, then it should look something like this. Depending on the version of Python you're using, the dictionary may be unordered, so if the ordering on the plot is important to you, then you might want to create a list of dictionary keys in the order you want and iterate over that instead
import matplotlib.pyplot as plt
import numpy as np
#colours = []#list of colours here, if you want
#markers = []#list of markers here, if you want
fig, ax = plt.subplots()
for idx, k in enumerate(d_psppm, 1):
data = d_psppm[k]
jitter = np.random.normal(0, 0.1, data.shape[0]) + idx
ax.scatter(jitter,
data,
s=25,#size of the marker
c="r",#colour, could be from colours
alpha=0.35,#opacity, 1 being solid
marker="^",#or ref. to markers, e.g. markers[idx]
edgecolors="none"#removes black border
)
As per Alex's suggestion, you could use the data to create a seaborn boxplot and overlay a swarmplot to show the data (depends on how many rows each has whether this is practical).

Matplotlib xticks as days

So I do have a simple question. I have a program which simulates a week/month of living of a shop. For now it takes care of cashdesks (I don't know if I transalted that one correctly from my language), as they can fail sometimes, and some specialist has to come to the shop and repair them. At the end of simulation, program plots a graph which look like this:
The 1.0 state occurs when the cashdesk has gotten some error/broke, then it waits for a technician to repair it, and then it gets back to 0, working state.
I or rather my project guy would rather see something else than minutes on the x axis. How can I do it? I mean, I would like it to be like Day 1, then an interval, Day 2, etc.
I know about pyplot.xticks() method, but it assigns the labels to the ticks that are in the list in the first argument, so then I would have to make like 2000 labels, with minutes, and I want only 7, with days written on it.
You can use matplotlib set_ticks and get_xticklabels() method of ax, inspired by this and this questions.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
minutes_in_day = 24 * 60
test = pd.Series(np.random.binomial(1, 0.002, 7 * minutes_in_day))
fig, ax = plt.subplots(1)
test.plot(ax = ax)
start, end = ax.get_xlim()
ax.xaxis.set_ticks(np.arange(start, end, minutes_in_day))
labels = ['Day\n %d'%(int(item.get_text())/minutes_in_day+ 1) for item in ax.get_xticklabels()]
ax.set_xticklabels(labels)
I get something like the picture below.
You're on the right track with plt.xticks(). Try this:
import matplotlib.pyplot as plt
# Generate dummy data
x_minutes = range(1, 2001)
y = [i*2 for i in x_minutes]
# Convert minutes to days
x_days = [i/1440.0 for i in x_minutes]
# Plot the data over the newly created days list
plt.plot(x_days, y)
# Create labels using some string formatting
labels = ['Day %d' % (item) for item in range(int(min(x_days)), int(max(x_days)+1))]
# Set the tick strings
plt.xticks(range(len(labels)), labels)
# Show the plot
plt.show()

Categories

Resources