I am VERY new to the world of python/pandas/matplotlib, but I have been using it recently to create box and whisker plots. I was curious how to create a box and whisker plot for each sheet using a specific column of data, i.e. I have 17 sheets, and I have column called HMB and DV on each sheet. I want to plot 17 data sets on a Box and Whisker for HMB and another 17 data sets on the DV plot. Below is what I have so far.
I can open the file, and get all the sheets into list_dfs, but then don't know where to go from there. I was going to try and manually slice each set (as I started below before coming here for help), but when I have more data in the future, I don't want to have to do that by hand. Any help would be greatly appreciated!
import pandas as pd
import numpy as np
import xlrd
import matplotlib.pyplot as plt
%matplotlib inline
from pandas import ExcelWriter
from pandas import ExcelFile
from pandas import DataFrame
excel_file = 'Project File Merger.xlsm'
list_dfs = []
xls = xlrd.open_workbook(excel_file,on_demand=True)
for sheet_name in xls.sheet_names():
df = pd.read_excel(excel_file,sheet_name)
list_dfs.append(df)
d_psppm = {}
for i, sheet_name in enumerate(xls.sheet_names()):
df = pd.read_excel(excel_file,sheet_name)
d_psppm["PSPPM" + str(i)] = df.loc[:,['PSPPM']]
values_list = list(d_psppm.values())
print(values_list[:])
A sample output looks like below, for 17 list entries, but with different number of rows for each.
PSPPM
0 0.246769
1 0.599589
2 0.082420
3 0.250000
4 0.205140
5 0.850000,
PSPPM
0 0.500887
1 0.475255
2 0.472711
3 0.412953
4 0.415883
5 0.703716,...
The next thing I want to do is create a box and whisker plot, 1 plot with 17 box and whiskers. I am not sure how to get the dictionary to plot with the values and indices as the name. I have tried to dig, and figure out how to convert the dictionary to a list and then plot each element in the list, but have had no luck.
Thanks for the help!
I agree with #Alex that forming your columns into a new DataFrame and then plotting from that would be a good approach, however, if you're going to use the dict, then it should look something like this. Depending on the version of Python you're using, the dictionary may be unordered, so if the ordering on the plot is important to you, then you might want to create a list of dictionary keys in the order you want and iterate over that instead
import matplotlib.pyplot as plt
import numpy as np
#colours = []#list of colours here, if you want
#markers = []#list of markers here, if you want
fig, ax = plt.subplots()
for idx, k in enumerate(d_psppm, 1):
data = d_psppm[k]
jitter = np.random.normal(0, 0.1, data.shape[0]) + idx
ax.scatter(jitter,
data,
s=25,#size of the marker
c="r",#colour, could be from colours
alpha=0.35,#opacity, 1 being solid
marker="^",#or ref. to markers, e.g. markers[idx]
edgecolors="none"#removes black border
)
As per Alex's suggestion, you could use the data to create a seaborn boxplot and overlay a swarmplot to show the data (depends on how many rows each has whether this is practical).
Related
I have a DataFrame with multi-index rows and I would like to create a heatmap without the repetition of row's labels, just like it appears in pandas DataFrame. Here a code to replicate my problem:
import pandas as pd
from matplotlib import pyplot as plt
import random
import seaborn as sns
%matplotlib inline
df = pd.DataFrame({'Occupation':['Economist','Economist','Economist','Engineer','Engineer','Engineer',
'Data Scientist','Data Scientist','Data Scientist'],
'Sex':['Female','Male','Both']*3, 'UK':random.sample(range(-10,10),9),
'US':random.sample(range(-10,10),9),'Brazil':random.sample(range(-10,10),9)})
df = df.set_index(['Occupation','Sex'])
df
sns.heatmap(df, annot=True, fmt="",cmap="YlGnBu")
Besides the elimination of repetition, I would like to customize a bit the y-labels since this raw form doesn't look good to me.
Is it possible?
AFAIK there's no quick and easy way to do that within seaborn, but hopefully some one corrects me. You can do it manually by resetting the ytick_labels to just be the values from level 1 of your index. Then you can loop over level 0 of your index and add a text element to your visualization at the correct location:
from collections import OrderedDict
ax = sns.heatmap(df, annot=True, cmap="YlGnBu")
ylabel_mapping = OrderedDict()
for occupation, sex in df.index:
ylabel_mapping.setdefault(occupation, [])
ylabel_mapping[occupation].append(sex)
hline = []
new_ylabels = []
for occupation, sex_list in ylabel_mapping.items():
sex_list[0] = "{} - {}".format(occupation, sex_list[0])
new_ylabels.extend(sex_list)
if hline:
hline.append(len(sex_list) + hline[-1])
else:
hline.append(len(sex_list))
ax.hlines(hline, xmin=-1, xmax=4, color="white", linewidth=5)
ax.set_yticklabels(new_ylabels)
An alternative approach involves using dataframe styling. This leads to a super simply syntax, but you do lose out on the colobar. This keeps your index and column presentation all the same as a dataframe. Note that you'll need to be working in a notebook or somewhere that can render html to view the output:
df.style.background_gradient(cmap="YlGnBu", vmin=-10, vmax=10)
I wrote a python script to read in a distance matrix that was provided via a CSV text file. This distance matrix shows the difference between different animal species, and I'm trying to sort them in different ways(diet, family, genus, etc.) using data from another CSV file that just has one row of ordering information. Code used is here:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as mp
dietCols = pd.read_csv("label_diet.txt", header=None)
df = pd.read_csv("distance_matrix.txt", header=None)
ax = sns.heatmap(df)
fig = ax.get_figure()
fig.savefig("fig1.png")
mp.clf()
dfDiet = pd.read_csv("distance_matrix.txt", header=None, names=dietCols)
ax2 = sns.heatmap(dfDiet, linewidths=0)
fig2 = ax2.get_figure()
fig2.savefig("fig2.png")
mp.clf()
When plotting the distance matrix, the original graph looks like this:
However, when the additional naming information is read from the text file, the graph produced only has one column and looks like this:
You can see the matrix data is being used as row labeling, and I'm not sure why that would be. Some of the rows provided have no values so they're listed as "NaN", so I'm not sure if that would be causing a problem. Is there any easy way to order this distance matrix using an exterior file? Any help would be appreciated!
I have a dataframe, like this. I want to do scatter plots of it.
I want to do scatter plots of Value1 but whenever value2 is decreased to below 0.6, I want to marked in those scatter plots (Value1) to red color otherwise default color is okay.
Any Suggestions ?
Add another column with color information:
import matplotlib.cm as cm
df['color'] = [int(value < 0.6) for value in df.Value2]
df.plot.scatter(x=df.index, y='Value1',c='color',cmap=cm.jet)
I use seaborn's lmplot (advanced scatterplot) tool for that.
You can make a new column in your spreadsheet file with name "Category". It's very easy to categorize variables in excel or openoffice
(It's something like this -> (if(cell_value<0.6-->low),if(cell_value>0.6-->high)).)
So your test data should look like this:
Than you can import the data in python (I use Anaconda 3.5 with spider: python 3.6) I saved the file in .txt format. but any other format is possible (.csv etc.)
#Import libraries
import seaborn as sns
import pandas as pd
import numpy as np
import os
#Open data.txt which is stored in a repository
os.chdir(r'C:\Users\DarthVader\Desktop\Graph')
f = open('data.txt')
#Get data in a list splitting by semicolon
data = []
for l in f:
v = l.strip().split(';')
data.append(v)
f.close()
#Convert list as dataframe for plot purposes
df = pd.DataFrame(data, columns = ['ID', 'Value', 'Value2','Category'])
#pop out first row with header
df2 = df.iloc[1:]
#Change variables to be plotted as numeric types
df2[['Value','Value2']] = df2[['Value','Value2']].apply(pd.to_numeric)
#Make plot with red color with values below 0.6 and green color with values above 0.6
sns.lmplot( x="Value", y="Value2", data=df2, fit_reg=False, hue='Category', legend=False, palette=dict(high="#2ecc71", low="#e74c3c"))
Your output should look like this.
I am really new to Python but I need to use a already existing iPython notebook from my professor for analyzing a dataset (using python 2). The data I have is in a .txt document and is a list consisting of numbers with a "," as decimal seperator. I managed to import this list and plot it––all good till here.
My problem now is:
I want an index (year) on the x-axis of my chart starting at 563 for the first value going till 1995 for the last value (there are 1,433 data points in total). How can I add this index to the list without touching the original data?
Here is the code I use:
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
fig = plt.figure(figsize=(15,4))
import os
D = open(os.path.expanduser("~/MY_FILE_DIRECTORY/Data.txt"))
Dat = D.read().replace(',','.')
Dat = [float(x) for x in Dat.split('\n')]
D.close()
plt.subplot(1, 1, 1)
plt.plot(Dat, 'b-')
cutmin = 0
cutmax = 1420
plt.axvline(cutmin, color = 'red')
plt.axvline(cutmax, color = 'red')
plt.grid()
Please help me! :-)
I suppose when you say index you mean x-axis labels for your data which is different from the x-coordinates of your actual data (which you do not want to modify). You also say that these indices are years from 563 to 1995. xticks() function allows you to change the localtions and labels of the tick marks on your x-axis. So you can add these two lines to your code.
index = np.arange(563, 1996, 1, dtype=np.int32)
plt.xticks( index )
Hope this is what you wanted.
I am using Python pandas read_excel to create a histogram or line plot. I would like to read in the entire file. It is a large file and I only want to plot certain values on it. I know how to use skiprows and parse_cols in read_excel, but if I do this, it does not read a part of the file that I need to use for the axis labels. I also do not know how to tell it to plot what I want for x-values and what I want for the y-values. Heres what I have:
df=pd.read_excel('JanRain.xlsx',parse_cols="C:BD")
years=df[0]
precip=df[31:32]
df.plot.bar()
I want the x axis to be row 1 of the excel file(years) and I want each bar in the bar graph to be the values on row 31 of the excel file. Im not sure how to isolate this. Would it be easier to read with pandas then plot with matplotlib?
Here is a sample of the excel file. The first row is years and the second column is days of the month (this file is only for 1 month:
Here's how I would plot the data in row 31 of a large dataframe, setting row 0 as the x-axis. (updated answer)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
create a random array with 32 rows, and 10 columns
df = pd.DataFrame(np.random.rand(320).reshape(32,10), columns=range(64,74), index=range(1,33))
df.to_excel(r"D:\data\data.xlsx")
Read only the columns and rows that you want using "parse_cols" and "skiprows." The first column in this example is the dataframe index.
# load desired columns and rows into a dataframe
# in this method, I firse make a list of all skipped_rows
desired_cols = [0] + list(range(2,9))
skipped_rows = list(range(1,33))
skipped_rows.remove(31)
df = pd.read_excel(r"D:\data\data.xlsx", index_col=0, parse_cols=desired_cols, skiprows=skipped_rows)
Currently this yields a dataframe with only one row.
65 66 67 68 69 70 71
31 0.310933 0.606858 0.12442 0.988441 0.821966 0.213625 0.254897
isolate only the row that you want to plot, giving a pandas.Series with the original column header as the index
ser = df.loc[31, :]
Plot the series.
fig, ax = plt.subplots()
ser.plot(ax=ax)
ax.set_xlabel("year")
ax.set_ylabel("precipitation")
fig, ax = plt.subplots()
ser.plot(kind="bar", ax=ax)
ax.set_xlabel("year")
ax.set_ylabel("precipitation")