Marked in Scatter plots, if Unexpected Values shows - python

I have a dataframe, like this. I want to do scatter plots of it.
I want to do scatter plots of Value1 but whenever value2 is decreased to below 0.6, I want to marked in those scatter plots (Value1) to red color otherwise default color is okay.
Any Suggestions ?

Add another column with color information:
import matplotlib.cm as cm
df['color'] = [int(value < 0.6) for value in df.Value2]
df.plot.scatter(x=df.index, y='Value1',c='color',cmap=cm.jet)

I use seaborn's lmplot (advanced scatterplot) tool for that.
You can make a new column in your spreadsheet file with name "Category". It's very easy to categorize variables in excel or openoffice
(It's something like this -> (if(cell_value<0.6-->low),if(cell_value>0.6-->high)).)
So your test data should look like this:
Than you can import the data in python (I use Anaconda 3.5 with spider: python 3.6) I saved the file in .txt format. but any other format is possible (.csv etc.)
#Import libraries
import seaborn as sns
import pandas as pd
import numpy as np
import os
#Open data.txt which is stored in a repository
os.chdir(r'C:\Users\DarthVader\Desktop\Graph')
f = open('data.txt')
#Get data in a list splitting by semicolon
data = []
for l in f:
v = l.strip().split(';')
data.append(v)
f.close()
#Convert list as dataframe for plot purposes
df = pd.DataFrame(data, columns = ['ID', 'Value', 'Value2','Category'])
#pop out first row with header
df2 = df.iloc[1:]
#Change variables to be plotted as numeric types
df2[['Value','Value2']] = df2[['Value','Value2']].apply(pd.to_numeric)
#Make plot with red color with values below 0.6 and green color with values above 0.6
sns.lmplot( x="Value", y="Value2", data=df2, fit_reg=False, hue='Category', legend=False, palette=dict(high="#2ecc71", low="#e74c3c"))
Your output should look like this.

Related

How to use two columns in x-axis

I'm using the below code to get Segment and Year in x-axis and Final_Sales in y-axis but it is throwing me an error.
CODE
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
order = pd.read_excel("Sample.xls", sheet_name = "Orders")
order["Year"] = pd.DatetimeIndex(order["Order Date"]).year
result = order.groupby(["Year", "Segment"]).agg(Final_Sales=("Sales", sum)).reset_index()
bar = plt.bar(x = result["Segment","Year"], height = result["Final_Sales"])
ERROR
Can someone help me to correct my code to see the output as below.
Required Output
Try to add another pair of brackets - result[["Segment","Year"]],
What you tried to do is to retrieve column named - "Segment","Year",
But actually what are you trying to do is to retrieve a list of columns - ["Segment","Year"].
There are several problems with your code:
When using several columns to index a dataframe you want to pass a list of columns to [] (see the docs) as follows :
result[["Segment","Year"]]
From the figure you provide it looks like you want to use year as hue. matplotlib.barplot doesn't have a hue argument, you would have to build it manually as described here. Instead you can use seaborn library that you are already importing anyway (see https://seaborn.pydata.org/generated/seaborn.barplot.html):
sns.barplot(x = 'Segment', y = 'Final_Sales', hue = 'Year', data = result)

Reordering heatmap from seaborn using column info from additional text file

I wrote a python script to read in a distance matrix that was provided via a CSV text file. This distance matrix shows the difference between different animal species, and I'm trying to sort them in different ways(diet, family, genus, etc.) using data from another CSV file that just has one row of ordering information. Code used is here:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as mp
dietCols = pd.read_csv("label_diet.txt", header=None)
df = pd.read_csv("distance_matrix.txt", header=None)
ax = sns.heatmap(df)
fig = ax.get_figure()
fig.savefig("fig1.png")
mp.clf()
dfDiet = pd.read_csv("distance_matrix.txt", header=None, names=dietCols)
ax2 = sns.heatmap(dfDiet, linewidths=0)
fig2 = ax2.get_figure()
fig2.savefig("fig2.png")
mp.clf()
When plotting the distance matrix, the original graph looks like this:
However, when the additional naming information is read from the text file, the graph produced only has one column and looks like this:
You can see the matrix data is being used as row labeling, and I'm not sure why that would be. Some of the rows provided have no values so they're listed as "NaN", so I'm not sure if that would be causing a problem. Is there any easy way to order this distance matrix using an exterior file? Any help would be appreciated!

Newbie Matplotlib and Pandas Plotting from CSV file

I haven't had much training with Matplotlib at all, and this really seems like a basic plotting application, but I'm getting nothing but errors.
Using Python 3, I'm simply trying to plot historical stock price data from a CSV file, using the date as the x axis and prices as the y. The data CSV looks like this:
(only just now noticing to big gap in times, but whatever)
import glob
import pandas as pd
import matplotlib.pyplot as plt
def plot_test():
files = glob.glob('./data/test/*.csv')
for file in files:
df = pd.read_csv(file, header=1, delimiter=',', index_col=1)
df['close'].plot()
plt.show()
plot_test()
I'm using glob for now just to identify any CSV file in that folder, but I've also tried just designating one specific CSV filename and get the same error:
KeyError: 'close'
I've also tried just designating a specific column number to only plot one particular column instead, but I don't know what's going on.
Ideally, I would like to plot it just like real stock data, where everything is on the same graph, volume at the bottom on it's own axis, open high low close on the y axis, and date on the x axis for every row in the file. I've tried a few different solutions but can't seem to figure it out. I know this has probably been asked before but I've tried lots of different solutions from SO and others but mine seems to be hanging up on me. Thanks so much for the newbie help!
Here on pandas documentation you can find that the header kwarg should be 0 for your csv, as the first row contains the column names. What is happening is that the DataFrame you are building doesn't have the column close, as it is taking the headers from the "second" row. It will probably work fine if you take the header kwarg or change it to header=0. It is the same with the other kwargs, no need to define them. A simple df = pd.read_csv(file) will do just fine.
You can prettify this according to your needs
import pandas
import matplotlib.pyplot as plt
def plot_test(file):
df = pandas.read_csv(file)
# convert timestamp
df['timestamp'] = pandas.to_datetime(df['timestamp'], format = '%Y-%m-%d %H:%M')
# plot prices
ax1 = plt.subplot(211)
ax1.plot_date(df['timestamp'], df['open'], '-', label = 'open')
ax1.plot_date(df['timestamp'], df['close'], '-', label = 'close')
ax1.plot_date(df['timestamp'], df['high'], '-', label = 'high')
ax1.plot_date(df['timestamp'], df['low'], '-', label = 'low')
ax1.legend()
# plot volume
ax2 = plt.subplot(212)
# issue: https://github.com/matplotlib/matplotlib/issues/9610
df.set_index('timestamp', inplace = True)
df.index.to_pydatetime()
ax2.bar(df.index, df['volume'], width = 1e-3)
ax2.xaxis_date()
plt.show()

Python plotting dictionary

I am VERY new to the world of python/pandas/matplotlib, but I have been using it recently to create box and whisker plots. I was curious how to create a box and whisker plot for each sheet using a specific column of data, i.e. I have 17 sheets, and I have column called HMB and DV on each sheet. I want to plot 17 data sets on a Box and Whisker for HMB and another 17 data sets on the DV plot. Below is what I have so far.
I can open the file, and get all the sheets into list_dfs, but then don't know where to go from there. I was going to try and manually slice each set (as I started below before coming here for help), but when I have more data in the future, I don't want to have to do that by hand. Any help would be greatly appreciated!
import pandas as pd
import numpy as np
import xlrd
import matplotlib.pyplot as plt
%matplotlib inline
from pandas import ExcelWriter
from pandas import ExcelFile
from pandas import DataFrame
excel_file = 'Project File Merger.xlsm'
list_dfs = []
xls = xlrd.open_workbook(excel_file,on_demand=True)
for sheet_name in xls.sheet_names():
df = pd.read_excel(excel_file,sheet_name)
list_dfs.append(df)
d_psppm = {}
for i, sheet_name in enumerate(xls.sheet_names()):
df = pd.read_excel(excel_file,sheet_name)
d_psppm["PSPPM" + str(i)] = df.loc[:,['PSPPM']]
values_list = list(d_psppm.values())
print(values_list[:])
A sample output looks like below, for 17 list entries, but with different number of rows for each.
PSPPM
0 0.246769
1 0.599589
2 0.082420
3 0.250000
4 0.205140
5 0.850000,
PSPPM
0 0.500887
1 0.475255
2 0.472711
3 0.412953
4 0.415883
5 0.703716,...
The next thing I want to do is create a box and whisker plot, 1 plot with 17 box and whiskers. I am not sure how to get the dictionary to plot with the values and indices as the name. I have tried to dig, and figure out how to convert the dictionary to a list and then plot each element in the list, but have had no luck.
Thanks for the help!
I agree with #Alex that forming your columns into a new DataFrame and then plotting from that would be a good approach, however, if you're going to use the dict, then it should look something like this. Depending on the version of Python you're using, the dictionary may be unordered, so if the ordering on the plot is important to you, then you might want to create a list of dictionary keys in the order you want and iterate over that instead
import matplotlib.pyplot as plt
import numpy as np
#colours = []#list of colours here, if you want
#markers = []#list of markers here, if you want
fig, ax = plt.subplots()
for idx, k in enumerate(d_psppm, 1):
data = d_psppm[k]
jitter = np.random.normal(0, 0.1, data.shape[0]) + idx
ax.scatter(jitter,
data,
s=25,#size of the marker
c="r",#colour, could be from colours
alpha=0.35,#opacity, 1 being solid
marker="^",#or ref. to markers, e.g. markers[idx]
edgecolors="none"#removes black border
)
As per Alex's suggestion, you could use the data to create a seaborn boxplot and overlay a swarmplot to show the data (depends on how many rows each has whether this is practical).

query from a csv file

I want to draw a plot of people who are more than 0.5 years old.
when I enter the data in python and make the data-frame, my code works:
import pandas as pd
data = {'age': [0.62,0.84,0.78,0.80,0.70,0.25,0.32,0.86,0.75],
'gender': [1,0,0,0,1,0,0,1,0],
'LOS': [0.11,0.37,0.23,-0.02,0.19,0.27,0.37,0.31,0.21],
'WBS': [9.42,4.40,6.80,9.30,5.30,5.90,3.10,4.10,12.07],
'HB': [22.44,10.40,15.60,15.10,11.30,10.60,12.50,10.40,14.10],
'Nothrophil': [70.43,88.40,76.50,87,82,87.59,15.40,77,88]}
df = pd.DataFrame(data, index=[0,1,2,3,4,5,6,7,8])
old = df.query('age > 0.5')
import matplotlib.pyplot as plt
plt.plot(old.age)
plt.show()
but when I use a csv file to form my data-frame, the code dosen’t work:
import pandas as pd
df= pd.read_csv('F:\HCSE\sample_data1.csv',sep=';')
old = df.query('age > 0.5')
import matplotlib.pyplot as plt
plt.plot(old.age)
plt.show()
How can I use a csv file and do the same action?
and one more question. Is it possible to draw a scatter plot with only one argument?
As an example I want to draw a scatter plot of people who are more than 0.5 years old (Y axis is the age and the X axis is the number of datas or number of rows in csv file) and I want to use different colors for different genders. how can I do it?
Thanks a lot.
but when I use a csv file to form my data-frame, the code dosen’t
work:
You might want to share the error message so that we can know, what is going on under the hood.
Is it possible to draw a scatter plot with only one argument?
As an example I want to draw a scatter plot of people who are more
than 0.5 years old (Y axis is the age and the X axis is the number of
datas or number of rows in csv file) and I want to use different
colors for different genders. how can I do it?
Yes. Please refer to below code.
colors = ['b' if gender == 1 else 'r' for gender in df.loc[df['age'] >0.5].gender]
df.loc[df['age'] > 0.5].reset_index().plot.scatter('index', 'age', color=colors)
You also can do this very easily using seaborn's lmplot.
import seaborn as sns
sns.lmplot(x="index", y="age", data=df.loc[df['age'] > 0.5].reset_index(), hue="gender", fit_reg=False)
Notice that you can apply colors according to gender with hue argument. Hope this helps for the visualization.
For the scatter plot, you could simply do:
colors = ['b' if gender == 1 else 'r' for gender in old.gender]
plt.scatter(range(len(old.age)), old.age, color = colors)
plt.show()
About the query, can you put your .csv file? It works with my data.

Categories

Resources