Reading excel with Python Pandas and isolating columns/rows to plot

Reading excel with Python Pandas and isolating columns/rows to plot - python

I am using Python pandas read_excel to create a histogram or line plot. I would like to read in the entire file. It is a large file and I only want to plot certain values on it. I know how to use skiprows and parse_cols in read_excel, but if I do this, it does not read a part of the file that I need to use for the axis labels. I also do not know how to tell it to plot what I want for x-values and what I want for the y-values. Heres what I have:
df=pd.read_excel('JanRain.xlsx',parse_cols="C:BD")
years=df[0]
precip=df[31:32]
df.plot.bar()
I want the x axis to be row 1 of the excel file(years) and I want each bar in the bar graph to be the values on row 31 of the excel file. Im not sure how to isolate this. Would it be easier to read with pandas then plot with matplotlib?
Here is a sample of the excel file. The first row is years and the second column is days of the month (this file is only for 1 month:

Here's how I would plot the data in row 31 of a large dataframe, setting row 0 as the x-axis. (updated answer)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
create a random array with 32 rows, and 10 columns
df = pd.DataFrame(np.random.rand(320).reshape(32,10), columns=range(64,74), index=range(1,33))
df.to_excel(r"D:\data\data.xlsx")
Read only the columns and rows that you want using "parse_cols" and "skiprows." The first column in this example is the dataframe index.
# load desired columns and rows into a dataframe
# in this method, I firse make a list of all skipped_rows
desired_cols = [0] + list(range(2,9))
skipped_rows = list(range(1,33))
skipped_rows.remove(31)
df = pd.read_excel(r"D:\data\data.xlsx", index_col=0, parse_cols=desired_cols, skiprows=skipped_rows)
Currently this yields a dataframe with only one row.
65 66 67 68 69 70 71
31 0.310933 0.606858 0.12442 0.988441 0.821966 0.213625 0.254897
isolate only the row that you want to plot, giving a pandas.Series with the original column header as the index
ser = df.loc[31, :]
Plot the series.
fig, ax = plt.subplots()
ser.plot(ax=ax)
ax.set_xlabel("year")
ax.set_ylabel("precipitation")
fig, ax = plt.subplots()
ser.plot(kind="bar", ax=ax)
ax.set_xlabel("year")
ax.set_ylabel("precipitation")

Related

How do I correctly plot two columns of a dataframe when the size of data is huge?

I have a dataframe of the format:
df = pd.DataFrame({
'TCTN':list('101','102','103',....,'STDEV')
'0':[855days,626days,....,5911days],
'1':[946days,485days,....,6040days],
'2':[1242days,1985days,....,5974days],
'3':[345days,1864days,....,6062days],
})
of 4997 rows × 229 columns for which i tried to plot columns TNTC vs STDEV using:
df3.plot(x='STDEV' ,y='TNTC',figsize=(20,5),style='o')
which gives me a plot like this:
but what i actually needed is the TNTC values on the Y axis. Is that not possible since there are 4997 rows? do i change the style of the plot to better fit the data?

Use matplotlib instead of pandas to do the plotting:
import matplotlib.pyplot as plt
plt.scatter(df['STDEV'].values, df['TNTC].values)

Python plotting dictionary

I am VERY new to the world of python/pandas/matplotlib, but I have been using it recently to create box and whisker plots. I was curious how to create a box and whisker plot for each sheet using a specific column of data, i.e. I have 17 sheets, and I have column called HMB and DV on each sheet. I want to plot 17 data sets on a Box and Whisker for HMB and another 17 data sets on the DV plot. Below is what I have so far.
I can open the file, and get all the sheets into list_dfs, but then don't know where to go from there. I was going to try and manually slice each set (as I started below before coming here for help), but when I have more data in the future, I don't want to have to do that by hand. Any help would be greatly appreciated!
import pandas as pd
import numpy as np
import xlrd
import matplotlib.pyplot as plt
%matplotlib inline
from pandas import ExcelWriter
from pandas import ExcelFile
from pandas import DataFrame
excel_file = 'Project File Merger.xlsm'
list_dfs = []
xls = xlrd.open_workbook(excel_file,on_demand=True)
for sheet_name in xls.sheet_names():
df = pd.read_excel(excel_file,sheet_name)
list_dfs.append(df)
d_psppm = {}
for i, sheet_name in enumerate(xls.sheet_names()):
df = pd.read_excel(excel_file,sheet_name)
d_psppm["PSPPM" + str(i)] = df.loc[:,['PSPPM']]
values_list = list(d_psppm.values())
print(values_list[:])
A sample output looks like below, for 17 list entries, but with different number of rows for each.
PSPPM
0 0.246769
1 0.599589
2 0.082420
3 0.250000
4 0.205140
5 0.850000,
PSPPM
0 0.500887
1 0.475255
2 0.472711
3 0.412953
4 0.415883
5 0.703716,...
The next thing I want to do is create a box and whisker plot, 1 plot with 17 box and whiskers. I am not sure how to get the dictionary to plot with the values and indices as the name. I have tried to dig, and figure out how to convert the dictionary to a list and then plot each element in the list, but have had no luck.
Thanks for the help!

I agree with #Alex that forming your columns into a new DataFrame and then plotting from that would be a good approach, however, if you're going to use the dict, then it should look something like this. Depending on the version of Python you're using, the dictionary may be unordered, so if the ordering on the plot is important to you, then you might want to create a list of dictionary keys in the order you want and iterate over that instead
import matplotlib.pyplot as plt
import numpy as np
#colours = []#list of colours here, if you want
#markers = []#list of markers here, if you want
fig, ax = plt.subplots()
for idx, k in enumerate(d_psppm, 1):
data = d_psppm[k]
jitter = np.random.normal(0, 0.1, data.shape[0]) + idx
ax.scatter(jitter,
data,
s=25,#size of the marker
c="r",#colour, could be from colours
alpha=0.35,#opacity, 1 being solid
marker="^",#or ref. to markers, e.g. markers[idx]
edgecolors="none"#removes black border
)
As per Alex's suggestion, you could use the data to create a seaborn boxplot and overlay a swarmplot to show the data (depends on how many rows each has whether this is practical).

Avoid plotting missing values in Seaborn

Problem: I have timeseries data of several days and I use sns.FacetGrid function of Seaborn python library to plot this data in facet form. In several cases, I found that mentioned seaborn function plots consecutive missing values (nan values) between two readings with a continuous line. While as matplotlib shows missing values as a gap, which makes sense. A demo example is as
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# create timeseries data for 3 days such that day two contains NaN values
time_duration1 = pd.date_range('1/1/2018', periods=24,freq='H')
data1 = np.random.randn(len(time_duration1))
ds1 = pd.Series(data=data1,index=time_duration1)
time_duration2 = pd.date_range('1/2/2018',periods=24,freq='H')
data2 = [float('nan')]*len(time_duration2)
ds2 = pd.Series(data=data2,index=time_duration2)
time_duration3 = pd.date_range('1/3/2018', periods=24,freq='H')
data3 = np.random.randn(len(time_duration3))
ds3 = pd.Series(data=data3,index=time_duration3)
# combine all three days series and then convert series into pandas dataframe
DS = pd.concat([ds1,ds2,ds3])
DF = DS.to_frame()
DF.plot()
It results into following plot
Above Matplotlib plot shows missing values with a gap.
Now let us prepare same data for seaborn function as
DF['col'] = np.ones(DF.shape[0])# dummy column but required for facets
DF['timestamp'] = DF.index
DF.columns = ['data_val','col','timestamp']
g = sns.FacetGrid(DF,col='col',col_wrap=1,size=2.5)
g.map_dataframe(plt.plot,'timestamp','data_val')
See, how seaborn plot shows missing data with a line. How should I force seaborn to not plot nan values with such a line?
Note: This is a dummy example, and I need facet grid in any case to plot my data.

FacetGrid by default removes nan from the data. The reason is that some functions inside seaborn would not work properly with nans (especially some of the statistical function, I'd say).
In order to keep the nan values in the data, use the dropna=False argument to FacetGrid:
g = sns.FacetGrid(DF,... , dropna=False)

how do I automate the number of columns and rows while plotting in python (matplotlib)

While plotting with matplotlib, each I have different number of columns and rows, I have to edit my script. Below I have posted the script which has 5 columns. But if I have file which has 7 columns and I want to plot 1st column against 7th column, then I have to edit my code again as in example: c0[7],c7=float(elements[7]), C7.append(c7),etc. Is there a way to automate it? so I won't have to keep changing my code each time I have different number of rows and cols. Thank you
As input parameters, you can have your data file and provide which columns you want to plot for example (1st col against 6th one). Script will take care of number of columns by itself.
import matplotlib.pyplot as plt
import numpy as np
import matplotlib
infile = open("ewcd.txt","r")
data = infile.readlines()
C0=[]
C1=[]
C2=[]
C3=[]
C4=[]
for line in data:
elements = line.split()
try:
c0=float(elements[0])
c1 = float(elements[1])
c2=float(elements[2])
c3=float(elements[3])
c4=float(elements[4])
C0.append(c0)
C1.append(c1)
C2.append(c2)
C3.append(c3)
C4.append(c4)
except IndexError:
pass
fig, ax = plt.subplots()
plt.yscale('log')
plt.tick_params(axis='both', which='major', labelsize=13)
plt.plot(C0,C1,'b-')
plt.plot(C0,C2,'g-')

Set x-axis intervals(ticks) for graph of Pandas DataFrame

I'm trying to set the ticks (time-steps) of the x-axis on my matplotlib graph of a Pandas DataFrame. My goal is to use the first column of the DataFrame to use as the ticks, but I haven't been successful so far.
My attempts so far have included:
Attempt 1:
#See 'xticks'
data_df[header_names[1]].plot(ax=ax, title="Roehrig Shock Data", style="-o", legend=True, xticks=data_df[header_names[0]])
Attempt 2:
ax.xaxis.set_ticks(data_df[header_names[0]])
header_names is just a list of the column header names and the dataframe is as follows:
Compression Velocity Compression Force
1 0.000213 6.810879
2 0.025055 140.693200
3 0.050146 158.401500
4 0.075816 171.050200
5 0.101011 178.639500
6 0.126681 186.228800
7 0.150925 191.288300
8 0.176597 198.877500
9 0.202269 203.937000
10 0.227466 208.996500
11 0.252663 214.056000
And here is the data in CSV format:
Compression Velocity,Compression Force
0.0002126891606,6.810879
0.025055073079999997,140.6932
0.050145696,158.4015
0.07581600279999999,171.0502
0.1010109232,178.6395
0.12668120459999999,186.2288
0.1509253776,191.2883
0.1765969798,198.8775
0.2022691662,203.937
0.2274659662,208.9965
0.2526627408,214.056
And here is an implementation of reading and plotting the graph:
data_df = pd.read_csv(file).astype(float)
fig = Figure()
ax = fig.add_subplot(111)
ax.set_xlabel("Velocity (m/sec)")
ax.set_ylabel("Force (N)")
data_df[header_names[1]].plot(ax=ax, title="Roehrig Shock Data", style="-o", legend=True)
The current graph looks like:
The x-axis is currently the number of rows in the dataframe (e.g. 12) rather than the actual values within the first column.
Is there a way to use the data from the first column in the dataframe to set as the ticks/intervals/time-steps of the x-axis?

This works for me:
data_df.plot(x='Compression Velocity', y='Compression Force', xticks=d['Compression Velocity'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading excel with Python Pandas and isolating columns/rows to plot - python

Related

How do I correctly plot two columns of a dataframe when the size of data is huge?

Python plotting dictionary

Avoid plotting missing values in Seaborn

how do I automate the number of columns and rows while plotting in python (matplotlib)

Set x-axis intervals(ticks) for graph of Pandas DataFrame

Categories

Resources