I currently have a set of series in pandas and each series is composed of two data sets. I need to separate out the two data sets into lists while retaining the series information, ie. the time and intensity data for 58V.
My current code looks like:
import numpy as numpy
import pandas as pd
xl = pd.ExcelFile("TEST_ATD.xlsx")
df = xl.parse("Sheet1")
series = xl.parse("Sheet1")
voltages = []
for item in df:
if "V" in item:
voltages.append(item)
data_list = []
for value in voltages:
print(df[value])
How do I select a particular data set from the series to extract them into a list? If I ask it to print(df[value]) returns my data sets, an example of which looks like:
Name: 58V, dtype: int64
0.000 0
0.180 1
0.360 1.2
0.540 1.5
0.720 1.2
..
35.277 0
35.457 0
35.637 0
NaN 0
Ultimately I plan to plot these data sets into a line graph with pyplot.
~~~ UPDATE ~~~
using
for value in voltages:
intensity=[]
for row in series[value].tolist():
intensity.append(row)
time=range(0,len(intensity))
pc_intensity = []
for item in intensity:
pc_intensity.append((100/max(intensity)*item))
plt.plot(time, pc_intensity)
axes = plt.gca()
axes.set_ylim([0,100])
plt.title(value)
plt.ylabel('Intensity')
plt.xlabel('Time')
plt.savefig(value +'.png')
plt.clf()
print(value)
I am able to get the plots of the first 8 data series (using arbitrary x axis), however, anything past the 8th series and my plots are empty? I have experimented and found this to be due to some of the series being different lengths. I'm confused as to why this would effect the plots as the x-axis is directly related to the length of the data set it is being plotted against?
I am not sure what you are trying to acheive but I'll take a guess
df = pd.DataFrame({'A': range(1, 10), 'B': range(1, 10), 'C': range(1, 10), 'D': range(1, 10), 'E': [1,1,1,2,2,2,2,3,4]})
for col in df.columns:
print(df[col].values.tolist())
this would print every columns of your dataframe as list
if you are just trying to plot something why not just use
df.plot()
Related
Situation
I’m trying to create a boxplot with individual and nested/grouped data. The dataset I use represents information for a number of households, where there is a distinction between 1-phase and 3-phase systems (#)
#NOTE Where the id appears only once, the household is single phased (1-phase) and duplicates are 3-phase system. Due to the duplicates, reading the csv-file via pd.read_csv(..) will extend the duplicate's names (i.e. 1, 1.1 and 1.2).
Using the basic plot techniques delivers:
In [4]: VoltageProfileFile= pd.read_csv(dest + '/VoltageProfiles_' + str(PV_par['value_PV']) + '%PV.csv', dtype= 'float')
...: VoltageProfileFile.boxplot(figsize=(20,5), rot= 60)
...: plt.ylim(0.9, 1.1)
...: plt.show()
Out[4]:
The result is correct, but it would be clean to have only 1 tick representing 1, 1.1 and 1.2 or 5, 5.1, 5.2 etc.
Question
I would like to clean this up by using a ‘categorical’ boxplot, where values from duplicates (3-phase systems) are grouped under the same id. I’m aware that seaborn enables users to use the hue parameter: sns.boxplot(x='',hue='', y='', data='') to create categorical plots (Plotting with categorical data). However, I can’t figure out how to format my dataset in order to achieve this? I tried via pd.melt(..) function (cfr. pandas.melt), but the resulting format changes the order in which the values appear (*)
(*) Every id is accompanied by a length up to a reference point, thus the order of appearance on the x-axis must remain.
What would be a good approach to tackle this problem?
Ideally, the boxplot would group 3-phase systems under one id and display different colours for 1ph vs. 3ph systems.
Kind regards,
Rémy
For seaborn plotting, data should be structured in long format and not wide format as you have it with distinct indicators such as household, phase, value.
So consider actually letting Pandas rename columns 1, 1.1, 1.2 and then run pd.melt into long format with adjustments of the generated household and phase columns using assign where you split on . and take the first and second parts respectively:
VoltageProfileFile_long = (pd.melt(VoltageProfileFile, var_name = 'phase')
.assign(household = lambda x: x['phase'].str.split("\\.").str[0].astype(int),
phase = lambda x: pd.to_numeric(x['phase'].str.split("\\.").str[1]).fillna(0).astype(int).add(1))
.reindex(['household', 'phase', 'value'], axis='columns')
)
Below is a demo with random data
Data (dumped to csv then read back in for pandas renaming process)
np.random.seed(111620)
VoltageProfileFile = pd.DataFrame([np.random.uniform(0.95, 1.05, 13) for i in range(50)],
columns = [1, 1, 1, 2, 3, 4, 5, 5, 5, 6, 7, 8, 9])
VoltageProfileFile.to_csv('data.csv', index=False)
VoltageProfileFile = pd.read_csv('data.csv')
VoltageProfileFile.head(10)
# 1 1.1 1.2 2 3 ... 5.2 6 7 8 9
# 0 1.012732 1.042768 0.975577 0.965508 1.048544 ... 1.010898 1.008921 1.006769 1.019615 1.036926
# 1 1.013457 1.048378 1.025201 0.982988 0.995133 ... 1.024578 1.024362 0.985693 1.041609 0.995037
# 2 1.024739 1.008590 0.960278 0.956811 1.001739 ... 0.969436 0.953134 0.966851 1.031544 1.036572
# 3 1.037998 0.993246 0.970146 0.989196 0.959527 ... 1.015577 1.027020 1.038941 0.971666 1.040658
# 4 0.995877 0.955734 0.952497 1.040942 0.985759 ... 1.021805 1.044108 0.980657 1.034179 0.980722
# 5 0.994755 0.951557 0.986580 1.021583 0.959249 ... 1.046740 0.998429 1.027406 1.007391 0.989477
# 6 1.023979 1.043418 1.020745 1.006081 1.030413 ... 0.964579 1.035479 0.982969 0.953484 1.005889
# 7 1.018904 1.045440 1.003997 1.018295 0.954814 ... 0.955295 0.960958 0.999492 1.010163 0.985847
# 8 0.960913 0.982671 1.016659 1.030384 1.043750 ... 1.042720 0.972287 1.039235 0.969571 0.999418
# 9 1.017085 0.998049 0.989664 0.953420 1.018018 ... 0.953041 0.955883 1.004630 0.996443 1.017762
Plot (after same processing to generate VoltageProfileFile_long)
sns.set()
fig, ax = plt.subplots(figsize=(8,4))
sns.boxplot(x='household', y='value', hue='phase', data=VoltageProfileFile_long, ax=ax)
plt.title('Boxplot of Values by Household and Phases')
plt.tight_layout()
plt.show()
plt.clf()
plt.close()
I'm interested in the first time a random process crosses a threshold. I am storing the results from observing the process in a dataframe, and have plotted how many times several realisations of that process cross 0.9 after I observe it a the end of 14 rounds.
This image was created with this code
import matplotlib.pyplot as plt
plt.style.use('ggplot')
fin = pd.DataFrame(data=np.random.uniform(size=(100, 13))).T
pos = (fin>0.9).astype(float)
ax=fin.loc[:, pos.loc[12, :] != 1.0].plot(figsize=(12, 6), color='silver', legend=False)
fin.loc[:, pos.loc[12, :] == 1.0].plot(figsize=(12, 6), color='indianred', legend=False, ax=ax)
where fin contained the random numbers, and pos was 1 every time that process crossed 0.9.
I would like to now plot the first time the process in fin crosses 0.9 for each realisation (columns represent realisations, rows represent observation times)
I can find the first occurence of a value above 0.9 with idxmax() but I'm stumped about how to remove everything in the dataframe after that in each column.
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.random.uniform(size=(100, 10)))
maxes = df.idxmax()
It's just that I'm having real difficulty thinking through this.
If I understand correctly, you can use
df = df[df.index < maxes[0]]
IIUC, we can use a boolean matrix with cumprod:
df.where((df < .9).cumprod().astype(bool)).plot()
Output:
I'm currently experimenting with pandas and matplotlib.
I have created a Pandas dataframe which stores data like this:
cmc|coloridentity
1 | G
1 | R
2 | G
3 | G
3 | B
4 | B
What I now want to do is to make a stacked bar plot where I can see how many entries per cmc exist. And I want to do that for all coloridentity and stack them above.
My thoughts so far:
#get all unique values of coloridentity
unique_values = df['coloridentity'].unique()
#Create two dictionaries. One for the number of entries per cost and one
# to store the different costs for each color
color_dict_values = {}
color_dict_index = {}
for u in unique_values:
temp_df = df['cmc'].loc[df['coloridentity'] == u].value_counts()
color_dict_values[u] = np.array(temp_df)
color_dict_index[u] = temp_df.index.to_numpy()
width = 0.4
p1 = plt.bar(color_dict_index['G'], color_dict_values['G'], width, color='g')
p2 = plt.bar(color_dict_index['R'], color_dict_values['R'], width,
bottom=color_dict_values['G'], color='r')
plt.show()
So but this gives me an error because the line where I say that the bottom of the second plot shall be the values of different plot have different numpy shapes.
Does anyone know a solution? I thought of adding 0 values so that the shapes are the same , but I don't know if this is the best solution, and if yes how the best way would be to solve it.
Working with a fixed index (the range of cmc values), makes things easier. That way the color_dict_values of a color_id give a count for each of the possible cmc values (stays zero when there are none).
The color_dict_index isn't needed any more. To fill in the color_dict_values, we iterate through the temporary dataframe with the value_counts.
To plot the bars, the x-axis is now the range of possible cmc values. I added [1:] to each array to skip the zero at the beginning which would look ugly in the plot.
The bottom starts at zero, and gets incremented by the color_dict_values of the color that has just been plotted. (Thanks to numpy, the constant 0 added to an array will be that array.)
In the code I generated some random numbers similar to the format in the question.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
N = 50
df = pd.DataFrame({'cmc': np.random.randint(1, 10, N), 'coloridentity': np.random.choice(['R', 'G'], N)})
# get all unique values of coloridentity
unique_values = df['coloridentity'].unique()
# find the range of all cmc indices
max_cmc = df['cmc'].max()
cmc_range = range(max_cmc + 1)
# dictionary for each coloridentity: array of values of each possible cmc
color_dict_values = {}
for u in unique_values:
value_counts_df = df['cmc'].loc[df['coloridentity'] == u].value_counts()
color_dict_values[u] = np.zeros(max_cmc + 1, dtype=int)
for ind, cnt in value_counts_df.iteritems():
color_dict_values[u][ind] = cnt
width = 0.4
bottom = 0
for col_id, col in zip(['G', 'R'], ['limegreen', 'crimson']):
plt.bar(cmc_range[1:], color_dict_values[col_id][1:], bottom=bottom, width=width, color=col)
bottom += color_dict_values[col_id][1:]
plt.xticks(cmc_range[1:]) # make sure every cmc gets a tick label
plt.tick_params(axis='x', length=0) # hide the tick marks
plt.xlabel('cmc')
plt.ylabel('count')
plt.show()
Here is an issue when plotting a dataframe:
The dataframe looks like
i ii n b
0 1 0 3 0
1 4 1 5 0
2 4 0 1 5
3 4 1 2 6
4 6 0 3 0
5 6 1 4 3
(code to create below). I'd like to plot stacked bars for same values of i, and I want bars to belong to groups according to ii. When I select only certain rows of the dataframe, I have issues plotting, forcing me to explicitly convert the dataframe's columns (which are extracted as pandas Series) to lists. (Note that I cannot use pivot, as I have multiple rows for some (i, ii) combinations.)
Why can I not directly pass a Series to matplotlib.pyplot.bar() (code for figure 3)?
Why does using Series affect the width of bars, which cannot be overridden by an explicit argument width?
Is there a way to produce the desired plot in a better way?
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({'i':[1,4,4,4,6,6], 'n':[3,5,1,2,3,4]})
df['ii'] = df.index % 2
df2 = df.set_index(['i', 'ii'])
df2["b"] = df2.groupby(level='i')['n'].cumsum() - df2.n
df2.reset_index(inplace=True)
# This produces expected outcome
plt.figure(1)
plt.clf()
ix = df2[df2.ii==0]
plt.bar(x=list(ix.i), height=ix.n, bottom=list(ix.b))
ix = df2[df2.ii==1]
plt.bar(x=list(ix.i), height=ix.n, bottom=list(ix.b))
plt.show()
plt.figure(2)
plt.clf()
ix = df2[df2.ii==0]
plt.bar(x=ix.i, height=ix.n, bottom=list(ix.b))
ix = df2[df2.ii==1]
# The following line will draw a bar with unexpected width of bar
plt.bar(x=ix.i, height=ix.n, bottom=list(ix.b))
plt.show()
plt.figure(3)
plt.clf()
plt.show()
ix = df2[df2.ii==0]
plt.bar(x=ix.i, height=ix.n, bottom=ix.b)
ix = df2[df2.ii==1]
# The following line will fail
plt.bar(x=ix.i, height=ix.n, bottom=ix.b)
# error:
# TypeError: only size-1 arrays can be converted to Python scalars
# apparently matplotlib tries to set line width
Desired output:
Basically you get such kind of error when the function expects a single value instead of that you passed an array. In many cases, we can use np.vectorize to apply a function that accepts a single element to every element in an array. It seems this is not a case here. Why you do not want to pass the list as you did in the first plot?
plt.bar(x=list(ix.i), height=list(ix.n), bottom=list(ix.b))
I have a dataframe of size (3,100) that is filled with some random float values.
Here is a sample of how the data frame looks like
A B C
4.394966 0.580573 2.293824
3.136197 2.227557 1.306508
4.010782 0.062342 3.629226
2.687100 1.050942 3.143727
1.280550 3.328417 2.247764
4.417837 3.236766 2.970697
1.036879 1.477697 4.029579
2.759076 4.753388 3.222587
1.989020 4.161404 1.073335
1.054660 1.427896 2.066219
0.301078 2.763342 4.166691
2.323838 0.791260 0.050898
3.544557 3.715050 4.196454
0.128322 3.803740 2.117179
0.549832 1.597547 4.288621
This is how I created it
df = pd.DataFrame(np.random.uniform(0,5,size=(100, 3)), columns=list('ABC'))
Note: pd is pandas
I want to plot a bar chart that would have three segments in x-axis where each segment would have 2 bars. One would show number of values less than 2 and other greater than equal to 2.
So on x-axis there would be two bars attached for column A, one with total number of values less than 2 and one with greater than equal to 2, and same for B and C
Can anyone suggest anything?
I was thinking of using seaborn and setting hue value for differentiating two classes (less than 2 and greater than equal to 2) but then again hue attribute only works for categorical value and I can only set one column in x-axis attribute.
Any tips would be appreciated.
You must use a filter and then count them, then you must use plot(kind='bar')
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.uniform(0,5,size=(100, 3)), columns=list('ABC'))
dfout = pd.DataFrame({'minor' : df[df<= 2].count(),
'major' : df[df > 2].count() })
dfout.plot(kind='bar')
plt.show()