import seaborn as sns
import matplotlib.pyplot as plt
sns.tsplot(data = df_month, time = 'month', value = 'pm_local');
plt.show()
Using this code I get this blank plot, I presume because of the scale of the y-axis. I don't know why this is, here are the first 5 rows of my dataframe (which consists of 12 rows - 1 row for each month):
How can I fix this?
I think the problem is related to the field unit. The function expects in the case of data passed as DataFrame a unit indicated which subject the data belongs to. This function behavior is not obvious for me, but see this example.
# Test Data
df = pd.DataFrame({'month': [1, 2, 3, 4, 5, 6],
'value': [11.5, 9.7, 12, 8, 4, 12.3]})
# Added a custom unit with a value = 1
sns.tsplot(data=df, value='value', unit=[1]*len(df), time='month')
plt.show()
You can also use extract a Series and plot it.
sns.tsplot(data=df.set_index('month')['value'])
plt.show()
I had this same issue. In my case it was due to incomplete data, such that every time point had at least one missing value, causing the default estimator to return NaN for every time point.
Since you only show the first 5 records of your data we can't tell if your data has the same issue. You can see if the following fix works:
from scipy import stats
sns.tsplot(data = df_month, time = 'month',
value = 'pm_local',
estimator = stats.nanmean);
Related
I started to use python 6 months ago and may be my question is a naive one. I would like to visualize my data and ANOVA statistics. It is common to do this using a barplot with added lines indicating significant differences and interactions. How do you make plot like this using python ?
enter image description here
Here is a simple dataframe, with 3 columns (A,B and the p_values already calculated with a t-test)
mport pandas as pd
import matplotlib.pyplot as plt
import numpy as np
ar = np.array([ [565.0, 81.0, 1.630947e-02],
[1006.0, 311.0, 1.222740e-27],
[2929.0, 1292.0, 5.559912e-12],
[3365.0, 1979.0, 2.507474e-22],
[2260.0, 1117.0, 1.540305e-01]])
df = pd.DataFrame(ar,columns = ['A', 'B', 'p_value'])
ax = plt.subplot()
# I calculate the percentage
(df.iloc[:,0:2]/df.iloc[:,0:2].sum()*100).plot.bar(ax=ax)
for container, p_val in zip(ax.containers,df['p_value']):
labels = [f"{round(v,1)}%" if (p_val > 0.05) else f"(**)\n{round(v,1)}%" for v in container.datavalues]
ax.bar_label(container,labels=labels, fontsize=10,padding=8)
plt.show()
Initially I just wanted to add a "**" each time a significant difference is observed between the 2 columns A & B. But the initial code above is not really working.
Now I would prefer having the added lines indicating significant differences and interactions between the A&B columns. But I have no ideas how to make it happen.
Regards
JYK
This question already has an answer here:
How to find the exact intersection of a curve (as np.array) with y==0?
(1 answer)
Closed last year.
I have two curves (supply and demand) and I want to find their intersection point (both, x and y). I was not able to find a simple solution for the mentioned problem. I want my code in the end to print what is the value of X and what is the value of Y.
supply = final['0_y']
demand = final['0_x']
price = final[6]
plt.plot(supply, price)
plt.plot(demand, price)
The main problem and challenge (something wrong) are that I have tried every other method, and every single time I get an empty set/list. Even when I try to visualize the intersection, I also get empty visual.
GRAPH:
As the implementation of the duplicate is not straightforward, I will show you how you can adapt it to your case. First, you use pandas series instead of numpy arrays, so we have to convert them. Then, your x- and y-axes are switched, so we have to change their order for the function call:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
final = pd.DataFrame({'0_y': [0, 0, 0, 10, 10, 30],
'0_x': [20, 11, 10, 4, 1, 0,],
"6": [-200, 50, 100, 200, 600, 1000]})
supply = final['0_y'].to_numpy()
demand = final['0_x'].to_numpy()
price = final["6"].to_numpy()
plt.plot(supply, price)
plt.plot(demand, price)
def find_roots(x,y):
s = np.abs(np.diff(np.sign(y))).astype(bool)
return x[:-1][s] + np.diff(x)[s]/(np.abs(y[1:][s]/y[:-1][s])+1)
z = find_roots(price, supply-demand)
x4z = np.interp(z, price, supply)
plt.scatter(x4z, z, color="red", zorder=3)
plt.title(f"Price is {z[0]} at supply/demand of {x4z[0]}")
plt.show()
Sample output:
Situation
I’m trying to create a boxplot with individual and nested/grouped data. The dataset I use represents information for a number of households, where there is a distinction between 1-phase and 3-phase systems (#)
#NOTE Where the id appears only once, the household is single phased (1-phase) and duplicates are 3-phase system. Due to the duplicates, reading the csv-file via pd.read_csv(..) will extend the duplicate's names (i.e. 1, 1.1 and 1.2).
Using the basic plot techniques delivers:
In [4]: VoltageProfileFile= pd.read_csv(dest + '/VoltageProfiles_' + str(PV_par['value_PV']) + '%PV.csv', dtype= 'float')
...: VoltageProfileFile.boxplot(figsize=(20,5), rot= 60)
...: plt.ylim(0.9, 1.1)
...: plt.show()
Out[4]:
The result is correct, but it would be clean to have only 1 tick representing 1, 1.1 and 1.2 or 5, 5.1, 5.2 etc.
Question
I would like to clean this up by using a ‘categorical’ boxplot, where values from duplicates (3-phase systems) are grouped under the same id. I’m aware that seaborn enables users to use the hue parameter: sns.boxplot(x='',hue='', y='', data='') to create categorical plots (Plotting with categorical data). However, I can’t figure out how to format my dataset in order to achieve this? I tried via pd.melt(..) function (cfr. pandas.melt), but the resulting format changes the order in which the values appear (*)
(*) Every id is accompanied by a length up to a reference point, thus the order of appearance on the x-axis must remain.
What would be a good approach to tackle this problem?
Ideally, the boxplot would group 3-phase systems under one id and display different colours for 1ph vs. 3ph systems.
Kind regards,
Rémy
For seaborn plotting, data should be structured in long format and not wide format as you have it with distinct indicators such as household, phase, value.
So consider actually letting Pandas rename columns 1, 1.1, 1.2 and then run pd.melt into long format with adjustments of the generated household and phase columns using assign where you split on . and take the first and second parts respectively:
VoltageProfileFile_long = (pd.melt(VoltageProfileFile, var_name = 'phase')
.assign(household = lambda x: x['phase'].str.split("\\.").str[0].astype(int),
phase = lambda x: pd.to_numeric(x['phase'].str.split("\\.").str[1]).fillna(0).astype(int).add(1))
.reindex(['household', 'phase', 'value'], axis='columns')
)
Below is a demo with random data
Data (dumped to csv then read back in for pandas renaming process)
np.random.seed(111620)
VoltageProfileFile = pd.DataFrame([np.random.uniform(0.95, 1.05, 13) for i in range(50)],
columns = [1, 1, 1, 2, 3, 4, 5, 5, 5, 6, 7, 8, 9])
VoltageProfileFile.to_csv('data.csv', index=False)
VoltageProfileFile = pd.read_csv('data.csv')
VoltageProfileFile.head(10)
# 1 1.1 1.2 2 3 ... 5.2 6 7 8 9
# 0 1.012732 1.042768 0.975577 0.965508 1.048544 ... 1.010898 1.008921 1.006769 1.019615 1.036926
# 1 1.013457 1.048378 1.025201 0.982988 0.995133 ... 1.024578 1.024362 0.985693 1.041609 0.995037
# 2 1.024739 1.008590 0.960278 0.956811 1.001739 ... 0.969436 0.953134 0.966851 1.031544 1.036572
# 3 1.037998 0.993246 0.970146 0.989196 0.959527 ... 1.015577 1.027020 1.038941 0.971666 1.040658
# 4 0.995877 0.955734 0.952497 1.040942 0.985759 ... 1.021805 1.044108 0.980657 1.034179 0.980722
# 5 0.994755 0.951557 0.986580 1.021583 0.959249 ... 1.046740 0.998429 1.027406 1.007391 0.989477
# 6 1.023979 1.043418 1.020745 1.006081 1.030413 ... 0.964579 1.035479 0.982969 0.953484 1.005889
# 7 1.018904 1.045440 1.003997 1.018295 0.954814 ... 0.955295 0.960958 0.999492 1.010163 0.985847
# 8 0.960913 0.982671 1.016659 1.030384 1.043750 ... 1.042720 0.972287 1.039235 0.969571 0.999418
# 9 1.017085 0.998049 0.989664 0.953420 1.018018 ... 0.953041 0.955883 1.004630 0.996443 1.017762
Plot (after same processing to generate VoltageProfileFile_long)
sns.set()
fig, ax = plt.subplots(figsize=(8,4))
sns.boxplot(x='household', y='value', hue='phase', data=VoltageProfileFile_long, ax=ax)
plt.title('Boxplot of Values by Household and Phases')
plt.tight_layout()
plt.show()
plt.clf()
plt.close()
I have a DataFrame in which one column contains different numerical values. I would like to find the most frequently occurring value specifically using the np.histogram() function.
I know that this task can be achieved using functions such as column.value_counts().nlargest(1), however, I am interested in how the np.histogram() function can be used to achieve this goal.
With this task I am hoping to get a better understanding of the function and the resulting values, as the description from the documentation (https://numpy.org/doc/1.18/reference/generated/numpy.histogram.html) is not so clear to me.
Below I am sharing an example Series of values to be used for this task:
data = pd.Series(np.random.randint(1,10,size=100))
This is one way to do it:
import numpy as np
import pandas as pd
# Make data
np.random.seed(0)
data = pd.Series(np.random.randint(1, 10, size=100))
# Make bins
bins = np.arange(data.min(), data.max() + 2)
# Compute histogram
h, _ = np.histogram(data, bins)
# Find most frequent value
mode = bins[h.argmax()]
# Mode computed with Pandas
mode_pd = data.value_counts().nlargest(1).index[0]
# Check result
print(mode == mode_pd)
# True
You can also define bins as:
bins = np.unique(data)
bins = np.append(bins, bins[-1] + 1)
Or if your data contains only positive numbers you can directly use np.bincount:
mode = np.bincount(data).argmax()
Of course there is also scipy.stats.mode:
import scipy.stats
mode = scipy.stats.mode(data)[0][0]
It can be done with:
hist, bin_edges = np.histogram(data, bins=np.arange(0.5,10.5))
result = np.argmax(hist)
You just need to read documentation more carefully. It says that if bins is [1, 2, 3, 4] then first bin is [1, 2), second is [2, 3) and third is [3, 4).
We calculate which amount of numbers are in bins [0.5, 1.5), [1.5, 2.5), ..., [8.5, 9.5) specifically in your problem and choose index of the maximum one.
Just in case, it's worth to use
np.unique(data)[np.argmax(hist)]
if we are not sure that your sorted data set np.unique(data) includes all the consecutive integers 0, 1, 2, 3, ...
I have converted a continuous dataset to categorical. I am getting nan values when ever the value of the continuous data is 0.0 after conversion. Below is my code
import pandas as pd
import matplotlib as plt
df = pd.read_csv('NSL-KDD/KDDTrain+.txt',header=None)
data = df[33]
bins = [0.000,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1.00]
category = pd.cut(data,bins)
category = category.to_frame()
print (category)
How do I convert the values so that I dont get NaN values. I have attached two screenshots for better understanding how the actual data looks and how the convert data looks. This is the main dataset. This is the what it becomes after using bins and pandas.cut(). How can thos "0.00" stays like the other values in the dataset.
When using pd.cut, you can specify the parameter include_lowest = True. This will make the first internal left inclusive (it will include the 0 value as your first interval starts with 0).
So in your case, you can adjust your code to be
import pandas as pd
import matplotlib as plt
df = pd.read_csv('NSL-KDD/KDDTrain+.txt',header=None)
data = df[33]
bins = [0.000,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1.00]
category = pd.cut(data,bins,include_lowest=True)
category = category.to_frame()
print (category)
Documentation Reference for pd.cut