I started to use python 6 months ago and may be my question is a naive one. I would like to visualize my data and ANOVA statistics. It is common to do this using a barplot with added lines indicating significant differences and interactions. How do you make plot like this using python ?
enter image description here
Here is a simple dataframe, with 3 columns (A,B and the p_values already calculated with a t-test)
mport pandas as pd
import matplotlib.pyplot as plt
import numpy as np
ar = np.array([ [565.0, 81.0, 1.630947e-02],
[1006.0, 311.0, 1.222740e-27],
[2929.0, 1292.0, 5.559912e-12],
[3365.0, 1979.0, 2.507474e-22],
[2260.0, 1117.0, 1.540305e-01]])
df = pd.DataFrame(ar,columns = ['A', 'B', 'p_value'])
ax = plt.subplot()
# I calculate the percentage
(df.iloc[:,0:2]/df.iloc[:,0:2].sum()*100).plot.bar(ax=ax)
for container, p_val in zip(ax.containers,df['p_value']):
labels = [f"{round(v,1)}%" if (p_val > 0.05) else f"(**)\n{round(v,1)}%" for v in container.datavalues]
ax.bar_label(container,labels=labels, fontsize=10,padding=8)
plt.show()
Initially I just wanted to add a "**" each time a significant difference is observed between the 2 columns A & B. But the initial code above is not really working.
Now I would prefer having the added lines indicating significant differences and interactions between the A&B columns. But I have no ideas how to make it happen.
Regards
JYK
I'm interested in the first time a random process crosses a threshold. I am storing the results from observing the process in a dataframe, and have plotted how many times several realisations of that process cross 0.9 after I observe it a the end of 14 rounds.
This image was created with this code
import matplotlib.pyplot as plt
plt.style.use('ggplot')
fin = pd.DataFrame(data=np.random.uniform(size=(100, 13))).T
pos = (fin>0.9).astype(float)
ax=fin.loc[:, pos.loc[12, :] != 1.0].plot(figsize=(12, 6), color='silver', legend=False)
fin.loc[:, pos.loc[12, :] == 1.0].plot(figsize=(12, 6), color='indianred', legend=False, ax=ax)
where fin contained the random numbers, and pos was 1 every time that process crossed 0.9.
I would like to now plot the first time the process in fin crosses 0.9 for each realisation (columns represent realisations, rows represent observation times)
I can find the first occurence of a value above 0.9 with idxmax() but I'm stumped about how to remove everything in the dataframe after that in each column.
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.random.uniform(size=(100, 10)))
maxes = df.idxmax()
It's just that I'm having real difficulty thinking through this.
If I understand correctly, you can use
df = df[df.index < maxes[0]]
IIUC, we can use a boolean matrix with cumprod:
df.where((df < .9).cumprod().astype(bool)).plot()
Output:
I have a ton of ranges. They all consist of numbers. The range has a maximum and a minimum which can not be exceeded, but given the example that you have two ranges and one max point of the range reaches above the min area of the other. That would mean that you have a small area that covers both of them. You can write one range that includes the others.
I want to see if some ranges overlap or if I can find some ranges that cover most of the other. The goal would be to see if I can simplify them by using one smaller range that fits inside the other. For example 7,8 - 9,6 and 7,9 - 9,6 can be covered with one range.
You can see my attempt to visualize them. But when I use my entire dataset consisting of hundreds of ranges my graph is not longer useful.
I know that I can detect recurrent ranges using python. But I don't want to know how often a range occurs. I want to know how many ranges lay in the same numerical boundaries.I want see if I can have a couple of ranges covering all of them. Finally my goal is to have the masterranges sorted in categories. Meaning that I have range 1 covering 50 other ranges. then range 2 covering 25 ranges and so on.
My current program shows the penetration of ranges but I also want that in a printed output with the exact digits.
It would be nice if you share some ideas to solve that program or if you have any suggestions on tools within python 3.7
import matplotlib.pyplot as plt
intervals = [[3.6,4.5],
[3.6,4.5],
[7.8,9.6],
[7.9,9.6],
[7.8,9.6],
[3.4,4.1],
[2.8,3.4],
[8.25,9.83],
[3.62,3.96],
[8.25,9.83],
[0.62,0.68],
[2.15,2.49],
[0.8,1.0],
[0.8,1.0],
[3.1,3.9],
[6.7,8.3],
[1,1.5],
[1,1.2],
[1.5,1.8],
[1.8,2.5],
[3,4.0],
[6.5,8.0],
[1.129,1.35],
[2.82,3.38],
[1.69,3.38],
[3.38,6.21],
[2.25,2.82],
[5.649,6.214],
[1.920,6.214]
]
for int in intervals:
plt.plot(int,[0,0], 'b', alpha = 0.2, linewidth = 100)
plt.show()
Here is an idea, You make a pandas data frame with the array. You substract the values in column2 - colum1 ( column 1 is x, and column 2 is y ). After that you create a histogram in which you take the range and the frecuency.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
intervals = [[3.6,4.5],
[3.6,4.5],
[7.8,9.6],
[7.9,9.6],
[7.8,9.6],
[3.4,4.1],
[2.8,3.4],
[8.25,9.83],
[3.62,3.96],
[8.25,9.83],
[0.62,0.68],
[2.15,2.49],
[0.8,1.0],
[0.8,1.0],
[3.1,3.9],
[6.7,8.3],
[1,1.5],
[1,1.2],
[1.5,1.8],
[1.8,2.5],
[3,4.0],
[6.5,8.0],
[1.129,1.35],
[2.82,3.38],
[1.69,3.38],
[3.38,6.21],
[2.25,2.82],
[5.649,6.214],
[1.920,6.214]]
intervals_ar = np.array(intervals)
df = pd.DataFrame({'Column1': intervals_ar[:, 0], 'Column2': intervals_ar[:, 1]})
df['Ranges'] = df['Column2'] - df ['Column1']
print(df)
frecuency_range = df['Ranges'].value_counts().sort_index()
print(frecuency_range)
df.Ranges.value_counts().sort_index().plot(kind = 'hist', bins = 5)
plt.title("Histogram Frecuency vs Range (column 2- column1)")
plt.show()
I have converted a continuous dataset to categorical. I am getting nan values when ever the value of the continuous data is 0.0 after conversion. Below is my code
import pandas as pd
import matplotlib as plt
df = pd.read_csv('NSL-KDD/KDDTrain+.txt',header=None)
data = df[33]
bins = [0.000,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1.00]
category = pd.cut(data,bins)
category = category.to_frame()
print (category)
How do I convert the values so that I dont get NaN values. I have attached two screenshots for better understanding how the actual data looks and how the convert data looks. This is the main dataset. This is the what it becomes after using bins and pandas.cut(). How can thos "0.00" stays like the other values in the dataset.
When using pd.cut, you can specify the parameter include_lowest = True. This will make the first internal left inclusive (it will include the 0 value as your first interval starts with 0).
So in your case, you can adjust your code to be
import pandas as pd
import matplotlib as plt
df = pd.read_csv('NSL-KDD/KDDTrain+.txt',header=None)
data = df[33]
bins = [0.000,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1.00]
category = pd.cut(data,bins,include_lowest=True)
category = category.to_frame()
print (category)
Documentation Reference for pd.cut
I have a pandas dataframe and I want to bin the data by values of a single column. E.G. 0-1, 1-2, etc, starting at 0 and ending at 1, with intervals of 0.1, taking the mean of each column within each bin.
I'm attempting to accomplish this using the .groupby functionality of pandas. See my code below:
import pandas as pd
import numpy as np
my_df = pd.DataFrame({"a": np.random.random(100),
"b": np.random.random(100),
"id": np.arange(100)})
bins = np.linspace(0, 1, 0.1)
groups = my_df.groupby(np.digitize(my_df.a, bins))
binned_data = groups.mean()
print binned_data
The print line then gives a single row with index "1", even though the data of column "a" should have a range of values for the bins specified.
I think its a problem with the creation of "bins" but I can't work out what.
I want 10 rows binned from 0 to 1 in 0.1 intervals. How can I accomplish this?
Many thanks.