Complex mask for dataframe - python

I have a dataframe with a time series in one single column. The data looks like this chart
I would like to create a mask that is TRUE each time that the data is equal or lower than -0.20. It should also be TRUE before reaching -0.20 while negative. It should also be true after reaching -0.20 while negative.
This version of the chart
is my manual attempt to show (in red) the values where the mask would be TRUE. I started creating the mask but I could only make it equal to TRUE while the data is less than -0.20 mask = (df['data'] < -0.2). I couldn't do any better, does anybody know how to achieve my goal?

One approach could be to group segments that are entirely below zero, and then for each group verify whether or not there any values below -0.2.
See below for a full reproducible example script:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
np.random.seed(167)
df = pd.DataFrame(
{"y": np.cumsum([np.random.uniform(-0.01, 0.01) for _ in range(10 ** 5)])}
)
plt.plot(df)
gt_zero = df["y"] < 0
regions = (gt_zero != gt_zero.shift()).cumsum()
# here's your interesting DataFrame with the specified mask
df_interesting = df.groupby(regions).filter(lambda s: s.min() < -0.2)
# plot individual regions
for i, grp in df.groupby(regions):
if grp["y"].min() < -0.2:
plt.plot(grp, color="tab:red", linewidth=5, alpha=0.6)
plt.axhline(0, linestyle="--", color="tab:gray")
plt.axhline(-0.2, linestyle="--", color="tab:gray")
plt.show()

Idea
Group by consecutive values of same sign, and then check if the minimum of such a group is less than the defined treshold.
Implementation
First, we want to separate negative from positive values.
negative_mask = (df['data']<0)
We then can create classes (ordered with integers) for each consecutive positive or negative series. The class increases by one each time the data changes sign.
consecutives = negative_mask.diff().ne(0).cumsum()
We then select only the data where the minimum of the group of consecutive elements is less than 0.2.
df.groupby(consecutives).filter(lambda df : df[0].min() < -0.2)
Example with random data
We can try our example with random data:
import numpy as np
import pandas as pd
np.random.seed(42)
data = np.random.randint(-300, 300, size=1000)/1000
df = pd.DataFrame(data, columns=["data"])
Output
data
2 -0.030
3 -0.194
4 -0.229
5 -0.280
6 -0.179
... ...
991 -0.293
995 -0.247
996 -0.062
997 -0.072
999 -0.250
363 rows × 1 columns

Related

How to set a seaborn color map in an arbitrary range?

I am creating a heatmap for the correlations between items.
sns.heatmap(df_corr, fmt=".2g", cmap='vlag', cbar='True', annot = True)
I choose vlag as it has red for high values, blue for low values, and white for the middle.
Seaborn automatically sets red for the highest value and blue for the lowest value in the dataframe.
However, as I am tracking Pearson's correlation, the value range is between -1 and 1 - as so I would like to set 1 to be represented by red, -1 with blue, leaving 0 to be represented by white.
How the results looks like:
How it should be*:
*(Of course this was generated by "cheating" - setting -1 as value(s) to force the range to be from -1 to 1; I want to set this range without warping my data)
it is vmin=-1 and vmax=1:
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
data = np.random.uniform(low=-0.5, high=0.5, size=(5,5))
hm = sn.heatmap(data = data, cmap= 'vlag', annot = True, vmin=-1, vmax=1)
plt.show()
Here is an unorthodox solution. You can "standardize" your data to a range 1 and -1. Even though the theoretical range of Pearson coefficient is [-1, 1]; strong negative correlations are not present in your dataset.
So, you can create another dataframe which contains the data with its max being 1 and min being -1. You can then plot this dataframe to get the desired effect. The advantage this procedure provides is that this technique generalizes to pretty much any dataframe (not verified though).
Here is the code -
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# Setting the initial scale of the data
scale_minimum = -1
scale_maximum = 1
scale_range = scale_maximum-scale_minimum
# Applying the scaling
df_minimun, df_maximum = df.min(), df.max() # Getting the range of the current dataframe
df_range = df_maximum - df_minimun # The range of the data
df = (df - df_minimun)/(df_range) # Scaling between 0 and 1
df_scaled = df*(scale_range) + scale_minimum # Scaling between 1 and -1
Hope this solves your problem.

Pandas: removing everything in a column after first value above threshold

I'm interested in the first time a random process crosses a threshold. I am storing the results from observing the process in a dataframe, and have plotted how many times several realisations of that process cross 0.9 after I observe it a the end of 14 rounds.
This image was created with this code
import matplotlib.pyplot as plt
plt.style.use('ggplot')
fin = pd.DataFrame(data=np.random.uniform(size=(100, 13))).T
pos = (fin>0.9).astype(float)
ax=fin.loc[:, pos.loc[12, :] != 1.0].plot(figsize=(12, 6), color='silver', legend=False)
fin.loc[:, pos.loc[12, :] == 1.0].plot(figsize=(12, 6), color='indianred', legend=False, ax=ax)
where fin contained the random numbers, and pos was 1 every time that process crossed 0.9.
I would like to now plot the first time the process in fin crosses 0.9 for each realisation (columns represent realisations, rows represent observation times)
I can find the first occurence of a value above 0.9 with idxmax() but I'm stumped about how to remove everything in the dataframe after that in each column.
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.random.uniform(size=(100, 10)))
maxes = df.idxmax()
It's just that I'm having real difficulty thinking through this.
If I understand correctly, you can use
df = df[df.index < maxes[0]]
IIUC, we can use a boolean matrix with cumprod:
df.where((df < .9).cumprod().astype(bool)).plot()
Output:

How to run function for multitude of arrays

so I need to analyse the peak number & width of a signal (in my case Calcium signal from epidermis cells) that I have stored in an excelsheet. Each column has all the values for one Cell (600 values)
To analyse the peaks, which I will be duing with the scipy.signal.find_peaks() and scipy.signal.peak_widths() function, I put the individual columns in an 1D numpy array containing all the 601 values from that column.
I did this by saving all the individual columns (Columns are named A, B, C, D, etc in Excelsheet) into their own dataframes (df_A, df_B) then putting them in an array :
import numpy as np
import pandas as pd
df = pd.read_excel('test.xlsx')
df_A = df.loc[:,'A']
df_B = df.loc[:,'B']
arrA = np.array(df_A)
arrB = np.array(df_B)
To calculate the the peak number&width i used the following lines :
from scipy.signal import find_peaks, peak_widths
peaks_A, _ = find_peaks(x,height=7000, prominence= 1)
results_peakwidth_A = peak_widths(x, peaks, rel_height=0.5)
Now since I have not only one but > 100 cells/signals to analyse, is there an simple way to do this for all the cells/arrays ? This exceeds my capabilities so I would gladly welcome any help.
The proposal would be as follows. In essence you firstly select the required columns (however many there are). Then you create a function that will take in a column (no need to turn it into arrays, unless scipy disagrees, in that case add column = column.values in the top of process function).
Afterwards use apply, which will loop through each column in the dataframe and pass it into the function that you defined.
import pandas as pd
from scipy.signal import find_peaks, peak_widths
df = pd.read_excel('test.xlsx')
df = ... # select all columns from A-Z into a single dataframe with the columns required.
# The shape here would b
# A B C
# 1 4 4.1
# 2 3 4.0
# ...
# define the function you want to apply to each column
def process(column):
peaks, _ = find_peaks(column,height=7000, prominence= 1)
return peak_widths(column, peaks, rel_height=0.5)
new_columns = df.apply(process)
As I'm unsure of what the actual output should look like, you might want to keep the peaks_A and the width. In which case you could alter the process function slightly:
def process(column):
peaks, _ = find_peaks(column,height=7000, prominence= 1)
width = peak_widths(column, peaks, rel_height=0.5)
return pd.Series({"width": width, "peaks": peaks})

plotting multiple columns value in x-axis in python

I have a dataframe of size (3,100) that is filled with some random float values.
Here is a sample of how the data frame looks like
A B C
4.394966 0.580573 2.293824
3.136197 2.227557 1.306508
4.010782 0.062342 3.629226
2.687100 1.050942 3.143727
1.280550 3.328417 2.247764
4.417837 3.236766 2.970697
1.036879 1.477697 4.029579
2.759076 4.753388 3.222587
1.989020 4.161404 1.073335
1.054660 1.427896 2.066219
0.301078 2.763342 4.166691
2.323838 0.791260 0.050898
3.544557 3.715050 4.196454
0.128322 3.803740 2.117179
0.549832 1.597547 4.288621
This is how I created it
df = pd.DataFrame(np.random.uniform(0,5,size=(100, 3)), columns=list('ABC'))
Note: pd is pandas
I want to plot a bar chart that would have three segments in x-axis where each segment would have 2 bars. One would show number of values less than 2 and other greater than equal to 2.
So on x-axis there would be two bars attached for column A, one with total number of values less than 2 and one with greater than equal to 2, and same for B and C
Can anyone suggest anything?
I was thinking of using seaborn and setting hue value for differentiating two classes (less than 2 and greater than equal to 2) but then again hue attribute only works for categorical value and I can only set one column in x-axis attribute.
Any tips would be appreciated.
You must use a filter and then count them, then you must use plot(kind='bar')
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.uniform(0,5,size=(100, 3)), columns=list('ABC'))
dfout = pd.DataFrame({'minor' : df[df<= 2].count(),
'major' : df[df > 2].count() })
dfout.plot(kind='bar')
plt.show()

Separating out pandas series for pyplot

I currently have a set of series in pandas and each series is composed of two data sets. I need to separate out the two data sets into lists while retaining the series information, ie. the time and intensity data for 58V.
My current code looks like:
import numpy as numpy
import pandas as pd
xl = pd.ExcelFile("TEST_ATD.xlsx")
df = xl.parse("Sheet1")
series = xl.parse("Sheet1")
voltages = []
for item in df:
if "V" in item:
voltages.append(item)
data_list = []
for value in voltages:
print(df[value])
How do I select a particular data set from the series to extract them into a list? If I ask it to print(df[value]) returns my data sets, an example of which looks like:
Name: 58V, dtype: int64
0.000 0
0.180 1
0.360 1.2
0.540 1.5
0.720 1.2
..
35.277 0
35.457 0
35.637 0
NaN 0
Ultimately I plan to plot these data sets into a line graph with pyplot.
~~~ UPDATE ~~~
using
for value in voltages:
intensity=[]
for row in series[value].tolist():
intensity.append(row)
time=range(0,len(intensity))
pc_intensity = []
for item in intensity:
pc_intensity.append((100/max(intensity)*item))
plt.plot(time, pc_intensity)
axes = plt.gca()
axes.set_ylim([0,100])
plt.title(value)
plt.ylabel('Intensity')
plt.xlabel('Time')
plt.savefig(value +'.png')
plt.clf()
print(value)
I am able to get the plots of the first 8 data series (using arbitrary x axis), however, anything past the 8th series and my plots are empty? I have experimented and found this to be due to some of the series being different lengths. I'm confused as to why this would effect the plots as the x-axis is directly related to the length of the data set it is being plotted against?
I am not sure what you are trying to acheive but I'll take a guess
df = pd.DataFrame({'A': range(1, 10), 'B': range(1, 10), 'C': range(1, 10), 'D': range(1, 10), 'E': [1,1,1,2,2,2,2,3,4]})
for col in df.columns:
print(df[col].values.tolist())
this would print every columns of your dataframe as list
if you are just trying to plot something why not just use
df.plot()

Categories

Resources