Pandas: removing everything in a column after first value above threshold

Pandas: removing everything in a column after first value above threshold - python

I'm interested in the first time a random process crosses a threshold. I am storing the results from observing the process in a dataframe, and have plotted how many times several realisations of that process cross 0.9 after I observe it a the end of 14 rounds.
This image was created with this code
import matplotlib.pyplot as plt
plt.style.use('ggplot')
fin = pd.DataFrame(data=np.random.uniform(size=(100, 13))).T
pos = (fin>0.9).astype(float)
ax=fin.loc[:, pos.loc[12, :] != 1.0].plot(figsize=(12, 6), color='silver', legend=False)
fin.loc[:, pos.loc[12, :] == 1.0].plot(figsize=(12, 6), color='indianred', legend=False, ax=ax)
where fin contained the random numbers, and pos was 1 every time that process crossed 0.9.
I would like to now plot the first time the process in fin crosses 0.9 for each realisation (columns represent realisations, rows represent observation times)
I can find the first occurence of a value above 0.9 with idxmax() but I'm stumped about how to remove everything in the dataframe after that in each column.
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.random.uniform(size=(100, 10)))
maxes = df.idxmax()
It's just that I'm having real difficulty thinking through this.

If I understand correctly, you can use
df = df[df.index < maxes[0]]

IIUC, we can use a boolean matrix with cumprod:
df.where((df < .9).cumprod().astype(bool)).plot()
Output:

Related

Complex mask for dataframe

I have a dataframe with a time series in one single column. The data looks like this chart
I would like to create a mask that is TRUE each time that the data is equal or lower than -0.20. It should also be TRUE before reaching -0.20 while negative. It should also be true after reaching -0.20 while negative.
This version of the chart
is my manual attempt to show (in red) the values where the mask would be TRUE. I started creating the mask but I could only make it equal to TRUE while the data is less than -0.20 mask = (df['data'] < -0.2). I couldn't do any better, does anybody know how to achieve my goal?

One approach could be to group segments that are entirely below zero, and then for each group verify whether or not there any values below -0.2.
See below for a full reproducible example script:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
np.random.seed(167)
df = pd.DataFrame(
{"y": np.cumsum([np.random.uniform(-0.01, 0.01) for _ in range(10 ** 5)])}
)
plt.plot(df)
gt_zero = df["y"] < 0
regions = (gt_zero != gt_zero.shift()).cumsum()
# here's your interesting DataFrame with the specified mask
df_interesting = df.groupby(regions).filter(lambda s: s.min() < -0.2)
# plot individual regions
for i, grp in df.groupby(regions):
if grp["y"].min() < -0.2:
plt.plot(grp, color="tab:red", linewidth=5, alpha=0.6)
plt.axhline(0, linestyle="--", color="tab:gray")
plt.axhline(-0.2, linestyle="--", color="tab:gray")
plt.show()

Idea
Group by consecutive values of same sign, and then check if the minimum of such a group is less than the defined treshold.
Implementation
First, we want to separate negative from positive values.
negative_mask = (df['data']<0)
We then can create classes (ordered with integers) for each consecutive positive or negative series. The class increases by one each time the data changes sign.
consecutives = negative_mask.diff().ne(0).cumsum()
We then select only the data where the minimum of the group of consecutive elements is less than 0.2.
df.groupby(consecutives).filter(lambda df : df[0].min() < -0.2)
Example with random data
We can try our example with random data:
import numpy as np
import pandas as pd
np.random.seed(42)
data = np.random.randint(-300, 300, size=1000)/1000
df = pd.DataFrame(data, columns=["data"])
Output
data
2 -0.030
3 -0.194
4 -0.229
5 -0.280
6 -0.179
... ...
991 -0.293
995 -0.247
996 -0.062
997 -0.072
999 -0.250
363 rows × 1 columns

How to set a seaborn color map in an arbitrary range?

I am creating a heatmap for the correlations between items.
sns.heatmap(df_corr, fmt=".2g", cmap='vlag', cbar='True', annot = True)
I choose vlag as it has red for high values, blue for low values, and white for the middle.
Seaborn automatically sets red for the highest value and blue for the lowest value in the dataframe.
However, as I am tracking Pearson's correlation, the value range is between -1 and 1 - as so I would like to set 1 to be represented by red, -1 with blue, leaving 0 to be represented by white.
How the results looks like:
How it should be*:
*(Of course this was generated by "cheating" - setting -1 as value(s) to force the range to be from -1 to 1; I want to set this range without warping my data)

it is vmin=-1 and vmax=1:
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
data = np.random.uniform(low=-0.5, high=0.5, size=(5,5))
hm = sn.heatmap(data = data, cmap= 'vlag', annot = True, vmin=-1, vmax=1)
plt.show()

Here is an unorthodox solution. You can "standardize" your data to a range 1 and -1. Even though the theoretical range of Pearson coefficient is [-1, 1]; strong negative correlations are not present in your dataset.
So, you can create another dataframe which contains the data with its max being 1 and min being -1. You can then plot this dataframe to get the desired effect. The advantage this procedure provides is that this technique generalizes to pretty much any dataframe (not verified though).
Here is the code -
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# Setting the initial scale of the data
scale_minimum = -1
scale_maximum = 1
scale_range = scale_maximum-scale_minimum
# Applying the scaling
df_minimun, df_maximum = df.min(), df.max() # Getting the range of the current dataframe
df_range = df_maximum - df_minimun # The range of the data
df = (df - df_minimun)/(df_range) # Scaling between 0 and 1
df_scaled = df*(scale_range) + scale_minimum # Scaling between 1 and -1
Hope this solves your problem.

I want to detect ranges with the same numerical boundaries of a dataset using matplotlib or pandas in python 3.7

I have a ton of ranges. They all consist of numbers. The range has a maximum and a minimum which can not be exceeded, but given the example that you have two ranges and one max point of the range reaches above the min area of the other. That would mean that you have a small area that covers both of them. You can write one range that includes the others.
I want to see if some ranges overlap or if I can find some ranges that cover most of the other. The goal would be to see if I can simplify them by using one smaller range that fits inside the other. For example 7,8 - 9,6 and 7,9 - 9,6 can be covered with one range.
You can see my attempt to visualize them. But when I use my entire dataset consisting of hundreds of ranges my graph is not longer useful.
I know that I can detect recurrent ranges using python. But I don't want to know how often a range occurs. I want to know how many ranges lay in the same numerical boundaries.I want see if I can have a couple of ranges covering all of them. Finally my goal is to have the masterranges sorted in categories. Meaning that I have range 1 covering 50 other ranges. then range 2 covering 25 ranges and so on.
My current program shows the penetration of ranges but I also want that in a printed output with the exact digits.
It would be nice if you share some ideas to solve that program or if you have any suggestions on tools within python 3.7
import matplotlib.pyplot as plt
intervals = [[3.6,4.5],
[3.6,4.5],
[7.8,9.6],
[7.9,9.6],
[7.8,9.6],
[3.4,4.1],
[2.8,3.4],
[8.25,9.83],
[3.62,3.96],
[8.25,9.83],
[0.62,0.68],
[2.15,2.49],
[0.8,1.0],
[0.8,1.0],
[3.1,3.9],
[6.7,8.3],
[1,1.5],
[1,1.2],
[1.5,1.8],
[1.8,2.5],
[3,4.0],
[6.5,8.0],
[1.129,1.35],
[2.82,3.38],
[1.69,3.38],
[3.38,6.21],
[2.25,2.82],
[5.649,6.214],
[1.920,6.214]
]
for int in intervals:
plt.plot(int,[0,0], 'b', alpha = 0.2, linewidth = 100)
plt.show()

Here is an idea, You make a pandas data frame with the array. You substract the values in column2 - colum1 ( column 1 is x, and column 2 is y ). After that you create a histogram in which you take the range and the frecuency.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
intervals = [[3.6,4.5],
[3.6,4.5],
[7.8,9.6],
[7.9,9.6],
[7.8,9.6],
[3.4,4.1],
[2.8,3.4],
[8.25,9.83],
[3.62,3.96],
[8.25,9.83],
[0.62,0.68],
[2.15,2.49],
[0.8,1.0],
[0.8,1.0],
[3.1,3.9],
[6.7,8.3],
[1,1.5],
[1,1.2],
[1.5,1.8],
[1.8,2.5],
[3,4.0],
[6.5,8.0],
[1.129,1.35],
[2.82,3.38],
[1.69,3.38],
[3.38,6.21],
[2.25,2.82],
[5.649,6.214],
[1.920,6.214]]
intervals_ar = np.array(intervals)
df = pd.DataFrame({'Column1': intervals_ar[:, 0], 'Column2': intervals_ar[:, 1]})
df['Ranges'] = df['Column2'] - df ['Column1']
print(df)
frecuency_range = df['Ranges'].value_counts().sort_index()
print(frecuency_range)
df.Ranges.value_counts().sort_index().plot(kind = 'hist', bins = 5)
plt.title("Histogram Frecuency vs Range (column 2- column1)")
plt.show()

Python - plot numpy array with gaps in the data

I need to plot some spectral data as a 2D image, where each data point corresponds to a spectrum with a specific date/time. I require to plot all spectra as follows:
- xx-axis - corresponds to the wavelenght
- yy-axis - corresponds to the date/time
- intensity - corresponds to the flux
If my datapoints were continuous/sequential in time I would just use matplotlib's imshow. However, not only the points are not all continuous/sequential in time but I have large time gaps between points.
here is some simulated data that mimics what I have:
import numpy as np
sampleSize = 100
data={}
for time in np.arange(0,5):
data[time] = np.random.sample(sampleSize)
for time in np.arange(14,20):
data[time] = np.random.sample(sampleSize)
for time in np.arange(30,40):
data[time] = np.random.sample(sampleSize)
for time in np.arange(25.5,35.5):
data[time] = np.random.sample(sampleSize)
for time in np.arange(80,120):
data[time] = np.random.sample(sampleSize)
if I needed to print only one of the subsets of data above; i would do:
mplt.imshow([data[time] for time in np.arange(0,5)], cmap ='Greys',aspect='auto',origin='lower',interpolation="none",extent=[-50,50,0,5])
mplt.show()
however, I have no idea how I can print all data in the same plot, while showing the gaps and keeping the yy-axis as the time. Any ideas?
thanks,
Jorge

Or you can use pandas to help you with sorting the keys, then reindex:
df = pd.DataFrame(data).T
plt.imshow(df.reindex(np.arange(df.index.max())),
cmap ='Greys',
aspect='auto',
origin='lower',
interpolation="none",
extent=[-50,50,0,5])
Output:

In the end I ended up using a different approach:
1) re-index the time in my data so that no two arrays has the same time and I avoid non-integer indexes
nTimes = 1
timeIndexes=[int(float(index)) for index in data.keys()]
while len(timeIndexes) != len(set(timeIndexes)):
nTimes += 1
timeIndexes=[int(nTimes*float(index)) for index in data.keys()]
timeIndexesDict = {str(int(nTimes*float(index))):data[index] for index in data.keys()}
lenData2Plot = max([int(key) for key in timeIndexesDict.keys()])
2) create an array of zeros with the number of columns like my data and a number of rows corresponding to my maximum re-indexed time
data2Plot = np.zeros((int(lenData2Plot)+1,sampleSize))
3) replace the rows in my array of zeros corresponding to my re-indeed times
for index in timeIndexesDict.keys():
data2Plot[int(index)][:] = timeIndexesDict[str(index)]
4) plot as I normally would plot an array with no gaps
mplt.imshow(data2Plot,
cmap='Greys',aspect='auto',origin='lower',interpolation="none",
extent=[-50,50,0,120])
mplt.show()

plotting multiple columns value in x-axis in python

I have a dataframe of size (3,100) that is filled with some random float values.
Here is a sample of how the data frame looks like
A B C
4.394966 0.580573 2.293824
3.136197 2.227557 1.306508
4.010782 0.062342 3.629226
2.687100 1.050942 3.143727
1.280550 3.328417 2.247764
4.417837 3.236766 2.970697
1.036879 1.477697 4.029579
2.759076 4.753388 3.222587
1.989020 4.161404 1.073335
1.054660 1.427896 2.066219
0.301078 2.763342 4.166691
2.323838 0.791260 0.050898
3.544557 3.715050 4.196454
0.128322 3.803740 2.117179
0.549832 1.597547 4.288621
This is how I created it
df = pd.DataFrame(np.random.uniform(0,5,size=(100, 3)), columns=list('ABC'))
Note: pd is pandas
I want to plot a bar chart that would have three segments in x-axis where each segment would have 2 bars. One would show number of values less than 2 and other greater than equal to 2.
So on x-axis there would be two bars attached for column A, one with total number of values less than 2 and one with greater than equal to 2, and same for B and C
Can anyone suggest anything?
I was thinking of using seaborn and setting hue value for differentiating two classes (less than 2 and greater than equal to 2) but then again hue attribute only works for categorical value and I can only set one column in x-axis attribute.
Any tips would be appreciated.

You must use a filter and then count them, then you must use plot(kind='bar')
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.uniform(0,5,size=(100, 3)), columns=list('ABC'))
dfout = pd.DataFrame({'minor' : df[df<= 2].count(),
'major' : df[df > 2].count() })
dfout.plot(kind='bar')
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: removing everything in a column after first value above threshold - python

If I understand correctly, you can use df = df[df.index < maxes[0]]

IIUC, we can use a boolean matrix with cumprod: df.where((df < .9).cumprod().astype(bool)).plot() Output:

Related

Complex mask for dataframe

How to set a seaborn color map in an arbitrary range?

I want to detect ranges with the same numerical boundaries of a dataset using matplotlib or pandas in python 3.7

Python - plot numpy array with gaps in the data

plotting multiple columns value in x-axis in python

Categories

Resources