Pandas Slice Columns and select subsets based on between condition - python

I have a dataframe as follows:
100 105 110
timestamp
2020-11-0112:00:00 0.2 0.5 0.1
2020-11-0112:01:00 0.3 0.8 0.2
2020-11-0112:02:00 0.8 0.9 0.4
2020-11-0112:03:00 1 0 0.4
2020-11-0112:04:00 0 1 0.5
2020-11-0112:05:00 0.5 1 0.2
I want to select columns with dataframe where the values would be greater than equal 0.5 and less than equal to 1, and I want the index/timestamp in which these occurrences happened. Each column could have multiple such occurrences. So, 100, can be between 0.5 and 1 from 12:00 to 12:03 and then again from 12:20 to 12:30. It needs to reset when it hits 0. The column names are variable.
I also want the time difference in which the column value was between 0.5 and 1, so from the above it was 3 minutes, and 10 minutes.
The expected output would be with a dict for ranges the indexes appeared in:
100 105 110
timestamp
2020-11-0112:00:00 NaN 0.5 NaN
2020-11-0112:01:00 NaN 0.8 NaN
2020-11-0112:02:00 0.8 0.9 NaN
2020-11-0112:03:00 1 NaN NaN
2020-11-0112:04:00 NaN 1 0.5
2020-11-0112:05:00 0.5 1 NaN
and probably a way to calculate the minutes which could be in a dict/list of dicts:
["105":
[{"from": "2020-11-0112:00:00", "to":"2020-11-0112:02:00"},
{"from": "2020-11-0112:04:00", "to":"2020-11-0112:05:00"}]
...
]
Essentially I want a the dicts at the end to evaluate.

Basically, it would be best if you got the ordered sequence of timestamps; then, you can manipulate it to get the differences. If the question is only about Pandas slicing and not about timestamp operations, then you need to do the following operation:
df[df["100"] >= 0.5][df["100"] <= 1]["timestamp"].values
Pandas data frames comparaision operations
For Pandas, data frames, normal comparison operations are overridden. If you do dataframe_instance >= 0.5, the result is a sequence of boolean values. An individual value in the sequence results from comparing an individual data frame value to 0.5.
Pandas data frame slicing
This sequence could be used to filter a subsequence from your data frame. It is possible because Pandas slicing is overridden and implemented as a reach filtering algorithm.

Related

Filling dataframe with average of previous columns values

I have a dataframe with having 5 columns with having missing values.
How do i fill the missing values with taking the average of previous two column values.
Here is the sample code for the same.
coh0 = [0.5, 0.3, 0.1, 0.2,0.2]
coh1 = [0.4,0.3,0.6,0.5]
coh2 = [0.2,0.2,0.3]
coh3 = [0.8,0.8]
coh4 = [0.5]
df= pd.DataFrame({'coh0': pd.Series(coh0), 'coh1': pd.Series(coh1),'coh2': pd.Series(coh2), 'coh3': pd.Series(coh3),'coh4': pd.Series(coh4)})
df
Here is the sample output
coh0coh1coh2coh3coh4
0 0.5 0.4 0.2 0.8 0.5
1 0.3 0.3 0.2 0.8 NaN
2 0.1 0.6 0.3 NaN NaN
3 0.2 0.5 NaN NaN NaN
4 0.2 NaN NaN NaN NaN
Here is the desired result i am looking for.
The NaN value in each column should be replaced by the previous two columns average value at the same position. However for the first NaN value in second column, it will take the default last value of first column.
The sample desired output would be like below.
For the exception you named, the first NaN, you can do
df.iloc[1, -1] = df.iloc[0, -1]
though it doesn't make a difference in this case as the mean of .2 and .8 is .5, anyway.
Either way, the rest is something like a rolling window calculation, except it has to be computed incrementally. Normally, you want to vectorize your operations and avoid iterating over the dataframe, but IMHO this is one of the rarer cases where it's actually appropriate to loop over the columns (cf. this excellent post), i.e.,
compute the row-wise (axis=1) mean of up to two columns left of the current one (df.iloc[:, max(0, i-2):i]),
and fill its NaN values from the resulting series.
for i in range(1, df.shape[1]):
mean_df = df.iloc[:, max(0, i-2):i].mean(axis=1)
df.iloc[:, i] = df.iloc[:, i].fillna(mean_df)
which results in
coh0 coh1 coh2 coh3 coh4
0 0.5 0.4 0.20 0.800 0.5000
1 0.3 0.3 0.20 0.800 0.5000
2 0.1 0.6 0.30 0.450 0.3750
3 0.2 0.5 0.35 0.425 0.3875
4 0.2 0.2 0.20 0.200 0.2000

calculate cosine similarity for all columns in a group by in a dataframe

I have a dataframe df: where APer columns range from 0-60
ID FID APerc0 ... APerc60
0 X 0.2 ... 0.5
1 Z 0.1 ... 0.3
2 Y 0.4 ... 0.9
3 X 0.2 ... 0.3
4 Z 0.9 ... 0.1
5 Z 0.1 ... 0.2
6 Y 0.8 ... 0.3
7 W 0.5 ... 0.4
8 X 0.6 ... 0.3
I want to calculate the cosine similarity of the values for all APerc columns between each row. So the result for the above should be:
ID CosSim
1 0,2,4 0.997
2 1,8,7 0.514
1 3,5,6 0.925
I know how to generate cosine similarity for the whole df:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df)
But I want to find similarity between each ID and group them together(or create separate df). How to do it fast for big dataset?
One possible solution could be get the particular rows you want to use for cosine similarity computation and do the following.
Here, combinations is basically the list pair of row index which you want to consider for computation.
cos = nn.CosineSimilarity(dim=0)
for i in range(len(combinations)):
row1 = df.loc[combinations[i][0], 2:62]
row2 = df.loc[combinations[i][1], 2:62]
sim = cos(row1, row2)
print(sim)
The result you can use in the way you want.
create a function for calculation, then df.apply(cosine_similarity_function()), one said that using apply function may perform hundreds times faster than row by row.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

Only want to consider a dataframe up to the present point

I have a dataframe and I am trying to do something along the lines of
df['foo'] = np.where(myfunc(df) == 1, 10, 20)
but I only want to consider the dataframe up to the present, for example if my dataframe looked like
A B C
1 0.3 0.3 1.6
2 0.6 0.6 0.4
3 0.9 0.9 1.2
4 1.2 1.2 0.8
and I was generating the value of 'foo' for the third row, I would be looking at the dataframe's first through third rows, but not the fourth row. Is it possible to accomplish this?
It is certainly possible. The dataframe up to the present is given by
df.iloc[:present],
and you can do whatever you want with it, in particular, use where, as described here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html

finding duplicate rows in pandas based on approximate match or formula

I have a pandas data frame
import pandas as pd
df = pd.DataFrame({"x" : [1.,1.,2.,3.,3.01,4.,5.],"y":[10.,11.,12.,12.95,13.0,11.,10.],
"name":["0ndx","1ndx","2ndx","3ndx","4ndx","5ndx","6ndx"]})
print(df.duplicated(subset=["x","y"]))
x y name
0 1.00 10.00 0ndx
1 1.00 11.00 1ndx
2 2.00 12.00 2ndx
3 3.00 12.95 3ndx
4 3.01 13.00 4ndx
5 4.00 11.00 5ndx
6 5.00 10.00 6ndx
I would like to find duplicate rows (in this case rows 3 and 4) using a formula based on distance with a tolerance of say 0.1. A row would be duplicated if it is is within a distance 0.1 of another row (or, equivalently if both x and y are within a tolerance). As one commenter pointed out, this could lead to a cluster of values with more than 0.1 of spread as 1.1 is close to 1.18 is close to 1.22. This might affect some of the things you can do, but I would still define any row that is within the tolerance of another as duplicated.
This is a toy problem I have a modest size problem but foresee problems of large enough size (250,000 rows) that the outer product might be expensive to construct.
Is there a way to do this?
you can compare with pandas.shift https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html.
Then if you wanted to compare each row to the previous, and make a column where they are within some threshold of each-other, let's say 0.1 it would follow:
eps = 0.1
df['duplicated'] = 0
df.sort_values(by=['x'],inplace=True)
df.loc[abs(df['x'] - df['x'].shift()) <= eps,'duplicated'] = 1
Then columns with a 1 would be those that are duplicated within your threshold.

How to count instances following a condition in a dataframe

I have a dataset containing 18 unique IDs, each having one column of interest for which I want to count instances where its values are greater than or less than 0.25
For those that are greater than 0.25, I want to subtract a value from them, to then graph the resulting values in a column scatter plot. How would I go about counting those instances using pandas and to extract those >0.25 values to have those values available to put into the scatter plot?
Demo data
data = pd.DataFrame({"num":[0.1, 0.3, 0.1, 0.4]})
print(data)
num
0 0.1
1 0.3
2 0.1
3 0.4
Filter the values that less than 0.25
great_than = data[data.num > 0.25]
print(great_than)
num
1 0.3
3 0.4

Categories

Resources