I have a dataset containing 18 unique IDs, each having one column of interest for which I want to count instances where its values are greater than or less than 0.25
For those that are greater than 0.25, I want to subtract a value from them, to then graph the resulting values in a column scatter plot. How would I go about counting those instances using pandas and to extract those >0.25 values to have those values available to put into the scatter plot?
Demo data
data = pd.DataFrame({"num":[0.1, 0.3, 0.1, 0.4]})
print(data)
num
0 0.1
1 0.3
2 0.1
3 0.4
Filter the values that less than 0.25
great_than = data[data.num > 0.25]
print(great_than)
num
1 0.3
3 0.4
Related
I have a dataframe df: where APer columns range from 0-60
ID FID APerc0 ... APerc60
0 X 0.2 ... 0.5
1 Z 0.1 ... 0.3
2 Y 0.4 ... 0.9
3 X 0.2 ... 0.3
4 Z 0.9 ... 0.1
5 Z 0.1 ... 0.2
6 Y 0.8 ... 0.3
7 W 0.5 ... 0.4
8 X 0.6 ... 0.3
I want to calculate the cosine similarity of the values for all APerc columns between each row. So the result for the above should be:
ID CosSim
1 0,2,4 0.997
2 1,8,7 0.514
1 3,5,6 0.925
I know how to generate cosine similarity for the whole df:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df)
But I want to find similarity between each ID and group them together(or create separate df). How to do it fast for big dataset?
One possible solution could be get the particular rows you want to use for cosine similarity computation and do the following.
Here, combinations is basically the list pair of row index which you want to consider for computation.
cos = nn.CosineSimilarity(dim=0)
for i in range(len(combinations)):
row1 = df.loc[combinations[i][0], 2:62]
row2 = df.loc[combinations[i][1], 2:62]
sim = cos(row1, row2)
print(sim)
The result you can use in the way you want.
create a function for calculation, then df.apply(cosine_similarity_function()), one said that using apply function may perform hundreds times faster than row by row.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html
I'm trying to sample a Dataframe based on a given Minimum Sample Interval on the "timestamp" column. Every extracted value would be the closest extracted value to the last one that is at least Minimum Sample Interval larger than the last one. So what I mean is, for the table given below and Minimum Sample Interval = 0.2
A timestamp
1 0.000000 0.1
2 3.162278 0.15
3 7.211103 0.45
4 7.071068 0.55
Here, we would extract indexes:
1, no last value yet so why not
Not 2, because it is only 0.05 larger than last value
3, because it is 0.35 larger than last value
Not 4, because it is only 0.1 larger than last value.
I've found a way to do this with iterrows, but I would like to avoid iterating over it if possible.
Closest I can think of is integer dividing the timestamp column with floordiv as interval and finding the rows where interval value changes. but for a case like [0.01 , 0.21, 0.55, 0.61, 0.75, 0.41], I would be selecting 0.61 instead of 0.75, which is only 0.06 larger than 0.55, instead of 0.2.
You can use pandas.Series.diff to compute the difference between each value and the next one:
sample = df[df['timestamp'].diff().fillna(1) > 0.2]
Output:
>>> sample
A timestamp
1 0.000000 0.10
3 7.211103 0.45
I have a dataframe as follows:
100 105 110
timestamp
2020-11-0112:00:00 0.2 0.5 0.1
2020-11-0112:01:00 0.3 0.8 0.2
2020-11-0112:02:00 0.8 0.9 0.4
2020-11-0112:03:00 1 0 0.4
2020-11-0112:04:00 0 1 0.5
2020-11-0112:05:00 0.5 1 0.2
I want to select columns with dataframe where the values would be greater than equal 0.5 and less than equal to 1, and I want the index/timestamp in which these occurrences happened. Each column could have multiple such occurrences. So, 100, can be between 0.5 and 1 from 12:00 to 12:03 and then again from 12:20 to 12:30. It needs to reset when it hits 0. The column names are variable.
I also want the time difference in which the column value was between 0.5 and 1, so from the above it was 3 minutes, and 10 minutes.
The expected output would be with a dict for ranges the indexes appeared in:
100 105 110
timestamp
2020-11-0112:00:00 NaN 0.5 NaN
2020-11-0112:01:00 NaN 0.8 NaN
2020-11-0112:02:00 0.8 0.9 NaN
2020-11-0112:03:00 1 NaN NaN
2020-11-0112:04:00 NaN 1 0.5
2020-11-0112:05:00 0.5 1 NaN
and probably a way to calculate the minutes which could be in a dict/list of dicts:
["105":
[{"from": "2020-11-0112:00:00", "to":"2020-11-0112:02:00"},
{"from": "2020-11-0112:04:00", "to":"2020-11-0112:05:00"}]
...
]
Essentially I want a the dicts at the end to evaluate.
Basically, it would be best if you got the ordered sequence of timestamps; then, you can manipulate it to get the differences. If the question is only about Pandas slicing and not about timestamp operations, then you need to do the following operation:
df[df["100"] >= 0.5][df["100"] <= 1]["timestamp"].values
Pandas data frames comparaision operations
For Pandas, data frames, normal comparison operations are overridden. If you do dataframe_instance >= 0.5, the result is a sequence of boolean values. An individual value in the sequence results from comparing an individual data frame value to 0.5.
Pandas data frame slicing
This sequence could be used to filter a subsequence from your data frame. It is possible because Pandas slicing is overridden and implemented as a reach filtering algorithm.
Is there any way to select values within 5 certain ranges for a given column, and to each different dataframe, apply in a new column, a label?
I mean, I have a list a of dataframes. All dataframes have 2 columns and share the same first column, but differs in the second (header and values). For example:
>> df1
GeneID A
1 0.3
2 0.0
3 143
4 9
5 0.6
>> df2
GeneID B
1 0.2
2 0.3
3 0.1
4 0.7
5 0.4
....
I would like to:
For each dataframe on the list, perform a calculation which gives the probability of that value occur within 1 of 5 different range. Append a new column with those values;
For each dataframe on the list, attach the respective range label in another new column.
Where the ranges are:
*Range_Values* -> *Range_Label*
**[0]** -> 'l1'
**]0,1]** -> 'l2'
**]1,10]** -> 'l3'
**]10,100]** -> 'l4'
**>100** 'l5'
This 2 steps approaches would led to something like:
>> list_dfs[df1]
GeneID A Prob_val Exp_prof
1 0.3 0.4 'l2'
2 0.0 0.2 'l1'
3 143 0.2 'l5'
4 9 0.2 'l3'
5 0.6 0.4 'l2'
You have to first define the bins and labels -
bins = [0, 1, 10, 100, float("inf")]
labels = ['l1', 'l2', 'l3', 'l4', 'l5']
Then use pd.cut() -
pd.cut(df1['A'], bins, right=False)
There is a labels parameter in pd.cut() that you can use to get labels -
pd.cut(df1['A'], bins, labels=labels, right=False)
You can use the bins generated to compute probabilities I leave it upto you to do that.
You can do this for the rest of the dfs in a loop and finally assign them to a list -
list_dfs = [df1, df2, ...]
If you have dynamic number of dfs use a loop -
Framework
for df in dfs:
df['bins'] = pd.cut(df['A'], bins, right=False)
df['label'] = pd.cut(df['A'], bins, labels=labels, right=False)
For the labels and bins, you can use pandas.cut. Note that you can't use a singleton as a bin in this function. Therefore you will have to create it afterwards. Here is how you can do this.
First I recreate one of your dataframes:
import io
temp = u"""
GeneID A
1 0.3
2 0.0
3 143
4 9
5 0.6"""
foo = pd.read_csv(io.StringIO(temp),delim_whitespace = True)
Then I create the new column and fill the NaN values with the label l1 which corresponds to the singleton [0].
foo['Exp_prof'] = pd.cut(foo.A,bins = [0,1,10,100,np.inf],labels = ['l2','l3','l4','l5'])
foo['Exp_prof'] = foo['Exp_prof'].cat.add_categories(['l1'])
foo['Exp_prof'] = foo['Exp_prof'].fillna('l1')
And I use this new column to compute the probabilities:
foo['Prob_val'] = foo.Exp_prof.map((foo.Exp_prof.value_counts()/len(foo)).to_dict())
And the output is:
GeneID A Exp_prof Prob_val
0 1 0.3 l2 0.4
1 2 0.0 l1 0.2
2 3 143.0 l5 0.2
3 4 9.0 l3 0.2
4 5 0.6 l2 0.4
I have a list that I'm adding to a pandas data frame it contains a range of decimal values.
I want to divide it into 3 ranges each range represents one value
sents=[]
for sent in sentis:
if sent > 0:
if sent < 0.40:
sents.append('negative')
if (sent >= 0.40 and sent <= 0.60):
sents.append('neutral')
if sent > 0.60
sents.append('positive')
my question is if there is a more efficient way in pandas to do this as i'm trying to implement this on a bigger list and
Thanks in advance.
You can use pd.cut to produce the results that are of type categorical and have the appropriate labels.
In order to fix the inclusion of .4 and .6 for the neutral category, I add and subtract the smallest float epsilon
sentis = np.linspace(0, 1, 11)
eps = np.finfo(float).eps
pd.DataFrame(dict(
Value=sentis,
Sentiment=pd.cut(
sentis, [-np.inf, .4 - eps, .6 + eps, np.inf],
labels=['negative', 'neutral', 'positive']
),
))
Sentiment Value
0 negative 0.0
1 negative 0.1
2 negative 0.2
3 negative 0.3
4 neutral 0.4
5 neutral 0.5
6 neutral 0.6
7 positive 0.7
8 positive 0.8
9 positive 0.9
10 positive 1.0
List comprehension:
['negative' if x < 0.4 else 'positive' if x > 0.6 else 'neutral' for x in sentis]