I'm implementing a machine learning algorithm and I'm extracting features out of a dataframe. I obviously need overlapping windows. Suppose the DataFrame looks like this
x y z
12.1 11 0.5
12.2 10 0.3
12.4 11 0.5
12.8 12 0.4
13.1 13 0.4
14.7 14 0.5
15.2 14 0.6
15.3 13 0.5
17.3 14 0.5
18.2 15 0.4
16.1 16 0.2
15.0 17 0.1
But in reality a lot larger (thousands of samples). I now want a list of dataframes where each DataFrame is of length ws (here 150) and to have step (stride) of 60.
This is what I got
r = np.arange(len(df))
s = r[::step]
return [df.iloc[k:k+ws] for k in s]
This works reasonably good but there's still one problem. The last 1,2 or 3 frames might not have length ws. I also cannot just discard the last 3 since there's sometimes only one with a length smaller then ws. So the s variable just keeps all the start indices, I'd need a way to have only the start indices where start_index + step < len(df). Unless of course, there are better and or faster ways for this (maybe a library). All existing documentation only talks about simple arrays.
You might only need to change s:
s = r[:len(df)-ws+1:step]
In this way you only find the start indexes of frames with length ws.
Related
I have a dataframe as follows:
100 105 110
timestamp
2020-11-0112:00:00 0.2 0.5 0.1
2020-11-0112:01:00 0.3 0.8 0.2
2020-11-0112:02:00 0.8 0.9 0.4
2020-11-0112:03:00 1 0 0.4
2020-11-0112:04:00 0 1 0.5
2020-11-0112:05:00 0.5 1 0.2
I want to select columns with dataframe where the values would be greater than equal 0.5 and less than equal to 1, and I want the index/timestamp in which these occurrences happened. Each column could have multiple such occurrences. So, 100, can be between 0.5 and 1 from 12:00 to 12:03 and then again from 12:20 to 12:30. It needs to reset when it hits 0. The column names are variable.
I also want the time difference in which the column value was between 0.5 and 1, so from the above it was 3 minutes, and 10 minutes.
The expected output would be with a dict for ranges the indexes appeared in:
100 105 110
timestamp
2020-11-0112:00:00 NaN 0.5 NaN
2020-11-0112:01:00 NaN 0.8 NaN
2020-11-0112:02:00 0.8 0.9 NaN
2020-11-0112:03:00 1 NaN NaN
2020-11-0112:04:00 NaN 1 0.5
2020-11-0112:05:00 0.5 1 NaN
and probably a way to calculate the minutes which could be in a dict/list of dicts:
["105":
[{"from": "2020-11-0112:00:00", "to":"2020-11-0112:02:00"},
{"from": "2020-11-0112:04:00", "to":"2020-11-0112:05:00"}]
...
]
Essentially I want a the dicts at the end to evaluate.
Basically, it would be best if you got the ordered sequence of timestamps; then, you can manipulate it to get the differences. If the question is only about Pandas slicing and not about timestamp operations, then you need to do the following operation:
df[df["100"] >= 0.5][df["100"] <= 1]["timestamp"].values
Pandas data frames comparaision operations
For Pandas, data frames, normal comparison operations are overridden. If you do dataframe_instance >= 0.5, the result is a sequence of boolean values. An individual value in the sequence results from comparing an individual data frame value to 0.5.
Pandas data frame slicing
This sequence could be used to filter a subsequence from your data frame. It is possible because Pandas slicing is overridden and implemented as a reach filtering algorithm.
I have a pandas data frame
import pandas as pd
df = pd.DataFrame({"x" : [1.,1.,2.,3.,3.01,4.,5.],"y":[10.,11.,12.,12.95,13.0,11.,10.],
"name":["0ndx","1ndx","2ndx","3ndx","4ndx","5ndx","6ndx"]})
print(df.duplicated(subset=["x","y"]))
x y name
0 1.00 10.00 0ndx
1 1.00 11.00 1ndx
2 2.00 12.00 2ndx
3 3.00 12.95 3ndx
4 3.01 13.00 4ndx
5 4.00 11.00 5ndx
6 5.00 10.00 6ndx
I would like to find duplicate rows (in this case rows 3 and 4) using a formula based on distance with a tolerance of say 0.1. A row would be duplicated if it is is within a distance 0.1 of another row (or, equivalently if both x and y are within a tolerance). As one commenter pointed out, this could lead to a cluster of values with more than 0.1 of spread as 1.1 is close to 1.18 is close to 1.22. This might affect some of the things you can do, but I would still define any row that is within the tolerance of another as duplicated.
This is a toy problem I have a modest size problem but foresee problems of large enough size (250,000 rows) that the outer product might be expensive to construct.
Is there a way to do this?
you can compare with pandas.shift https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html.
Then if you wanted to compare each row to the previous, and make a column where they are within some threshold of each-other, let's say 0.1 it would follow:
eps = 0.1
df['duplicated'] = 0
df.sort_values(by=['x'],inplace=True)
df.loc[abs(df['x'] - df['x'].shift()) <= eps,'duplicated'] = 1
Then columns with a 1 would be those that are duplicated within your threshold.
This is my first question on Stack Overflow, please let me know how I can help you help me if my question is unclear.
Goal: Use Python and Pandas to Outer join (or merge) Data Sets containing different experimental trials where the "x" axis of each trial is extremely similar but has some deviations. Most importantly, the "x" axis increases, hits a maximum and then decreases, often overlapping with previously existing "x" points.
Problem: When I go to join/merge the datasets on "x", the "x" column is sorted, messing up the order of the collected data and making it impossible to plot it correctly.
Here is a small example of what I am trying to do:
Wouldn't let me add pictures because I am new. Here is the code to generate these example data sets.
Data Sets :
Import:
import numpy as np
import pandas as pd
import random as rand
Code :
T1 = {'x':np.array([1,1.5,2,2.5,3,3.5,4,5,2,1]),'y':np.array([10000,8500,7400,6450,5670,5100,4600,4500,8400,9000]),'z':np.array(rand.sample(range(0,10000),10))}'
T2 = {'x':np.array([1,2,3,4,5,6,7,2,1.5,1]),'y':np.array([10500,7700,5500,4560,4300,3900,3800,5400,8400,8800]),'z':np.array(rand.sample(range(0,10000),10))}
Trial1 = pd.DataFrame(T1)
Trial2 = pd.DataFrame(T2)
Attempt to Merge/Join:
WomboCombo = Trial1.join(Trial2,how='outer',lsuffix=1,rsuffix=2, on='x')
WomboCombo2 = pd.merge(left=Trial1, right= Trial2, how = 'outer', left
Attempt to split into two parts, increasing and decreasing part (manually found row number where data "x" starts decreasing):
Trial1Inc = Trial1[0:8]
Trial2Inc = Trial2[0:7]
Result - Merge works well, join messes with the "x" column, not sure why:
Trial1Inc.merge(Trial2Inc,on='x',how='outer', suffixes=[1,2])
Incrementing section Merge Result
Trial1Inc.join(Trial2Inc,on='x',how='outer', lsuffix=1,rsuffix=2)
Incrementing section Join Result
Hopefully my example is clear, the "x" column in Trial 1 increases until 5, then decreases back towards 0. In Trial 2, I altered the test a bit because I noticed that I needed data at a slightly higher "x" value. Trial 2 Increases until 7 and then quickly decreases back towards 0.
My end goal is to plot the average of all y values (where there is overlap between the trials) against the corresponding x values.
If there is overlap I can add error bars. Pandas is almost perfect for what I am trying to do because an Outer join adds null values where there is no overlap and is capable of horizontally concatenating the two trials when there is overlap.
All thats left now is to figure out how to join on the "x" column but maintain its order of increasing values and then decreasing values. The reason it is important for me to first increase "x" and then decrease it is because when looking at the "y" values, it seems as though the initial "y" value at a given "x" is greater than the "y" value when "x" is decreasing (E.G. in trial 1 when x=1, y=10000, however, later in the trial when we come back to x=1, y=9000, this trend is important. When Pandas sorts the column before merging, instead of there being a clean curve showing a decrease in "y" as "x" increases and then the reverse, there are vertical downward jumps at any point where the data was joined.
I would really appreciate any help with either:
A) a perfect solution that lets me join on "x" when "x" contains duplicates
B) an efficient way to split the data sets into increasing "x" and decreasing "x" so that I can merge the increasing and decreasing sections of each trial separately and then vertically concat them.
Hopefully I did an okay job explaining the problem I would like to solve. Please let me know if I can clarify anything,
Thanks for the help!
I think #xyzjayne idea of splitting the dataframe is a great idea.
Splitting Trial1 and Trial2:
# index of max x value in Trial2
t2_max_index = Trial2.index[Trial2['x'] == Trial2['x'].max()].tolist()
# split Trial2 by max value
trial2_high = Trial2.loc[:t2_max_index[0]].set_index('x')
trial2_low = Trial2.loc[t2_max_index[0]+1:].set_index('x')
# index of max x value in Trial1
t1_max_index = Trial1.index[Trial1['x'] == Trial1['x'].max()].tolist()
# split Trial1 by max vlaue
trial1_high = Trial1.loc[:t1_max_index[0]].set_index('x')
trial1_low = Trial1.loc[t1_max_index[0]+1:].set_index('x')
Once we split the dataframes we join the highers together and the lowers together:
WomboCombo_high = trial1_high.join(trial2_high, how='outer', lsuffix='1', rsuffix='2', on='x').reset_index()
WomboCombo_low = trial1_low.join(trial2_low, how='outer', lsuffix='1', rsuffix='2', on='x').reset_index()
We now combine them toegther to have one dataframe WomboCombo
WomboCombo = WomboCombo_high.append(WomboCombo_low)
OUTPUT:
x y1 z1 y2 z2
0 1.0 10000.0 3425.0 10500.0 3061.0
1 1.5 8500.0 5059.0 NaN NaN
2 2.0 7400.0 2739.0 7700.0 7090.0
3 2.5 6450.0 9912.0 NaN NaN
4 3.0 5670.0 2099.0 5500.0 1140.0
5 3.5 5100.0 9637.0 NaN NaN
6 4.0 4600.0 7581.0 4560.0 9584.0
7 5.0 4500.0 8616.0 4300.0 3940.0
8 6.0 NaN NaN 3900.0 5896.0
9 7.0 NaN NaN 3800.0 6211.0
0 2.0 8400.0 3181.0 5400.0 9529.0
2 1.5 NaN NaN 8400.0 3260.0
1 1.0 9000.0 4280.0 8800.0 8303.0
One possible solution is to give you trial rows specific IDs an then merge on the IDs. Should keep the x values from being sorted.
Here's what I was trying out, but it doesn't address varying numbers of data points. I like gym-hh's answer, though it's not clear to me that you wanted two columns of y,z pairs. So you could combine his ideas and this code to get what you need.
Trial1['index1'] = Trial1.index
Trial2['index1'] = Trial2.index
WomboCombo = Trial1.append(Trial2)
WomboCombo.sort_values(by=['index1'],inplace=True)
WomboCombo
Output:
x y z index1
0 1.0 10000 7148 0
0 1.0 10500 2745 0
1 1.5 8500 248 1
1 2.0 7700 9505 1
2 2.0 7400 6380 2
2 3.0 5500 3401 2
3 2.5 6450 6183 3
3 4.0 4560 5281 3
4 3.0 5670 99 4
4 5.0 4300 8864 4
5 3.5 5100 5132 5
5 6.0 3900 7570 5
6 4.0 4600 9951 6
6 7.0 3800 7447 6
7 2.0 5400 3713 7
7 5.0 4500 3863 7
8 1.5 8400 8776 8
8 2.0 8400 1592 8
9 1.0 9000 2167 9
9 1.0 8800 782 9
Let's say I have a 4D array that is
(600, 1, 3, 3)
If you take the first 2 elements they may look like this:
0 1 0
1 1 1
0 1 0
2 2 2
3 3 3
1 1 1
etc
I have a list that contains certain weights that I want to replace specific values in the array. My intention is to use the index of the list element match the value in the array. Therefore, this list
[0.1 1.1 1.2 1.3]
when applied against my array would give this result:
0.1 1.1 0.1
1.1 1.1 1.1
0.1 1.1 0.1
1.2 1.2 1.2
1.3 1.3 1.3
1.1 1.1 1.1
etc
This method would have to run through the entire 600 elements of the array.
I can do this in a clunky way using a for loop and array[array==x] = y or np.place but I wanted to avoid a loop and perhaps use a method that at once will replace all values. Is there such an approach?
Quoting from #Divakar's solution in the comments, which solves the issue in a very efficient manner:
Simply index into the array version:
np.asarray(vals)[idx], where vals is
the list and idx is the array.
Or use np.take(vals, idx) to do the array conversion under the hood.
I have a dataset containing 15 examples. It has 3 features and a target label. How do I load the values corresponding to the 3 features into an array in Python (Pandas)?
I want to train a decision tree classifier on the dataset. For this, I have to load the examples into arrays such that all the data points are in an array X and the corresponding labels are in another array Y. How should I proceed?
The dataset looks like following:
x1 x2 x3 z
0 5.5 0.5 4.5 2
1 7.4 1.1 3.6 0
2 5.9 0.2 3.4 2
3 9.9 0.1 0.8 0
4 6.9 -0.1 0.6 2
5 6.8 -0.3 5.1 2
6 4.1 0.3 5.1 1
7 1.3 -0.2 1.8 1
8 4.5 0.4 2.0 0
9 0.5 0.0 2.3 1
10 5.9 -0.1 4.4 0
11 9.3 -0.2 3.2 0
12 1.0 0.1 2.8 1
13 0.4 0.1 4.3 1
14 2.7 -0.5 4.2 1
I already have the dataset loaded into a dataframe :
import pandas as pd
df = pd.read_csv('C:\Users\Dell\Downloads\dataset.csv')
print(df.to_string())
I need to know how to load the values corresponding to the features x1, x2 and x3 into X (as training examples) and the values corresponding to the label z into Y (as the labels for the training examples).
Thanks.
First you load the data in a data.frame.
Since you had a very strange formatting I changed this to normal .csv to make this example easier to understand.
x1,x2,x3,z
5.5,0.5,4.5,2
7.4,1.1,3.6,0
5.9,0.2,3.4,2
9.9,0.1,0.8,0
6.9,-0.1,0.6,2
6.8,-0.3,5.1,2
4.1,0.3,5.1,1
1.3,-0.2,1.8,1
4.5,0.4,2.0,0
0.5,0.0,2.3,1
5.9,-0.1,4.4,0
9.3,-0.2,3.2,0
1.0,0.1,2.8,1
0.4,0.1,4.3,1
2.7,-0.5,4.2,1
If you have the data in the data.frame, half of the work is already done. I posted you a example using the "caret" package using a linear regression model.
library("caret")
my.dataframe <- read.csv("myExample.csv", header = T, sep =",")
fit <- train(z ~ ., data = my.dataframe, method = "lm")
fit
Basically you have just to replace the "lm" in method to train all kinds of other models.
Here is a list where you can choose from:
http://topepo.github.io/caret/available-models.html
For training a random forest model you would type
library("caret")
my.dataframe <- read.csv("myExample.csv", header = T, sep =",")
fit <- train(z ~ ., data = my.dataframe, method = "rf")
fit
But be also careful you have very limited data - not every model makes sense for just 15 data points.
Random Forest model will give you for example this warning:
45: In randomForest.default(x, y, mtry = param$mtry, ...) :
The response has five or fewer unique values. Are you sure you want to do regression?