How to sample Pandas Dataframe with minimum distance between - python

I'm trying to sample a Dataframe based on a given Minimum Sample Interval on the "timestamp" column. Every extracted value would be the closest extracted value to the last one that is at least Minimum Sample Interval larger than the last one. So what I mean is, for the table given below and Minimum Sample Interval = 0.2
A timestamp
1 0.000000 0.1
2 3.162278 0.15
3 7.211103 0.45
4 7.071068 0.55
Here, we would extract indexes:
1, no last value yet so why not
Not 2, because it is only 0.05 larger than last value
3, because it is 0.35 larger than last value
Not 4, because it is only 0.1 larger than last value.
I've found a way to do this with iterrows, but I would like to avoid iterating over it if possible.
Closest I can think of is integer dividing the timestamp column with floordiv as interval and finding the rows where interval value changes. but for a case like [0.01 , 0.21, 0.55, 0.61, 0.75, 0.41], I would be selecting 0.61 instead of 0.75, which is only 0.06 larger than 0.55, instead of 0.2.

You can use pandas.Series.diff to compute the difference between each value and the next one:
sample = df[df['timestamp'].diff().fillna(1) > 0.2]
Output:
>>> sample
A timestamp
1 0.000000 0.10
3 7.211103 0.45

Related

Filling dataframe with average of previous columns values

I have a dataframe with having 5 columns with having missing values.
How do i fill the missing values with taking the average of previous two column values.
Here is the sample code for the same.
coh0 = [0.5, 0.3, 0.1, 0.2,0.2]
coh1 = [0.4,0.3,0.6,0.5]
coh2 = [0.2,0.2,0.3]
coh3 = [0.8,0.8]
coh4 = [0.5]
df= pd.DataFrame({'coh0': pd.Series(coh0), 'coh1': pd.Series(coh1),'coh2': pd.Series(coh2), 'coh3': pd.Series(coh3),'coh4': pd.Series(coh4)})
df
Here is the sample output
coh0coh1coh2coh3coh4
0 0.5 0.4 0.2 0.8 0.5
1 0.3 0.3 0.2 0.8 NaN
2 0.1 0.6 0.3 NaN NaN
3 0.2 0.5 NaN NaN NaN
4 0.2 NaN NaN NaN NaN
Here is the desired result i am looking for.
The NaN value in each column should be replaced by the previous two columns average value at the same position. However for the first NaN value in second column, it will take the default last value of first column.
The sample desired output would be like below.
For the exception you named, the first NaN, you can do
df.iloc[1, -1] = df.iloc[0, -1]
though it doesn't make a difference in this case as the mean of .2 and .8 is .5, anyway.
Either way, the rest is something like a rolling window calculation, except it has to be computed incrementally. Normally, you want to vectorize your operations and avoid iterating over the dataframe, but IMHO this is one of the rarer cases where it's actually appropriate to loop over the columns (cf. this excellent post), i.e.,
compute the row-wise (axis=1) mean of up to two columns left of the current one (df.iloc[:, max(0, i-2):i]),
and fill its NaN values from the resulting series.
for i in range(1, df.shape[1]):
mean_df = df.iloc[:, max(0, i-2):i].mean(axis=1)
df.iloc[:, i] = df.iloc[:, i].fillna(mean_df)
which results in
coh0 coh1 coh2 coh3 coh4
0 0.5 0.4 0.20 0.800 0.5000
1 0.3 0.3 0.20 0.800 0.5000
2 0.1 0.6 0.30 0.450 0.3750
3 0.2 0.5 0.35 0.425 0.3875
4 0.2 0.2 0.20 0.200 0.2000

Get mean of numpy array using pandas groupby

I have a DataFrame where one column is a numpy array of numbers. For example,
import numpy as np
import pandas as pd
df = pd.DataFrame.from_dict({
'id': [1, 1, 2, 2, 3, 3, 3, 4, 4],
'data': [np.array([0.43, 0.32, 0.19]),
np.array([0.41, 0.11, 0.21]),
np.array([0.94, 0.35, 0.14]),
np.array([0.78, 0.92, 0.45]),
np.array([0.32, 0.63, 0.48]),
np.array([0.17, 0.12, 0.15]),
np.array([0.54, 0.12, 0.16]),
np.array([0.48, 0.16, 0.19]),
np.array([0.14, 0.47, 0.01])]
})
I want to groupby the id column and aggregate by taking the element-wise average of the array. Splitting the array up first is not feasible since it is length 300 and I have 200,000+ rows. When I do df.groupby('id').mean(), I get the error "No numeric types to aggregate". I am able to get an element-wise mean of the lists using df['data'].mean(), so I think there should be a way to do a grouped mean. To clarify, I want the output to be an array for each value of ID. Each element in the resulting array should be the mean of the values of the elements in the corresponding position within each group. In the example, the result should be:
pd.DataFrame.from_dict({
'id': [1, 2,3,4],
'data': [np.array([0.42, 0.215, 0.2]),
np.array([0.86, 0.635, 0.29500000000000004]),
np.array([0.3433333333333333, 0.29, 0.26333333333333336]),
np.array([0.31, 0.315, 0.1])]
})
Could someone suggest how I might do this? Thanks!
Mean it twice, one at array level and once at group level:
df['data'].map(np.mean).groupby(df['id']).mean().reset_index()
id data
0 1 0.278333
1 2 0.596667
2 3 0.298889
3 4 0.241667
Based on comment, you can do:
pd.DataFrame(df['data'].tolist(),index=df['id']).mean(level=0).agg(np.array,1)
id
1 [0.42, 0.215, 0.2]
2 [0.86, 0.635, 0.29500000000000004]
3 [0.3433333333333333, 0.29, 0.26333333333333336]
4 [0.31, 0.315, 0.1]
dtype: object
Or:
df.groupby("id")['data'].apply(np.mean)
First, splitting up the array is feasible because your current storage requires storing a complex object of all the values within a DataFrame. This is going to take a lot more space than simply storing the flat 2D array
# Your current memory usage
df.memory_usage(deep=True).sum()
1352
# Create a new DataFrame (really just overwrite `df` but keep separate for illustration)
df1 = pd.concat([df['id'], pd.DataFrame(df['data'].tolist())], 1)
# id 0 1 2
#0 1 0.43 0.32 0.19
#1 1 0.41 0.11 0.21
#2 2 0.94 0.35 0.14
#3 2 0.78 0.92 0.45
#4 3 0.32 0.63 0.48
#5 3 0.17 0.12 0.15
#6 3 0.54 0.12 0.16
#7 4 0.48 0.16 0.19
#8 4 0.14 0.47 0.01
Yes, this looks bigger, but it's not in terms of memory, it's actually smaller. The 3x factor here is a bit extreme, for larger DataFrames with long arrays it will probably be like 95% of the memory. Still it has to be less.
df1.memory_usage(deep=True).sum()
#416
And now your aggregation is a normal groupby + mean, columns give the location in the array
df1.groupby('id').mean()
# 0 1 2
#id
#1 0.420000 0.215 0.200000
#2 0.860000 0.635 0.295000
#3 0.343333 0.290 0.263333
#4 0.310000 0.315 0.100000
Group by mean for array where output is array of mean value
df['data'].map(np.array).groupby(df['id']).mean().reset_index()
Output:
id data
0 1 [0.42, 0.215, 0.2]
1 2 [0.86, 0.635, 0.29500000000000004]
2 3 [0.3433333333333333, 0.29, 0.26333333333333336]
3 4 [0.31, 0.315, 0.1]
You can always .apply the numpy mean.
df.groupby('id')['data'].apply(np.mean).apply(np.mean)
# returns:
id
1 0.278333
2 0.596667
3 0.298889
4 0.241667
Name: data, dtype: float64

Efficient embedding computations for large DataFrame

Given the DataFrame:
id articleno target
0 1 [607303] 607295
1 1 [607295] 607303
2 2 [243404, 617953] 590448
3 2 [590448, 617953] 243404
for each row, compute the average article-embedding by looking up each item in the lists in the dictionary:
embeddings = {"607303": np.array([0.19, 0.25, 0.45])
,"607295": np.array([0.77, 0.76, 0.55])
,"243404": np.array([0.35, 0.44, 0.32])
,"617953": np.array([0.23, 0.78, 0.24])
,"590448": np.array([0.67, 0.12, 0.10])}
So for example, and to clarify, for the third row (index 2), the article embeddings for 243404 and 617953 is [0.35, 0.44, 0.32] and [0.23, 0.78, 0.24], respectively. The average article embedding is computes as the element wise addition of all elements, divided by the number of articles, so: ([0.35, 0.44, 0.32]+[0.23, 0.78, 0.24])/2=[0.29, 0.61, 0.28].
Expected output:
id dim1 dim2 dim3 target
0 1 0.19 0.25 0.45 607295
1 1 0.77 0.76 0.55 607303
2 2 0.29 0.61 0.28 590448
3 2 0.45 0.45 0.17 243404
In reality, my DataFrame has millions of rows, and the lists in articleno can contain many more items. Because of this, iterating over the rows might be too slow, and a more efficient solution (perhaps vectorized) could be needed.
Moreover, the number of dimensions (embedding size) is known beforehand, but is a couple of hundred, so the number of columns; dim1, dim2, dim3, ... dimN should be dynamic, based on the dimensions of the embedding (N).
In the previous question, you went extra miles to separate elements in articleno list, then remove the target from articleno list. Now, if you want to access the elements inside articleno list, you need to go extra miles again to separate them.
To illustrate what I mean, here's an approach that generate both of the output from the two questions, while adding minimal extra code:
# construct the embeddings dataframe:
embedding_df = pd.DataFrame(embeddings).T.add_prefix('dim')
# aggregation dictionary
agg_dict = {'countrycode':'first','articleno':list}
# taking mean over embedddings
for i in embedding_df.columns: agg_dict[i] = 'mean'
new_df = df.explode('articleno')
(new_df.join(new_df['articleno'].rename('target'))
.query('articleno != target')
.merge(embedding_df, left_on='articleno', right_index=True) # this line is extra from the previous question
.groupby(['id','target'], as_index=False)
.agg(agg_dict)
)
Output:
id target countrycode articleno dim0 dim1 dim2
0 2 590448 US [617953, 617953] 0.23 0.78 0.24
1 2 617953 US [590448, 590448] 0.67 0.12 0.10
Now, if you don't care for articleno column in the final output, you can even simplify your code while lower memory/runtime like this:
total_embeddings = g[embedding_df.columns].sum()
article_counts = g['id'].transform('size')
new_df[embedding_df.columns] = (total_embeddings.sub(new_df[embedding_df.columns])
.div(article_counts-1, axis=0)
)
and you would get the same output.

How to extract mean, max and min values of one column from a range data of the other column (continuous data) in python

I have a dataframe with two columns Distance(m) and height(m). I want to calculate the max, min and average height values from an interval of 0.04439 m of distance.
Distance is a continuous series from 0 to 0.81m each 0.00222m with a total of 403 values length.
The aim is to extract 18 values (max min average) of Height from 18 intervals each 0.0439m distance (the continuous distance series between 0 and 0.81m)
Then, create a dataframe (2 columns) of each distance interval and its respectively max min and avg value of height
this is an example:
Interval distance Height_max(m) Height_min(m) Height_average(m)
1 0.35 0.15 0.25
2 0.55 0.22 0.35
3 0.25 0.10 0.15
I have only 2 columns in my dataframe:
Distance(m) = [0, 0.0022, 0.0044, .... 0.81 ]
Height(m) = [ 0, 0.1, 0.5, 0.4, 0.9, .... 0.1]
Does anyone have any suggestions that can help me?
Thanks!
I believe you need cut for bining column by intervals and then aggregate by GroupBy.agg with list of aggregation functions:
d = pd.cut(df['Distance'], [0, 0.0022, 0.0044, .... 0.81 ])
h = pd.cut(df['Height'], [0, 0.1, 0.5, 0.4, 0.9, .... 0.1])
df.groupby([d, h])['Height'].agg(['min','max','mean'])

How to count instances following a condition in a dataframe

I have a dataset containing 18 unique IDs, each having one column of interest for which I want to count instances where its values are greater than or less than 0.25
For those that are greater than 0.25, I want to subtract a value from them, to then graph the resulting values in a column scatter plot. How would I go about counting those instances using pandas and to extract those >0.25 values to have those values available to put into the scatter plot?
Demo data
data = pd.DataFrame({"num":[0.1, 0.3, 0.1, 0.4]})
print(data)
num
0 0.1
1 0.3
2 0.1
3 0.4
Filter the values that less than 0.25
great_than = data[data.num > 0.25]
print(great_than)
num
1 0.3
3 0.4

Categories

Resources