calculate aggregated variance for each group in python

calculate aggregated variance for each group in python - python

I have a data frame (df) with these columns: user, vector, and group.
df = pd.DataFrame({'user': ['user_1', 'user_2', 'user_3', 'user_4', 'user_5', 'user_6'], 'vector': [[1, 0, 2, 0], [1, 8, 0, 2],[6, 2, 0, 0], [5, 0, 2, 2], [3, 8, 0, 0],[6, 0, 0, 2]], 'group': ['A', 'B', 'C', 'B', 'A', 'A']})
I want to calculate aggregated variance for each group.
I tried this code, but it return an error
aggregated_variance = (df.groupby('group', as_index=False)['vector'].agg(["var"]))
ValueError: no results

You can use .explode to clean up your data and then perform a .groupby operation:
out = (
df.explode('vector')
.groupby('group')['vector'].var(ddof=1)
)
print(out)
group
A 7.060606
B 7.428571
C 8.000000
Name: vector, dtype: float64
The trick here lies in the use of .explode:
>>> df.head()
user vector group
0 user_1 [1, 0, 2, 0] A
1 user_2 [1, 8, 0, 2] B
2 user_3 [6, 2, 0, 0] C
3 user_4 [5, 0, 2, 2] B
4 user_5 [3, 8, 0, 0] A
>>> df.explode('vector').head()
user vector group
0 user_1 1 A
0 user_1 0 A
0 user_1 2 A
0 user_1 0 A
1 user_2 1 B
...

If you take the sum() after you group df, you will have a dataframe that shows a list of all vector values for each group. Then, create a lambda function to calculate the variance of each list of vector values.
aggregated = df.groupby("group").sum()['vector']
aggregated_variance = aggregated.apply(lambda x: np.var(x)).reset_index()

Related

pandas: iterate over multiple columns with is_monotonic_increasing

i am trying to iterate over a time series with multiple columns and go through the columns to check if the values within the columns are motonic_increasing or decreasing.
The underlying issue is that I don't know how to iterate over the dataframe columns and treat the values as a list to allow is_monotonic_increasing to work.
I have a dataset that looks like this:
Id 10000T 20000T
2020-04-30 0 7
2020-05-31 3 5
2020-06-30 5 6
and I have tried doing this:
trend_observation_period = new_df[-3:] #the dataset
trend = np.where((trend_observation_period.is_monotonic_increasing()==True), 'go', 'nogo')
which gives me the error:
AttributeError: 'DataFrame' object has no attribute 'is_monotonic_increasing'
I am confused because I though that np.where would iterate over the columns and read them as np arrays. I have also tried this which does not work either.
for i in trend_observation_period.iteritems():
s = pd.Series(i)
trend = np.where((s.is_monotonic_increasing()==True | s.is_monotonic_decreasing()==True),
'trending', 'not_trending')

It sounds like you're after something which will iterate columns and test if each column is monotonic. See if this puts you on the right track.
Per the pandas docs .is_monotonic is the same as .is_monotonic_increasing.
Example:
# Sample dataset setup.
df = pd.DataFrame({'a': [1, 1, 1, 2],
'b': [3, 2, 1, 0],
'c': [0, 1, 1, 0],
'd': [2, 0, 1, 0]})
# Loop through each column in the DataFrame and output if monotonic.
for c in df:
print(f'Column: {c} I', df[c].is_monotonic)
print(f'Column: {c} D', df[c].is_monotonic_decreasing, end='\n\n')
Output:
Column: a I True
Column: a D False
Column: b I False
Column: b D True
Column: c I False
Column: c D False
Column: d I False
Column: d D False

You can use DataFrame.apply to apply a function to each of your columns. Since is_monotonic_increasing is a property of a Series and not a method of it, you'll need to wrap it in a function (you can use lambda for this):
df = pd.DataFrame({'a': [1, 1, 1, 1],
'b': [1, 1, 1, 0],
'c': [0, 1, 1, 0],
'd': [0, 0, 0, 0]})
increasing_cols = df.apply(lambda s: s.is_monotonic_increasing)
print(increasing_cols)
a True
b False
c False
d True
dtype: bool

Use .apply and is_monotonic.
Example :
import pandas as pd
df = pd.DataFrame({"A":[1, 2, 3, 4],
"B":[0, 1, 0, 1],
"C":[3, 5, 8, 9],
"D":[1, 2, 2, 1]})
df.apply(lambda x:x.is_monotonic)
A True
B False
C True
D False
dtype: bool

Aggregating time series data

I am no data scientist. I do know python and I currently have to manage time series data that is coming in at a regular interval. Much of this data is all zero's or values that are the same for a long time, and to save memory I'd like to filter them out. Is there some standard method for this (which I am obviously unaware of) or should I implement my own algorithm ?
What I want to achieve is the following:
interval value result
(summed)
1 0 0
2 0 # removed
3 0 0
4 1 1
5 2 2
6 2 # removed
7 2 # removed
8 2 2
9 0 0
10 0 0
Any help appreciated !

You could use pandas query on dataframes to achieve this:
import pandas as pd
matrix = [[1,0, 0],
[2, 0, 0],
[3, 0, 0],
[4, 1, 1],
[5, 2, 2],
[6, 2, 0],
[7, 2, 0],
[8, 2, 2],
[9, 0, 0],
[10,0, 0]]
df = pd.DataFrame(matrix, columns=list('abc'))
print(df.query("c != 0"))

There is no quick function call to do what you need. The following is one way
import pandas as pd
df = pd.DataFrame({'interval':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'value':[0, 0, 0, 1, 2, 2, 2, 2, 0, 0]}) # example dataframe
df['group'] = df['value'].ne(df['value'].shift()).cumsum() # column that increments every time the value changes
df['key'] = 1 # create column of ones
df['key'] = df.groupby('group')['key'].transform('cumsum') # get the cumulative sum
df['key'] = df.groupby('group')['key'].transform(lambda x: x.isin( [x.min(), x.max()])) # check which key is minimum and which is maximum by group
df = df[df['key']==True].drop(columns=['group', 'key']) # keep only relevant cases
df

Here is the code:
l = [0, 0, 0, 1, 2, 2, 2, 2, 0, 0]
for (i, ll) in enumerate(l):
if i != 0 and ll == l[i-1] and i<len(l)-1 and ll == l[i+1]:
continue
print(i+1, ll)
It produces what you want. You haven't specified format of your input data, so I assumed they're in a list. The conditions ll == l[i-1] and ll == l[i+1] are key to skipping the repeated values.

Thanks all!
Looking at the answers I guess I can conclude I'll need to roll my own. I'll be using your input as inspiration.
Thanks again !

Appending a list based on the column that the data comes from

I'm attempting to append a binary numpy array to another numpy array to feed into a neural network. The binary list is dependent on the column that the array is coming from.
For example, an array that comes from the third column is [0 0 1 0 0 0 0 0 0].
Here is an example:
Data (list of arrays):
[[0, 1, 1, 1, 0], [0, 1, 0, 0, 1], [1, 0, 0, 0, 0]]
Let's say that the first two elements came from the first column of a dataframe and the third element came from the second column. After appending the binary array the data would look something like this:
[([0, 1, 1, 1, 0],
[1 0 0 0 0 0 0 0 0]),
([0, 1, 0, 0, 1],
[1 0 0 0 0 0 0 0 0]),
([1, 0, 0, 0, 0],
[0 1 0 0 0 0 0 0 0])]
For context, I was originally training on just a single column of a dataframe, however I want to be able to train over the entire dataframe now.
Is there a way to automatically append this array to my data depending on the column the data is coming from so that the neural network can train on the whole data set rather than just going column by column?
Additionally, would this require two input layers or just one?

Maybe you could add a more concrete example to your question. But anyway, is this what you're expecting?
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'col1': [[0,0,1], [1,1,1]], 'col2': [[1,1,0],[0,0,0]]})
In [3]: df
Out[3]:
col1 col2
0 [0, 0, 1] [1, 1, 0]
1 [1, 1, 1] [0, 0, 0]
In [4]: for col_index, col_name in enumerate(df.columns):
...: array_to_append = [0] * len(df.columns)
...: array_to_append[col_index] = 1
...: df[col_name] = df[col_name].map(lambda x: (x, array_to_append))
...:
In [5]: df
Out[5]:
col1 col2
0 ([0, 0, 1], [1, 0]) ([1, 1, 0], [0, 1])
1 ([1, 1, 1], [1, 0]) ([0, 0, 0], [0, 1])

Pandas Merge DataFrame Columns With Same Name But Different Rows

I would like to merge two dataframes. Both have the same column names, but different numbers of rows.
The values from the smaller dataframe should then replace the values from the other dataframe
So far I tried using pd.merge
pd.merge(df1, df2, how='left', on='NodeID)
But I do not know how to tell the merge command to use the values from the right dataframe for the columnes 'X' and 'Y'.
df1 = pd.DataFrame(data={'NodeID': [1, 2, 3, 4, 5], 'X': [0, 0, 0, 0, 0], 'Y': [0, 0, 0, 0, 0]})
df2 = pd.DataFrame(data={'NodeID': [2, 4], 'X': [1, 1], 'Y': [1, 1]})
The result should then look like this:
df3 = pd.DataFrame(data={'NodeID': [1, 2, 3, 4, 5], 'X': [0, 1, 0, 1, 0], 'Y':[0, 1, 0, 1, 0]})

This is can be done with concat and drop_duplicates
pd.concat([df2,df1]).drop_duplicates('NodeID').sort_values('NodeID')
Out[763]:
NodeID X Y
0 1 0 0
0 2 1 1
2 3 0 0
1 4 1 1
4 5 0 0

Numpy assign an array value based on the values of another array with column selected based on a vector

I have a 2 dimensional array
X
array([[2, 3, 3, 3],
[3, 2, 1, 3],
[2, 3, 1, 2],
[2, 2, 3, 1]])
and a 1 dimensional array
y
array([1, 0, 0, 1])
For each row of X, i want to find the column index where X has the lowest value and y has a value of 1, and set the corresponding row column pair in a third matrix to 1
For example, in case of the first row of X, the column index corresponding to the minimum X value (for the first row only) and y = 1 is 0, then I want Z[0,0] = 1 and all other Z[0,i] = 0.
Similarly, for the second row, column index 0 or 3 gives the lowest X value with y = 1. Then i want either Z[1,0] or Z[1,3] = 1 (preferably Z[1,0] = 1 and all other Z[1,i] = 0, since 0 column is the first occurance)
My final Z array will look like
Z
array([[1, 0, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 0, 1]])

One way to do this is using masked arrays.
import numpy as np
X = np.array([[2, 3, 3, 3],
[3, 2, 1, 3],
[2, 3, 1, 2],
[2, 2, 3, 1]])
y = np.array([1, 0, 0, 1])
#get a mask in the shape of X. (True for places to ignore.)
y_mask = np.vstack([y == 0] * len(X))
X_masked = np.ma.masked_array(X, y_mask)
out = np.zeros_like(X)
mins = np.argmin(X_masked, axis=0)
#Output: array([0, 0, 0, 3], dtype=int64)
#Now just set the indexes to 1 on the minimum for each axis.
out[np.arange(len(out)), mins] = 1
print(out)
[[1 0 0 0]
[1 0 0 0]
[1 0 0 0]
[0 0 0 1]]

you can use numpy.argmin(), to get the indexes of the min value at each row of X. For example:
import numpy as np
a = np.arange(6).reshape(2,3) + 10
ids = np.argmin(a, axis=1)
Similarly, you can the indexes where y is 1 by either numpy.nonzero or numpy.where.
Once you have the two index arrays setting the values in third array should be quite easy.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

calculate aggregated variance for each group in python - python

Related

pandas: iterate over multiple columns with is_monotonic_increasing

Aggregating time series data

Appending a list based on the column that the data comes from

Pandas Merge DataFrame Columns With Same Name But Different Rows

Numpy assign an array value based on the values of another array with column selected based on a vector

Categories

Resources