correlation for two lists of data - python

These two lists contain data something like this:
a = [1 2 1 3 1 2 1 1 1 2 1 1 2 1 4 1 ]
b = [ 3480. 7080. 10440. 13200. 16800. 20400. 23880. 27480. 30840. 38040. 41520. 44880. 48480. 52080. 55680. 59280.]
How to find correlation using python by importing rpy2, I mean cor function. And the o/p has to lie between -1 and +1.

from rpy2.robjects.vectors import FloatVector
from rpy2.robjects.packages import importr
stats = importr('stats')
a=[1, 2, 1, 3, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 4, 1 ]
b=[ 3480, 7080, 10440, 13200, 16800, 20400, 23880,
27480, 30840, 38040, 41520, 44880, 48480, 52080, 55680, 59280]
result = stats.cor(FloatVector(a), FloatVector(b))
The documentation for rpy2 has many other examples about how to use it.

Related

Python - issue with dimension of sequency

I want to create in Python the following sequence of zero's and one's:
{0, 1,1,1,1, 0,0, 1,1,1, 0,0,0, 1,1, 0,0,0,0, 1}
So there is first 1 zero and 4 one's, then 2 zeros and 3 one's, then 3 zeros and 2 ones and finally 4 zeros and 1 one. The final array is supposed to have dimension 20x1, but my code gives me the dimension 4x2. Does anyone know how I can fix this?
Here's my code:
import numpy as np
seq = [ (np.ones(n), np.zeros(5-n) ) for n in range(1,5)]
Many thanks in advance!
For each iteration you create a tuple of two things, hence the 4x2 result. You can bring it to the form you want by concatenating the array elements all together, but there is a pattern to your sequence; you can take advantage that it looks like a triangular matrix of 1s and 0s, which you can then flatten.
n = 5
ones = np.ones((n, n), dtype=int)
seq = np.triu(ones)[1:].flatten()
Output:
array([0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1])
You can use flatten:
import numpy as np
l = np.array([[0] * n + [1] * (5 - n) for n in range(1, 5)]).flatten()
print(l)
# >>> [0 1 1 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 0 1]

Calculate the average of sections of a column with condition met to create new dataframe

I have the below data table
A = [2, 3, 1, 2, 4, 1, 5, 3, 1, 7, 5]
B = [0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0]
df = pd.DataFrame({'A':A, 'B':B})
I'd like to calculate the average of column A when consecutive rows see column B equal to 1. All rows where column B equal to 0 are neglected and subsequently create a new dataframe like below:
Thanks for your help!
Keywords: groupby, shift, mean
Code:
df_result=df.groupby((df['B'].shift(1,fill_value=0)!= df['B']).cumsum()).mean()
df_result=df_result[df_result['B']!=0]
df_result
A B
1 2.0 1.0
3 3.0 1.0
As you might noticed, you need first to determine the consecutive rows blocks having the same values.
One way to do so is by shifting B one row and then comparing it with itself.
df['B_shifted']=df['B'].shift(1,fill_value=0) # fill_value=0 to return int and replace Nan with 0's
df['A'] =[2, 3, 1, 2, 4, 1, 5, 3, 1, 7, 5]
df['B'] =[0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0]
df['B_shifted'] =[0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0]
(df['B_shifted'] != df['B'])=[F, T, F, F, T, F, T, F, F, T, F]
[↑ ][↑ ][↑ ][↑ ]
Now we can use the groupby pandas method as follows:
df_grouped=df.groupby((df['B_shifted'] != df['B']).cumsum())
Now if we looped in the DtaFrameGroupBy object df_grouped
we'll see the following tuples:
(0, A B B_shifted
0 2 0 0)
(1, A B B_shifted
1 3 1 0
2 1 1 1
3 2 1 1)
(2, A B B_shifted
4 4 0 1
5 1 0 0)
(3, A B B_shifted
6 5 1 0
7 3 1 1
8 1 1 1)
(4, A B B_shifted
9 7 0 1
10 5 0 0)
We can simply calculate the mean and filter the zero values now as follow
df_result=df_grouped.mean()
df_result=df_result[df_result['B']!=0][['A','B']]
References:(link, link).
Try:
m = (df.B != df.B.shift(1)).cumsum() * df.B
df_out = df.groupby(m[m > 0])["A"].mean().reset_index(drop=True).to_frame()
df_out["B"] = 1
print(df_out)
Prints:
A B
0 2 1
1 3 1
df1 = df.groupby((df['B'].shift() != df['B']).cumsum()).mean().reset_index(drop=True)
df1 = df1[df1['B'] == 1].astype(int).reset_index(drop=True)
df1
Output
A B
0 2 1
1 3 1
Explanation
We are checking if each row's value of B is not equal to next value using pd.shift, if so then we are grouping those values and calculating its mean and assigning it to new dataframe df1.
Since we have mean of groups of all consecutive 0s and 1s, so we are then filtering only values of B==1.

How to get array of numbers using ones() in numpy?

Hi I have a code in Matlab which is generating the following sequence.
[ones(1,6*2) 2 ones(1,6*2-1) 2 ones(1,6*2) 1]
ans =
Columns 1 through 18
1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
Columns 19 through 36
1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1
Columns 37 through 38
1 1
I want to generate similar array of numbers in Python.
I have tried to generate as follows.
ConvStride = [np.ones((12,),dtype=int),2,np.ones((11,),dtype=int),2,np.ones((12,),dtype=int),1]
Ans= [1 1 1 1 1... 1],2,[1 1 1 ... 1],2,[1 1 1 1....1],1
ConvStride = [np.ones((12,),dtype=int),2,np.ones((11,),dtype=int),2,np.ones((12,),dtype=int),1]
required
[ 1 1 1 .....1 2 1 1 1 .....1 2 111....1 1]
Could you please let me know a work around here.
Use np.r_:
np.r_[np.ones(12,int),2,np.ones(11,int),2,np.ones(12,int)]
# array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,
# 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
You can create a list using a similar python syntax and then convert it to a numpy array
import numpy as np
x = [1]*(6*2) + [2] + [1]*(6*2-1) + [2] + [1]*(6*2) + [1]
ans = np.array(x)
If you want to do it all with numpy you can use hstack.
np.hstack([np.ones(6*2, int), 2, np.ones(6*2-1, int), 2, np.ones(6*2, int), 1])

Counting the number of consecutive values that meets a condition (Pandas Dataframe)

So I created this post regarding my problem 2 days ago and got an answer thankfully.
I have a data made of 20 rows and 2500 columns. Each column is a unique product and rows are time series, results of measurements. Therefore each product is measured 20 times and there are 2500 products.
This time I want to know for how many consecutive rows my measurement result can stay above a specific threshold.
AKA: I want to count the number of consecutive values that is above a value, let's say 5.
A = [1, 2, 6, 8, 7, 3, 2, 3, 6, 10, 2, 1, 0, 2]
We have these values in bold and according to what I defined above, I should get NumofConsFeature = 3 as the result. (Getting the max if there are more than 1 series that meets the condition)
I thought of filtering using .gt, then getting the indexes and using a loop afterwards in order to detect the consecutive index numbers but couldn't make it work.
In 2nd phase, I'd like to know the index of the first value of my consecutive series. For the above example, that would be 3.
But I have no idea of how for this one.
Thanks in advance.
Here's another answer using only Pandas functions:
A = [1, 2, 6, 8, 7, 3, 2, 3, 6, 10, 2, 1, 0, 2]
a = pd.DataFrame(A, columns = ['foo'])
a['is_large'] = (a.foo > 5)
a['crossing'] = (a.is_large != a.is_large.shift()).cumsum()
a['count'] = a.groupby(['is_large', 'crossing']).cumcount(ascending=False) + 1
a.loc[a.is_large == False, 'count'] = 0
which gives
foo is_large crossing count
0 1 False 1 0
1 2 False 1 0
2 6 True 2 3
3 8 True 2 2
4 7 True 2 1
5 3 False 3 0
6 2 False 3 0
7 3 False 3 0
8 6 True 4 2
9 10 True 4 1
10 2 False 5 0
11 1 False 5 0
12 0 False 5 0
13 2 False 5 0
From there on you can easily find the maximum and its index.
There is simple way to do that.
Lets say your list is like: A = [1, 2, 6, 8, 7, 6, 8, 3, 2, 3, 6, 10,6,7,8, 2, 1, 0, 2]
And you want to find how many consecutive series that has values bigger than 6 and length of 5. For instance, here your answer is 2. There is two series that has values bigger than 6 and length of the series are 5. In python and pandas we do that like below:
condition = (df.wanted_row > 6) & \
(df.wanted_row.shift(-1) > 6) & \
(df.wanted_row.shift(-2) > 6) & \
(df.wanted_row.shift(-3) > 6) & \
(df.wanted_row.shift(-4) > 6)
consecutive_count = df[condition].count().head(1)[0]
Here's one with maxisland_start_len_mask -
# https://stackoverflow.com/a/52718782/ #Divakar
def maxisland_start_len_mask(a, fillna_index = -1, fillna_len = 0):
# a is a boolean array
pad = np.zeros(a.shape[1],dtype=bool)
mask = np.vstack((pad, a, pad))
mask_step = mask[1:] != mask[:-1]
idx = np.flatnonzero(mask_step.T)
island_starts = idx[::2]
island_lens = idx[1::2] - idx[::2]
n_islands_percol = mask_step.sum(0)//2
bins = np.repeat(np.arange(a.shape[1]),n_islands_percol)
scale = island_lens.max()+1
scaled_idx = np.argsort(scale*bins + island_lens)
grp_shift_idx = np.r_[0,n_islands_percol.cumsum()]
max_island_starts = island_starts[scaled_idx[grp_shift_idx[1:]-1]]
max_island_percol_start = max_island_starts%(a.shape[0]+1)
valid = n_islands_percol!=0
cut_idx = grp_shift_idx[:-1][valid]
max_island_percol_len = np.maximum.reduceat(island_lens, cut_idx)
out_len = np.full(a.shape[1], fillna_len, dtype=int)
out_len[valid] = max_island_percol_len
out_index = np.where(valid,max_island_percol_start,fillna_index)
return out_index, out_len
def maxisland_start_len(a, trigger_val, comp_func=np.greater):
# a is 2D array as the data
mask = comp_func(a,trigger_val)
return maxisland_start_len_mask(mask, fillna_index = -1, fillna_len = 0)
Sample run -
In [169]: a
Out[169]:
array([[ 1, 0, 3],
[ 2, 7, 3],
[ 6, 8, 4],
[ 8, 6, 8],
[ 7, 1, 6],
[ 3, 7, 8],
[ 2, 5, 8],
[ 3, 3, 0],
[ 6, 5, 0],
[10, 3, 8],
[ 2, 3, 3],
[ 1, 7, 0],
[ 0, 0, 4],
[ 2, 3, 2]])
# Per column results
In [170]: row_index, length = maxisland_start_len(a, 5)
In [172]: row_index
Out[172]: array([2, 1, 3])
In [173]: length
Out[173]: array([3, 3, 4])
You can apply diff() on your Series, and then just count the number of consecutive entries where the difference is 1 and the actual value is above your cutoff. The largest count is the maximum number of consecutive values.
First compute diff():
df = pd.DataFrame({"a":[1, 2, 6, 7, 8, 3, 2, 3, 6, 10, 2, 1, 0, 2]})
df['b'] = df.a.diff()
df
a b
0 1 NaN
1 2 1.0
2 6 4.0
3 7 1.0
4 8 1.0
5 3 -5.0
6 2 -1.0
7 3 1.0
8 6 3.0
9 10 4.0
10 2 -8.0
11 1 -1.0
12 0 -1.0
13 2 2.0
Now count consecutive sequences:
above = 5
n_consec = 1
max_n_consec = 1
for a, b in df.values[1:]:
if (a > above) & (b == 1):
n_consec += 1
else: # check for new max, then start again from 1
max_n_consec = max(n_consec, max_n_consec)
n_consec = 1
max_n_consec
3
Here's how I did it using numpy:
import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[1, 2, 6, 7, 8, 3, 2, 3, 6, 10, 2, 1, 0, 2]})
consecutive_steps = 2
marginal_price = 5
assertions = [(df.loc[:, "a"].shift(-i) < marginal_price) for i in range(consecutive_steps)]
condition = np.all(assertions, axis=0)
consecutive_count = df.loc[condition, :].count()
print(consecutive_count)
which yields 6.

How to iterate a vectorized if/else statement over additional columns?

import pandas as pd, numpy as np
ltlist = [1, 2]
org = {'ID': [1, 3, 4, 5, 6, 7], 'ID2': [3, 4, 5, 6, 7, 2]}
ltlist_set = set(ltlist)
org['LT'] = np.where(org['ID'].isin(ltlist_set), org['ID'], 0)
I'll need to check the ID2 column and write the ID in, unless it already has an ID.
output
ID ID2 LT
1 3 1
3 4 0
4 5 0
5 6 0
6 7 0
7 2 2
Thanks!
Option 1
You can nest numpy.where statements:
org['LT'] = np.where(org['ID'].isin(ltlist_set), 1,
np.where(org['ID2'].isin(ltlist_set), 2, 0))
Option 2
Alternatively, you can use pd.DataFrame.loc sequentially:
org['LT'] = 0 # default value
org.loc[org['ID2'].isin(ltlist_set), 'LT'] = 2
org.loc[org['ID'].isin(ltlist_set), 'LT'] = 1
Option 3
A third option is to use numpy.select:
conditions = [org['ID'].isin(ltlist_set), org['ID2'].isin(ltlist_set)]
values = [1, 2]
org['LT'] = np.select(conditions, values, 0) # 0 is default value

Categories

Resources