How to get array of numbers using ones() in numpy? - python

Hi I have a code in Matlab which is generating the following sequence.
[ones(1,6*2) 2 ones(1,6*2-1) 2 ones(1,6*2) 1]
ans =
Columns 1 through 18
1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
Columns 19 through 36
1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1
Columns 37 through 38
1 1
I want to generate similar array of numbers in Python.
I have tried to generate as follows.
ConvStride = [np.ones((12,),dtype=int),2,np.ones((11,),dtype=int),2,np.ones((12,),dtype=int),1]
Ans= [1 1 1 1 1... 1],2,[1 1 1 ... 1],2,[1 1 1 1....1],1
ConvStride = [np.ones((12,),dtype=int),2,np.ones((11,),dtype=int),2,np.ones((12,),dtype=int),1]
required
[ 1 1 1 .....1 2 1 1 1 .....1 2 111....1 1]
Could you please let me know a work around here.

Use np.r_:
np.r_[np.ones(12,int),2,np.ones(11,int),2,np.ones(12,int)]
# array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,
# 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

You can create a list using a similar python syntax and then convert it to a numpy array
import numpy as np
x = [1]*(6*2) + [2] + [1]*(6*2-1) + [2] + [1]*(6*2) + [1]
ans = np.array(x)
If you want to do it all with numpy you can use hstack.
np.hstack([np.ones(6*2, int), 2, np.ones(6*2-1, int), 2, np.ones(6*2, int), 1])

Related

Calculate the average of sections of a column with condition met to create new dataframe

I have the below data table
A = [2, 3, 1, 2, 4, 1, 5, 3, 1, 7, 5]
B = [0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0]
df = pd.DataFrame({'A':A, 'B':B})
I'd like to calculate the average of column A when consecutive rows see column B equal to 1. All rows where column B equal to 0 are neglected and subsequently create a new dataframe like below:
Thanks for your help!
Keywords: groupby, shift, mean
Code:
df_result=df.groupby((df['B'].shift(1,fill_value=0)!= df['B']).cumsum()).mean()
df_result=df_result[df_result['B']!=0]
df_result
A B
1 2.0 1.0
3 3.0 1.0
As you might noticed, you need first to determine the consecutive rows blocks having the same values.
One way to do so is by shifting B one row and then comparing it with itself.
df['B_shifted']=df['B'].shift(1,fill_value=0) # fill_value=0 to return int and replace Nan with 0's
df['A'] =[2, 3, 1, 2, 4, 1, 5, 3, 1, 7, 5]
df['B'] =[0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0]
df['B_shifted'] =[0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0]
(df['B_shifted'] != df['B'])=[F, T, F, F, T, F, T, F, F, T, F]
[↑ ][↑ ][↑ ][↑ ]
Now we can use the groupby pandas method as follows:
df_grouped=df.groupby((df['B_shifted'] != df['B']).cumsum())
Now if we looped in the DtaFrameGroupBy object df_grouped
we'll see the following tuples:
(0, A B B_shifted
0 2 0 0)
(1, A B B_shifted
1 3 1 0
2 1 1 1
3 2 1 1)
(2, A B B_shifted
4 4 0 1
5 1 0 0)
(3, A B B_shifted
6 5 1 0
7 3 1 1
8 1 1 1)
(4, A B B_shifted
9 7 0 1
10 5 0 0)
We can simply calculate the mean and filter the zero values now as follow
df_result=df_grouped.mean()
df_result=df_result[df_result['B']!=0][['A','B']]
References:(link, link).
Try:
m = (df.B != df.B.shift(1)).cumsum() * df.B
df_out = df.groupby(m[m > 0])["A"].mean().reset_index(drop=True).to_frame()
df_out["B"] = 1
print(df_out)
Prints:
A B
0 2 1
1 3 1
df1 = df.groupby((df['B'].shift() != df['B']).cumsum()).mean().reset_index(drop=True)
df1 = df1[df1['B'] == 1].astype(int).reset_index(drop=True)
df1
Output
A B
0 2 1
1 3 1
Explanation
We are checking if each row's value of B is not equal to next value using pd.shift, if so then we are grouping those values and calculating its mean and assigning it to new dataframe df1.
Since we have mean of groups of all consecutive 0s and 1s, so we are then filtering only values of B==1.

Count of all possible combinations between dataframe columns

I'm trying to get the count (where all row values are 1) of each possible combination between the eight columns of a dataframe. Basically I need to understand how many times different overlaps exist.
I've tried to use itertools.product to get all the combinations, but it doesn't seem to work.
import pandas as pd
import numpy as np
import itertools
df = pd.read_excel('filename.xlsx')
df.head(15)
a b c d e f g h
0 1 0 0 0 0 1 0 0
1 1 0 0 0 0 0 0 0
2 1 0 1 1 1 1 1 1
3 1 0 1 1 0 1 1 1
4 1 0 0 0 0 0 0 0
5 0 1 0 0 1 1 1 1
6 1 1 0 0 1 1 1 1
7 1 1 1 1 1 1 1 1
8 1 1 0 0 1 1 0 0
9 1 1 1 0 1 0 1 0
10 1 1 1 0 1 1 0 0
11 1 0 0 0 0 1 0 0
12 1 1 1 1 1 1 1 1
13 1 1 1 1 1 1 1 1
14 0 1 1 1 1 1 1 0
print(list(itertools.product(new_df.columns)))
The expected output would be a dataframe with the count (n) of rows for each of valid combinations (where values in the row are all 1).
For example:
a b
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
5 0 1
6 1 1
7 1 1
8 1 1
9 1 1
10 1 1
11 1 0
12 1 1
13 1 1
14 0 1
Would give
combination count
a 12
a_b 7
b 9
Note that the output would need to contain all the combinations possible between a and h, not just pairwise
Powerset Combinations
Use the powerset recipe with,
s = pd.Series({
'_'.join(c): df[c].min(axis=1).sum()
for c in map(list, filter(None, powerset(df)))
})
a 13
b 9
c 8
d 6
e 10
f 12
g 9
h 7
a_b 7
...
Pairwise Combinations
This is a special case, and can be vectorized.
from itertools import combinations
u = df.T.dot(df)
pd.DataFrame({
'combination': [*map('_'.join, combinations(df, 2))],
# pandas < 0.24
# 'count': u.values[np.triu_indices_from(u, k=1)]
# pandas >= 0.24
'count': u.to_numpy()[np.triu_indices_from(u, k=1)]
})
You can use dot, then extract the upper triangular matrix values:
combination count
0 a_b 7
1 a_c 7
2 a_d 5
3 a_e 8
4 a_f 10
5 a_g 7
6 a_h 6
7 b_c 6
8 b_d 4
9 b_e 9
As you happen to have 8 columns, np.packbits together with
np.bincount is rather convenient here:
import numpy as np
import pandas as pd
# make large example
ncol, nrow = 8, 1_000_000
df = pd.DataFrame(np.random.randint(0,2,(nrow,ncol)), columns=list("abcdefgh"))
from time import time
T = [time()]
# encode as binary numbers and count
counts = np.bincount(np.packbits(df.values.astype(np.uint8)),None,256)
# find sets in other sets
rng = np.arange(256, dtype=np.uint8)
contained = (rng & rng[:, None]) == rng[:, None]
# and sum
ccounts = (counts * contained).sum(1)
# if there are empty bins, remove them
nz = np.where(ccounts)[0].astype(np.uint8)
# helper to build bin labels
a2h = np.array(list("abcdefgh"))
# put labels to counts
result = pd.Series(ccounts[nz], index = ["_".join((*a2h[np.unpackbits(i).view(bool)],)) for i in nz])
from itertools import chain, combinations
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
T.append(time())
s = pd.Series({
'_'.join(c): df[c].min(axis=1).sum()
for c in map(list, filter(None, powerset(df)))
})
T.append(time())
print("packbits {:.3f} powerset {:.3f}".format(*np.diff(T)))
print("results equal", (result.sort_index()[1:]==s.sort_index()).all())
This gives the same result as the powerset approach but literally 1000x faster:
packbits 0.016 powerset 21.974
results equal True
If you have just values of 1 and 0, you could do:
df= pd.DataFrame({
'a': [1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1],
'b': [1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0],
'c': [1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1],
'd': [1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1],
})
(df.a * df.b).sum()
This results in 4.
To get all combinations you can use combinations from itertools:
from itertools import combinations
analyze=[(col,) for col in df.columns]
analyze.extend(combinations(df.columns, 2))
for cols in analyze:
num_ser= None
for col in cols:
if num_ser is None:
num_ser= df[col]
else:
num_ser*= df[col]
num= num_ser.sum()
print(f'{cols} contains {num}')
This results in:
('a',) contains 4
('b',) contains 7
('c',) contains 11
('d',) contains 23
('a', 'b') contains 4
('a', 'c') contains 4
('a', 'd') contains 4
('b', 'c') contains 7
('b', 'd') contains 7
('c', 'd') contains 11
Cooccurence matrix is all you need:
Let's construct an example first:
import numpy as np
import pandas as pd
mat = np.zeros((5,5))
mat[0,0] = 1
mat[0,1] = 1
mat[1,0] = 1
mat[2,1] = 1
mat[3,3] = 1
mat[3,4] = 1
mat[2,4] = 1
cols = ['a','b','c','d','e']
df = pd.DataFrame(mat,columns=cols)
print(df)
a b c d e
0 1.0 1.0 0.0 0.0 0.0
1 1.0 0.0 0.0 0.0 0.0
2 0.0 1.0 0.0 0.0 1.0
3 0.0 0.0 0.0 1.0 1.0
4 0.0 0.0 0.0 0.0 0.0
Now we construct the cooccurence matrix:
# construct the cooccurence matrix:
co_df = df.T.dot(df)
print(co_df)
a b c d e
a 2.0 1.0 0.0 0.0 0.0
b 1.0 2.0 0.0 0.0 1.0
c 0.0 0.0 0.0 0.0 0.0
d 0.0 0.0 0.0 1.0 1.0
e 0.0 1.0 0.0 1.0 2.0
Finally the result you need:
result = {}
for c1 in cols:
for c2 in cols:
if c1 == c2:
if c1 not in result:
result[c1] = co_df[c1][c2]
else:
if '_'.join([c1,c2]) not in result:
result['_'.join([c1,c2])] = co_df[c1][c2]
print(result)
{'a': 2.0, 'a_b': 1.0, 'a_c': 0.0, 'a_d': 0.0, 'a_e': 0.0, 'b_a': 1.0, 'b': 2.0, 'b_c': 0.0, 'b_d': 0.0, 'b_e': 1.0, 'c_a': 0.0, 'c_b': 0.0, 'c': 0.0, 'c_d': 0.0, 'c_e': 0.0, 'd_a': 0.0, 'd_b': 0.0, 'd_c': 0.0, 'd': 1.0, 'd_e': 1.0, 'e_a': 0.0, 'e_b': 1.0, 'e_c': 0.0, 'e_d': 1.0, 'e': 2.0}

Match letter frequency within a word against 26 letters in R (or python)

Currently, I have a string "abdicator". I would like find out frequency of letters from this word compared against all English alphabets (i.e., 26 letters), with an output in the form as follows.
Output:
a b c d e f g h i ... o ... r s t ... x y z
2 1 1 0 0 0 0 0 1..0..1..0..1 0 1 ... 0 ...
This output can be a numeric vector (with names being the 26 letters). My initial attempt was to first use strsplit function to split the string into individual letters (using R):
strsplit("abdicator","") #split at every character
#[[1]]
#[1] "a" "b" "c" "d" "e"`
However, I am a little stuck as to what to do for the next step. Can someone enlighten me on this please? Many thanks.
In R:
table(c(letters, strsplit("abdicator", "")[[1]]))-1
# a b c d e f g h i j k l m n o p q r s t u v w x y z
# 2 1 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0
And extending that a bit to handle the possibility of multiple words and/or capital letters:
words <- c("abdicator", "Syzygy")
letterCount <- function(X) table(c(letters, strsplit(tolower(X), "")[[1]]))-1
t(sapply(words, letterCount))
# a b c d e f g h i j k l m n o p q r s t u v w x y z
# abdicator 2 1 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0
# syzygy 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 3 1
In Python:
>>> from collections import Counter
>>> s = "abdicator"
>>> Counter(s)
Counter({'a': 2, 'c': 1, 'b': 1, 'd': 1, 'i': 1, 'o': 1, 'r': 1, 't': 1})
>>> map(Counter(s).__getitem__, map(chr, range(ord('a'), ord('z')+1)))
[2, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0]
Or:
>>> import string
>>> map(Counter(s).__getitem__, string.lowercase)
[2, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0]
Python:
import collections
import string
counts = collections.Counter('abdicator')
chars = string.ascii_lowercase
print(*chars, sep=' ')
print(*[counts[char] for char in chars], sep=' ')
In Python 2:
import string, collections
ctr = collections.Counter('abdicator')
for l in string.ascii_lowercase:
print l,
print
for l in string.ascii_lowercase:
print ctr[l],
print
In Python 3, only the syntax of print changes.
This produces exactly the output you requested. The core idea is that a collections.Counter, indexed with a missing key, humbly returns 0 with the obvious semantics "this key has been seen 0 times" fully aligned with the semantics it uses for keys that are present (where it returns their count, i.e, the number of times they have been seen).

correlation for two lists of data

These two lists contain data something like this:
a = [1 2 1 3 1 2 1 1 1 2 1 1 2 1 4 1 ]
b = [ 3480. 7080. 10440. 13200. 16800. 20400. 23880. 27480. 30840. 38040. 41520. 44880. 48480. 52080. 55680. 59280.]
How to find correlation using python by importing rpy2, I mean cor function. And the o/p has to lie between -1 and +1.
from rpy2.robjects.vectors import FloatVector
from rpy2.robjects.packages import importr
stats = importr('stats')
a=[1, 2, 1, 3, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 4, 1 ]
b=[ 3480, 7080, 10440, 13200, 16800, 20400, 23880,
27480, 30840, 38040, 41520, 44880, 48480, 52080, 55680, 59280]
result = stats.cor(FloatVector(a), FloatVector(b))
The documentation for rpy2 has many other examples about how to use it.

numpy/pandas: How to convert a series of strings of zeros and ones into a matrix

I have a data that arrives in this format:
[
(1, "000010101001010101011101010101110101", "aaa", ... ),
(0, "111101010100101010101110101010111010", "bb", ... ),
(0, "100010110100010101001010101011101010", "ccc", ... ),
(1, "000010101001010101011101010101110101", "ddd", ... ),
(1, "110100010101001010101011101010111101", "eeee", ... ),
...
]
In tuple format, it looks like this:
(Y, X, other_info, ... )
At the end of the day, I need to train a classifier (e.g. sklearn.linear_model.logistic.LogisticRegression) using Y and X.
What's the most straightforward way to turn the string of ones and zeros into something like a np.array, so that I can run it through the classifier? Seems like there should be an easy answer here, but I haven't been able to think of/google one.
A few notes:
I'm already using numpy/pandas/sklearn, so anything in those libraries is fair game.
For a lot of what I'm doing, it's convenient to have the other_info columns together in a DataFrame
The strings are is pretty long (~20,000 columns), but the total data frame is not very tall (~500 rows).
Since you asked primarily for a way to convert a string of ones and zeros into a numpy array, I'll offer my solution as follows:
d = '0101010000' * 2000 # create a 20,000 long string of 1s and 0s
d_array = np.fromstring(d, 'int8') - 48 # 48 is ascii 0. ascii 1 is 49
This compares favourable to #DSM's solution in terms of speed:
In [21]: timeit numpy.fromstring(d, dtype='int8') - 48
10000 loops, best of 3: 35.8 us per loop
In [22]: timeit numpy.fromiter(d, dtype='int', count=20000)
100 loops, best of 3: 8.57 ms per loop
How about something like this:
Make the dataframe:
In [82]: v = [
....: (1, "000010101001010101011101010101110101", "aaa"),
....: (0, "111101010100101010101110101010111010", "bb"),
....: (0, "100010110100010101001010101011101010", "ccc"),
....: (1, "000010101001010101011101010101110101", "ddd"),
....: (1, "110100010101001010101011101010111101", "eeee"),
....: ]
In [83]:
In [83]: df = pandas.DataFrame(v)
We can use fromiter or array to get an ndarray:
In [84]: d ="000010101001010101011101010101110101"
In [85]: np.fromiter(d, int) # better: np.fromiter(d, int, count=len(d))
Out[85]:
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1])
In [86]: np.array(list(d), int)
Out[86]:
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1])
There might be a slick vectorized way to do this, but I'd just apply the obvious per-entry function to the values and get on with my day:
In [87]: df[1]
Out[87]:
0 000010101001010101011101010101110101
1 111101010100101010101110101010111010
2 100010110100010101001010101011101010
3 000010101001010101011101010101110101
4 110100010101001010101011101010111101
Name: 1
In [88]: df[1] = df[1].apply(lambda x: np.fromiter(x, int)) # better with count=len(x)
In [89]: df
Out[89]:
0 1 2
0 1 [0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 1 0 1 aaa
1 0 [1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 1 0 bb
2 0 [1 0 0 0 1 0 1 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 ccc
3 1 [0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 1 0 1 ddd
4 1 [1 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 eeee
In [90]: df[1][0]
Out[90]:
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1])

Categories

Resources