Co-occurence matrix in python - python

I have huge data set, sample is below, and i need to compute the co-occurence matrix of the skill column, please refer to the sample data below, i read about the co-occurance matrix, and CountVectorizer from scikit learn shed light, i wrote the below code, but i am confused about how to see the results. If anyone can, then please help me, please find the sample data, and my tried code below
df1 = pd.DataFrame([["1000074", "6284 6295"],["75634786", "4044 4714 5789 6076 6077 6079 6082 6168 6229"],["75635714","4092 4420 4430 4437 4651"]], columns=['people_id', 'skills_id'])
count_vect = CountVectorizer(ngram_range=(1,1),lowercase= False)
X_counts = count_vect.fit_transform(df1['skills_id'])
Xc = (X_counts.T * X_counts)
Xc.setdiag(0)
print(Xc.todense())
I am pretty new to this terminology of co-occurence matrix with numbers, word-to-word co-occurence i can understand, but how toread and understand the result.

Well you may think of it just like a word-to-word co-occurrence matrix. Here, assuming your skill column is that 2nd column with numbers of size 4, it first looks at all unique possible values :
>>> count_vect.get_feature_names()
['4044',
'4092',
'4420',
'4430',
'4437',
'4651',
'4714',
'5789',
'6076',
'6077',
'6079',
'6082',
'6168',
'6229',
'6284',
'6295']
That's an array of size 16 which represents the 16 different words that were found in your skill column. Indeed, sklearn.text.CountVectorizer() finds the words by splitting your strings using space delimiter.
The final matrix you see using print(Xc.todense()) is just the co-occurrence matrix for these 16 words. That's why it is of size (16,16)
To make it clearer (please forgive the columns alignment formatting), you could look at :
>> pd.DataFrame(Xc.todense(),
columns=count_vect.get_feature_names(),
index=count_vect.get_feature_names())
4044 4092 4420 ...
4044 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0
4092 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0
4420 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0
4430 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0
4437 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0
4651 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
4714 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0
5789 1 0 0 0 0 0 1 0 1 1 1 1 1 1 0 0
6076 1 0 0 0 0 0 1 1 0 1 1 1 1 1 0 0
6077 1 0 0 0 0 0 1 1 1 0 1 1 1 1 0 0
6079 1 0 0 0 0 0 1 1 1 1 0 1 1 1 0 0
6082 1 0 0 0 0 0 1 1 1 1 1 0 1 1 0 0
6168 1 0 0 0 0 0 1 1 1 1 1 1 0 1 0 0
6229 1 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0
6284 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
6295 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
tl;dr In that case, as you input strings, whether they are numbers (e.g. "23") or nouns (e.g. "cat") doesn't change anything. The co-occurrence still displays binary values representing whether a given token is found with another one. The default tokenizer for CountVectorizer() is just splitting on spaces.
What exactly would you have expected differently with numbers ?

Related

Python panda's dataframe boolean Series/Column based on conditional next columns

I'm having trouble describing exactly what I want to achieve. I've tried looking here on stack to find others with the same problem, but are unable to find any. So I will try to describe exactly what I want and give you a sample setup code.
I would like to have a function that gives me a new column/pd.Series. This new column has boolean TRUE values (or int's) that are based on a certain condition.
The condition being as follows. There are N number of columns (example is 8), each with the same name but ending with one new number. IE, column_1, column_2 etc. The function I need is twofold:
If N is given, look for/through each column row and see if it and the next N columns row are also TRUE/1 ..
If N is NOT given, look for each column row and if all next columns rows are also TRUE/1, with the numbers as ID's to look at the column.
def get_df_series(df: pd.DataFrame, columns_ids: list, n: int=8) -> pd.Dataframe:
for i in columns_ids:
# missing code here .. i dont know if this would be the way to go
pass
return df
def create_dataframe(numbers: list) -> pd.DataFrame:
df = pd.DataFrame() # empty df
# create a column for each number with the number as ID and with random boolean values as int's
for i in numbers:
df[f'column_{i}'] = np.random.randint(2, size=20)
return df
if __name__=="__main__":
numbers = [1, 2, 3, 4, 5, 6, 7, 8]
df = create_dataframe(numbers=numbers)
df = get_df_series(df=df, numbers=numbers, n=3)
I have some experience with Pandas dataframes and know how to create IF/ELSE things with np.select for example.
(function) select(condlist: Sequence[ArrayLike], choicelist: Sequence[ArrayLike], default: ArrayLike = ...) -> NDArray
The problem I'm running into is that I don't know how to make a conditional statement if I don't know how many columns are ahead. For example, if I want to know for column_5 if the next 3 are also true, I can hardcode this, but I have columns up to id 20 and would love to not have to hardcode everything from column_1 to column_20 if I want to know if all conditions in all those columns are true.
Now the problem is that I don't know if this is even possible. So any feedback would be appreaciated. Even just giving me a hint on where to look for a way to do this.
EDIT: What I forgot to mention was that there will be random columns in between that obviously cannot be taking into the equation. For example, there will be main_column_1, main_column_2, main_column_3, side_column_1, side_column_2, right_column_1, main_column_3, main_column_4 etc...
The answer Corralien gave is correct, but I should've made my question more clearer.
I need to be able to, say, look at main_column and for that one look ahead N amount of columns of the same type: main_column.
Try:
n = 3
out = (df.rolling(n, min_periods=1, axis=1).sum()
.shift(-n+1, fill_value=0, axis=1).eq(n).astype(int)
.rename(columns=lambda x: 'result_' + x.split('_')[1]))
Output:
>>> out
result_1 result_2 result_3 result_4 result_5 result_6 result_7 result_8
0 1 1 1 1 1 1 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0
5 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0
8 0 1 1 1 0 0 0 0
9 0 0 0 0 0 1 0 0
10 0 0 0 0 0 0 0 0
11 0 0 0 0 1 0 0 0
12 0 0 0 0 0 0 0 0
13 0 0 0 1 1 0 0 0
14 0 0 0 0 0 1 0 0
15 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0
17 0 0 1 0 0 0 0 0
18 0 0 1 0 0 0 0 0
19 0 0 0 0 0 0 0 0
Input:
>>> df
column_1 column_2 column_3 column_4 column_5 column_6 column_7 column_8
0 1 1 1 1 1 1 1 1
1 0 1 0 0 0 1 1 0
2 1 1 0 1 0 1 1 0
3 1 0 1 0 0 0 0 0
4 1 0 0 1 1 1 0 1
5 1 1 0 1 0 1 1 0
6 1 0 1 0 0 0 0 1
7 0 0 1 0 0 0 0 0
8 0 1 1 1 1 1 0 0
9 1 0 1 1 0 1 1 1
10 0 0 1 1 0 0 1 1
11 1 0 1 0 1 1 1 0
12 0 1 1 0 1 0 1 0
13 0 0 0 1 1 1 1 0
14 0 0 1 1 0 1 1 1
15 1 0 0 1 0 1 0 0
16 1 0 0 0 0 0 0 1
17 0 0 1 1 1 0 0 1
18 0 0 1 1 1 0 0 1
19 0 0 1 0 0 0 1 0

Python make newick format using dataframe with 0s and 1s

I have a dataframe like this
a b c d e f g h i j k l m
mut1 0 0 0 0 0 1 1 1 1 1 1 1 1
mut2 0 0 0 0 0 1 1 1 1 1 0 0 0
mut3 0 0 0 0 0 1 1 0 0 0 0 0 0
mut4 0 0 0 0 0 1 0 0 0 0 0 0 0
mut5 0 0 0 0 0 0 0 1 1 0 0 0 0
mut6 0 0 0 0 0 0 0 1 0 0 0 0 0
mut7 0 0 0 0 0 0 0 0 0 1 0 0 0
mut8 0 0 0 0 0 0 0 0 0 0 1 1 1
mut9 0 0 0 0 0 0 0 0 0 0 1 1 0
mut10 0 0 0 0 0 0 0 0 0 0 0 0 1
mut11 1 1 1 1 1 0 0 0 0 0 0 0 0
mut12 1 1 1 0 0 0 0 0 0 0 0 0 0
mut13 1 1 0 0 0 0 0 0 0 0 0 0 0
mut14 1 0 0 0 0 0 0 0 0 0 0 0 0
mut15 0 0 0 1 0 0 0 0 0 0 0 0 0
mut16 0 0 0 0 1 0 0 0 0 0 0 0 0
and origianl corresponding string
(a:0,b:0,c:0,d:0,e:0,f:0,g:0,h:0,i:0,j:0,k:0,l:0,m:0):0
The algorithm I thought was like this.
In row mut1, we can see that f,g,h,i,j,k,l,m have the same features.
So the string can be modified into
(a:0,b:0,c:0,d:0,e:0,(f:0,g:0,h:0,i:0,j:0,k:0,l:0,m:0):0):0
In row mut2, we can see that f,g,h,i,j have the same features.
So the string can be modified into
(a:0,b:0,c:0,d:0,e:0,((f:0,g:0,h:0,i:0,j:0):0,k:0,l:0,m:0):0):0
Until mut10, it continues to cluster samples in f,g,h,i,j,k,l,m.
And the output will be
(a:0,b:0,c:0,d:0,e:0,(((f:0,g:0):0,(h:0,i:0):0,j:0):0,((k:0,l:0):0,m:0):0):0):0
(For a row with one "1", just skip the process)
From mut10, it stars to cluster samples a,b,c,d,e
and similarly, the final output will be
(((a:0,b:0):0,c:0):0,d:0,e:0,(((f:0,g:0):0,(h:0,i:0):0,j:0):0,((k:0,l:0):0,m:0):0):0):0
So the algorithm is
Cluster the samples with the same features.
After clustering, add ":0" behind the closing parenthesis.
Any suggestions on this process?
*p.s. I have uploaded similar question
Creating a newick format from dataframe with 0 and 1
but this one is more detailed.
Your question asks for a solution in Python, which I'm not familiar with. Hopefully, the following procedure in R will be helpful as well.
What your question describes is matrix representation of a tree. Such a tree can be retrieved from the matrix with a maximum parsimony method using the phangorn package. To manipulate trees in R, newick format is useful. Newick differs from the tree representation in your question by ending with a semicolon.
First, prepare a starting tree in phylo format.
library(phangorn)
tree0 <- read.tree(text = "(a,b,c,d,e,f,g,h,i,j,k,l,m);")
Second, convert your data.frame to a phyDat object, where the rows represent samples and columns features. The phyDat object also requires what levels are present in the data, which is 0 and 1 in this case. Combining the starting tree with the data, we calculate the maximum parsimony tree.
dat0 = read.table(text = " a b c d e f g h i j k l m
mut1 0 0 0 0 0 1 1 1 1 1 1 1 1
mut2 0 0 0 0 0 1 1 1 1 1 0 0 0
mut3 0 0 0 0 0 1 1 0 0 0 0 0 0
mut4 0 0 0 0 0 1 0 0 0 0 0 0 0
mut5 0 0 0 0 0 0 0 1 1 0 0 0 0
mut6 0 0 0 0 0 0 0 1 0 0 0 0 0
mut7 0 0 0 0 0 0 0 0 0 1 0 0 0
mut8 0 0 0 0 0 0 0 0 0 0 1 1 1
mut9 0 0 0 0 0 0 0 0 0 0 1 1 0
mut10 0 0 0 0 0 0 0 0 0 0 0 0 1
mut11 1 1 1 1 1 0 0 0 0 0 0 0 0
mut12 1 1 1 0 0 0 0 0 0 0 0 0 0
mut13 1 1 0 0 0 0 0 0 0 0 0 0 0
mut14 1 0 0 0 0 0 0 0 0 0 0 0 0
mut15 0 0 0 1 0 0 0 0 0 0 0 0 0
mut16 0 0 0 0 1 0 0 0 0 0 0 0 0")
dat1 <- phyDat(data = t(dat0),
type = "USER",
levels = c(0, 1))
tree1 <- optim.parsimony(tree = tree0, data = dat1)
plot(tree1)
The tree now contains a cladogram with no branch lengths. Class phylo is effectively a list, so the zero branch lengths can be added as an extra element.
tree2 <- tree1
tree2$edge.length <- rep(0, nrow(tree2$edge))
Last, we write the tree into a character vector in newick format and remove the semicolon at the end to match the requirement.
tree3 <- write.tree(tree2)
tree3 <- sub(";", "", tree3)
tree3
# [1] "((e:0,d:0):0,(c:0,(b:0,a:0):0):0,((m:0,(l:0,k:0):0):0,((i:0,h:0):0,j:0,(g:0,f:0):0):0):0)"

tf.data.experimental.sample_from_datasets not sampling as expected

The documentation seems to be bare bone and the example given in their standard TF tutorial not highlighting a behavior I see. Lets say you have an imbalanced dataset of 1 and 0 (pos and neg), and you want to sample at weights [0.5, 0.5], such that you see the positives more frequently. You would do this:
pos_ds = tf.data.Dataset.from_tensor_slices(np.ones(shape=(16, 1)))
neg_ds = tf.data.Dataset.from_tensor_slices(np.zeros(shape=(128, 1)))
resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds, neg_ds], weights=[0.5, 0.5])
And if I want to see how the pos and neg are distributed as I go through the dataset:
xs = []
for x in resampled_ds:
xs.append(int(x.numpy()[0]))
xs = np.array(xs)
print(xs)
np.bincount(xs)
I see this:
[1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1
0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
array([128, 16])
There are 128 negatives and 16 positives. If I use this as my train_ds, it will be equivalent to no sampling done, and worse, the negatives are no longer uniformly distributed across the steps / epoch. I am guessing that the 0.5 sampling is happening in the beginning and once it "run out" of 1s, it just started sampling the zeros only. It clearly doesn't do sampling with replacement for the 1s. I think the 1s and 0s will only be 0.5/0.5 if you stop after all the 1s are sampled.
It looks like this is the behavior but it isn't the only sensible one. I want to sample the positives multiple times (i.e. sampling with replacement) in 1 epoch, with approx equal amount of pos and negs, is there any option for this API? Also, I have data augmentation so the positives are actually not the same when trained.
You can do something like this for the replacement issue:
resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds.repeat(128 // 16), neg_ds], weights=[0.5, 0.5])
And the result is:
[1 1 1 0 0 1 1 1 1 1 0 1 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1
1 0 0 1 1 1 1 0 1 1 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
1 0 0 1 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 1 0 0
0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1
1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 1 1
1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 0 1 1 0
0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
Out[2]: array([128, 128], dtype=int64)
Actually, I also found the solution is right there on that TF tutorial imbalanced_data.ipynb (i totally missed this one in my own notebook).
pos_ds = pos_ds.shuffle(BUFFER_SIZE).repeat()
neg_ds = neg_ds.shuffle(BUFFER_SIZE).repeat()
resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds, neg_ds], weights=[0.5, 0.5])
The tutorial further suggest a heuristic to set the resampled_steps_per_epoch.
However, the shuffle + repeat, is still not equivalent to a true sampling with replacement for the minority class. A repeat() follow by a shuffle() may be do it. I can update this by trying both ways.

*solved* How can I check if all rows in a numpy array equals to 1 when the number of arrays differ?

I'm looking for a way to test (element wise) which rows in N number of arrays that are equal to 1. I know there are good ways to do this when we know the number of arrays which I've found searching around. However, in my case, I will not (in a time efficent manner) be able to keep track of the number of array in which I want to test this on. Below is the solution and desired output when comparing two arrays.
I appreciate any help. Thank you!
A = np.array([1,2,3])
B = np.array([1,1,1])
C = np.logical_and(A==1,B==1)
array([ True, False, False])
I could also use np.where(A==1) if I have an array of floats with multiple arrays. However, this only gives me the occurances if one value is equal to 1.
Example array: (note that it wont always be 3 arrays but can be 5 or 15 as well).
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
1 0 0
1 0 0
1 0 0
1 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
1 0 0
1 0 0
1 0 0
1 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
1 0 0
1 0 0
1 0 0
1 0 0
0 0 0
0 0 0
0 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 1
1 0 0
1 0 0
1 0 0
1 0 1
1 0 1
1 0 1
1 0 0
1 0 1
1 0 0
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 0
0 0 0
0 0 0
0 0 0

Setting all values within a subsection of a 2D array to another value using NumPy

I have a 2D grid of values. For example, they might look like this:
0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 0
0 1 1 1 1 1 1 1 1 1 0
0 1 0 0 0 1 1 0 0 1 0
0 1 0 0 0 1 1 0 0 1 0
0 1 0 0 0 1 1 0 0 1 0
0 1 0 0 0 1 1 0 0 1 0
0 1 1 1 1 1 1 1 1 1 0
0 1 1 1 1 1 1 1 1 1 0
0 0 0 0 0 0 0 0 0 0 0
What I would like to do is set all the 0's that are contained within the 1's to a different value.
Essentially what I want to do is get the first instance of 1 and the last instance of 1 in a row or column, and then any 0's within this boundary I would like to set to another value.
I can brute-force it by getting the first and last instances, and then manually setting it, but is there a numpy-way of doing this more efficiently?
Generally skimage might have some more algorithms if you do more image stuff. This problem is covered nicely by scipy.ndimage.binary_fill_holes.

Categories

Resources