Transform Pandas dataframe into frequency matrix - python

I'm trying to transform a pandas dataframe with three columns (Date, Start, End) into a frequency matrix. My input dataframe look like this:
Date, Start, End
2016-09-02 09:16:00 18 16
2016-09-02 16:14:10 16 1
2016-09-02 06:17:21 18 17
2016-09-02 05:51:07 23 17
2016-09-02 18:34:44 18 17
2016-09-02 05:44:44 20 4
2016-09-02 09:25:22 18 17
2016-09-02 22:27:44 18 17
2016-09-02 16:02:46 0 18
2016-09-02 15:35:07 17 17
2016-09-02 16:06:42 8 17
2016-09-02 14:47:04 16 23
2016-09-02 07:47:24 20 1
...
The values of 'Start' and 'End' are integers between 0 and 23 inclusive. The 'Date' is a datetime. The frequency matrix I'm trying to create is a 24 by 24 csv, where row i and column j is the number of times 'End'=i and 'Start'=j occurs in the input. For example, the above data would create:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0
2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0
5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
13, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0
17, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 0, 0, 0, 0, 1
18, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
19, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
22, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
23, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0
For extra help, could this be done in a way that creates a separate matrix for every 15 minutes? That would be 672 matrices as this date range is one week.
I'm a self taught beginner, and I really can't think of how to solve this in a pythonic way, any solutions or advice would be greatly appreciated.

Create your matrix with a simple count and unstack one of one column:
mat = df.groupby(['Start', 'End']).count().unstack(level=0)
Clean up the Date level:
mat.columns = mat.columns.droplevel(0)
Now reindex rows and columns and cast into integers:
mat.reindex(*[range(0,24)]*2).fillna(0)
Detailed explanations
First, you count the number of occurences a given (start,end) couple appears. The result of groupby against these two columns actually brings back a multiindex.
df.groupby(['Start', 'End']).count()
Out[134]:
Date
Start End
0 18 1
8 17 1
16 1 1
23 1
17 17 1
18 16 1
17 4
20 1 1
4 1
23 17 1
What we want from that result is to get the Start index in columns. unstack does this:
df.groupby(['Start', 'End']).count().unstack(level=0)
Out[135]:
Date
Start 0 8 16 17 18 20 23
End
1 NaN NaN 1.0 NaN NaN 1.0 NaN
4 NaN NaN NaN NaN NaN 1.0 NaN
16 NaN NaN NaN NaN 1.0 NaN NaN
17 NaN 1.0 NaN 1.0 4.0 NaN 1.0
18 1.0 NaN NaN NaN NaN NaN NaN
23 NaN NaN 1.0 NaN NaN NaN NaN
The result of unstack is the Start column being moved as an additional column index level on top of the current Date column index (see below). That's why we drop the level 0 afterwards. Another way - depending on your current source code - could be to filter out the Date column upfront, then unstack would bring one level.
_.columns
Out[136]:
MultiIndex(levels=[['Date'], [0, 8, 16, 17, 18, 20, 23]],
labels=[[0, 0, 0, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6]],
names=[None, 'Start'])

Bit late but for anyone who's here:
There is a function explicitly for this called pd.crosstab()
https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html
You will want to use it like:
output = pd.crosstab(df["Start"], df["End"])

Related

Indexing 2 dimensional array in python

I've been trying to change a single item in a 2-dimensional array in python using the syntax x[2][3]=1 but instead of just changing the item in the 2nd row 3rd column, it ends up changing the values of all of the 3rd column. My code is below:
population = [[0]*20]*5
population[2][3] = 1
for row in population:
print(row)
This outputs
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
but I only want
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
How would I index the item such that it only changes the 2nd row and 3rd column?
I'm using python 3.7.4 on repl.it
Link here: https://repl.it/#ajqe/2d-array-test
Use :
population = [[0]*20 for _ in range(5)]
to generate the lists instead. The method you are using is referencing the same object 5 times, instead of creating 5 separate lists. To check this you can use the is operator:
>>> population = [[0]*20]*5
>>> population[0] is population[1]
True
>>> population = [[0]*20 for _ in range(5)]
>>> population[0] is population[1]
False

Compute Cosine SImilarity Within Groups

I have a dataframe that consists of rows like the following. My goal here is to compute the the cosine similarity of every row with every row within the same category, such that I'd end up with a dataframe with 3 columns: category, vecs, and dist where dist is a n length array that contains the distance between each row and every row within the same category.
category vecs
0 a [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]
1 a [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]
2 b [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]
3 b [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]
The inefficient solution that I've though of would be to loop through each row, check if cat is equal and then compute distance and add to list else continue loop. This solution would be n^2 though and I'm looking for something more efficient. I have 8115 rows in this dataframe and am looking for something that would possibly scale to even larger datasets.
The other possible solution I've looked at would be using sklearn pairwise distance (metric = cosine) and somehow only include computations with same categories, but I'm struggling to think about how to do this.
Would someone be willing to help or suggest a different efficient solution?
You need to do the (more or less) n(n-1)/2 computations.
This is irreducible, since the similarities have to be computed somehow if there is no hidden structure in the vectors.
You can use scipy to compute the pairwise distances, and the squareform function to get back a regular symmetric matrix, that would otherwise be the triangular flattened:
from scipy.spatial.distance import pdist, squareform
similarities = dict()
for cat, group in df.groupby("category"):
a = tuple(row.vecs for _, row in group.iterrows())
b = np.array(a)
sim_mat = squareform(1 - pdist(b, metric='cosine'))
similarities[cat] = sim_mat
[print(k, v, sep='\n') for k, v in similarities.items()]
a
[[0. 1.]
[1. 0.]]
b
[[0. 0.70710678]
[0.70710678 0. ]]

Python: Given an adjacency matrix in form of an array of arrays - how can I get the connected components?

Given an adjacency matrix of an undirected graph in form of an array of arrays in python, how can I get the connected components in form of (row,col,class)?
I already used scipy.sparse.csgraph.connected_components(adjmx) - yet what I got was only a list of the connected component labels. How can I get their precise location (meaning: row and col and label)?
And here an example - given a sparse matrix M which belongs to an undirected graph G:
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1]
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1]
I run scipys CC algorithm on it:
ccres = sp.sparse.csgraph.connected_components(M, directed=False)
What it yields is:
a) the number of connected components: 7
b) an array:
[0 1 2 2 2 3 3 3 2 2 4 5 2 2 2 2 2 0 3 3 6 3 3 3 6]
What I need now is an output in the fashion (row, col, cc label). What is the precise semantic of this array i get in b) ?
EDIT: The solution provided in this post: Finding connected components in a pixel-array actually did the trick! It proposes also how to recover the rows and columns indices of a given label.

Python error while using the LDA in Sklearn

I'm trying to implement the LinearDiscriminantAnalysis from sklearn for that here is what I've done so far:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import numpy as np
import pandas as pd
# Reading csv file
training_file = 'Training.csv'
testing_file = 'Test.csv'
dataframe_train = pd.read_csv(training_file)
dataframe_test = pd.read_csv(testing_file)
dataframe_train['onehot_code'] = dataframe_train.apply(lambda x : onehot_processing(int(float(x['Onehot'])), numberOFclasses), axis=1)
dataframe_test['onehot_code'] = dataframe_test.apply(lambda x:onehot_processing(int(float(x['Onehot'])),numberOFclasses),axis=1)
stdsc = preprocessing.StandardScaler()
np_scaled_train = stdsc.fit_transform(dataframe_train.iloc[:,:-3])
np_scaled_test = stdsc.transform(dataframe_test.iloc[:,:-3])
lda = LinearDiscriminantAnalysis(n_components=2)
Training_Frame = lda.fit_transform(np_scaled_train,dataframe_train.iloc[:,-1]) # the script crashes here
Testing_Frame = lda.transform(np_scaled_test)
The error message that I get is:
ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead.
The shapes of the dataframe are correct, So I don't get what I'm missing or what should I convert so that the function accepts the parameter, or is the cause something else ?
I'll be grateful for any hint!
Update
Here's howdataframe_train.iloc[:,-1]looks like :
0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
5 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
6 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
7 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
8 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
9 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
10 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
11 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
12 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
13 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
14 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
15 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
16 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
17 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
18 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
19 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
20 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
21 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
22 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
23 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
24 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
25 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
26 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
27 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
28 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
29 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
...
2328 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2329 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2330 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2331 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2332 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2333 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2334 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2335 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2336 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2337 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2338 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2339 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2340 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2341 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2342 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2343 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2344 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2345 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2346 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2347 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2348 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2349 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2350 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2351 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2352 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2353 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2354 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2355 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2356 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2357 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Name: onehot_code, dtype: object
each row is a vector of 20 elements .
**2nd_UPDATE"
Running the following : Training_Frame = lda.fit_transform(np_scaled_train,np.asarray(dataframe_train.iloc[:,-1]))
delivers this error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-a8adf693ad9e> in <module>()
----> 1 Training_Frame = lda.fit_transform(np_scaled_train,np.asarray(dataframe_train.iloc[:,-1]))
c:\users\engine\appdata\local\programs\python\python35\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
495 else:
496 # fit method of arity 2 (supervised transformation)
--> 497 return self.fit(X, y, **fit_params).transform(X)
498
499
c:\users\engine\appdata\local\programs\python\python35\lib\site-packages\sklearn\discriminant_analysis.py in fit(self, X, y, store_covariance, tol)
441 self.tol = tol
442 X, y = check_X_y(X, y, ensure_min_samples=2, estimator=self)
--> 443 self.classes_ = unique_labels(y)
444
445 if self.priors is None: # estimate priors from sample
c:\users\engine\appdata\local\programs\python\python35\lib\site-packages\sklearn\utils\multiclass.py in unique_labels(*ys)
77 # Check that we don't mix label format
78
---> 79 ys_types = set(type_of_target(x) for x in ys)
80 if ys_types == set(["binary", "multiclass"]):
81 ys_types = set(["multiclass"])
c:\users\engine\appdata\local\programs\python\python35\lib\site-packages
\sklearn\utils\multiclass.py in <genexpr>(.0)
77 # Check that we don't mix label format
78
---> 79 ys_types = set(type_of_target(x) for x in ys)
80 if ys_types == set(["binary", "multiclass"]):
81 ys_types = set(["multiclass"])
c:\users\engine\appdata\local\programs\python\python35\lib\site-packages\sklearn\utils\multiclass.py in type_of_target(y)
248 if (not hasattr(y[0], '__array__') and isinstance(y[0], Sequence)
249 and not isinstance(y[0], string_types)):
--> 250 raise ValueError('You appear to be using a legacy multi-label data'
251 ' representation. Sequence of sequences are no'
252 ' longer supported; use a binary array or sparse'
ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead.
This is what works for me, when tried duplicating your example.
y_train = dataframe_train.iloc[:,-1]
y_test = dataframe_test.iloc[:,-1]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_train = mlb.fit_transform(y_train)
y_test = mlb.transform(y_test)
lda = LinearDiscriminantAnalysis(n_components=2)
Training_Frame = lda.fit_transform(np_scaled_train, y_train)
Testing_Frame = lda.transform(np_scaled_test)
The error is most probably due to how pandas handle the lists in a column, and how numpy interprets them. The scikit-learn checks if supplied y is a numpy array of supported types (dtypes) (int, float, string, etc), but in your case df.iloc[:, -1] returns a pandas.Series which when directly converted to numpy, results in dtype = object. And hence the error.
One more workaround is (without using any of the code above):
Training_Frame = lda.fit_transform(np_scaled_train,
np.array([np.array(r) for r in dataframe_train‌​.iloc[:,-1]]))
Hope it works for you.

Performing Arithmetic operations on nested dataframe containing a list

I have a dataframe called dailyHistogram defined as follows:
dailyHistogram = pd.DataFrame({'NumVisits':[[0 for x in range(1440)]
for y in range (180)],
'DoW': [0]*ReportingDateRange.size
}
,columns=['NumVisits','DoF'],
index=ReportingDateRange)
Where NumVisits is a two-dimensional array (1440 by 180) and holds a histogram of some activity in 180 days. DoW is simply a column which holds the day of the week.
The index in this dataframe is the dates on which the activities occurred.
My problem is in performing any operations on dailyHistogram["NumVisits"].
Here's what dailyHistogram["NumVisits"] looks like:
> dailyHistogram["NumVisits"]
> Out[193]:
> 2016-01-01 [5, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
>
> 2016-01-02 [2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
>
> 2016-01-03 [6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
>
> 2016-01-04 [8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
>
> 2016-06-26 [3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
>
> 2016-06-27 [4, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
>
> 2016-06-28 [7, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
>
> 2016-06-29 [7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
>
> 2016-06-30 [4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
>
> Freq: D, Name: NumVisits, dtype: object
I'd like to sum up all vectors in "NumVisits" for a particular day of the week, but no arithmetic operations seem to be possible on dailyHistogram["NumVisits"]
That is because NumVisits is a list, and to perform arithmetic on the contents of a list, you need to explicitly apply your functions. For example:
df['NumVisits'].apply(sum)
For element by element sum in each row:
import numpy as np
df['NumVisits'].apply(np.cumsum)
For sum across all rows, for each element in row:
np.array(dailyHistogram['NumVisits'].tolist()).sum(axis=0)

Categories

Resources