Related
This question already has answers here:
Accessing every 1st element of Pandas DataFrame column containing lists
(5 answers)
Closed 1 year ago.
I have dataframe like this:
text emotion
0 working add oil [1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0]
1 you're welcome [0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0]
7 off to face my exam now [0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, ...
12 no, i'm so not la! i want to sleeeeeeeeeeep. [0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, ...
151 i try to register on ebay. when i enter my hom... [1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, ...
18 Swam 6050 yards on just a yogurt for breakfast... [0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, ...
19 Alright! [0, 0, 1, 1, 0, 0, 0, 0]
120 Visiting gma. It's getting cold [0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, ...
22 You are very missed [0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, ...
345 ...LOL! You mean Rhode Island...close enough [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, ...
How can I leave only the first numbers in emotion column, to get data like this?:
text emotion
0 working add oil 1
1 you're welcome 0
7 off to face my exam now 0
12 no, i'm so not la! i want to sleeeeeeeeeeep. 0
151 i try to register on ebay. when i enter my hom... 1
18 Swam 6050 yards on just a yogurt for breakfast... 0
19 Alright! **0**
120 Visiting gma. It's getting cold 0
22 You are very missed **0**
345 ...LOL! You mean Rhode Island...close enough 0
If "emotion" column is a list and not string:
df["emotion"] = df["emotion"].apply(lambda x: x[0])
print(df)
Prints:
text emotion
0 working add oil 1
1 you're welcome 0
2 off to face my exam now 0
3 no, i'm so not la! i want to sleeeeeeeeeeep. 0
4 i try to register on ebay. when i enter my hom... 1
5 Swam 6050 yards on just a yogurt for breakfast... 0
6 Alright! 0
7 Visiting gma. It's getting cold 0
8 You are very missed 0
9 ...LOL! You mean Rhode Island...close enough 0
If it's string, you can convert it to list using ast.literal_eval:
from ast import literal_eval
df["emotion"] = df["emotion"].apply(literal_eval)
# and then:
df["emotion"] = df["emotion"].apply(lambda x: x[0])
I have a series that is a list of lists that contain integers that I am attempting to turn into an array. This is a small snip-it of the list I am trying to convert into an array.
['[0, 0, 0, 0, 0, 0, 0, 0, 0, 1]',
'[0, 0, 0, 0, 0, 0, 0, 0, 1, 0]',
'[0, 0, 0, 0, 0, 0, 0, 1, 0, 0]',
'[0, 0, 0, 0, 0, 0, 0, 1, 0, 1]',
'[0, 0, 0, 0, 0, 0, 0, 1, 1, 1]']
I've tried to replace the quotes with .replace, but that hasn't worked out.
sequence = [i.replace(" '' ", ' ') for i in sequence]
You can use ast.literal_eval to change the string to list of lists of ints
sequence = [literal_eval(i) for i in sequence]
# [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 1], [0, 0, 0, 0, 0, 0, 0, 1, 1, 1]]
You can change it to numpy array
import numpy as np
array = np.asarray(sequence)
print(array)
output
[[0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 0 0 1 0]
[0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1 0 1]
[0 0 0 0 0 0 0 1 1 1]]
Or to 1d pandas array
import pandas as pd
array = pd.array([item for items in sequence for item in items])
print(array)
outout
<IntegerArray>
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]
Length: 50, dtype: Int64
I am trying to write a big matrix that includes a smaller row matrix (size changeable) that are spread on the "diagonal" of the matrix. All the other values are 0. How do I create such a matrix?
I've tried np.put, np.append. Here's what I have so far:
t = [1,2,3]
n=3
m=4
A = np.zeros((2*m,m*n+m),dtype=int)
for i in range (m):
A[i-1:i-1+t.shape[0], n*(i-1):n*(i-1)+t.shape[1]] += t
print("A= \n",np.matrix(A))
I want the following matrix (I'm sorry I don't know how to show matrix but if someone can help me with this too I would appreciate it a lot) :
A=
[[1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 0 ]
[0 0 0 0 0 0 1 2 3 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 2 3 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
It causes the following error:
ValueError: operands could not be broadcast together with shapes (0,0) (1,3) (0,0)
You can use careful reshaping like so:
t = [1,2,3]
n=3
m=4
A = np.zeros((2*m,m*n+m),dtype=int)
A.ravel()[:m*(m*n+m+n)].reshape(m,-1)[:,:len(t)] = t
A
# array([[1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Make mask for 12 positions and use it for assignment
idx = np.zeros(A.shape).astype(bool)
for i in range(m):
idx[i,i*n:i*n+3] = True
A[idx]= t*m
array([[1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I have a numpy array with the following shape (11617, 37). The data is multi class data, and to establish a baseline, I need to find which class (or classes) are the most common.
I have tried this formula and also this
A = np.array([[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0],
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0],
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]])
axis = 0
u, indices = np.unique(arr, return_inverse=True)
answer = u[np.argmax(np.apply_along_axis(np.bincount, axis, indices.reshape(arr.shape),
None, np.max(indices) + 1), axis=axis)]
I need to find the most frequent combination of the 37 classes in my array
Expected output:
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0]
To find the most frequent combination (rows, which means axis=0), you can try this!
A = np.array([[1,0,0,0],
[1,0,0,1],
[1,0,0,0]])
unique_rows,counts = np.unique(A, return_counts=True,axis=0)
unique_rows[np.argmax(counts)]
FYI, If the array you mentioned in the question is your target variable, then it is an example of multi-label data.
This may be of use for you to understand multi-class and multi-label
You could try np.unique with return_counts parameter:
from operator import itemgetter
import numpy as np
A = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
uniques, counts = np.unique(A, axis=0, return_counts=True)
idxmax, _ = max(zip(range(len(counts)), counts), key=itemgetter(1))
print(uniques[idxmax])
Output
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0]
You can use collections.Counter.most_common if you convert your list of list elements to a tuple (convert the lists to tuples so they can be counted)
from collections import Counter
A = [[0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0]]
c = Counter(tuple(x) for x in A)
print(c.most_common()[0]) # ((0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0), 2)
This returns a tuple containing the most common list and the number of occurrences.
A really quick and easy solution:
A = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
print(max(A, key=A.count))
Which prints:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
If you need to pay attention to runtime or want to optimize your code - this is not the way you want to go. However, if you just need a quick solution, it might help to keep this one-liner in mind.
(A.tolist() gets you a list from a np.ndarray if you need that first.)
from collections import Counter
A = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
most_common = [Counter(i).most_common(1).pop()[0] for i in A]
most_common
[0, 0, 0]
I'm trying to transform a pandas dataframe with three columns (Date, Start, End) into a frequency matrix. My input dataframe look like this:
Date, Start, End
2016-09-02 09:16:00 18 16
2016-09-02 16:14:10 16 1
2016-09-02 06:17:21 18 17
2016-09-02 05:51:07 23 17
2016-09-02 18:34:44 18 17
2016-09-02 05:44:44 20 4
2016-09-02 09:25:22 18 17
2016-09-02 22:27:44 18 17
2016-09-02 16:02:46 0 18
2016-09-02 15:35:07 17 17
2016-09-02 16:06:42 8 17
2016-09-02 14:47:04 16 23
2016-09-02 07:47:24 20 1
...
The values of 'Start' and 'End' are integers between 0 and 23 inclusive. The 'Date' is a datetime. The frequency matrix I'm trying to create is a 24 by 24 csv, where row i and column j is the number of times 'End'=i and 'Start'=j occurs in the input. For example, the above data would create:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0
2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0
5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
13, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0
17, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 0, 0, 0, 0, 1
18, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
19, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
22, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
23, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0
For extra help, could this be done in a way that creates a separate matrix for every 15 minutes? That would be 672 matrices as this date range is one week.
I'm a self taught beginner, and I really can't think of how to solve this in a pythonic way, any solutions or advice would be greatly appreciated.
Create your matrix with a simple count and unstack one of one column:
mat = df.groupby(['Start', 'End']).count().unstack(level=0)
Clean up the Date level:
mat.columns = mat.columns.droplevel(0)
Now reindex rows and columns and cast into integers:
mat.reindex(*[range(0,24)]*2).fillna(0)
Detailed explanations
First, you count the number of occurences a given (start,end) couple appears. The result of groupby against these two columns actually brings back a multiindex.
df.groupby(['Start', 'End']).count()
Out[134]:
Date
Start End
0 18 1
8 17 1
16 1 1
23 1
17 17 1
18 16 1
17 4
20 1 1
4 1
23 17 1
What we want from that result is to get the Start index in columns. unstack does this:
df.groupby(['Start', 'End']).count().unstack(level=0)
Out[135]:
Date
Start 0 8 16 17 18 20 23
End
1 NaN NaN 1.0 NaN NaN 1.0 NaN
4 NaN NaN NaN NaN NaN 1.0 NaN
16 NaN NaN NaN NaN 1.0 NaN NaN
17 NaN 1.0 NaN 1.0 4.0 NaN 1.0
18 1.0 NaN NaN NaN NaN NaN NaN
23 NaN NaN 1.0 NaN NaN NaN NaN
The result of unstack is the Start column being moved as an additional column index level on top of the current Date column index (see below). That's why we drop the level 0 afterwards. Another way - depending on your current source code - could be to filter out the Date column upfront, then unstack would bring one level.
_.columns
Out[136]:
MultiIndex(levels=[['Date'], [0, 8, 16, 17, 18, 20, 23]],
labels=[[0, 0, 0, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6]],
names=[None, 'Start'])
Bit late but for anyone who's here:
There is a function explicitly for this called pd.crosstab()
https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html
You will want to use it like:
output = pd.crosstab(df["Start"], df["End"])