Explode column into columns - python

I got pandas train df which looks like this:
image
1 [[0, 0, 0], [1, 0, 1], [0, 1, 1]]
2 [[1, 1, 1], [0, 0, 1], [0, 0, 1]]
2 [[0, 0, 1], [0, 1, 1], [1, 1, 1]]
Is there any way to "explode" it but into columns
1 2 3 4 5 6 7 8 9
1 0, 0, 0, 1, 0, 1, 0, 1, 1
2 1, 1, 1, 0, 0, 1, 0, 0, 1
2 0, 0, 1, 0, 1, 1, 1, 1, 1

np.vstack the Series of lists of lists, then reshape
pd.DataFrame(np.vstack(df['image']).reshape(len(df), -1))
0 1 2 3 4 5 6 7 8
0 0 0 0 1 0 1 0 1 1
1 1 1 1 0 0 1 0 0 1
2 0 0 1 0 1 1 1 1 1

Related

Can I use pivotal table to create a heatmap table in pandas

I have this data frame and the result dataframe:
df= pd.DataFrame(
{
"I": ["I1", "I2", "I3", "I4", "I5", "I6", "I7"],
"A": [1, 1, 0, 0, 0, 0, 0],
"B": [0, 1, 1, 0, 0, 1, 1],
"C": [0, 0, 0, 0, 0, 1, 1],
"D": [1, 1, 1, 1, 1, 0, 1],
"E": [1, 0, 0, 1, 1, 0, 1],
"F": [0, 0, 0, 1, 1, 0, 0],
"G": [0, 0, 0, 0, 1, 0, 0],
"H": [1, 1, 0, 0, 0, 1, 1],
})
result=pd.DataFrame(
{
"I": ["A", "B", "C", "D", "E", "F", "G", "H"],
"A": [2, 1, 0, 2, 1, 0, 0, 2],
"B": [1, 4, 2, 3, 1, 0, 0, 3],
"C": [0, 2, 2, 1, 1, 0, 0, 2],
"D": [2, 3, 1, 6, 4, 2, 1, 3],
"E": [1, 1, 1, 4, 4, 2, 1, 2],
"F": [0, 0, 0, 2, 2, 2, 1, 0],
"G": [0, 0, 0, 1, 1, 1, 1, 0],
"H": [2, 3, 2, 3, 2, 0, 0, 4],
})
print('input dataframe')
print(df)
print('result dataframe')
print(result)
The result data frame is a square data frame (the number of rows and columns are the same), and the value in each cell is the number of rows with 1 on both columns.
for example the cell at A:B is the number of columns with 1 in Column A and 1 in column B. In this case, the result is 1 since only on row I2 the values for both columns are one.
I can write nested for loop to calculate these values, but I am looking for a better way to do so.
Can I use a pivotal table for this?
My implementation which doesn't use a pivot table is as follows:
df=df.astype(bool)
r=pd.DataFrame(index=df.columns[1:], columns=df.columns[1:])
for c1 in df.columns[1:]:
for c2 in df.columns[1:]:
tmp=df[c1] & df[c2]
r.loc[c1][c2]=tmp.sum()
print(r)
running this code generates:
A B C D E F G H
A 2 1 0 2 1 0 0 2
B 1 4 2 3 1 0 0 3
C 0 2 2 1 1 0 0 2
D 2 3 1 6 4 2 1 3
E 1 1 1 4 4 2 1 2
F 0 0 0 2 2 2 1 0
G 0 0 0 1 1 1 1 0
H 2 3 2 3 2 0 0 4
Yes, but you'd be better off with matrix multiplication:
df.iloc[:,1:].T # df.iloc[:,1:]
Output:
A B C D E F G H
A 2 1 0 2 1 0 0 2
B 1 4 2 3 1 0 0 3
C 0 2 2 1 1 0 0 2
D 2 3 1 6 4 2 1 3
E 1 1 1 4 4 2 1 2
F 0 0 0 2 2 2 1 0
G 0 0 0 1 1 1 1 0
H 2 3 2 3 2 0 0 4

Scale/resize a square matrix into a larger size whilst retaining the grid structure/pattern (Python)

arr = [[1 0 0] # 3x3
[0 1 0]
[0 0 1]]
largeArr = [[1 1 0 0 0 0] # 6x6
[1 1 0 0 0 0]
[0 0 1 1 0 0]
[0 0 1 1 0 0]
[0 0 0 0 1 1]
[0 0 0 0 1 1]]
Like above, I want to retain the same 'grid' format whilst increasing the dimensions of the 2D array. How would I go about doing this? I assume the original matrix can only be scaled up by an integer n.
You can use numba if performance is of importance (similar post) with no python jitting and in parallel mode if needed (this code can be written faster by some optimizations):
#nb.njit # #nb.njit("int64[:, ::1](int64[:, ::1], int64)", parallel =True)
def numba_(arr, n):
res = np.empty((arr.shape[0] * n, arr.shape[0] * n), dtype=np.int64)
for i in range(arr.shape[0]): # for i in nb.prange(arr.shape[0])
for j in range(arr.shape[0]):
res[n * i: n * (i + 1), n * j: n * (j + 1)] = arr[i, j]
return res
So, as an example:
arr = [[0 0 0 1 1]
[0 1 1 1 1]
[1 1 0 0 1]
[0 0 1 0 1]
[0 1 1 0 1]]
res (n=3):
[[0 0 0 0 0 0 0 0 0 1 1 1 1 1 1]
[0 0 0 0 0 0 0 0 0 1 1 1 1 1 1]
[0 0 0 0 0 0 0 0 0 1 1 1 1 1 1]
[0 0 0 1 1 1 1 1 1 1 1 1 1 1 1]
[0 0 0 1 1 1 1 1 1 1 1 1 1 1 1]
[0 0 0 1 1 1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1]
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1]
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1]
[0 0 0 0 0 0 1 1 1 0 0 0 1 1 1]
[0 0 0 0 0 0 1 1 1 0 0 0 1 1 1]
[0 0 0 0 0 0 1 1 1 0 0 0 1 1 1]
[0 0 0 1 1 1 1 1 1 0 0 0 1 1 1]
[0 0 0 1 1 1 1 1 1 0 0 0 1 1 1]
[0 0 0 1 1 1 1 1 1 0 0 0 1 1 1]]
Performances (perfplot)
In my benchmarks, numba will be the fastest (for large n, parallel mode will be better), after that BrokenBenchmark answer will be faster than scipy.ndimage.zoom. In the benchmarks, f is arr.shape[0] and n is the repeating count:
You can use repeat() twice:
arr.repeat(2, 0).repeat(2, 1)
This outputs:
[[1. 1. 0. 0. 0. 0.]
[1. 1. 0. 0. 0. 0.]
[0. 0. 1. 1. 0. 0.]
[0. 0. 1. 1. 0. 0.]
[0. 0. 0. 0. 1. 1.]
[0. 0. 0. 0. 1. 1.]]
You could use scipy.ndimage.zoom
In [3]: arr = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
In [4]: ndimage.zoom(arr, 2, order=0, grid_mode=True, mode="nearest")
Out[4]:
array([[1, 1, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0],
[0, 0, 1, 1, 0, 0],
[0, 0, 0, 0, 1, 1],
[0, 0, 0, 0, 1, 1]])
This can be done using Pillow (fork of PIL) as follows:
from PIL import Image
import numpy as np
n = 3 # repeatation
im = Image.fromarray(arr)
up_im = im.resize((im.width*n, im.height*n),resample=Image.NEAREST)
up_arr = np.array(up_im)
Example:
arr = np.array(
[[0, 0, 0, 1, 1],
[0, 1, 1, 1, 1],
[1, 1, 0, 0, 1],
[0, 0, 1, 0, 1],
[0, 1, 1, 0, 1]])
res (n=3):
np.array(
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1]])
numba is by far the most performant in terms of speed. As the matrix size increases, PIL takes much more time.

How to load Pandas dataframe into Surprise dataset?

I am building a recommender system based on user's ratings for 11 different items.
I started with a dictionary (user_dict) of user ratings:
{'U1': [3, 4, 2, 5, 0, 4, 1, 3, 0, 0, 4],
'U2': [2, 3, 1, 0, 3, 0, 2, 0, 0, 3, 0],
'U3': [0, 4, 0, 5, 0, 4, 0, 3, 0, 2, 4],
'U4': [0, 0, 2, 1, 4, 3, 2, 0, 0, 2, 0],
'U5': [0, 0, 0, 5, 0, 4, 0, 3, 0, 0, 4],
'U6': [2, 3, 4, 0, 3, 0, 3, 0, 3, 4, 0],
'U7': [0, 4, 3, 5, 0, 5, 0, 0, 0, 0, 4],
'U8': [4, 3, 0, 3, 4, 2, 2, 0, 2, 3, 2],
'U9': [0, 2, 0, 3, 1, 0, 1, 0, 0, 2, 0],
'U10': [0, 3, 0, 4, 3, 3, 0, 3, 0, 4, 4],
'U11': [2, 2, 1, 2, 1, 0, 2, 0, 1, 0, 2],
'U12': [0, 4, 4, 5, 0, 0, 0, 3, 0, 4, 5],
'U13': [3, 3, 0, 2, 2, 3, 2, 0, 2, 0, 3],
'U14': [0, 3, 4, 5, 0, 5, 0, 0, 0, 4, 0],
'U15': [2, 0, 0, 3, 0, 2, 2, 3, 0, 0, 3],
'U16': [4, 4, 0, 4, 3, 4, 0, 3, 0, 3, 0],
'U17': [0, 2, 0, 3, 1, 0, 2, 0, 1, 0, 3],
'U18': [2, 3, 1, 0, 3, 2, 3, 2, 0, 2, 0],
'U19': [0, 5, 0, 4, 0, 3, 0, 4, 0, 0, 5],
'U20': [0, 0, 3, 0, 3, 0, 4, 0, 2, 0, 0],
'U21': [3, 0, 2, 4, 2, 3, 0, 4, 2, 3, 3],
'U22': [4, 4, 0, 5, 3, 5, 0, 4, 0, 3, 0],
'U23': [3, 0, 0, 0, 3, 0, 2, 0, 0, 4, 0],
'U24': [4, 0, 3, 0, 3, 0, 3, 0, 0, 2, 2],
'U25': [0, 5, 0, 3, 3, 4, 0, 3, 3, 4, 4]}
I then loaded the dictionary into a Pandas dataframe by using this code:
df= pd.DataFrame(user_dict)
userRatings_df = df.T
print(userRatings_df)
This prints the data like so:
0 1 2 3 4 5 6 7 8 9 10
U1 3 4 2 5 0 4 1 3 0 0 4
U2 2 3 1 0 3 0 2 0 0 3 0
U3 0 4 0 5 0 4 0 3 0 2 4
U4 0 0 2 1 4 3 2 0 0 2 0
U5 0 0 0 5 0 4 0 3 0 0 4
U6 2 3 4 0 3 0 3 0 3 4 0
U7 0 4 3 5 0 5 0 0 0 0 4
U8 4 3 0 3 4 2 2 0 2 3 2
U9 0 2 0 3 1 0 1 0 0 2 0
U10 0 3 0 4 3 3 0 3 0 4 4
U11 2 2 1 2 1 0 2 0 1 0 2
U12 0 4 4 5 0 0 0 3 0 4 5
U13 3 3 0 2 2 3 2 0 2 0 3
U14 0 3 4 5 0 5 0 0 0 4 0
U15 2 0 0 3 0 2 2 3 0 0 3
U16 4 4 0 4 3 4 0 3 0 3 0
U17 0 2 0 3 1 0 2 0 1 0 3
U18 2 3 1 0 3 2 3 2 0 2 0
U19 0 5 0 4 0 3 0 4 0 0 5
U20 0 0 3 0 3 0 4 0 2 0 0
U21 3 0 2 4 2 3 0 4 2 3 3
U22 4 4 0 5 3 5 0 4 0 3 0
U23 3 0 0 0 3 0 2 0 0 4 0
U24 4 0 3 0 3 0 3 0 0 2 2
U25 0 5 0 3 3 4 0 3 3 4 4
When I attempt to load into into a Surprise dataset I run this code:
reader = Reader(rating_scale=(1,5))
userRatings_data=Dataset.load_from_df(userRatings_df[[1,2,3,4,5,6,7,8,9,10]],
reader)
I get this error:
ValueError: too many values to unpack (expected 3)
Can anyone help me to fix this error?
The problem is coming from the way you are converting your dictionary into a pandas dataframe. For the Dataset to be able process a pandas dataframe, you will need to have only three columns. First column is supposed to be the user ID, second column is the item ID and the third column is the actual rating.
This is how I would build a dataframe which would run in "Dataset":
DF = pd.DataFrame()
for key in user_dict.keys():
df = pd.DataFrame(columns=['User', 'Item', 'Rating'])
df['Rating'] = pd.Series(user_dict[key])
df['Item'] = pd.DataFrame(df.index)
df['User'] = key
DF = pd.concat([DF, df], axis = 0)
DF = DF.reset_index(drop=True)
If you pay attention, I am taking every key from the dictionary, which is essentially a user ID, turn it into a pandas column, along with the ratings and the ratings' indices which will be the column for raw item IDs. Then from every key I build a temporary dataframe which is stacked on top of each other in the final and main dataframe.
Hopefully this helps.

Possible bug in scipy.ndimage.measurements.label?

I was trying to do a percolation program in python, and I saw a tutorial recommending scipy.ndimage.measurements.label to identify the clusters. The problem is I stated to notice some odd behaviors in the function. Some elements that should belong to the same cluster are receiving different labels. Here is a code snippet that reproduce my problem.
import numpy as np
import scipy
from scipy.ndimage import measurements
grid = np.array([[0, 1, 1, 0, 1, 1, 0, 1, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 0, 1, 0],
[1, 1, 1, 1, 1, 1, 0, 1, 1, 1],
[1, 0, 1, 0, 1, 1, 0, 1, 1, 1],
[0, 0, 1, 0, 1, 0, 0, 0, 0, 1],
[0, 1, 1, 1, 0, 0, 0, 0, 0, 1],
[0, 1, 0, 1, 1, 1, 0, 0, 1, 1], #<- notice the last two elements
[1, 1, 0, 1, 1, 1, 1, 1, 1, 0],
[1, 0, 0, 0, 1, 1, 1, 1, 0, 1],
[1, 1, 1, 0, 0, 0, 1, 1, 0, 0]])
labels, nlabels = measurements.label(grid)
print "Scipy Version: ", scipy.__version__
print
print labels
The output I get is:
Scipy Version: 0.13.0
[[0 1 1 0 2 2 0 3 0 4]
[0 1 0 0 0 0 0 0 5 0]
[1 1 1 1 1 1 0 5 5 5]
[1 0 1 0 1 1 0 5 5 5]
[0 0 1 0 1 0 0 0 0 5]
[0 1 1 1 0 0 0 0 0 5]
[0 1 0 1 1 1 0 0 1 5] #<- The last two elements
[1 1 0 1 1 1 1 1 1 0] # are set with different labels
[1 0 0 0 1 1 1 1 0 6]
[1 1 1 0 0 0 1 1 0 0]]
Am I missing something about the way this function is supposed to work or is this a bug?
This is very important because labeling the clusters correctly is crucial to get the right results in percolation.
Thanks, for the help.

Displaying python 2d list without commas, brackets, etc. and newline after every row

I'm trying to display a python 2D list without the commas, brackets, etc., and I'd like to display a new line after every 'row' of the list is over.
This is my attempt at doing so:
ogm = repr(ogm).replace(',', ' ')
ogm = repr(ogm).replace('[', ' ')
ogm = repr(ogm).replace("'", ' ')
ogm = repr(ogm).replace('"', ' ')
print repr(ogm).replace(']', ' ')
This is the input:
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 0, 1, 0, 0, 0, 0, 0, 0], [1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [0, 1, 1, 0, 0, 0, 0, 0, 1, 1], [0, 0, 0, 0, 0, 0, 1, 1, 1, 0], [0, 0, 0, 1, 1, 0, 1, 1, 1, 1], [0, 0, 1, 1, 0, 0, 1, 1, 1, 1], [0, 1, 0, 0, 0, 0, 0, 1, 1, 0], [0, 0, 0, 0, 0, 0, 1, 1, 0, 0], [1, 0, 1, 1, 1, 1, 0, 0, 0, 0]]
This is the output:
"' 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 0 0 1 1 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 1 0 1 1 1 1 0 0 0 0 '"
I'm encountering two problems:
There are stray " and ' which I can't get rid of
I have no idea how to do a newline
Simple way:
for row in list2D:
print " ".join(map(str,row))
Maybe join is appropriate for you:
print "\n".join(" ".join(str(el) for el in row) for row in ogm)
0 0 0 0 0 0 0 0 0 0
1 1 0 1 0 0 0 0 0 0
1 1 1 0 0 0 0 0 0 0
0 1 1 0 0 0 0 0 1 1
0 0 0 0 0 0 1 1 1 0
0 0 0 1 1 0 1 1 1 1
0 0 1 1 0 0 1 1 1 1
0 1 0 0 0 0 0 1 1 0
0 0 0 0 0 0 1 1 0 0
1 0 1 1 1 1 0 0 0 0
print "\n".join(" ".join(map(str, line)) for line in ogm)
If you want the rows and columns transposed
print "\n".join(" ".join(map(str, line)) for line in zip(*ogm))
for row in list2D:
print(*row)
To make the display even more readable you can use tabs or fill the cells with spaces to align the columns.
def printMatrix(matrix):
for lst in matrix:
for element in lst:
print(element, end="\t")
print("")
It will display
6 8 99
999 7 99
3 7 99
instead of
6 8 99
999 7 99
3 7 99
ogm = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 0, 1, 0, 0, 0, 0, 0, 0], [1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [0, 1, 1, 0, 0, 0, 0, 0, 1, 1], [0, 0, 0, 0, 0, 0, 1, 1, 1, 0], [0, 0, 0, 1, 1, 0, 1, 1, 1, 1], [0, 0, 1, 1, 0, 0, 1, 1, 1, 1], [0, 1, 0, 0, 0, 0, 0, 1, 1, 0], [0, 0, 0, 0, 0, 0, 1, 1, 0, 0], [1, 0, 1, 1, 1, 1, 0, 0, 0, 0]]
s1 = str(ogm)
s2 = s1.replace('], [','\n')
s3 = s2.replace('[','')
s4 = s3.replace(']','')
s5= s4.replace(',','')
print s5
btw the " is actually two ' without any gap
i am learning python for a week. u guys have given some xcellent solutions. here is how i did it....this works too....... :)

Categories

Resources