I have turned dataframe that has a tuple of length 2 as index
1 2 -1
(0, 1) 0 1 0
(0, 2) 1 0 0
(0, -1) 0 0 0
(1, 1) 1 0 0
(1, 2) 0 1 0
(1, -1) 1 1 1
into numpy 2D array and managed to split it to 3D array(in regards to the first value) by split funcion:
arr = np.array(np.array_split(arr,2))
with result
[[[0 1 0]
[1 0 0]
[0 0 0]]
[[1 0 0]
[0 1 0]
[1 1 1]]]
I want to make a function to do the split even further, for example, to create 5D tensor from (0,0,0,0) (length 4) indices.
Any idea on how to do this recursively?
Use the following code to generate sample data:
import pandas as pd
import numpy as np
import itertools
def create_fake_data_frame(nlevels = 2, ncols = 3):
result = pd.DataFrame(
index=itertools.product(*(nlevels * [[0, 1]])),
data=np.arange(ncols*2**nlevels).reshape(2**nlevels, ncols)
)
result = convert_index_of_tuples_to_multiindex(result)
return result
def convert_index_of_tuples_to_multiindex(df):
return df.set_index(pd.MultiIndex.from_tuples(df.index))
# Increase nlevels to get dataframes with more levels in their MultiIndex
df = create_fake_data_frame(nlevels=3)
print(df)
This is the result:
0 1 2
0 0 0 0 1 2
1 3 4 5
1 0 6 7 8
1 9 10 11
1 0 0 12 13 14
1 15 16 17
1 0 18 19 20
1 21 22 23
Then, modify the dataframe in such a way that each row contains a single
column, whose value is a list of the values in the corresponding row of
the original dataframe:
def data_frame_with_single_column_of_lists(df):
if len(df.columns) <= 1:
return df
result = df.apply(collapse_columns_into_lists, axis=1)
return result
def collapse_columns_into_lists(s):
result = s.copy()
result['lists'] = result.values.tolist()
result = result[['lists']]
return result
df = data_frame_with_single_column_of_lists(df)
print(df)
The output will be like this:
lists
0 0 0 [0, 1, 2]
1 [3, 4, 5]
1 0 [6, 7, 8]
1 [9, 10, 11]
1 0 0 [12, 13, 14]
1 [15, 16, 17]
1 0 [18, 19, 20]
1 [21, 22, 23]
Finally, use the following code to get a tensor
def increase_list_nesting_by_removing_an_index_level(df):
def list_of_lists(series):
result = series.to_frame().set_index(series.index.droplevel(-1))
result = result.apply(lambda x: x['lists'], axis=1).to_frame()
result = [x[0] for x in result.values.tolist()]
return result
grouped = df.groupby(df.index.droplevel(-1))
result = grouped.agg(list_of_lists)
if type(result.index[0]) == tuple:
result = convert_index_of_tuples_to_multiindex(result)
return result
def tensor_from_data_frame(df):
if df.index.nlevels <= 1:
return np.array([i[0] for i in df.values])
result = increase_list_nesting_by_removing_an_index_level(df)
result = tensor_from_data_frame(result)
return result
tensor = tensor_from_data_frame(df)
print(tensor)
The result will be like this:
[[[[ 0 1 2]
[ 3 4 5]]
[[ 6 7 8]
[ 9 10 11]]]
[[[12 13 14]
[15 16 17]]
[[18 19 20]
[21 22 23]]]]
Related
given the data set
#Create Series
s = pd.Series([[1,2,3,],[1,10,11],[2,11,12]],['buz','bas','bur'])
k = pd.Series(['y','n','o'],['buz','bas','bur'])
#Create DataFrame df from two series
df = pd.DataFrame({'first':s,'second':k})
I was able to create new columns based on all possible values of 'first'
def text_to_list(df,col):
val=df[col].explode().unique()
return val
unique=text_to_list(df,'first')
for options in unique :
df[options]=0
now I need to check off (or turn the value to '1') in each row and column where that value exists in the original list of 'first'
I'm pretty sure its a combination of .isin and/or .apply, but i'm struggling
the end result should be for row
buz: cols 1,2,3 are 1
bas: cols 1,10,11 are 1
bur: cols 2,11,12 are 1
first second 1 2 3 10 11 12
buz [1, 2, 3] y 1 1 1 0 0 0
bas [1, 10, 11] n 1 0 0 1 1 0
bur [2, 11, 12] o 0 1 0 0 1 1
adding the solution provided by -https://stackoverflow.com/users/3558077/ashutosh-porwal
df1=df.join(pd.get_dummies(df['first'].apply(pd.Series).stack()).sum(level=0))
print(df1)
Note: this solution did not require my hack job of creating the columns beforehand by explode column 'first'
From your update it seems that what you need is simply:
for opt in unique :
df[opt]=df['first'].apply(lambda x: int(opt in x))
Output:
first second 1 2 3 10 11 12
buz [1, 2, 3] y 1 1 1 0 0 0
bas [1, 10, 11] n 1 0 0 1 1 0
bur [2, 11, 12] o 0 1 0 0 1 1
Data:
>>> import pandas as pd
>>> s = pd.Series([[1,2,3,],[1,10,11],[2,11,12]],['buz','bas','bur'])
>>> k = pd.Series(['y','n','o'],['buz','bas','bur'])
>>> df = pd.DataFrame({'first':s,'second':k})
>>> df
first second
buz [1, 2, 3] y
bas [1, 10, 11] n
bur [2, 11, 12] o
Solution:
>>> df[df['first'].explode().to_list()] = 0
>>> df = df[['first', 'second']].join(df.apply(lambda x:x.loc[x['first']], axis=1).replace({0 : 1, np.nan : 0}).astype(int))
>>> df
first second 1 2 3 10 11 12
buz [1, 2, 3] y 1 1 1 0 0 0
bas [1, 10, 11] n 1 0 0 1 1 0
bur [2, 11, 12] o 0 1 0 0 1 1
Use pd.merge and pivot_table:
out = df.reset_index().explode('first') \
.pivot_table(values='index', index='second', columns='first',
aggfunc='any', fill_value=False, sort=False).astype(int)
out = df.merge(out, on='second')
Output:
>>> out
first second 1 2 3 10 11 12
0 [1, 2, 3] y 1 1 1 0 0 0
1 [1, 10, 11] n 1 0 0 1 1 0
2 [2, 11, 12] o 0 1 0 0 1 1
I have a column (binary) in a dataframe (df) of the form:
Vector
0
1
0
1
0
.
.
.
I am using this in a binary classification model. My objective is to take these 0's and 1's and move them into two seperate lists, which then get translated into numpy arrays. As an example, I would like to move the first 5 items from Vector into X, then the 6th item into Y. Then the next 5 items into X, and then the following 6th item into Y, till the end of the df (currently 200k rows).
My first instinct is to write a for loop for this (but I know this is hugely inefficient):
for i in range(0, df.shape[0] - 6):
# as we iterate through the df
# we will use a step of 5
if i_cnt > 5:
y = df['Vector'].iloc[i]
Y.append(y)
i_cnt = 1
else:
x = df['Vector'].iloc[i]
X.append(x)
i_cnt +=1
There is definitely a faster way to do this and hoping someone knows how I can do that?
Use modulo with 6 by array created by length of index and compare for X and Y:
#sample data for easy verify
df = pd.DataFrame({'Vector': range(20)})
idx = np.arange(len(df)) % 6
X = df.loc[idx < 5, 'Vector']
print (X)
0 0
1 1
2 2
3 3
4 4
6 6
7 7
8 8
9 9
10 10
12 12
13 13
14 14
15 15
16 16
18 18
19 19
Y = df.loc[idx == 5, 'Vector']
print (Y)
5 5
11 11
17 17
If output format is different - X is 2d array use reshape with -1 for automatic count length with 6 and select by indexing:
df = pd.DataFrame({'Vector': range(18)})
arr = df['Vector'].to_numpy().reshape(-1, 6)
X = arr[:, :-1]
Y = arr[:, -1]
print (X)
[[ 0 1 2 3 4]
[ 6 7 8 9 10]
[12 13 14 15 16]]
print (Y)
[ 5 11 17]
For k = 5 + 1 = 6,
k = 6
n_rows = len(df.index)
n_samples = n_rows // k
X_and_y = df.Vector.to_numpy().reshape(n_samples, k)
X = X_and_y[:, :-1]
y = X_and_y[:, -1]
We reshape the column to a (n_samples, 5 + 1) array where n_samples = n_rows / 6, then we take all but last column into X and last column into y.
e.g.
>>> df = pd.DataFrame(np.random.randint(0, 2, size=18), columns=["Vector"])
>>> df
Vector
0 0
1 0
2 1
3 1
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 1
12 0
13 0
14 1
15 0
16 0
17 1
>>> # after
>>> X
array([[0, 0, 1, 1, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0]])
>>> y
array([0, 1, 1])
You can try
X = list(df[df.index % 6 < 5]["Vector"])
y = list(df[df.index % 6 == 5]["Vector"])
I have two dataframes of errors in 3 axis (x, y, z):
df1 = pd.DataFrame([[0, 1, 2], [-1, 0, 1], [-2, 0, 3]], columns = ['x', 'y', 'z'])
df2 = pd.DataFrame([[1, 1, 3], [1, 0, 2], [1, 0, 3]], columns = ['x', 'y', 'z'])
I'm looking for a fast way to find the Cartesian sum of the square of each row of the two dataframes.
EDIT My current solution:
cartesian_sum = list(np.sum(list(tup), axis = 0).tolist()
for tup in itertools.product( (df1**2).to_numpy().tolist(),
(df2**2).to_numpy().tolist() ) )
cartesian_sum
>>>
[[1, 2, 13],
[1, 1, 8],
[1, 1, 13],
[2, 1, 10],
[2, 0, 5],
[2, 0, 10],
[5, 1, 18],
[5, 0, 13],
[5, 0, 18]]
is too slow (~ 2.4 ms; compared to the solutions based purely in Pandas running ~ 8-10 ms).
This is similar to the related question (link here) but using itertools is so slow. Is there a faster way of doing this in Python?
I think you need cross join first, remove column a, squared, convert columns to MultiIndex and sum per first level:
df = df1.assign(a=1).merge(df2.assign(a=1), on='a').drop('a', axis=1) ** 2
df.columns = df.columns.str.split('_', expand=True)
df = df.sum(level=0, axis=1)
print (df)
x y z
0 1 2 13
1 1 1 8
2 1 1 13
3 2 1 10
4 2 0 5
5 2 0 10
6 5 1 18
7 5 0 13
8 5 0 18
Details:
print (df1.assign(a=1).merge(df2.assign(a=1), on='a'))
x_x y_x z_x a x_y y_y z_y
0 0 1 2 1 1 1 3
1 0 1 2 1 1 0 2
2 0 1 2 1 1 0 3
3 -1 0 1 1 1 1 3
4 -1 0 1 1 1 0 2
5 -1 0 1 1 1 0 3
6 -2 0 3 1 1 1 3
7 -2 0 3 1 1 0 2
8 -2 0 3 1 1 0 3
One idea for improve performance:
#https://stackoverflow.com/a/53699013/2901002
def cartesian_product_simplified_changed(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
a = np.column_stack([left.values[ia2.ravel()] ** 2, right.values[ib2.ravel()] ** 2])
a = a[:, :la] + a[:, la:]
return a
a = cartesian_product_simplified_changed(df1, df2)
print (a)
[[ 1 2 13]
[ 1 1 8]
[ 1 1 13]
[ 2 1 10]
[ 2 0 5]
[ 2 0 10]
[ 5 1 18]
[ 5 0 13]
[ 5 0 18]]
I have a problem with Arrays Column in DataFrame
ex : I have This Data
CustomerNumber ArraysDate
1 [ 1 4 13 ]
2 [ 3 ]
3 [ 0 ]
4 [ 2 60 30 40]
I Want caculator sum the element in ArrayDate
I create a function
def Caculator(n,x,value):
v = 0
for i in n-x:
v = sum(value)
return v
And
s['Sum'] = Caculator(s['n'],1,s['ArraysDate'])
n is count the element of ArraysDate Column
And I want caculator
Sum = t1 + t2 +....+t_n-x
And Expect Result :
CustomerNumber ArraysDate Sum
1 [ 1 4 13 ] 5
2 [ 3 ] 0
3 [ 0 ] 0
4 [ 2 60 30 40] 92
IIUC you can use:
df['Sum']=df.ArraysDate.apply(lambda x: sum(x[:len(x)-1]))
#or df.ArraysDate.str[:-1].apply(sum)
print(df)
CustomerNumber ArraysDate Sum
0 1 [1, 4, 13] 5
1 2 [3] 0
2 3 [0] 0
3 4 [2, 60, 30, 40] 92
DF: df = pd.DataFrame({'CustomerNumber': [1, 2, 3, 4], 'ArraysDate': [[1,4,13],[3],[0],[2,60,30,40]]})
Maybe something like:
def Caculator(x,arrayDates):
vList = []
for i in range(arrayDates.count()):
v = 0
for num in range(0, len(arrayDates[i])-x):
v = v + arrayDates[i][num]
vList.append(v)
return vList
for the DataFrame s:
data = [[1, [1, 4, 13]], [2, [3]], [3, [0]], [4, [2, 60, 30, 40]]]
s = pd.DataFrame(data, columns = ['CustomerNumber', 'ArraysDate'])
and call the function like this:
s['Sum'] = Caculator(1,s['ArraysDate'])
You can make sum in ArraysDate column of Pandas DataFrame like this:
import pandas as pd
import numpy as np
d={'CustomerNumber':pd.Series([1,2,3,4]),
'ArraysDate':pd.Series([[1,4,13],[3],[0],[2,60,30,40]])}
df=pd.DataFrame(d)
df['sum']=[np.sum(i[0:(len(i)-1)]) for i in df['ArraysDate']]
print(df)
Output:
CustomerNumber ArraysDate sum
0 1 [1, 4, 13] 5.0
1 2 [3] 0.0
2 3 [0] 0.0
3 4 [2, 60, 30, 40] 92.0
I am simulating protein folding on a 2D grid where every angle is either ±90° or 0°, and have the following problem:
I have an n-by-n numpy array filled with zeros, except for certain places where the value is any integer from 1 to n. Every integer appears just once. Integer k is always a nearest neighbour to k-1 and k + 1, except for the endpoints. The array is saved as an object in the class Grid which I have created for doing energy calculations and folding the protein. Example array, with n=5:
>>> from Grid import Grid
>>> a = Grid(5)
>>> a.show()
[[0 0 0 0 0]
[0 0 0 0 0]
[1 2 3 4 5]
[0 0 0 0 0]
[0 0 0 0 0]]
My goal is to find the longest consecutive line of non-zero elements withouth any bends. In the above case, the result should be 5.
My idea so far are something like this:
def getDiameter(self):
indexes = np.zeros((self.n, 2))
for i in range(1, self.n + 1):
indexes[i - 1] = np.argwhere(self.array == i)[0]
for i in range(self.n):
j = 1
currentDiameter = 1
while indexes[0][i] == indexes[0][i + j] and i + j <= self.n:
currentDiameter += 1
j += 1
while indexes[i][0] == indexes[i + j][0] and i + j <= self.n:
currentDiameter += 1
j += 1
if currentDiameter > diameter:
diameter = currentDiameter
return diameter
This has two problems: (1) it doesn't work, and (2) it is horribly inefficient if I get it to work. I am wondering if anybody has a better way of doing this. If anything is unclear, please let me know.
Edit:
Less trivial example
[[ 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 10 0 0 0]
[ 0 0 0 0 0 0 9 0 0 0]
[ 0 0 0 0 0 0 8 0 0 0]
[ 0 0 0 4 5 6 7 0 0 0]
[ 0 0 0 3 0 0 0 0 0 0]
[ 0 0 0 2 1 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0]]
The correct answer here is 4 (both the longest column and the longest row have four non-zero elements).
What I understood from your question is you need to find the length of longest occurance of consecutive elements in numpy array (row by row).
So for this below one, the output should be 5:
[[1 2 3 4 0]
[0 0 0 0 0]
[10 11 12 13 14]
[0 1 2 3 0]
[1 0 0 0 0]]
Because [10 11 12 13 14] are consecutive elements and they have the longest length comparing to any consecutive elements in any other row.
If this is what you are expecting, consider this:
import numpy as np
from itertools import groupby
a = np.array([[1, 2, 3, 4, 0],
[0, 0, 0, 0, 0],
[10, 11, 12, 13, 14],
[0, 1, 2, 3, 0],
[1, 0, 0, 0, 0]])
a = a.astype(float)
a[a == 0] = np.nan
b = np.diff(a) # Calculate the n-th discrete difference. Consecutive numbers will have a difference of 1.
counter = []
for line in b: # for each row.
if 1 in line: # consecutive elements differ by 1.
counter.append(max(sum(1 for _ in g) for k, g in groupby(line) if k == 1) + 1) # find the longest length of consecutive 1's for each row.
print(max(counter)) # find the max of list holding the longest length of consecutive 1's for each row.
# 5
For your particular example:
[[0 0 0 0 0]
[0 0 0 0 0]
[1 2 3 4 5]
[0 0 0 0 0]
[0 0 0 0 0]]
# 5
Start by finding the longest consecutive occurrence in a list:
def find_longest(l):
counter = 0
counters =[]
for i in l:
if i == 0:
counters.append(counter)
counter = 0
else:
counter += 1
counters.append(counter)
return max(counters)
now you can apply this function to each row and each column of the array, and find the maximum:
longest_occurrences = [find_longest(row) for row in a] + [find_longest(col) for col in a.T]
longest_occurrence = max(longest_occurrences)