How can I split a 2D array by a grouping variable, and return a list of arrays please (also the order is important).
To show expected outcome, the equivalent in R can be done as
> (A = matrix(c("a", "b", "a", "c", "b", "d"), nr=3, byrow=TRUE)) # input
[,1] [,2]
[1,] "a" "b"
[2,] "a" "c"
[3,] "b" "d"
> (split.data.frame(A, A[,1])) # output
$a
[,1] [,2]
[1,] "a" "b"
[2,] "a" "c"
$b
[,1] [,2]
[1,] "b" "d"
EDIT: To clarify: I'd like to split the array/matrix, A into a list of multiple arrays based on the unique values in the first column. That is, split A into one array where the first column has an a, and another array where the first column has a b.
I have tried Python equivalent of R "split"-function but this gives three arrays
import numpy as np
import itertools
A = np.array([["a", "b"], ["a", "c"], ["b", "d"]])
b = a[:,0]
def split(x, f):
return list(itertools.compress(x, f)), list(itertools.compress(x, (not i for i in f)))
split(A, b)
([array(['a', 'b'], dtype='<U1'),
array(['a', 'c'], dtype='<U1'),
array(['b', 'd'], dtype='<U1')],
[])
And also numpy.split, using np.split(A, b), but which needs integers. I though I may be able to use How to convert strings into integers in Python? to convert the letters to integers, but even if I pass integers, it doesn't split as expected
c = np.transpose(np.array([1,1,2]))
np.split(A, c) # returns 4 arrays
Can this be done? thanks
EDIT: please note that this is a small example, and the number of groups may be greater than two and they may not be ordered.
You can use pandas:
import pandas as pd
import numpy as np
a = np.array([["a", "b"], ["a", "c"], ["b", "d"]])
listofdfs = {}
for n,g in pd.DataFrame(a).groupby(0):
listofdfs[n] = g
listofdfs['a'].values
Output:
array([['a', 'b'],
['a', 'c']], dtype=object)
And,
listofdfs['b'].values
Output:
array([['b', 'd']], dtype=object)
Or, you could use itertools groupby:
import numpy as np
from itertools import groupby
l = [np.stack(list(g)) for k, g in groupby(a, lambda x: x[0])]
l[0]
Output:
array([['a', 'b'],
['a', 'c']], dtype='<U1')
And,
l[1]
Output:
array([['b', 'd']], dtype='<U1')
If I understand your question, you can do simple slicing, as in:
a = np.array([["a", "b"], ["a", "c"], ["b", "d"]])
x,y=a[:2,:],a[2,:]
x
array([['a', 'b'],
['a', 'c']], dtype='<U1')
y
array(['b', 'd'], dtype='<U1')
Related
I need the code to be in R
Exemple
I have a list:
[[0,'A',50.1],
[1,'B',50.0],
[2,'C',50.2],
[3,'D',50.7],
[4,'E',50.3]]
I want to order it based on the 3rd elemnts only so I get a result like this
[[1,'B',50.0],
[0,'A',50.1],
[2,'C',50.2],
[4,'E',50.3],
[3,'D',50.7]]
and then reorder the index so the Final result would be
Final = [[0,'B',50.0],
[1,'A',50.1],
[2,'C',50.2],
[3,'E',50.3],
[4,'D',50.7]]
and then I have the indexes in some grouping
G = [[0,1],[1,3][2,3,4]]
I want based on G as indexes of Final have the Grouping like this
[['B','A'],['A','E']['C','E','D']]
I already have the code in python, but I need the same code in R
L = [[i, *x[1:]] for i, x in enumerate(sorted(L, key=lambda x: x[2]))]
print (L)
[[0, 'B', 50.0], [1, 'A', 50.1], [2, 'C', 50.2], [3, 'E', 50.3], [4, 'D', 50.7]]
out = [[L[y][1] for y in x] for x in G]
print (out)
[['B', 'A'], ['A', 'E'], ['C', 'E', 'D']]
You can try:
LL <- L |>
as.data.frame() |>
arrange(x) |>
mutate(id=sort(L$id))
lapply(G, \(x) LL$v1[LL$id %in% x])
[[1]]
[1] "B" "A"
[[2]]
[1] "A" "E"
[[3]]
[1] "C" "E" "D"
Data:
L <- list(id=0:4, v1=LETTERS[1:5], x = c(50.1, 50.0, 50.2, 50.7, 50.3))
G <- list(c(0,1), c(1,3), c(2,3,4))
Libraries:
library(dplyr)
This question already has answers here:
How can I "zip sort" parallel numpy arrays?
(5 answers)
Closed 3 years ago.
I have tried to convert my array to a 2-D array and utilizing np.sort and np.lexsort but have not had any luck.
import numpy as np
# Here are the 2 arrays I would like to sort b using a.
a = np.array([6,5,3,4,1,2])
b = np.array(["x","y","z","a","b","c"])
Is it possible to sort b using a?
When printing b the output should be:
["b", "c", "z", "a", "y", "x"]
You can use the built-in NumPy indexing:
In [1]: import numpy as np
...:
...: # Here are the 2 arrays I would like to sort b using a.
...: a = np.array([6,5,3,4,1,2])
...: b = np.array(["x","y","z","a","b","c"])
In [2]: b[a - 1]
Out[2]: array(['c', 'b', 'z', 'a', 'x', 'y'], dtype='<U1')
Also, I think your desired output is c, b, z, a, y, x instead of b, c, z, a, y, x.
I have 3 different lists of unequal length.
I want to append the shorter lists with "X" and make sizes equal to the length of the longest list.
A = [10,20,30,40,50]
B = ["A", "B", "C"]
C = ["X1", "X2"]
After appending "X" , it should be like the following:
A = [10,20,30,40,50]
B = ["A", "B", "C", "X","X"]
C = ["P1", "P2", "X", "X", "X"]
I have used the below code for achieving it,
for i, a in enumerate(A):
if i < len(B):
pass
else:
B.append('X')
How can i do it efficiently in python ?
Use the extend method
B.extend(['X'] * (len(A)-len(B)))
Calculate the max length and for each list, append the delta.
In Python, List has a binary operator + to concat multiple lists together, as well as * to tile itself.
A = [10,20,30,40,50]
B = ["A", "B", "C"]
C = ["X1", "X2"]
max_length = max(max(len(A), len(B)), len(C))
A += ['X'] * (max_length - len(A))
B += ['X'] * (max_length - len(B))
C += ['X'] * (max_length - len(C))
Then organize them using a container list, for less repeated codes and better extensibility.
A = [10,20,30,40,50]
B = ["A", "B", "C"]
C = ["X1", "X2"]
arrays = [A, B, C]
max_length = 0
for array in arrays:
max_length = max(max_length, len(array))
for array in arrays:
array += ['X'] * (max_length - len(array))
Result:
print(A) # [10, 20, 30, 40, 50]
print(B) # ['A', 'B', 'C', 'X', 'X']
print(C) # ['X1', 'X2', 'X', 'X', 'X']
The python itertools module has a lot of nifty functions that are good for cases like this. For example:
>>> from itertools import izip_longest, izip
>>> A = [10, 20, 30, 40, 50]
>>> B = ["A", "B", "C"]
>>> C = ["X1", "X2"]
>>> A, B, C = (list(x) for x in (izip(*izip_longest(A, B, C, fillvalue='X'))))
>>> A
[10, 20, 30, 40, 50]
>>> B
['A', 'B', 'C', 'X', 'X']
>>> C
['X1', 'X2', 'X', 'X', 'X']
Write function that makes this for you
A = [10, 20, 30, 40, 50]
B = ["A", "B", "C"]
C = ["X1", "X2"]
def extend_with_extra_elements(*some_lists):
max_some_lists_length = max(map(len, some_lists))
for some_list in some_lists:
extra_elements_count = max_some_lists_length - len(some_list)
extra_elements = ['X'] * extra_elements_count
yield some_list + extra_elements
A, B, C = extend_with_extra_elements(A, B, C)
efficient enough
Try to use max() to get the max length and then append list to B and C.
If you want to replace X with P, you can use a list comprehension [i.replace('X','P') for i in C] to get ['P1','P2']:
>>> m=max(len(A),len(B),len(C))
>>> B+['X']*(m-len(B))
['A', 'B', 'C', 'X', 'X']
>>> [i.replace('X','P') for i in C]+['X']*(m-len(C))
['P1', 'P2', 'X', 'X', 'X']
So I have many pandas data frames with 3 columns of categorical variables:
D F False
T F False
D F False
T F False
The first and second columns can take one of three values. The third one is binary. So there are a grand total of 18 possible rows (not all combination may be represented on each data frame).
I would like to assign a number 1-18 to each row, so that rows with the same combination of factors are assigned the same number and vise-versa (no hash collision).
What is the most efficient way to do this in pandas?
So, all_combination_df is a df with all possible combination of the factors. I am trying to turn df such as big_df to a Series with unique numbers in it
import pandas, itertools
def expand_grid(data_dict):
"""Create a dataframe from every combination of given values."""
rows = itertools.product(*data_dict.values())
return pandas.DataFrame.from_records(rows, columns=data_dict.keys())
all_combination_df = expand_grid(
{'variable_1': ['D', 'A', 'T'],
'variable_2': ['C', 'A', 'B'],
'variable_3' : [True, False]})
big_df = pandas.concat([all_combination_df, all_combination_df, all_combination_df])
UPDATE: as #user189035 mentioned in the comment it's much better to use categorical dtype as it'll save a lot of memory
I would try to use factorize method:
In [112]: df['category'] = \
...: pd.Categorical(
...: pd.factorize((df.a + '~' + df.b + '~' + (df.c*1).astype(str)))[0])
...:
In [113]: df
Out[113]:
a b c category
0 A X True 0
1 B Y False 1
2 A X True 0
3 C Z False 2
4 A Z True 3
5 C Z True 4
6 B Y False 1
7 C Z False 2
In [114]: df.dtypes
Out[114]:
a object
b object
c bool
category category
dtype: object
Explanation: this simple way we can glue all columns into a single series:
In [115]: df.a + '~' + df.b + '~' + (df.c*1).astype(str)
Out[115]:
0 A~X~1
1 B~Y~0
2 A~X~1
3 C~Z~0
4 A~Z~1
5 C~Z~1
6 B~Y~0
7 C~Z~0
dtype: object
Without taking into account issues of efficiency, this would find duplicate rows and give you a dictionary (similar to the question here).
import pandas as pd, numpy as np
# Define data
d = np.array([["D", "T", "D", "T", "U"],
["F", "F", "F", "J", "K"],
[False, False, False, False, True]])
df = pd.DataFrame(d.T)
# Find and remove duplicate rows
df_nodupe = df[~df.duplicated()]
# Make a list
df_nodupe.T.to_dict('list')
{0: ['D', 'F', 'False'],
1: ['T', 'F', 'False'],
3: ['T', 'J', 'False'],
4: ['U', 'K', 'True']}
Otherwise, you could use map, like so:
import pandas as pd, numpy as np
# Define data
d = np.array([["D", "T", "D", "T", "U"],
["F", "F", "F", "J", "K"],
[False, False, False, False, True]])
df = pd.DataFrame(d.T)
df.columns = ['x', 'y', 'z']
# Define your dictionary of interest
dd = {('D', 'F', 'False'): 0,
('T', 'F', 'False'): 1,
('T', 'J', 'False'): 2,
('U', 'K', 'True'): 3}
# Create a tuple of the rows of interest
df['tupe'] = zip(df.x, df.y, df.z)
# Create a new column based on the row values
df['new_category'] = df.tupe.map(dd)
I have two arrays, and want to combine the ith elements in each one together:
import numpy as np
a = np.array(['a', 'b', 'c'])
b = np.array(['x', 'y', 'z'])
I want to return
array(['ax', 'by', 'cz'])
What is the function for this? Thx,
>>> import numpy as np
>>> a = np.array(['a', 'b', 'c'])
>>> b = np.array(['x', 'y', 'z'])
>>> c = np.array([i+j for i, j in zip(a, b)])
>>> c
array(['ax', 'by', 'cz'],
dtype='|S2')
#DSM makes the point that if a and b had dtype=object you could simply add the two arrays together:
>>> a = np.array(["a", "b", "c"], dtype=object)
>>> b = np.array(["x", "y", "z"], dtype=object)
>>> c = a + b
>>> c
array([ax, by, cz], dtype=object)