element-wise merge np.array in multiple pandas column - python

I got a pandas dataframe, in which there are several columns’ value are np.array, I would like to merge these np.arrays into one array elementwise based row.
e.g
col1 col2 col3
[2.1, 3] [4, 4] [2, 3]
[4, 5] [6, 7] [9, 9]
[7, 8] [8, 9] [5, 4]
... ... ...
expected result:
col_f
[2.1, 3, 4, 4, 2, 3]
[4, 5, 6, 7, 9, 9]
[7, 8, 8, 9 5, 4]
........
I use kind of for loop to realize it, but just wondering if there is the more elegant way to do it.
below is my for loop cod:
f_vector = []
for i in range(len(df.index)):
vector = np.hstack((df['A0_vector'][i], items_df['A1_vector'][i], items_df['A2_vector'][i], items_df['A3_vector'][i], items_df['A4_vector'][i], items_df['A5_vector'][i]))
f_vector.append(vector)
X = np.array(f_vector)

You can use numpy.concatenate with apply along axis=1:
import numpy as np
df['col_f'] = df[['col1', 'col2', 'col3']].apply(np.concatenate, axis=1)
If those were lists instead of np.arrays, + operator would have worked:
df['col_f'] = df['col1'] + df['col2'] + + df['col3']
Note: edited after comments thread below.

Related

how to detect rows are subset of other rows and delete them in pandas series

I have a large pandas series that each row in it, is a list of numbers.
I want to detect rows that are subset of other rows and delete them from series.
my solution is using 2 for loops but it is very slow. Can anyone help me and introduce a faster way for this because my for loop is very slow.
for example, we must delete rows 2, 4 in the below sample because they are subsets of rows 1, 3 respectively.
import pandas as pd
cycles = pd.Series([[1, 2, 3, 4], [3, 4], [5, 6, 9, 7], [5, 9]])
First, you could sort the lists since they are numbers and convert them to string. Then for every string simply check if it is a substring of any of the other rows, if so it is a subset. Since everything is sorted we can be sure the order of the numbers will not affect this step.
Finally, filter out only the ones that are not identified as a subset.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'cycles': [[9, 5, 4, 3], [9, 5, 4], [2, 4, 3], [2, 3]],
'members': [4, 3, 3, 2]
})
print(df)
cycles members
0 [9, 5, 4, 3] 4
1 [9, 5, 4] 3
2 [2, 4, 3] 3
3 [2, 3] 2
df['cycles'] = df['cycles'].map(np.sort)
df['cycles_str'] = [','.join(map(str, c)) for c in df['cycles']]
# Here we check if matches are >1, because it will match with itself once!
df['is_subset'] = [df['cycles_str'].str.contains(c_str).sum() > 1 for c_str in df['cycles_str']]
df = df.loc[df['is_subset'] == False]
df = df.drop(['cycles_str', 'is_subset'], axis=1)
cycles members
0 [3, 4, 5, 9] 4
2 [2, 3, 4] 3
Edit - The above doesn't work for [1, 2, 4] & [1, 2, 3, 4]
Rewrote the code. This uses 2 loops and set to check for subsets using list comprehension:
# check if >1 True, as it will match with itself once!
df['is_subset'] = [[set(y).issubset(set(x)) for x in df['cycles']].count(True)>1 for y in df['cycles']]
df = df.loc[df['is_subset'] == False]
df = df.drop('is_subset', axis=1)
print(df)
cycles members
0 [9, 5, 4, 3] 4
2 [2, 4, 3] 3

extract elements of tuple from a pandas series

I have a pandas series with data of type tuple as list elements. The length of the tuple is exactly 2 and there are a bunch of NaNs. I am trying to split each list in the tuple into its own column.
import pandas as pd
import numpy as np
df = pd.DataFrame({'val': [([1,2,3],[4,5,6]),
([7,8,9],[10,11,12]),
np.nan]
})
Expected Output:
If you know the lenght of tuples are exactly 2, you can do:
df["x"] = df.val.str[0]
df["y"] = df.val.str[1]
print(df[["x", "y"]])
Prints:
x y
0 [1, 2, 3] [4, 5, 6]
1 [7, 8, 9] [10, 11, 12]
2 NaN NaN
You could also convert the column to a list and cast it to the DataFrame constructor (fill None with np.nan as well):
out = pd.DataFrame(df['val'].tolist(), columns=['x','y']).fillna(np.nan)
Output:
x y
0 [1, 2, 3] [4, 5, 6]
1 [7, 8, 9] [10, 11, 12]
2 NaN NaN
One way using pandas.Series.apply:
new_df = df["val"].apply(pd.Series)
print(new_df)
Output:
0 1
0 [1, 2, 3] [4, 5, 6]
1 [7, 8, 9] [10, 11, 12]
2 NaN NaN

Selecting a range of columns in Python without using numpy

I want to extract range of columns. I know how to do that in numpy but I don't want to use numpy slicing operator.
import numpy as np
a = [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
arr = np.array(a)
k = 0
print(arr[k:, k+1]) # --> [2 7]
print([[a[r][n+1] for n in range(0,k+1)] for r in range(k,len(a))][0]) # --> [2]
What's wrong with second statement?
You're overcomplicating it. Get the rows with a[k:], then get a cell with row[k+1].
>>> [row[k+1] for row in a[k:]]
[2, 7]
a = [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
k = 0
print(list(list(zip(*a[k:]))[k+1])) # [2, 7]
Is this what you're looking for?
cols = [1,2,3] # extract middle 3 columns
cols123 = [[l[col] for col in cols] for l in a]
# [[2, 3, 4], [7, 8, 9]]

Maximum of an array constituting a pandas dataframe cell

I have a pandas dataframe in which a column is formed by arrays. So every cell is an array.
Say there is a column A in dataframe df, such that
A = [ [1, 2, 3],
[4, 5, 6],
[7, 8, 9],
... ]
I want to operate in each array and get, e.g. the maximum of each array, and store it in another column.
In the example, I would like to obtain another column
B = [ 3,
6,
9,
...]
I have tried these approaches so far, none of them giving what I want.
df['B'] = np.max(df['A']);#
df.applymap (lambda B: A.max())
df['B'] = df.applymap (lambda B: np.max(np.array(df['A'].tolist()),0))
How should I proceed? And is this the best way to have my dataframe organized?
You can just apply(max). It doesn't matter if the values are lists or np.array.
df = pd.DataFrame({'a': [[1, 2, 3], [4, 5, 6], [7, 8, 9]]})
df['b'] = df['a'].apply(max)
print(df)
Outputs
a b
0 [1, 2, 3] 3
1 [4, 5, 6] 6
2 [7, 8, 9] 9
Here is one way without apply:
df['B']=np.max(df['A'].values.tolist(),axis=1)
A B
0 [1, 2, 3] 3
1 [4, 5, 6] 6
2 [7, 8, 9] 9

Extracting from lists of pandas series to another based on indexes

I have pandas data frame having 2 series each of them contains 2d arrays like,
a is the first series sub-array is of different length like
a:
0 [[1,2,3,4,5,6,7,7],[1,2,3,4,5],[5,9,3,2]]
1 [[1,2,3],[6,7],[8,9,10]]
and b is the second one but its subarray has only one element like
b:
0 [[0],[2],[3]]
1 [ [1],[0],[1]]
I want to extract elements of the a series based on indexes given in b.
The result of the above example should be like:
0 [1,3,2]
1 [2, 6, 9]
Can anyone please help? Thanks a lot
Setup
a = pd.Series({0: [[1, 2, 3, 4, 5, 6, 7, 7], [1, 2, 3, 4, 5], [5, 9, 3, 2]],
1: [[1, 2, 3], [6, 7], [8, 9, 10]]})
b = pd.Series({0: [[0], [2], [3]], 1: [[1], [0], [1]]})
Difficult to make this efficient since you have lists of varying sizes, but here's a solution using a list comprehension and zip:
out = pd.Series([[x[y] for x, [y] in zip(i, j)] for i, j in zip(a, b)])
0 [1, 3, 2]
1 [2, 6, 9]
dtype: object
You can use apply to index a with b:
df.apply(lambda row: [row.a[i][row.b[i][0]] for i in range(len(row[0]))], axis=1)
0 [1, 3, 2]
1 [2, 6, 9]
dtype: object
Data:
data = {"a":[[[1,2,3,4,5,6,7,7],[1,2,3,4,5],[5,9,3,2]],
[[1,2,3],[6,7],[8,9,10]]],
"b": [[[0],[2],[3]],
[[1],[0],[1]]]}
df = pd.DataFrame(data)

Categories

Resources