Comparing lists elements to sublist elements in Pandas - python

df
col1 col2
['aa', 'bb', 'cc', 'dd'] [['ee', 'ff', 'gg', 'hh'], ['qq', 'ww', 'ee', 'rr']]
['ss', 'dd', 'ff', 'gg'] [['mm', 'nn', 'vv', 'cc'], ['zz', 'aa', 'jj', 'kk']]
['ss', 'dd'] [['mm', 'nn', 'vv', 'cc'], ['zz', 'aa', 'jj', 'kk']]
I'd like to be able to run a function that concats the first list element in col1 to the first sublist elements (there are multiple sublists) in col2, then concats the second list element in col1 to the second sublist elements in col2.
Results would be like this column:
results
[['aaee', 'bbff', 'ccgg', 'ddhh'],['aaqq', 'bbww', 'ccee', 'ddrr']]
[['ssmm', 'ddnn', 'ffvv', 'ggcc'],['sszz', 'ddaa', 'ffjj', 'ggkk']]
[['ssmm', 'ddnn'],['sszz', 'ddaa']]
I'm thinking it would have something to do with looping through the first elements in col1 and somehow loop and match them to the corresponding items in each sublist in col2 - how can I do this?
Converted code
[[[df1.agg(lambda x: get_top_matches(u,w), axis=1) for u,w in zip(x,v)]\
for v in y] for x,y in zip(df1['parent_org_name_list'], df1['children_org_name_sublists'])]
Results:

You can just use zip here:
[[[u+w for u,w in zip(x,v)] for v in y] for x,y in zip(df['col1'], df['col2'])]
Output:
[[['aaee', 'bbff', 'ccgg', 'ddhh'], ['aaqq', 'bbww', 'ccee', 'ddrr']],
[['ssmm', 'ddnn', 'ffvv', 'ggcc'], ['sszz', 'ddaa', 'ffjj', 'ggkk']],
[['ssmm', 'ddnn'], ['sszz', 'ddaa']]]
To assign back to your dataframe, you can do:
df['results'] = [[[u+w for u,w in zip(x,v)] for v in y]
for x,y in zip(df['col1'], df['col2'])]

Max, try this solution with a cycle. It allows finer control over transformations, including dealing with uneven lengths (see len_limit in the example):
import pandas as pd
df = pd.DataFrame({'c1':[['aa', 'bb', 'cc', 'dd'],['ss', 'dd', 'ff', 'gg']],
'c2':[[['ee', 'ff', 'gg', 'hh'], ['qq', 'ww', 'ee', 'rr']],
[['mm', 'nn', 'vv', 'cc'], ['zz', 'aa', 'jj', 'kk']]],})
df ['c3'] = 'empty' # send string to 'c3' so it is object data type
print(df)
c1 c2 c3
0 [aa, bb, cc, dd] [[ee, ff, gg, hh], [qq, ww, ee, rr]] empty
1 [ss, dd, ff, gg] [[mm, nn, vv, cc], [zz, aa, jj, kk]] empty
for i, row in df.iterrows():
c3_list = []
len_limit = len (row['c1']
for c2_sublist in row['c2']:
c3_list.append([j1+j2 for j1, j2 in zip(row['c1'], c2_sublist[:len_limit])])
df.at[i, 'c3'] = c3_list
print (df['c3'])
0 [[aaee, bbff, ccgg, ddhh], [aaqq, bbww, ccee, ...
1 [[ssmm, ddnn, ffvv, ggcc], [sszz, ddaa, ffjj, ...
Name: c3, dtype: object

Try:
df["results"] = df[["col1", "col2"]].apply(lambda x: [list(map(''.join, zip(x["col1"], el))) for el in x["col2"]], axis=1)
Outputs:
>>> df["results"]
0 [[aaee, bbff, ccgg, ddhh], [aaqq, bbww, ccee, ...
1 [[ssmm, ddnn, ffvv, ggcc], [sszz, ddaa, ffjj, ...
2 [[ssmm, ddnn], [sszz, ddaa]]

Related

Get the list of values/entries that are common to all the columns in a dataframe (or a csv file) in python

I have this setup of a pandas dataframe (or a csv file):
df = {
'colA': ['aa', 'bb', 'cc', 'dd', 'ee'],
'colB': ['aa', 'bb', 'dd', 'qq', 'ee'],
'colC': ['aa', 'bb', 'cc', 'ee', 'dd'],
'colD': ['aa', 'bb', 'ee', 'cc', 'dd']
}
The goal here is to get a list/column with the set of values that appear in all the columns or in other words the entries that are common to all the columns.
Required output:
col
aa
bb
dd
ee
or a output with the common value's list:
common_list = ['aa', 'bb', 'dd', 'ee']
I have a silly solution (but it doesn't seem to be correct as I am not getting what I want when implemented to my dataframe)
import pandas as pd
df = pd.read_csv('Bus Names Concat test.csv') #i/p csv file (pandas df converted into csv)
df = df.stack().value_counts()
core_list = df[df>2].index.tolist() #defining the common list as core list
print(len(core_list))
df_core = pd.DataFrame(core_list)
print(df_core)
Any help/sugggestion/feedback to get the required o/p will be appreciated.
In your case
s = df.melt().groupby('value')['variable'].nunique()
outlist = s[s==4].index.tolist()
Out[307]: ['aa', 'bb', 'dd', 'ee']
You can use .intersection() method of sets to find common values between sets of each column:
# wrapped in a list, take first column set and pass sets of other columns as arguments
common_list = list(set(df.colA).intersection(set(df.colB), set(df.colC), set(df.colD)))
sorted(common_list) # needs sorting in alphabetical order
Output:
['aa', 'bb', 'dd', 'ee']
Alternatively, for unspecified number of columns and without sorting:
common_list = list(set(df[df.columns[0]] # first column's set to compare the rest with
).intersection(
*(set(df[col]) # unpack generator with sets
for col in df.columns[1:]))) # of the remaining columns
common_list
Output:
['dd', 'aa', 'ee', 'bb'] # curiously enough, it is ordered differently
Transpose the dataframe so columns are rows and rows are columns.
Convert the values to a list of lists.
Map each sublist as a Set and unpack the Intersection of the Sets to a List of unique values.
common_list = list(set.intersection(*map(set, df.values.transpose().tolist())))
print(common_list)
['aa', 'bb', 'dd', 'ee']

Pandas, adding multiple columns of list

I have a dataframe like this one
df = pd.DataFrame({'A' : [['a', 'b', 'c'], ['e', 'f', 'g','g']], 'B' : [['1', '4', 'a'], ['5', 'a']]})
I would like to create another column C that will be a column of list like the others but this one will be the "union" of the others
Something like this :
df = pd.DataFrame({'A' : [['a', 'b', 'c'], ['e', 'f', 'g', 'g']], 'B' : [['1', '4', 'a'], ['5', 'a']], 'C' : [['a', 'b', 'c', '1', '4', 'a'], ['e', 'f', 'g', 'g', '5', 'a']]})
But i have like hundreds of columns and C will be the "union" of these hundreds of columns i dont want to index on it like this :
df['C'] = df['A'] + df['B]
And i dont want to make a for loop because the dataframe i am manipulating are too big and i want something fast
Thank you for helping
As you have lists, you cannot vectorize the operation.
A list comprehension might be the fastest:
from itertools import chain
df['out'] = [list(chain.from_iterable(x[1:])) for x in df.itertuples()]
Example:
A B C out
0 [a, b, c] [1, 4, a] [x, y] [a, b, c, 1, 4, a, x, y]
1 [e, f, g, g] [5, a] [z] [e, f, g, g, 5, a, z]
As an alternative to #mozway 's answer, you could try something like this:
df = pd.DataFrame({'A': [['a', 'b', 'c'], ['e', 'f', 'g','g']], 'B' : [['1', '4', 'a'], ['5', 'a']]})
df['C'] = df.sum(axis=1).astype(str)
use 'astype' as required for list contents
you can use the apply method
df['C']=df.apply(lambda x: [' '.join(i) for i in list(x[df.columns.to_list()])], axis=1)

Python: how to remove key from list and keep value?

I have an array like this
myarr = [
[{'text':'da','id':'aa','info':'aaa'},{'text':'da','id':'aa','info':'aaa'},{'text':'da','id':'aa','info':'aaa'}],
[{'text':'da','id':'aa','info':'aaa'},{'text':'da','id':'aa','info':'aaa'},{'text':'da','id':'aa','info':'aaa'}]
]
I need result:
myarr = [
[['da','aa','aaa'],['da','aa','aaa'],['da','aa','aaa']],
[['da','aa','aaa'],['da','aa','aaa'],['da','aa','aaa']]
]
How can i get sample result? Please help me!
You can try a list comprehension -
# l will iterate over each inner list and
# e will iterate over dictionaries in each inner list
myarr = [[list(e.values()) for e in l] for l in myarr]
print(myarr)
Ouput:
[[['da', 'aa', 'aaa'], ['da', 'aa', 'aaa'], ['da', 'aa', 'aaa']], [['da', 'aa', 'aaa'], ['da', 'aa', 'aaa'], ['da', 'aa', 'aaa']]]
For some variety, you could also use:
myarr = [[*map(list, map(dict.values, x))] for x in myarr]

Need to compare a portion of one list with another and see if they have the same numeric order and if not see the elements that are in other position

I'm trying to compare one list lst with another one lst2 and see if the values in one lst correspond to a portion of the other list lst2 and if it has the same string order of the first one lst and if is not returns the values that do not have the right position.
This are the examples:
lst = ['a', 'b', 'd', 'c', 'e']
lst2 = ['DD', 'OO', 'CC' ,'WW', 'GG', 'a', 'b', 'c', 'd', 'e', 'AA' 'QQ', 'EE', 'ZZ', 'XX', 'YY', 'UU', 'II', 'OO', 'HH']
Supose that the the lst values will change. I mean it will have a different length with other string values added in near future but the string index 'GG' and 'AA' in lst2 will not change, it just will change the values from 'a' to 'e' as lst but the process it will be the same.
It is better using pandas dataframes or "\n".join() as string columns or just using list?
The question was not actually clear, but as I understand it - you want to find the elements in lst2 that did not appear in lst1.
for both upcoming approaches I checked with the example inputs as you provided:
lst = ['a', 'b', 'd', 'c', 'e']
lst2 = ['DD', 'OO', 'CC' ,'WW', 'GG', 'a', 'b', 'c', 'd', 'e', 'AA' 'QQ', 'EE', 'ZZ', 'XX', 'YY', 'UU', 'II', 'OO', 'HH']
If the order of appearances must be the same in both lists - you can use tricky way of manipulating strings. like in here:
lst_str = ' '.join(lst)
lst2_str = ' '.join(lst2)
index_found = lst2_str.find(lst_str)
lst4 = lst2_str.split()
if index_found!=-1:
lst4 = (lst2_str[0:index_found] + lst2_str[index_found+len(lst_str):]).split()
output:
['DD', 'OO', 'CC', 'WW', 'GG', 'a', 'b', 'c', 'd', 'e', 'AAQQ', 'EE', 'ZZ', 'XX', 'YY', 'UU', 'II', 'OO', 'HH']
this approach assume that whitespaces are not allowed in the elements in the lists (you can use other seperator as well of course)
Otherwise if the order does not matter, a simple list comprehension will do the work:
lst3 = [item for item in lst2 if not item in lst]
output:
['DD', 'OO', 'CC', 'WW', 'GG', 'AAQQ', 'EE', 'ZZ', 'XX', 'YY', 'UU', 'II', 'OO', 'HH']
the differences between the approaches outputs comes up because the order in elements in lst is differet than in lst2, therefore the 2 approaches retrieve different outputs

Permute string of Kronecker products

A function I am writing will receive as input a matrix H=A x B x I x I, where each matrix is square and of dimension d, the cross refers to the Kronecker product np.kron and I is the identity np.eye(d). Thus
I = np.eye(d)
H = np.kron(A, B)
H = np.kron(H, I)
H = np.kron(H, I)
Given H and the above form, but without knowledge of A and B, I would like to construct G = I x A x I x B e.g. the result of
G = np.kron(I, A)
G = np.kron(G, I)
G = np.kron(G, B)
It should be possible to do this by applying some permutation to H. How do I implement that permutation?
Transposing with (2,0,3,1,6,4,7,5) (after expanding to 8 axes) appears to do it:
>>> from functools import reduce
>>>
>>> A = np.random.randint(0,10,(10,10))
>>> B = np.random.randint(0,10,(10,10))
>>> I = np.identity(10, int)
>>> G = reduce(np.kron, (A,B,I,I))
>>> H = reduce(np.kron, (I,A,I,B))
>>>
>>>
>>> (G.reshape(*8*(10,)).transpose(2,0,3,1,6,4,7,5).reshape(10**4,10**4) == H).all()
True
Explanation: Let's look at a minimal example to understand how the Kronecker product relates to reshaping and axis shuffling.
Two 1D factors:
>>> A, B = np.arange(1,5), np.array(list("abcd"), dtype=object)
>>> np.kron(A, B)
array(['a', 'b', 'c', 'd', 'aa', 'bb', 'cc', 'dd', 'aaa', 'bbb', 'ccc',
'ddd', 'aaaa', 'bbbb', 'cccc', 'dddd'], dtype=object)
We can observe that the arrangement is row-major-ish, so if we reshape we actually get the outer product:
>>> np.kron(A, B).reshape(4, 4)
array([['a', 'b', 'c', 'd'],
['aa', 'bb', 'cc', 'dd'],
['aaa', 'bbb', 'ccc', 'ddd'],
['aaaa', 'bbbb', 'cccc', 'dddd']], dtype=object)
>>> np.outer(A, B)
array([['a', 'b', 'c', 'd'],
['aa', 'bb', 'cc', 'dd'],
['aaa', 'bbb', 'ccc', 'ddd'],
['aaaa', 'bbbb', 'cccc', 'dddd']], dtype=object)
If we do the same with factors swapped we get the transpose:
>>> np.kron(B, A).reshape(4, 4)
array([['a', 'aa', 'aaa', 'aaaa'],
['b', 'bb', 'bbb', 'bbbb'],
['c', 'cc', 'ccc', 'cccc'],
['d', 'dd', 'ddd', 'dddd']], dtype=object)
With 2D factors things are similar
>>> A2, B2 = A.reshape(2,2), B.reshape(2,2)
>>>
>>> np.kron(A2, B2)
array([['a', 'b', 'aa', 'bb'],
['c', 'd', 'cc', 'dd'],
['aaa', 'bbb', 'aaaa', 'bbbb'],
['ccc', 'ddd', 'cccc', 'dddd']], dtype=object)
>>> np.kron(A2, B2).reshape(2,2,2,2)
array([[[['a', 'b'],
['aa', 'bb']],
[['c', 'd'],
['cc', 'dd']]],
[[['aaa', 'bbb'],
['aaaa', 'bbbb']],
[['ccc', 'ddd'],
['cccc', 'dddd']]]], dtype=object)
But there is a minor complication in that the corresponding outer product has axes arranged differently:
>>> np.multiply.outer(A2, B2)
array([[[['a', 'b'],
['c', 'd']],
[['aa', 'bb'],
['cc', 'dd']]],
[[['aaa', 'bbb'],
['ccc', 'ddd']],
[['aaaa', 'bbbb'],
['cccc', 'dddd']]]], dtype=object)
We need to swap middle axes to get the same result.
>>> np.multiply.outer(A2, B2).swapaxes(1,2)
array([[[['a', 'b'],
['aa', 'bb']],
[['c', 'd'],
['cc', 'dd']]],
[[['aaa', 'bbb'],
['aaaa', 'bbbb']],
[['ccc', 'ddd'],
['cccc', 'dddd']]]], dtype=object)
So if we want to go the swapped Kronecker product we can swap the middle axes: (0,2,1,3)
now we have the outer product. swapping factors exchanges the first two axes with the second two: (1,3,0,2)
going back to Kronecker, swap the middle axes
=> total axis permutation: (1,0,3,2)
>>> np.all(np.kron(A2, B2).reshape(2,2,2,2).transpose(1,0,3,2).reshape(4,4) == np.kron(B2, A2))
True
Using the same principles leads to the recipe for the four factor original question.
This answer expands on Paul Panzer's correct answer to document how one would solve similar problems like this more generally.
Suppose we wish to map a matrix string reduce(kron, ABCD) into, for example, reduce(kron, CADB), where each matrix has dimension d columns. Both of the strings are thus d**4, d**4 matrices. Alternatively they are [d,]*8 shaped arrays.
The way np.kron arranges data means that the index ordering of ABDC corresponds to that of its constituents as follows: D_0 C_0 B_0 A_0 D_1 C_1 B_1 A_1 where for example D_0 (D_1) is the fastest (slowest) oscillating index in D. For CADB the index ordering is instead (B_0 D_0 A_0 C_0 B_1 D_1 A_1 C_1); you just read the string backwards once for the faster and once for the slower indices. The appropriate permutation string in this case is thus (2,0,3,1,6,4,7,5).

Categories

Resources