I have a dataframe like this one
df = pd.DataFrame({'A' : [['a', 'b', 'c'], ['e', 'f', 'g','g']], 'B' : [['1', '4', 'a'], ['5', 'a']]})
I would like to create another column C that will be a column of list like the others but this one will be the "union" of the others
Something like this :
df = pd.DataFrame({'A' : [['a', 'b', 'c'], ['e', 'f', 'g', 'g']], 'B' : [['1', '4', 'a'], ['5', 'a']], 'C' : [['a', 'b', 'c', '1', '4', 'a'], ['e', 'f', 'g', 'g', '5', 'a']]})
But i have like hundreds of columns and C will be the "union" of these hundreds of columns i dont want to index on it like this :
df['C'] = df['A'] + df['B]
And i dont want to make a for loop because the dataframe i am manipulating are too big and i want something fast
Thank you for helping
As you have lists, you cannot vectorize the operation.
A list comprehension might be the fastest:
from itertools import chain
df['out'] = [list(chain.from_iterable(x[1:])) for x in df.itertuples()]
Example:
A B C out
0 [a, b, c] [1, 4, a] [x, y] [a, b, c, 1, 4, a, x, y]
1 [e, f, g, g] [5, a] [z] [e, f, g, g, 5, a, z]
As an alternative to #mozway 's answer, you could try something like this:
df = pd.DataFrame({'A': [['a', 'b', 'c'], ['e', 'f', 'g','g']], 'B' : [['1', '4', 'a'], ['5', 'a']]})
df['C'] = df.sum(axis=1).astype(str)
use 'astype' as required for list contents
you can use the apply method
df['C']=df.apply(lambda x: [' '.join(i) for i in list(x[df.columns.to_list()])], axis=1)
Related
I'm new to python and I'm trying to merge three different lists into one list based on the index value as shown in the example below:
All three lists are of same size.
A=['ABC', 'PQR', 'MNO']
B=['X', 'Y', 'Z']
C=['1','2','3']***
The output that I wanted is
P=[['ABC', 'X', '1'],['PQR', 'Y', '2'],['MNO', 'Z', '3']]
Thanks in advance.
I usually do it with numpy, as it is a simple traspose, and works with as many lists as you throw at it:
import numpy as np
A = ['ABC', 'PQR', 'MNO']
B = ['X', 'Y', 'Z']
C = ['1', '2', '3']
lists = [A, B, C]
numpy_array = np.array(lists)
transpose = numpy_array.T
transpose_list = transpose.tolist()
print(transpose_list)
Here is the solution for you using the for loop with the range() function:
A=['ABC', 'PQR', 'MNO']
B=['X', 'Y', 'Z']
C=['1','2','3']
list1=[]
for i in range(len(A)):
list1.append([A[i],B[i],C[i]])
display(list1)
OUTPUT:
[['ABC', 'X', '1'], ['PQR', 'Y', '2'], ['MNO', 'Z', '3']]
Using for loop with the zip() function:
l=[]
for a,b,c in zip(A,B,C):
l.append([a,b,c])
display(l)
OUTPUT:
[['ABC', 'X', '1'], ['PQR', 'Y', '2'], ['MNO', 'Z', '3']]
You don't want to use for loop?
Then here is the map() function for you:
result = list(map(lambda a, b, c: [a,b,c] , A, B,C))
display(result)
OUTPUT:
[['ABC', 'X', '1'], ['PQR', 'Y', '2'], ['MNO', 'Z', '3']]
you can use list comprehension to get desired output;
a=[[x,y,z] for x,y,z in zip(A,B,C)]
print(a)
I am trying to complete a task as part of a larger project at my workplace and I have a working solution for the problem, but due to the time complexity of the solution it takes a infeasibly long time to complete the task (the length of the dataframe is several millions). It is not a one-time task and has to be run daily.
Objective: Given a table with two columns: 'a' and 'b' where 'a' has single strings as values and 'b' has a list of strings as values, merge rows where an item in 'b' of a row matches with an item in 'b' of other rows such that 'a' and 'b' in the merged table would both be a list of items.
Example 1:
Input Table:
a b
0 1 [a, b, e]
1 2 [a, g]
2 3 [c, f]
3 4 [d]
4 5 [b]
Required Output:
a b
0 [1, 2, 5] [a, b, e, g]
1 [3] [c, f]
2 [4] [d]
Example 2:
Input Table:
a b
0 1 [a, b, e]
1 3 [a, g, f]
2 4 [c, f]
3 6 [d, h]
4 9 [b, g, h]
Required Output:
a b
0 [1, 3, 4, 6, 9] [a, b, c, d, e, f, g, h]
The working solution I have:
import pandas as pd
def merge_rows(df):
df_merged = pd.DataFrame(columns=df.columns)
matched = False
while len(df) > 0:
if not matched:
x = len(df_merged)
df_merged.loc[x, 'a'] = list(df.iloc[0, 0])
df_merged.loc[x, 'b'] = df.iloc[0, 1]
df = df.iloc[1:, :]
for rm in range(len(df_merged)):
matched = False
right_b_lists_of_lists = df.b.tolist()
df.reset_index(drop=True, inplace=True)
match_index_list = [i for b_part in df_merged.loc[rm, 'b'] for (i, b_list) in enumerate(right_b_lists_of_lists) if b_part in b_list]
df_matches = df.loc[match_index_list]
if len(df_matches) > 0:
df_merged.loc[rm, 'a'] = list(set(df_merged.loc[rm, 'a'] + df_matches.a.tolist()))
df_merged.loc[rm, 'b'] = list(set(df_merged.loc[rm, 'b'] + [item for sublist in df_matches.b.tolist() for item in sublist]))
df = df.drop(df_matches.index)
matched = True
break
return df_merged
df1 = pd.DataFrame({'a': ['1', '2', '3', '4', '5'], 'b': [['a', 'b', 'e'], ['a', 'g'], ['c', 'f'], ['d'], ['b']]})
df1_merged = merge_rows(df1)
print('Original DF:')
print(df1.to_string())
print('Merged DF:')
print(df1_merged.to_string())
df2 = pd.DataFrame({'a': ['1', '3', '4', '6', '9'], 'b': [['a', 'b', 'e'], ['a', 'g', 'f'], ['c', 'f'], ['d', 'h'], ['b', 'g', 'h']]})
df2_merged = merge_rows(df2)
print('Original DF:')
print(df2.to_string())
print('Merged DF:')
print(df2_merged.to_string())
The above code prints the following:
Original DF:
a b
0 1 [a, b, e]
1 2 [a, g]
2 3 [c, f]
3 4 [d]
4 5 [b]
Merged DF:
a b
0 [1, 2, 5] [e, b, a, g]
1 [3] [c, f]
2 [4] [d]
Original DF:
a b
0 1 [a, b, e]
1 3 [a, g, f]
2 4 [c, f]
3 6 [d, h]
4 9 [b, g, h]
Merged DF:
a b
0 [4, 3, 6, 9, 1] [e, h, c, g, f, d, b, a]
Note that the lists in 'a' and 'b' in the output from the above code are not sorted, but that is acceptable.
This solution is practically infeasible given the asymptotic time complexity of O(n^2) as average case for the solution, along with being unable to think of a way to parallelise this polynomial solution, the large size of n that I need to run it on a daily basis, and the machine I have to run it on.
Any help with either a linearithmic solution or a parallelisable polynomial solution (or better!) would be greatly appreciated!
A solution is Python is preferred, but I would welcome a solution in R / C / C++ / Java / P.
Here is an implementation using the idea of the disjoint set structure.
Note that there are many ways to make it more efficient (and there could be bugs too).
At least it works on the two cases, and runs 10x faster than the original function in the question post on my laptop.
import pandas as pd
def merge_rows2(df):
parents = {} # maps elements to the parent member
for row in df.values:
elems = row[1]
if len(elems) < 1:
continue # edge case, empty letter list
for elem in elems:
if not elem in parents: # new letter
parents[elem] = elems[0] # register the first element as the parent
else: # this letter has already be seen
# find the root parent
p = parents[elem]
path = [elem]
while True:
path.append(p)
if p == parents[p]:
break
p = parents[p]
# map to the new parent, two sets merged
parents[p] = elems[0]
# path compression, for fast access next time
for e in path:
parents[e] = elems[0]
#print(parents) # debug
# make sure all elements directly maps to the root
for e, p in parents.items():
if e == p: # root node
continue
# find the root node
path = [e]
while True:
path.append(p)
if p == parents[p]:
break
p = parents[p]
# path compression
for e in path:
parents[e] = p
#print(parents) # debug
groups = {}
for e, p in parents.items():
if p in groups:
groups[p].append(e)
else:
groups[p] = [e]
#print(groups) # debug
# collect values
values = {g:[] for g in groups}
for row in df.values:
elems = row[1]
if len(elems) < 1:
continue
p = parents[elems[0]] # group identity
values[p].append(row[0])
# make data frame
rows = [{"a":values[g], "b":groups[g]} for g in groups]
return pd.DataFrame(rows)
# test
df1 = pd.DataFrame({'a': ['1', '2', '3', '4', '5'], 'b': [['a', 'b', 'e'], ['a', 'g'], ['c', 'f'], ['d'], ['b']]})
print(merge_rows2(df1))
df2 = pd.DataFrame({'a': ['1', '3', '4', '6', '9'], 'b': [['a', 'b', 'e'], ['a', 'g', 'f'], ['c', 'f'], ['d', 'h'], ['b', 'g', 'h']]})
print(merge_rows2(df2))
# test
df1 = pd.DataFrame({'a': ['1', '2', '3', '4', '5'], 'b': [['a', 'b', 'e'], ['a', 'g'], ['c', 'f'], ['d'], ['b']]})
print(merge_rows2(df1))
# a b
#0 [1, 2, 5] [a, b, e, g]
#1 [3] [c, f]
#2 [4] [d]
df2 = pd.DataFrame({'a': ['1', '3', '4', '6', '9'], 'b': [['a', 'b', 'e'], ['a', 'g', 'f'], ['c', 'f'], ['d', 'h'], ['b', 'g', 'h']]})
print(merge_rows2(df2))
# a b
#0 [1, 3, 4, 6, 9] [a, b, e, g, f, c, d, h]
%timeit merge_rows(df1)
%timeit merge_rows2(df1)
#7.47 ms ± 277 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#365 µs ± 3.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit merge_rows(df2)
%timeit merge_rows2(df2)
#4.1 ms ± 90.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#351 µs ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This uses pure Python rather than Pandas but might need a more representative example dataset to truly see which is faster as it makes heavy use of dicts and sets which has different time and memory use characteristics.
The consolidation function I copied from my Set consolidation task on Rosetta Code.
Code
# -*- coding: utf-8 -*-
"""
Answering:
"Efficient algorithm to merge rows of a table based on matching items from a list in a column"
https://stackoverflow.com/questions/62817492/efficient-algorithm-to-merge-rows-of-a-table-based-on-matching-items-from-a-list
Created on Fri Jul 10 04:49:26 2020
#author: Paddy3118
"""
#%%
from collections import defaultdict
from pprint import pprint as pp
def consolidate(sets):
setlist = [s for s in sets if s]
for i, s1 in enumerate(setlist):
if s1:
for s2 in setlist[i+1:]:
intersection = s1.intersection(s2)
if intersection:
s2.update(s1)
s1.clear()
s1 = s2
return [s for s in setlist if s]
#%%
dat1 = {'a': ['1', '2', '3', '4', '5'],
'b': [['a', 'b', 'e'], ['a', 'g'],
['c', 'f'], ['d'], ['b']]}
dat2 = {'a': ['1', '3', '4', '6', '9'],
'b': [['a', 'b', 'e'], ['a', 'g', 'f'],
['c', 'f'], ['d', 'h'], ['b', 'g', 'h']]}
#data = dat2
def row_merge(data):
data['a'] = [set(x) for x in data['a']]
data['b'] = [set(x) for x in data['b']]
b_map = defaultdict(list)
for i, b_list in enumerate(data['b']):
for item in b_list:
b_map[item].append(i)
index_merge = consolidate([set(v) for v in b_map.values()])
a, b = defaultdict(set), defaultdict(set)
a, b = [], []
adata, bdata = data['a'], data['b']
for merge in index_merge:
arow, brow = set(), set()
for row_index in merge:
arow |= adata[row_index]
brow |= bdata[row_index]
a.append(sorted(arow))
b.append(sorted(brow))
return {'a': a, 'b': b}
answer = row_merge(dat1)
pp(answer)
answer = row_merge(dat2)
pp(answer)
Output
{'a': [['1', '2', '5'], ['3'], ['4']],
'b': [['a', 'b', 'e', 'g'], ['c', 'f'], ['d']]}
{'a': [['1', '3', '4', '6', '9']],
'b': [['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']]}
I have a 2D list:
# # # ^ # ^ # ^
l = [['A', '1', '2'], ['B', 'xx', 'A'], ['C', 'B', 's'], ['D', 'd', 'B']]
and the first element in each list can be treated as an #ID string (in the example: A, B, C, D). Anywhere where the ID's (A, B, C, D) occur in the second dimension's lists I would like to replace it with the content of the actual list. Example: ['B', 'xx', 'A'] should become ['B', 'xx', ['A', '1', '2']] because A is an #ID (first string of list) and it occurs in the second list. Output should be:
n = [['A', '1', '2'], ['B', 'xx', ['A', '1', '2']], ['C', ['B', 'xx', ['A', '1', '2']], 's'],
['D', 'd', ['B', 'xx', ['A', '1', '2']]]]
The problem I am facing is that there can be longer lists and more branches so it's getting complicated. In the end I am trying to build a tree diagram. I was thinking of calculting first what is the highest branching but don't have a solution in mind yet.
l = [['A', '1', '2'], ['B', 'xx', 'A'], ['C', 'B', 's'], ['D', 'd', 'B']]
dic = {i[0]:i for i in l}
for i in l:
fv = i[0]
for j, v in enumerate(i):
if v in dic and j!=0:
dic[fv][j] = dic[v]
res = [v for i,v in dic.items()]
print(res)
output
[['A', '1', '2'],
['B', 'xx', ['A', '1', '2']],
['C', ['B', 'xx', ['A', '1', '2']], 's'],
['D', 'd', ['B', 'xx', ['A', '1', '2']]]]
Have you tried using a dictionary? If you have the ID's then you could possibly refer to them and then loop through the array and change entries. Below is what I had
l = [['A', '1', '2'], ['B', 'xx', 'A'], ['C', 'B', 's'], ['D', 'd', 'B'], ['E', 'C', 'b']]
dt = {}
for i in l:
dt[i[0]] = i
for i in range(len(l)):
for j in range(1, len(l[i])):
if(l[i][j] in dt):
l[i][j] = dt.get(l[i][j])
print(l)
Another more succinct version:
d = {item[0]: item for item in l}
for item in l:
item[1:] = [d.get(element, element) for element in item[1:]]
I have a list my_list = ['a', 'b', 'c', 'd'] and I need to create a dictionary which looks like
{ 'a': ['a', 'b', 'c', 'd'],
'b': ['b', 'a', 'c', 'd'],
'c': ['c', 'a', 'b', 'd'],
'd': ['d', 'a', 'b', 'c'] }
each element as its value being the list but the first element is itself.
Here is my code
my_list = ['1', '2', '3', '4']
my_dict=dict()
for x in my_list:
n = my_lst[:]
n.remove(x)
n= [x] + n[:]
my_dict[x] = n
print my_dict
which gives
{'1': ['1', '2', '3', '4'],
'3': ['3', '1', '2', '4'],
'2': ['2', '1', '3', '4'],
'4': ['4', '1', '2', '3']}
as required.
But I don't think that's the most optimal way of doing it. Any help to optimize will be appreciated.
>>> seq
['a', 'b', 'c', 'd']
>>> {e: [e]+[i for i in seq if i != e] for e in seq}
{'a': ['a', 'b', 'c', 'd'],
'b': ['b', 'a', 'c', 'd'],
'c': ['c', 'a', 'b', 'd'],
'd': ['d', 'a', 'b', 'c']}
A faster approach (than the accepted answer) for larger lists is
{e: [e] + seq[:i] + seq[i+1:] for i, e in enumerate(seq)}
Relative timings:
In [1]: seq = list(range(1000))
In [2]: %timeit {e: [e]+[i for i in seq if i != e] for e in seq}
10 loops, best of 3: 40.8 ms per loop
In [3]: %timeit {e: [e] + seq[:i] + seq[i+1:] for i, e in enumerate(seq)}
100 loops, best of 3: 6.03 ms per loop
You can get hacky with a dictionary comprehension:
my_dict = {elem: list(sorted(my_list, key=lambda x: x != elem)) for elem in my_lst}
This works on the fact that the sorted function performs a stable sort, and False is less than True
Edit: This method is less clear and probably slower, use with caution.
Let's say I had a list:
[a, b, c, d, e, f]
Given an index, say 3, what is a pythonic way to remove everything before
that index from the front of the list, and then add it to the back.
So if I was given index 3, I would want to reorder the list as
[d, e, f, a, b, c]
>>> l = ['a', 'b', 'c', 'd', 'e', 'f']
>>>
>>> l[3:] + l[:3]
['d', 'e', 'f', 'a', 'b', 'c']
>>>
or bring it into a function:
>>> def swap_at_index(l, i):
... return l[i:] + l[:i]
...
>>> the_list = ['a', 'b', 'c', 'd', 'e', 'f']
>>> swap_at_index(the_list, 3)
['d', 'e', 'f', 'a', 'b', 'c']
use the slice operation
e.g.,
myList = ['a', 'b','c', 'd', 'e', 'f']
myList[3:] + myList[:3]
gives
['d', 'e', 'f', 'a', 'b', 'c']
def foo(myList, x):
return myList[x:] + myList[:x]
Should do the trick.
Call it like this:
>>> aList = ['a', 'b' ,'c', 'd', 'e', 'f']
>>> print foo(aList, 3)
['d', 'e', 'f', 'a', 'b', 'c']
EDIT Haha all answers are the same...
The pythonic way it's that's sdolan said, i can only add the inline way:
>>> f = lambda l, q: l[q:] + l[:q]
so, you can use like:
>>> f([1,2,3,4,5,6], 3)
[4, 5, 6, 1, 2, 3]