pandas groupby and join lists - python

I have a dataframe df, with two columns, I want to groupby one column and join the lists belongs to same group, example:
column_a, column_b
1, [1,2,3]
1, [2,5]
2, [5,6]
after the process:
column_a, column_b
1, [1,2,3,2,5]
2, [5,6]
I want to keep all the duplicates. I have the following questions:
The dtypes of the dataframe are object(s). convert_objects() doesn't convert column_b to list automatically. How can I do this?
what does the function in df.groupby(...).apply(lambda x: ...) apply to ? what is the form of x ? list?
the solution to my main problem?
Thanks in advance.

object dtype is a catch-all dtype that basically means not int, float, bool, datetime, or timedelta. So it is storing them as a list. convert_objects tries to convert a column to one of those dtypes.
You want
In [63]: df
Out[63]:
a b c
0 1 [1, 2, 3] foo
1 1 [2, 5] bar
2 2 [5, 6] baz
In [64]: df.groupby('a').agg({'b': 'sum', 'c': lambda x: ' '.join(x)})
Out[64]:
c b
a
1 foo bar [1, 2, 3, 2, 5]
2 baz [5, 6]
This groups the data frame by the values in column a. Read more about groupby.
This is doing a regular list sum (concatenation) just like [1, 2, 3] + [2, 5] with the result [1, 2, 3, 2, 5]

df.groupby('column_a').agg(sum)
This works because of operator overloading sum concatenates the lists together. The index of the resulting df will be the values from column_a:

The approach proposed above using df.groupby('column_a').agg(sum) definetly works. However, you have to make sure that your list only contains integers, otherwise the output will not be the same.
If you want to convert all of the lists items into integers, you can use:
df['column_a'] = df['column_a'].apply(lambda x: list(map(int, x)))

The accepted answer suggests to use groupby.sum, which is working fine with small number of lists, however using sum to concatenate lists is quadratic.
For a larger number of lists, a much faster option would be to use itertools.chain or a list comprehension:
df = pd.DataFrame({'column_a': ['1', '1', '2'],
'column_b': [['1', '2', '3'], ['2', '5'], ['5', '6']]})
itertools.chain:
from itertools import chain
out = (df.groupby('column_a', as_index=False)['column_b']
.agg(lambda x: list(chain.from_iterable(x)))
)
list comprehension:
out = (df.groupby('column_a', as_index=False, sort=False)['column_b']
.agg(lambda x: [e for l in x for e in l])
)
output:
column_a column_b
0 1 [1, 2, 3, 2, 5]
1 2 [5, 6]
Comparison of speed
Using n repeats of the example to show the impact of the number of lists to merge:
test_df = pd.concat([df]*n, ignore_index=True)
NB. also comparing the numpy approach (agg(lambda x: np.concatenate(x.to_numpy()).tolist())).

Use numpy and simple "for" or "map":
import numpy as np
u_clm = np.unique(df.column_a.values)
all_lists = []
for clm in u_clm:
df_process = df.query('column_a == #clm')
list_ = np.concatenate(df.column_b.values)
all_lists.append((clm, list_.tolist()))
df_sum_lists = pd.DataFrame(all_lists)
It's faster in 350 times than a simple "groupby-agg-sum" approach for huge datasets.

Thanks, helped me
merge.fillna("", inplace = True) new_merge = merge.groupby(['id']).agg({ 'q1':lambda x: ','.join(x), 'q2':lambda x: ','.join(x),'q2_bookcode':lambda x: ','.join(x), 'q1_bookcode':lambda x: ','.join(x)})

Related

How can I merge a list into a dataframe?

I have the following dataframe and list:
df = [[[1,2,3],'a'],[[4,5],'b'],[[6,7,8],'c']]
list = [[1,2,3],[4,5]]
And I want to do a inner merge between them, so I can keep the items in common. This will be my result:
df = [[1,2,3],'a'],[[4,5],'b']]
I have been thinking in converting both to strings, but even if I convert my list to string, I haven't been able to merge both of them as the merge function requires the items to be series or dataframes (not strings). This could be a great help!!
Thanks
If I understand you correctly, you want only keep rows from the dataframe where the values (lists) are both in the column and the list:
lst = [[1, 2, 3], [4, 5]]
print(df[df["col1"].isin(lst)])
Prints:
col1 col2
0 [1, 2, 3] a
1 [4, 5] b
DataFrame used:
col1 col2
0 [1, 2, 3] a
1 [4, 5] b
2 [6, 7, 8] c
Thanks for your answer!
This is what worked for me:
Convert my list to a series (my using DB):
match = pd.Series(','.join(map(str,match)))
Convert the list of my master DB into a string:
df_temp2['match_s'].loc[m] =
','.join(map(str,df_temp2['match'].loc[m]))
Applied an inner merge on both DB:
df_temp3 = df_temp2.merge(match.rename('match'), how='inner',
left_on='match_s', right_on='match')
Hope it also works for somebody else :)

How to remove duplicate rows in pandas with concatenating column values? [duplicate]

I have a dataframe df, with two columns, I want to groupby one column and join the lists belongs to same group, example:
column_a, column_b
1, [1,2,3]
1, [2,5]
2, [5,6]
after the process:
column_a, column_b
1, [1,2,3,2,5]
2, [5,6]
I want to keep all the duplicates. I have the following questions:
The dtypes of the dataframe are object(s). convert_objects() doesn't convert column_b to list automatically. How can I do this?
what does the function in df.groupby(...).apply(lambda x: ...) apply to ? what is the form of x ? list?
the solution to my main problem?
Thanks in advance.
object dtype is a catch-all dtype that basically means not int, float, bool, datetime, or timedelta. So it is storing them as a list. convert_objects tries to convert a column to one of those dtypes.
You want
In [63]: df
Out[63]:
a b c
0 1 [1, 2, 3] foo
1 1 [2, 5] bar
2 2 [5, 6] baz
In [64]: df.groupby('a').agg({'b': 'sum', 'c': lambda x: ' '.join(x)})
Out[64]:
c b
a
1 foo bar [1, 2, 3, 2, 5]
2 baz [5, 6]
This groups the data frame by the values in column a. Read more about groupby.
This is doing a regular list sum (concatenation) just like [1, 2, 3] + [2, 5] with the result [1, 2, 3, 2, 5]
df.groupby('column_a').agg(sum)
This works because of operator overloading sum concatenates the lists together. The index of the resulting df will be the values from column_a:
The approach proposed above using df.groupby('column_a').agg(sum) definetly works. However, you have to make sure that your list only contains integers, otherwise the output will not be the same.
If you want to convert all of the lists items into integers, you can use:
df['column_a'] = df['column_a'].apply(lambda x: list(map(int, x)))
The accepted answer suggests to use groupby.sum, which is working fine with small number of lists, however using sum to concatenate lists is quadratic.
For a larger number of lists, a much faster option would be to use itertools.chain or a list comprehension:
df = pd.DataFrame({'column_a': ['1', '1', '2'],
'column_b': [['1', '2', '3'], ['2', '5'], ['5', '6']]})
itertools.chain:
from itertools import chain
out = (df.groupby('column_a', as_index=False)['column_b']
.agg(lambda x: list(chain.from_iterable(x)))
)
list comprehension:
out = (df.groupby('column_a', as_index=False, sort=False)['column_b']
.agg(lambda x: [e for l in x for e in l])
)
output:
column_a column_b
0 1 [1, 2, 3, 2, 5]
1 2 [5, 6]
Comparison of speed
Using n repeats of the example to show the impact of the number of lists to merge:
test_df = pd.concat([df]*n, ignore_index=True)
NB. also comparing the numpy approach (agg(lambda x: np.concatenate(x.to_numpy()).tolist())).
Use numpy and simple "for" or "map":
import numpy as np
u_clm = np.unique(df.column_a.values)
all_lists = []
for clm in u_clm:
df_process = df.query('column_a == #clm')
list_ = np.concatenate(df.column_b.values)
all_lists.append((clm, list_.tolist()))
df_sum_lists = pd.DataFrame(all_lists)
It's faster in 350 times than a simple "groupby-agg-sum" approach for huge datasets.
Thanks, helped me
merge.fillna("", inplace = True) new_merge = merge.groupby(['id']).agg({ 'q1':lambda x: ','.join(x), 'q2':lambda x: ','.join(x),'q2_bookcode':lambda x: ','.join(x), 'q1_bookcode':lambda x: ','.join(x)})

Explode multiple lists of same length in DataFrame

I have a Pandas DataFrame with several lists in columns that I would like to split. Each list has the same length and they have to be split at the same indices.
What I have now uses a suggestion from here but I cannot make it work:
import numpy as np
import pandas as pd
from itertools import chain
split_size = 2
def split_list(arr, keep_partial=False):
arrs = []
while len(arr) >= split_size:
sub = arr[:split_size]
arrs.append(sub)
arr = arr[split_size:]
if keep_partial:
arrs.append(arr)
return arrs
df = pd.DataFrame({'id': [1, 2, 3], 't': [[1,2,3,4], [1,2,3,4,5,6], [0,2]], 'v': [[0,-1,1,0], [0,-1,1,0,2,-2], [0,0]]})
def chainer(lst):
return list(chain.from_iterable(split_list(lst, split_size)))
def chain_col(col):
return col.apply(lambda x: chainer(x))
lens = df.t.apply(lambda x: len(split_list(x)))
pd.DataFrame({'id': np.repeat(df.id, lens), 't': chain_col(df.t), 'v': chain_col(df.v)})
The problem is that it repeats each full list rather than splits it across lines. I think the issue is the usage of chain.from_iterable but without it I simply get the list of lists (i.e. split lists) repeated rather than each split to its own row in the DataFrame.
My data set is not very large (a few thousand rows), so if there is a better way I'd be happy to learn. I looked at explode but that seems to split the data set based on a single column and I want multiple columns to be split in the same way.
My desired output is for id = 1 is
1. a row with t = [1,2] and v = [0,-1]
2. another row with t = [3,4] = [1,0]
Ideally I'd add a sub-index to each 'id' (e.g. 1 -> 1.1 and 1.2, so I can distinguish them) but that's a cosmetic thing, not my main problem.
Using explode, pd.concat and GroupBy:
note: this answer uses the new explode method only available from pandas>=0.25.0
d1 = df.explode('t').drop(columns='v')
d2 = df.explode('v').drop(columns=['id', 't'])
df2 = pd.concat([d1,d2], axis=1)
df2
s = df2.groupby('id')['id'].cumcount()//2
final = df2.groupby(['id', s]).agg({'t':list,
'v':list}).reset_index(level=0)
final['id'] = final['id'].astype(str).str.cat('.'+final.groupby('id').cumcount().add(1).astype(str))
Output
id t v
0 1.1 [1, 2] [0, -1]
1 1.2 [3, 4] [1, 0]
0 2.1 [1, 2] [0, -1]
1 2.2 [3, 4] [1, 0]
2 2.3 [5, 6] [2, -2]
0 3.1 [0, 2] [0, 0]
IIUC, here is one way using a funstion which splits lists to n chunks , then applymap to split each cell , followed by explode and concat:
def split_lists(l, n):
"""splits a list to n chunks"""
for i in range(0, len(l), n):
yield l[i:i + n]
def explode_multiple(x):
"""This will use the prev func,
explode each columns and concat them to a dataframe"""
m=x.applymap(lambda x: [*split_lists(x,2)])
m=pd.concat([m.explode(i).loc[:,i] for i in m.columns],axis=1).reset_index()
return m
explode_multiple(df.set_index('id')) #setting id as index since other columns have list
id t v
0 1 [1, 2] [0, -1]
1 1 [3, 4] [1, 0]
2 2 [1, 2] [0, -1]
3 2 [3, 4] [1, 0]
4 2 [5, 6] [2, -2]
5 3 [0, 2] [0, 0]

Applying a function only works for one column instead of multiple?

x = [{'list1':'[1,6]', 'list2':'[1,1]'},
{'list1':'[1,7]', 'list2':'[1,2]'}]
df = pd.DataFrame(x)
Now I'm going to transform it from string to list type:
df[['list1','list2']].apply(lambda x: ast.literal_eval(x.strip()))
>> ("'Series' object has no attribute 'strip'", 'occurred at index list1')
So I get an error, but if I single out only 1 column:
d['list1'].apply(lambda x: ast.literal_eval(x.strip()))
>> 0 [1, 6]
1 [1, 7]
Name: list1, dtype: object
Why is this happening? Why does it only allow one column instead of multiple?
It is important to understand how apply is supposed to work in order to understand why it isn't working for you. Each column (considering the default axis=0) is iteratively operated upon, you can see how by letting each series print itself:
df.apply(lambda x: print(x))
0 [1,6]
1 [1,7]
Name: list1, dtype: object
0 [1,1]
1 [1,2]
Name: list2, dtype: object
And when you try and call (series_object).strip(), the error makes more sense.
Since you want to apply your function to each cell individually, you can use applymap instead, it's relatively faster in comparison.
df[['list1','list2']].applymap(ast.literal_eval)
Or,
df[['list1','list2']].applymap(pd.eval)
list1 list2
0 [1, 6] [1, 1]
1 [1, 7] [1, 2]
Other options also include:
df.apply(lambda x: x.map(ast.literal_eval))
list1 list2
0 [1, 6] [1, 1]
1 [1, 7] [1, 2]
Among others.

python pandas filtering involving lists

I am currently using Pandas in python 2.7. My dataframe looks similar to this:
>>> df
0
1 [1, 2]
2 [2, 3]
3 [4, 5]
Is it possible to filter rows by values in column 1? For example, if my filter value is 2, the filter should return a dataframe containing the first two rows.
I have tried a couple of ways already. The best thing I can think of is to do a list comprehension that returns the index of rows in which the value exist. Then, I could filter the dataframe with the list of indices. But, this would be very slow if I want to filter multiple times with different values. Ideally, I would like something that uses the build in Pandas functions in order to speed up the process.
You can use boolean indexing:
import pandas as pd
df = pd.DataFrame({'0':[[1, 2],[2, 3], [4, 5]]})
print (df)
0
0 [1, 2]
1 [2, 3]
2 [4, 5]
print (df['0'].apply(lambda x: 2 in x))
0 True
1 True
2 False
Name: 0, dtype: bool
print (df[df['0'].apply(lambda x: 2 in x)])
0
0 [1, 2]
1 [2, 3]
You can also use boolean indexing with a list comprehension:
>>> df[[2 in row for row in df['0']]]
0
0 [1, 2]
1 [2, 3]

Categories

Resources