How to remove duplicate rows in pandas with concatenating column values? [duplicate] - python

I have a dataframe df, with two columns, I want to groupby one column and join the lists belongs to same group, example:
column_a, column_b
1, [1,2,3]
1, [2,5]
2, [5,6]
after the process:
column_a, column_b
1, [1,2,3,2,5]
2, [5,6]
I want to keep all the duplicates. I have the following questions:
The dtypes of the dataframe are object(s). convert_objects() doesn't convert column_b to list automatically. How can I do this?
what does the function in df.groupby(...).apply(lambda x: ...) apply to ? what is the form of x ? list?
the solution to my main problem?
Thanks in advance.

object dtype is a catch-all dtype that basically means not int, float, bool, datetime, or timedelta. So it is storing them as a list. convert_objects tries to convert a column to one of those dtypes.
You want
In [63]: df
Out[63]:
a b c
0 1 [1, 2, 3] foo
1 1 [2, 5] bar
2 2 [5, 6] baz
In [64]: df.groupby('a').agg({'b': 'sum', 'c': lambda x: ' '.join(x)})
Out[64]:
c b
a
1 foo bar [1, 2, 3, 2, 5]
2 baz [5, 6]
This groups the data frame by the values in column a. Read more about groupby.
This is doing a regular list sum (concatenation) just like [1, 2, 3] + [2, 5] with the result [1, 2, 3, 2, 5]

df.groupby('column_a').agg(sum)
This works because of operator overloading sum concatenates the lists together. The index of the resulting df will be the values from column_a:

The approach proposed above using df.groupby('column_a').agg(sum) definetly works. However, you have to make sure that your list only contains integers, otherwise the output will not be the same.
If you want to convert all of the lists items into integers, you can use:
df['column_a'] = df['column_a'].apply(lambda x: list(map(int, x)))

The accepted answer suggests to use groupby.sum, which is working fine with small number of lists, however using sum to concatenate lists is quadratic.
For a larger number of lists, a much faster option would be to use itertools.chain or a list comprehension:
df = pd.DataFrame({'column_a': ['1', '1', '2'],
'column_b': [['1', '2', '3'], ['2', '5'], ['5', '6']]})
itertools.chain:
from itertools import chain
out = (df.groupby('column_a', as_index=False)['column_b']
.agg(lambda x: list(chain.from_iterable(x)))
)
list comprehension:
out = (df.groupby('column_a', as_index=False, sort=False)['column_b']
.agg(lambda x: [e for l in x for e in l])
)
output:
column_a column_b
0 1 [1, 2, 3, 2, 5]
1 2 [5, 6]
Comparison of speed
Using n repeats of the example to show the impact of the number of lists to merge:
test_df = pd.concat([df]*n, ignore_index=True)
NB. also comparing the numpy approach (agg(lambda x: np.concatenate(x.to_numpy()).tolist())).

Use numpy and simple "for" or "map":
import numpy as np
u_clm = np.unique(df.column_a.values)
all_lists = []
for clm in u_clm:
df_process = df.query('column_a == #clm')
list_ = np.concatenate(df.column_b.values)
all_lists.append((clm, list_.tolist()))
df_sum_lists = pd.DataFrame(all_lists)
It's faster in 350 times than a simple "groupby-agg-sum" approach for huge datasets.

Thanks, helped me
merge.fillna("", inplace = True) new_merge = merge.groupby(['id']).agg({ 'q1':lambda x: ','.join(x), 'q2':lambda x: ','.join(x),'q2_bookcode':lambda x: ','.join(x), 'q1_bookcode':lambda x: ','.join(x)})

Related

How can I merge a list into a dataframe?

I have the following dataframe and list:
df = [[[1,2,3],'a'],[[4,5],'b'],[[6,7,8],'c']]
list = [[1,2,3],[4,5]]
And I want to do a inner merge between them, so I can keep the items in common. This will be my result:
df = [[1,2,3],'a'],[[4,5],'b']]
I have been thinking in converting both to strings, but even if I convert my list to string, I haven't been able to merge both of them as the merge function requires the items to be series or dataframes (not strings). This could be a great help!!
Thanks
If I understand you correctly, you want only keep rows from the dataframe where the values (lists) are both in the column and the list:
lst = [[1, 2, 3], [4, 5]]
print(df[df["col1"].isin(lst)])
Prints:
col1 col2
0 [1, 2, 3] a
1 [4, 5] b
DataFrame used:
col1 col2
0 [1, 2, 3] a
1 [4, 5] b
2 [6, 7, 8] c
Thanks for your answer!
This is what worked for me:
Convert my list to a series (my using DB):
match = pd.Series(','.join(map(str,match)))
Convert the list of my master DB into a string:
df_temp2['match_s'].loc[m] =
','.join(map(str,df_temp2['match'].loc[m]))
Applied an inner merge on both DB:
df_temp3 = df_temp2.merge(match.rename('match'), how='inner',
left_on='match_s', right_on='match')
Hope it also works for somebody else :)

Applying a function only works for one column instead of multiple?

x = [{'list1':'[1,6]', 'list2':'[1,1]'},
{'list1':'[1,7]', 'list2':'[1,2]'}]
df = pd.DataFrame(x)
Now I'm going to transform it from string to list type:
df[['list1','list2']].apply(lambda x: ast.literal_eval(x.strip()))
>> ("'Series' object has no attribute 'strip'", 'occurred at index list1')
So I get an error, but if I single out only 1 column:
d['list1'].apply(lambda x: ast.literal_eval(x.strip()))
>> 0 [1, 6]
1 [1, 7]
Name: list1, dtype: object
Why is this happening? Why does it only allow one column instead of multiple?
It is important to understand how apply is supposed to work in order to understand why it isn't working for you. Each column (considering the default axis=0) is iteratively operated upon, you can see how by letting each series print itself:
df.apply(lambda x: print(x))
0 [1,6]
1 [1,7]
Name: list1, dtype: object
0 [1,1]
1 [1,2]
Name: list2, dtype: object
And when you try and call (series_object).strip(), the error makes more sense.
Since you want to apply your function to each cell individually, you can use applymap instead, it's relatively faster in comparison.
df[['list1','list2']].applymap(ast.literal_eval)
Or,
df[['list1','list2']].applymap(pd.eval)
list1 list2
0 [1, 6] [1, 1]
1 [1, 7] [1, 2]
Other options also include:
df.apply(lambda x: x.map(ast.literal_eval))
list1 list2
0 [1, 6] [1, 1]
1 [1, 7] [1, 2]
Among others.

python pandas filtering involving lists

I am currently using Pandas in python 2.7. My dataframe looks similar to this:
>>> df
0
1 [1, 2]
2 [2, 3]
3 [4, 5]
Is it possible to filter rows by values in column 1? For example, if my filter value is 2, the filter should return a dataframe containing the first two rows.
I have tried a couple of ways already. The best thing I can think of is to do a list comprehension that returns the index of rows in which the value exist. Then, I could filter the dataframe with the list of indices. But, this would be very slow if I want to filter multiple times with different values. Ideally, I would like something that uses the build in Pandas functions in order to speed up the process.
You can use boolean indexing:
import pandas as pd
df = pd.DataFrame({'0':[[1, 2],[2, 3], [4, 5]]})
print (df)
0
0 [1, 2]
1 [2, 3]
2 [4, 5]
print (df['0'].apply(lambda x: 2 in x))
0 True
1 True
2 False
Name: 0, dtype: bool
print (df[df['0'].apply(lambda x: 2 in x)])
0
0 [1, 2]
1 [2, 3]
You can also use boolean indexing with a list comprehension:
>>> df[[2 in row for row in df['0']]]
0
0 [1, 2]
1 [2, 3]

Accessing the Kth group in Pandas

Is this the only way to do it? It's quite verbose.
k = 0
grouped = df.groupby('A')
df.ix[grouped.groups[list(grouped.groups)[k]]]
Also, wouldn't list(grouped.groups) return keys in a meaningless order? (unordered dictionary)
Aside from the ordering of groups, is there a more consise way to get a group? I don't necessarily need to get the Kth one, although it would be nice to get them in the order they appear in the dataframe.
If you know the key, the concise way is get_group:
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [12]: g = df.groupby('A')
In [13]: g.get_group(1)
Out[13]:
A B
0 1 2
1 1 4
As mentioned the group keys are not necessarily ordered, you can access them (as is) with levels:
In [14]: g.grouper.levels
Out[14]: [Int64Index([1, 5], dtype='int64')]
if this is for just one column you can use unique to get them in the order they appear i.e. not sorted:
In [15]: df.A.unique()
Out[15]: array([1, 5])

pandas groupby and join lists

I have a dataframe df, with two columns, I want to groupby one column and join the lists belongs to same group, example:
column_a, column_b
1, [1,2,3]
1, [2,5]
2, [5,6]
after the process:
column_a, column_b
1, [1,2,3,2,5]
2, [5,6]
I want to keep all the duplicates. I have the following questions:
The dtypes of the dataframe are object(s). convert_objects() doesn't convert column_b to list automatically. How can I do this?
what does the function in df.groupby(...).apply(lambda x: ...) apply to ? what is the form of x ? list?
the solution to my main problem?
Thanks in advance.
object dtype is a catch-all dtype that basically means not int, float, bool, datetime, or timedelta. So it is storing them as a list. convert_objects tries to convert a column to one of those dtypes.
You want
In [63]: df
Out[63]:
a b c
0 1 [1, 2, 3] foo
1 1 [2, 5] bar
2 2 [5, 6] baz
In [64]: df.groupby('a').agg({'b': 'sum', 'c': lambda x: ' '.join(x)})
Out[64]:
c b
a
1 foo bar [1, 2, 3, 2, 5]
2 baz [5, 6]
This groups the data frame by the values in column a. Read more about groupby.
This is doing a regular list sum (concatenation) just like [1, 2, 3] + [2, 5] with the result [1, 2, 3, 2, 5]
df.groupby('column_a').agg(sum)
This works because of operator overloading sum concatenates the lists together. The index of the resulting df will be the values from column_a:
The approach proposed above using df.groupby('column_a').agg(sum) definetly works. However, you have to make sure that your list only contains integers, otherwise the output will not be the same.
If you want to convert all of the lists items into integers, you can use:
df['column_a'] = df['column_a'].apply(lambda x: list(map(int, x)))
The accepted answer suggests to use groupby.sum, which is working fine with small number of lists, however using sum to concatenate lists is quadratic.
For a larger number of lists, a much faster option would be to use itertools.chain or a list comprehension:
df = pd.DataFrame({'column_a': ['1', '1', '2'],
'column_b': [['1', '2', '3'], ['2', '5'], ['5', '6']]})
itertools.chain:
from itertools import chain
out = (df.groupby('column_a', as_index=False)['column_b']
.agg(lambda x: list(chain.from_iterable(x)))
)
list comprehension:
out = (df.groupby('column_a', as_index=False, sort=False)['column_b']
.agg(lambda x: [e for l in x for e in l])
)
output:
column_a column_b
0 1 [1, 2, 3, 2, 5]
1 2 [5, 6]
Comparison of speed
Using n repeats of the example to show the impact of the number of lists to merge:
test_df = pd.concat([df]*n, ignore_index=True)
NB. also comparing the numpy approach (agg(lambda x: np.concatenate(x.to_numpy()).tolist())).
Use numpy and simple "for" or "map":
import numpy as np
u_clm = np.unique(df.column_a.values)
all_lists = []
for clm in u_clm:
df_process = df.query('column_a == #clm')
list_ = np.concatenate(df.column_b.values)
all_lists.append((clm, list_.tolist()))
df_sum_lists = pd.DataFrame(all_lists)
It's faster in 350 times than a simple "groupby-agg-sum" approach for huge datasets.
Thanks, helped me
merge.fillna("", inplace = True) new_merge = merge.groupby(['id']).agg({ 'q1':lambda x: ','.join(x), 'q2':lambda x: ','.join(x),'q2_bookcode':lambda x: ','.join(x), 'q1_bookcode':lambda x: ','.join(x)})

Categories

Resources