I have the following dataframe and list:
df = [[[1,2,3],'a'],[[4,5],'b'],[[6,7,8],'c']]
list = [[1,2,3],[4,5]]
And I want to do a inner merge between them, so I can keep the items in common. This will be my result:
df = [[1,2,3],'a'],[[4,5],'b']]
I have been thinking in converting both to strings, but even if I convert my list to string, I haven't been able to merge both of them as the merge function requires the items to be series or dataframes (not strings). This could be a great help!!
Thanks
If I understand you correctly, you want only keep rows from the dataframe where the values (lists) are both in the column and the list:
lst = [[1, 2, 3], [4, 5]]
print(df[df["col1"].isin(lst)])
Prints:
col1 col2
0 [1, 2, 3] a
1 [4, 5] b
DataFrame used:
col1 col2
0 [1, 2, 3] a
1 [4, 5] b
2 [6, 7, 8] c
Thanks for your answer!
This is what worked for me:
Convert my list to a series (my using DB):
match = pd.Series(','.join(map(str,match)))
Convert the list of my master DB into a string:
df_temp2['match_s'].loc[m] =
','.join(map(str,df_temp2['match'].loc[m]))
Applied an inner merge on both DB:
df_temp3 = df_temp2.merge(match.rename('match'), how='inner',
left_on='match_s', right_on='match')
Hope it also works for somebody else :)
Related
I have 2 dataframes. One (a) has a column of integers and one (b) has a column of a list of integers (array or list)
I'm trying to find a way to find all occurrences of b where b contains a.
I hoped something like this would work
df3 = a[a['cell'].isin(b['cells'])]
But I get an empty dataframe.
I tried using columns 'a' and 'b' in the same dataframe.
Here is the filter that worked for me on an example dataframe I made.
data=pd.DataFrame([[1, [1, 2, 3]], [5, [11, 12, 13]], [6, [7, 6, 10]]], columns=['a', 'b'])
mask=(pd.Series(data.index)).apply(lambda x:(data.loc[x, ['a']]).isin(data.loc[x, ['b']].values[0]))
data:
data.loc[mask['a'].values]:
I have a dataframe df, with two columns, I want to groupby one column and join the lists belongs to same group, example:
column_a, column_b
1, [1,2,3]
1, [2,5]
2, [5,6]
after the process:
column_a, column_b
1, [1,2,3,2,5]
2, [5,6]
I want to keep all the duplicates. I have the following questions:
The dtypes of the dataframe are object(s). convert_objects() doesn't convert column_b to list automatically. How can I do this?
what does the function in df.groupby(...).apply(lambda x: ...) apply to ? what is the form of x ? list?
the solution to my main problem?
Thanks in advance.
object dtype is a catch-all dtype that basically means not int, float, bool, datetime, or timedelta. So it is storing them as a list. convert_objects tries to convert a column to one of those dtypes.
You want
In [63]: df
Out[63]:
a b c
0 1 [1, 2, 3] foo
1 1 [2, 5] bar
2 2 [5, 6] baz
In [64]: df.groupby('a').agg({'b': 'sum', 'c': lambda x: ' '.join(x)})
Out[64]:
c b
a
1 foo bar [1, 2, 3, 2, 5]
2 baz [5, 6]
This groups the data frame by the values in column a. Read more about groupby.
This is doing a regular list sum (concatenation) just like [1, 2, 3] + [2, 5] with the result [1, 2, 3, 2, 5]
df.groupby('column_a').agg(sum)
This works because of operator overloading sum concatenates the lists together. The index of the resulting df will be the values from column_a:
The approach proposed above using df.groupby('column_a').agg(sum) definetly works. However, you have to make sure that your list only contains integers, otherwise the output will not be the same.
If you want to convert all of the lists items into integers, you can use:
df['column_a'] = df['column_a'].apply(lambda x: list(map(int, x)))
The accepted answer suggests to use groupby.sum, which is working fine with small number of lists, however using sum to concatenate lists is quadratic.
For a larger number of lists, a much faster option would be to use itertools.chain or a list comprehension:
df = pd.DataFrame({'column_a': ['1', '1', '2'],
'column_b': [['1', '2', '3'], ['2', '5'], ['5', '6']]})
itertools.chain:
from itertools import chain
out = (df.groupby('column_a', as_index=False)['column_b']
.agg(lambda x: list(chain.from_iterable(x)))
)
list comprehension:
out = (df.groupby('column_a', as_index=False, sort=False)['column_b']
.agg(lambda x: [e for l in x for e in l])
)
output:
column_a column_b
0 1 [1, 2, 3, 2, 5]
1 2 [5, 6]
Comparison of speed
Using n repeats of the example to show the impact of the number of lists to merge:
test_df = pd.concat([df]*n, ignore_index=True)
NB. also comparing the numpy approach (agg(lambda x: np.concatenate(x.to_numpy()).tolist())).
Use numpy and simple "for" or "map":
import numpy as np
u_clm = np.unique(df.column_a.values)
all_lists = []
for clm in u_clm:
df_process = df.query('column_a == #clm')
list_ = np.concatenate(df.column_b.values)
all_lists.append((clm, list_.tolist()))
df_sum_lists = pd.DataFrame(all_lists)
It's faster in 350 times than a simple "groupby-agg-sum" approach for huge datasets.
Thanks, helped me
merge.fillna("", inplace = True) new_merge = merge.groupby(['id']).agg({ 'q1':lambda x: ','.join(x), 'q2':lambda x: ','.join(x),'q2_bookcode':lambda x: ','.join(x), 'q1_bookcode':lambda x: ','.join(x)})
So I have a list of list :-
a = [[1,2,3,4],[5,6,7,8],[4,5,6,7]]
I want to iterate through the contents and avoid a certain column. So my general for loop structure is as follows:
for i in range (0,len(a[0])):
Now as far as I know I need one more condition to avoid a certain column. How do I do that?
Another way this could be done is to delete the whole column and append it right to the end.
So after deleting the column:
a = [[1,2,4],[5,6,8],[4,5,7]]
After appending:
a = [[1,2,4,3],[5,6,8,7],[4,5,7,6]]
I could use numpy to do this. So I am deleting the desired column from numpy.
Here's a working code:
a = [[1,2,3,4],[5,6,7,8],[4,5,6,7]]
a_np= np.array(a)
a_col_data = a_np[:,2]
a1 = np.delete(a_np, 2 , axis = 1)
Now my a_col_data is [3 7 6]
I cannot append in this format. I need it in this format [[3],[7],[6]]
Then I could use the following code to append it as a last column in a:
np.append(a_np, a_col_data, axis=1)
Now issues with this approach, that I am facing:
How to convert a list [3 7 6] into [[3],[7],[6]]?
Considering I have a list of 150 columns and 3000 rows I can't do it manually.
Another issue is np.array converts the list into a pure matrix like structure removing the "," between elements. I want to get the list structure back.
For example:
a = [[1,2,4],[5,6,8],[4,5,7]]
a_np = [[1 2 4][5 6 8][4 5 7]]
How do I convert a_np to a again?
EDIT:
Ok I was just browsing stackoverflow and I got a solution. I could simply do the following to convert it back:
a_np.tolist()
There are many ways to do this.
You could add an extra axis to a_col_data:
>>> a_np2 = np.append(a1, a_col_data[:, np.newaxis], axis=1)
>>> a_np2
array([[1, 2, 4, 3],
[5, 6, 8, 7],
[4, 5, 7, 6]])
>>> a_np2.tolist()
[[1, 2, 4, 3], [5, 6, 8, 7], [4, 5, 7, 6]]
Or you could avoid going the numpy route by using enumerate:
a = [[1,2,3,4],[5,6,7,8],[4,5,6,7]]
for i in a:
for j, x in enumerate(i):
if j == 2:
continue
...
So you have a list of lists and want a list of lists in return, each missing one entry in the list? Assuming you want to remove the item at index 2 of each sublist:
a = [[x for i, x in enumerate(sublist) is i != 2] for sublist in a]
If you want the columns 0, 1 and 2 from you array you could do e.g.
a = np.array([[1,2,3,4],[5,6,7,8],[4,5,6,7]])
for col in a[:,[0,1,2]].T:
print col
I am currently using Pandas in python 2.7. My dataframe looks similar to this:
>>> df
0
1 [1, 2]
2 [2, 3]
3 [4, 5]
Is it possible to filter rows by values in column 1? For example, if my filter value is 2, the filter should return a dataframe containing the first two rows.
I have tried a couple of ways already. The best thing I can think of is to do a list comprehension that returns the index of rows in which the value exist. Then, I could filter the dataframe with the list of indices. But, this would be very slow if I want to filter multiple times with different values. Ideally, I would like something that uses the build in Pandas functions in order to speed up the process.
You can use boolean indexing:
import pandas as pd
df = pd.DataFrame({'0':[[1, 2],[2, 3], [4, 5]]})
print (df)
0
0 [1, 2]
1 [2, 3]
2 [4, 5]
print (df['0'].apply(lambda x: 2 in x))
0 True
1 True
2 False
Name: 0, dtype: bool
print (df[df['0'].apply(lambda x: 2 in x)])
0
0 [1, 2]
1 [2, 3]
You can also use boolean indexing with a list comprehension:
>>> df[[2 in row for row in df['0']]]
0
0 [1, 2]
1 [2, 3]
I have a dataframe df, with two columns, I want to groupby one column and join the lists belongs to same group, example:
column_a, column_b
1, [1,2,3]
1, [2,5]
2, [5,6]
after the process:
column_a, column_b
1, [1,2,3,2,5]
2, [5,6]
I want to keep all the duplicates. I have the following questions:
The dtypes of the dataframe are object(s). convert_objects() doesn't convert column_b to list automatically. How can I do this?
what does the function in df.groupby(...).apply(lambda x: ...) apply to ? what is the form of x ? list?
the solution to my main problem?
Thanks in advance.
object dtype is a catch-all dtype that basically means not int, float, bool, datetime, or timedelta. So it is storing them as a list. convert_objects tries to convert a column to one of those dtypes.
You want
In [63]: df
Out[63]:
a b c
0 1 [1, 2, 3] foo
1 1 [2, 5] bar
2 2 [5, 6] baz
In [64]: df.groupby('a').agg({'b': 'sum', 'c': lambda x: ' '.join(x)})
Out[64]:
c b
a
1 foo bar [1, 2, 3, 2, 5]
2 baz [5, 6]
This groups the data frame by the values in column a. Read more about groupby.
This is doing a regular list sum (concatenation) just like [1, 2, 3] + [2, 5] with the result [1, 2, 3, 2, 5]
df.groupby('column_a').agg(sum)
This works because of operator overloading sum concatenates the lists together. The index of the resulting df will be the values from column_a:
The approach proposed above using df.groupby('column_a').agg(sum) definetly works. However, you have to make sure that your list only contains integers, otherwise the output will not be the same.
If you want to convert all of the lists items into integers, you can use:
df['column_a'] = df['column_a'].apply(lambda x: list(map(int, x)))
The accepted answer suggests to use groupby.sum, which is working fine with small number of lists, however using sum to concatenate lists is quadratic.
For a larger number of lists, a much faster option would be to use itertools.chain or a list comprehension:
df = pd.DataFrame({'column_a': ['1', '1', '2'],
'column_b': [['1', '2', '3'], ['2', '5'], ['5', '6']]})
itertools.chain:
from itertools import chain
out = (df.groupby('column_a', as_index=False)['column_b']
.agg(lambda x: list(chain.from_iterable(x)))
)
list comprehension:
out = (df.groupby('column_a', as_index=False, sort=False)['column_b']
.agg(lambda x: [e for l in x for e in l])
)
output:
column_a column_b
0 1 [1, 2, 3, 2, 5]
1 2 [5, 6]
Comparison of speed
Using n repeats of the example to show the impact of the number of lists to merge:
test_df = pd.concat([df]*n, ignore_index=True)
NB. also comparing the numpy approach (agg(lambda x: np.concatenate(x.to_numpy()).tolist())).
Use numpy and simple "for" or "map":
import numpy as np
u_clm = np.unique(df.column_a.values)
all_lists = []
for clm in u_clm:
df_process = df.query('column_a == #clm')
list_ = np.concatenate(df.column_b.values)
all_lists.append((clm, list_.tolist()))
df_sum_lists = pd.DataFrame(all_lists)
It's faster in 350 times than a simple "groupby-agg-sum" approach for huge datasets.
Thanks, helped me
merge.fillna("", inplace = True) new_merge = merge.groupby(['id']).agg({ 'q1':lambda x: ','.join(x), 'q2':lambda x: ','.join(x),'q2_bookcode':lambda x: ','.join(x), 'q1_bookcode':lambda x: ','.join(x)})