Combine lists from several columns into one nested list pandas - python

Here is my dataframe:
| col1 | col2 | col3 |
----------------------------------
[1,2,3,4] | [1,2,3,4] | [1,2,3,4]
I also have this function:
def joiner(col1,col2,col3):
snip = []
snip.append(col1)
snip.append(col2)
snip.append(col3)
return snip
I want to call this on each of the columns and assign it to a new column.
My end goal would be something like this:
| col1 | col2 | col3 | col4
------------------------------------------------------------------
[1,2,3,4] | [1,2,3,4] | [1,2,3,4] | [[1,2,3,4],[1,2,3,4],[1,2,3,4]]

Just .apply list on axis=1, it'll create lists for each rows
>>> df['col4'] = df.apply(list, axis=1)
OUTPUT:
col1 col2 col3 col4
0 [1, 2, 3, 4] [1, 2, 3, 4] [1, 2, 3, 4] [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]

You can just do
df['col'] = df.values.tolist()

Related

List from Column elements

I want to create a column 'List' for column 'Numbers' such that it gives a list leaving the element of the corresponding row in pandas.
Table:
| Numbers | List |
| -------- | -------------- |
| 1 | [2,3,4,1] |
| 2 | [3,4,1,1] |
| 3 | [4,1,1,2] |
| 4 | [1,1,2,3] |
| 1 | [1,2,3,4] |
Can anyone help with this, please?
For general solution working with duplicated values first repeat values by numpy.tile and then remove values of diagonal for delete value of row:
df = pd.DataFrame({'Numbers':[1,2,3,4,1]})
A = np.tile(df['Numbers'], len(df)).reshape(-1, len(df))
#https://stackoverflow.com/a/46736275/2901002
df['new'] = A[~np.eye(A.shape[0],dtype=bool)].reshape(A.shape[0],-1).tolist()
print (df)
0 1 [2, 3, 4, 1]
1 2 [1, 3, 4, 1]
2 3 [1, 2, 4, 1]
3 4 [1, 2, 3, 1]
4 1 [1, 2, 3, 4]
Try this:
df = pd.DataFrame({'numbers':range(1, 5)})
df['list'] = df['numbers'].apply(lambda x: [i for i in df['numbers'] if i != x])
df
import pandas as pd
df = pd.DataFrame({'Numbers':[1,2,3,4,5]})
df['List'] = df['Numbers'].apply(
# for every cell with element x in Numbers return Numbers without the element
lambda x: [y for y in df['Numbers'] if not y==x])
which results in:
df
Numbers List
0 1 [2, 3, 4, 5]
1 2 [1, 3, 4, 5]
2 3 [1, 2, 4, 5]
3 4 [1, 2, 3, 5]
4 5 [1, 2, 3, 4]
Abhinar Khandelwal. If the task is to get any other number besides the current row value, than my answer can be fixed to the following:
import numpy as np
rng = np.random.default_rng()
numbers = rng.integers(5, size=7)
df = pd.DataFrame({'numbers':numbers})
df['list'] = df.reset_index()['index'].apply(lambda x: df[df.index != x].numbers.values)
df
But this way is much faster https://stackoverflow.com/a/73275614/18965699 :)

python pandas DataFrame - assign a list to multiple cells

I have a DataFrame like
name col1 col2
a aa 123
a bb 123
b aa 234
and a list
[1, 2, 3]
I want to replace the col2 of every row with col1 = 'aa' with the list like
name col1 col2
a aa [1, 2, 3]
a bb 123
b aa [1, 2, 3]
I tried something like
df.loc[df[col1] == 'aa', col2] = [1, 2, 3]
but it gives me the error:
ValueError: could not broadcast input array from shape (xx,) into shape (yy,)
How should I get around this?
Make it simple. np.where should do. Code below
df['col2']=np.where(df['col1']=='aa', str(lst), df['col2'])
Alternatively use pd.Series with list locked in double brackects
df['col2']=np.where(df['col1']=='aa', pd.Series([lst]), df['col2'])
import pandas as pd
df = pd.DataFrame({"name":["a","a","b"],"col1":["aa","bb","aa"],"col2":[123,123,234]})
l = [1,2,3]
df["col2"] = df.apply(lambda x: l if x.col1 == "aa" else x.col2, axis =1)
df
A list comprehension with an if/else should work
df['col2'] = [x['col2'] if x['col1'] != 'aa' else [1,2,3] for ind,x in df.iterrows()]
It will be safe to do with for loop
df.col2 = df.col2.astype(object)
for x in df.index:
if df.at[x,'col1'] == 'aa':
df.at[x,'col2'] = [1,2,3]
df
name col1 col2
0 a aa [1, 2, 3]
1 a bb 123
2 b aa [1, 2, 3]
You can also use:
data = {'aa':[1,2,3]}
df['col2'] = np.where(df['col1'] == 'aa', df['col1'].map(data), df['col2'])
You should use this with care, as doing this will change list to both locations:
df['col2'].loc[0].append(5)
print(df)
#OUTPUT
name col1 col2
0 a aa [1, 2, 3, 5]
1 a bb 123
2 b aa [1, 2, 3, 5]
But this is fine:
df = df.loc[1:]
print(df)
#OUTPUT
name col1 col2
1 a bb 123
2 b aa [1, 2, 3]

combine two column and create new column using pandas library

df = pd.read_csv("school_data.csv")
col1 col2
0 [1,2,3] [4,5,6]
1 [0,5,3] [6,2,5]
want o/p
col1 col2 col3
0 [1,2,3] [4,5,6] [1,2,3,4,5,6]
1 [0,5,3] [6,2,5] [0,5,3,6,2,5]
col1 and col2 value are unique,
using pandas
Simplest way would be to do this:
df['col3'] = df['col1'] + df['col2']
Example:
import pandas as pd
row1 = [[1,2,3], [4,5,6]]
row2 = [[0,5,3], [6,2,5]]
df = pd.DataFrame(data=[row1, row2], columns=['col1', 'col2'])
df['col3'] = df['col1'] + df['col2']
print(df)
Output:
col1 col2 col3
0 [1, 2, 3] [4, 5, 6] [1, 2, 3, 4, 5, 6]
1 [0, 5, 3] [6, 2, 5] [0, 5, 3, 6, 2, 5]
You can use apply function on more than one column at once, like this:
def func(x):
return x['col1'] + x['col2']
df['col3'] = df[['col1','col2']].apply(func, axis=1)
Why not do a simple df['col1'] + df['col2']?
Assume col1 has list but in str types. In that case you can always modify func to:
def func(x):
return x['col1'][1:-1].split(',') + x['col2']

Merging multiple pandas dataframes into a single dataframe with contents concatenated as a list

I have a dictionary with an unknown number of pandas dataframes. Each dataframe contains a set of columns that are always present (user_id) and a set of columns that might or may not be present. All dataframes have the same number and order of rows. The content of each cell is a list (for the columns I am interested).
A simplified example:
df['first'] = pd.DataFrame( {'user_ID': [1, 2, 3],
'col1': [[1], [2,3], [3]],
'col2': [[3], [3], [3,1]],
'col3': [[], [1,2,3], [3,1]]} )
df['second'] = pd.DataFrame( {'user_ID': [1, 2, 3],
'col1': [[1, 2], [3], [3]],
'col3': [[1], [2,3], [3]],
'col4': [[3], [3], [3,1]] })
df['last'] = pd.DataFrame( {'user_ID': [1, 2, 3],
'col1': [[1], [2,3], [3]],
'col2': [[3], [3], [3,1]],
'col5': [[], [1,2,3], [3,1]]} )
They look like:
col1 col2 col3 user_ID
0 [1] [3] [] 1
1 [2, 3] [3] [1, 2, 3] 2
2 [3] [3, 1] [3, 1] 3
col1 col3 col4 user_ID
0 [1, 2] [1] [3] 1
1 [3] [2, 3] [3] 2
2 [3] [3] [3, 1] 3
col1 col2 col5 user_ID
0 [1] [3] [] 1
1 [2, 3] [3] [1, 2, 3] 2
2 [3] [3, 1] [3, 1] 3
How can I merge all these dataframes into a single dataframe where all columns that are not user_ID are merged so the contents are appended to the list?
Result should look like (order of elements in each list is irrelevant):
col1 col2 col3 col4 col5 user_ID
0 [1, 1, 2, 1] [3, 3] [1] [3] [] 1
1 [2, 3, 3, 2, 3] [3, 3] [1, 2, 3, 2, 3] [2] [1, 2, 3] 2
2 [3, 3, 3] [3, 1, 3, 1] [3, 1, 3] [3, 1] [3, 1] 3
I managed to concatenate the dataframes, but I still need to merge the resulting columns.
for dfName in ['first', 'second', 'last']:
df[dfName] = df[dfName].drop(['user_ID'], axis=1)
merged = pd.concat(df, axis=1, keys=['first', 'second', 'last'])
print(merged)
outputs:
first second last \
col1 col2 col3 col1 col3 col4 col1 col2
0 [1] [3] [] [1, 2] [1] [3] [1] [3]
1 [2, 3] [3] [1, 2, 3] [3] [2, 3] [3] [2, 3] [3]
2 [3] [3, 1] [3, 1] [3] [3] [3, 1] [3] [3, 1]
col5
0 []
1 [1, 2, 3]
2 [3, 1]
Any ideas?
It's a little involved, but you will need df.groupby. First, use pd.concat and join them. Then replace NaNs using df.applymap, and finally the groupby and sum.
In [673]: pd.concat([df1, df2, df3], 0)\
.applymap(lambda x: [] if x != x else x)\
.groupby('user_ID', as_index=False).sum()
Out[673]:
user_ID col1 col2 col3 col4 col5
0 1 [1, 1, 2, 1] [3, 3] [1] [3] []
1 2 [2, 3, 3, 2, 3] [3, 3] [1, 2, 3, 2, 3] [3] [1, 2, 3]
2 3 [3, 3, 3] [3, 1, 3, 1] [3, 1, 3] [3, 1] [3, 1]
Slightly improved efficiency thanks to Maarten Fabré.
If you have an unknown amount of dataframes, you can put them in a list or dict, and pass that to pd.concat:
merged = pd.concat(df_list, 0). ...
you could use df.groupby('user_ID').sum() if it weren't for the nan values, which cause all columns apart fro col1 to drop.
To get around this you could use this rather ugly method
pd.concat((df0, df1, df2)).fillna(-1).applymap(lambda x: x if x != -1 else []).groupby('user_ID').sum()
I had to resort to the fillna(-1).applymap(...) because you can't seem to assign [] directly to an item. If someone has a better suggestion to do this, let me know
edit
using #COLDSPEED's trick of comparing NaN to NaN
pd.concat((df0, df1, df2)).applymap(lambda x: x if x == x else []).groupby('user_ID').sum()
works easier
If you want the user_ID as a colum,n instead of an index, just add .reset_index()

Find all duplicate rows in a pandas dataframe

I would like to be able to get the indices of all the instances of a duplicated row in a dataset without knowing the name and number of columns beforehand. So assume I have this:
col
1 | 1
2 | 2
3 | 1
4 | 1
5 | 2
I'd like to be able to get [1, 3, 4] and [2, 5]. Is there any way to achieve this? It sounds really simple, but since I don't know the columns beforehand I can't do something like df[col == x...].
First filter all duplicated rows and then groupby with apply or convert index to_series:
df = df[df.col.duplicated(keep=False)]
a = df.groupby('col').apply(lambda x: list(x.index))
print (a)
col
1 [1, 3, 4]
2 [2, 5]
dtype: object
a = df.index.to_series().groupby(df.col).apply(list)
print (a)
col
1 [1, 3, 4]
2 [2, 5]
dtype: object
And if need nested lists:
L = df.groupby('col').apply(lambda x: list(x.index)).tolist()
print (L)
[[1, 3, 4], [2, 5]]
If need use only first column is possible selected by position with iloc:
a = df[df.iloc[:,0].duplicated(keep=False)]
.groupby(df.iloc[:,0]).apply(lambda x: list(x.index))
print (a)
col
1 [1, 3, 4]
2 [2, 5]
dtype: object

Categories

Resources