combine two column and create new column using pandas library

combine two column and create new column using pandas library - python

df = pd.read_csv("school_data.csv")
col1 col2
0 [1,2,3] [4,5,6]
1 [0,5,3] [6,2,5]
want o/p
col1 col2 col3
0 [1,2,3] [4,5,6] [1,2,3,4,5,6]
1 [0,5,3] [6,2,5] [0,5,3,6,2,5]
col1 and col2 value are unique,
using pandas

Simplest way would be to do this:
df['col3'] = df['col1'] + df['col2']
Example:
import pandas as pd
row1 = [[1,2,3], [4,5,6]]
row2 = [[0,5,3], [6,2,5]]
df = pd.DataFrame(data=[row1, row2], columns=['col1', 'col2'])
df['col3'] = df['col1'] + df['col2']
print(df)
Output:
col1 col2 col3
0 [1, 2, 3] [4, 5, 6] [1, 2, 3, 4, 5, 6]
1 [0, 5, 3] [6, 2, 5] [0, 5, 3, 6, 2, 5]

You can use apply function on more than one column at once, like this:
def func(x):
return x['col1'] + x['col2']
df['col3'] = df[['col1','col2']].apply(func, axis=1)
Why not do a simple df['col1'] + df['col2']?
Assume col1 has list but in str types. In that case you can always modify func to:
def func(x):
return x['col1'][1:-1].split(',') + x['col2']

Related

Pandas Dataframe add element to a list in a cell

I am trying something like this:
List append in pandas cell
But the problem is the post is old and everything is deprecated and should not be used anymore.
d = {'col1': ['TEST', 'TEST'], 'col2': [[1, 2], [1, 2]], 'col3': [35, 89]}
df = pd.DataFrame(data=d)
col1
col2
col3
TEST
[1, 2, 3]
35
TEST
[1, 2, 3]
89
My Dataframe looks like this, were there is the col2 is the one I am interested in. I need to add [0,0] to the lists in col2 for every row in the DataFrame. My real DataFrame is of dynamic shape so I cant just set every cell on its own.
End result should look like this:
col1
col2
col3
TEST
[1, 2, 3, 0, 0]
35
TEST
[1, 2, 3, 0, 0]
89
I fooled around with df.apply and df.assign but I can't seem to get it to work.
I tried:
df['col2'] += [0, 0]
df = df.col2.apply(lambda x: x.append([0,0]))
Which returns a Series that looks nothing like i need it
df = df.assign(new_column = lambda x: x + list([0, 0))

Not sure if this is the best way to go but, option 2 works with a little modification
import pandas as pd
d = {'col1': ['TEST', 'TEST'], 'col2': [[1, 2], [1, 2]], 'col3': [35, 89]}
df = pd.DataFrame(data=d)
df["col2"] = df["col2"].apply(lambda x: x + [0,0])
print(df)
Firstly, if you want to add all members of an iterable to a list use .extend instead of .append. This doesn't work because the method works inplace and doesn't return anything so "col2" values become None, so use list summation instead. Finally, you want to assign your modified column to the original DataFrame, not override it (this is the reason for the Series return)

One idea is use list comprehension:
df["col2"] = [x + [0,0] for x in df["col2"]]
print (df)
col1 col2 col3
0 TEST [1, 2, 0, 0] 35
1 TEST [1, 2, 0, 0] 89

for val in df['col2']:
val.append(0)

Pandas Efficient Filtering: Same filter condition on multiple columns

Say I have the data below:
df = pd.DataFrame({'col1': [1, 2, 1],
'col2': [2, 4, 3],
'col3': [3, 6, 5],
'col4': [4, 8, 7]})
Is there a way to use list comprehensions to filter data efficiently? For example, if I wanted to find all cases where col2 was even OR col3 was even OR col 4 was even, is there a simpler way than just writing this?
df[(df['col2'] % 2 == 0) | (df['col3'] % 2 == 0) | (df['col4'] % 2 == 0)]
It would be nice if I could pass in a list of columns and the condition to check.

df[(df[cols] % 2 == 0).any(axis=1)]
where cols is your list of columns

python pandas DataFrame - assign a list to multiple cells

I have a DataFrame like
name col1 col2
a aa 123
a bb 123
b aa 234
and a list
[1, 2, 3]
I want to replace the col2 of every row with col1 = 'aa' with the list like
name col1 col2
a aa [1, 2, 3]
a bb 123
b aa [1, 2, 3]
I tried something like
df.loc[df[col1] == 'aa', col2] = [1, 2, 3]
but it gives me the error:
ValueError: could not broadcast input array from shape (xx,) into shape (yy,)
How should I get around this?

Make it simple. np.where should do. Code below
df['col2']=np.where(df['col1']=='aa', str(lst), df['col2'])
Alternatively use pd.Series with list locked in double brackects
df['col2']=np.where(df['col1']=='aa', pd.Series([lst]), df['col2'])

import pandas as pd
df = pd.DataFrame({"name":["a","a","b"],"col1":["aa","bb","aa"],"col2":[123,123,234]})
l = [1,2,3]
df["col2"] = df.apply(lambda x: l if x.col1 == "aa" else x.col2, axis =1)
df

A list comprehension with an if/else should work
df['col2'] = [x['col2'] if x['col1'] != 'aa' else [1,2,3] for ind,x in df.iterrows()]

It will be safe to do with for loop
df.col2 = df.col2.astype(object)
for x in df.index:
if df.at[x,'col1'] == 'aa':
df.at[x,'col2'] = [1,2,3]
df
name col1 col2
0 a aa [1, 2, 3]
1 a bb 123
2 b aa [1, 2, 3]

You can also use:
data = {'aa':[1,2,3]}
df['col2'] = np.where(df['col1'] == 'aa', df['col1'].map(data), df['col2'])
You should use this with care, as doing this will change list to both locations:
df['col2'].loc[0].append(5)
print(df)
#OUTPUT
name col1 col2
0 a aa [1, 2, 3, 5]
1 a bb 123
2 b aa [1, 2, 3, 5]
But this is fine:
df = df.loc[1:]
print(df)
#OUTPUT
name col1 col2
1 a bb 123
2 b aa [1, 2, 3]

Combine lists from several columns into one nested list pandas

Here is my dataframe:
| col1 | col2 | col3 |
----------------------------------
[1,2,3,4] | [1,2,3,4] | [1,2,3,4]
I also have this function:
def joiner(col1,col2,col3):
snip = []
snip.append(col1)
snip.append(col2)
snip.append(col3)
return snip
I want to call this on each of the columns and assign it to a new column.
My end goal would be something like this:
| col1 | col2 | col3 | col4
------------------------------------------------------------------
[1,2,3,4] | [1,2,3,4] | [1,2,3,4] | [[1,2,3,4],[1,2,3,4],[1,2,3,4]]

Just .apply list on axis=1, it'll create lists for each rows
>>> df['col4'] = df.apply(list, axis=1)
OUTPUT:
col1 col2 col3 col4
0 [1, 2, 3, 4] [1, 2, 3, 4] [1, 2, 3, 4] [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]

You can just do
df['col'] = df.values.tolist()

Checking for missing rows in pandas dataframe based on subsetting columns

I have two dataframes from two sources that should be the same.
I would like to iterate through a subset of each column combination to get a count of the number of different rows between each dataframe.
Right now I can do it manually, but I would like to write a function or script that can automate this for me. Any ideas?
d = {'col1': [1, 2, 3], 'col2': [3, 4, 6], 'col3': [3, 4, 7]}
d1 = {'col1': [1, 2, 3], 'col2': [3, 4, 6], 'col3': [3, 4, 3]}
df = pd.DataFrame(data=d)
df1 = pd.DataFrame(data=d1)
Check all rows:
merged = df.merge(df1, indicator=True, how='outer')
rows_missing_from_df = merged[merged['_merge'] == 'right_only']
rows_missing_from_df.shape
(1, 4)
Check rows for just col1 and col2
df_select = df[['col1', 'col2']]
df1_select = df1[['col1', 'col2']]
merged = df_select.merge(df1_select, indicator=True, how='outer')
rows_missing_from_df = merged[merged['_merge'] == 'right_only']
rows_missing_from_df.shape
(0, 3)
Check rows for just col2 and col3
df_select_1 = df[['col2', 'col3']]
df1_select_1 = df1[['col2', 'col3']]
merged = df_select_1.merge(df1_select_1, indicator=True, how='outer')
rows_missing_from_df = merged[merged['_merge'] == 'right_only']
rows_missing_from_df.shape
(1, 3)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

combine two column and create new column using pandas library - python

df = pd.read_csv("school_data.csv") col1 col2 0 [1,2,3] [4,5,6] 1 [0,5,3] [6,2,5] want o/p col1 col2 col3 0 [1,2,3] [4,5,6] [1,2,3,4,5,6] 1 [0,5,3] [6,2,5] [0,5,3,6,2,5] col1 and col2 value are unique, using pandas

Related

Pandas Dataframe add element to a list in a cell

Pandas Efficient Filtering: Same filter condition on multiple columns

python pandas DataFrame - assign a list to multiple cells

Combine lists from several columns into one nested list pandas

Checking for missing rows in pandas dataframe based on subsetting columns

Categories

Resources