Add a list to a dataframe in Python - python

I have a df
col1 col2
0 1 ONE AAKLD
1 2 TWO ERBB
2 3 THE COCCNUT
3 4 WOW AACE
and I have the following lists
list1 = ['a1', 'a2', 'a3']
list2 = ['b1', 'b2', 'b3']
list3 = ['c1', 'c2', 'c3']
I want to add the list values to different columns of particular rows in the df based on condition i.e., if col2 in df has AA ,the i need the list1 to be appended to that.
Expected output:
col1 col2 1 2 3
0 1 ONE AAKLD a1 a2 a3
1 2 TWO ERBB b1 b2 b3
2 3 THE COCCNUT c1 c2 c3
3 4 WOW AACE a1 a2 a3
Thanks

Check below provides required output.
import pandas as pd
import numpy as np
df_1 = pd.DataFrame( {'col1':[1,2,3,4,], 'col2':['AA','BB','CC','AA']})
list1 = ['a1', 'a2', 'a3']
list2 = ['b1', 'b2', 'b3']
list3 = ['c1', 'c2', 'c3']
df = pd.DataFrame([list1, list2, list3], columns=['1','2','3'])
df['col2'] = df['1'].str.split('', expand=True)[1].str.upper()*2
pd.merge(df_1, df, left_on='col2',right_on='col2')
Update as per OP's comments below
Building upon earlier logic of MERGE. You can easily change MATCHING criteria if need arise.
import pandas as pd
import numpy as np
## Data Prep
df_1 = pd.DataFrame( {'col1':[1,2,3,4,], 'col2':['ONE AAKLD','TWO ERBB','THE COCCNUT','WOW AACE']})
df_1['join_col'] = 1
list1 = ['a1', 'a2', 'a3']
list2 = ['b1', 'b2', 'b3']
list3 = ['c1', 'c2', 'c3']
## Logic
df = pd.DataFrame([list1, list2, list3], columns=['1','2','3'])
df['match_col'] = df['1'].str.split('', expand=True)[1].str.upper()*2
df['join_col'] = 1
df_2 = pd.merge(df_1, df, left_on='join_col', right_on = 'join_col')
df_2['Is_Match'] = df_2[['col2', 'match_col']].apply(lambda x: x[1] in x[0] , axis = 1)
df_2[df_2['Is_Match'] == True][['col1','col2','1','2','3']]
Output:

another option is using map:
d = {'AA':list1,'BB':list2,'CC':list3}
df_1[[1,2,3]] = df_1['col2'].map(d).agg(pd.Series)
print(df_1)
'''
col1 col2 1 2 3
0 1 AA a1 a2 a3
1 2 BB b1 b2 b3
2 3 CC c1 c2 c3
3 4 AA a1 a2 a3
UPD
so it's not absolutly clear what your data is, anyway you can try this (it won't work properly if you have more than one double caracters per string):
df_1[[1,2,3]] = df_1['col2'].str.extract(r'(([A-Z])\2)')[0].map(d).agg(pd.Series)
>>> df_1
'''
col1 col2 1 2 3
0 1 ONE AAKLD a1 a2 a3
1 2 TWO ERBB b1 b2 b3
2 3 THE COCCNUT c1 c2 c3
3 4 WOW AACE a1 a2 a3

You can play with pd.concat() and Pandas' Series.
df = pd.DataFrame({"col1":[1,2,3],"col2":[4,5,6]})
lst = pd.Series([7,8,9])
pd.concat((df,lst),axis=1)

Related

Optimal way of Reshaping Pandas Dataframe

I have a one dimensional dataframe setup like this:
[A1,B1,C1,A2,B2,C2,A3,B3,C3,A4,B4,C4,A5,B5,C5,A6,B6,C6]
In the my program A1,...,C6 will be numbers read from a csv.
I would like to reshape it into a 2d dataframe like this:
[A1,B1,C1]
[A2,B2,C2]
[A3,B3,C3]
[A4,B4,C4]
[A5,B5,C5]
[A6,B6,C6]
I could make this using loops but it will slow the program down a lot since I would be making this transformation many times. What is the optimal command for reshaping data this way? I looked through a bunch of the reshape dataframe questions but couldn't find anything specific to this. Thanks in advance.
Setup
s = "A1,B1,C1,A2,B2,C2,A3,B3,C3,A4,B4,C4,A5,B5,C5,A6,B6,C6".split(',')
Using Numpy
pd.DataFrame(np.array(s).reshape(-1, 3))
0 1 2
0 A1 B1 C1
1 A2 B2 C2
2 A3 B3 C3
3 A4 B4 C4
4 A5 B5 C5
5 A6 B6 C6
Iterator shenanigans
pd.DataFrame([*zip(*[iter(s)]*3)])
0 1 2
0 A1 B1 C1
1 A2 B2 C2
2 A3 B3 C3
3 A4 B4 C4
4 A5 B5 C5
5 A6 B6 C6
Using a stride (step) when parsing the list, assuming the data is in the format you provided.
s = [A1,B1,C1,A2,B2,C2,A3,B3,C3,A4,B4,C4,A5,B5,C5,A6,B6,C6]
Note that if s is initially a dataframe with one row and 18 columns, you can convert it to a list via:
s = s.T.iloc[:, 0].tolist()
Then convert the result into a dataframe of your chosen dimension via:
df = pd.DataFrame({'A': s[::3], 'B': s[1::3], 'C': s[2::3]})
More generally:
s = range(18)
cols = 3
>>> pd.DataFrame([s[n:(n + cols)] for n in range(0, len(s), cols)])
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
Using list split
[s[x:x+3] for x in range(0, len(s),3)]
Out[1151]:
[['A1', 'B1', 'C1'],
['A2', 'B2', 'C2'],
['A3', 'B3', 'C3'],
['A4', 'B4', 'C4'],
['A5', 'B5', 'C5'],
['A6', 'B6', 'C6']]
#pd.DataFrame([s[x:x+3] for x in range(0, len(s),3)])
I would reshape the array and ensure that the order argument is set to "A"
mylist = np.array(['a1', 'b1', 'c1', 'a2', 'b2', 'c2', 'a3', 'b3', 'c3', 'a4', 'b4', 'c4', 'a5','b5', 'c5', 'a6', 'b6', 'c6'])
reshapedList = mylist.reshape((6, 3), order = 'A')
print(mylist)
>>> ['a1' 'b1' 'c1' 'a2' 'b2' 'c2' 'a3' 'b3' 'c3' 'a4' 'b4' 'c4' 'a5' 'b5' 'c5' 'a6' 'b6' 'c6']
print(reshapedList)
[['a1' 'b1' 'c1']
['a2' 'b2' 'c2']
['a3' 'b3' 'c3']
['a4' 'b4' 'c4']
['a5' 'b5' 'c5']
['a6' 'b6' 'c6']]
If you want a pandas dataframe, you can get it as follows.
df = pd.DataFrame(mylist.reshape((6, 3), order = 'A'), columns = list('ABC'))
>>> df
A B C
0 a1 b1 c1
1 a2 b2 c2
2 a3 b3 c3
3 a4 b4 c4
4 a5 b5 c5
5 a6 b6 c6
Note:
It is important that you take sometime to check the differences between dataframe and array. Your question spoke of dataframe but what you really meant was array.

How can I find the "set difference" of rows in two dataframes on a subset of columns in Pandas?

I have two dataframes, say df1 and df2, with the same column names.
Example:
df1
C1 | C2 | C3 | C4
A 1 2 AA
B 1 3 A
A 3 2 B
df2
C1 | C2 | C3 | C4
A 1 3 E
B 1 2 C
Q 4 1 Z
I would like to filter out rows in df1 based on common values in a fixed subset of columns between df1 and df2. In the above example, if the columns are C1 and C2, I would like the first two rows to be filtered out, as their values in both df1 and df2 for these columns are identical.
What would be a clean way to do this in Pandas?
So far, based on this answer, I have been able to find the common rows.
common_df = pandas.merge(df1, df2, how='inner', on=['C1','C2'])
This gives me a new dataframe with only those rows that have common values in the specified columns, i.e., the intersection.
I have also seen this thread, but the answers all seem to assume a difference on all the columns.
The expected result for the above example (rows common on specified columns removed):
C1 | C2 | C3 | C4
A 3 2 B
Maybe not the cleanest, but you could add a key column to df1 to check against.
Setting up the datasets
import pandas as pd
df1 = pd.DataFrame({ 'C1': ['A', 'B', 'A'],
'C2': [1, 1, 3],
'C3': [2, 3, 2],
'C4': ['AA', 'A', 'B']})
df2 = pd.DataFrame({ 'C1': ['A', 'B', 'Q'],
'C2': [1, 1, 4],
'C3': [3, 2, 1],
'C4': ['E', 'C', 'Z']})
Adding a key, using your code to find the commons
df1['key'] = range(1, len(df1) + 1)
common_df = pd.merge(df1, df2, how='inner', on=['C1','C2'])
df_filter = df1[~df1['key'].isin(common_df['key'])].drop('key', axis=1)
You can use an anti-join method where you do an outer join on the specified columns while returning the method of the join with an indicator. Only downside is that you'd have to rename and drop the extra columns after the join.
>>> import pandas as pd
>>> df1 = pd.DataFrame({'C1':['A','B','A'],'C2':[1,1,3],'C3':[2,3,2],'C4':['AA','A','B']})
>>> df2 = pd.DataFrame({'C1':['A','B','Q'],'C2':[1,1,4],'C3':[3,2,1],'C4':['E','C','Z']})
>>> df_merged = df1.merge(df2, on=['C1','C2'], indicator=True, how='outer')
>>> df_merged
C1 C2 C3_x C4_x C3_y C4_y _merge
0 A 1 2.0 AA 3.0 E both
1 B 1 3.0 A 2.0 C both
2 A 3 2.0 B NaN NaN left_only
3 Q 4 NaN NaN 1.0 Z right_only
>>> df1_setdiff = df_merged[df_merged['_merge'] == 'left_only'].rename(columns={'C3_x': 'C3', 'C4_x': 'C4'}).drop(['C3_y', 'C4_y', '_merge'], axis=1)
>>> df1_setdiff
C1 C2 C3 C4
2 A 3 2.0 B
>>> df2_setdiff = df_merged[df_merged['_merge'] == 'right_only'].rename(columns={'C3_y': 'C3', 'C4_y': 'C4'}).drop(['C3_x', 'C4_x', '_merge'], axis=1)
>>> df2_setdiff
C1 C2 C3 C4
3 Q 4 1.0 Z
import pandas as pd
df1 = pd.DataFrame({'C1':['A','B','A'],'C2':[1,1,3],'C3':[2,3,2],'C4':['AA','A','B']})
df2 = pd.DataFrame({'C1':['A','B','Q'],'C2':[1,1,4],'C3':[3,2,1],'C4':['E','C','Z']})
common = pd.merge(df1, df2,on=['C1','C2'])
R1 = df1[~((df1.C1.isin(common.C1))&(df1.C2.isin(common.C2)))]
R2 = df2[~((df2.C1.isin(common.C1))&(df2.C2.isin(common.C2)))]
df1:
C1 C2 C3 C4
0 A 1 2 AA
1 B 1 3 A
2 A 3 2 B
df2:
C1 C2 C3 C4
0 A 1 3 E
1 B 1 2 C
2 Q 4 1 Z
common:
C1 C2 C3_x C4_x C3_y C4_y
0 A 1 2 AA 3 E
1 B 1 3 A 2 C
R1:
C1 C2 C3 C4
2 A 3 2 B
R2:
C1 C2 C3 C4
2 Q 4 1 Z

Pandas: How to expand data frame rows containing a dictionary with varying keys in a column?

I'm a little stuck, can you please help me with this. I've simplified the problem I'm facing to the following:
Input
Desired Output
I know how to handle the case where the dictionaries in col. c have same keys.
You can create DataFrame by constructor, reshape by stack and last join to original:
df1 = (pd.DataFrame(df.c.values.tolist())
.stack()
.reset_index(level=1)
.rename(columns={0:'val','level_1':'key'}))
print (df1)
key val
0 c00 v00
0 c01 v01
1 c10 v10
2 c20 v20
2 c21 v21
2 c22 v22
df = df.drop('c', 1).join(df1).reset_index(drop=True)
print (df)
a b key val
0 a0 b0 c00 v00
1 a0 b0 c01 v01
2 a1 b1 c10 v10
3 a2 b2 c20 v20
4 a2 b2 c21 v21
5 a2 b2 c22 v22
Here is one way:
import pandas as pd
from itertools import chain
df = pd.DataFrame([['a0', 'b0', {'c00': 'v00', 'c01': 'v01'}],
['a1', 'b1', {'c10': 'v10'}],
['a2', 'b2', {'c20': 'v20', 'c21': 'v21', 'c22': 'v22'}] ],
columns=['a', 'b', 'c'])
# first convert 'c' to list of tuples
df['c'] = df['c'].apply(lambda x: list(x.items()))
lens = list(map(len, df['c']))
# create dataframe
df_out = pd.DataFrame({'a': np.repeat(df['a'].values, lens),
'b': np.repeat(df['b'].values, lens),
'c': list(chain.from_iterable(df['c'].values))})
# unpack tuple
df_out = df_out.join(df_out['c'].apply(pd.Series))\
.rename(columns={0: 'key', 1: 'val'}).drop('c', 1)
# a b key val
# 0 a0 b0 c00 v00
# 1 a0 b0 c01 v01
# 2 a1 b1 c10 v10
# 3 a2 b2 c20 v20
# 4 a2 b2 c21 v21
# 5 a2 b2 c22 v22
My solution is next:
import pandas as pd
t=pd.DataFrame([['a0','b0',{'c00':'v00','c01':'v01'}],['a1','b1',{'c10':'v10'}],['a2','b2',{'c20':'v20','c21':'v21','c22':'v22'}]],columns=['a','b','c'])
l2=[]
for i in t.index:
for j in t.loc[i,'c']:
l2+=[[t.loc[i,'a'],t.loc[i,'b'],j,t.loc[i,'c'][j]]]
t2=pd.DataFrame(l2,columns=['a','b','key','val'])
where 't' is your DataFrame, which you obtain as you want.

How to concatenate dataframes using 2 columns as key?

How do i concatenate in pandas using a column as a key like we do in sql ?
df1
col1 col2
a1 b1
a2 b2
a3 b3
a4 b4
df2
col3 col4
a1 d1
a2 d3
a3 d3
I want to merge/concatenate them on col1 = col3 without getting rid of records that are not in col3 but in are in col 1. Similar to a left join in sql.
df
col1 col2 col4
a1 b1 d1
a2 b2 d2
a3 b3 d3
a4 b4 NA
Does the following work for you:
df1 = pd.DataFrame(
[
['a1', 'b1'],
['a2', 'b2'],
['a3', 'b3'],
['a4', 'b4']
],
columns=['col1', 'col2']
)
df2 = pd.DataFrame(
[
['a1', 'd1'],
['a2', 'd2'],
['a3', 'd3']
],
columns=['col3', 'col4']
)
df1 = df1.set_index('col1')
df2 = df2.set_index('col3')
dd = df2[df2.index.isin(df1.index)]
# dd.index.names = ['col1']
df = pd.concat([df1, dd], axis=1).reset_index().rename(columns={'index': 'col1'})
# Output
col1 col2 col4
0 a1 b1 d1
1 a2 b2 d2
2 a3 b3 d3
3 a4 b4 NaN

Pandas multiIndex: add new indexes for each existing index

I have a dataframe that looks like this:
ID Type
0 a1 y
1 a1 y
2 a2 y
3 a2 n
4 a3 n
I want to re-index it to look like this:
ID Subindex Type
a1 1 y
2 y
a2 1 y
2 n
a3 1 n
Any command in pandas that could do this? Thank you so much!
To number the items in each group, use cumcount:
import pandas as pd
df = pd.DataFrame({'ID': ['a1', 'a1', 'a2', 'a2', 'a3'],
'Type': ['y', 'y', 'y', 'n', 'n']})
df['Subindex'] = df.groupby('ID').cumcount()+1
print(df)
yields
ID Type Subindex
0 a1 y 1
1 a1 y 2
2 a2 y 1
3 a2 n 2
4 a3 n 1

Categories

Resources