pandas join on columns which contains a list - match any

pandas join on columns which contains a list - match any - python

I have two dataframes
I want to join on a column where one of the column is a list,
need to join if any value in list matches
df1 =
| index | col_1 |
| ----- | ----- |
| 1 | 'a' |
| 2 | 'b' |
df2 =
| index_2 | col_1 |
| ------- | ----- |
| A | ['a', 'c'] |
| B | ['a', 'd', 'e'] |
I am looking something like
df1.join(df2, on='col_1', type_=any, type='left')
| index |col_1_x |index_2|col_1_y |
| ----- |--------|_______| ----- |
| 1 |'a' | A |['a', 'c'] |
| 1 |'a' | A |['a', 'd', 'e']|
```

You can use explode and then use merge like so:
import pandas as pd
# Create the input dataframes
df1 = pd.DataFrame({'index': [1, 2], 'col_1': ['a', 'b']})
df2 = pd.DataFrame({'index_2': ['A', 'B'], 'col_1': [['a', 'c'], ['a', 'd', 'e']]})
# Explode the list column in df2 to multiple rows
df2_exploded = df2.explode('col_1')
# Perform a regular join on the common column
result = df1.merge(df2_exploded, left_on='col_1', right_on='col_1', how='left')
# Get the "col_1" from un-exploded data
result = result.merge(df2, on='index_2', how='left').dropna()
df_exploded looks like this:
index_2 col_1
0 A a
0 A c
1 B a
1 B d
1 B e
The final result looks like this:
index col_1_x index_2 col_1_y
0 1 a A [a, c]
1 1 a B [a, d, e]

You can do the following :
import pandas as pd
df1 = pd.DataFrame({'index': [1, 2], 'col_1': ['a', 'b']})
df2 = pd.DataFrame({'index_2': ['A', 'B'], 'col_1': [['a', 'c'], ['a', 'd', 'e']]})
# check for matches
def any_match(list1, list2):
if list1 is None or list2 is None:
return False
return any(x in list2 for x in list1)
# join the dataframes based on matching values
result = pd.merge(df1, df2, how='cross')
result = result[result.apply(lambda x: any_match(x['col_1_x'], x['col_1_y']), axis=1)]
print(result[['index', 'col_1_x', 'index_2', 'col_1_y']])
which returns:
index col_1_x index_2 col_1_y
0 1 a A [a, c]
1 1 a B [a, d, e]

Related

How to keep a row if any column contains a certain substring?

I have a pandas DataFrame df:
import pandas as pd
# Create a Pandas dataframe from some data.
df = pd.DataFrame({'Var1': ['d', 'a --> b', 'e', 'c --> d'],
'Var2': ['a', 'e', 'a --> b', 'd'],
'Var3': ['c', 'd', 'a --> b', 'e']})
Which looks like this when printed (for reference):
| | Var1 | Var2 | Var3 |
|---|---------|---------|---------|
| 0 | d | a | c |
| 1 | a --> b | e | d |
| 2 | e | a --> b | a --> b |
| 3 | c --> d | d | e |
I would like to keep just the rows 1, 2 and 3 that contains the value '-->'. In another words, I want to drop all rows in my dataframe that doesn't contains at least one column with the value '-->'.
I know how to filter just one column, df[df['Var1'].str.contains('-->', regex=False)] like gives me rows 1 and 3.
But I don't know how to apply to all columns. And I read some similar cases here and here, but couldn't figure out how to adapt to my case.
Can you suggest a way to select those rows?

Combine all columns into one and search for the substring:
df[df.sum(axis=1).str.contains('-->')]
# Var1 Var2 Var3
#1 a --> b e d
#2 e a --> b a --> b

You can filter them out using this.
df1= df[df.apply(lambda x: any(x.str.contains('-->')),axis=1)]
print (df1)
The output of this will be:
Original DataFrame:
Var1 Var2 Var3
0 d a c
1 a --> b e d
2 e a --> b a --> b
3 c d e
DF1: contains only rows with arrows
Var1 Var2 Var3
1 a --> b e d
2 e a --> b a --> b

try .stack() with a boolean index.
s = df.stack().str.contains('-->').reset_index(1,drop=True)
df.loc[s[s].index.unique()]
Var1 Var2 Var3
1 a --> b e d
2 e a --> b a --> b

Get the max value from each group with pandas.DataFrame.groupby

I need to aggregate two columns of my dataframe, count the values of the second columns and then take only the row with the highest value in the "count" column, let me show:
df =
col1|col2
---------
A | AX
A | AX
A | AY
A | AY
A | AY
B | BX
B | BX
B | BX
B | BY
B | BY
C | CX
C | CX
C | CX
C | CX
C | CX
------------
df1 = df.groupby(['col1', 'col2']).agg({'col2': 'count'})
df1.columns = ['count']
df1= df1.reset_index()
out:
col1 col2 count
A AX 2
A AY 3
B BX 3
B BY 2
C CX 5
so far so good, but now I need to get only the row of each 'col1' group that has the maximum 'count' value, but keeping the value in 'col2'.
expected output in the end:
col1 col2 count
A AY 3
B BX 3
C CX 5
I have no idea how to do that. My attempts so far of using the max() aggregation always left the 'col2' out.

From your original DataFrame you can .value_counts, which returns a descending count within group, and then given this sorting drop_duplicates will keep the most frequent within group.
df1 = (df.groupby('col1')['col2'].value_counts()
.rename('counts').reset_index()
.drop_duplicates('col1'))
col1 col2 counts
0 A AY 3
2 B BX 3
4 C CX 5

Probably not ideal, but this works:
df1.loc[df1.groupby(level=0).idxmax()['count']]
col1 col2 count
A AY 3
B BX 3
C CX 5
This works because the groupby within the loc will return a list of indices, which loc will then pull up.

I guess you need this: df['qty'] = 1 and then df.groupby([['col1', 'col2']].sum().reset_index(drop=True)

Option 1: Include Ties
In case you have ties and want to show them.
Ties could be, for instance, both (B, BX) and (B, BY) occur 3 times.
# Prepare packages
import pandas as pd
# Create dummy date
df = pd.DataFrame({
'col1': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'],
'col2': ['AX', 'AX', 'AY', 'AY', 'AY', 'BX', 'BX', 'BX', 'BY', 'BY', 'BY', 'CX', 'CX', 'CX', 'CX', 'CX'],
})
# Get Max Value by Group with Ties
df_count = (df.groupby('col1', as_index=0)['col2'].value_counts())
m = df_count.groupby(['col1'])['count'].transform(max) == df_count['count']
df1 = df_count[m]
col1 col2 count
0 A AY 3
2 B BX 3
3 B BY 3
4 C CX 5
Option 2: Short Code Ignoring Ties
df1 = (df
.groupby('col1')['col2']
.value_counts()
.groupby(level=0)
.head(1)
# .to_frame('count').reset_index() # Uncomment to get exact output requested
)

Create dataframe from dictionary where arrays are of unequal length

I have a dictionary - {'Car': ['a', 'b'], 'Bike': ['q', 'w', 'e']}
I want to generate a data frame like this -
S.no. | vehicle | model
1 | Car | a
2 | Car | b
2 | Bike | q
2 | Bike | w
2 | Bike | e
I tried df = pd.DataFrame(vDict) but I get ValueError: arrays must all be same length error. Help please?

Use:
pd.Series(dct, name='model').explode().rename_axis(index='vehicle').reset_index()

We can use pd.DataFrame.from_dict here, then use stack and finally clean up our index and column names:
dct = {'Car': ['a', 'b'], 'Bike': ['q', 'w', 'e']}
df = pd.DataFrame.from_dict(dct, orient='index').stack()
df = df.reset_index(level=0, name='model').rename(columns={'level_0':'vehicle'})
df = df.reset_index(drop=True)
vehicle model
0 Car a
1 Car b
2 Bike q
3 Bike w
4 Bike e

Replace value in column by value in list by index

My column in dataframe contains indices of values in list. Like:
id | idx
A | 0
B | 0
C | 2
D | 1
list = ['a', 'b', 'c', 'd']
I want to replace each value in idx column greater than 0 by value in list of corresponding index, so that:
id | idx
A | 0
B | 0
C | c # list[2]
D | b # list[1]
I tried to do this with loop, but it does nothing...Although if I move ['idx'] it will replace all values on this row
for index in df.idx.values:
if index >=1:
df[df.idx==index]['idx'] = list[index]

Dont use list like variable name, because builtin (python code word).
Then use Series.map with enumerate in Series.mask:
L = ['a', 'b', 'c', 'd']
df['idx'] = df['idx'].mask(df['idx'] >=1, df['idx'].map(dict(enumerate(L))))
print (df)
id idx
0 A 0
1 B 0
2 C c
3 D b
Similar idea is processing only matched rows by mask:
L = ['a', 'b', 'c', 'd']
m = df['idx'] >=1
df.loc[m,'idx'] = df.loc[m,'idx'].map(dict(enumerate(L)))
print (df)
id idx
0 A 0
1 B 0
2 C c
3 D b

Create a dictionary for items where the index is greater than 0, then use the mapping with replace to get your output :
mapping = dict((key,val) for key,val in enumerate(l) if key > 0)
print(mapping)
{1: 'b', 2: 'c', 3: 'd'}
df.replace(mapping)
id idx
0 A 0
1 B 0
2 C c
3 D b
Note : I changed the list variable name to l

Remove pandas columns based on list

I have a list:
my_list = ['a', 'b']
and a pandas dataframe:
d = {'a': [1, 2], 'b': [3, 4], 'c': [1, 2], 'd': [3, 4]}
df = pd.DataFrame(data=d)
What can I do to remove the columns in df based on list my_list, in this case remove columns a and b

This is very simple:
df = df.drop(columns=my_list)
drop removes columns by specifying a list of column names

This is a concise script using list comprehension: [df.pop(x) for x in my_list]
my_list = ['a', 'b']
d = {'a': [1, 2], 'b': [3, 4], 'c': [1, 2], 'd': [3, 4]}
df = pd.DataFrame(data=d)
print(df.to_markdown())
| | a | b | c | d |
|---:|----:|----:|----:|----:|
| 0 | 1 | 3 | 1 | 3 |
| 1 | 2 | 4 | 2 | 4 |
[df.pop(x) for x in my_list]
print(df.to_markdown())
| | c | d |
|---:|----:|----:|
| 0 | 1 | 3 |
| 1 | 2 | 4 |

You can select required columns as well:
cols_of_interest = ['c', 'd']
df = df[cols_of_interest]
if you have a range of columns to drop: for example 2 to 8, you can use:
df.drop(df.iloc[:,2:8].head(0).columns, axis=1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas join on columns which contains a list - match any - python

Related

How to keep a row if any column contains a certain substring?

Get the max value from each group with pandas.DataFrame.groupby

Create dataframe from dictionary where arrays are of unequal length

Replace value in column by value in list by index

Remove pandas columns based on list

Categories

Resources