mark one dataframe if pattern found in another dataframe

mark one dataframe if pattern found in another dataframe - python

I have two dataframes, and I want to mark the second one if the first one contains a pattern. Very large of rows (>10000's)
date | items
20100605 | apple is red
20110606 | orange is orange
20120607 | apple is green
B: shorter with a few hundred rows.
id | color
123 | is Red
234 | not orange
235 | is green
Result would be to flag all columns in B if pattern found in A, possibly adding a column to B like
B:
id | color | found
123 | is Red | true
234 | not orange | false
235 | is green | true
thinking of something like, dfB['found'] = dfB['color'].isin(dfA['items']) but don't see any way to ignore case. Also, with this approach it will change true to false. Don't want to change those which are already set true. Also, I believe it's inefficient to loop large dataframes more than once. Running through A once and marking B would be better way but not sure how to achieve that using isin(). Any other ways? Especially ignoring case sensitivity of pattern.

You can use something like this:
df2['check'] = df2['color'].apply(lambda x: True if any(x.casefold() in i.casefold() for i in df['items']) else False)
or you can use str.contains:
df2['check'] = df2['color'].str.contains('|'.join(df['items'].str.split(" ").str[1] + ' ' + df['items'].str.split(" ").str[2]),case=False)
#get second and third words

Related

Removing rows contains non-english words in Pandas dataframe

I have a pandas data frame that consists of 4 rows, the English rows contain news titles, some rows contain non-English words like this one
**SheÃ¢â‚¬â„¢s the Hollywood Power Behind Those ...**
I want to remove all rows like this one, so all rows that contain at least non-English characters in the Pandas data frame.

If using Python >= 3.7:
df[df['col'].map(lambda x: x.isascii())]
where col is your target column.
Data:
df = pd.DataFrame({
'colA': ['**SheÃ¢â‚¬â„¢s the Hollywood Power Behind Those ...**',
'Hello, world!', 'Cainã', 'another value', 'test123*', 'âbc']
})
print(df.to_markdown())
| | colA |
|---:|:------------------------------------------------------|
| 0 | **SheÃ¢â‚¬â„¢s the Hollywood Power Behind Those ...** |
| 1 | Hello, world! |
| 2 | Cainã |
| 3 | another value |
| 4 | test123* |
| 5 | âbc |
Identifying and filtering strings with non-English characters (see the ASCII printable characters):
df[df.colA.map(lambda x: x.isascii())]
Output:
colA
1 Hello, world!
3 another value
4 test123*
Original approach was to use a user-defined function like this:
def is_ascii(s):
try:
s.encode(encoding='utf-8').decode('ascii')
except UnicodeDecodeError:
return False
else:
return True

You can use regex to do that.
Installation documentation is here. (just a simple pip install regex)
import re
and use [^a-zA-Z] to filter it.
to break it down:
^: Not
a-z: small letter
A-Z: Capital letters

Add prefix to ffill, identifying values which were carried forward

Is there a wayto add a prefix when filling na's with ffill in pandas? I have a dataframe containing, taxonomic information like so:
| Kingdom | Phylum | Class | Order | Family | Genus |
| Bacteria | Firmicutes | Bacilli | Lactobacillales | Lactobacillaceae | Lactobacillus |
| Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | | |
| Bacteria | Bacteroidetes | | | | |
Since not all of the taxa in my dataframe can be classified fully, I have some empty cells. Replacing the spaces with NA and using ffill I can fill these with the last valid string in each row but I would like to add a string to these (for example "Unknown_Bacteroidales") so I can identify which ones were carried forward.
So far I tried this taxa_formatted = "unknown_" + taxonomy.fillna(method='ffill', axis=1) but this of course adds the "unknown_" prefix to everything in the dataframe.

You can this using boolean masking with df.isna.
df = df.replace("", np.nan) # if already NaN present skip this step
d = df.ffill()
d[df.isna()]+="(Copy)"
d
Kingdom Phylum Class Order Family Genus
0 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus
1 Bacteria Bacteroidetes Bacteroidia Bacteroidales Lactobacillaceae(Copy) Lactobacillus(Copy)
2 Bacteria Bacteroidetes Bacteroidia(Copy) Bacteroidales(Copy) Lactobacillaceae(Copy) Lactobacillus(Copy)
You can use df.add here.
d = df.ffill(axis=1)
df.add("unkown_" + d[df.isna()],fill_value='')
Kingdom Phylum Class Order Family Genus
0 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus
1 Bacteria Bacteroidetes Bacteroidia Bacteroidales unkown_Bacteroidales unkown_Bacteroidales
2 Bacteria Bacteroidetes unkown_Bacteroidetes unkown_Bacteroidetes unkown_Bacteroidetes unkown_Bacteroidetes

You need to use mask and update:
#make true nan's first.
#df = df.replace('',np.nan)
s = df.isnull()
df = df.ffill(axis=1)
df.update('unknown_' + df.mask(~s) )
print(df)
Bacteria Firmicutes Bacilli Lactobacillales \
0 Bacteria Bacteroidetes Bacteroidia Bacteroidales
1 Bacteria Bacteroidetes unknown_Bacteroidetes unknown_Bacteroidetes
Lactobacillaceae Lactobacillus
0 unknown_Bacteroidales unknown_Bacteroidales
1 unknown_Bacteroidetes unknown_Bacteroidetes

Merging one column in pandas - only keep one row with all values instead of new row for each matc

I want to merge two pandas dataframes:
df 1
City | Attraction | X | Z | Y
Somewhere Rainbows 1 2 3
Somewhere Trees 4 4 4
Somewhere Unicorns
df 2
City | Other Column | Also another column
Somewhere Something Something else
Normally this would be done so:
df2.merge(df1[['City', 'Attraction']], left_on='City', right_on='City. how='left']
City | Other Column | Also another column | Attraction
Somewhere Something Something else Rainbows
Somewhere Something Something else Trees
Somewhere Something Something else Unicorns
However, I would like to group the results of the join into a comma separated list (or whatever):
City | Other Column | Also another column | Attraction
Somewhere Something Something else Rainbows, Trees, Unicorns

groupby() and map:
df2['Attaction'] = df2['City'].map(df1.groupby('City').Attraction.agg(', '.join))

Pandas groupby, get ratio of boolean variable [duplicate]

I have a column of sites: ['Canada', 'USA', 'China' ....]
Each site occurs many times in the SITE column and next to each instance is a true or false value.
INDEX | VALUE | SITE
0 | True | Canada
1 | False | Canada
2 | True | USA
3 | True | USA
And it goes on.
Goal 1: I want to find, for each site, what percent of the VALUE column is True.
Goal 2: I want to return a list of sites where % True in the VALUE column is greater than 10%.
How do I use groupby to achieve this? I only know how to use groupby to find the mean for each site which won't help me here.

Something like this:
In [13]: g = df.groupby('SITE')['VALUE'].mean()
In [14]: g[g > 0.1]
Out[14]:
SITE
Canada 0.5
USA 1.0

Group by two columns and output original data structure

I have a dataframe table:
Test results | Make
P | BMW
F | VW
F | VW
P | VW
P | VW
P | VW
And I want to group by both make and test result to output a count something like this, including both original columns:
Test results | Make | count
P | BMW | 1
F | VW | 2
P | VW | 3
I am currently doing:
pass_rates = df.groupby(['Test Results','Make']).size()
but it groups both make and test result in one column when I need it to stay in the original structure

You can add reset_index with parameter name:
name : object, default None
The name of the column corresponding to the Series values
pass_rates = df.groupby(['Test Results','Make']).size().reset_index(name='count')
print pass_rates
Test Results Make count
0 F VW 2
1 P BMW 1
2 P VW 3
If you want disable sorting, add parameter sort=False to groupby:
sort : boolean, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.
pass_rates = df.groupby(['Test Results','Make'], sort=False).size().reset_index(name='count')
print pass_rates
Test Results Make count
0 P BMW 1
1 F VW 2
2 P VW 3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

mark one dataframe if pattern found in another dataframe - python

Related

Removing rows contains non-english words in Pandas dataframe

Add prefix to ffill, identifying values which were carried forward

Merging one column in pandas - only keep one row with all values instead of new row for each matc

Pandas groupby, get ratio of boolean variable [duplicate]

Group by two columns and output original data structure

Categories

Resources