Removing rows contains non-english words in Pandas dataframe

Removing rows contains non-english words in Pandas dataframe - python

I have a pandas data frame that consists of 4 rows, the English rows contain news titles, some rows contain non-English words like this one
**SheÃ¢â‚¬â„¢s the Hollywood Power Behind Those ...**
I want to remove all rows like this one, so all rows that contain at least non-English characters in the Pandas data frame.

If using Python >= 3.7:
df[df['col'].map(lambda x: x.isascii())]
where col is your target column.
Data:
df = pd.DataFrame({
'colA': ['**SheÃ¢â‚¬â„¢s the Hollywood Power Behind Those ...**',
'Hello, world!', 'Cainã', 'another value', 'test123*', 'âbc']
})
print(df.to_markdown())
| | colA |
|---:|:------------------------------------------------------|
| 0 | **SheÃ¢â‚¬â„¢s the Hollywood Power Behind Those ...** |
| 1 | Hello, world! |
| 2 | Cainã |
| 3 | another value |
| 4 | test123* |
| 5 | âbc |
Identifying and filtering strings with non-English characters (see the ASCII printable characters):
df[df.colA.map(lambda x: x.isascii())]
Output:
colA
1 Hello, world!
3 another value
4 test123*
Original approach was to use a user-defined function like this:
def is_ascii(s):
try:
s.encode(encoding='utf-8').decode('ascii')
except UnicodeDecodeError:
return False
else:
return True

You can use regex to do that.
Installation documentation is here. (just a simple pip install regex)
import re
and use [^a-zA-Z] to filter it.
to break it down:
^: Not
a-z: small letter
A-Z: Capital letters

Related

mark one dataframe if pattern found in another dataframe

I have two dataframes, and I want to mark the second one if the first one contains a pattern. Very large of rows (>10000's)
date | items
20100605 | apple is red
20110606 | orange is orange
20120607 | apple is green
B: shorter with a few hundred rows.
id | color
123 | is Red
234 | not orange
235 | is green
Result would be to flag all columns in B if pattern found in A, possibly adding a column to B like
B:
id | color | found
123 | is Red | true
234 | not orange | false
235 | is green | true
thinking of something like, dfB['found'] = dfB['color'].isin(dfA['items']) but don't see any way to ignore case. Also, with this approach it will change true to false. Don't want to change those which are already set true. Also, I believe it's inefficient to loop large dataframes more than once. Running through A once and marking B would be better way but not sure how to achieve that using isin(). Any other ways? Especially ignoring case sensitivity of pattern.

You can use something like this:
df2['check'] = df2['color'].apply(lambda x: True if any(x.casefold() in i.casefold() for i in df['items']) else False)
or you can use str.contains:
df2['check'] = df2['color'].str.contains('|'.join(df['items'].str.split(" ").str[1] + ' ' + df['items'].str.split(" ").str[2]),case=False)
#get second and third words

Extract last value of list in each row [duplicate]

This question already has answers here:
Extracting an element of a list in a pandas column
(4 answers)
Selecting the last element of a list inside a pandas dataframe
(4 answers)
Closed 5 months ago.
I have a data frame with a column location which contains a lot of text and then an actual location code. I'm trying to extract out the location code, and I figured out if I split the string on spaces then I can just grab the last item in each list. For example, in the df below the column location is the original column and after applying my split I now have location_split.
| location | location_split
0 | Town (123) | ['Town', '(123)']
1 | Town Town (123AB) | ['Town', 'Town', '(123AB)']
2 | Town (40832) (123BC) | ['Town', '(40832)', '(123BC)']
3 | Town (987) | ['Town', '(987)']
But, how do I make it so that I can pull out the last item in the list and have that be the value for location_split? Something like df['location']=df['location_split'][-1] and end up with the location column below. I did attempt regex, but since some rows have multiple parentheses containing numbers it couldn't differentiate, but splitting them and then grabbing the last item on the list seems the most foolproof.
| location
0 | (123)
1 | (123AB)
2 | (123BC)
3 | (987)

You can use .str accessor
df['location'] = df['location_split'].str[-1]
# or
df['location'] = df['location_split'].str.get(-1)

You can use the regex:
df['location'] = df['location'].astype(str).str.extract('(\(.*\))')

Unify different separators in column

I have a dataframe df with ~450000 rows and 4 columns like "HK" as in the example:
df = pd.DataFrame(
{
"HK": [
"19000000-ac-;ghj-;qrs",
"19000000- abcd-",
"19000000 -abc;klm-",
"19000000 - abc-;",
"19000000 a-",
]
}
)
df.head()
| HK
| -------------
| 19000000-ac-;ghj-;qrs
| 19000000- abcd-
| 19000000 -abc-;klm-
| 19000000 - abc-;
| 19000000 a-
I always have 8 digits followed by a value. The digits and the value are separated through different forms of "-" (no whitespace inbetween digits and value, whitespace left, whitespace right, whitespace left and right or only a whitespace without a "-").
I would like to get a unified presentation whith "$digits$ - $value$" so that my column looks like this:
| HK
| -------------
| 19000000 - ac-;ghj-;qrs
| 19000000 - abcd-
| 19000000 - abc-;klm-
| 19000000 - abc-;
| 19000000 - a-

Using pd.Series.str.replace with a regular expression:
>>> df['HK'].str.replace(r'(?<=\d{8})[\s-]+(?=\w)', ' - ', regex=True)
0 19000000 - ac-;ghj-;qrs
1 19000000 - abcd-
2 19000000 - abc;klm-
3 19000000 - abc-;
4 19000000 - a-
Name: HK, dtype: object
Explaining the regular expression. There is a lookback (?<=\d{8}) requiring that there are eight digits immediately before the main section. The main section is [\s-]+ which requires one or more characters which are whitespace or hyphens. Then there is a lookahead (?=\w) requiring that immediately after this is a word character (in this case, something like a).

Apply user-defined functions over a python datatable (not pandas dataframe)?

Datatable is popular for R, but it also has a Python version. However, I don't see anything in the docs for applying a user defined function over a datatable.
Here's a toy example (in pandas) where a user function is applied over a dataframe to look for po-box addresses:
df = pd.DataFrame({'customer':[101, 102, 103],
'address':['12 main st', '32 8th st, 7th fl', 'po box 123']})
customer | address
----------------------------
101 | 12 main st
102 | 32 8th st, 7th fl
103 | po box 123
# User-defined function:
def is_pobox(s):
rslt = re.search(r'^p(ost)?\.? *o(ffice)?\.? *box *\d+', s)
if rslt:
return True
else:
return False
# Using .apply() for this example:
df['is_pobox'] = df.apply(lambda x: is_pobox(x['address']), axis = 1)
# Expected Output:
customer | address | rslt
----------------------------|------
101 | 12 main st | False
102 | 32 8th st, 7th fl| False
103 | po box 123 | True
Is there a way to do this .apply operation in datatable? Would be nice, because datatable seems to be quite a bit faster than pandas for most operations.

Merging one column in pandas - only keep one row with all values instead of new row for each matc

I want to merge two pandas dataframes:
df 1
City | Attraction | X | Z | Y
Somewhere Rainbows 1 2 3
Somewhere Trees 4 4 4
Somewhere Unicorns
df 2
City | Other Column | Also another column
Somewhere Something Something else
Normally this would be done so:
df2.merge(df1[['City', 'Attraction']], left_on='City', right_on='City. how='left']
City | Other Column | Also another column | Attraction
Somewhere Something Something else Rainbows
Somewhere Something Something else Trees
Somewhere Something Something else Unicorns
However, I would like to group the results of the join into a comma separated list (or whatever):
City | Other Column | Also another column | Attraction
Somewhere Something Something else Rainbows, Trees, Unicorns

groupby() and map:
df2['Attaction'] = df2['City'].map(df1.groupby('City').Attraction.agg(', '.join))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing rows contains non-english words in Pandas dataframe - python

You can use regex to do that. Installation documentation is here. (just a simple pip install regex) import re and use [^a-zA-Z] to filter it. to break it down: ^: Not a-z: small letter A-Z: Capital letters

Related

mark one dataframe if pattern found in another dataframe

Extract last value of list in each row [duplicate]

Unify different separators in column

Apply user-defined functions over a python datatable (not pandas dataframe)?

Merging one column in pandas - only keep one row with all values instead of new row for each matc

Categories

Resources