This question already has answers here:
Remove non-ASCII characters from pandas column
(8 answers)
Closed 1 year ago.
In my DF there are values like الÙجيرة in different columns. How can I remove such values? I am reading the data from an excel file. So on reading, if we could do something then that will be great.
Also, I have some values like Battery ÁÁÁ so I want it to be Battery, So how can I delete these non-English characters but keep other content?
You can use regex to remove designated characters from your strings:
import re
import pandas as pd
records = [{'name':'Foo الÙجيرة'}, {'name':'Battery ÁÁÁ'}]
df = pd.DataFrame.from_records(records)
# Allow alpha numeric and spaces (add additional characters as needed)
pattern = re.compile('[^A-z0-9 ]+')
def clean_text(string):
return pattern.search('', string)
# Apply to your df
df['clean_name'] = df['name'].apply(clean_text)
name clean_name
0 Foo الÙجيرة Foo
1 Battery ÁÁÁ Battery
For more solutions, you can read this SO Q: Python, remove all non-alphabet chars from string
You can use python split method to do that or you can lambda function:
df[column_name] = df[column_name].apply(lambda column_name : column_name[start:stop])
#df['location'] = df['location'].apply(lambda location:location[0:4])
Split Method
df[column_name] = df[column_name].apply(lambda column_name: column_name.split('')[0])
Related
This question already has answers here:
How to merge with wildcard? - Pandas
(2 answers)
Closed 27 days ago.
I have two data frame where I have * asterisk wild card in string which need to be compared with full string such that * asterisk will match any number of characters.
This same is working in Excel vlookup using below fomula, need same to be done using Python / Pandas Dataframe tried using pd.merge but * asterisk is considered as String in python instead of wildcard
Left_dataframe
Right_dataframe
Formula in excel working
Output expected
Compare * data
Compare with the data
VLOOKUP(A2,B:B,1,0)
Compare with the data
Tried using using Python / Pandas Dataframe tried using pd.merge but * asterisk is considered as String in python instead of wildcard and merge in not working
It is difficult to propose a solution. In the absence of 'min reproducible example', you might consider the following:
#import libraries
import pandas as pd
import re ##not used initially
#create dataframes
left_data = {'Left_dataframe': 'Compare * data'}
right_data = {'Right_dataframe': 'Compare with the data'}
df_left = pd.DataFrame(left_data, index=[0])
df_right = pd.DataFrame(right_data, index=[0])
#check df
df_left
df_right
#compare cell value | escape *
df_output_where = df_right.where(df_right.Right_dataframe.str.contains('/*'), df_left)
df_output_where
'''
[Disclaimer] Below might not be the most pythonic way.
The below add-on checks for the '*' asterisk within the first df, and in addition,
check if the first 'word' in the two df matches.
If so, it returns the value in the second df.
'''
## pattern to use
str_pattern5 = r'(\w+) (.*) (\w+)'
## startswith first word in column (returned by re.match())
## assist from https://stackoverflow.com/a/56073916/20107918
## assist from https://stackoverflow.com/a/61713961/20107918
compare_left = df_left.Left_dataframe.apply(lambda x: x.startswith(re.match(str_pattern5, (df_left.Left_dataframe.values)[0]).group(1))) #.group(1))
compare_right = df_right.Right_dataframe.apply(lambda x: x.startswith(re.match(str_pattern5, (df_right.Right_dataframe.values)[0]).group(1))) #.group(1))
## return value from the right dataframe where
## the right df, compare_right == compare_left
df_output_compare02 = df_right.Right_dataframe if ((df_left.Left_dataframe.str.contains('/*')).all() and (compare_right == compare_left).all()) else None
df_output_compare02
#### NB:
#df_output_where_compare = df_right.where(((df_left.Left_dataframe.str.contains('/*')) and (compare_right == compare_left))) ##truth value
#df_output_where_compare = df_right.Right_dataframe.where((df_left.Left_dataframe.str.contains('/*')).all() and (compare_right == compare_left).all(), df_left) #ValueError: Array conditional must be same shape as self
#df_outpute_where_compare
PS: Below does not return the desired output
#### PS: does not return the desired output
#perform **pd.merge**
df_output = df_left.merge(df_right, how='outer', left_on='Left_dataframe', right_on='Right_dataframe')
df_output
Further consideration:
string
If performing a check on * as a pattern, do it as regex.
You may import re for better regex handling.
I have a small dataframe with entries regarding motorsport balance of performance.
I try to get rid of the string after "#"
This is working fine with the code:
for col in df_engine.columns[1:]:
df_engine[col] = df_engine[col].str.rstrip(r"[\ \# \d.[0-9]+]")
but is leaving last column unchanged, and I do not understand why.
The Ferrari column also has a NaN entry as last position, just as additional info.
Can anyone provide some help?
Thank you in advance!
rstrip does not work with regex. As per the documentation,
to_strip str or None, default None
Specifying the set of characters to
be removed. All combinations of this set of characters will be
stripped. If None then whitespaces are removed.
>>> "1.76 # 0.88".rstrip("[\ \# \d.[0-9]+]")
'1.76 # 0.88'
>>> "1.76 # 0.88".rstrip("[\ \# \d.[0-8]+]") # It's not treated as regex, instead All combinations of characters(`[\ \# \d.[0-8]+]`) stripped
'1.76'
You could use the replace method instead.
for col in df.columns[1:]:
df[col] = df[col].str.replace(r"\s#\s[\d\.]+$", "", regex=True)
What about str.split() ?
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html#pandas.Series.str.split
The function splits a serie in dataframe columns (when expand=True) using the separator provided.
The following example split the serie df_engine[col] and produce a dataframe. The first column of the new dataframe contains values preceding the first separator char '#' found in the value
df_engine[col].str.split('#', expand=True)[0]
This question already has answers here:
Faster method of extracting characters for multiple columns in dataframe
(2 answers)
How to extract part of a string in Pandas column and make a new column
(3 answers)
Reference - What does this regex mean?
(1 answer)
Closed 2 months ago.
I wish to keep everything before the hyphen in one column, and keep everything before the colon in another column using Pandas.
Data
ID Type Stat
AA - type2 AAB:AB33:77:000 Y
CC - type3 CCC:AB33:77:000 N
Desired
ID Type
AA AAB
CC CCC
Doing
separator = '-'
result_1 = my_str.split(separator, 1)[0]
Any suggestion is appreciated
We can try using str.extract here:
df["ID"] = df["ID"].str.extract(r'(\w+)')
df["Type"] = df["Type"].str.extract(r'(\w+)')
I would say
func1 = lambda _: _['ID'].split('- ')[0]
func2 = lambda _: _['Type'].split(':')[0]
data\
.assign(ID=func1)\
.assign(Type=func2)
References
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html
I have a data frame with some text read in from a txt file the column names are FEATURE and SENTENCES.
Within the FEATURE col there is some text that starts with '[NA]', e.g. '[NA] not a feature'.
How can I remove those rows from my data frame?
So far I have tried:
df[~df.FEATURE.str.contains("[NA]")]
But this did nothing, no errors either.
I also tried:
df.drop(df['FEATURE'].str.startswith('[NA]'))
Again, there were no errors, but this didn't work.
Lets suppose you have DataFrame below:
>>> df
FEATURE
0 this
1 is
2 string
3 [NA]
Then below simply should be sufficed ..
>>> df[~df['FEATURE'].str.startswith('[NA]')]
FEATURE
0 this
1 is
2 string
other way in case data needed to formatted to string before operating on it..
df[~df['FEATURE'].astype(str).str.startswith('[NA]')]
OR using str.contains :
>>> df[df.FEATURE.str.contains('[NA]') == False]
# df[df['FEATURE'].str.contains('[NA]') == False]
FEATURE
0 this
1 is
2 string
OR
df[df.FEATURE.str[0].ne('[')]
IIUC use regex=False for not parsing string like regex:
df[~df.FEATURE.str.contains("[NA]", regex=False)]
Or escape special regex chars []:
df[~df.FEATURE.str.contains("\[NA\]")]
Another problem should be trailing white spaces, then use:
df[~df['FEATURE'].str.strip().str.startswith('[NA]')]
df['data'].str.startswith('[NA]') or df['data'].str.contains('[NA]') will both return a boolean (True/False) list. Drop doesnt work with booleans and in this case it is easiest using 'loc'
Here is one solution with some example data. Note that i add '==False' to get all the rows that DON'T have [NA]:
df = pd.DataFrame(['feature','feature2', 'feature3', '[NA] not feature', '[NA] not feature2'], columns=['data'])
mask = df['data'].str.contains('[NA]')==False
df.loc[mask]
The below simply code should work
df = df[~df['Date'].str.startswith('[NA]')]
I have a column of integers (sample row: 123456789) and some of the values are interspersed with junk alphabets. Ex: 1234y5678. I want to delete the alphabets appearing in such cells and retain the numbers. How do I go about it using Pandas?
Assume my dataframe is df and the column name is mobile.
Should I use np.where with conditions such as df[df['mobile'].str.contains('a-z')] and use string replace?
If your junk characters are not limited to letters, you should use this:
yourSeries.str.replace('[^0-9]', '')
Use pd.Series.str.replace:
import pandas as pd
s = pd.Series(['125109a181', '1361q1j1', '85198m4'])
s.str.replace('[a-zA-Z]', '').astype(int)
Output:
0 125109181
1 136111
2 851984
Use the regex character class \D (not a digit):
df['mobile'] = df['mobile'].str.replace('\D', '').astype('int64')