I have a dataframe where there are special characters (like a square) in one of the columns EPI_ID. I want to remove all rows that contain this special character. This isn't a standard character and I haven't found issues similar to this in a dataframe, mostly as strings. Nevertheless, I am having trouble identifying these columns. Any suggestions?
df
EPI_ID stuff
2342F randoM_words
FER43 predictive_words
u'\u25A1' blank
My attempt:
df[~df['EPI_ID'].apply(lambda x: x.encode('ascii') == True)]
My results are throwing False for every row.
Expected output:
EPI_ID stuff
2342F randoM_words
FER43 predictive_words
Edit: the square doesn't come up in the mock df. But this is what it is square
Assuming your DataFrame looks something like this:
>>> df = pd.DataFrame({'EPI_ID': ['2343F', 'FER43', 'DF' + u'\u25A1' + '123', 'PQRX74'], 'STUFF': ['abc', 'def', 'ghi', 'jkl']})
>>> df
EPI_ID STUFF
0 2343F abc
1 FER43 def
2 DF□123 ghi
3 PQRX74 jkl
You can use str.contains, which handles regex:
df.loc[df['EPI_ID'].str.contains(r'[^\x00-\x7F]+') == False]
EPI_ID STUFF
0 2343F abc
1 FER43 def
3 PQRX74 jkl
Regex courtesy of this answer: (grep) Regex to match non-ASCII characters?
Related
I would like to know how to write a formula that would identify/display records of string/object data type on a Pandas DataFrame that contains leading or trailing spaces.
The purpose for this is to get an audit on a Jupyter notebook of such records before applying any strip functions.
The goal is for the script to identify these records automatically without having to type the name of the columns manually. The scope should be any column of str/object data type that contains a value that includes either a leading or trailing spaces or both.
Please notice. I would like to see the resulting output in a dataframe format.
Thank you!
Link to sample dataframe data
You can use:
df['col'].str.startswith(' ')
df['col'].str.endswith(' ')
or with a regex:
df['col'].str.match(r'\s+')
df['col'].str.contains(r'\s+$')
Example:
df = pd.DataFrame({'col': [' abc', 'def', 'ghi ', ' jkl ']})
df['start'] = df['col'].str.startswith(' ')
df['end'] = df['col'].str.endswith(' ')
df['either'] = df['start'] | df['stop']
col start end either
0 abc True False True
1 def False False False
2 ghi False True True
3 jkl True True True
However, this is likely not faster than directly stripping the spaces:
df['col'] = df['col'].str.strip()
col
0 abc
1 def
2 ghi
3 jkl
updated answer
To detect the columns with leading/traiing spaces, you can use:
cols = df.astype(str).apply(lambda c: c.str.contains(r'^\s+|\s+$')).any()
cols[cols].index
example on the provided link:
Index(['First Name', 'Team'], dtype='object')
I have used re.search to get strings of uniqueID from larger strings.
ex:
import re
string= 'example string with this uniqueID: 300-350'
combination = '(\d+)[-](\d+)'
m = re.search(combination, string)
print (m.group(0))
Out: '300-350'
I have created a dataframe with the UniqueID and the Combination as columns.
uniqueID combinations
0 300-350 (\d+)[-](\d+)
1 off-250 (\w+)[-](\d+)
2 on-stab (\w+)[-](\w+)
And a dictionary meaning_combination relating the combination with the variable meaning it represents:
meaning_combination={'(\\d+)[-](\\d+)': 'A-B',
'(\\w+)[-](\\d+)': 'C-A',
'(\\w+)[-](\\w+)': 'C-D'}
I want to create new columns for each variable (A, B, C, D) and fill them with their corresponding values.
the final result should look like this:
uniqueID combinations A B C D
0 300-350 (\d+)[-](\d+) 300 350
1 off-250 (\w+)[-](\d+) 250 off
2 on-stab (\w+)[-](\w+) stab on
I would fix your regexes to:
meaning_combination={'(\d+-\d+)': 'A-B',
'([^0-9\W]+\-\d+)': 'C-A',
'([^0-9\W]+\-[^0-9\W]+)': 'C-D'}
To capture the full group instead of having three capturing groups.
I.e. (300-350, 300, 350) --> (300-350)
You don't need to have extra two capturing groups because if a specific pattern is satisfied then you know what the positions of the word or digit characters are (based on how you defined the pattern) and you can split by - to access them individually.
I.e.:
str = 'example string with this uniqueID: 300-350'
values = re.findall('(\d+-\d+)', str)
>>>['300-350']
#first digit char:
values[0].split('-')[0]
>>>'300'
If you use this way, you can loop over dictionary keys and list of strings and test if the pattern is satisfied in the string. If it's satisfied (len(re.findall(pattern, string)) != 0), then grab the corresponding dictionary value for the key and split it and split the match and assign dictionary_value.split('-')[0] : match[0].split('-')[0] and dictionary_value.split('-')[1] : match[0].split('-')[1] in a new dictionary that your creating in the loop - also assign unique id to the full match value and combination to the matched pattern. Then use pandas to make a Dataframe.
Altogether:
import re
import pandas as pd
stri= ['example string with this uniqueID: 300-350', 'example string with this uniqueID: off-250', 'example string with this uniqueID: on-stab']
meaning_combination={'(\d+-\d+)': 'A-B',
'([^0-9\W]+\-\d+)': 'C-A',
'([^0-9\W]+\-[^0-9\W]+)': 'C-D'}
values = [{'Unique ID': re.findall(x, st)[0], 'Combination': x, y.split('-')[0] : re.findall(x, st)[0].split('-')[0], y.split('-')[1] : re.findall(x, st)[0].split('-')[1]} for st in stri for x, y in meaning_combination.items() if len(re.findall(x, st)) != 0]
df = pd.DataFrame.from_dict(values)
#just to sort it in order since default is alphabetical
col_val = ['Unique ID', 'Combination', 'A', 'B', 'C', 'D']
df = df.reindex(sorted(df.columns, key=lambda x: col_val.index(x) ), axis=1)
print(df)
output:
Unique ID Combination A B C D
0 300-350 (\d+-\d+) 300 350 NaN NaN
1 off-250 ([^0-9\W]+\-\d+) 250 NaN off NaN
2 on-stab ([^0-9\W]+\-[^0-9\W]+) NaN NaN on stab
Also, note, I think you have a typo in your expected output because you have:
'(\\w+)[-](\\d+)': 'C-A'
which would match off-250, but in your final result you have:
uniqueID combinations A B C D
1 off-250 (\w+)[-](\d+) 250 off
When based on your key this should be in C and A.
My data frame has some columns which contains digits and words. Before the digits and words sometimes there are special character like ">*".
The column are mostly divided in , or /. Based on separators, I want to section it into new columns and delete it.
Reproduced my dataframe and with my code:
d = {'error': [
'test,121',
'123',
'test,test',
'>errrI1GB,213',
'*errrI1GB,213',
'*errrI1GB/213',
'*>errrI1GB/213',
'>*errrI1GB,213',
'>test, test',
'>>test, test',
'>>:test,test',
]}
df = pd.DataFrame(data=d)
df['error'] = df['error'].str.replace(' ', '')
df[['error1', 'error2']] = df['error'].str.extract('.*?(\w*)[,|/](\w*)')
df
So far my approach is first to remove the whitespaces with
df['error'] = df['error'].str.replace(' ', '')
Than I constructed my regex with this help
https://regex101.com/r/UHzTOq/13
.*?(\w*)[,|/](\w*)
Afterwards I delete the messy column with:
df.drop(columns =["error"], inplace = True)
My single values in the row are not considered. Therefore I get a NaN as a result. How to include them in my regex?
Solution is:
df[['error1', 'error2']] = df['error'].str.extract(r'^[>*:]*(.*?)(?:[,|\\](.*))?$')
Assuming that we'd like to add those values with only a test or a 123 in error1 column, maybe then we'd just slightly modify your original expression:
^.*?(\w*)\s*(?:[,|/]\s*(\w*))?\s*$
I'm pretty sure there should be other easier ways though.
Test
import pandas as pd
d = {'error': [
'test,121',
'123',
'test',
'test,test',
'>errrI1GB,213',
'*errrI1GB,213',
'*errrI1GB/213',
'*>errrI1GB/213',
'>*errrI1GB,213',
'>test, test',
'>>test, test',
'>>:test,test',
]}
df = pd.DataFrame(data=d)
df['error1'] = df['error'].str.replace(r'(?mi)^.*?(\w*)\s*(?:[,|/]\s*(\w*))?\s*$', r'\1')
df['error2'] = df['error'].str.replace(r'(?mi)^.*?(\w*)\s*(?:[,|/]\s*(\w*))?\s*$', r'\2')
print(df)
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
Output
error error1 error2
0 test,121 test 121
1 123 123
2 test test
3 test,test test test
4 >errrI1GB,213 errrI1GB 213
5 *errrI1GB,213 errrI1GB 213
6 *errrI1GB/213 errrI1GB 213
7 *>errrI1GB/213 errrI1GB 213
8 >*errrI1GB,213 errrI1GB 213
9 >test, test test test
10 >>test, test test test
11 >>:test,test test test
RegEx Circuit
jex.im visualizes regular expressions:
I have a dataframe which has some duplicate tags separated by commas in the "Tags" column, is there a way to remove the duplicate strings from the series. I want the output in 400 to have just Museum, Drinking, Shopping.
I can't split on a comma & remove them because there are some tags in the series that have similar words like for example: [Museum, Art Museum, Shopping] so splitting and dropping multiple museum strings would affect the unique 'Art Museum' string.
Desired Output
You can split by comma and convert to a set(),which removes duplicates, after removing leading/trailing white space with str.strip(). Then, you can df.apply() this to your column.
df['Tags']=df['Tags'].apply(lambda x: ', '.join(set([y.strip() for y in x.split(',')])))
You can create a function that removes duplicates from a given string. Then apply this function to your column Tags.
def remove_dup(strng):
'''
Input a string and split them
'''
return ', '.join(list(dict.fromkeys(strng.split(', '))))
df['Tags'] = df['Tags'].apply(lambda x: remove_dup(x))
DEMO:
import pandas as pd
my_dict = {'Tags':["Museum, Art Museum, Shopping, Museum",'Drink, Drink','Shop','Visit'],'Country':['USA','USA','USA', 'USA']}
df = pd.DataFrame(my_dict)
df['Tags'] = df['Tags'].apply(lambda x: remove_dup(x))
df
Output:
Tags Country
0 Museum, Art Museum, Shopping USA
1 Drink USA
2 Shop USA
3 Visit USA
Without some code example, I've thrown together something that would work.
import pandas as pd
test = [['Museum', 'Art Museum', 'Shopping', "Museum"]]
df = pd.DataFrame()
df[0] = test
df[0]= df.applymap(set)
Out[35]:
0
0 {Museum, Shopping, Art Museum}
One approach that avoids apply
# in your code just s = df['Tags']
s = pd.Series(['','', 'Tour',
'Outdoors, Beach, Sports',
'Museum, Drinking, Drinking, Shopping'])
(s.str.split(',\s+', expand=True)
.stack()
.reset_index()
.drop_duplicates(['level_0',0])
.groupby('level_0')[0]
.agg(','.join)
)
Output:
level_0
0
1
2 Tour
3 Outdoors,Beach,Sports
4 Museum,Drinking,Shopping
Name: 0, dtype: object
there maybe mach fancier way doing these kind of stuffs.
but will do the job.
make it lower-case
data['tags'] = data['tags'].str.lower()
split every row in tags col by comma it will return a list of string
data['tags'] = data['tags'].str.split(',')
map function str.strip to every element of list (remove trailing spaces).
apply set function return set of current words and remove duplicates
data['tags'] = data['tags'].apply(lambda x: set(map(str.strip , x)))
Suppose I have a log file structured as follow for each line:
$date $machine $task_name $loggedstuff
I hope to read the whole thing with pd.read_csv('blah.log', sep=r'\s+'). The problem is, $loggedstuff has spaces in it, is there any way to limit the delimiter to operate exactly 3 times so everything in loggedstuff will appear in the dataframe as a single column?
I've already tried using csv to parse it as list of list and then feed it into pandas, but that is slow, I'm wondering if there's a more direct way to do this. Thanks!
Setup
tmp.txt
a b c d
1 2 3 test1 test2 test3
1 2 3 test1 test2 test3 test4
Code
df = pd.read_csv('tmp.txt', sep='\n', header=None)
cols = df.loc[0].str.split(' ')[0]
df = df.drop(0)
def splitter(s):
vals = s.iloc[0].split(' ')
d = dict(zip(cols[:-1], vals))
d[cols[-1]] = ' '.join(vals[len(cols) - 1: ])
return pd.Series(d)
df.apply(splitter, axis=1)
returns
a b c d
1 1 2 3 test1 test2 test3
2 1 2 3 test1 test2 test3 test4
When using expand=True, the split elements will expand out into separate columns.
Parameter n can be used to limit the number of splits in the output.
Details about the same cane From pandas.Series.str.split
Pattern to use
df.str.split(pat=None, n=-1, expand=False)
expand : bool, default False
Expand the splitted strings into separate columns.
If True, return DataFrame/MultiIndex expanding dimensionality.
If False, return Series/Index, containing lists of strings
df.str.split(' ', n=3, expand=True)
I think you can read each line of the csv file as a single string, then convert the resulted dataframe to 3 columns by regular expression.
df = pd.read_csv('./test.csv', sep='#', squeeze=True)
df = df.str.extract('([^\s]+)\s+([^\s]+)\s+(.+)')
in which you can change the separator to whatever not appeared in the document.