How to delete junk strings appearing in an integer column - python

I have a column of integers (sample row: 123456789) and some of the values are interspersed with junk alphabets. Ex: 1234y5678. I want to delete the alphabets appearing in such cells and retain the numbers. How do I go about it using Pandas?
Assume my dataframe is df and the column name is mobile.
Should I use np.where with conditions such as df[df['mobile'].str.contains('a-z')] and use string replace?

If your junk characters are not limited to letters, you should use this:
yourSeries.str.replace('[^0-9]', '')

Use pd.Series.str.replace:
import pandas as pd
s = pd.Series(['125109a181', '1361q1j1', '85198m4'])
s.str.replace('[a-zA-Z]', '').astype(int)
Output:
0 125109181
1 136111
2 851984

Use the regex character class \D (not a digit):
df['mobile'] = df['mobile'].str.replace('\D', '').astype('int64')

Related

Strip strings in pandas columns

I have a small dataframe with entries regarding motorsport balance of performance.
I try to get rid of the string after "#"
This is working fine with the code:
for col in df_engine.columns[1:]:
df_engine[col] = df_engine[col].str.rstrip(r"[\ \# \d.[0-9]+]")
but is leaving last column unchanged, and I do not understand why.
The Ferrari column also has a NaN entry as last position, just as additional info.
Can anyone provide some help?
Thank you in advance!
rstrip does not work with regex. As per the documentation,
to_strip str or None, default None
Specifying the set of characters to
be removed. All combinations of this set of characters will be
stripped. If None then whitespaces are removed.
>>> "1.76 # 0.88".rstrip("[\ \# \d.[0-9]+]")
'1.76 # 0.88'
>>> "1.76 # 0.88".rstrip("[\ \# \d.[0-8]+]") # It's not treated as regex, instead All combinations of characters(`[\ \# \d.[0-8]+]`) stripped
'1.76'
You could use the replace method instead.
for col in df.columns[1:]:
df[col] = df[col].str.replace(r"\s#\s[\d\.]+$", "", regex=True)
What about str.split() ?
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html#pandas.Series.str.split
The function splits a serie in dataframe columns (when expand=True) using the separator provided.
The following example split the serie df_engine[col] and produce a dataframe. The first column of the new dataframe contains values preceding the first separator char '#' found in the value
df_engine[col].str.split('#', expand=True)[0]

Adding leading zeros for only numeric character id

I know how to add leading zeros for all values in pandas column. But my pandas column 'id' involves both numeric character like '83948', '848439' and Alphanumeric character like 'dy348dn', '494rd7f'. What I want is only add zeros for the numeric character until it reaches to 10, how can we do that?
I understand that you want to apply padding only on ids that are completely numeric. In this case, you can use isnumeric() on a string (for example, mystring.isnumeric()) in order to check if the string contains only numbers. If the condition is satisfied, you can apply your padding rule.
You can use a mask with str.isdigit and boolean indexing with str.zfill:
mask = df['col'].str.isdigit()
df.loc[mask, 'col'] = df.loc[mask, 'col'].str.zfill(10)
Output:
col
0 0000083948
1 0000848439
2 dy348dn
3 494rd7f
Used input:
df = pd.DataFrame({'col': ['83948', '848439', 'dy348dn', '494rd7f']})

Extract substring from left to a specific character for each row in a pandas dataframe?

I have a dataframe that contains a collection of strings. These strings look something like this:
"oop9-hg78-op67_457y"
I need to cut everything from the underscore to the end in order to match this data with another set. My attempt looked something like this:
df['column'] = df['column'].str[0:'_']
I've tried toying around with .find() in this statement but nothing seems to work. Anybody have any ideas? Any and all help would be greatly appreciated!
You can try .str.split then access the list with .str or with .str.extract
df['column'] = df['column'].str.split('_').str[0]
# or
df['column'] = df['column'].str.extract('^([^_]*)_')
print(df)
column
0 oop9-hg78-op67
df['column'] = df['column'].str.extract('_', expand=False)
could also be used if another option is needed.
Adding to the solution provided above by #Ynjxsjmh
You can use str.extract:
df['column'] = df['column'df].str.extract(r'(^[^_]+)')
Output (as separate column for clarity):
column column2
0 oop9-hg78-op67_457y oop9-hg78-op67
Regex:
( # start capturing group
^ # match start of string
[^_]+ # one or more non-underscore
) # end capturing group

Remove a row in a pandas data frame if the data starts with a specific character

I have a data frame with some text read in from a txt file the column names are FEATURE and SENTENCES.
Within the FEATURE col there is some text that starts with '[NA]', e.g. '[NA] not a feature'.
How can I remove those rows from my data frame?
So far I have tried:
df[~df.FEATURE.str.contains("[NA]")]
But this did nothing, no errors either.
I also tried:
df.drop(df['FEATURE'].str.startswith('[NA]'))
Again, there were no errors, but this didn't work.
Lets suppose you have DataFrame below:
>>> df
FEATURE
0 this
1 is
2 string
3 [NA]
Then below simply should be sufficed ..
>>> df[~df['FEATURE'].str.startswith('[NA]')]
FEATURE
0 this
1 is
2 string
other way in case data needed to formatted to string before operating on it..
df[~df['FEATURE'].astype(str).str.startswith('[NA]')]
OR using str.contains :
>>> df[df.FEATURE.str.contains('[NA]') == False]
# df[df['FEATURE'].str.contains('[NA]') == False]
FEATURE
0 this
1 is
2 string
OR
df[df.FEATURE.str[0].ne('[')]
IIUC use regex=False for not parsing string like regex:
df[~df.FEATURE.str.contains("[NA]", regex=False)]
Or escape special regex chars []:
df[~df.FEATURE.str.contains("\[NA\]")]
Another problem should be trailing white spaces, then use:
df[~df['FEATURE'].str.strip().str.startswith('[NA]')]
df['data'].str.startswith('[NA]') or df['data'].str.contains('[NA]') will both return a boolean (True/False) list. Drop doesnt work with booleans and in this case it is easiest using 'loc'
Here is one solution with some example data. Note that i add '==False' to get all the rows that DON'T have [NA]:
df = pd.DataFrame(['feature','feature2', 'feature3', '[NA] not feature', '[NA] not feature2'], columns=['data'])
mask = df['data'].str.contains('[NA]')==False
df.loc[mask]
The below simply code should work
df = df[~df['Date'].str.startswith('[NA]')]

Filter for a string followed by a random row of numbers

I have a row that I would like to filter for in a dataframe.
ch=b611067=football
My question is I would like to just filter for the b'611067 section.
I understand I can use the follow str.startswith('b') to find the start of the ID but what I am looking for is a way to say something like str.contains('random 6 digit numberical value'
Hope this makes sense.
I am not sure (yet) how to do this efficiently in pandas, but you can use regex for the match:
import re
pattern = '(b\d{6})'
text = 'ch=b611067=football'
matches = re.findall(pattern=pattern, string=text)
for match in matches:
pass # do something
Edit: this answer explains how to use regex with pandas:
How to filter rows in pandas by regex
You can use the .str accessor to use string functions on string columns, including matching by regexp:
import pandas as pd
df = pd.DataFrame(data={"foo": ["us=b611068=handball", "ch=b611067=football", "de=b611069=hockey"]})
print(df.foo.str.match(r'.+=b611067=.+'))
Output:
0 False
1 True
2 False
Name: foo, dtype: bool
You can use this to index the dataframe, so for instance:
print(df[df.foo.str.match(r'.+=b611067=.+')])
Output:
foo
1 ch=b611067=football
If you want all rows that match the pattern b<6 numbers>, you can use the expression provided by tobias_k:
df.foo.str.match(r'.+=b[0-9]{6}=.+')
Note, this gives the same result as df.foo.str.contains(r'=b611067=') which doesn't require you to provide the wildcards and is the solution given in How to filter rows in pandas by regex, but as mentioned in the Pandas docs, with match you can be stricter.

Categories

Resources