finding and replacing strings with numbers only within a pandas dataframe - python

I am trying to replace the strings that contain numbers with another string (an empty one in this case) within a pandas DataFrame.
I tried with the .replace method and a regex expression:
# creating dummy dataframe
data = pd.DataFrame({'A': ['test' for _ in range(5)]})
# the value that should get replaced with ''
data.iloc[0] = 'test5'
data.replace(regex=r'\d', value='', inplace=True)
print(data)
A
0 test
1 test
2 test
3 test
4 test
As you can see, it only replace the '5' within the string and not the whole string.
I also tried using the .where method but it doesn't seem to fit my need as I don't want to replace any of the strings not containing numbers
this is what it should look like:
A
0
1 test
2 test
3 test
4 test

You can use Boolean indexing via pd.Series.str.contains with loc:
data.loc[data['A'].str.contains(r'\d'), 'A'] = ''
Similarly, with mask or np.where:
data['A'] = data['A'].mask(data['A'].str.contains(r'\d'), '')
data['A'] = np.where(data['A'].str.contains(r'\d'), '', data['A'])

Related

How to standardize column in pandas

I have dataframe which contains id column with the following sample values
16620625 5686
16310427-5502
16501010 4957
16110430 8679
16990624/4174
16230404.1177
16820221/3388
I want to standardise to XXXXXXXX-XXXX (i.e. 8 and 4 digits separated by a dash), How can I achieve that using python.
here's my code
df['id']
df.replace(" ", "-")
Can use DataFrame.replace() function using a regular expression like this:
df = df.replace(regex=r'^(\d{8})\D(\d{4})$', value=r'\1-\2')
Here's example code with sample data.
import pandas as pd
df = pd.DataFrame({'id': [
'16620625 5686',
'16310427-5502',
'16501010 4957',
'16110430 8679',
'16990624/4174',
'16230404.1177',
'16820221/3388']})
# normalize matching strings with 8-digits + delimiter + 4-digits
df = df.replace(regex=r'^(\d{8})\D(\d{4})$', value=r'\1-\2')
print(df)
Output:
id
0 16620625-5686
1 16310427-5502
2 16501010-4957
3 16110430-8679
4 16990624-4174
5 16230404-1177
6 16820221-3388
If any value does not match the regexp of the expected format then it's value will not be changed.
inside a for loop:
convert your data frame entry to a string.
traverse this string up to 7th index.
concatenate '-' after 7th index to the string.
concatenate remaining string to the end.
traverse to next data frame entry.

Remove a row in a pandas data frame if the data starts with a specific character

I have a data frame with some text read in from a txt file the column names are FEATURE and SENTENCES.
Within the FEATURE col there is some text that starts with '[NA]', e.g. '[NA] not a feature'.
How can I remove those rows from my data frame?
So far I have tried:
df[~df.FEATURE.str.contains("[NA]")]
But this did nothing, no errors either.
I also tried:
df.drop(df['FEATURE'].str.startswith('[NA]'))
Again, there were no errors, but this didn't work.
Lets suppose you have DataFrame below:
>>> df
FEATURE
0 this
1 is
2 string
3 [NA]
Then below simply should be sufficed ..
>>> df[~df['FEATURE'].str.startswith('[NA]')]
FEATURE
0 this
1 is
2 string
other way in case data needed to formatted to string before operating on it..
df[~df['FEATURE'].astype(str).str.startswith('[NA]')]
OR using str.contains :
>>> df[df.FEATURE.str.contains('[NA]') == False]
# df[df['FEATURE'].str.contains('[NA]') == False]
FEATURE
0 this
1 is
2 string
OR
df[df.FEATURE.str[0].ne('[')]
IIUC use regex=False for not parsing string like regex:
df[~df.FEATURE.str.contains("[NA]", regex=False)]
Or escape special regex chars []:
df[~df.FEATURE.str.contains("\[NA\]")]
Another problem should be trailing white spaces, then use:
df[~df['FEATURE'].str.strip().str.startswith('[NA]')]
df['data'].str.startswith('[NA]') or df['data'].str.contains('[NA]') will both return a boolean (True/False) list. Drop doesnt work with booleans and in this case it is easiest using 'loc'
Here is one solution with some example data. Note that i add '==False' to get all the rows that DON'T have [NA]:
df = pd.DataFrame(['feature','feature2', 'feature3', '[NA] not feature', '[NA] not feature2'], columns=['data'])
mask = df['data'].str.contains('[NA]')==False
df.loc[mask]
The below simply code should work
df = df[~df['Date'].str.startswith('[NA]')]

Pandas - Remove leading Zeros from String but not from Integers

I currently have a column in my dataset that looks like the following:
Identifier
09325445
02242456
00MatBrown
0AntonioK
065824245
The column data type is object.
What I'd like to do is remove the leading zeros only from column rows where there is a string. I want to keep the leading zeros where the column rows are integers.
Result I'm looking to achieve:
Identifier
09325445
02242456
MatBrown
AntonioK
065824245
Code I am currently using (that isn't working)
def removeLeadingZeroFromString(row):
if df['Identifier'] == str:
return df['Identifier'].str.strip('0')
else:
return df['Identifier']
df['Identifier' ] = df.apply(lambda row: removeLeadingZeroFromString(row), axis=1)
One approach would be to try to convert Identifier to_numeric. Test where the converted values isna, using this mask to only str.lstrip (strip leading zeros only) where the values could not be converted:
m = pd.to_numeric(df['Identifier'], errors='coerce').isna()
df.loc[m, 'Identifier'] = df.loc[m, 'Identifier'].str.lstrip('0')
df:
Identifier
0 09325445
1 02242456
2 MatBrown
3 AntonioK
4 065824245
Alternatively, a less robust approach, but one that will work with number only strings, would be to test where not str.isnumeric:
m = ~df['Identifier'].str.isnumeric()
df.loc[m, 'Identifier'] = df.loc[m, 'Identifier'].str.lstrip('0')
*NOTE This fails easily to_numeric is the much better approach if looking for all number types.
Sample Frame:
df = pd.DataFrame({
'Identifier': ['0932544.5', '02242456']
})
Sample Results with isnumeric:
Identifier
0 932544.5 # 0 Stripped
1 02242456
DataFrame and imports:
import pandas as pd
df = pd.DataFrame({
'Identifier': ['09325445', '02242456', '00MatBrown', '0AntonioK',
'065824245']
})
Use replace with regex and a positive lookahead:
>>> df['Identifier'].str.replace(r'^0+(?=[a-zA-Z])', '', regex=True)
0 09325445
1 02242456
2 MatBrown
3 AntonioK
4 065824245
Name: Identifier, dtype: object
Regex: replace one or more 0 (0+) at the start of the string (^) if there is a character ([a-zA-Z]) after 0s ((?=...)).

Replacing a value in all cells of a DataFrame in Python

I have an example df:
df = pd.DataFrame({'A': ['100,100', '200,200'],
'B': ['200,100,100', '100']})
A B
0 100,100 200,100,100
1 200,200 100
and I want to replace the commas ',' with nothing (basically, remove them). You can probably guess a real-world application, as many data is written with thousand separators, feel free to introduce me to a better method.
Now I read the documentation for pd.replace() here and I tried several versions of code - it raises no error, but does not modify my data frame.
df = df.replace(',','')
df = df.replace({',': ''})
df = df.replace([','],'')
df = df.replace([','],[''])
I can get it working when specifying the column names and using the ".str.replace()" method for Series, but imagine having 20 columns. I also can get this working specifying columns in the df.replace() method but there must be a more convenient way for such an easy task. I could write a custom function, but pandas is such an amazing library it must be something I am missing.
This works:
df['A'] = df['A'].str.replace(',','')
Thank you!
df.replace has a parameter regex set it to True for partial matches.
By default regex param is False. When False it replaces only exact or fullmatches.
From Pandas docs:
str: string exactly matching to_replace will be replaced with the value.
df.replace(',', '', regex=True)
A B
0 100100 200100100
1 200200 100
In pd.Series.str.replace by default it's regex param is True.
From docs:
Equivalent to str.replace() or re.sub(), depending on the regex value.
Determines if assumes the passed-in pattern is a regular expression:
If True, assumes the passed-in pattern is a regular expression.
If False, treats the pattern as a literal string
Though your immediate question has probably been answered, I wanted to mention that if you are reading this data in from a csv file, you can pass the thousands argument with a comma "," to indicate that this should be treated as an integer and remove the comma:
import io
import pandas as pd
csv_file = io.StringIO("""
A,B,C
"1,000","2,000","3,000"
1,2,3
"50,000",50,5
""")
df = pd.read_csv(csv_file, thousands=",")
print(df)
A B C
0 1000 2000 3000
1 1 2 3
2 50000 50 5
print(df.dtypes)
A int64
B int64
C int64
dtype: object

I'm using Pandas in Python and wanted to know how to split a value in a column and search that value in the column

Normally when splitting a value which is a string, one would simply do:
string = 'aabbcc'
small = string[0:2]
And that's simply it. I thought it would be the same thing for a dataframe by doing:
df = df['Column'][Range][Length of Desired value]
df = df['Date'][0:4][2:4]
Note: Every string in the column have the same length and are all integers written as a string data type
If I use the code above the program just throws the Range and takes [2:4] as the range which is weird.
When doing this individually it works:
df2 = df['Column'][index][2:4]
So right now I had to make a loop that goes one by one and append it to a new Dataframe.
To do the operation element wise, you can use apply (see link):
df['Column'][0:4].apply(lambda x : x[2:4])
When you did df2 = df['Column'][0:4][2:4], you are doing the same as df2 = df['Column'][2:4].
You're getting the indexes 0 to 4 of df and then getting the indexes 2 to 4 of that one.

Categories

Resources