I have an example df:
df = pd.DataFrame({'A': ['100,100', '200,200'],
'B': ['200,100,100', '100']})
A B
0 100,100 200,100,100
1 200,200 100
and I want to replace the commas ',' with nothing (basically, remove them). You can probably guess a real-world application, as many data is written with thousand separators, feel free to introduce me to a better method.
Now I read the documentation for pd.replace() here and I tried several versions of code - it raises no error, but does not modify my data frame.
df = df.replace(',','')
df = df.replace({',': ''})
df = df.replace([','],'')
df = df.replace([','],[''])
I can get it working when specifying the column names and using the ".str.replace()" method for Series, but imagine having 20 columns. I also can get this working specifying columns in the df.replace() method but there must be a more convenient way for such an easy task. I could write a custom function, but pandas is such an amazing library it must be something I am missing.
This works:
df['A'] = df['A'].str.replace(',','')
Thank you!
df.replace has a parameter regex set it to True for partial matches.
By default regex param is False. When False it replaces only exact or fullmatches.
From Pandas docs:
str: string exactly matching to_replace will be replaced with the value.
df.replace(',', '', regex=True)
A B
0 100100 200100100
1 200200 100
In pd.Series.str.replace by default it's regex param is True.
From docs:
Equivalent to str.replace() or re.sub(), depending on the regex value.
Determines if assumes the passed-in pattern is a regular expression:
If True, assumes the passed-in pattern is a regular expression.
If False, treats the pattern as a literal string
Though your immediate question has probably been answered, I wanted to mention that if you are reading this data in from a csv file, you can pass the thousands argument with a comma "," to indicate that this should be treated as an integer and remove the comma:
import io
import pandas as pd
csv_file = io.StringIO("""
A,B,C
"1,000","2,000","3,000"
1,2,3
"50,000",50,5
""")
df = pd.read_csv(csv_file, thousands=",")
print(df)
A B C
0 1000 2000 3000
1 1 2 3
2 50000 50 5
print(df.dtypes)
A int64
B int64
C int64
dtype: object
Related
I convert a CSV file to pandas DataFrame, but found all content is str with the pattern like ="content"
Tried using df.replace to substitute '=' and '"'. The code is like
df.replace("=","", inplace = True)
df.replace('"',"", inplace = True)
However, this code does not work without error messages, and nothing is replaced in the Dataframe.
After df.replace
Strangely, it works when use
df[column] = df[column].str.replace('=','')
df[column] = df[column].str.replace('=','')
Is there any possible way to replace/substitute equal and double quote signs using DataFrame methods? And I am curious with the reason why df.replace method isn't workable.
Sorry I can only provide the pic since the original data and code are in a notebook with locked internet and USB function.
Thanks for the help
Because .replace('=', '') requires the cell value to be exactly '=' which is obviously not true in your case.
You may instead use it with regex:
df = pd.DataFrame({'a': ['="abc"', '="bcd"'], 'b': ['="uef"', '="hdd"'], 'c':[1,3]})
df.replace([r'^="', r'"$'], '', regex=True, inplace=True)
print(df)
a b c
0 abc uef 1
1 bcd hdd 3
Two regular expressions are used here, with the first taking care of the head and the second the tail.
I have a data frame with some text read in from a txt file the column names are FEATURE and SENTENCES.
Within the FEATURE col there is some text that starts with '[NA]', e.g. '[NA] not a feature'.
How can I remove those rows from my data frame?
So far I have tried:
df[~df.FEATURE.str.contains("[NA]")]
But this did nothing, no errors either.
I also tried:
df.drop(df['FEATURE'].str.startswith('[NA]'))
Again, there were no errors, but this didn't work.
Lets suppose you have DataFrame below:
>>> df
FEATURE
0 this
1 is
2 string
3 [NA]
Then below simply should be sufficed ..
>>> df[~df['FEATURE'].str.startswith('[NA]')]
FEATURE
0 this
1 is
2 string
other way in case data needed to formatted to string before operating on it..
df[~df['FEATURE'].astype(str).str.startswith('[NA]')]
OR using str.contains :
>>> df[df.FEATURE.str.contains('[NA]') == False]
# df[df['FEATURE'].str.contains('[NA]') == False]
FEATURE
0 this
1 is
2 string
OR
df[df.FEATURE.str[0].ne('[')]
IIUC use regex=False for not parsing string like regex:
df[~df.FEATURE.str.contains("[NA]", regex=False)]
Or escape special regex chars []:
df[~df.FEATURE.str.contains("\[NA\]")]
Another problem should be trailing white spaces, then use:
df[~df['FEATURE'].str.strip().str.startswith('[NA]')]
df['data'].str.startswith('[NA]') or df['data'].str.contains('[NA]') will both return a boolean (True/False) list. Drop doesnt work with booleans and in this case it is easiest using 'loc'
Here is one solution with some example data. Note that i add '==False' to get all the rows that DON'T have [NA]:
df = pd.DataFrame(['feature','feature2', 'feature3', '[NA] not feature', '[NA] not feature2'], columns=['data'])
mask = df['data'].str.contains('[NA]')==False
df.loc[mask]
The below simply code should work
df = df[~df['Date'].str.startswith('[NA]')]
I would like to replace some values in my dataframe that were entered in the wrong format. For example, 850/07-498745 should be 07-498745. Now, I used string split successfully to do so. However, it turns all previously correctly formatted strings into NaNs. I tried to base it on a condition, but still I have the same problem. How can I fix it ?
Example Input:
mylist = ['850/07-498745', '850/07-148465', '07-499015']
df = pd.DataFrame(mylist)
df.rename(columns={ df.columns[0]: "mycolumn" }, inplace = True)
My Attempt:
df['mycolumn'] = df[df.mycolumn.str.contains('/') == True].mycolumn.str.split('/', 1).str[1]
df
Output:
What I wanted:
You can use split with / and grab the last returning string from the list:
df['mycolumn'].str.split('/').str[-1]
0 07-498745
1 07-148465
2 07-499015
Name: mycolumn, dtype: object
This would also work, and may help you understand why your original attempt did not:
mask = df.mycolumn.str.contains('/')
df.mycolumn.loc[mask] = df.mycolumn[mask].str.split('/', 1).str[1]
You were doing df['mycolumn'] = ..., which I believe is just replacing the entire Series for that column with the new one you formed.
For a regex solution:
df.mycolumn.str.extract('(?:.*/)?(.*)$')[0]
Output:
0 07-498745
1 07-148465
2 07-499015
Name: 0, dtype: object
I have a row that I would like to filter for in a dataframe.
ch=b611067=football
My question is I would like to just filter for the b'611067 section.
I understand I can use the follow str.startswith('b') to find the start of the ID but what I am looking for is a way to say something like str.contains('random 6 digit numberical value'
Hope this makes sense.
I am not sure (yet) how to do this efficiently in pandas, but you can use regex for the match:
import re
pattern = '(b\d{6})'
text = 'ch=b611067=football'
matches = re.findall(pattern=pattern, string=text)
for match in matches:
pass # do something
Edit: this answer explains how to use regex with pandas:
How to filter rows in pandas by regex
You can use the .str accessor to use string functions on string columns, including matching by regexp:
import pandas as pd
df = pd.DataFrame(data={"foo": ["us=b611068=handball", "ch=b611067=football", "de=b611069=hockey"]})
print(df.foo.str.match(r'.+=b611067=.+'))
Output:
0 False
1 True
2 False
Name: foo, dtype: bool
You can use this to index the dataframe, so for instance:
print(df[df.foo.str.match(r'.+=b611067=.+')])
Output:
foo
1 ch=b611067=football
If you want all rows that match the pattern b<6 numbers>, you can use the expression provided by tobias_k:
df.foo.str.match(r'.+=b[0-9]{6}=.+')
Note, this gives the same result as df.foo.str.contains(r'=b611067=') which doesn't require you to provide the wildcards and is the solution given in How to filter rows in pandas by regex, but as mentioned in the Pandas docs, with match you can be stricter.
I am trying to replace the strings that contain numbers with another string (an empty one in this case) within a pandas DataFrame.
I tried with the .replace method and a regex expression:
# creating dummy dataframe
data = pd.DataFrame({'A': ['test' for _ in range(5)]})
# the value that should get replaced with ''
data.iloc[0] = 'test5'
data.replace(regex=r'\d', value='', inplace=True)
print(data)
A
0 test
1 test
2 test
3 test
4 test
As you can see, it only replace the '5' within the string and not the whole string.
I also tried using the .where method but it doesn't seem to fit my need as I don't want to replace any of the strings not containing numbers
this is what it should look like:
A
0
1 test
2 test
3 test
4 test
You can use Boolean indexing via pd.Series.str.contains with loc:
data.loc[data['A'].str.contains(r'\d'), 'A'] = ''
Similarly, with mask or np.where:
data['A'] = data['A'].mask(data['A'].str.contains(r'\d'), '')
data['A'] = np.where(data['A'].str.contains(r'\d'), '', data['A'])