Replacing an string in a dataframe python - python

I have a (7,11000) dataframe. in some of these 7 columns, there are strings.
In Coulmn 2 and row 1000, there is a string 'London'. I want to change it to 'Paris'.
how can I do this? I searched all over the web but I couldnt find a way. I used theses commands but none of them works:
df['column2'].replace('London','Paris')
df['column2'].str.replace('London','Paris')
re.sub('London','Paris',df['column2'])
I usually receive this error:
TypeError: expected string or bytes-like object

If you want to replace a single row (you mention row 1000), you can do it with .loc. If you want to replace all occurrences of 'London', you could do this:
import pandas as pd
df = pd.DataFrame({'country': ['New York', 'London'],})
df.country = df.country.str.replace('London', 'Paris')
Alternatively, you could write your own replacement function, and then use .apply:
def replace_country(string):
if string == 'London':
return 'Paris'
return string
df.country = df.country.apply(replace_country)
The second method is a bit overkill, but is a good example that generalizes better for more complex tasks.

Before replacing check for non characters with re
import re
for r, map in re_map.items():
df['column2'] = [re.sub(r, map, x) for x in df['column2']]

These are all great answers but many are not vectorized, operating on every item in the series once rather than working on the entire series.
A very reliable filter + replace strategy is to create a mask or subset True/False series and then use loc with that series to replace:
mask = df.country == 'London'
df.loc[mask, 'country'] = 'Paris'
# On 10m records:
# this method < 1 second
# #Charles method 1 < 10 seconds
# #Charles method 2 < 3.5 seconds
# #jose method didn't bother because it would be 30 seconds or more

Related

Is there a faster way to search every column of a dataframe for a String than with .apply and str.contains?

So basically i have a bunch of dataframes with about 100 columns and 500-3000 rows filled with different String values. Now I want to search the entire Dataframe for lets say the String "Airbag" and delete every row which doesnt contain this String? I was able to do this with the following code:
df = df[df.apply(lambda row: row.astype(str).str.contains('Airbag', regex=False).any(), axis=1)]
This works exactly like i want to, but it is way too slow. So i tried finding a way to do it with Vectorization or List Comprehension but i wasn't able to do it or find some example code on the internet. So my question is, if it is possible to fasten this process up or not?
Example Dataframe:
df = pd.DataFrame({'col1': ['Airbag_101', 'Distance_xy', 'Sensor_2'], 'col2': ['String1', 'String2', 'String3'], 'col3': ['Tires', 'Wheel_Airbag', 'Antenna']})
Let's start from this dataframe with random strings and numbers in COLUMN:
import numpy as np
np.random.seed(0)
strings = np.apply_along_axis(''.join, 1, np.random.choice(list('ABCD'), size=(100, 5)))
junk = list(range(10))
col = list(strings)+junk
np.random.shuffle(col)
df = pd.DataFrame({'COLUMN': col})
>>> df.head()
COLUMN
0 BBCAA
1 6
2 ADDDA
3 DCABB
4 ADABC
You can simply apply pandas.Series.str.contains. You need to use fillna to account for the non-string elements:
>>> df[df['COLUMN'].str.contains('ABC').fillna(False)]
COLUMN
4 ADABC
31 BDABC
40 BABCB
88 AABCA
101 ABCBB
testing all columns:
Here is an alternative using a good old custom function. One could think that it should be slower than apply/transform, but it is actually faster when you have a lot of columns and a decent frequency of the seached term (tested on the example dataframe, a 3x3 with no match, and 3x3000 dataframes with matches and no matches):
def has_match(series):
for s in series:
if 'Airbag' in s:
return True
return False
df[df.apply(has_match, axis=1)]
Update (exact match)
Since it looks like you actually want an exact match, test with eq() instead of str.contains(). Then use boolean indexing with loc:
df.loc[df.eq('Airbag').any(axis=1)]
Original (substring)
Test for the string with applymap() and turn it into a row mask using any(axis=1):
df[df.applymap(lambda x: 'Airbag' in x).any(axis=1)]
# col1 col2 col3
# 0 Airbag_101 String1 Tires
# 1 Distance_xy String2 Wheel_Airbag
As mozway said, "optimal" depends on the data. These are some timing plots for reference.
Timings vs number of rows (fixed at 3 columns):
Timings vs number of columns (fixed at 3,000 rows):
Ok I was able to speed it up with the help of numpy arrays, but thanks for the help :D
master_index = []
for column in df.columns:
np_array = df[column].values
index = np.where(np_array == 'Airbag')
master_index.append(index)
print(df.iloc[master_index[1][0]])

Filtering in pandas: excluding rows that contain part of a string [duplicate]

I've done some searching and can't figure out how to filter a dataframe by
df["col"].str.contains(word)
however I'm wondering if there is a way to do the reverse: filter a dataframe by that set's compliment. eg: to the effect of
!(df["col"].str.contains(word))
Can this be done through a DataFrame method?
You can use the invert (~) operator (which acts like a not for boolean data):
new_df = df[~df["col"].str.contains(word)]
where new_df is the copy returned by RHS.
contains also accepts a regular expression...
If the above throws a ValueError or TypeError, the reason is likely because you have mixed datatypes, so use na=False:
new_df = df[~df["col"].str.contains(word, na=False)]
Or,
new_df = df[df["col"].str.contains(word) == False]
I was having trouble with the not (~) symbol as well, so here's another way from another StackOverflow thread:
df[df["col"].str.contains('this|that')==False]
You can use Apply and Lambda :
df[df["col"].apply(lambda x: word not in x)]
Or if you want to define more complex rule, you can use AND:
df[df["col"].apply(lambda x: word_1 not in x and word_2 not in x)]
I hope the answers are already posted
I am adding the framework to find multiple words and negate those from dataFrame.
Here 'word1','word2','word3','word4' = list of patterns to search
df = DataFrame
column_a = A column name from DataFrame df
values_to_remove = ['word1','word2','word3','word4']
pattern = '|'.join(values_to_remove)
result = df.loc[~df['column_a'].str.contains(pattern, case=False)]
I had to get rid of the NULL values before using the command recommended by Andy above. An example:
df = pd.DataFrame(index = [0, 1, 2], columns=['first', 'second', 'third'])
df.ix[:, 'first'] = 'myword'
df.ix[0, 'second'] = 'myword'
df.ix[2, 'second'] = 'myword'
df.ix[1, 'third'] = 'myword'
df
first second third
0 myword myword NaN
1 myword NaN myword
2 myword myword NaN
Now running the command:
~df["second"].str.contains(word)
I get the following error:
TypeError: bad operand type for unary ~: 'float'
I got rid of the NULL values using dropna() or fillna() first and retried the command with no problem.
To negate your query use ~. Using query has the advantage of returning the valid observations of df directly:
df.query('~col.str.contains("word").values')
Additional to nanselm2's answer, you can use 0 instead of False:
df["col"].str.contains(word)==0
somehow '.contains' didn't work for me but when I tried with '.isin' as mentioned by #kenan in the answer (How to drop rows from pandas data frame that contains a particular string in a particular column?) it works. Adding further, if you want to look at the entire dataframe and remove those rows which has the specific word (or set of words) just use the loop below
for col in df.columns:
df = df[~df[col].isin(['string or string list separeted by comma'])]
just remove ~ to get the dataframe that contains the word
To compliment to the above question, if someone wants to remove all the rows with strings, one could do:
df_new=df[~df['col_name'].apply(lambda x: isinstance(x, str))]

How to change all string cells which include numbers to float all at once in pandas? [duplicate]

This question already has answers here:
convert entire pandas dataframe to integers in pandas (0.17.0)
(4 answers)
Closed 3 years ago.
So I have a dataframe about NBA stats from the last season which I am using to learn pandas and matplotlib but all numbers (Points per game, salaries, PER etc.) are strings. I noticed it when I tried to sum them and they just concatenated. So I used this :
df['Salary'] = df['Salary'].astype(float)
to change the values but there is many more columns that I have to do the same thing for and I know that I should do it manually. First thing that comes to mind is some kind of regex but I am not familiar with it so I am seeking for help. Thanks in advance!
In Pandas, DataFrame objects make a list of all columns contained in the frame available via the columns attribute. This attribute is iterable, which means you can use this as the iterable object of a for-in loop. This allows you to easily run through and apply an operation to all columns:
for col in df.columns:
df[col] = df[col].astype('float', errors='ignore')
Documentation page for Pandas DataFrame: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
Another way to do this if you know the columns in advance is to specify the dtype when you import the dataframe.
df = pd.read_csv("file.tsv", sep='\t', dtype={'a': np.float. 'b': str, 'c': np.float}
A second method could be to use a conversion dictionary:
conversion_dict = {'a': np.float, 'c': np.float}
df = df.astype(conversion_dict)
A third method if your column would be an object would be to use the infer_object() method from pandas. Using this method you dont have to specify all the columns yourself.
df = df.infer_objects()
good luck
I think you can use select_dtypes
The strategy is to find the columns with types object, which usually are string. You can check it out by using df.info().
so :
df.select_dtypes(include = ['object']).astype(float)
would do the trick
If you want to keep a trace of this :
str_cols = df.select_dtypes(include = ['object'].columns
mapping = {col_name:col_type for col_name, col_type in zip(str_cols, [float]*len(str_cols))}
df[str_cols] = df[str_cols].astype(mapping)
I like this approach because you can create a dictionary of the types you want your columns to be in.
If you know the names of the columns you can use a for loop to apply the same transformation to each column. This is useful if you don't want to convert entire data frame but only the numeric columns etc. Hope that helps 👍
cols = ['points','salary','wins']
for i in cols:
df[i] = df[i].astype(float)
I think what OP is asking is how he can convert each column to it's appropriate type (int, float, or str) without having to manually inspect each column and then explicitly convert it.
I think something like the below should work for you. Keep in mind that this is pretty exhaustive and checks each value for the entire column. You can always the second for loop to maybe only look at the first 100 columns to make a decision on what type to use for that column.
import pandas as pd
import numpy as np
# Example dataframe full of strings
df = pd.DataFrame.from_dict({'name':['Lebron James','Kevin Durant'],'points':['38',' '],'steals':['2.5',''],'position':['Every Position','SG'],'turnovers':['0','7']})
def convertTypes(df):
for col in df:
is_an_int = True
is_a_float = True
if(df[col].dtype == np.float64 or df[col].dtype == np.int64):
# If the column's type is already a float or int, skip it
pass
else:
# Iterate through each value in the column
for value in df[col].iteritems():
if value[1].isspace() == True or value[1] == '':
continue
# If the string's isnumeric method returns false, it's not an int
if value[1].isnumeric() == False:
is_an_int = False
# if the string is made up of two numerics split by a '.', it's a float
if isinstance(value[1],str):
if len(value[1].split('.')) == 2:
if value[1].split('.')[0].isnumeric() and value[1].split('.')[1].isnumeric():
is_a_float = True
else:
is_a_float = False
else:
is_a_float = False
else:
is_a_float = False
if is_a_float == True:
# If every value's a float, convert the whole column
# Replace blanks and whitespaces with np.nan
df[col] = df[col].replace(r'^\s*$', np.nan, regex=True).astype(float)
elif is_an_int == True:
# If every value's an int, convert the whole column
# Replace blanks and whitespaces with 0
df[col] = df[col].replace(r'^\s*$', 0, regex=True).astype(int)
convertTypes(df)

What is the most efficient way to regex search an entire Pandas Dataframe? [duplicate]

Thought this would be straight forward but had some trouble tracking down an elegant way to search all columns in a dataframe at same time for a partial string match. Basically how would I apply df['col1'].str.contains('^') to an entire dataframe at once and filter down to any rows that have records containing the match?
The Series.str.contains method expects a regex pattern (by default), not a literal string. Therefore str.contains("^") matches the beginning of any string. Since every string has a beginning, everything matches. Instead use str.contains("\^") to match the literal ^ character.
To check every column, you could use for col in df to iterate through the column names, and then call str.contains on each column:
mask = np.column_stack([df[col].str.contains(r"\^", na=False) for col in df])
df.loc[mask.any(axis=1)]
Alternatively, you could pass regex=False to str.contains to make the test use the Python in operator; but (in general) using regex is faster.
Try with :
df.apply(lambda row: row.astype(str).str.contains('TEST').any(), axis=1)
Here's a function to solve the problem of doing text search in all column of a dataframe df:
def search(regex: str, df, case=False):
"""Search all the text columns of `df`, return rows with any matches."""
textlikes = df.select_dtypes(include=[object, "string"])
return df[
textlikes.apply(
lambda column: column.str.contains(regex, regex=True, case=case, na=False)
).any(axis=1)
]
It differs from the existing answers by both staying in the pandas API and embracing that pandas is more efficient in column processing than row processing. Also, this is packed as a pure function :-)
Relevant docs:
DataFrame.apply
The .str accessor
DataFrame.any
Alternatively you can use eq and any:
df[df.eq('^').any(axis=1)]
posting my findings in case someone would need.
i had a Dataframe (360 000 rows), needed to search across the whole dataframe to find the rows (just a few) that contained word 'TOTAL' (any variation eg 'TOTAL PRICE', 'TOTAL STEMS' etc) and delete those rows.
i finally processed the dataframe in two-steps:
FIND COLUMNS THAT CONTAIN THE WORD:
for i in df.columns:
df[i].astype('str').apply(lambda x: print(df[i].name) if x.startswith('TOTAL') else 'pass')
DELETE THE ROWS:
df[df['LENGTH/ CMS'].str.contains('TOTAL') != True]
Here is an example using applymap. I found other answers didn't work for me since they assumed that all data in a column would be strings causing Attribute Errors. Also it is surprisingly fast.
def search(dataFrame, item):
mask = (dataFrame.applymap(lambda x: isinstance(x, str) and item in x)).any(1)
return dataFrame[mask]
You can easily change the lambda to use regex if needed.
Yet another solution. This selects for columns of type object, which is Panda's type for strings. Other solutions that coerce to str with .astype(str) could give false positives if you're searching for a number (and want to exclude numeric columns and only search in strings -- but if you want to include searching numeric columns it may be the better approach).
As an added benefit, filtering the columns in this way seems to have a performance benefit; on my dataframe of shape (15807, 35), with only 17 of those 35 being strings, I see 4.74 s ± 108 ms per loop as compared to 5.72 s ± 155 ms.
df[
df.select_dtypes(object)
.apply(lambda row: row.str.contains("with"), axis=1)
.any(axis=1)
]
Building on top of #unutbu's answer https://stackoverflow.com/a/26641085/2839786
I use something like this:
>>> import pandas as pd
>>> import numpy as np
>>>
>>> def search(df: pd.DataFrame, substring: str, case: bool = False) -> pd.DataFrame:
... mask = np.column_stack([df[col].astype(str).str.contains(substring.lower(), case=case, na=False) for col in df])
... return df.loc[mask.any(axis=1)]
>>>
>>> # test
>>> df = pd.DataFrame({'col1':['hello', 'world', 'Sun'], 'col2': ['today', 'sunny', 'foo'], 'col3': ['WORLD', 'NEWS', 'bar']})
>>> df
col1 col2 col3
0 hello today WORLD
1 world sunny NEWS
2 Sun foo bar
>>>
>>> search(df, 'sun')
col1 col2 col3
1 world sunny NEWS
2 Sun foo bar

Search for String in all Pandas DataFrame columns and filter

Thought this would be straight forward but had some trouble tracking down an elegant way to search all columns in a dataframe at same time for a partial string match. Basically how would I apply df['col1'].str.contains('^') to an entire dataframe at once and filter down to any rows that have records containing the match?
The Series.str.contains method expects a regex pattern (by default), not a literal string. Therefore str.contains("^") matches the beginning of any string. Since every string has a beginning, everything matches. Instead use str.contains("\^") to match the literal ^ character.
To check every column, you could use for col in df to iterate through the column names, and then call str.contains on each column:
mask = np.column_stack([df[col].str.contains(r"\^", na=False) for col in df])
df.loc[mask.any(axis=1)]
Alternatively, you could pass regex=False to str.contains to make the test use the Python in operator; but (in general) using regex is faster.
Try with :
df.apply(lambda row: row.astype(str).str.contains('TEST').any(), axis=1)
Here's a function to solve the problem of doing text search in all column of a dataframe df:
def search(regex: str, df, case=False):
"""Search all the text columns of `df`, return rows with any matches."""
textlikes = df.select_dtypes(include=[object, "string"])
return df[
textlikes.apply(
lambda column: column.str.contains(regex, regex=True, case=case, na=False)
).any(axis=1)
]
It differs from the existing answers by both staying in the pandas API and embracing that pandas is more efficient in column processing than row processing. Also, this is packed as a pure function :-)
Relevant docs:
DataFrame.apply
The .str accessor
DataFrame.any
Alternatively you can use eq and any:
df[df.eq('^').any(axis=1)]
posting my findings in case someone would need.
i had a Dataframe (360 000 rows), needed to search across the whole dataframe to find the rows (just a few) that contained word 'TOTAL' (any variation eg 'TOTAL PRICE', 'TOTAL STEMS' etc) and delete those rows.
i finally processed the dataframe in two-steps:
FIND COLUMNS THAT CONTAIN THE WORD:
for i in df.columns:
df[i].astype('str').apply(lambda x: print(df[i].name) if x.startswith('TOTAL') else 'pass')
DELETE THE ROWS:
df[df['LENGTH/ CMS'].str.contains('TOTAL') != True]
Here is an example using applymap. I found other answers didn't work for me since they assumed that all data in a column would be strings causing Attribute Errors. Also it is surprisingly fast.
def search(dataFrame, item):
mask = (dataFrame.applymap(lambda x: isinstance(x, str) and item in x)).any(1)
return dataFrame[mask]
You can easily change the lambda to use regex if needed.
Yet another solution. This selects for columns of type object, which is Panda's type for strings. Other solutions that coerce to str with .astype(str) could give false positives if you're searching for a number (and want to exclude numeric columns and only search in strings -- but if you want to include searching numeric columns it may be the better approach).
As an added benefit, filtering the columns in this way seems to have a performance benefit; on my dataframe of shape (15807, 35), with only 17 of those 35 being strings, I see 4.74 s ± 108 ms per loop as compared to 5.72 s ± 155 ms.
df[
df.select_dtypes(object)
.apply(lambda row: row.str.contains("with"), axis=1)
.any(axis=1)
]
Building on top of #unutbu's answer https://stackoverflow.com/a/26641085/2839786
I use something like this:
>>> import pandas as pd
>>> import numpy as np
>>>
>>> def search(df: pd.DataFrame, substring: str, case: bool = False) -> pd.DataFrame:
... mask = np.column_stack([df[col].astype(str).str.contains(substring.lower(), case=case, na=False) for col in df])
... return df.loc[mask.any(axis=1)]
>>>
>>> # test
>>> df = pd.DataFrame({'col1':['hello', 'world', 'Sun'], 'col2': ['today', 'sunny', 'foo'], 'col3': ['WORLD', 'NEWS', 'bar']})
>>> df
col1 col2 col3
0 hello today WORLD
1 world sunny NEWS
2 Sun foo bar
>>>
>>> search(df, 'sun')
col1 col2 col3
1 world sunny NEWS
2 Sun foo bar

Categories

Resources