Cleaning data frames with rogue elements using split()

Cleaning data frames with rogue elements using split() - python

Given the following data in an excel sheet (taken in as a dataframe) :
Name Number Date
AA '9988779911' '01-JAN-18'
'BB' '8779912044' '01-FEB-18'
I have used the following code to clean the dataframe and remove the unnecessary apostrophes;
for name in list(df):
df[name] = df[name].str.split("'").str[1]
And I want the following output :
Name Number Date
AA 9988779911 01-JAN-18
BB 8779912044 01-FEB-18
I am getting the following error :
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
Thanks in advance for your help.:):)

try this,
for name in list(df):
df[name] = df[name].str.replace("\'","")
Replace ' with empty character.

simpler approach
df.applymap(lambda x: x.replace("'",""))

Strip function is probably the shortest way here. The other answers are elegant too.
str.strip("'")
Moshevi has said the same in one of the comments.

Related

Extract substring from left to a specific character for each row in a pandas dataframe?

I have a dataframe that contains a collection of strings. These strings look something like this:
"oop9-hg78-op67_457y"
I need to cut everything from the underscore to the end in order to match this data with another set. My attempt looked something like this:
df['column'] = df['column'].str[0:'_']
I've tried toying around with .find() in this statement but nothing seems to work. Anybody have any ideas? Any and all help would be greatly appreciated!

You can try .str.split then access the list with .str or with .str.extract
df['column'] = df['column'].str.split('_').str[0]
# or
df['column'] = df['column'].str.extract('^([^_]*)_')
print(df)
column
0 oop9-hg78-op67

df['column'] = df['column'].str.extract('_', expand=False)
could also be used if another option is needed.
Adding to the solution provided above by #Ynjxsjmh

You can use str.extract:
df['column'] = df['column'df].str.extract(r'(^[^_]+)')
Output (as separate column for clarity):
column column2
0 oop9-hg78-op67_457y oop9-hg78-op67
Regex:
( # start capturing group
^ # match start of string
[^_]+ # one or more non-underscore
) # end capturing group

Filtering in pandas: excluding rows that contain part of a string [duplicate]

I've done some searching and can't figure out how to filter a dataframe by
df["col"].str.contains(word)
however I'm wondering if there is a way to do the reverse: filter a dataframe by that set's compliment. eg: to the effect of
!(df["col"].str.contains(word))
Can this be done through a DataFrame method?

You can use the invert (~) operator (which acts like a not for boolean data):
new_df = df[~df["col"].str.contains(word)]
where new_df is the copy returned by RHS.
contains also accepts a regular expression...
If the above throws a ValueError or TypeError, the reason is likely because you have mixed datatypes, so use na=False:
new_df = df[~df["col"].str.contains(word, na=False)]
Or,
new_df = df[df["col"].str.contains(word) == False]

I was having trouble with the not (~) symbol as well, so here's another way from another StackOverflow thread:
df[df["col"].str.contains('this|that')==False]

You can use Apply and Lambda :
df[df["col"].apply(lambda x: word not in x)]
Or if you want to define more complex rule, you can use AND:
df[df["col"].apply(lambda x: word_1 not in x and word_2 not in x)]

I hope the answers are already posted
I am adding the framework to find multiple words and negate those from dataFrame.
Here 'word1','word2','word3','word4' = list of patterns to search
df = DataFrame
column_a = A column name from DataFrame df
values_to_remove = ['word1','word2','word3','word4']
pattern = '|'.join(values_to_remove)
result = df.loc[~df['column_a'].str.contains(pattern, case=False)]

I had to get rid of the NULL values before using the command recommended by Andy above. An example:
df = pd.DataFrame(index = [0, 1, 2], columns=['first', 'second', 'third'])
df.ix[:, 'first'] = 'myword'
df.ix[0, 'second'] = 'myword'
df.ix[2, 'second'] = 'myword'
df.ix[1, 'third'] = 'myword'
df
first second third
0 myword myword NaN
1 myword NaN myword
2 myword myword NaN
Now running the command:
~df["second"].str.contains(word)
I get the following error:
TypeError: bad operand type for unary ~: 'float'
I got rid of the NULL values using dropna() or fillna() first and retried the command with no problem.

To negate your query use ~. Using query has the advantage of returning the valid observations of df directly:
df.query('~col.str.contains("word").values')

Additional to nanselm2's answer, you can use 0 instead of False:
df["col"].str.contains(word)==0

somehow '.contains' didn't work for me but when I tried with '.isin' as mentioned by #kenan in the answer (How to drop rows from pandas data frame that contains a particular string in a particular column?) it works. Adding further, if you want to look at the entire dataframe and remove those rows which has the specific word (or set of words) just use the loop below
for col in df.columns:
df = df[~df[col].isin(['string or string list separeted by comma'])]
just remove ~ to get the dataframe that contains the word

To compliment to the above question, if someone wants to remove all the rows with strings, one could do:
df_new=df[~df['col_name'].apply(lambda x: isinstance(x, str))]

Python Pandas - 'DataFrame' object has no attribute 'str' - .str.replace error

I am trying to replace "," by "" for 80 columns in a panda dataframe.
I have create a list of this headers to iterate:
headers = ['h1', 'h2', 'h3'... 'h80']
and then I am using a list of headers to replace multiple columns string value as bellow:
dataFrame[headers] = dataFrame[headers].str.replace(',','')
Which gave me this error : AttributeError: 'DataFrame' object has no attribute 'str'
When I try the same on only one header it works well, and I need to use the "str.replace" because the only "replace" method does sadly not replace the ",".
Thank you

Using df.apply
pd.Series.str.replace is a series method not for data frames. You can use apply on each row/column series instead.
dataFrame[headers] = dataFrame[headers].apply(lambda x: x.str.replace(',',''))
Using df.applymap
Or, you can use applymap and treat each cell as a string and use replace directly on them -
dataFrame[headers] = dataFrame[headers].applymap(lambda x: x.replace(',',''))
Using df.replace
You can also use df.replace which is a method available to replace values in df directly across all columns selected. But, for this purpose you will have to set regex=True
dataFrame[headers] = dataFrame[headers].replace(',','',regex=True)

Find length of pd.Series and strip the last two characters Pandas

I am aware that I can find the length of a pd.Series by using pd.Series.str.len() but is there a method to strip the last two characters? I know we can use Python to accomplish this but I was curious to see if it could be done in Pandas.
For example:
$1000.0000
1..0009
456.2233
Would end in :
$1000.00
1..00
456.22
Any insight would be greatly appreciated.

Just do:
import pandas as pd
s = pd.Series(['$1000.0000', '1..0009', '456.2233'])
res = s.str[:-2]
print(res)
Output
0 $1000.00
1 1..00
2 456.22
dtype: object
Pandas supports the built-in string methods through the accessor str, from the documentation:
These are accessed via the str attribute and generally have names
matching the equivalent (scalar) built-in string methods

Try with
df_new = df.astype(str).applymap(lambda x : x[:-2])
Or only one column
df_new = df.astype(str).str[:-2]

How to replace a string that is a part of a dataframe with a list in pandas?

I am a beginner at coding, and since this is a very simple question, I know there must be answers out there. However, I've searched for about a half hour, typing countless queries in google, and all has flown over my head.
Lets say I have a dataframe with columns "Name", "Hobbies" and 2 people, so 2 rows. Currently, I have the hobbies as strings in the form "hobby1, hobby2". I would like to change this into ["hobby1", "hobby2"]
hobbies_as_string = df.iloc[0, 2]
hobbies_as_list = hobbies_as_string.split(',')
df.iloc[0, -2] = hobbies_as_list
However, this falls to an error, ValueError: Must have equal len keys and value when setting with an iterable. I don't understand why if I get hobbies_as_string as a copy, I'm able to assign the hobbies column as a list no problem. I'm also able to assign df.iloc[0,-2] as a string, such as "Hey", and that works fine. I'm guess it has to do the with ValueError. Why won't pandas let me assign it as a list??
Thank you very much for your help and explanation.

Use the "at" method to replace a value with a list
import pandas as pd
# create a dataframe
df = pd.DataFrame(data={'Name': ['Stinky', 'Lou'],
'Hobbies': ['Shooting Sports', 'Poker']})
# replace Lous hobby of poker with a list of degen hobbies with the at method
df.at[1, 'Hobbies'] = ['Poker', 'Ponies', 'Dice']

Are you looking to apply a split row-wise to each value into a list?
import pandas as pd
df = pd.DataFrame({'Name' : ['John', 'Kate'],
'Hobbies' : ["Hobby1, Hobby2", "Hobby2, Hobby3"]})
df['Hobbies'] = df['Hobbies'].apply(lambda x: x.split(','))
df
OR if you are not a big lambda exer, then you can do str.split() on the entire column, which is easier:
import pandas as pd
df = pd.DataFrame({'Name' : ['John', 'Kate'],
'Hobbies' : ["Hobby1, Hobby2", "Hobby2, Hobby3"]})
df['Hobbies'] = df['Hobbies'].str.split(",")
df
Output:
Name Hobbies
0 John [Hobby1, Hobby2]
1 Kate [Hobby2, Hobby3]

Another way of doing it
df=pd.DataFrame({'hobbiesStrings':['"hobby1, hobby2"']})
df
replace ,whitespace with "," and put hobbiesStrings values in a list
x=df.hobbiesStrings.str.replace('((?<=)(\,\s+)+)','","').values.tolist()
x
Here I use regex expressions
Basically I am replacing comma \, followed by whitespace \s with ","
rewrite column s using df.assign
df=df.assign(hobbies_stringsnes=[x])
Chained together
df=df.assign(hobbies_stringsnes=[df.hobbiesStrings.str.replace('((\,\s))','","').values.tolist()])
df
Output

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cleaning data frames with rogue elements using split() - python

try this, for name in list(df): df[name] = df[name].str.replace("\'","") Replace ' with empty character.

simpler approach df.applymap(lambda x: x.replace("'",""))

Strip function is probably the shortest way here. The other answers are elegant too. str.strip("'") Moshevi has said the same in one of the comments.

Related

Extract substring from left to a specific character for each row in a pandas dataframe?

Filtering in pandas: excluding rows that contain part of a string [duplicate]

Python Pandas - 'DataFrame' object has no attribute 'str' - .str.replace error

Find length of pd.Series and strip the last two characters Pandas

How to replace a string that is a part of a dataframe with a list in pandas?

Categories

Resources