removing newlines from messy strings in pandas dataframe cells? - python

I've used multiple ways of splitting and stripping the strings in my pandas dataframe to remove all the '\n'characters, but for some reason it simply doesn't want to delete the characters that are attached to other words, even though I split them. I have a pandas dataframe with a column that captures text from web pages using Beautifulsoup. The text has been cleaned a bit already by beautifulsoup, but it failed in removing the newlines attached to other characters. My strings look a bit like this:
"hands-on\ndevelopment of games. We will study a variety of software technologies\nrelevant to games including programming languages, scripting\nlanguages, operating systems, file systems, networks, simulation\nengines, and multi-media design systems. We will also study some of\nthe underlying scientific concepts from computer science and related\nfields including"
Is there an easy python way to remove these "\n" characters?

EDIT: the correct answer to this is:
df = df.replace(r'\n',' ', regex=True)
I think you need replace:
df = df.replace('\n','', regex=True)
Or:
df = df.replace('\n',' ', regex=True)
Or:
df = df.replace(r'\\n',' ', regex=True)
Sample:
text = '''hands-on\ndev nologies\nrelevant scripting\nlang
'''
df = pd.DataFrame({'A':[text]})
print (df)
A
0 hands-on\ndev nologies\nrelevant scripting\nla...
df = df.replace('\n',' ', regex=True)
print (df)
A
0 hands-on dev nologies relevant scripting lang

df.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=["",""], regex=True, inplace=True)
worked for me.
Source:
https://gist.github.com/smram/d6ded3c9028272360eb65bcab564a18a

To remove carriage return (\r), new line (\n) and tab (\t)
df = df.replace(r'\r+|\n+|\t+','', regex=True)

in messy data it might to be a good idea to remove all whitespaces df.replace(r'\s', '', regex = True, inplace = True).

df = 'Sarah Marie Wimberly So so beautiful!!!\nAbram Staten You guys look good man.\nTJ Sloan I miss you guys\n'
df = df.replace(r'\\n',' ', regex=True)
This worked for the messy data I had.

Related

Extract words after a symbol in python

I have the following data where i would like to extract out source= from the values. Is there a way to create a general regex function so that i can apply on other columns as well to extract words after equal sign?
Data Data2
source=book social-media=facebook
source=book social-media=instagram
source=journal social-media=facebook
Im using python and i have tried the following:
df['Data'].astype(str).str.replace(r'[a-zA-Z]\=', '', regex=True)
but it didnt work
you can try this :
df.replace(r'[a-zA-Z]+-?[a-zA-Z]+=', '', regex=True)
It gives you the following result :
Data Data2
0 book facebook
1 book instagram
2 journal facebook
Regex is not required in this situation:
print(df['Data'].apply(lambda x : x.split('=')[-1]))
print(df['Data2'].apply(lambda x : x.split('=')[-1]))
You have to repeat the character class 1 or more times and you don't have to escape the equals sign.
What you can do is make the match a bit broader matching all characters except a whitespace char or an equals sign.
Then set the result to the new value.
import pandas as pd
data = [
"source=book",
"source=journal",
"social-media=facebook",
"social-media=instagram"
]
df = pd.DataFrame(data, columns=["Data"])
df['Data'] = df['Data'].astype(str).str.replace(r'[^\s=]+=', '', regex=True)
print(df)
Output
Data
0 book
1 journal
2 facebook
3 instagram
If there has to be a value after the equals sign, you can also use str.extract
df['Data'] = df['Data'].astype(str).str.extract(r'[^\s=]+=([^\s=]+)')

How to strip/replace "domain\" from Pandas DataFrame Column?

I have a pandas DataFrame that's being read in from a CSV that has hostnames of computers including the domain they belong to along with a bunch of other columns. I'm trying to strip out the Domain information such that I'm left with ONLY the Hostname.
DataFrame ex:
name
domain1\computername1
domain1\computername45
dmain3\servername1
dmain3\computername3
domain1\servername64
....
I've tried using both str.strip() and str.replace() with a regex as well as a string literal, but I can't seem to correctly target the domain information correctly.
Examples of what I've tried thus far:
df['name'].str.strip('.*\\')
df['name'].str.replace('.*\\', '', regex = True)
df['name'].str.replace(r'[.*\\]', '', regex = True)
df['name'].str.replace('domain1\\\\', '', regex = False)
df['name'].str.replace('dmain3\\\\', '', regex = False)
None of these seem to make any changes when I spit the DataFrame out using logging.debug(df)
You are already close to the answer, just use:
df['name'] = df['name'].str.replace(r'.*\\', '', regex = True)
which just adds using r-string from one of your tried code.
Without using r-string here, the string is equivalent to .*\\ which will be interpreted to only one \ in the final regex. However, with r-string, the string will becomes '.*\\\\' and each pair of \\ will be interpreted finally as one \ and final result becomes 2 slashes as you expect.
Output:
0 computername1
1 computername45
2 servername1
3 computername3
4 servername64
Name: name, dtype: object
You can use .str.split:
df["name"] = df["name"].str.split("\\", n=1).str[-1]
print(df)
Prints:
name
0 computername1
1 computername45
2 servername1
3 computername3
4 servername64
No regex approach with ntpath.basename:
import pandas as pd
import ntpath
df = pd.DataFrame({'name':[r'domain1\computername1']})
df["name"] = df["name"].apply(lambda x: ntpath.basename(x))
Results: computername1.
With rsplit:
df["name"] = df["name"].str.rsplit('\\').str[-1]

Python pandas setting NaN values fails

I am trying to clean a dataset in pandas, information is stored ona csv file and is imported using:
tester = pd.read_csv('date.csv')
Every column contains a '?' where the value is missing. For example there is an age column that contains 9 question marks (?)
I am trying to set the all the question marks to NaN, i have tried:
tester = pd.read_csv('date.csv', na_values=["?"])
tester['age'].replace("?", np.NaN)
tester.replace('?', np.NaN)
for col in tester :
print tester[col].value_counts(dropna=False)
Still returns 0 for the age when I know there is 9 (?s). In this case I assume the check is failing as the value is never seen as being ?.
I have looked at the csv file in notepage and there is no space etc around the character.
Is there anyway of forcing this so that it is recognised?
sample data:
read_csv had a na_values parameter. See here.
df = pd.read_csv('date.csv', na_values='?')
You are very near:
# IT looks like file is having spaces after comma, so use `sep`
tester = pd.read_csv('date.csv', sep=', ', engine='python')
tester['age'].replace('?', np.nan)
There seems problem with data somewhere so, for debug..
pd.read_csv('file', error_bad_lines=False)
tester = tester [~(tester == '?').any(axis=1)]
OR
pd.read_csv('file', sep='delimiter', header=None)
OR
pd.read_csv('file',header=None,sep=', ')

replacing quotes, commas, apostrophes w/ regex - python/pandas

I have a column with addresses, and sometimes it has these characters I want to remove => ' - " - ,(apostrophe, double quotes, commas)
I would like to replace these characters with space in one shot. I'm using pandas and this is the code I have so far to replace one of them.
test['Address 1'].map(lambda x: x.replace(',', ''))
Is there a way to modify these code so I can replace these characters in one shot? Sorry for being a noob, but I would like to learn more about pandas and regex.
Your help will be appreciated!
You can use str.replace:
test['Address 1'] = test['Address 1'].str.replace(r"[\"\',]", '')
Sample:
import pandas as pd
test = pd.DataFrame({'Address 1': ["'aaa",'sa,ss"']})
print (test)
Address 1
0 'aaa
1 sa,ss"
test['Address 1'] = test['Address 1'].str.replace(r"[\"\',]", '')
print (test)
Address 1
0 aaa
1 sass
Here's the pandas solution:
To apply it to an entire dataframe use, df.replace. Don't forget the \ character for the apostrophe.
Example:
import pandas as pd
df = #some dataframe
df.replace('\'','', regex=True, inplace=True)

Pandas KeyError: value not in index

I have the following code,
df = pd.read_csv(CsvFileName)
p = df.pivot_table(index=['Hour'], columns='DOW', values='Changes', aggfunc=np.mean).round(0)
p.fillna(0, inplace=True)
p[["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]] = p[["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]].astype(int)
It has always been working until the csv file doesn't have enough coverage (of all week days). For e.g., with the following .csv file,
DOW,Hour,Changes
4Wed,01,237
3Tue,07,2533
1Sun,01,240
3Tue,12,4407
1Sun,09,2204
1Sun,01,240
1Sun,01,241
1Sun,01,241
3Tue,11,662
4Wed,01,4
2Mon,18,4737
1Sun,15,240
2Mon,02,4
6Fri,01,1
1Sun,01,240
2Mon,19,2300
2Mon,19,2532
I'll get the following error:
KeyError: "['5Thu' '7Sat'] not in index"
It seems to have a very easy fix, but I'm just too new to Python to know how to fix it.
Use reindex to get all columns you need. It'll preserve the ones that are already there and put in empty columns otherwise.
p = p.reindex(columns=['1Sun', '2Mon', '3Tue', '4Wed', '5Thu', '6Fri', '7Sat'])
So, your entire code example should look like this:
df = pd.read_csv(CsvFileName)
p = df.pivot_table(index=['Hour'], columns='DOW', values='Changes', aggfunc=np.mean).round(0)
p.fillna(0, inplace=True)
columns = ["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]
p = p.reindex(columns=columns)
p[columns] = p[columns].astype(int)
I had a very similar issue. I got the same error because the csv contained spaces in the header. My csv contained a header "Gender " and I had it listed as:
[['Gender']]
If it's easy enough for you to access your csv, you can use the excel formula trim() to clip any spaces of the cells.
or remove it like this
df.columns = df.columns.to_series().apply(lambda x: x.strip())
please try this to clean and format your column names:
df.columns = (df.columns.str.strip().str.upper()
.str.replace(' ', '_')
.str.replace('(', '')
.str.replace(')', ''))
I had the same issue.
During the 1st development I used a .csv file (comma as separator) that I've modified a bit before saving it.
After saving the commas became semicolon.
On Windows it is dependent on the "Regional and Language Options" customize screen where you find a List separator. This is the char Windows applications expect to be the CSV separator.
When testing from a brand new file I encountered that issue.
I've removed the 'sep' argument in read_csv method
before:
df1 = pd.read_csv('myfile.csv', sep=',');
after:
df1 = pd.read_csv('myfile.csv');
That way, the issue disappeared.

Categories

Resources