How do I remove punctuation from a string in a pandas Series - python

I am trying to remove punctuation from a pandas Series. My problem is that I am unable to iterate over all the lines in the Series. This is the code that I tried out but it is taking forever to run. Note that my dataset is a bit large, around 112MB(200,000 rows)
import pandas as pd
import string
df = pd.read_csv('let us see.csv')
s = set(string.punctuation)
for st in df.reviewText.str:
for j in s:
if j in st:
df.reviewText = df.reviewText.str.replace(j, '')
df.reviewText = df.reviewText.str.lower()
df['clean_review'] = df.reviewText
print(df.clean_review.tail())

D-E-N's answer is pretty good. I just add another solution of how to improve the performance of your code.
Iteraring over a list version of your series should work faster than your approach.
import pandas as pd
import string
def replace_chars(text, chars):
for c in chars:
text = text.replace(c, '')
return text.lower()
df = pd.read_csv('let us see.csv')
s = set(string.punctuation)
reviewTextList = df.reviewText.astype(str).tolist()
reviewTextList = [replace_chars(x, s) for x in reviewTextList]
df['clean_review'] = reviewTextList
print(df.clean_review.tail())

Related

How can I get the specific strings in column value

what I want to do is delete certain parts of a string and take the only near of AcoS and insert it into a new column.
import pandas as pd
data = [{"Campaign" : "Sf l Spy l Branded l ACoS 20 l Manual NX"}]
df = pd.DataFrame(data)
df.insert(1,"targetAcos", 0)
df["targetAcos"] = df["Campaign"].str.replace(r' l ACoS \(.*)\l', r'\1', regex=True)
print(df["targetAcos"])
But I guess I am kinda bad at this, I couldn't make it correctly so I hope you guys can explain how can you do.
I think the Pandas function you want to be using here is str.extract:
df["targetAcos"] = df["Campaign"].str.extract(r'\bl ACoS (\d+) l')
Or perhaps a more generic regex would be:
df["targetAcos"] = df["Campaign"].str.extract(r'\bACoS (\d+)\b')

Convert Multiple Python Lines to a Concurrent DataFrame and Merge with Source Data

I apologize if this is a rudimentary question. I feel like it should be easy but I cannot figure it out. I have the code that is listed below that essentially looks at two columns in a CSV file and matches up job titles that have a similarity of 0.7. To do this, I use difflib.get_close_matches. However, the output is multiple single lines and whenever I try to convert to a DataFrame, every single line is its own DataFrame and I cannot figure out how to merge/concat them. All code, as well as current and desired outputs are below. Any help would be much appreciated.
Current Code is:
import pandas as pd
import difflib
df = pd.read_csv('name.csv')
aLists = list(df['JTs'])
bLists = list(df['JT'])
n=3
cutoff = 0.7
for aList in aLists:
best = difflib.get_close_matches(aList, bLists, n, cutoff)
print(best)
Current Output is:
['SW Engineer']
['Manu Engineer']
[]
['IT Help']
Desired Output is:
Output
0 SW Engineer
1 Manu Engineer
2 (blank)
3 IT Help
The table I am attempting to do this one is:
Any help would be greatly appreciated!
Here is a simple way to achieve this.I have converted first to a string.Then the first and last brackets are removed from that string and then is appended to a global list.
import pandas as pd
import difflib
import numpy as np
df = pd.read_csv('name.csv')
aLists = list(df['JTs'])
bLists = list(df['JT'])
n = 3
cutoff = 0.7
best = []
for aList in aLists:
temp = difflib.get_close_matches(aList, bLists, n, cutoff)
temp = str(temp)
strippedString = temp.lstrip("[").rstrip("]")
# print(temp)
best.append(strippedString)
print(best)
Output
[
"'SW Engineer'",
"'Manu Engineer'",
'',
"'IT Help'"
]
Here is another better way to achieve this.
You can simply use numpy to concatenate multiple arrays into single one.And then you can convert it to normal array if you want.
import pandas as pd
import difflib
import numpy as np
df = pd.read_csv('name.csv')
aLists = list(df['JTs'])
bLists = list(df['JT'])
n = 3
cutoff = 0.7
best = []
for aList in aLists:
temp = difflib.get_close_matches(aList, bLists, n, cutoff)
best.append(temp)
# print(best)
# Use concatenate() to join two arrays
combinedNumpyArray = np.concatenate(best)
#Converting numpy array to normal array
normalArray = combinedNumpyArray.tolist()
print(normalArray)
Output
['SW Engineer', 'Manu Engineer', 'IT Help']
Thanks
You could use Panda's .apply() to run your function on each entry. This could then either be added as a new column or a new dataframe created.
For example:
import pandas as pd
import difflib
def get_best_match(word):
matches = difflib.get_close_matches(word, JT, n, cutoff)
return matches[0] if matches else None
df = pd.read_csv('name.csv')
JT = df['JT']
n = 3
cutoff = 0.7
df['Output'] = df['JTs'].apply(get_best_match)
Or for a new dataframe:
df_output = pd.DataFrame({'Output' : df['JTs'].apply(get_best_match)})
Giving you:
JTs JT Output
0 Software Engineer Manu Engineer SW Engineer
1 Manufacturing Engineer SW Engineer Manu Engineer
2 Human Resource Manager IT Help None
3 IT Help Desk f IT Help
Or:
Output
0 SW Engineer
1 Manu Engineer
2 None
3 IT Help

Replace cells in a dataframe with a range of values

I have a large dataframe that has certain cells which have values like: <25-27>. Is there a simple way to convert these into something like:25|26|27 ?
Source data frame:
import pandas as pd
import numpy as np
f = {'function':['2','<25-27>','200'],'CP':['<31-33>','210','4001']}
filter = pd.DataFrame(data=f)
filter
Output Required
output = {'function':['2','25|26|27','200'],'CP':['31|32|33','210','4001']}
op = pd.DataFrame(data=output)
op
thanks a lot !
import re
def convert_range(x):
m = re.match("<([0-9]+)+\-([0-9]+)>", x)
if m is None:
return x
s1, s2 = m.groups()
return "|".join([str(s) for s in range(int(s1), int(s2)+1)])
op = filter.applymap(convert_range)

Pandas: Replacing string with hashed string via regex

I have a DataFrame with 29 columns, and need to replace part of a string in some columns with a hashed part of the string.
Example of the column is as follows:
ABSX, PLAN=PLAN_A ;SFFBJD
ADSFJ, PLAN=PLAN_B ;AHJDG
...
...
Code that captures the part of the string:
Test[14] = Test[14].replace({'(?<=PLAN=)(^"]+ ;)' :'hello'}, regex=True)
I want to change the 'hello' to hash of '(?<=PLAN=)(^"]+ ;)' but it doesn't work this way. Wanted to check if anyone did this before without looping line by line of the DataFrame?
here is what I suggest:
import hashlib
import re
import pandas as pd
# First I reproduce a similar dataset
df = pd.DataFrame({"v1":["ABSX", "ADSFJ"],
"v2": ["PLAN=PLAN_A", "PLAN=PLAN_B"],
"v3": ["SFFBJD", "AHJDG"]})
# I search for the regex and create a column matched_el with the hash
r = re.compile(r'=[a-zA-Z_]+')
df["matched_el"] = ["".join(r.findall(w)) for w in df.v2]
df["matched_el"] = df["matched_el"].str.replace("=","")
df["matched_el"] = [hashlib.md5(w.encode()).hexdigest() for w in df.matched_el]
# Then I replace in v2 using this hash
df["v2"] = df["v2"].str.replace("(=[a-zA-Z_]+)", "=")+df["matched_el"]
df = df.drop(columns="matched_el")
Here is the result
v1 v2 v3
0 ABSX PLAN=8d846f78aa0b0debd89fc1faafc4c40f SFFBJD
1 ADSFJ PLAN=3b9a3c8184829ca5571cb08c0cf73c8d AHJDG

How to replace substrings in a dataframe in Python

I have a dataframe, where I want to replace some words to others, based on another dataframe:
import pandas as pd
dist = pd.DataFrame([["21","apple"],["25","balana"],["30","lemon"]],columns=["idx","item"])
a = pd.DataFrame(["apple - banana"],columns=["pf"])
a['pf'] = a['pf'].replace(dist["item"], dist["idx"], regex=True)
print(a)
How can I do that? (this does not work in its current form)
You can try this:
dist = pd.DataFrame([["21","apple"],["25","balana"],["30","lemon"]],columns= ["idx","item"])
a = pd.DataFrame(["apple - banana"],columns=["pf"])
b = dict(zip(dist["idx"], dist["item"]))
def replace_items(token):
for key, value in b.items():
token = token.replace(value, key)
return token
a["pf"] = a["pf"].apply(replace_items)
Please be aware that the banana in your dist dataframe is balana. Not sure if this is intended...
Converting the translation table to dictionary seems to solve the problem:
import pandas as pd
dist = pd.DataFrame([["apple","21"],["banana","25"],["lemon","30"]],columns=["item","idx"])
dist = dist.set_index('item')['idx'].to_dict()
a = pd.DataFrame(["apple - banana"],columns=["pf"])
a['pf'] = a['pf'].replace(dist, regex=True)
print(a)

Categories

Resources