Applying string operations to pandas data frame - python

There are similar answers but I could not apply it to my own case
I wanna get rid of forbidden characters for Windows directory names in my pandas dataframe. I tried to use something like:
df1['item_name'] = "".join(x for x in df1['item_name'].rstrip() if x.isalnum() or x in [" ", "-", "_"]) if df1['item_name'] else ""
Assume I have a dataframe like this
item_name
0 st*back
1 yhh?\xx
2 adfg%s
3 ghytt&{23
4 ghh_h
I want to get:
item_name
0 stback
1 yhhxx
2 adfgs
3 ghytt23
4 ghh_h
How I could achieve this?
Note: I scraped data from internet earlier, and used the following code for the older version
item_name = "".join(x for x in item_name.text.rstrip() if x.isalnum() or x in [" ", "-", "_"]) if item_name else ""
Now, I have new observations for the same items and I want to merge them with older observations. But I forgot to use the same method when I rescraped

You could summarize the condition as a negative character class, and use str.replace to remove them, here \w stands for word characters alnum + _, \s stands for space and - is literal dash. With ^ in the character class, [^\w\s-] matches any character that is not alpha numeric, nor [" ", "-", "_"], then you can use replace method to remove them:
df.item_name.str.replace("[^\w\s-]", "")
#0 stback
#1 yhhxx
#2 adfgs
#3 ghytt23
#4 ghh_h
#Name: item_name, dtype: object

Try
import re
df.item_name.apply(lambda x: re.sub('\W+', '', x))
0 stback
1 yhhxx
2 adfgs
3 ghytt23
4 ghh_h

If you have a properly escaped list of characters
lst = ['\\\\', '\*', '\?', '%', '&', '\{']
df.replace(lst, '', regex=True)
item_name
0 stback
1 yhhxx
2 adfgs
3 ghytt23
4 ghh_h

Related

How to extract strings from a list in a column in a python pandas dataframe?

Let's say I have a list
lst = ["fi", "ap", "ko", "co", "ex"]
and we have this series
Explanation
a "fi doesn't work correctly"
b "apples are cool"
c "this works but translation is ko"
and I'm looking to get something like this:
Explanation Explanation Extracted
a "fi doesn't work correctly" "fi"
b "apples are cool" "N/A"
c "this works but translation is ko" "ko"
With a dataframe like
df = pd.DataFrame(
{"Explanation": ["fi doesn't co work correctly",
"apples are cool",
"this works but translation is ko"]},
index=["a", "b", "c"]
)
you can use .str.extract() to do
lst = ["fi", "ap", "ko", "co", "ex"]
pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = df.Explanation.str.extract(pattern, expand=False)
to get
Explanation Explanation Extracted
a fi doesn't co work correctly fi
b apples are cool NaN
c this works but translation is ko ko
The regex pattern r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)" looks for an occurrence of one of the lst items either at the beginning with withespace afterwards, in the middle with whitespace before and after, or at the end with withespace before. str.extract() extracts the capture group (the part in the middle in ()). Without a match the return is NaN.
If you want to extract multiple matches, you could use .str.findall() and then ", ".join the results:
pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = (
df.Explanation.str.findall(pattern).str.join(", ").replace({"": None})
)
Alternative without regex:
df.index = df.index.astype("category")
matches = df.Explanation.str.split().explode().loc[lambda s: s.isin(lst)]
df["Explanation Extracted"] = (
matches.groupby(level=0).agg(set).str.join(", ").replace({"": None})
)
If you only want to match at the beginning or end of the sentences, then replace the first part with:
df.index = df.index.astype("category")
splitted = df.Explanation.str.split()
matches = (
(splitted.str[:1] + splitted.str[-1:]).explode().loc[lambda s: s.isin(lst)]
)
...
I think this solves your problem.
import pandas as pd
lst = ["fi", "ap", "ko", "co", "ex"]
df = pd.DataFrame([["fi doesn't work correctly"],["apples are cool"],["this works but translation is ko"]],columns=["Explanation"])
extracted =[]
for index, row in df.iterrows():
tempList =[]
rowSplit = row['Explanation'].split(" ")
for val in rowSplit:
if val in lst:
tempList.append(val)
if len(tempList)>0:
extracted.append(','.join(tempList))
else:
extracted.append('N/A')
df['Explanation Extracted'] = extracted
apply function of Pandas might be helpful
def extract_explanation(dataframe):
custom_substring = ["fi", "ap", "ko", "co", "ex"]
substrings = dataframe['explanation'].split(" ")
explanation = "N/A"
for string in substrings:
if string in custom_substring:
explanation = string
return explanation
df['Explanation Extracted'] = df.apply(extract_explanation, axis=1)
The catch here is assumption of only one explanation, but it can be converted into a list, if multiple explanations are expected.
Option 1
Assuming that one wants to extract the exact string in the list lst one can start by creating a regex
regex = f'\\b({"|".join(lst)})\\b'
where \b is the word boundary (beginning or end of a word) that indicates the word is not followed by additional characters, or with characters before. So, considering that one has the string ap in the list lst, if one has the word apple in the dataframe, that won't be considered.
And then, using pandas.Series.str.extract, and, to make it case insensitive, use re.IGNORECASE
import re
df['Explanation Extracted'] = df['Explanation'].str.extract(regex, flags=re.IGNORECASE, expand=False)
[Out]:
ID Explanation Explanation Extracted
0 1 fi doesn't work correctly fi
1 2 cap ples are cool NaN
2 3 this works but translation is ko ko
Option 2
One can also use pandas.Series.apply with a custom lambda function.
df['Explanation Extracted'] = df['Explanation'].apply(lambda x: next((i for i in lst if i.lower() in x.lower().split()), 'N/A'))
[Out]:
ID Explanation Explanation Extracted
0 1 fi doesn't work correctly fi
1 2 cap ples are cool N/A
2 3 this works but translation is ko ko
Notes:
.lower() is to make it case insensitive.
.split() is one way to prevent that even though ap is in the list, the string apple doesn't appear in the Explanation Extracted column.

How to replace last 2 characters if available in python

Parse and replace last 2 specific characters from the string if available.
For Eg: 0 = O, 5 = S, 1=I
Input Data:
Col1
AZBYCCCD0
NZBY23HG1
BZYCQ05CO
YODZ225H0
NaN
CS45DRNZQ
Expected Output:
Col1
AZBYCCCDO
NZBY23HGI
BZYCQ05CO
YODZ225HO
NaN
CS45DRNZQ
I have been trying to use :
repl = {'0':'O','1':'I'}
df['Col1'] = df['Col1'].astype(str).str[7:].replace(repl, regex=True)
But the above script doesn't work.
Please Suggest
The problem you are seeing is that you are only selecting the last two characters and thus the function only returns the last two. You need to combine the original portion with the replaced portion.
df = pd.DataFrame([
'AZBYCCCD0',
'NZBY23HG1',
'BZYCQ05CO',
'YODZ225H0',
pd.np.NaN,
'CS45DRNZQ',], columns = ['Col1'])
repl = {'0':'O','1':'I'}
df['Col1'].astype(str).str[:7] + df['Col1'].astype(str).str[7:].replace(repl, regex=True)
Which returns as expected:
0 AZBYCCCDO
1 NZBY23HGI
2 BZYCQ05CO
3 YODZ225HO
4 nan
5 CS45DRNZQ
Name: Col1, dtype: object
Can be done via replace: Modify the regex and lambda function if required:
repl = {'0':'O','1':'I'}
df['Col1'] = df['Col1'].str[:7] + df['Col1'].str[7:].str.replace(r'0|1', lambda x : repl[x.group(0)], regex=True)
#or
df['Col1'] = df['Col1'].str[:7] + df['Col1'].str[7:].replace(repl, regex=True)
I think you can try something like this (assuming you are working on one word at a time):
repl = {'0':'O','1':'I', '5':'S'}
def func1(string):
str_list = list(string)
for i in range(len(str_list)):
if str_list[i] in repl:
str_list[i] = repl[str_list[i]]
string = "".join(str_list)
return string
I think this should work. It does not require any import.
x="Col1,AZBYCCCD0,NZBY23HG1,BZYCQ05CO,YODZ225H0,NaN,CS45DRNZQ"
d=x.split(",")
for j in d:
print("Original Word: ",j.strip())
last_chars=str(j[-3:])
j=j.strip(last_chars)
if "0" in last_chars:
last_chars=last_chars.replace("0","O")
if "1" in last_chars:
last_chars=last_chars.replace("1","I")
#print(last_chars)
print("New Word: "+j+last_chars)

Removing numbers from strings in a Data frame column

I want to remove numbers from strings in a column, while at the same time keeping numbers that do not have any strings in the same column.
This is how the data looks like;
df=
id description
1 XG154LU
2 4562689
3 556
4 LE896E
5 65KKL4
This is how i want the output to look like:
id description
1 XGLU
2 4562689
3 556
4 LEE
5 KKL
I used the code below but when i run it it removes all the entries in the description column and replace it with blanks:
def clean_text_round1(text):
text = re.sub('\w*\d\w*', '', text)
text = re.sub('[‘’“”…]', '', text)
text = re.sub(r'\n', '', text)
text = re.sub(r'\r', '', text)
return text
round1 = lambda x: clean_text_round1(x)
df['description'] = df['description'].apply(round1)
Try:
import numpy as np
df['description'] = np.where(df.description.str.contains('^\d+$'), df.description, df.description.str.replace('\d+', ''))
Output:
id description
1 XGLU
2 4562689
3 556
4 LEE
5 KKL
Logic:
Look if the string contains digits only, if yes, dont do anything and just copy the number as it is. If the string has numbers mixed with string, then replace them with black space '' leaving out only characters without the numbers.
This should solve it for you.
def clean_text_round1(text):
if type(text) == int:
return text
else:
text = ''.join([i for i in text if not i.isdigit()])
return text
df['description'] = df['description'].apply(clean_text_round1)
Let me know if this works for you. Not sure about the speed performance. You can use regex instead of join.
def convert(v):
# check if the string is composed of not only numbers
if any([char.isalpha() for char in v]):
va = [char for char in v if char.isalpha()]
va = ''.join(va)
return va
else:
return v
# apply() a function for a single column
df['description']= df['description'].apply(convert)
print(df)
id description
0 XGLU
1 4562689
2 556
3 LEE
4 KKL

Python - Create dataframe by getting the numbers of alphabetic characters

I have a dataframe with a column called "Utterances", which contains strings (e.g.: "I wanna have a beer" is its first row).
What I need is to create a new data frame that will contain the number of every letter of every row of "Utterances" in the alphabet.
This means that for example in the case of "I wanna have a beer", I need to get the following row: 9 23114141 81225 1 25518, since "I" is the 9th letter of the alphabet, "w" the 23rd and so on. Notice that I want the spaces " " to be maintained.
What I have done so far is the following:
for word in df2[['Utterances']]:
for character in word:
new.append(ord(character.lower())-96)
str1 = ''.join(str(e) for e in new)
The above returns the concatenated string. However, the above loop only iterates once and second the string returned by str1 does not have the required spaces (" "). And of course, I can not find a way to append these lines into a new dataframe.
Any help would be greatly appreciated.
Thanks.
You can do
In [5572]: df
Out[5572]:
Utterances
0 I wanna have a beer
In [5573]: df['Utterances'].apply(lambda x: ' '.join([''.join(str(ord(c)-96) for c in w)
for w in x.lower().split()]))
Out[5573]:
0 9 23114141 81225 1 25518
Name: Utterances, dtype: object
for word in ['I ab c def']:
for character in word:
if character == ' ':
new.append(' ')
else:
new.append(ord(character.lower())-96)
str1 = ''.join(str(e) for e in new)
Output
9 12 3 456
Lets use dictionary and get with strings if you have only alphabets i.e
import string
dic = {j:i+1 for i,j in enumerate(string.ascii_lowercase[:26])}
dic[' ']= ' '
df['Ut'].apply(lambda x : ''.join([str(dic.get(i)) for i in str(x).lower()]))
Output :
Ut new
0 I wanna have a beer 9 23114141 81225 1 25518
​

Removing stand-alone numbers from string in Python

There are a lot of similar questions, but I have not found a solution for my problem. I have a data frame with the following structure/form:
col_1
0 BULKA TARTA 500G KAJO 1
1 CUKIER KRYSZTAL 1KG KSC 4
2 KASZA JĘCZMIENNA 4*100G 2 0.92
3 LEWIATAN MAKARON WSTĄŻKA 1 0.89
However, I want to achieve the effect:
col_1
0 BULKA TARTA 500G KAJO
1 CUKIER KRYSZTAL 1KG KSC
2 KASZA JĘCZMIENNA 4*100G
3 LEWIATAN MAKARON WSTĄŻKA
So I want to remove the independent natural and decimal numbers, but leave the numbers in the string with the letters.
I tried to use df.col_1.str.isdigit().replace([True, False],[np.nan, df.col_1]) , but it only works on comparing the entire cell whether it is a number or not.
You have some ideas how to do it? Or maybe it would be good to break the column with spaces and then compare?
We could create a function that tries to convert to float. If it fails we return True (not_float)
import pandas as pd
df = pd.DataFrame({"col_1" : ["BULKA TARTA 500G KAJO 1",
"CUKIER KRYSZTAL 1KG KSC 4",
"KASZA JĘCZMIENNA 4*100G 2 0.92",
"LEWIATAN MAKARON WSTĄŻKA 1 0.89"]})
def is_not_float(string):
try:
float(string)
return False
except ValueError: # String is not a number
return True
df["col_1"] = df["col_1"].apply(lambda x: [i for i in x.split(" ") if is_not_float(i)])
df
Or following the example of my fellow SO:ers. However this would treat 130. as a number.
df["col_1"] = (df["col_1"].apply(
lambda x: [i for i in x.split(" ") if not i.replace(".","").isnumeric()]))
Returns
col_1
0 [BULKA, TARTA, 500G, KAJO]
1 [CUKIER, KRYSZTAL, 1KG, KSC]
2 [KASZA, JĘCZMIENNA, 4*100G]
3 [LEWIATAN, MAKARON, WSTĄŻKA]
Sure,
You could use a regex.
import re
df.col_1 = re.sub("\d+\.?\d+?", "", df.col_1)
Yes you can
def no_nums(col):
return ' '.join(filter(lambda word:word.replace('.','').isdigit()==False, col.split()))
df.col_1.apply(no_nums)
This filters out words from each value which are completely made of digits,
And maybe contains a decimal point.
If you want to filter out numbers like 1,000, simply add another replace for ','

Categories

Resources