Parse and replace last 2 specific characters from the string if available.
For Eg: 0 = O, 5 = S, 1=I
Input Data:
Col1
AZBYCCCD0
NZBY23HG1
BZYCQ05CO
YODZ225H0
NaN
CS45DRNZQ
Expected Output:
Col1
AZBYCCCDO
NZBY23HGI
BZYCQ05CO
YODZ225HO
NaN
CS45DRNZQ
I have been trying to use :
repl = {'0':'O','1':'I'}
df['Col1'] = df['Col1'].astype(str).str[7:].replace(repl, regex=True)
But the above script doesn't work.
Please Suggest
The problem you are seeing is that you are only selecting the last two characters and thus the function only returns the last two. You need to combine the original portion with the replaced portion.
df = pd.DataFrame([
'AZBYCCCD0',
'NZBY23HG1',
'BZYCQ05CO',
'YODZ225H0',
pd.np.NaN,
'CS45DRNZQ',], columns = ['Col1'])
repl = {'0':'O','1':'I'}
df['Col1'].astype(str).str[:7] + df['Col1'].astype(str).str[7:].replace(repl, regex=True)
Which returns as expected:
0 AZBYCCCDO
1 NZBY23HGI
2 BZYCQ05CO
3 YODZ225HO
4 nan
5 CS45DRNZQ
Name: Col1, dtype: object
Can be done via replace: Modify the regex and lambda function if required:
repl = {'0':'O','1':'I'}
df['Col1'] = df['Col1'].str[:7] + df['Col1'].str[7:].str.replace(r'0|1', lambda x : repl[x.group(0)], regex=True)
#or
df['Col1'] = df['Col1'].str[:7] + df['Col1'].str[7:].replace(repl, regex=True)
I think you can try something like this (assuming you are working on one word at a time):
repl = {'0':'O','1':'I', '5':'S'}
def func1(string):
str_list = list(string)
for i in range(len(str_list)):
if str_list[i] in repl:
str_list[i] = repl[str_list[i]]
string = "".join(str_list)
return string
I think this should work. It does not require any import.
x="Col1,AZBYCCCD0,NZBY23HG1,BZYCQ05CO,YODZ225H0,NaN,CS45DRNZQ"
d=x.split(",")
for j in d:
print("Original Word: ",j.strip())
last_chars=str(j[-3:])
j=j.strip(last_chars)
if "0" in last_chars:
last_chars=last_chars.replace("0","O")
if "1" in last_chars:
last_chars=last_chars.replace("1","I")
#print(last_chars)
print("New Word: "+j+last_chars)
Related
Let's say I have a list
lst = ["fi", "ap", "ko", "co", "ex"]
and we have this series
Explanation
a "fi doesn't work correctly"
b "apples are cool"
c "this works but translation is ko"
and I'm looking to get something like this:
Explanation Explanation Extracted
a "fi doesn't work correctly" "fi"
b "apples are cool" "N/A"
c "this works but translation is ko" "ko"
With a dataframe like
df = pd.DataFrame(
{"Explanation": ["fi doesn't co work correctly",
"apples are cool",
"this works but translation is ko"]},
index=["a", "b", "c"]
)
you can use .str.extract() to do
lst = ["fi", "ap", "ko", "co", "ex"]
pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = df.Explanation.str.extract(pattern, expand=False)
to get
Explanation Explanation Extracted
a fi doesn't co work correctly fi
b apples are cool NaN
c this works but translation is ko ko
The regex pattern r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)" looks for an occurrence of one of the lst items either at the beginning with withespace afterwards, in the middle with whitespace before and after, or at the end with withespace before. str.extract() extracts the capture group (the part in the middle in ()). Without a match the return is NaN.
If you want to extract multiple matches, you could use .str.findall() and then ", ".join the results:
pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = (
df.Explanation.str.findall(pattern).str.join(", ").replace({"": None})
)
Alternative without regex:
df.index = df.index.astype("category")
matches = df.Explanation.str.split().explode().loc[lambda s: s.isin(lst)]
df["Explanation Extracted"] = (
matches.groupby(level=0).agg(set).str.join(", ").replace({"": None})
)
If you only want to match at the beginning or end of the sentences, then replace the first part with:
df.index = df.index.astype("category")
splitted = df.Explanation.str.split()
matches = (
(splitted.str[:1] + splitted.str[-1:]).explode().loc[lambda s: s.isin(lst)]
)
...
I think this solves your problem.
import pandas as pd
lst = ["fi", "ap", "ko", "co", "ex"]
df = pd.DataFrame([["fi doesn't work correctly"],["apples are cool"],["this works but translation is ko"]],columns=["Explanation"])
extracted =[]
for index, row in df.iterrows():
tempList =[]
rowSplit = row['Explanation'].split(" ")
for val in rowSplit:
if val in lst:
tempList.append(val)
if len(tempList)>0:
extracted.append(','.join(tempList))
else:
extracted.append('N/A')
df['Explanation Extracted'] = extracted
apply function of Pandas might be helpful
def extract_explanation(dataframe):
custom_substring = ["fi", "ap", "ko", "co", "ex"]
substrings = dataframe['explanation'].split(" ")
explanation = "N/A"
for string in substrings:
if string in custom_substring:
explanation = string
return explanation
df['Explanation Extracted'] = df.apply(extract_explanation, axis=1)
The catch here is assumption of only one explanation, but it can be converted into a list, if multiple explanations are expected.
Option 1
Assuming that one wants to extract the exact string in the list lst one can start by creating a regex
regex = f'\\b({"|".join(lst)})\\b'
where \b is the word boundary (beginning or end of a word) that indicates the word is not followed by additional characters, or with characters before. So, considering that one has the string ap in the list lst, if one has the word apple in the dataframe, that won't be considered.
And then, using pandas.Series.str.extract, and, to make it case insensitive, use re.IGNORECASE
import re
df['Explanation Extracted'] = df['Explanation'].str.extract(regex, flags=re.IGNORECASE, expand=False)
[Out]:
ID Explanation Explanation Extracted
0 1 fi doesn't work correctly fi
1 2 cap ples are cool NaN
2 3 this works but translation is ko ko
Option 2
One can also use pandas.Series.apply with a custom lambda function.
df['Explanation Extracted'] = df['Explanation'].apply(lambda x: next((i for i in lst if i.lower() in x.lower().split()), 'N/A'))
[Out]:
ID Explanation Explanation Extracted
0 1 fi doesn't work correctly fi
1 2 cap ples are cool N/A
2 3 this works but translation is ko ko
Notes:
.lower() is to make it case insensitive.
.split() is one way to prevent that even though ap is in the list, the string apple doesn't appear in the Explanation Extracted column.
I would like how to convert the first letter of each word in this column:
Test
There is a cat UNDER the table
The pen is working WELL.
Into lower case, in order to have
Test
there is a cat uNDER the table
the pen is working wELL.
I know there is capitalize() but I would need a function which does the opposite.
Many thanks
Please note that the strings are within a column.
I don't believe there is a builtin for this, but I could be mistaken. This is however quite easy to do with string comprehension!.
" ".join(i[0].lower()+i[1:] for i in line.split(" "))
Where line is each individual line.
According to this solution you can do :
>>> func = lambda s: s[:1].lower() + s[1:] if s else ''
>>> sent = "There is a cat UNDER the table "
>>> res = " ".join(list(map(func , sent.split())))
>>> res
'there is a cat uNDER the table'
You can use .str.lower, .str.split and ' '.join:
s=df.Test.str.split()
df.Test=s.str[0].str.lower()+' '+s.str[1:].agg(' '.join)
Same as spliting the words with .str.split and then modifying with apply:
df.Test=df.Test.str.split().apply(lambda x: [x[0].lower()]+x[1:] ).agg(' '.join)
Both outputs:
df
Test
0 there is a cat UNDER the table
1 the pen is working WELL.
i have data frame that looks like this
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC,60 ,R-12,HT
2 moreinfo
3 MZF8-050Z-AAB
4 GoCats
5 MZA2-0580-TFD,60 ,R-669,LT
i want to be able to strip ,60 ,R-12,HT using regex and also deletes the moreinfo and GoCats rows from the df.
My expected Results:
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC
2 MZF8-050Z-AAB
3 MZA2-0580-TFD
I first removed the strings
del = ['hello', 'moreinfo']
for i in del:
df = df[value!= i]
Can somebody suggest a way to use regex to match and delete all case that do match A067-M4FL-CAA-020 or MZF8-050Z-AAB pattern so i don't have to create a list for all possible cases?
I was able to strip a single line like this but i want to be able to strip all matching cases in the dataframe
pattern = r',\w+ \,\w+-\w+\,\w+ *'
line = 'MRF2-050A-TFC,60 ,R-12,HT'
for i in re.findall(pattern, line):
line = line.replace(i,'')
>>> MRF2-050A-TFC
I tried adjusting my code but it prints out the same output for each row
pattern = r',\w+ \,\w+-\w+\,\w+ *'
for d in df:
for i in re.findall(pattern, d):
d = d.replace(i,'')
Any suggestions will be greatly appreciated. Thanks
You may try this
(?:\w+-){2,}[^,\n]*
Demo
Python scripts may be as follows
ss="""0 A067-M4FL-CAA-020
1 MRF2-050A-TFC,60 ,R-12,HT
2 moreinfo
3 MZF8-050Z-AAB
4 GoCats
5 MZA2-0580-TFD,60 ,R-669,LT"""
import re
regx=re.compile(r'(?:\w+-){2,}[^,\n]*')
m= regx.findall(ss)
for i in range(len(m)):
print("%d %s" %(i, m[i]))
and the output is
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC
2 MZF8-050Z-AAB
3 MZA2-0580-TFD
Here's a simpler approach you can try without using regex. pandas has many in-built functions to deal with text data.
# remove unwanted values
df['value'] = df.value.str.replace(r'moreinfo|60|R-.*|HT|GoCats|\,', '')
# drop na
df = df[(df != '')].dropna()
# print
print(df)
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC
3 MZF8-050Z-AAB
5 MZA2-0580-TFD
-----------
# data used
df = pd.read_fwf(StringIO(u'''
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC,60 ,R-12,HT
2 moreinfo
3 MZF8-050Z-AAB
4 GoCats
5 MZA2-0580-TFD,60 ,R-669,LT'''),header=1)
I'd suggest capturing the data you DO want, since it's pretty particular, and the data you do NOT want could be anything.
Your pattern should look something like this:
^\w{4}-\w{4}-\w{3}(?:-\d{3})?
https://regex101.com/r/NtH2Ut/2
I'd recommend being more specific than \w where possible. (Like ^[A-Z]\w{3}) if you know the beginning four character chunk should start with a letter.
edit
Sorry, I may not have read your input and output literally enough:
https://regex101.com/r/NtH2Ut/3
^(?:\d+\s+\w{4}-\w{4}-\w{3}(?:-\d{3})?)|^\s+.*
There are a lot of similar questions, but I have not found a solution for my problem. I have a data frame with the following structure/form:
col_1
0 BULKA TARTA 500G KAJO 1
1 CUKIER KRYSZTAL 1KG KSC 4
2 KASZA JĘCZMIENNA 4*100G 2 0.92
3 LEWIATAN MAKARON WSTĄŻKA 1 0.89
However, I want to achieve the effect:
col_1
0 BULKA TARTA 500G KAJO
1 CUKIER KRYSZTAL 1KG KSC
2 KASZA JĘCZMIENNA 4*100G
3 LEWIATAN MAKARON WSTĄŻKA
So I want to remove the independent natural and decimal numbers, but leave the numbers in the string with the letters.
I tried to use df.col_1.str.isdigit().replace([True, False],[np.nan, df.col_1]) , but it only works on comparing the entire cell whether it is a number or not.
You have some ideas how to do it? Or maybe it would be good to break the column with spaces and then compare?
We could create a function that tries to convert to float. If it fails we return True (not_float)
import pandas as pd
df = pd.DataFrame({"col_1" : ["BULKA TARTA 500G KAJO 1",
"CUKIER KRYSZTAL 1KG KSC 4",
"KASZA JĘCZMIENNA 4*100G 2 0.92",
"LEWIATAN MAKARON WSTĄŻKA 1 0.89"]})
def is_not_float(string):
try:
float(string)
return False
except ValueError: # String is not a number
return True
df["col_1"] = df["col_1"].apply(lambda x: [i for i in x.split(" ") if is_not_float(i)])
df
Or following the example of my fellow SO:ers. However this would treat 130. as a number.
df["col_1"] = (df["col_1"].apply(
lambda x: [i for i in x.split(" ") if not i.replace(".","").isnumeric()]))
Returns
col_1
0 [BULKA, TARTA, 500G, KAJO]
1 [CUKIER, KRYSZTAL, 1KG, KSC]
2 [KASZA, JĘCZMIENNA, 4*100G]
3 [LEWIATAN, MAKARON, WSTĄŻKA]
Sure,
You could use a regex.
import re
df.col_1 = re.sub("\d+\.?\d+?", "", df.col_1)
Yes you can
def no_nums(col):
return ' '.join(filter(lambda word:word.replace('.','').isdigit()==False, col.split()))
df.col_1.apply(no_nums)
This filters out words from each value which are completely made of digits,
And maybe contains a decimal point.
If you want to filter out numbers like 1,000, simply add another replace for ','
There are similar answers but I could not apply it to my own case
I wanna get rid of forbidden characters for Windows directory names in my pandas dataframe. I tried to use something like:
df1['item_name'] = "".join(x for x in df1['item_name'].rstrip() if x.isalnum() or x in [" ", "-", "_"]) if df1['item_name'] else ""
Assume I have a dataframe like this
item_name
0 st*back
1 yhh?\xx
2 adfg%s
3 ghytt&{23
4 ghh_h
I want to get:
item_name
0 stback
1 yhhxx
2 adfgs
3 ghytt23
4 ghh_h
How I could achieve this?
Note: I scraped data from internet earlier, and used the following code for the older version
item_name = "".join(x for x in item_name.text.rstrip() if x.isalnum() or x in [" ", "-", "_"]) if item_name else ""
Now, I have new observations for the same items and I want to merge them with older observations. But I forgot to use the same method when I rescraped
You could summarize the condition as a negative character class, and use str.replace to remove them, here \w stands for word characters alnum + _, \s stands for space and - is literal dash. With ^ in the character class, [^\w\s-] matches any character that is not alpha numeric, nor [" ", "-", "_"], then you can use replace method to remove them:
df.item_name.str.replace("[^\w\s-]", "")
#0 stback
#1 yhhxx
#2 adfgs
#3 ghytt23
#4 ghh_h
#Name: item_name, dtype: object
Try
import re
df.item_name.apply(lambda x: re.sub('\W+', '', x))
0 stback
1 yhhxx
2 adfgs
3 ghytt23
4 ghh_h
If you have a properly escaped list of characters
lst = ['\\\\', '\*', '\?', '%', '&', '\{']
df.replace(lst, '', regex=True)
item_name
0 stback
1 yhhxx
2 adfgs
3 ghytt23
4 ghh_h