I want to remove numbers from strings in a column, while at the same time keeping numbers that do not have any strings in the same column.
This is how the data looks like;
df=
id description
1 XG154LU
2 4562689
3 556
4 LE896E
5 65KKL4
This is how i want the output to look like:
id description
1 XGLU
2 4562689
3 556
4 LEE
5 KKL
I used the code below but when i run it it removes all the entries in the description column and replace it with blanks:
def clean_text_round1(text):
text = re.sub('\w*\d\w*', '', text)
text = re.sub('[‘’“”…]', '', text)
text = re.sub(r'\n', '', text)
text = re.sub(r'\r', '', text)
return text
round1 = lambda x: clean_text_round1(x)
df['description'] = df['description'].apply(round1)
Try:
import numpy as np
df['description'] = np.where(df.description.str.contains('^\d+$'), df.description, df.description.str.replace('\d+', ''))
Output:
id description
1 XGLU
2 4562689
3 556
4 LEE
5 KKL
Logic:
Look if the string contains digits only, if yes, dont do anything and just copy the number as it is. If the string has numbers mixed with string, then replace them with black space '' leaving out only characters without the numbers.
This should solve it for you.
def clean_text_round1(text):
if type(text) == int:
return text
else:
text = ''.join([i for i in text if not i.isdigit()])
return text
df['description'] = df['description'].apply(clean_text_round1)
Let me know if this works for you. Not sure about the speed performance. You can use regex instead of join.
def convert(v):
# check if the string is composed of not only numbers
if any([char.isalpha() for char in v]):
va = [char for char in v if char.isalpha()]
va = ''.join(va)
return va
else:
return v
# apply() a function for a single column
df['description']= df['description'].apply(convert)
print(df)
id description
0 XGLU
1 4562689
2 556
3 LEE
4 KKL
Related
I have a column of names and informations of products, i need to remove the codes from the names and every code starts with four or more zeros, some names have four zeros or more in the weight and some are joined with the name as the example below:
data = {
'Name' : ['ANOA 250g 00004689', 'ANOA 10000g 00000059884', '80%c asjw 150000001568 ', 'Shivangi000000478761'],
}
testdf = pd.DataFrame(data)
The correct output would be:
results = {
'Name' : ['ANOA 250g', 'ANOA 10000g', '80%c asjw 150000001568 ', 'Shivangi'],
}
results = pd.DataFrame(results)
you can split the strings by the start of the code pattern, which is expressed by the regex (?<!\d)0{4,}. this pattern consumes four 0s that are not preceded by any digit. after splitting the string, take the first fragment, and the str.strip gets rid of possible trailing space
testdf.Name.str.split('(?<!\d)0{4,}', regex=True, expand=True)[0].str.strip()[0].str.strip()
# outputs:
0 ANOA 250g
1 ANOA 10000g
2 80%c asjw 150000001568
3 Shivangi
note that this works for the case where the codes are always at the end of your string.
Use a regex with str.replace:
testdf['Name'] = testdf['Name'].str.replace(r'(?:(?<=\D)|\s*\b)0{4}\d*',
'', regex=True)
Or, similar to #HaleemurAli, with a negative match
testdf['Name'] = testdf['Name'].str.replace(r'(?<!\d)0{4,}0{4}\d*',
'', regex=True)
Output:
Name
0 ANOA 250g
1 ANOA 10000g
2 80%c asjw 150000001568
3 Shivangi
regex1 demo
regex2 demo
try splitting it at each space and checking if the each item has 0000 in it like:
answer=[]
for i in results["Name"]:
answer.append("".join([j for j in i.split() if "0000" not in j]))
I have a data frame where one column has string values and the other has integers but those columns have special characters with it or the string data has integers with it. So to remove it I used regex my regex is working fine but for the integer column, if 'abc123' is then it is not removing the abc and same with string column if '123abc' is there then it is not removing it. I don't know if the pattern or is wrong or the code is wrong. Below is my code,
d = [['abc','123'],['gbc#*','23abc'],['abc123','abc200'],['124abc','1230&*'],['abcer£$%&*!','230!?*&']]
df1= pd.DataFrame(d, columns=['str','int'])
print(df1)
str int
0 abc 123
1 gbc#* 23abc
2 abc123 abc200
3 124abc 1230&*
4 abcer£$%&*! 230!?*&
num = r'\d+$'
alpha = r'[a-zA-Z]+$'
wrong = df1[~df1['int'].str.contains(num, na=True)]
correct_int = [re.sub(r'([^\d]+?)', '', item) for item in wrong['int']]
print(correct_int)
wrong_str = df1[~df1['str'].str.contains(alpha, na=True)]
correct_str = [re.sub(r'([^a-zA-Z ]+?)', '', item) for item in df1['str']]
print(correct_str)
Output:
correct_int: ['23', '1230', '230']
As you can see it removed for '23abc','1230&*','230!?*&' but not for 'abc200' as the string was coming first
correct_str: ['abc', 'gbc', 'abc', 'abc', 'abcer']
now it removed for all but sometimes it's not removing when the value is '124abc'
Is my pattern wrong? I have also tried giving different patterns but nothing worked
I am removing the integers and special characters in the column 'str' and removing string values and special characters in column 'int'
Expected output:
Once after cleaning and replacing with the old with the cleaned values the output would look like this.
str int
0 abc 123
1 gbc 23
2 abc 200
3 abc 1230
4 abcer 230
You can do it with
df1['str'] = df1['str'].str.replace(r"[\d\W+]", '') # replaces numbers (\d) and non-word characters (\W) with empty strings
df1['int'] = df1['int'].str.replace(r"\D+", '') # replaces any non-decimal digit character (like [^0-9])
Returns:
str int
0 abc 123
1 gbc 23
2 abc 200
3 abc 1230
4 abcer 230
Try the following:
'\D' represents any non digit value, substitute those with empty string '' in int column
[^a-zA-Z] represents any character not in the range a-z and A-Z, substitute those with empty string '' in str column
Apply these transformations to both columns using .apply() and a lambda function
import pandas as pd
import re
d = [['abc','123'],['gbc#*','23abc'],['abc123','abc200'],['124abc','1230&*'],['abcer£$%&*!','230!?*&']]
df1= pd.DataFrame(d, columns=['str','int'])
df1['int'] = df1['int'].apply(lambda r: re.sub('\D', '', r))
df1['str'] = df1['str'].apply(lambda r: re.sub('[^a-zA-Z]', '', r))
print(df1)
Output:
str int
0 abc 123
1 gbc 23
2 abc 200
3 abc 1230
4 abcer 230
I have a dataframe with a column called "Utterances", which contains strings (e.g.: "I wanna have a beer" is its first row).
What I need is to create a new data frame that will contain the number of every letter of every row of "Utterances" in the alphabet.
This means that for example in the case of "I wanna have a beer", I need to get the following row: 9 23114141 81225 1 25518, since "I" is the 9th letter of the alphabet, "w" the 23rd and so on. Notice that I want the spaces " " to be maintained.
What I have done so far is the following:
for word in df2[['Utterances']]:
for character in word:
new.append(ord(character.lower())-96)
str1 = ''.join(str(e) for e in new)
The above returns the concatenated string. However, the above loop only iterates once and second the string returned by str1 does not have the required spaces (" "). And of course, I can not find a way to append these lines into a new dataframe.
Any help would be greatly appreciated.
Thanks.
You can do
In [5572]: df
Out[5572]:
Utterances
0 I wanna have a beer
In [5573]: df['Utterances'].apply(lambda x: ' '.join([''.join(str(ord(c)-96) for c in w)
for w in x.lower().split()]))
Out[5573]:
0 9 23114141 81225 1 25518
Name: Utterances, dtype: object
for word in ['I ab c def']:
for character in word:
if character == ' ':
new.append(' ')
else:
new.append(ord(character.lower())-96)
str1 = ''.join(str(e) for e in new)
Output
9 12 3 456
Lets use dictionary and get with strings if you have only alphabets i.e
import string
dic = {j:i+1 for i,j in enumerate(string.ascii_lowercase[:26])}
dic[' ']= ' '
df['Ut'].apply(lambda x : ''.join([str(dic.get(i)) for i in str(x).lower()]))
Output :
Ut new
0 I wanna have a beer 9 23114141 81225 1 25518
There are a lot of similar questions, but I have not found a solution for my problem. I have a data frame with the following structure/form:
col_1
0 BULKA TARTA 500G KAJO 1
1 CUKIER KRYSZTAL 1KG KSC 4
2 KASZA JĘCZMIENNA 4*100G 2 0.92
3 LEWIATAN MAKARON WSTĄŻKA 1 0.89
However, I want to achieve the effect:
col_1
0 BULKA TARTA 500G KAJO
1 CUKIER KRYSZTAL 1KG KSC
2 KASZA JĘCZMIENNA 4*100G
3 LEWIATAN MAKARON WSTĄŻKA
So I want to remove the independent natural and decimal numbers, but leave the numbers in the string with the letters.
I tried to use df.col_1.str.isdigit().replace([True, False],[np.nan, df.col_1]) , but it only works on comparing the entire cell whether it is a number or not.
You have some ideas how to do it? Or maybe it would be good to break the column with spaces and then compare?
We could create a function that tries to convert to float. If it fails we return True (not_float)
import pandas as pd
df = pd.DataFrame({"col_1" : ["BULKA TARTA 500G KAJO 1",
"CUKIER KRYSZTAL 1KG KSC 4",
"KASZA JĘCZMIENNA 4*100G 2 0.92",
"LEWIATAN MAKARON WSTĄŻKA 1 0.89"]})
def is_not_float(string):
try:
float(string)
return False
except ValueError: # String is not a number
return True
df["col_1"] = df["col_1"].apply(lambda x: [i for i in x.split(" ") if is_not_float(i)])
df
Or following the example of my fellow SO:ers. However this would treat 130. as a number.
df["col_1"] = (df["col_1"].apply(
lambda x: [i for i in x.split(" ") if not i.replace(".","").isnumeric()]))
Returns
col_1
0 [BULKA, TARTA, 500G, KAJO]
1 [CUKIER, KRYSZTAL, 1KG, KSC]
2 [KASZA, JĘCZMIENNA, 4*100G]
3 [LEWIATAN, MAKARON, WSTĄŻKA]
Sure,
You could use a regex.
import re
df.col_1 = re.sub("\d+\.?\d+?", "", df.col_1)
Yes you can
def no_nums(col):
return ' '.join(filter(lambda word:word.replace('.','').isdigit()==False, col.split()))
df.col_1.apply(no_nums)
This filters out words from each value which are completely made of digits,
And maybe contains a decimal point.
If you want to filter out numbers like 1,000, simply add another replace for ','
There are similar answers but I could not apply it to my own case
I wanna get rid of forbidden characters for Windows directory names in my pandas dataframe. I tried to use something like:
df1['item_name'] = "".join(x for x in df1['item_name'].rstrip() if x.isalnum() or x in [" ", "-", "_"]) if df1['item_name'] else ""
Assume I have a dataframe like this
item_name
0 st*back
1 yhh?\xx
2 adfg%s
3 ghytt&{23
4 ghh_h
I want to get:
item_name
0 stback
1 yhhxx
2 adfgs
3 ghytt23
4 ghh_h
How I could achieve this?
Note: I scraped data from internet earlier, and used the following code for the older version
item_name = "".join(x for x in item_name.text.rstrip() if x.isalnum() or x in [" ", "-", "_"]) if item_name else ""
Now, I have new observations for the same items and I want to merge them with older observations. But I forgot to use the same method when I rescraped
You could summarize the condition as a negative character class, and use str.replace to remove them, here \w stands for word characters alnum + _, \s stands for space and - is literal dash. With ^ in the character class, [^\w\s-] matches any character that is not alpha numeric, nor [" ", "-", "_"], then you can use replace method to remove them:
df.item_name.str.replace("[^\w\s-]", "")
#0 stback
#1 yhhxx
#2 adfgs
#3 ghytt23
#4 ghh_h
#Name: item_name, dtype: object
Try
import re
df.item_name.apply(lambda x: re.sub('\W+', '', x))
0 stback
1 yhhxx
2 adfgs
3 ghytt23
4 ghh_h
If you have a properly escaped list of characters
lst = ['\\\\', '\*', '\?', '%', '&', '\{']
df.replace(lst, '', regex=True)
item_name
0 stback
1 yhhxx
2 adfgs
3 ghytt23
4 ghh_h