This question already has answers here:
Extract Number from String in Python
(18 answers)
How do I parse a string to a float or int?
(32 answers)
Closed 5 months ago.
I have a list of strings and I would like to verify some conditions on the strings. For example:
String_1: 'The price is 15 euros'.
String_2: 'The price is 14 euros'.
Condition: The price is > 14 --> OK
How can I verify it?
I'm actually doing like this:
if ('price is 13' in string):
print('ok')
and I'm writing all the valid cases.
I would like to have just one condition.
You can list all of the integers in the string and use them in an if statement after.
str = "price is 16 euros"
for number in [int(s) for s in str.split() if s.isdigit()]:
if (number > 14):
print "ok"
If your string contains more than one number, you can select which one you want to use in the list.
Hoep it helps.
You can just compare strings if they differ only by number and numbers have the same digits count. I.e.:
String_1 = 'The price is 15 euros'
String_2 = 'The price is 14 euros'
String_3 = 'The price is 37 EUR'
The will be naturally sorted as String_3 > String_1 > String_2
But will NOT work for:
String_4 = 'The price is 114 euros'
it has 3 digits instead of 2 and it will be String_4 < String_3 thus
So, the better, if you can extract number from the string, like following:
import re
def get_price(s):
m = re.match("The price is ([0-9]*)", s)
if m:
return = int(m.group(1))
return 0
Now you can compare prices as integer:
price = get_price(String_1)
if price > 14:
print ("Okay!")
. . .
if get_price(String_1) > 14:
print ("Okay!")
([0-9]*) - is the capturing group of the regular expression, all defined in the round parenthesis will be returned in group(1) method of the Python match object. You can extend this simple regular expression [0-9]* further for your needs.
If you have list of strings:
string_list = [String_1, String_2, String_3, String_4]
for s in string_list:
if get_price(s) > 14:
print ("'{}' is okay!".format(s))
Is the string format always going to be the exact same? As in, it will always start with "The price is" and then have a positive integer, and then end with "euros'? If so, you can just split the string into words and index the integer, cast it into an int, and check if it's greater than 14.
if int(s.split()[3]) > 14:
print('ok')
If the strings will not be consistent, you may want to consider a regex solution to get the numeral part of the sentence out.
You could use a regular expression to extract the number after "price is", and then convert the number in string format to int. And, finally to compare if it is greater than 14, for example:
import re
p = re.compile('price\sis\s\d\d*')
string1 = 'The price is 15 euros'
string2 = 'The price is 14 euros'
number = re.findall(p, string1)[0].split("price is ")
if int(number[1]) > 14:
print('ok')
Output:
ok
I suppose you have only ono value in your string. So we can do it with regex.
import re
String_1 = 'The price is 15 euros.'
if float(re.findall(r'\d+', String_1)[0]) > 14:
print("OK")
Related
This question already has answers here:
Why not use Double or Float to represent currency?
(16 answers)
Closed 1 year ago.
so i was webscraping Foot locker Website , now when i get the price i get it in more than one decimal points.
i want to round it off to 2 digits after decimal point, how can i do that ?
My price list:
90.00
170.00
198.00
137.99137.99158.00
When i try the float function/Method i get an error, can someone Please help :)
print(float(Price))
90.0
170.0
198.0
ValueError: could not convert string to float: '137.99137.99158.00'
and i also want to round it off to two decimal points, so 90.0 will become 90.00 :)
After a second look at your prices it seems to me that the problem with the multiple decimal points is due to missing spaces between the prices. Maybe the webscraper needs a fix? If you want to go on with what you have, you can do it with regular expressions. But my fix only works if prices are always given with two decimal digits.
import re
list_prices = [ '90.00', '170.00', '198.00', '137.99137.99158.00' ]
pattern_price = re.compile(r'[0-9]+\.[0-9]{2}')
list_prices_clean = pattern_price.findall('\n'.join(list_prices))
print(list_prices_clean)
# ['90.00', '170.00', '198.00', '137.99', '137.99', '158.00']
You're getting that error because the input 137.99137.99158.00 is not a valid input for the float function. I have written the below function to clean your inputs.
def clean_invalid_number(num):
split_num = num.split('.')
num_len = len(split_num)
if len(split_num) > 1:
temp = split_num[0] + '.'
for i in range(1,num_len):
temp += split_num[i]
return temp
else:
return num
To explain the above, I used the split function which returns a list. If the list length is greater than 1 then there is more than 1 fullstop which means the data needs to be cleaned.The list does not contain the character you split.
As for returning 2 decimal points simply use
Price = round(Price,2)
Returning two 90.00 instead of 90.0 does not make sense if you are casting to float.
Here is the full code as a demo:
prices = ['90.00', '170.00', '198.00', '137.99137.99158.00']
prices = [round(float(clean_invalid_number(p)),2 ) for p in prices]
print(prices)
[90.0, 170.0, 198.0, 137.99]
replace first dot by a temporary delimiter
delete all other dots
replace temporary delimiter with dot
round
print with two decimals
like this:
list_prices = [ '90.00', '170.00', '198.00', '137.99137.99158.00']
def clean_price(price, sep='.'):
price = str(price)
price = price.replace(sep, 'DOT', 1)
price = price.replace(sep, '')
price = price.replace('DOT', '.')
rounded = round(float(price),2)
return f'{rounded:.2f}'
list_prices_clean = [clean_price(price) for price in list_prices]
print(list_prices_clean)
# ['90.09', '170.00', '198.00', '137.99']
EDIT:
In case you mean rounding after the last decimal point:
def clean_price(price, sep='.'):
price = str(price)
num_seps = price.count(sep)
price = price.replace(sep, '', num_seps-1)
rounded = round(float(price),2)
return f'{rounded:.2f}'
list_prices_clean = [clean_price(price) for price in list_prices]
print(list_prices_clean)
# ['90.00', '170.00', '198.00', '1379913799158.00']
No need to write custom methods, use regular expressions (regex) to extract patterns from Strings. Your problem is that the long string (137.99137.99158.00) are 3 prices without spaces in between. The regex expression "[0-9]+.[0-9][0-9]" finds all patterns with one or more numbers before a "." and two numbers after the "."
import re
reg = "[0-9]+\.[0-9]{0,2}";
test = "137.99137.99158.00";
p = re.compile(reg);
result = p.search(test);
result.group(0)
Output:
137.99
Short explanation:
'[0-9]' "numbers"
'+' "one or more"
'.' "String for the dot"
Regex seems to be quite weird at the start, but it is an essential skill. Especially when you want to mine text.
Ok, i have finally sound a solution to my Problem, nad thank you everyone for helping out as well
def Price(s):
try:
P = s.find("div",class_="ProductPrice").text.replace("$","").strip().split("to")[1].split(".")
return round(float(".".join(P[0:2])),2)
except:
P = s.find("div",class_="ProductPrice").text.replace("$","").strip().split("to")[0].split(".")
return float(".".join(P[0:2]))
Hi I want to parse only digit , for example I parse numbers of users sessions last 5 min , userSes = "12342 last 5 min" , I want to parse only 12342 (this number change every 5 min) , but when I parse this data result is 12342 and 5 ( this number is "from last 5 min " 's number) can any one help me ?
x= ('12342 from last 5 min ')
print(''.join(filter(lambda x: x.isdigit(),x)))
You can use regex:
import re
x = '12342 from last 5 min '
n_sessions = int(re.findall('^(\d+).*', x)[0])
print(n_sessions)
^(\d+) .* looks for a number (\d+) at the start of the string (^) before the space and everything else (.*).
If your string is consistent
n_sessions = int(x.split()[0])
should be enough.
parsed_list = [item for item in x.split(' ')[0] if item.isdigit()]
print(''.join(parsed_list))
I have a bunch of strings in a pandas dataframe that contain numbers in them. I could the riun the below code and replace them all
df.feature_col = df.feature_col.str.replace('\d+', ' NUM ')
But what I need to do is replace any 10 digit number with a string like masked_id, any 16 digit numbers with account_number, or any three-digit numbers with yet another string, and so on.
How do I go about doing this?
PS: since my data size is less, a less optimal way is also good enough for me.
Another way is replace with option regex=True with a dictionary. You can also use somewhat more relaxed match
patterns (in order) than Tim's:
# test data
df = pd.DataFrame({'feature_col':['this has 1234567',
'this has 1234',
'this has 123',
'this has none']})
# pattern in decreasing length order
# these of course would replace '12345' with 'ID45' :-)
df['feature_col'] = df.feature_col.replace({'\d{7}': 'ID7',
'\d{4}': 'ID4',
'\d{3}': 'ID3'},
regex=True)
Output:
feature_col
0 this has ID7
1 this has ID4
2 this has ID3
3 this has none
You could do a series of replacements, one for each length of number:
df.feature_col = df.feature_col.str.replace(r'\b\d{3}\b', ' 3mask ')
df.feature_col = df.feature_col.str.replace(r'\b\d{10}\b', masked_id)
df.feature_col = df.feature_col.str.replace(r'\b\d{16}\b', account_number)
Let's suppose that I have a string like that:
sentence = 'I am 6,571.5 14 a 14 data 1,a211 43.2 scientist 1he3'
I want to have as an output the frequency of the most frequent number in the string.
At the string above this is 2 which corresponds to the number 14 which is the most frequent number in the string.
When I say number I mean something which consists only of digits and , or . and it is delimited by whitespaces.
Hence, at the string above the only numbers are: 6,571.5, 14, 14, 43.2.
(Keep in mind that different countries use the , and . in the opposite way for decimals and thousands so I want to take into account all these possible cases)
How can I efficiently do this?
P.S.
It is funny to discover that in Python there is no (very) quick way to test if a word is a number (including integers and floats of different conventions about , and .).
you can try:
from collections import Counter
import re
pattern = '\s*?\d+[\,\.]\d+[\,\.]\d+\s*?|\s*?\d+[\,\.]\d+\s*?|\s[0-9]+\s'
sentence = 'I am 6,571.5 14 a 14 data 1,a211 43.2 scientist 1he3'
[(_ , freq)] = Counter(re.findall(pattern, sentence)).most_common(1)
print(freq)
# output: 2
or you can use:
def simple(w):
if w.isalpha():
return False
if w.isnumeric():
return True
if w.count('.') > 1 or w.count(',') > 1:
return False
if w.startswith('.') or w.startswith(','):
return False
if w.replace(',', '').replace('.', '').isnumeric():
return True
return False
[(_ , freq)] = Counter([w for w in sentence.split() if simple(w)]).most_common(1)
print(freq)
# output: 2
but the second solution is ~ 2 times slower
If I have these names:
bob = "Bob 1"
james = "James 2"
longname = "longname 3"
And priting these gives me:
Bob 1
James 2
longname 3
How can I make sure that the numbers would be aligned (without using \t or tabs or anything)? Like this:
Bob 1
James 2
longname3
This is a good use for a format string, which can specify a width for a field to be filled with a character (including spaces). But, you'll have to split() your strings first if they're in the format at the top of the post. For example:
"{: <10}{}".format(*bob.split())
# output: 'Bob 1'
The < means left align, and the space before it is the character that will be used to "fill" the "emtpy" part of that number of characters. Doesn't have to be spaces. 10 is the number of spaces and the : is just to prevent it from thinking that <10 is supposed to be the name of the argument to insert here.
Based on your example, it looks like you want the width to be based on the longest name. In which case you don't want to hardcode 10 like I just did. Instead you want to get the longest length. Here's a better example:
names_and_nums = [x.split() for x in (bob, james, longname)]
longest_length = max(len(name) for (name, num) in names_and_nums)
format_str = "{: <" + str(longest_length) + "}{}"
for name, num in names_and_nums:
print(format_str.format(name, num))
See: Format specification docs