Python parse only digit - python

Hi I want to parse only digit , for example I parse numbers of users sessions last 5 min , userSes = "12342 last 5 min" , I want to parse only 12342 (this number change every 5 min) , but when I parse this data result is 12342 and 5 ( this number is "from last 5 min " 's number) can any one help me ?
x= ('12342 from last 5 min ')
print(''.join(filter(lambda x: x.isdigit(),x)))

You can use regex:
import re
x = '12342 from last 5 min '
n_sessions = int(re.findall('^(\d+).*', x)[0])
print(n_sessions)
^(\d+) .* looks for a number (\d+) at the start of the string (^) before the space and everything else (.*).
If your string is consistent
n_sessions = int(x.split()[0])
should be enough.

parsed_list = [item for item in x.split(' ')[0] if item.isdigit()]
print(''.join(parsed_list))

Related

How to remove numbers from a string column that starts with 4 zeros?

I have a column of names and informations of products, i need to remove the codes from the names and every code starts with four or more zeros, some names have four zeros or more in the weight and some are joined with the name as the example below:
data = {
'Name' : ['ANOA 250g 00004689', 'ANOA 10000g 00000059884', '80%c asjw 150000001568 ', 'Shivangi000000478761'],
}
testdf = pd.DataFrame(data)
The correct output would be:
results = {
'Name' : ['ANOA 250g', 'ANOA 10000g', '80%c asjw 150000001568 ', 'Shivangi'],
}
results = pd.DataFrame(results)
you can split the strings by the start of the code pattern, which is expressed by the regex (?<!\d)0{4,}. this pattern consumes four 0s that are not preceded by any digit. after splitting the string, take the first fragment, and the str.strip gets rid of possible trailing space
testdf.Name.str.split('(?<!\d)0{4,}', regex=True, expand=True)[0].str.strip()[0].str.strip()
# outputs:
0 ANOA 250g
1 ANOA 10000g
2 80%c asjw 150000001568
3 Shivangi
note that this works for the case where the codes are always at the end of your string.
Use a regex with str.replace:
testdf['Name'] = testdf['Name'].str.replace(r'(?:(?<=\D)|\s*\b)0{4}\d*',
'', regex=True)
Or, similar to #HaleemurAli, with a negative match
testdf['Name'] = testdf['Name'].str.replace(r'(?<!\d)0{4,}0{4}\d*',
'', regex=True)
Output:
Name
0 ANOA 250g
1 ANOA 10000g
2 80%c asjw 150000001568
3 Shivangi
regex1 demo
regex2 demo
try splitting it at each space and checking if the each item has 0000 in it like:
answer=[]
for i in results["Name"]:
answer.append("".join([j for j in i.split() if "0000" not in j]))

Comparing strings containing numbers in Python [duplicate]

This question already has answers here:
Extract Number from String in Python
(18 answers)
How do I parse a string to a float or int?
(32 answers)
Closed 5 months ago.
I have a list of strings and I would like to verify some conditions on the strings. For example:
String_1: 'The price is 15 euros'.
String_2: 'The price is 14 euros'.
Condition: The price is > 14 --> OK
How can I verify it?
I'm actually doing like this:
if ('price is 13' in string):
print('ok')
and I'm writing all the valid cases.
I would like to have just one condition.
You can list all of the integers in the string and use them in an if statement after.
str = "price is 16 euros"
for number in [int(s) for s in str.split() if s.isdigit()]:
if (number > 14):
print "ok"
If your string contains more than one number, you can select which one you want to use in the list.
Hoep it helps.
You can just compare strings if they differ only by number and numbers have the same digits count. I.e.:
String_1 = 'The price is 15 euros'
String_2 = 'The price is 14 euros'
String_3 = 'The price is 37 EUR'
The will be naturally sorted as String_3 > String_1 > String_2
But will NOT work for:
String_4 = 'The price is 114 euros'
it has 3 digits instead of 2 and it will be String_4 < String_3 thus
So, the better, if you can extract number from the string, like following:
import re
def get_price(s):
m = re.match("The price is ([0-9]*)", s)
if m:
return = int(m.group(1))
return 0
Now you can compare prices as integer:
price = get_price(String_1)
if price > 14:
print ("Okay!")
. . .
if get_price(String_1) > 14:
print ("Okay!")
([0-9]*) - is the capturing group of the regular expression, all defined in the round parenthesis will be returned in group(1) method of the Python match object. You can extend this simple regular expression [0-9]* further for your needs.
If you have list of strings:
string_list = [String_1, String_2, String_3, String_4]
for s in string_list:
if get_price(s) > 14:
print ("'{}' is okay!".format(s))
Is the string format always going to be the exact same? As in, it will always start with "The price is" and then have a positive integer, and then end with "euros'? If so, you can just split the string into words and index the integer, cast it into an int, and check if it's greater than 14.
if int(s.split()[3]) > 14:
print('ok')
If the strings will not be consistent, you may want to consider a regex solution to get the numeral part of the sentence out.
You could use a regular expression to extract the number after "price is", and then convert the number in string format to int. And, finally to compare if it is greater than 14, for example:
import re
p = re.compile('price\sis\s\d\d*')
string1 = 'The price is 15 euros'
string2 = 'The price is 14 euros'
number = re.findall(p, string1)[0].split("price is ")
if int(number[1]) > 14:
print('ok')
Output:
ok
I suppose you have only ono value in your string. So we can do it with regex.
import re
String_1 = 'The price is 15 euros.'
if float(re.findall(r'\d+', String_1)[0]) > 14:
print("OK")

Insert space to separate conjoined alpha and numeric strings - Python RegEx

In Python, I need to create a regex that inserts a space between any concatenated AlphaNum combinations. For example, this is what I want:
8min15sec ==> 8 min 15 sec
7m12s ==> 7 m 12 s
15mi25s ==> 15 mi 25 s
RegEx101 demo
I am blundering around with solutions found online, but they are a bit too complex for me to parse/modify. For example, I have this:
[a-zA-Z][a-zA-Z\d]*
but it only identifies the first insertion point: 8Xmin15sec (the X)
And this
(?<=[a-z])(?=[A-Z0-9])|(?<=[0-9])(?=[A-Z])
but it only finds this point: 8minX15sec (the X)
I could sure use a hand with the full syntax for finding each insertion point and inserting the spaces.
RegEx101 demo (same link as above)
How about the following approach:
import re
for test in ['8min15sec', '7m12s', '15mi25s']:
print(re.sub(r'(\d+|\D+)', r'\1 ', test).strip())
Which would give you:
8 min 15 sec
7 m 12 s
15 mi 25 s
You can use this regex, which marks the point which are boundaries of numbers and alphabets with either order i.e. number first then alphabets or vice versa.
(?<=\d)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=\d)
This regex (?<=\d)(?=[a-zA-Z]) marks a point with positive lookahead to look for an alphabet and positive look behind to look for a digit.
Similarly, (?<=[a-zA-Z])(?=\d) does same but in opposite order.
And then just replace that mark by a space.
Demo
Here is sample python code for same.
import re
arr = ['8min15sec', '7m12s', '15mi25s']
for s in arr:
print (s + ' --> ' + re.sub('(?<=\d)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=\d)', ' ',s))
Which prints following output,
8min15sec --> 8 min 15 sec
7m12s --> 7 m 12 s
15mi25s --> 15 mi 25 s
How about:
"(\d+)([a-zA-Z]+)"
to
"\1 \2 "
https://regex101.com/r/yvqCtQ/2
And in python:
In [59]: re.sub(r'(\d+)([a-zA-Z]+)', r'\1 \2 ', '8min15sec')
Out[59]: '8 min 15 sec '

Python - Create dataframe by getting the numbers of alphabetic characters

I have a dataframe with a column called "Utterances", which contains strings (e.g.: "I wanna have a beer" is its first row).
What I need is to create a new data frame that will contain the number of every letter of every row of "Utterances" in the alphabet.
This means that for example in the case of "I wanna have a beer", I need to get the following row: 9 23114141 81225 1 25518, since "I" is the 9th letter of the alphabet, "w" the 23rd and so on. Notice that I want the spaces " " to be maintained.
What I have done so far is the following:
for word in df2[['Utterances']]:
for character in word:
new.append(ord(character.lower())-96)
str1 = ''.join(str(e) for e in new)
The above returns the concatenated string. However, the above loop only iterates once and second the string returned by str1 does not have the required spaces (" "). And of course, I can not find a way to append these lines into a new dataframe.
Any help would be greatly appreciated.
Thanks.
You can do
In [5572]: df
Out[5572]:
Utterances
0 I wanna have a beer
In [5573]: df['Utterances'].apply(lambda x: ' '.join([''.join(str(ord(c)-96) for c in w)
for w in x.lower().split()]))
Out[5573]:
0 9 23114141 81225 1 25518
Name: Utterances, dtype: object
for word in ['I ab c def']:
for character in word:
if character == ' ':
new.append(' ')
else:
new.append(ord(character.lower())-96)
str1 = ''.join(str(e) for e in new)
Output
9 12 3 456
Lets use dictionary and get with strings if you have only alphabets i.e
import string
dic = {j:i+1 for i,j in enumerate(string.ascii_lowercase[:26])}
dic[' ']= ' '
df['Ut'].apply(lambda x : ''.join([str(dic.get(i)) for i in str(x).lower()]))
Output :
Ut new
0 I wanna have a beer 9 23114141 81225 1 25518
​

How to use regular expression extract data not followed by something with pandas

I just want to extract the years, but not the number. How can I define not followed by XXX?
I made the following example, but the result is always a literal more than I expected.
text = ["hi2017", "322017"]
text = pd.Series(text)
myPat = "([^\d]\d{4})"
res = text.str.extract(myPat)
res
Then I get the result:
0 i2017
1 NaN
dtype: object
Actually, I just want to get "2017", but not "i2017", how can I do it?
PS. The "322017" should not be extracted, because it is not a year, but a number
Give this a try:
(?<!\d)(\d{4})(?!\d)
which returns 2017 and is based almost entirely on the comment by #PauloAlmeida
As I understand, you need only year, defined as 4 digits followed by non-number.
"(?:[a-z]+)(\d{4})$" works for me. (which means 4 digits followed by more than one character & the 4 digits are the last characters of the string)
text = ["hi2017", "322017"]
text = pd.Series(text)
myPat = "(?:[a-z]+)(\d{4})$"
res = text.str.extract(myPat)
Output:
print(res)
'''
0 2017
1 NaN
'''
You want 4-digit numbers where the first digit is either a 1 or a 2. This translates to all the numbers between 1000 to 2999, inclusive.
The regex for this is: (1[0-9]{3})|(2[0-9]{3})
This will get all the numbers between 1000 and 2999, inclusive within a string.
In your case, hi2017 will result in 2017. Additionally, 322017 will result in 2201. This is also a valid year as per your definition.
Regexr is a great online tool http://regexr.com/3ghcq
myPat = "(\d{4})"

Categories

Resources