Let's suppose that I have a string like that:
sentence = 'I am 6,571.5 14 a 14 data 1,a211 43.2 scientist 1he3'
I want to have as an output the frequency of the most frequent number in the string.
At the string above this is 2 which corresponds to the number 14 which is the most frequent number in the string.
When I say number I mean something which consists only of digits and , or . and it is delimited by whitespaces.
Hence, at the string above the only numbers are: 6,571.5, 14, 14, 43.2.
(Keep in mind that different countries use the , and . in the opposite way for decimals and thousands so I want to take into account all these possible cases)
How can I efficiently do this?
P.S.
It is funny to discover that in Python there is no (very) quick way to test if a word is a number (including integers and floats of different conventions about , and .).
you can try:
from collections import Counter
import re
pattern = '\s*?\d+[\,\.]\d+[\,\.]\d+\s*?|\s*?\d+[\,\.]\d+\s*?|\s[0-9]+\s'
sentence = 'I am 6,571.5 14 a 14 data 1,a211 43.2 scientist 1he3'
[(_ , freq)] = Counter(re.findall(pattern, sentence)).most_common(1)
print(freq)
# output: 2
or you can use:
def simple(w):
if w.isalpha():
return False
if w.isnumeric():
return True
if w.count('.') > 1 or w.count(',') > 1:
return False
if w.startswith('.') or w.startswith(','):
return False
if w.replace(',', '').replace('.', '').isnumeric():
return True
return False
[(_ , freq)] = Counter([w for w in sentence.split() if simple(w)]).most_common(1)
print(freq)
# output: 2
but the second solution is ~ 2 times slower
Related
I have a bunch of strings in a pandas dataframe that contain numbers in them. I could the riun the below code and replace them all
df.feature_col = df.feature_col.str.replace('\d+', ' NUM ')
But what I need to do is replace any 10 digit number with a string like masked_id, any 16 digit numbers with account_number, or any three-digit numbers with yet another string, and so on.
How do I go about doing this?
PS: since my data size is less, a less optimal way is also good enough for me.
Another way is replace with option regex=True with a dictionary. You can also use somewhat more relaxed match
patterns (in order) than Tim's:
# test data
df = pd.DataFrame({'feature_col':['this has 1234567',
'this has 1234',
'this has 123',
'this has none']})
# pattern in decreasing length order
# these of course would replace '12345' with 'ID45' :-)
df['feature_col'] = df.feature_col.replace({'\d{7}': 'ID7',
'\d{4}': 'ID4',
'\d{3}': 'ID3'},
regex=True)
Output:
feature_col
0 this has ID7
1 this has ID4
2 this has ID3
3 this has none
You could do a series of replacements, one for each length of number:
df.feature_col = df.feature_col.str.replace(r'\b\d{3}\b', ' 3mask ')
df.feature_col = df.feature_col.str.replace(r'\b\d{10}\b', masked_id)
df.feature_col = df.feature_col.str.replace(r'\b\d{16}\b', account_number)
I would like to know how to count how many negative words (no, not) and abbreviation (n't) there are in a sentence and in the whole text.
For number of sentences I am applying the following one:
df["sent"]=df['text'].str.count('[\w][\.!\?]')
However this gives me the count of sentences in a text. I would need to look per each sentence at the number of negation words and within the whole text.
Can you please give me some tips?
The expected output for text column is shown below
text sent count_n_s count_tot
I haven't tried it yet 1 1 1
I do not like it. What do you think? 2 0.5 1
It's marvellous!!! 1 0 0
No, I prefer the other one. 2 1 1
count_n_s is given by counting the total number of negotiation words per sentence, then dividing by the number of sentences.
I tried
split_w = re.split("\w+",df['text'])
neg_words=['no','not','n\'t']
words = [w for i,w in enumerate(split_w) if i and (split_w[i-1] in neg_words)]
This would get a count of total negations in the text (not for individual sentences):
import re
NEG = r"""(?:^(?:no|not)$)|n't"""
NEG_RE = re.compile(NEG, re.VERBOSE)
def get_count(text):
count = 0
for word in text:
if NEG_RE .search(word):
count+=1
continue
else:
pass
return count
df['text_list'] = df['text'].apply(lambda x: x.split())
df['count'] = df['text_list'].apply(lambda x: get_count(x))
To get count of negations for individual lines use the code below. For words like haven't you can add it to neg_words since it is not a negation if you strip the word of everything else if it has n't
import re
str1 = '''I haven't tried it yet
I do not like it. What do you think?
It's marvellous!!!
No, I prefer the other one.'''
neg_words=['no','not','n\'t']
for text in str1.split('\n'):
split_w = re.split("\s", text.lower())
# to get rid of special characters such as comma in 'No,' use the below search
split_w = [re.search('^\w+', w).group(0) for w in split_w]
words = [w for w in split_w if w in neg_words]
print(len(words))
I am cleaning text and then passing it to the CountVectorizer function to give me a count of how many times each word appears in the text. The problem is that it is treating 10,000x as two words (10 and 000x). Similarly for 5.00 it is treating 5 and 00 as two different words.
I have tried the following:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
corpus=["userna lightning strike megawaysnew release there's many
ways win lightning strike megaways. start epic adventure today, seek
mystery symbols, re-spins wild multipliers, mega spins gamble lead wins
10,000x bet!"]
analyzer = CountVectorizer().build_analyzer()
vectorizer = CountVectorizer()
result = vectorizer.fit_transform(corpus).todense()
cols = vectorizer.get_feature_names()
res_df45 = pd.DataFrame(result, columns = cols)
In the data frame, both "10" and "000x" are given a count of 1 but I need them to be treated as one word (10,000x). How can I do this?
The default regex pattern the tokenizer is using for the token_pattern parameter is:
token_pattern='(?u)\\b\\w\\w+\\b'
So a word is defined by a \b word boundary at the beginning and the end with \w\w+ one alphanumeric character followed by one or more alphanumeric characters between the boundaries. To interpret the regex, the backslashes have to be escaped by \\.
So you could change the token pattern to:
token_pattern='\\b(\\w+[\\.,]?\\w+)\\b'
Explanation: [\\.,]?allows for the optional appearance of a . or ,. The regex for the first appearing alphanumeric character \w has to be extended to \w+ to match numbers with more than one digit before the punctuation.
For your slightly adjusted example:
corpus=["I am userna lightning strike 2.5 release re-spins there's many 10,000x bet in NA!"]
analyzer = CountVectorizer().build_analyzer()
vectorizer = CountVectorizer(token_pattern='\\b(\\w+[\\.,]?\\w+)\\b')
result = vectorizer.fit_transform(corpus).todense()
cols = vectorizer.get_feature_names()
print(pd.DataFrame(result, columns = cols))
Output:
10,000x 2.5 am bet in lightning many na re release spins strike there userna
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Alternatively you could modify your input text, e.g. by replacing the decimal point .with underscore _ and removing commas standing between digits.
import re
corpus = ["I am userna lightning strike 2.5 release re-spins there's many 10,000x bet in NA!"]
for i in range(len(corpus)):
corpus[i] = re.sub("(\d+)\.(\d+)", "\\1_\\2", corpus[i])
corpus[i] = re.sub("(\d+),(\d+)", "\\1\\2", corpus[i])
analyzer = CountVectorizer().build_analyzer()
vectorizer = CountVectorizer()
result = vectorizer.fit_transform(corpus).todense()
cols = vectorizer.get_feature_names()
print(pd.DataFrame(result, columns = cols))
Output:
10000x 2_5 am bet in lightning many na re release spins strike there userna
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
This question already has answers here:
Extract Number from String in Python
(18 answers)
How do I parse a string to a float or int?
(32 answers)
Closed 5 months ago.
I have a list of strings and I would like to verify some conditions on the strings. For example:
String_1: 'The price is 15 euros'.
String_2: 'The price is 14 euros'.
Condition: The price is > 14 --> OK
How can I verify it?
I'm actually doing like this:
if ('price is 13' in string):
print('ok')
and I'm writing all the valid cases.
I would like to have just one condition.
You can list all of the integers in the string and use them in an if statement after.
str = "price is 16 euros"
for number in [int(s) for s in str.split() if s.isdigit()]:
if (number > 14):
print "ok"
If your string contains more than one number, you can select which one you want to use in the list.
Hoep it helps.
You can just compare strings if they differ only by number and numbers have the same digits count. I.e.:
String_1 = 'The price is 15 euros'
String_2 = 'The price is 14 euros'
String_3 = 'The price is 37 EUR'
The will be naturally sorted as String_3 > String_1 > String_2
But will NOT work for:
String_4 = 'The price is 114 euros'
it has 3 digits instead of 2 and it will be String_4 < String_3 thus
So, the better, if you can extract number from the string, like following:
import re
def get_price(s):
m = re.match("The price is ([0-9]*)", s)
if m:
return = int(m.group(1))
return 0
Now you can compare prices as integer:
price = get_price(String_1)
if price > 14:
print ("Okay!")
. . .
if get_price(String_1) > 14:
print ("Okay!")
([0-9]*) - is the capturing group of the regular expression, all defined in the round parenthesis will be returned in group(1) method of the Python match object. You can extend this simple regular expression [0-9]* further for your needs.
If you have list of strings:
string_list = [String_1, String_2, String_3, String_4]
for s in string_list:
if get_price(s) > 14:
print ("'{}' is okay!".format(s))
Is the string format always going to be the exact same? As in, it will always start with "The price is" and then have a positive integer, and then end with "euros'? If so, you can just split the string into words and index the integer, cast it into an int, and check if it's greater than 14.
if int(s.split()[3]) > 14:
print('ok')
If the strings will not be consistent, you may want to consider a regex solution to get the numeral part of the sentence out.
You could use a regular expression to extract the number after "price is", and then convert the number in string format to int. And, finally to compare if it is greater than 14, for example:
import re
p = re.compile('price\sis\s\d\d*')
string1 = 'The price is 15 euros'
string2 = 'The price is 14 euros'
number = re.findall(p, string1)[0].split("price is ")
if int(number[1]) > 14:
print('ok')
Output:
ok
I suppose you have only ono value in your string. So we can do it with regex.
import re
String_1 = 'The price is 15 euros.'
if float(re.findall(r'\d+', String_1)[0]) > 14:
print("OK")
There are a lot of similar questions, but I have not found a solution for my problem. I have a data frame with the following structure/form:
col_1
0 BULKA TARTA 500G KAJO 1
1 CUKIER KRYSZTAL 1KG KSC 4
2 KASZA JĘCZMIENNA 4*100G 2 0.92
3 LEWIATAN MAKARON WSTĄŻKA 1 0.89
However, I want to achieve the effect:
col_1
0 BULKA TARTA 500G KAJO
1 CUKIER KRYSZTAL 1KG KSC
2 KASZA JĘCZMIENNA 4*100G
3 LEWIATAN MAKARON WSTĄŻKA
So I want to remove the independent natural and decimal numbers, but leave the numbers in the string with the letters.
I tried to use df.col_1.str.isdigit().replace([True, False],[np.nan, df.col_1]) , but it only works on comparing the entire cell whether it is a number or not.
You have some ideas how to do it? Or maybe it would be good to break the column with spaces and then compare?
We could create a function that tries to convert to float. If it fails we return True (not_float)
import pandas as pd
df = pd.DataFrame({"col_1" : ["BULKA TARTA 500G KAJO 1",
"CUKIER KRYSZTAL 1KG KSC 4",
"KASZA JĘCZMIENNA 4*100G 2 0.92",
"LEWIATAN MAKARON WSTĄŻKA 1 0.89"]})
def is_not_float(string):
try:
float(string)
return False
except ValueError: # String is not a number
return True
df["col_1"] = df["col_1"].apply(lambda x: [i for i in x.split(" ") if is_not_float(i)])
df
Or following the example of my fellow SO:ers. However this would treat 130. as a number.
df["col_1"] = (df["col_1"].apply(
lambda x: [i for i in x.split(" ") if not i.replace(".","").isnumeric()]))
Returns
col_1
0 [BULKA, TARTA, 500G, KAJO]
1 [CUKIER, KRYSZTAL, 1KG, KSC]
2 [KASZA, JĘCZMIENNA, 4*100G]
3 [LEWIATAN, MAKARON, WSTĄŻKA]
Sure,
You could use a regex.
import re
df.col_1 = re.sub("\d+\.?\d+?", "", df.col_1)
Yes you can
def no_nums(col):
return ' '.join(filter(lambda word:word.replace('.','').isdigit()==False, col.split()))
df.col_1.apply(no_nums)
This filters out words from each value which are completely made of digits,
And maybe contains a decimal point.
If you want to filter out numbers like 1,000, simply add another replace for ','