I want to count 'end sentences' e.g. full stops, exclamation marks, and question marks.
I have written a little loop to do this, but I was wondering if there is a better way. Not allowed to use built-in functions.
for line in textContent:
numberOfFullStops += line.count(".")
numberOfQuestionMarks += line.count("?")
numberOfQuestionMarks += line.count("!")
numberOfSentences = numberOfFullStops + numberOfQuestionMarks + numberOfExclamationMarks
Assuming you want to count terminal punctuation in one sentence, we can produce a dictionary of (character, count) pairs by looping over the characters of each string and filtering the punctuations.
Demo
Here are three options presented top-down with intermediate- to beginner-level data structures:
import collections as ct
sentence = "Here is a sentence, and it has some exclamations!!"
terminals = ".?!"
# Option 1 - Counter and Dictionary Comprehension
cd = {c:val for c, val in ct.Counter(sentence).items() if c in terminals}
cd
# Out: {'!': 2}
# Option 2 - Default Dictionary
dd = ct.defaultdict(int)
for c in sentence:
if c in terminals:
dd[c] += 1
dd
# Out: defaultdict(int, {'!': 2})
# Option 3 - Regular Dictionary
d = {}
for c in sentence:
if c in terminals:
if c not in d:
d[c] = 0
d[c] += 1
d
# Out: {'!': 2}
To extend further, for a list of separate sentences, loop around one of the latter options.
for sentence in sentences:
# add option here
Note: to sum the total punctuations per sentence, total the dict.values(), e.g. sum(cd.values()).
Update: assuming you want to split a sentence by terminal punctutation, use regular expressions:
import re
line = "Here is a string of sentences. How do we split them up? Try regular expressions!!!"
# Option - Regular Expression and List Comprehension
pattern = r"[.?!]"
sentences = [sentence for sentence in re.split(pattern, line) if sentence]
sentences
# Out: ['Here is a string of sentences', ' How do we split them up', ' Try regular expressions']
len(sentences)
# Out: 3
Notice line has 5 terminals, but only 3 sentences. Thus regex is a more reliable approach.
References
collections.Counter
collections.defaultdict
re.split
List comprehension
Related
I'm very new to python and I'm trying to work with strings.
I have some data with peptides for example (test string) KGSLADEE. I want to write a function which compares the test string to the reference string: AGSTQKP to see what percentage of the letters in the test string are the same as in the reference string. How can I do this? When looking online I can only find code for exact string matches.
For this example:
(1*K) + (1*G) + (1*S) + (1*L) = 4 (letters which are the same)
Divide by 8 (total number of letters in the test string)
(4/8) * 100 = 50%
How can I do this? When looking online I can only find code for exact string matches.
This yields the same results as the answer from Hoxha Alban, but I find this one a bit easier to read. It uses the Counter module (see here: https://pymotw.com/2/collections/counter.html)
from collections import Counter
def f(test, ref):
intersection = Counter(test) & Counter(ref)
return len(list(intersection.elements())) / len(test) * 100
Are you searching for something like this?
def f(test, ref):
d = dict()
for c in ref:
if c not in d:
d[c] = min(ref.count(c), test.count(c))
return sum(d.values()) / len(test) * 100
f('KGSLADEE', 'AGSTQKP') # 50%
f('hello', 'h') # 20%
f('abc', 'cba') # 100%
f('a', 'aaa') # 100%
f('aaa', 'a') # 33.333%
You will need to loop through every letter in your test string. Your loop will go through each letter and check that letter within your reference string and then give you an output.
You can then use this output to calculate your percentage.
word1 = 'KGSLADEE'
word2 = 'AGSTQKP'
same = 0
for letter1 in word1:
for letter2 in word2:
if letter1 == letter2:
same += 1
break
print(same/len(word1)*100)
It is not a solution for all situations. But you can expand that
I am looking to be able to recursively remove adjacent letters in a string that differ only in their case e.g. if s = AaBbccDd i would want to be able to remove Aa Bb Dd but leave cc.
I can do this recursively using lists:
I think it aught to be able to be done using regex but i am struggling:
with test string 'fffAaaABbe' the answer should be 'fffe' but the regex I am using gives 'fe'
def test(line):
res = re.compile(r'(.)\1{1}', re.IGNORECASE)
#print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
The way that works is:
def test(line):
result =''
chr = list(line)
cnt = 0
i = len(chr) - 1
while i > 0:
if ord(chr[i]) == ord(chr[i - 1]) + 32 or ord(chr[i]) == ord(chr[i - 1]) - 32:
cnt += 1
chr.pop(i)
chr.pop(i - 1)
i -= 2
else:
i -= 1
if cnt > 0: # until we can't find any duplicates.
return test(''.join(chr))
result = ''.join(chr)
print(result)
Is it possible to do this using a regex?
re.IGNORECASE is not way to solve this problem, as it will treat aa, Aa, aA, AA same way. Technically it is possible using re.sub, following way.
import re
txt = 'fffAaaABbe'
after_sub = re.sub(r'Aa|aA|Bb|bB|Cc|cC|Dd|dD|Ee|eE|Ff|fF|Gg|gG|Hh|hH|Ii|iI|Jj|jJ|Kk|kK|Ll|lL|Mm|mM|Nn|nN|Oo|oO|Pp|pP|Qq|qQ|Rr|rR|Ss|sS|Tt|tT|Uu|uU|Vv|vV|Ww|wW|Xx|xX|Yy|yY|Zz|zZ', '', txt)
print(after_sub) # fffe
Note that I explicitly defined all possible letters pairs, because so far I know there is no way to say "inverted case letter" using just re pattern. Maybe other user will be able to provide more concise re-based solution.
I suggest a different approach which uses groupby to group adjacent similar letters:
from itertools import groupby
def test(line):
res = []
for k, g in groupby(line, key=lambda x: x.lower()):
g = list(g)
if all(x == x.lower() for x in g):
res.append(''.join(g))
print(''.join(res))
Sample run:
>>> test('AaBbccDd')
cc
>>> test('fffAaaABbe')
fffe
r'(.)\1{1}' is wrong because it will match any character that is repeated twice, including non-letter characters. If you want to stick to letters, you can't use this.
However, even if we just do r'[A-z]\1{1}', this would still be bad because you would match any sequence of the same letter twice, but it would catch xx and XX -- you don't want to match consecutive same characters with matching case, as you said in the original question.
It just so happens that there is no short-hand to do this conveniently, but it is still possible. You could also just write a small function to turn it into a short-hand.
Building on #Daweo's answer, you can generate the regex pattern needed to match pairs of same letters with non-matching case to get the final pattern of aA|Aa|bB|Bb|cC|Cc|dD|Dd|eE|Ee|fF|Ff|gG|Gg|hH|Hh|iI|Ii|jJ|Jj|kK|Kk|lL|Ll|mM|Mm|nN|Nn|oO|Oo|pP|Pp|qQ|Qq|rR|Rr|sS|Ss|tT|Tt|uU|Uu|vV|Vv|wW|Ww|xX|Xx|yY|Yy|zZ|Zz:
import re
import string
def consecutiveLettersNonMatchingCase():
# Get all 'xX|Xx' with a list comprehension
# and join them with '|'
return '|'.join(['{0}{1}|{1}{0}'.format(s, t)\
# Iterate through the upper/lowercase characters
# in lock-step
for s, t in zip(
string.ascii_lowercase,
string.ascii_uppercase)])
def test(line):
res = re.compile(consecutiveLettersNonMatchingCase())
print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
print(consecutiveLettersNonMatchingCase())
I would like to count how many chars of the alphabet [a...z] are in a string, I don't care about "!" or "?" or "\n", ",", "." and many others. I just want the alphabet letters.
I was trying to use collections.Counter(string) but the problem is Counter is getting these unwanted chars and I don't know how to disconsider them.
And other problem is in my real idiom we use not only "o" but "ô" and "ó" as well as "e", "é", or "ç". In these cases these chars must considered.
How can I do that?
Remove the non-letters after you use collections.Counter.
import collections
import string
counter = collections.Counter(yourString)
for char in counter.keys():
if not isalpha(char):
del counter[char]
This will counter uppercase and lowercase letters separately, as well as letters that have different accents and diacritics. If you want to ignore case, you can use collections.Counter(yourString.lower()). If you also want to ignore diacritics, use collections.Counter(yourString.encode('ascii', 'ignore').decode().lower())
If you only want to count unique characters, you can use set.intersection between the set of characters in your string, and a set of characters you accept:
maybe something like this:
import string
acceptable_chars = set('áàéíóúç').union(set(string.ascii_lowercase))
mystring = 'kfh;l1234sóúçids'
num_alpha = len(set(mystring.lower()).intersection(acceptable_chars))
print(num_alpha)
output :
10
Given
import string
import collections as ct
s = "Lorem ipsum çéó?"
Code
Make a set of accepted characters - exclude what you don't want; include what you do want.
excluded = set("oe")
included = set("ôóéç")
accepted = set(string.ascii_letters) - excluded | included
Count and mask the accepted characters.
counted = ct.Counter(s)
masked = ct.Counter(accepted)
shared = set((counted & masked).elements())
Filter accepted characters from Counter.
Demo
Sum of characters
sum(v for k, v in counted.items() if k in shared)
# 11
Sum of unique characters
sum(1 for k, _ in counted.items() if k in shared)
# 10
Dict of tallied characters
{k: v for k, v in counted.items() if k in shared}
# {'L': 1, 'r': 1, 'm': 2, 'i': 1, 'p': 1, 's': 1, 'u': 1, 'ç': 1, 'é': 1, 'ó': 1}
So, guys!
I marked the Reblochon Masque answer as right because I thought was right. But after test a little more I found out I was wrong. So I made my own solution based on his code.
The problem with that code is if my string is "kfh;l1234sóúçids" like his example, the output is correct, 10. But if I repteat it with spaces, like: "kfh;l1234sóúçids kfh;l1234sóúçids" I get 10 again and should be 20.
So, my solution is create a set of acceptable chars like his code and use it as reference.
Here:(In my case the output is 11 because I added a few more special chars to the set.
import collections
import string
# This counter is considering only acceptable_chars. All the rest is ignored.
def numChars(sentence):
acceptable_chars = set('áàéíóúç').union(set(string.ascii_lowercase))
chars = collections.Counter(sentence)
acum = 0
for key in acceptable_chars:
if key in chars:
acum += chars[key]
return acum
x = numChars("kfh;l1234sóúçids")
print("Acum: " + str(x))
Output:
Acum: 11
If the string is 'kfh;l1234sóúçids \n kfh;l1234sóúçids':
Acum: 22
This might be more information than necessary to explain my question, but I am trying to combine 2 scripts (I wrote for other uses) together to do the following.
TargetString (input_file) 4FOO 2BAR
Result (output_file) 1FOO 2FOO 3FOO 4FOO 1BAR 2BAR
My first script finds the pattern and copies to file_2
pattern = "\d[A-Za-z]{3}"
matches = re.findall(pattern, input_file.read())
f1.write('\n'.join(matches))
My second script opens the output_file and, using re.sub, replaces and alters the target string(s) using capturing groups and back-references. But I am stuck here on how to turn i.e. 3 into 1 2 3.
Any ideas?
This simple example doesn't need to use regular expression, but if you want to use re anyway, here's example (note: you have minor error in your pattern, should be A-Z, not A-A):
text_input = '4FOO 2BAR'
import re
matches = re.findall(r"(\d)([A-Za-z]{3})", text_input)
for (count, what) in matches:
for i in range(1, int(count)+1):
print(f'{i}{what}', end=' ')
print()
Prints:
1FOO 2FOO 3FOO 4FOO 1BAR 2BAR
Note: If you want to support multiple digits, you can use (\d+) - note the + sign.
Assuming your numbers are between 1 and 9, without regex, you can use a list comprehension with f-strings (Python 3.6+):
L = ['4FOO', '2BAR']
res = [f'{j}{i[1:]}' for i in L for j in range(1, int(i[0])+1)]
['1FOO', '2FOO', '3FOO', '4FOO', '1BAR', '2BAR']
Reading and writing to CSV files are covered elsewhere: read, write.
More generalised, to account for numbers greater than 9, you can use itertools.groupby:
from itertools import groupby
L = ['4FOO', '10BAR']
def make_var(x, int_flag):
return int(''.join(x)) if int_flag else ''.join(x)
vals = ((make_var(b, a) for a, b in groupby(i, str.isdigit)) for i in L)
res = [f'{j}{k}' for num, k in vals for j in range(1, num+1)]
print(res)
['1FOO', '2FOO', '3FOO', '4FOO', '1BAR', '2BAR', '3BAR', '4BAR',
'5BAR', '6BAR', '7BAR', '8BAR', '9BAR', '10BAR']
I have a list -
A=["hi how are you","have good day","where are you going ","do you like the place"]
and another list -
B=["how","good","where","going","like","place"]
List B includes some of words that exist in list A.
I want to replace all words in List B that occur in List A by their index in list B. If word doesn't exist replace it with 0
So list A after the replacement should be
["0 1 0 0","0 2 0","3 0 0 4","0 0 5 0 6"]
I tried using for loop but it's not effiecent as my list length is > 10000. I also tried using map function but i wasn't successful
Here is my attempt :
for item in list_A:
words=sorted(item.split(), key=len,reverse=True)
for w in word:
if w.strip() in list_B:
item=item.replace(w,str(list_B.index(w.strip())))
else:
item=item.replace(w,0)
What you could do is create a dictionary that maps each word in list B to it's index. Then you only have to iterate through the first list once.
Something like
B = ["how","yes"]
BDict = {}
index = 0
for x in B:
Bdict[x] = index
index += 1
for sentence in A:
for word in sentence:
if word in BDict:
#BDict[word] has the index of the current word in B
else:
#Word does not exist in B
This should significantly decrease runtime since dictionary has O(1) access time. However, depending on the size of B the dictionary could become quite large
EDIT:
Your code works, the reason it is slow is that the in and index operator have to perform a linear search when you are using a list. So if B gets large this can be a big slow down. A dictionary however has a constant time required to see if a key exists in the dictionary and for retrieving the value. By using the dictionary you would replace 2 O(n) operations with O(1) operations.
You should define a function to return index of word in second list:
def get_index_of_word(word):
try:
return str(B.index(word) + 1)
except ValueError:
return '0'
And then, You can use nested list comprehension to generate the result:
[' '.join(get_index_of_word(word) for word in sentence.split()) for sentence in A]
UPDATE
from collections import defaultdict
index = defaultdict(lambda: 0, ((word, index) for index, word in enumerate(B, 1))
[' '.join(str(index[word]) for word in sentence.split()) for sentence in A]
You can try this:
A=["hi how are you","have good day","where are you going ","do you like the place"]
A = map(lambda x:x.split(), A)
B=["how","good","where","going","like","place"]
new = [[c if d == a else 0 for c, d in enumerate(i)] for i in A for a in B]
final = map(' '.join, map(lambda x: [str(i) for i in x], new))
print final
Hi your solution is making (too) many lookups.
here is mine:
A=["hi how are you",
"have good day",
"where are you going ",
"do you like the place"]
B=["how","good","where","going","like","place"]
# I assume B contains only unique elements.
gg = { word: idx for (idx, word) in enumerate(B, start=1)}
print(gg)
lookup = lambda word: str(gg.get(word, 0)) # Buils your index and gets you efficient search with proper object types.
def translate(str_):
return ' '.join(lookup(word) for word in str_.split())
print(translate("hi how are you")) # check for one sentence.
translated = [translate(sentence) for sentence in A] # yey victory.
print(translated)
# Advanced usage
class missingdict(dict):
def __missing__(self, key):
return 0
miss = missingdict(gg)
def tr2(str_):
return ' '.join(str(miss[word]) for word in str_.split())
print([tr2(sentence) for sentence in A])
You may also be using the yield keyword when you will be more self-confident in python.
This is in Python 3.x
A=["hi how are you","have good day","where are you going ","do you like the place"]
B=["how","good","where","going","like","place"]
list(map(' '.join, map(lambda x:[str(B.index(i)+1) if i in B else '0' for i in x], [i.split() for i in A])))
Output:
['0 1 0 0', '0 2 0', '3 0 0 4', '0 0 5 0 6']