I'm trying to preprocess message data from the StockTwits API, how can I remove all instances of $name from a string in python?
For example if the string is:
$AAPL $TSLA $MSFT are all going up!
The output would be:
are all going up!
Something like this would do:
>>> s = "$AAPL $TSLA $MSFT are all going up!"
>>> re.sub(r"\$[a-zA-Z0-9]+\s*", "", s)
'are all going up!'
This allows numbers in the name as well, remove 0-9 if that's not what you want (it would remove e.g. $15 as well).
I'm not sure I get it, but If I'm to remove all instances of words that start with $, I would break into individual strings, then look for $, and re-form using a list comprehension.
substrings = string.split(' ')
substrings = [s for s in substrings if not s.startswith('$')]
new_string = ' '.join(substrings)
Some would use regular expressions, which are likely more computationally efficient, but less easy to read.
Related
So let's say I have this string like
string = 'abcd <# string that has whitespace> efgh'
And I want to delete all the white space inside this <#...> And not affect anything outside <#...>
But the characters outside <#...> can change too so the <#...> is not going to be in a fixed position.
How should I do this?
This is not a complicated operation. You just do it like you would as a human being. Find the two delimiters, keep the part before the first one, remove space from the middle, keep the rest.
string = 'abcd <# string that has whitespace> efgh'
i1 = string.find('<#')
i2 = string.find('>')
res = string[:i1] + string[i1:i2].replace(' ','') + string[i2:]
print(res)
Output:
abcd <#stringthathaswhitespace> efgh
How about this...
string = 'abcd <# string that has whitespace> efgh'
s = string.split()
s = ' '.join( (s[0], ''.join(s[1:-1]), s[-1]) )
If <#...> exists consistently, one method to find the string is use regular expressions (regex) to search for that part of the string with the charactersyou want to modify. You then need to strip out the white space.
It takes a bit to get your head around regex, but they can be powerful tool.
Regex
I am handed a bunch of data and trying to get rid of certain characters. The data contains multiple instances of "^{number}" → "^0", "^1", "^2", etc.
I am trying to set all of these instances to an empty string, "", is there a better way to do this than
string.replace("^0", "").replace("^1", "").replace("^2", "")
I understand you can use a dictionary, but it seems a little overkill considering each item will be replaced with "".
I understand that the digits are always at the end of the string, have a look at the solutions below.
with regex:
import re
text = 'xyz125'
s = re.sub("\d+$",'', text)
print(s)
it should print:
'xyz'
without regex, keep in mind that this solution removes all digits and not only the ones at the end of a string:
text = 'xyz125'
result = ''.join(i for i in text if not i.isdigit())
print(result)
it should print:
'xyz'
I need to split strings based on a sequence of Regex patterns. I am able to apply individually the split, but the issue is recursively split the different sentences.
For example I have this sentence:
"I want to be splitted using different patterns. It is a complex task, and not easy to solve; so, I would need help."
I would need to split the sentence based on ",", ";" and ".".
The resulst should be 5 sentences like:
"I want to be splitted using different patterns."
"It is a complex task,"
"and not easy to solve;"
"so,"
"I would need help."
My code so far:
import re
sample_sentence = "I want to be splitted using different patterns. It is a complex task, and not easy to solve; so, I would need help."
patterns = [re.compile('(?<=\.) '),
re.compile('(?<=,) '),
re.compile('(?<=;) ')]
for pattern in patterns:
splitted_sentences = pattern.split(sample_sentence)
print(f'Pattern used: {pattern}')
How can I apply the different patterns without losing the results and get the expected result?
Edit: I need to run each pattern one by one, as I need to do some checks in the result of every pattern, so running it in some sort of tree algorithm. Sorry for not explaining entirely, in my head it was clear, but I did not think it would have side effects.
You can join each pattern with |:
import re
s = "I want to be splitted using different patterns. It is a complex task, and not easy to solve; so, I would need help."
result = re.split('(?<=\.)\s|,\s*|;\s*', s)
Output:
['I want to be splitted using different patterns.', 'It is a complex task', 'and not easy to solve', 'so', 'I would need help.']
Python has this in re
Try
re.split('; | , | . ',ourString)
I can't think of a single regex to do this. So, what you can do it replace all the different type of delimiters with a custom-defined delimiter, say $DELIMITER$ and then split your sentence based on this delimiter.
new_sent = re.sub('[.,;]', '$DELIMITER$', sent)
new_sent.split('$DELIMITER$')
This will result in the following:
['I want to be splitted using different patterns',
' It is a complex task',
' and not easy to solve',
' so',
' I would need help',
'']
NOTE: The above output has an additional empty string. This is because there is a period at the end of the sentence. To avoid this, you can either remove that empty element from the list or you can substitute the custom defined delimiter if it occurs at the end of the sentence.
new_sent = re.sub('[.,;]', '$DELIMITER$', sent)
new_sent = re.sub('\$DELIMITER\$$', '', new_sent)
new_sent.split('$DELIMITER$')
In case you have a list of delimiters, you can make you regex pattern using the following code:
delimiter_list = [',', '.', ':', ';']
pattern = '[' + ''.join(delimiter_list) + ']' #will result in [,.:;]
new_sent = re.sub(pattern, '$DELIMITER$', sent)
new_sent = re.sub('\$DELIMITER\$$', '', new_sent)
new_sent.split('$DELIMITER$')
I hope this helps!!!
Use a lookbehind with a character class:
import re
s = "I want to be splitted using different patterns. It is a complex task, and not easy to solve; so, I would need help."
result = re.split('(?<=[.,;])\s', s)
print(result)
Output:
['I want to be splitted using different patterns.',
'It is a complex task,',
'and not easy to solve;',
'so,',
'I would need help.']
The following code is for an assignment that asks that a string of sentences is entered from a user and that the beginning of each sentence is capitalized by a function.
For example, if a user enters: 'hello. these are sample sentences. there are three of them.'
The output should be: 'Hello. These are sample sentences. There are three of them.'
I have created the following code:
def main():
sentences = input('Enter sentences with lowercase letters: ')
capitalize(sentences)
#This function capitalizes the first letter of each sentence
def capitalize(user_sentences):
sent_list = user_sentences.split('. ')
new_sentences = []
count = 0
for count in range(len(sent_list)):
new_sentences = sent_list[count]
new_sentences = (new_sentences +'. ')
print(new_sentences.capitalize())
main()
This code has two issues that I am not sure how to correct. First, it prints each sentence as a new line. Second, it adds an extra period at the end. The output from this code using the sample input from above would be:
Hello.
These are sample sentences.
There are three of them..
Is there a way to format the output to be one line and remove the final period?
The following works for reasonably clean input:
>>> s = 'hello. these are sample sentences. there are three of them.'
>>> '. '.join(x.capitalize() for x in s.split('. '))
'Hello. These are sample sentences. There are three of them.'
If there is more varied whitespace around the full-stop, you might have to use some more sophisticated logic:
>>> '. '.join(x.strip().capitalize() for x in s.split('.'))
Which normalizes the whitespace which may or may not be what you want.
def main():
sentences = input('Enter sentences with lowercase letters: ')
capitalizeFunc(sentences)
def capitalizeFunc(user_sentences):
sent_list = user_sentences.split('. ')
print(".".join((i.capitalize() for i in sent_list)))
main()
Output:
Enter sentences with lowercase letters: "hello. these are sample sentences. there are three of them."
Hello.These are sample sentences.There are three of them.
I think this might be helpful:
>>> sentence = input()
>>> '. '.join(map(lambda s: s.strip().capitalize(), sentence.split('.')))
>>> s = 'hello. these are sample sentences. there are three of them.'
>>> '. '.join(map(str.capitalize, s.split('. ')))
'Hello. These are sample sentences. There are three of them.'
This code has two issues that I am not sure how to correct. First, it prints each sentence as a new line.
That’s because you’re printing each sentence with a separate call to print. By default, print adds a newline. If you don’t want it to, you can override what it adds with the end keyword parameter. If you don’t want it to add anything at all, just use end=''
Second, it adds an extra period at the end.
That’s because you’re explicitly adding a period to every sentence, including the last one.
One way to fix this is to keep track of the index as well as the sentence as you’re looping over them—e.g., with for index, sentence in enumerate(sentences):. Then you only add the period if index isn’t the last one. Or, slightly more simply, you add the period at the start, if the index is anything but zero.
However, theres a better way out of both of these problems. You split the string into sentences by splitting on '. '. You can join those sentences back into one big string by doing the exact opposite:
sentences = '. '.join(sentences)
Then you don’t need a loop (there’s one hidden inside join of course), you don’t need to worry about treating the last or first one special, and you only have one print instead of a bunch of them so you don’t need to worry about end.
A different trick is to put the cleverness of print to work for you instead of fighting it. Not only does it add a newline at the end by default, it also lets you print multiple things and adds a space between them by default. For example, print(1, 2, 3) or, equivalently, print(*[1, 2, 3]) will print out 1 2 3. And you can override that space separator with anything else you want. So you can print(*sentences, sep='. ', end='') to get exactly what you want in one go. However, this may be a bit opaque or over-clever to people reading your code. Personally, whenever I can use join instead (which is usually), I do that even though it’s a bit more typing, because it makes it more obvious what’s happening.
As a side note, a bit of your code is misleading:
new_sentences = []
count = 0
for count in range(len(sent_list)):
new_sentences = sent_list[count]
new_sentences = (new_sentences +'. ')
print(new_sentences.capitalize())
The logic of that loop is fine, but it would be a lot easier to understand if you called the one-new-sentence variable new_sentence instead of new_sentences, and didn’t set it to an empty list at the start. As it is, the reader is led to expect that you’re going to build up a list of new sentences and then do something with it, but actually you just throw that list away at the start and handle each sentence one by one.
And, while we’re at it, you don’t need count here; just loop over sent_list directly:
for sentence in sent_list:
new_sentence = sent + '. '
print(new_sentence.capitalize())
This does the same thing as the code you had, but I think it’s easier to understand that it does that think from a quick glance.
(Of course you still need the fixes for your two problems.)
Use nltk.sent_tokenize to tokenize the string into sentences. And capitalize each sentence, and join them again.
A sentence can't always end with a ., there can other things too, like a ?, or !. Also three consecutive dots ..., doesn't end the sentence. sent_tokenize will handle them all.
from nltk.tokenize import sent_tokenize
def capitalize(user_sentences):
sents = sent_tokenize(user_sentences)
capitalized_sents = [sent.capitalize() for sent in sents]
joined_ = ' '.join(capitalized_sents)
print(joined_)
The reason your sentences were being printed on separate lines, were because print always ends its output with a newline. So, printing sentences separately in loop will make them print on newlines. So, you should print them all at once, after joining them. Or, you can specify end='' in print statement, so it doesn't end the sentences with newline characters.
The second thing, about output being ended with an extra period, is because, you're appending '. ' with each of the sentence. The good thing about sent_tokenize is, it doesn't remove '.', '?', etc from the end of the sentences, so you don't have to append '. ' at the end manually again. Instead, you can just join the sentences with a space character, and you'll be good to go.
If you get an error for nltk not being recognized, you can install it by running pip install nltk on the terminal/cmd.
I really apologize if this has been answered before but I have been scouring SO and Google for a couple of hours now on how to properly do this. It should be easy and I know I am missing something simple.
I am trying to read from a file and count all occurrences of elements from a list. This list is not just whole words though. It has special characters and punctuation that I need to get as well.
This is what I have so far, I have been trying various ways and this post got me the closest:
Python - Finding word frequencies of list of words in text file
So I have a file that contains a couple of paragraphs and my list of strings is:
listToCheck = ['the','The ','the,','the;','the!','the\'','the.','\'the']
My full code is:
#!/usr/bin/python
import re
from collections import Counter
f = open('text.txt','r')
wanted = ['the','The ','the,','the;','the!','the\'','the.','\'the']
words = re.findall('\w+', f.read().lower())
cnt = Counter()
for word in words:
if word in wanted:
print word
cnt[word] += 1
print cnt
my output thus far looks like:
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
Counter({'the': 17})
It is counting my "the" strings with punctuation but not counting them as separate counters. I know it is because of the \W+. I am just not sure what the proper regex pattern to use here or if I'm going about this the wrong way.
I suspect there may be some extra details to your specific problem that you are not describing here for simplicity. However, I'll assume that what you are looking for is to find a given word, e.g. "the", which could have either an upper or lower case first letter, and can be preceded and followed either by a whitespace or by some punctuation characters such as ;,.!'. You want to count the number of all the distinct instances of this general pattern.
I would define a single (non-disjunctive) regular expression that define this. Something like this
import re
pattern = re.compile(r"[\s',;.!][Tt]he[\s.,;'!]")
(That might not be exactly what you are looking for in general. I just assuming it is based on what you stated above. )
Now, let's say our text is
text = '''
Foo and the foo and ;the, foo. The foo 'the and the;
and the' and the; and foo the, and the. foo.
'''
We could do
matches = pattern.findall(text)
where matches will be
[' the ',
';the,',
' The ',
"'the ",
' the;',
" the'",
' the;',
' the,',
' the.']
And then you just count.
from collections import Counter
count = Counter()
for match in matches:
count[match] += 1
which in this case would lead to
Counter({' the;': 2, ' the.': 1, ' the,': 1, " the'": 1, ' The ': 1, "'the ": 1, ';the,': 1, ' the ': 1})
As I said at the start, this might not be exactly what you want, but hopefully you could modify this to get what you want.
Just to add, a difficulty with using a disjunctive regular expression like
'the|the;|the,|the!'
is that the strings like "the," and "the;" will also match the first option, i.e. "the", and that will be returned as the match. Even though this problem could be avoided by more careful ordering of the options, I think it might not be easier in general.
The simplest option is to combine all "wanted" strings into one regular expression:
rr = '|'.join(map(re.escape, wanted))
and then find all matches in the text using re.findall.
To make sure longer stings match first, just sort the wanted list by length:
wanted.sort(key=len, reverse=True)
rr = '|'.join(map(re.escape, wanted))