Deleting indeterminate substrings - python

I am relatively new to python. Suppose I have the following string -
tweet1= 'Check this out!! #ThrowbackTuesday I finally found this!!'
tweet2= 'Man the summer is hot... #RisingSun #SummerIsHere Can't take it..'
Now, I am trying to delete all hashtags(#) within the tweets such that -
tweet1= 'Check this out!! I finally found this!!'
tweet2= 'Man the summer is hot... Can't take it..'
My code was -
tweet1= 'Check this out!! #ThrowbackTuesday I finally found this!!'
i,j=0,0
s=tweet1
while i < len(tweet1):
if tweet1[i]=='#':
j=i
while tweet1[j] != ' ':
++j
while i<len(tweet1) and j<len(tweet1):
++j
s[i]=tweet1[j]
++i
++i
print(s)
This code gives me no output and no errors which leads me to believe that I am using the wrong logic. Is there an easier solution to this using regex?

Here is a regex solution:
re.sub(r'#\w+ ?', '', tweet1)
The regex means to delete a hash symbol followed by 1 or more word characters (letters, numbers, or underscore) optionally followed by a space (so you don't get two spaces in a row).
You can find out plenty about regexes in general and in Python with Google, it's not hard.
Additionally, to allow additional special characters, such as $ and #, replace \w with [\w$#], where the $# can be substituted with whatever characters you like, i.e. everything in the brackets is allowed.

You can utilize split and startswith to accomplish your task.
Here split will make your tweet string a list of words separated by spaces. So then when iterating in a comprehension creating a new list, just omit anything starting with a #, by using startswith. Then ' '.join will simply make it a string again separated by spaces.
The code can be written as
tweet = 'Check this out!! #ThrowbackTuesday I finally found this!!'
print(' '.join([w for w in tweet.split() if not w.startswith('#')]))
Output:
Check this out!! I finally found this!!

Python doesn't have a ++ operator so ++j just applies the + operator to j twice which, of course, does nothing. You should use j += 1 instead.

Related

how to find substring from a single line string

suppose, I have a string, s="panpanIpanAMpanJOEpan" . From this I want to find the word pan and replace it with spaces so that I can get the output string as "I AM JOE". How can I do it??
Actually I also don't know how to find certain substring from a long string without spaces such as mentioned above.
It will be great if someone helps me learning about this.
If you don't know pan you can exploit that the letters you want to find is all upper case.
fillword = min(set("".join(i if i.islower() else ' ' for i in s).split(' '))-set(['']),key=len)
This works by first replacing all upper case letters with space, then splitting on space and finding the minimal nonempty word.
Use replace to replace with space, and then strip to remove excess spacing.
s="panpanIpanAMpanJOEpan"
s.replace(fillword,' ').strip()
gives:
'I AM JOE'
s="panpanIpanAMpanJOEpan"
print(s.replace("pan"," ").strip())
use replace
Output:
I AM JOE
As DarrylG and others mentioned, .replace will do what you asked for, where you define what you want to replace ("pan") and what you want to replace it with (" ").
To find a certain string in a longer string you can use .find(), which takes a string you are looking for and optionally where to start and stop looking for it (as integers) as arguments.
If you wanted to find all of the occurrences of a string in a bigger string there's two options:
Find the string with find(), then cut the string so it no longer contains your searchterm and repeat this until the .find() method returns -1(that means the searchterm is not found in the string anymore)
or use the regex module and use the .finditer method to find all occurences of your string Link to someone explaining exactly that on stackoverflow.
Edit: If you don't know what you are searching for, it becomes a bit more tricky, but you can write a regex expession that would extract this data as well using the same regex module. This is easy if you know what the end result is supposed to be (I AM JOE in your case). If you don't it becomes more complicated and we would need additional information to help with this.
You can use replace, to replace all occurances of a substring at once.
In case you want to find the substrings yourself, you can do it manually:
s = "panpanIpanAMpanJOEpan"
while True:
panPosition = s.find('pan') # -1 == 'pan' not found!
if panPosition == -1:
s = s.strip()
break
# Cut out pan from s and replace it with a blanc.
s = s[:panPosition] + ' ' + s[panPosition + 3:]
print(s)
Out:
I AM JOE

Matching an optional '#' does not seem to be working properly

I'm attempting to get full words or hashtags from a string, it seems as though I'm applying the 'optional character' ? flag wrong in regex.
Here is my code:
print re.findall(r'(#)?\w*', text)
print re.findall(r'[#]?\w*', text)
Thus 'this is a sentence talking about this, #this, #that, #etc'
Should return matches for 'this' and '#this'
Yet it seems to be returning a list with empty strings as well as other random things.
What is wrong with the regex?
EDIT:
I'm attempting to get whole spam words, and I seem to have jumbled myself...
s = 'spamword'
print re.findall(r'(#)?'+s, text)
I need to match the whole word, and not word parts...
You can use word boundary in your regex:
s = 'spamword'
re.findall(r'#?' + s + r'\b', text)
The above answers really explains why,Here is one piece of code that should work.
>>>re.findall(r'#?\w+\b')

Multiple punctuation stripping

I tried multiple solutions here, and although they strip some code, they dont seem to work on multiple punctuations ex. "[ or ',
This code:
regex = re.compile('[%s]' % re.escape(string.punctuation))
for i in words:
while regex.match(i):
regex.sub('', i)
I got from:
Best way to strip punctuation from a string in Python was good but i still encounter problems with double punctuations.
I added While loop in hope to ittirate over each word to remove multiple punctuations but that does not seem to work it just gets stuck on the first item "[ and does not exit it
Am I just missing some obvious piece that I am just being oblivious too?
I solved the problem by adding a redundancy and double looping my lists, this takes extremely long time (well into the minutes) due to fairly large sets
I use Python 2.7
Your code doesn't work because regex.match needs the beginning of the string or complete string to match.
Also, you did not do anything with the return value of regex.sub(). sub doesn't work in place, but you need to assign its result to something.
regex.search returns a match if the pattern is found anywhere in the string and works as expected:
import re
import string
words = ['a.bc,,', 'cdd,gf.f.d,fe']
regex = re.compile('[%s]' % re.escape(string.punctuation))
for i in words:
while regex.search(i):
i = regex.sub('', i)
print i
Edit: As pointed out below by #senderle, the while clause isn't necessary and can be left out completely.
this will replace everything not alphanumeric ...
re.sub("[^a-zA-Z0-9 ]","",my_text)
>>> re.sub("[^a-zA-Z0-9 ]","","A [Black. Cat' On a Hot , tin roof!")
'A Black Cat On a Hot tin roof'
Here is a simple way:
>>> print str.translate("My&& Dog's {{{%!##%!##$L&&&ove Sal*mon", None,'~`!##$%^&*()_+=-[]\|}{;:/><,.?\"\'')
>>> My Dogs Love Salmon
Using this str.translate function will eliminate the punctuation. I usually use this for eliminating numbers from DNA sequence reads.

How to find all words followed by symbol using Python Regex?

I need re.findall to detect words that are followed by a "="
So it works for an example like
re.findall('\w+(?=[=])', "I think Python=amazing")
but it won't work for "I think Python = amazing" or "Python =amazing"...
I do not know how to possibly integrate the whitespace issue here properly.
Thanks a bunch!
'(\w+)\s*=\s*'
re.findall('(\w+)\s*=\s*', 'I think Python=amazing') \\ return 'Python'
re.findall('(\w+)\s*=\s*', 'I think Python = amazing') \\ return 'Python'
re.findall('(\w+)\s*=\s*', 'I think Python =amazing') \\ return 'Python'
You said "Again stuck in the regex" probably in reference to your earlier question Looking for a way to identify and replace Python variables in a script where you got answers to the question that you asked, but I don't think you asked the question you really wanted the answer to.
You are looking to refactor Python code, and unless your tool understands Python, it will generate false positives and false negatives; that is, finding instances of variable = that aren't assignments and missing assignments that aren't matched by your regexp.
There is a partial list of tools at What refactoring tools do you use for Python? and more general searches with "refactoring Python your_editing_environment" will yield more still.
Just add some optional whitespace before the =:
\w+(?=\s*=)
Use this instead
re.findall('^(.+)(?=[=])', "I think Python=amazing")
Explanation
# ^(.+)(?=[=])
#
# Options: case insensitive
#
# Assert position at the beginning of the string «^»
# Match the regular expression below and capture its match into backreference number 1 «(.+)»
# Match any single character that is not a line break character «.+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=[=])»
# Match the character “=” «[=]»
You need to allow for whitespace between the word and the =:
re.findall('\w+(?=\s*[=])', "I think Python = amazing")
You can also simplify the expression by using a capturing group around the word, instead of a non-capturing group around the equals:
re.findall('(\w+)\s*=', "I think Python = amazing")
r'(.*)=.*' would do it as well ...
You have anything #1 followed with a = followed with anything #2, you get anything #1.
>>> re.findall(r'(.*)=.*', "I think Python=amazing")
['I think Python']
>>> re.findall(r'(.*)=.*', " I think Python = amazing oh yes very amazing ")
[' I think Python ']
>>> re.findall(r'(.*)=.*', "= crazy ")
['']
Then you can strip() the string that is in the list returned.
re.split(r'\s*=', "I think Python=amazing")[0].split() # returns ['I', 'think', 'Python']

Find words with capital letters not at start of a sentence with regex

Using Python and regex I am trying to find words in a piece of text that start with a capital letter but are not at the start of a sentence.
The best way I can think of is to check that the word is not preceded by a full stop then a space. I am pretty sure that I need to use negative lookbehind. This is what I have so far, it will run but always returns nothing:
(?<!\.\s)\b[A-Z][a-z]*\b
I think the problem might be with the use of [A-Z][a-z]* inside the word boundary \b but I am really not sure.
Thanks for the help.
Your regex appears to work:
In [6]: import re
In [7]: re.findall(r'(?<!\.\s)\b[A-Z][a-z]*\b', 'lookbehind. This is what I have')
Out[7]: ['I']
Make sure you're using a raw string (r'...') when specifying the regex.
If you have some specific inputs on which the regex doesn't work, please add them to your question.
Although you asked specifically for a regex, it may be interesting to also consider a list comprehension. They're sometimes a bit more readable (although in this case, probably at the cost of efficiency). Here's one way to achieve this:
import string
S = "T'was brillig, and the slithy Toves were gyring and gimbling in the " + \
"Wabe. All mimsy were the Borogoves, and the Mome Raths outgrabe."
LS = S.split(' ')
words = [x for (pre,x) in zip(['.']+LS, LS+[' '])
if (x[0] in string.uppercase) and (pre[-1] != '.')]
Try and loop over your input with:
(?!^)\b([A-Z]\w+)
and capture the first group. As you can see, a negative lookahead can be used as well, since the position you want to match is everything but a beginning of line. A negative lookbehind would have the same effect.

Categories

Resources