Removing parts of a string after certain chars in Python - python

New to Python.
I'd like to remove the substrings between the word AND and the comma character in the following string:
MyString = ' x.ABC AND XYZ, \ny.DEF AND Type, \nSome Long String AND Qwerty, \nz.GHI AND Tree \n'
The result should be:
MyString = ' x.ABC,\ny.DEF,\nSome Long String,\nz.GHI\n'
I'd like to do it without using regex.
I have tried various methods with splits and joins and indexes to no avail.
Any direction appreciated.
Thanks.

While Moses's answer is really good, I have a funny feeling this is a homework question and meant for you not to use any imports. Anyways here's an answer with no imports, it's not as efficient as other answers like Moses' or Regex but it works just not as well as others.
MyString = 'x.ABC AND XYZ, \ny.DEF AND Type, \nSome Long String AND Qwerty, \nz.GHI AND Tree \n'
new_string = ''
for each in [[y for y in x.split(' AND ')][0] for x in MyString.split('\n')]:
new_string+=each
new_string+='\n'
print(new_string)

You can split the string into lines, and further split the lines into words and use itertools.takewhile to drop all words after AND (itself included):
from itertools import takewhile
''.join(' '.join(takewhile(lambda x: x != 'AND', line.split())) + ',\n'
for line in MyString.splitlines())
Notice that the newline character and a comma are manually added after each line is reconstructed with str.join.
All the lines are then finally joined using str.join.

Now it is working.. and probably avoiding the 'append' keyword makes it really fast...
In [19]: ',\n'.join([x.split('AND')[0].strip() for x in MyString.split('\n')])
Out[19]: 'x.ABC,\ny.DEF,\nSome Long String,\nz.GHI,\n'
You can check this answer to understand why...
Comparing list comprehensions and explicit loops (3 array generators faster than 1 for loop)

Related

Remove instances of words following a character in python

I'm trying to preprocess message data from the StockTwits API, how can I remove all instances of $name from a string in python?
For example if the string is:
$AAPL $TSLA $MSFT are all going up!
The output would be:
are all going up!
Something like this would do:
>>> s = "$AAPL $TSLA $MSFT are all going up!"
>>> re.sub(r"\$[a-zA-Z0-9]+\s*", "", s)
'are all going up!'
This allows numbers in the name as well, remove 0-9 if that's not what you want (it would remove e.g. $15 as well).
I'm not sure I get it, but If I'm to remove all instances of words that start with $, I would break into individual strings, then look for $, and re-form using a list comprehension.
substrings = string.split(' ')
substrings = [s for s in substrings if not s.startswith('$')]
new_string = ' '.join(substrings)
Some would use regular expressions, which are likely more computationally efficient, but less easy to read.

How to split strings with special characters without removing those characters?

I'm writing this function which needs to return an abbreviated version of a str. The return str must contain the first letter, number of characters removed and the, last letter;it must be abbreviated per word and not by sentence, then after that I need to join every word again with the same format including the special-characters. I tried using the re.findall() method but it automatically removes the special-characters so I can't use " ".join() because it will leave out the special-characters.
Here's my code:
import re
def abbreviate(wrd):
return " ".join([i if len(i) < 4 else i[0] + str(len(i[1:-1])) + i[-1] for i in re.findall(r"[\w']+", wrd)])
print(abbreviate("elephant-rides are really fun!"))
The output would be:
e6t r3s are r4y fun
But the output should be:
e6t-r3s are r4y fun!
No need for str.join. Might as well take full advantage of what the re module has to offer.
re.sub accepts a string or a callable object (like a function or lambda), which takes the current match as an input and must return a string with which to replace the current match.
import re
pattern = "\\b[a-z]([a-z]{2,})[a-z]\\b"
string = "elephant-rides are really fun!"
def replace(match):
return f"{match.group(0)[0]}{len(match.group(1))}{match.group(0)[-1]}"
abbreviated = re.sub(pattern, replace, string)
print(abbreviated)
Output:
e6t-r3s are r4y fun!
>>>
Maybe someone else can improve upon this answer with a cuter pattern, or any other suggestions. The way the pattern is written now, it assumes that you're only dealing with lowercase letters, so that's something to keep in mind - but it should be pretty straightforward to modify it to suit your needs. I'm not really a fan of the repetition of [a-z], but that's just the quickest way I could think of for capturing the "inner" characters of a word in a separate capturing group. You may also want to consider what should happen with words/contractions like "don't" or "shouldn't".
Thank you for viewing my question. After a few more searches, trial, and error I finally found a way to execute my code properly without changing it too much. I simply substituted re.findall(r"[\w']+", wrd) with re.split(r'([\W\d\_])', wrd) and also removed the whitespace in "".join() for they were simply not needed anymore.
import re
def abbreviate(wrd):
return "".join([i if len(i) < 4 else i[0] + str(len(i[1:-1])) + i[-1] for i in re.split(r'([\W\d\_])', wrd)])
print(abbreviate("elephant-rides are not fun!"))
Output:
e6t-r3s are not fun!

TypeError: 'int' object is not iterable. When iterating over each string in string data

good day,
My objective is to use .lower() for converting each string in the text data into the lower case. I tried to use .count() with a one-liner iteration. However, I get the following error:
TypeError: 'int' object is not iterable
Here is my code:
# Iterating over the strings in the data. The data is called text
text_lowercase = ''.join((string.lower().strip() for string in text.count(0,)))
I want to use the one-liner iteration and do this.
Help would be truly appreciated. Cheers!
text.count returns an integer. You try to iterate over it:
for string in text.count(0,)
but since it returns an integer, there is no in (it is not iterable). This is what the error message is telling you.
In the future, to better identify the source of an error, try to break up one-liners into multiple lines. This will give you better feedback on what part of your operation is failing.
There's a couple of issues to point out here:
text_lowercase = ''.join((string.lower().strip() for string in text.count(0,)))
Naming the temporary variable string is a bad idea, as this looks a lot like the type name. Something like s is more common and readable.
Or perhaps word because that's what you are after it seems. This is the second problem, your approach seems to break down the string in characters, but from the comments it appears you'd like to operate on words? (your use of strip also suggests this)
You're joining on '', which results in all the parts of the string being joined with no space between them.
As others point out, count returns an integer, but you want to operate on the actual string. You indicated you only tried count to iterate and that's not needed in Python like it is in many other languages.
Putting it together for words:
text_lowercase = ' '.join([w.lower() for w in text.split(' ')])
Or if you're after characters:
text_lowercase = ''.join([ch.lower() for ch in text])
But then you could just:
text_lowercase = text.lower()
Perhaps you like words, but want to get rid of excess spaces?
text_lowercase = ' '.join([w.lower() for w in text.split(' ') if w != ''])
Or in shorthand:
text_lowercase = ' '.join([w.lower() for w in text.split(' ') if w])
The exception you get is because the count() returns an int and then you try to iterate over that int. I think you should remove the count and you might be good to go (depending on how text looks like)
If you want to have a function that just lower cases the instances of string inside your text, maybe you can use something like this:
def lowercase_instance(text, string):
return string.lower().join(text.split(string))
Now, if you have a list of texts then you can do something like this:
lowercase_texts = [lowercase_instance(text, string) for text in texts]
hopefully this helps!

Python replace oneliner without using regexp

I have my code here:
a = u"\n".join(my_array).replace(u"\n\n", u"\n")
The problem is that if there are "\n\n\n\n" you are left with "\n\n" and I just want one "\n"
So I've come up with:
a = u"\n".join(my_array)
while a.find(u"\n\n")>=0:
a = a.replace(u"\n\n", u"\n")
I was wondering if there's a more elegant way / maybe oneliner without using regexp to do this in Python?
If you really want to do this in one line and without using regular expression, one way to reduce all the sequences of multiple \n to single \n would to be first split by \n and then join all the non-empty segments by a single \n.
>>> a = "foo\n\nbar\n\n\nblub\n\n\n\nbaz"
>>> "\n".join(x for x in a.split("\n") if x)
'foo\nbar\nblub\nbaz'
Here, a is the entire string, i.e. after you did "\n".join(my_array), and depending on what my_array originally is, there may be better solutions, e.g. stripping \n from the individual lines prior to joining, but this will work nonetheless.
To convert sequences of newlines to single newlines you can split the string on newlines and then filter out the empty strings before re-joining. Eg,
mystring = u"this\n\nis a\ntest string\n\nwith embedded\n\n\nnewlines\n"
a = u'\n'.join(filter(None, mystring.split(u'\n')))
print '{0!r}\n{1!r}'.format(mystring, a)
output
u'this\n\nis a\ntest string\n\nwith embedded\n\n\nnewlines\n'
u'this\nis a\ntest string\nwith embedded\nnewlines'
Note that this eliminates any trailing newlines, but that shouldn't be a big deal.
Using reduce should work:
reduce(lambda x,y: (x+y).replace('\n\n', '\n'), x)
However, regular expressions would be more elegant:
re.sub('\n+', '\n', x)
perhaps this can help:
u"\n".join(s.replace(u'\n', '') for s in my_array))

An elegant way to get hashtags out of a string in Python?

I'm looking for a clean way to get a set (list, array, whatever) of words starting with # inside a given string.
In C#, I would write
var hashtags = input
.Split (' ')
.Where (s => s[0] == '#')
.Select (s => s.Substring (1))
.Distinct ();
What is comparatively elegant code to do this in Python?
EDIT
Sample input: "Hey guys! #stackoverflow really #rocks #rocks #announcement"
Expected output: ["stackoverflow", "rocks", "announcement"]
With #inspectorG4dget's answer, if you want no duplicates, you can use set comprehensions instead of list comprehensions.
>>> tags="Hey guys! #stackoverflow really #rocks #rocks #announcement"
>>> {tag.strip("#") for tag in tags.split() if tag.startswith("#")}
set(['announcement', 'rocks', 'stackoverflow'])
Note that { } syntax for set comprehensions only works starting with Python 2.7.
If you're working with older versions, feed list comprehension ([ ]) output to set function as suggested by #Bertrand.
[i[1:] for i in line.split() if i.startswith("#")]
This version will get rid of any empty strings (as I have read such concerns in the comments) and strings that are only "#". Also, as in Bertrand Marron's code, it's better to turn this into a set as follows (to avoid duplicates and for O(1) lookup time):
set([i[1:] for i in line.split() if i.startswith("#")])
the findall method of regular expression objects can get them all at once:
>>> import re
>>> s = "this #is a #string with several #hashtags"
>>> pat = re.compile(r"#(\w+)")
>>> pat.findall(s)
['is', 'string', 'hashtags']
>>>
I'd say
hashtags = [word[1:] for word in input.split() if word[0] == '#']
Edit: this will create a set without any duplicates.
set(hashtags)
there are some problems with the answers presented here.
{tag.strip("#") for tag in tags.split() if tag.startswith("#")}
[i[1:] for i in line.split() if i.startswith("#")]
wont works if you have hashtag like '#one#two#'
2 re.compile(r"#(\w+)") wont work for many unicode languages (even using re.UNICODE)
i had seen more ways to extract hashtag, but found non of them answering on all cases
so i wrote some small python code to handle most of the cases. it works for me.
def get_hashtagslist(string):
ret = []
s=''
hashtag = False
for char in string:
if char=='#':
hashtag = True
if s:
ret.append(s)
s=''
continue
# take only the prefix of the hastag in case contain one of this chars (like on: '#happy,but i..' it will takes only 'happy' )
if hashtag and char in [' ','.',',','(',')',':','{','}'] and s:
ret.append(s)
s=''
hashtag=False
if hashtag:
s+=char
if s:
ret.append(s)
return set(ret)
Another option is regEx:
import re
inputLine = "Hey guys! #stackoverflow really #rocks #rocks #announcement"
re.findall(r'(?i)\#\w+', inputLine) # will includes #
re.findall(r'(?i)(?<=\#)\w+', inputLine) # will not include #

Categories

Resources