This question already has answers here:
Split Strings into words with multiple word boundary delimiters
(31 answers)
Closed 4 years ago.
The python code below reads 'resting-place' as one word.
The modified list shows up as: ['This', 'is', 'my', 'resting-place.']
I want it to show as: ['This', 'is', 'my', 'resting', 'place']
Thereby, giving me a total of 5 words instead of 4 words in the modified list.
original = 'This is my resting-place.'
modified = original.split()
print(modified)
numWords = 0
for word in modified:
numWords += 1
print ('Total words are:', numWords)
Output is:
Total words are: 4
I want the output to have 5 words.
To count number of words in a sentence with - separates to two words without splitting:
>>> original = 'This is my resting-place.'
>>> sum(map(original.strip().count, [' ','-'])) + 1
5
Here is the code:
s='This is my resting-place.'
len(s.split(" "))
4
You can use regex:
import re
original = 'This is my resting-place.'
print(re.split("\s+|-", original))
Output:
['This', 'is', 'my', 'resting', 'place.']
I think you will find what you want in this article, here you can find how to create a function where you can pass multiple parameter to split a string, in your case you'll be able to split that extra character
http://code.activestate.com/recipes/577616-split-strings-w-multiple-separators/
here is an example of the final result
>>> s = 'thing1,thing2/thing3-thing4'
>>> tsplit(s, (',', '/', '-'))
>>> ['thing1', 'thing2', 'thing3', 'thing4']
Related
This question already has answers here:
Removing duplicates in lists
(56 answers)
Closed 1 year ago.
I have a string like :
'hi', 'what', 'are', 'are', 'what', 'hi'
I want to remove a specific repeated word. For example:
'hi', 'what', 'are', 'are', 'what'
Here, I am just removing the repeated word of hi, and keeping rest of the repeated words.
How to do this using regex?
Regex is used for text search. You have structured data, so this is unnecessary.
def remove_all_but_first(iterable, removeword='hi'):
remove = False
for word in iterable:
if word == removeword:
if remove:
continue
else:
remove = True
yield word
Note that this will return an iterator, not a list. Cast the result to list if you need it to remain a list.
You can do this
import re
s= "['hi', 'what', 'are', 'are', 'what', 'hi']"
# convert string to list. Remove first and last char, remove ' and empty spaces
s=s[1:-1].replace("'",'').replace(' ','').split(',')
remove = 'hi'
# store the index of first occurance so that we can add it after removing all occurance
firstIndex = s.index(remove)
# regex to remove all occurances of a word
regex = re.compile(r'('+remove+')', flags=re.IGNORECASE)
op = regex.sub("", '|'.join(s)).split('|')
# clean up the list by removing empty items
while("" in op) :
op.remove("")
# re-insert the removed word in the same index as its first occurance
op.insert(firstIndex, remove)
print(str(op))
You don't need regex for that, convert the string to list and then you can find the index of the first occurrence of the word and filter it from a slice of the rest of the list
lst = "['hi', 'what', 'are', 'are', 'what', 'hi']"
lst = ast.literal_eval(lst)
word = 'hi'
index = lst.index('hi') + 1
lst = lst[:index] + [x for x in lst[index:] if x != word]
print(lst) # ['hi', 'what', 'are', 'are', 'what']
This question already has answers here:
How to methodically join two lists?
(4 answers)
Closed 1 year ago.
I have a function that takes in a string, which is then divided into two strings, a and b. I want to perform some logic that will take the first word from a, append it to an array, then take the first word from b and append it to the same array. I'd like to loop through both strings until the array contains every other word from both strings. For example:
Start with
a = 'Hello are today'
b = 'how you'
End with
x = ['Hello', 'how', 'are', 'you', 'today']
Here's a one-liner that does it:
>>> import itertools
>>> [word for t in itertools.zip_longest(a.split(), b.split()) for word in t if word]
['Hello', 'how', 'are', 'you', 'today']
Using a for loop to iterate over the strings and add the words if they are within range:
def concate_strings(a, b):
a = a.split()
b = b.split()
x = []
for i in range(max(len(a), len(b))):
if i < len(a):
x.append(a[i])
if i < len(b):
x.append(b[i])
return x
if __name__ == "__main__":
a = 'Hello are today'
b = 'how you'
print(concate_strings(a, b))
Output:
['Hello', 'how', 'are', 'you', 'today']
Alternate solution:
You may have more than two strings and want to concatenate them together. In that case, you can pass a list of strings to the method and then concatenate them as their split length is within range.
def concate_strings_multiple(strings):
x = []
for i in range(max([len(s.split()) for s in strings])):
for s in strings:
if i < len(s.split()):
x.append(s.split()[i])
return x
if __name__ == "__main__":
a = 'Hello are today'
b = 'how you'
print(concate_strings_multiple([a, b]))
Output:
['Hello', 'how', 'are', 'you', 'today']
This question already has answers here:
Some built-in to pad a list in python
(14 answers)
Finding length of the longest list in an irregular list of lists
(10 answers)
Closed 6 months ago.
I have a list of lists of sentences and I want to pad all sentences so that they are of the same length.
I was able to do this but I am trying to find most optimal ways to do things and challenge myself.
max_length = max(len(sent) for sent in sents)
list_length = len(sents)
sents_padded = [[pad_token for i in range(max_length)] for j in range(list_length)]
for i,sent in enumerate(sents):
sents_padded[i][0:len(sent)] = sent
and I used the inputs:
sents = [["Hello","World"],["Where","are","you"],["I","am","doing","fine"]]
pad_token = "Hi"
Is my method an efficient way to do it or there are better ways to do it?
This is provided in itertools (in python3) for iteration, with zip_longest, which you can just invert normally with zip(*), and pass it to list if you prefer that over an iterator.
import itertools
from pprint import pprint
sents = [["Hello","World"],["Where","are","you"],["I","am","doing","fine"]]
pad_token = "Hi"
padded = zip(*itertools.zip_longest(*sents, fillvalue=pad_token))
pprint (list(padded))
[['Hello', 'World', 'Hi', 'Hi'],
['Where', 'are', 'you', 'Hi'],
['I', 'am', 'doing', 'fine']]
Here is how you can use str.ljust() to pad each string, and use max() with a key of len to find the number in which to pad each string:
lst = ['Hello World', 'Good day!', 'How are you?']
l = len(max(lst, key=len)) # The length of the longest sentence
lst = [s.ljust(l) for s in lst] # Pad each sentence with l
print(lst)
Output:
['Hello World ',
'Good day! ',
'How are you?']
Assumption:
The output should be the same as OP output (i.e. same number of words in each sublist).
Inputs:
sents = [["Hello","World"],["Where","are","you"],["I","am","doing","fine"]]
pad_token = "Hi"
Following 1-liner produces the same output as OP code.
sents_padded = [sent + [pad_token]*(max_length - len(sent)) for sent in sents]
print(sents_padded)
# [['Hello', 'World', 'Hi', 'Hi'], ['Where', 'are', 'you', 'Hi'], ['I', 'am', 'doing', 'fine']]
This seemed to be faster when I timed it:
maxi = 0
for sent in sents:
if sent.__len__() > maxi:
maxi = sent.__len__()
for sent in sents:
while sent.__len__() < maxi:
sent.append(pad_token)
print(sents)
This question already has answers here:
Printing list elements on separate lines in Python
(10 answers)
Closed 4 years ago.
I've got this code. This code removes the stopwords(in the stopwords.py file) from yelp.py
def remove_stop(text, stopwords):
disallowed = set(stopwords)
return [word for word in text if word not in disallowed]
text = open('yelp.py','r').read().split()
stopwords = open('stopwords.py','r').read().split()
print(remove_stop(text, stopwords))
Currently, the output is a very long string.
I want the output to skip a line after every word in the yelp.py file.
How do i do that? Can somebody help pls!!
The current output is ['near', 'best', "I've", 'ever', 'price', 'good', 'deal.', 'For', 'less', '6', 'dollars', 'person', 'get', 'pizza', 'salad', 'want.', 'If', 'looking', 'super', 'high', 'quality', 'pizza', "I'd", 'recommend', 'going', 'elsewhere', 'looking', 'decent', 'pizza', 'great', 'price,', 'go', 'here.']
How do i get it to skip a line?
Once you have collected your output, a list l, you can print it as
print(*l, sep="\n")
where the * operator unpacks the list. Each element is used as a separate argument to the function.
Moreover, with the sep named argument you can customize the separator between items.
Full updated code:
def remove_stop(text, stopwords):
disallowed = set(stopwords)
return [word for word in text if word not in disallowed]
text = open('yelp.py','r').read().split()
stopwords = open('stopwords.py','r').read().split()
output = remove_stop(text, stopwords)
print(*output, sep="\n")
When you print a list you get one long list as ouput [0, 1, 2, 3, 4, 5, ...]. Instead of printing the list you could iterate over it:
for e in my_list:
print(e)
and you will get a newline after each element in the list.
hi i'm new to python programming, please help me to create a function that taken in a text file as an argument and creates a list of words thereby removing all punctuation and the list "splits" on double space. What i mean to say is the list should create subsists on every double space occurrences within a text file.
This is my function:
def tokenize(document):
file = open("document.txt","r+").read()
print re.findall(r'\w+', file)
Input text file has a string as follows:
What's did the little boy tell the game warden? His dad was in the kitchen poaching eggs!
Note: There's a double spacing after warden? and before His
My function gives me an output like this
['what','s','did','the','little','boy','tell','the','game','warden','His','dad','was','in','the','kitchen','poaching','eggs']
Desired output :
[['what','s','did','the','little','boy','tell','the','game','warden'],
['His','dad','was','in','the','kitchen','poaching','eggs']]
First split the whole text on double spaces and then pass each item to regex as:
>>> file = "What's did the little boy tell the game warden? His dad was in the kitchen poaching eggs!"
>>> file = text.split(' ')
>>> file
["What's did the little boy tell the game warden?", 'His dad was in the kitchen poaching eggs!']
>>> res = []
>>> for sen in file:
... res.append(re.findall(r'\w+', sen))
...
>>> res
[['What', 's', 'did', 'the', 'little', 'boy', 'tell', 'the', 'game', 'warden'], ['His', 'dad', 'was', 'in', 'the', 'kitchen', 'poaching', 'eggs']]
Here's a reasonable all-RE's approach:
def tokenize(document):
with open("document.txt") as f:
text = f.read()
blocks = re.split(r'\s\s+', text)
return [re.findall(r'\w+', b) for b in blocks]
The builtin split function allows splitting on multiple spaces.
This:
a = "hello world. How are you"
b = a.split(' ')
c = [ x.split(' ') for x in b ]
Yields:
c = [['hello', 'world.'], ['how', 'are', 'you?']]
If you want to remove the punctuation too, apply regex to elements in 'b' or to 'x' in the third statement.
At first split the file by punctuation, and then on the second pass split the resulted strings by spaces.
def splitByPunct(s):
return (x.group(0) for x in re.finditer(r'[^\.\,\?\!]+', s) if x and x.group(0))
[x.split() for x in splitByPunct("some string, another string! The phrase")]
this yields
[['some', 'string'], ['another', 'string'], ['The', 'phrase']]