Sentences into words in Python [duplicate] - python

This question already has answers here:
Print list without brackets in a single row
(14 answers)
Closed 5 months ago.
I'm suppose to store all words in a long sentence in a binary tree stored in a txt.file
Ex english.txt: A blurb or a tag is a statement about a book,
record or video, supplied by the publisher or
distributor, like "The best-selling novel" or
"Greatest hit tunes" or even "Perverse sex".
How do I story every single word of the sentence in the tree?
I have tried:
from bintreeFile import Bintree
english = Bintree()
with open("english.txt", "r", encoding = "utf-8") as english_file:
for rad in english_file:
words = rad.strip().split(" ")
engelska.put(words)
engelska.write()
This ends up printing ex:
['A', 'blurb', 'or', 'a', 'tag', 'is', 'a', 'statement', 'about', 'a', 'book,']
['record', 'or', 'video,', 'supplied', 'by', 'the', 'publisher', 'or']
How can i fix this so it only prints the words?
A
blurb
or
tag
...etc

engelska is probably a list or an Array of some sort. Arrays/Lists have iterable methods, so you need to do something like
for x in engelska:
print(x)

Related

TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got str)

I'm looking to get all sentences in a text file that contain at least one of the conjunctions in the list "conjunctions". However, when applying this function for the text in the variable "text_to_look" like this:
import spacy
lang_model = spacy.load("en_core_web_sm")
text_to_look = "A woman is looking at books in a library. She's looking to buy one, but she hasn't got any money. She really wanted to book, so she asks another customer to lend her money. The man accepts. They get along really well, so they both exchange phone numbers and go their separate ways."
def get_coordinate_sents(file_to_examine):
conjunctions = ['and', 'but', 'for', 'nor', 'or', 'yet', 'so']
text = lang_model(file_to_examine)
sentences = text.sents
for sentence in sentences:
coord_sents = []
if any(conjunction in sentence for conjunction in conjunctions):
coord_sents.append(sentence)
return coord_sents
wanted_sents = get_coordinate_sents(text_to_look)
I get this error message :
TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got str)
There seems to be something about spaCy that I'm not aware of and prevents me from doing this...
While the problem lies in the fact that conjunction is a string and sentence is a Span object, and to check if the sentence text contains a conjunction you need to access the Span text property, you also re-initialize the coord_sents in the loop, effectively saving only the last sentence in the variable. Note a list comprehension looks preferable in such cases.
So, a quick fix for your case is
def get_coordinate_sents(file_to_examine):
conjunctions = ['and', 'but', 'for', 'nor', 'or', 'yet', 'so']
text = lang_model(file_to_examine)
return [sentence for sentence in text.sents if any(conjunction in sentence.text for conjunction in conjunctions)]
Here is my test:
import spacy
lang_model = spacy.load("en_core_web_sm")
text_to_look = "A woman is looking at books in a library. She's looking to buy one, but she hasn't got any money. She really wanted to book, so she asks another customer to lend her money. The man accepts. They get along really well, so they both exchange phone numbers and go their separate ways."
file_to_examine = text_to_look
conjunctions = ['and', 'but', 'for', 'nor', 'or', 'yet', 'so']
text = lang_model(file_to_examine)
sentences = text.sents
coord_sents = [sentence for sentence in sentences if any(conjunction in sentence.text for conjunction in conjunctions)]
Output:
>>> coord_sents
[She's looking to buy one, but she hasn't got any money., She really wanted to book, so she asks another customer to lend her money., They get along really well, so they both exchange phone numbers and go their separate ways.]
However, the in operation will find nor in north, so in crimson, etc.
You need a regex here:
import re
conjunctions = ['and', 'but', 'for', 'nor', 'or', 'yet', 'so']
rx = re.compile(fr'\b(?:{"|".join(conjunctions)})\b')
def get_coordinate_sents(file_to_examine):
text = lang_model(file_to_examine)
return [sentence for sentence in text.sents if rx.search(sentence.text)]

Split sentences into words and make it into a list (Python) [duplicate]

This question already has answers here:
What is the syntax to insert one list into another list in python?
(6 answers)
Closed 2 years ago.
I have a dataframe named "df" that only have 1 column called "tweet". That dataframe consists of a bunch of sentences like this :
I have a cat
What do you mean by that?
This is my room.
Lorem ipsum dolor sit amet
I want to split all the sentences into words and put all the words into a list.
So far i tried :
def word_split() :
word = []
for index, row in df.iterrows() :
words = row['tweet'].split()
word.append(words)
return word
word_split()
But rather than a list, i got a list of lists :
[['I', 'have', 'a', 'cat'],
['What', 'do', 'you', 'mean', 'by', 'that?'],
['This', 'is' .....]]
I want it to be a list rather than a list of lists :
['I', 'have', 'a', 'cat', 'What', 'do', 'you', .....]
Any suggestions?
Rather than
word.append(words)
it must be
word.extend(words)
Thank you #jonrsharpe for the answer.

How to separate beginning and ending punctuation from words using python? [duplicate]

This question already has an answer here:
Python Tokenization
(1 answer)
Closed 4 years ago.
I have a list of words with possible punctuation at their beginning and the end. I need to separate the punctuation using regex as follows:
sample_input = ["I", "!Go", "I'm", "call.", "exit?!"]
sample_output = ["I", "!", "Go", "I'm", "call", ".", "exit", "?", "!"]
The original string look like that:
string ="It's a mountainous wonderland decorated with ancient glaciers, breathtaking national parks and sumptuous vineyards, but behind its glossy image New Zealand is failing many of its children."
Does anybody have an idea, how to solve this problem?
Thank you.
You can tokenize the each list item first by:
import re
words = ["I", "!Go", "I'm", "call.", "exit?!"]
newwords = []
for i in words:
newwords.append(re.findall(r"[\w']+|[\W]", i))
print newwords
>>>[['I'], ['!', 'Go'], ["I'm"], ['call', '.'], ['exit', '?', '!']]
then getting the result by:
result= [item for sublist in newwords for item in sublist]
print result
>>>['I', '!', 'Go', "I'm", 'call', '.', 'exit', '?', '!']
You need to break each string w.r.t either \w' or with \W group to get the final list as per your desired output.
You can use this approach to write as per your code requirement.

how to split string on word and punctuation [duplicate]

This question already has answers here:
Splitting a string into words and punctuation
(11 answers)
Closed 4 years ago.
I am relatively new to Python, is there a way I can split the string "James kicked Bob's ball, laughed and ran away." into the following, so I have the words and punctuation in list items ["James", "kicked", "Bob's", "ball", ",", "laughed", "and", "ran", "away", "."]. is there a way to do this in python?
You can try this:
import re
str = "James kicked Bob's ball, laughed and ran away."
x = re.findall(r"[\w']+|[.,!?;]", str)
print(x)
Output:
['James', 'kicked', "Bob's", 'ball', ',', 'laughed', 'and', 'ran', 'away', '.']
It seems you are trying to tokenize a sentence.
Some tokenizer already exists and perform well.
For example, you can use spacy.
Once install, you will need to download the model of your language:
python -m spacy download en
Then you will be able to use it in your script:
import spacy
nlp = spacy.load('en')
tokens = list(nlp("James kicked Bob's ball, laughed and ran away."))
Output:
['James', 'kicked', 'Bob', "'s", 'ball', ',', 'laughed', 'and', 'ran', 'away', '.']
By using a tokenizer, it will take care of some corner cases. For example, the sentence 'I tried but it failed...' will be tokenized as ['I', 'tried', 'but', 'it', 'failed', '...']. Here the dots at the end are grouped together as only one token. In the same way, "don't" is tokenize as ['do', "n't"] instead of the basic ['don', "'t"]

tokenize array consisting of strings

I have an array called allchats consisting of long strings. some of the places in the index look like the following:
allchats[5,0] = "Hi, have you ever seen something like that? no?"
allchats[106,0] = "some word blabla some more words yes"
allchats[410,0] = "I don't know how we will ever get through this..."
I wish to tokenize each string in the array. Furthermore I wish to use a regex tool to eliminate questionsmarks, commas etc.
I have tried the following:
import nltk
from nltk.tokenize import RegexTokenizer
tknzr = RegexTokenizer('\w+')
allchats1 = [[tknzr.tokenize(chat) for chat in str] for str in allchats]
I wish to end up with:
allchats[5,0] = ['Hi', 'have', 'you', 'ever', 'seen', 'something', 'like', 'that', 'no']
allchats[106,0] = '[some', 'word', 'blabla', 'some', 'more', 'words', 'yes']
allchats[410,0] = ['I', 'dont', 'know', 'how', 'we', 'will', 'ever', 'get', 'through', 'this']
I am quite sure that I am doing something wrong with the strings (str) in the for loop, but cannot figure out what I need to correct in order to succeed.
Thank you in advance for you help!
You have a typo error on your list comprehension, it doesn't take nested lists, but chained lists:
allchats1 = [tknzr.tokenize(chat) for str in allchats for chat in str]
If you want to iterate over words instead of just characters, you are looking for str.split() method. So here is a fully working exmple:
allchats = ["Hi, have you ever seen something like that? no?", "some word blabla some more words yes", "I don't know how we will ever get through this..."]
def tokenize(word):
# use real logic here
return word + 'tokenized'
tokenized = [tokenize(word) for sentence in allchats for word in sentence.split()]
print(tokenized)
If you're not sure you have only strings in your list and want to go over only strings, you can check this with isinstance method (example here):
tokenized = [tokenize(word) for sentence in allchats if isinstance(sentence, str) for word in sentence.split()]

Categories

Resources