Splitting individual sentence to list - python

I am asking on how to make individual lists. Not how to find a substring as marked duplicated for.
I have the following file
'Gentlemen do not read each others mail.' Henry Stinson
'The more corrupt the state, the more numerous the laws.' Tacitus
'The price of freedom is eternal vigilance.' Thomas Jefferson
'Few false ideas have more firmly gripped the minds of so many intelligent men than the one that, if they just tried, they could invent a cipher that no one could break.' David Kahn
'Who will watch the watchmen.' Juvenal
'Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin.' John Von Neumann
'They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety.' Benjamin Franklin
'And so it often happens that an apparently ingenious idea is in fact a weakness which the scientific cryptographer seizes on for his solution.' Herbert Yardley
I am trying to convert each sentence to a list so that when I search for the word say "Gentlemen" it should print me the entire sentence.
I am able to get the lines to split but I am unable to convert them to individual list. I have tried a few things from the internet but nothing has helped so far.
here is what
def myFun(filename):
file = open(filename, "r")
c1 = [ line for line in file ]
for i in c1:
print(i)

you can use in to search a string or array, for example 7 in a_list or "I" in "where am I"
you can iterate directly over a file if you want
for line in open("my_file.txt")
although to ensure it closes people recommend using a context manager
with open("my_file.txt") as f:
for line in f:
that should probably at least get you going in the right direction
if you want to search case insensitive you can simply use str.lower()
term.lower() in search_string.lower() #case insensitive

Python strings have a split() method:
individual_words = 'This is my sentence.'.split()
print(len(individual_words)) # 4
Edit: As #ShadowRanger mentions below, running split() without an argument will take care of leading, trailing, and consecutive whitespace.

Related

To Split text based on words using python code

I have a long text like the one below. I need to split based on some words say ("In","On","These")
Below is sample data:
On the other hand, we denounce with righteous indignation and dislike men who are so beguiled and demoralized by the charms of pleasure of the moment, so blinded by desire, that they cannot foresee the pain and trouble that are bound to ensue; and equal blame belongs to those who fail in their duty through weakness of will, which is the same as saying through shrinking from toil and pain. These cases are perfectly simple and easy to distinguish. In a free hour, when our power of choice is untrammelled and when nothing prevents our being able to do what we like best, every pleasure is to be welcomed and every pain avoided. But in certain circumstances and owing to the claims of duty or the obligations of business it will frequently occur that pleasures have to be repudiated and annoyances accepted. The wise man therefore always holds in these matters to this principle of selection: he rejects pleasures to secure other greater pleasures, or else he endures pains to avoid worse pains.
Can this problem be solved with a code as I have 1000 rows in a csv file.
As per my comment, I think a good option would be to use regular expression with the pattern:
re.split(r'(?<!^)\b(?=(?:On|In|These)\b)', YourStringVariable)
Yes this can be done in python. You can load the text into a variable and use the built in Split function for string. For example:
with open(filename, 'r') as file:
lines = file.read()
lines = lines.split('These')
# lines is now a list of strings split whenever 'These' string was encountered
To find whole words that are not part of larger words, I like using the regular expression:
[^\w]word[^\w]
Sample python code, assuming the text is in a variable named text:
import re
exp = re.compile(r'[^\w]in[^\w]', flags=re.IGNORECASE)
all_occurrences = list(exp.finditer(text))

Removing "\n"s when printing sentences from text file in python?

I am trying to print a list of sentences from a text file (one of the Project Gutenberg eBooks). When I print the file as a single string string it looks fine:
file = open('11.txt','r+')
alice = file.read()
print(alice[:500])
Output is:
ALICE'S ADVENTURES IN WONDERLAND
Lewis Carroll
THE MILLENNIUM FULCRUM EDITION 3.0
CHAPTER I. Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'
So she was considering in her own mind (as well as she could, for the
hot d
Now, when I split it into sentences (The assignment was specifically to do this by "splitting at the periods," so it's a very simplified split), I get this:
>>> print(sentences[:5])
["ALICE'S ADVENTURES IN WONDERLAND\n\nLewis Carroll\n\nTHE MILLENNIUM FULCRUM EDITION 3", '0\n\n\n\n\nCHAPTER I', " Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversations?'\n\nSo she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her", "\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, 'Oh dear!\nOh dear! I shall be late!' (when she thought it over afterwards, it\noccurred to her that she ought to have wondered at this, but at the time\nit all seemed quite natural); but when the Rabbit actually TOOK A WATCH\nOUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on,\nAlice started to her feet, for it flashed across her mind that she had\nnever before seen a rabbit with either a waistcoat-pocket, or a watch\nto take out of it, and burning with curiosity, she ran across the field\nafter it, and fortunately was just in time to see it pop down a large\nrabbit-hole under the hedge", '\n\nIn another moment down went Alice after it, never once considering how\nin the world she was to get out again']
Where do the extra "\n" characters come from and how can I remove them?
If you want to replace all the newlines with one space, do this:
import re
new_sentences = [re.sub(r'\n+', ' ', s) for s in sentences]
You may not want to use regex, but I would do:
import re
new_sentences = []
for s in sentences:
new_sentences.append(re.sub(r'\n{2,}', '\n', s))
This should replace all instances of two or more '\n' with a single newline, so you still have newlines, but don't have "extra" newlines.
If you want to avoid creating a new list, and instead modify the existing one (credit to #gavriel and Andrew L.: I hadn't thought of using enumerate when I first posted my answer):
import re
for i, s in enumerate(sentences):
sentences[i] = re.sub(r'\n{2,}', '\n', s)
The extra newlines aren't really extra, by which I mean they are meant to be there and are visible in the text in your question: the more '\n' there are, the more space there is visible between the lines of text (i.e., one between the chapter heading and the first paragraph, and many between the edition and the chapter heading.
You'll understand where the \n characters come from with this little example:
alice = """ALICE'S ADVENTURES IN WONDERLAND
Lewis Carroll
THE MILLENNIUM FULCRUM EDITION 3.0
CHAPTER I. Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'
So she was considering in her own mind (as well as she could, for the
hot d"""
print len(alice.split("."))
print len(alice.split("\n"))
It all depends the way you're splitting your text, the above example will give this output:
3
19
Which means there are 3 substrings if you were to split the text using . or 19 substrings if you splitted using \n as separator. You can read more about str.split
In your case you've splitted your text using ., so the 3 substrings will contain multiple newlines characters \n, to get rid of them you can either split these substrings again or just get rid of them using str.replace
The text uses newlines to delimit sentences as well as fullstops. You have an issue where just replacing the new line characters with an empty string will result in having words without spaces between them. Before you split alice by '.', I would use something along the lines of #elethan's solution to replace all of the multiple new lines in alice with a '.' Then you could do alice.split('.') and all of the sentences separated with multiple new lines would be split appropriately along with the sentences separated with . initially.
Then your only issue is the decimal point in the version number.
file = open('11.txt','r+')
file.read().split('\n')

Python : Word by word Text Processing between two files

I'm new to NLP. I have two text files. First file has dialogues formatted properly like below .
RECEPTIONIST Can I help you?
LINCOLN Yes. Um, is this the State bank?
RECEPTIONIST If you have to ask, maybe you shouldn't be here.
SARAH I think this is the place.
RECEPTIONIST Fill in the query.
LINCOLN Thank-you. We'll be right back.
RECEPTIONIST Oh, take your time. I'll just finish my crossword puzzle.
oh, wait.
The Second text file has 7 columns . In 5th column I have the words sequence from the dialogues of like below .
Column 5
Can
I
help
you
?
yes
.
Um
,
The Full stop and commas are considered as words here and if it has 3 or more full stop's together like "..." then it should be considered as a single word. Also if the words "Thank-you" (because they don't have space in between them) should be considered as a single word.
Now I want to write a script in python to compare each word from the dialogues and then make a new column (Column 8) which should show " who speaks the word " . Like below
Column 5 Column 8
Can RECEPTIONIST
I RECEPTIONIST
help RECEPTIONIST
you RECEPTIONIST
? RECEPTIONIST
yes LINCOLN
. LINCOLN
Um LINCOLN
, LINCOLN
As I'm completely new to python environment. I dont know where to start .Please provide your suggestion and any tips to coding!
The first file has the dialogues and the second file has information about the dialogues
I suggest the following steps to perform:
Process text file 1
here you want to split the string like LEONARD Agreed, what's your pointinto
a set of tokens. A naive approach is to use split(" ") which will split the text based on spaces, however you also need to take in consideration punctuations.
I suggest to use NLTK, a python library for natural language processing. A basic example will show how this might help you:
import nltk
sentence = """Hi this is a test."""
tokens = nltk.word_tokenize(sentence)
# output: tokens
['Hi', 'this', "is", 'a', 'test', '.']
Once you have tokenised each sentence correctly, you will know how many lines it will have in the second text file.
Process text file 2
Now you will iterate over each line in the second text file, you check if the word matches the supposed token which you found in the first step. If this is the case you add the first token (the name of the person who said it) to the end of the line (column 8).
You can get the word from the string TheBigBangTheory.Season01.Episode01.en 1 59.160 0.070 you 0.990 lexby simply doing sentence.split(" ")[4], which returns youin this case.
I believe it will still need some tweaking, but I'll leave that to you. This might outline the general idea.
Goodluck, Bazinga!

How to parse names from raw text

I was wondering if anyone knew of any good libraries or methods of parsing names from raw text.
For example, let's say I've got these as examples: (note sometimes they are capitalized tuples, other times not)
James Vaynerchuck and the rest of the group will be meeting at 1PM.
Sally Johnson, Jim White and brad burton.
Mark angleman Happiness, Productivity & blocks. Mark & Evan at 4pm.
My first thought is to load some sort of Part Of Speech tagger (like Pythons NLTK), tag all of the words. Then strip out only nouns, then compare the nouns against a database of known words (ie a literal dictionary), if they aren't in the dictionary, assume they are a name.
Other thoughts would be to delve into machine learning, but that might be beyond the scope of what I need here.
Any thoughts, suggestions or libraries you could point me to would be very helpful.
Thanks!
I don't know why you think you need NLTK just to rule out dictionary words; a simple dictionary (which you might have installed somewhere like /usr/share/dict/words, or you can download one off the internet) is all you need:
with open('/usr/share/dict/words') as f:
dictwords = {word.strip() for word in f}
with open(mypath) as f:
names = [word for line in f for word in line.rstrip().split()
if word.lower() not in dictwords]
Your words list may include names, but if so, it will include them capitalized, so:
dictwords = {word.strip() for word in f if word.islower()}
Or, if you want to whitelist proper names instead of blacklisting dictionary words:
with open('/usr/share/dict/propernames') as f:
namewords = {word.strip() for word in f}
with open(mypath) as f:
names = [word for line in f for word in line.rstrip().split()
if word.title() in namewords]
But this really isn't going to work. Look at "Jim White" from your example. His last name is obviously going to be in any dictionary, and his first name will be in many (as a short version of "jimmy", as a common romanization of the Arabic letter "jīm", etc.). "Mark" is also a common dictionary word. And the other way around, "Will" is a very common name even though you want to treat it as a word, and "Happiness" is an uncommon name, but at least a few people have it.
So, to make this work even the slightest bit, you probably want to combine multiple heuristics. First, instead of a word being either always a name or never a name, each word has a probability of being used as a name in some relevant corpus—White may be a name 13.7% of the time, Mark 41.3%, Jim 99.1%, Happiness 0.1%, etc. Next, if it's not the first word in a sentence, but is capitalized, it's much more likely to be a name (how much more? I don't know, you'll need to test and tune for your particular input), and if it's lowercase, it's less likely to be a name. You could bring in more context—for example, you have a lot of full names, so if something is a possible first name and it appears right next to something that's a common last name, it's more likely to be a first name. You could even try to parse the grammar (it's OK if you bail on some sentences; they just won't get any input from the grammar rule), so if two adjacent words only work as part of a sentence one if the second one is a verb, they're probably not a first and last name, even if that same second word could be a noun (and a name) in other contexts. And so on.
I found this library quite useful for parsing names: Python Name Parser
It can also deal with names that are formatted Lastname, Firstname.

Regular Expressions: Match up to a word or a maximum number of words

I want to look for a phrase, match up to a few words following it, but stop early if I find another specific phrase.
For example, I want to match up to three words following "going to the", but stop the matching process if I encounter "to try". So for example "going to the luna park" will result with "luna park"; "going to the capital city of Peru" will result with "capital city of" and "going to the moon to try some cheesecake" will result with "moon".
Can it be done with a single, simple regular expression (preferably in Python)? I've tried all the combinations I could think of, but failed miserably :).
This one matches up to 3 ({1,3}) words following going to the as long as they are not followed by to try ((?!to try)):
import re
infile = open("input", "r")
for line in infile:
m = re.match("going to the ((?:\w+\s*(?!to try)){1,3})", line)
if m:
print m.group(1).rstrip()
Output
luna park
capital city of
moon
I think you are looking for a way to extract Proper Nouns out of sentences. You should look at NLTK for proper approach. Regex can be only helpful of a limited context free grammer. On the other hand you seem to asking for ability to parse human language which is non-trivial (for computers).

Categories

Resources