Let's say I open a text file with Python3:
fname1 = "filename.txt"
with open(fname1, "rt", encoding='latin1') as in_file:
readable_file = in_file.read()
The output is a standard text file of paragraphs:
\n\n"Well done, Mrs. Martin!" thought Emma. "You know what you are about."\n\n"And when she had come away, Mrs. Martin was so very kind as to send\nMrs. Goddard a beautiful goose--the finest goose Mrs. Goddard had\never seen. Mrs. Goddard had dressed it on a Sunday, and asked all\nthe three teachers, Miss Nash, and Miss Prince, and Miss Richardson,\nto sup with her."\n\n"Mr. Martin, I suppose, is not a man of information beyond the line\nof his own business? He does not read?"\n\n"Oh yes!--that is, no--I do not know--but I believe he has\nread a good deal--but not what you would think any thing of.\nHe reads the Agricultural Reports, and some other books that lay\nin one of the window seats--but he reads all _them_ to himself.\nBut sometimes of an evening, before we went to cards, he would read\nsomething aloud out of the Elegant Extracts, very entertaining.\nAnd I know he has read the Vicar of Wakefield. He never read the\nRomance of the Forest, nor The Children of the Abbey. He had never\nheard of such books before I mentioned them, but he is determined\nto get them now as soon as ever he can."\n\nThe next question was--\n\n"What sort of looking man is Mr. Martin?"
How can one save only a certain string within this file? For example, how does one save the sentence
And when she had come away, Mrs. Martin was so very kind as to send\nMrs. Goddard a beautiful goose--the finest goose Mrs. Goddard had\never seen.
into a separate text file? How do you know the indices where to access this sentence?
CLARIFICATION: There should be no decision statements to make. The end goal is to create a program which the user could "save" sentences or paragraphs separately. I am asking a more general question at the moment.
Let's say there's a certain paragraph I like in this text. I would like a way to append this quickly to a JSON file or text file. In principle, how does one do this? Tagging all sentences? Is there a way to isolate paragraphs? (To repeat above) How do you know the indices where to access this sentence? (especially if there is no "decision logic")
If I know the indices, couldn't I simply slice the string?
Related
I am currently cleaning data from text files. And the files contains transcriptions of speeches from daily conversations. Some of the files are multilingual, a few examples of a multilingual portion are like so:
around that area,<tamil>அம்மா:ammaa</tamil> would have cooked too
so at least need to <mandarin>跑两趟:pao liang tang</mandarin>,then I told them that it is fine
There can be multiple of such other languages in one file
Going back to the first example, what I am trying to do with the data is to remove "<tamil>", "அம்மா:" and "</tamil>", keeping just the english pronunciation of the word. I have tried to replace the <tamil> to "", but am quite unsure of how to approach the removal of the tamil words.
The expected output would be:
around that area, ammaa would have cooked too
so at least need to pao liang tang,then I told them that it is fine
How would I go about doing so?
Yes, Pls try this
content="around that area,<tamil>அம்மா:ammaa</tamil> would have cooked too"
ft=' '.join([word for line in [item.strip() for item in content.replace('<',' <').replace('>','> ').split('>') if not (item.strip().startswith('<') or (item.strip().startswith('&') and item.strip().endswith(';')))] for word in line.split() if not (word.strip().startswith('<') or (word.strip().startswith('&') and word.strip().endswith(';')))])
outputs=ft.encode('ascii','ignore')
print(outputs.decode('utf-8'))
output
around that area, :ammaa would have cooked too
It's not complete output..Like if you see final string there some extra things like ":", some punctuations..So pls edit them yourself using regex..I've posted 99% of the answer
I need a python package that could get the related sentence from a text, based on the keywords provided.
For example, below is the Wikipedia page of J.J Oppenheimer -
Early life
Childhood and education
J. Robert Oppenheimer was born in New York City on April 22, 1904,[note 1][7] to Julius Oppenheimer, a wealthy Jewish textile importer who had immigrated to the United States from Germany in 1888, and Ella Friedman, a painter.
Julius came to the United States with no money, no baccalaureate studies, and no knowledge of the English language. He got a job in a textile company and within a decade was an executive with the company. Ella was from Baltimore.[8] The Oppenheimer were non-observant Ashkenazi Jews.[9]
The first atomic bomb was successfully detonated on July 16, 1945, in the Trinity test in New Mexico.
Oppenheimer later remarked that it brought to mind words from the Bhagavad Gita: "Now I am become Death, the destroyer of worlds.
If my passed string is - "JJ Oppenheimer birth date", it should return "J. Robert Oppenheimer was born in New York City on April 22, 1904"
If my passed string is - "JJ Openheimer Trinity test", it should return "The first atomic bomb was successfully detonated on July 16, 1945, in the Trinity test in New Mexico"
I tried searching a lot but nothing comes closer to what I want and I don't know much about NLP vectorization techniques. It would be great if someone please suggest some package if they know(or exist).
You could use fuzzywuzzy.
fuzz.ratio(search_text, sentence).
This gives you a score of how similar two strings are.
https://github.com/seatgeek/fuzzywuzzy
I am pretty sure a Module exists that could do this for you, you could try and make it yourself by parsing through the text and creating words like: ["date of birth", "born", "birth date", etc] and you do this for multiple fields. This would thus allow you to find information that would be available.
The idea is:
you grab your text or whatever u have,
you grab what you are looking for (example date of birth)
You then assign a date of birth to a list of similar words,
you look through ur file to see if you find a sentence that has that in it.
I am pretty sure there is no module, maybe I am wrong but smth like this should work.
The task you describe looks like Information Retrieval. Given a query (the keywords) the model should return a list of document (the sentences) that best matches the query.
This is essentially what the response using fuzzywuzzy is suggesting. But maybe just counting the number of occurences of the query words in each sentence is enough (and more efficient).
The next step would be to use Tf-Idf. It is a weighting scheme, that gives high scores to words that are specific to a document with respect to a set of document (the corpus).
This results in every document having a vector associated, you will then be able to sort the documents according to their similarity to the query vector. SO Answer to do that
I'm dealing with the Well-known novel of Victor Hugo "Les Miserables".
A part of my project is to detect the existence of each of the novel's character in a sentence and count them. This can be done easily by something like this:
def character_frequency(character_per_sentences_dict,):
characters_frequency = OrderedDict([])
for k, v in character_per_sentences_dict.items():
if len(v) != 0:
characters_frequency[k] = len(v)
return characters_frequency, characters_in_vol
This pies of could works well for all of the characters except "Èponine".
I also read the text with the following piece code:
import codecs
import nltk.tokenize
with open(path_to_volume + '.txt', 'r', encoding='latin1') as fp:
novel = ' '.join(fp.readlines())
# Tokenize sentences and calculate the number of sentences
sentences = sent_tokenize(novel)
num_volume = path_to_volume.split("-v")[-1]
I should add that the dictation of "Èponine" is the same everywhere.
Any idea what's going on ?!
Here is a sample in which this name apears:
" ONE SHOULD ALWAYS BEGIN BY ARRESTING THE VICTIMS
At nightfall, Javert had posted his men and had gone into ambush himself between the trees of the Rue de la Barrieredes-Gobelins which faced the Gorbeau house, on the other side of the boulevard. He had begun operations by opening his pockets, and dropping into it the two young girls who were charged with keeping a watch on the approaches to the den. But he had only caged Azelma. As for Èponine, she was not at her post, she had disappeared, and he had not been able to seize her. Then Javert had made a point and had bent his ear to waiting for the signal agreed upon. The comings and goings of the fiacres had greatly agitated him. At last, he had grown impatient, and, sure that there was a nest there, sure of being in luck, having recognized many of the ruffians who had entered, he had finally decided to go upstairs without waiting for the pistol-shot."
I agree with #BoarGules that there is likely a more efficient and effective way to approach this problem. With that said, I'm not sure what your problem is here. Python is fully Unicode supportive. You can "just do it" in terms of using Unicode in your program logic using Python's standard string ops and libraries.
For example, this works:
#!/usr/bin/env python
import requests
names = [
u'Éponine',
u'Cosette'
]
# Retrieve Les Misérables from Project Gutenberg
t = requests.get("http://www.gutenberg.org/files/135/135-0.txt").text
for name in names:
c = t.count(name)
print("{}: {}".format(name, c))
Results:
Éponine: 81
Cosette: 1004
I obviously don't have the text you have, so I don't know if how it is encoded, or how it is being read is the problem. I can't test that without having it. In this code, I get the source text off the internet. My point is just that non-ASCII characters should not pose any impediment to you as long as your inputs are reasonable.
All of the time to run this is spent reading the text. I think even if you added dozens of names, it wouldn't add up to a noticeable delay on any decent computer. So this method works just fine.
My text file is like below.
[0, "we break dance not hearts by Short Stack is my ringtone.... i LOVE that !!!.....\n"]
[1, "I want to write a . I think I will.\n"]
[2, "#va_stress broke my twitter..\n"]
[3, "\" "Y must people insist on talking about stupid politics on the comments of a bubblegum pop . Sorry\n"]
[4, "aww great "Picture to burn"\n"]
[5, "#jessdelight I just played ur joint two s ago. Everyone in studio was feeling it!\n"]
[6, "http://img207.imageshack.us/my.php?image=wpcl10670s.jpg her s are so perfect.\n"]
[7, "cannot hear the new due to geographic location. i am geographically undesirable. and tune-less\n"]
[8, "\" couples in public\n"]
[9, "damn wendy's commerical got that damn in my head.\n"]
[10, "i swear to cheese & crackers #zyuuup is in Detroit like every 2 months & i NEVER get to see him! i swear this blows monkeyballs!\n"]
[11, "\" getting ready for school. after i print out this\n"]
I want to read every second element from the list mean all the text tweets into array.
I wrote
tweets = []
for line in open('tweets.txt').readlines():
print line[1]
tweets.append(line)
but when I see the output, It just takes 2nd character of every line.
When you read a text file in Python, the lines are just strings. They aren't automatically converted to some other data structure.
In your case, it looks like each line in your file contains a JSON list. In that case, you can parse the line first using json.loads(). This converts the string to a Python list which you can then take the second element of:
import json
with open('tweets.txt') as fp:
tweets = [json.loads(line)[1] for line in fp]
May be you should consider to use json.loads method :
import json
tweets = []
for line in open('tweets.txt').readlines():
print json.loads(line)[1]
tweets.append(line)
There is more pythonic way in #Erik Cederstrand 's comment.
Rather than guessing what format the data is in, you should find out.
If you're generating it yourself, and don't know how to parse back in what you're creating, change your code to generate something that can be easily parsed with the same library used to generate it, like JsonLines or CSV.
If you're ingesting it from some API, read the documentation for that API and parse it the way it's documented.
If someone handed you the file and told you to parse it, ask that someone what format it's in.
Occasionally, you do have to deal with some crufty old file in some format that was never documented and nobody remembers what it was. In that case, you do have to reverse engineer it. But what you want to do then is guess at likely possibilities, and try to parse it with as much validation and error handling as possible, to verify that you guessed right.
In this case, the format looks a lot like either JSON lines or ndjson. Both are slightly different ways of encoding multiple objects with one JSON text per line, with specific restrictions on those texts and the way they're encoded and the whitespace between them.
So, while a quick&dirty parser like this will probably work:
with open('tweets.txt') as f:
for line in f:
tweet = json.loads(line)
dosomething(tweet)
You probably want to use a library like jsonlines:
with jsonlines.open('tweets.txt') as f:
for tweet in f:
dosomething(tweet)
The fact that the quick&dirty parser works on JSON lines is, of course, part of the point of that format—but if you don't actually know whether you have JSON lines or not, you're better off making sure.
Since your input looks like Python expressions, I'd use ast.literal_eval to parse them.
Here is an example:
import ast
with open('tweets.txt') as fp:
tweets = [ast.literal_eval(line)[1] for line in fp]
print(tweets)
Output:
['we break dance not hearts by Short Stack is my ringtone.... i LOVE that !!!.....\n', 'I want to write a . I think I will.\n', '#va_stress broke my twitter..\n', '" "Y must people insist on talking about stupid politics on the comments of a bubblegum pop . Sorry\n', 'aww great "Picture to burn"\n', '#jessdelight I just played ur joint two s ago. Everyone in studio was feeling it!\n', 'http://img207.imageshack.us/my.php?image=wpcl10670s.jpg her s are so perfect.\n', 'cannot hear the new due to geographic location. i am geographically undesirable. and tune-less\n', '" couples in public\n', "damn wendy's commerical got that damn in my head.\n", 'i swear to cheese & crackers #zyuuup is in Detroit like every 2 months & i NEVER get to see him! i swear this blows monkeyballs!\n', '" getting ready for school. after i print out this\n']
I am trying to structure my text document in an xml structure, where each sentence gets an id. I have text documents with unstructured sentences and I would like to split the sentences using a '.' delimiter and write them to xml. Here is my code:
import re
#Read the file
with open ('C:\\Users\\ngwak\\Documents\\test.txt') as f:
content = [f]
split_content = []
for element in content:
split_content += re.split("(.)\s+", element)
print(split_content, sep='\n\n')
But I am getting this error already and I cant interpret it:
TypeError: expected string or buffer
How can I split my sentences and write them to xml? Thanks a lot.
This is how my txt file looks like:
In a formal sense, the germ of national consciousness can be traced back to the Peace Treaty of Hoachanas signed in 13–June-1858 between soldiers, all the chiefs except those of the Bondelswarts (who had not been involved in the previous fighting), as well as by Muewuta, two sons of amuaha, formerly a Commandant of Chief Onag of the Triku people. There is ample epistolary as well as oral evidence for this view. The most poignant statement is to be found in the now famous and oft-quoted letter of Onag to Bonagha written on May 13, 1890 in which, amongst other things, he says that on June 13 there are people coming. Again on the 01.02.2015 till the 01.05 there are some coming.
And I would like the sentences to be like this in xml:
<sentence id=01>In a formal sense, the germ of national consciousness
can be traced back to the Peace Treaty of Hoachanas signed in 13–June-
1858 between soldiers, all the chiefs except those of the Bondelswarts
(who had not been involved in the previous fighting), as well as by
Muewuta, two sons of amuaha, formerly a Commandant of Chief Onag of the
Triku people. </sentence>
text_file = open('C:\\Users\\ngwak\\Documents\\test.txt', "r")
textLinesFromFile = text_file.read().replace("\n","").split('.')
for sentenceNumber in range (0,len(textLinesFromFile)):
print (textLinesFromFile[sentenceNumber].strip())
#Or write each sentence in your XML
You don't need the content = [f] line.
with open ('C:\\Users\\ngwak\\Documents\\test.txt') as file:
split_content = []
for element in file:
split_content += re.split("(.)\s+", element)
print(split_content, sep='\n\n')
File objects are iterable. Using them in a for loop will iterate over each line.
Further Reading
Methods on File objects in the Python Docs
The example in this SO answer: Iterating on a file using Python