Extract data points from text

Extract data points from text - python

I'd like to extract data from a sentence. I know that this is not a simple and easy process but was wondering what would be the best way about doing so. For example I'd have the sentence:
The silver boat moved through the dark waters, light by the moonlight.
From that I'd like to get:
Boat: sliver, moving through water //moving through water referring to Water:
Water: dark, light by the moonlight, boat moving //boat moving referring to Boat:
Moonlight: lighting water
Basically I'd like to get all the words that describe the noun whether a single word or short phrase. Some details are kept both twice (the item Boat has a reference to water and the item Water reference to boat)
I was thinking about getting a list of nouns and finding all them in the sentence. Then I'd somehow get everything describing it (all the links).
If someone has already done something similar or if there's a way to get all related data from a sentence, please tell me or give a link.

Related

Analyze semantic generality of sentence with python

I'm looking to analyze how specific a statement is. I've checked out packages like NLTK but haven't found anything that seems to fit. I'm looking for something that can give an English sentence a score of how specific or general it is.
An example of a specific sentence:
"The box is green and weighed one pound last week."
And example of a general sentence:
"Red is a color."
Any suggestions or ideas?

How to pick out the specific words from a text file in python?

This is my text file: "myFile.txt"
Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of the Italic Mountains, she had a last view back on the skyline of her hometown Bookmarksgrove, the headline of Alphabet Village and the subline of her own road, the Line Lane. Pityful a rethoric question ran over her cheek, then
Now I want to fetch the word "far" that how many times it's appearing and the print the same. If anyone can help me out with this, please explain, thanks in advance.

Just use the inbuilt .count() method in Python.
my_string = "Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of the Italic Mountains, she had a last view back on the skyline of her hometown Bookmarksgrove, the headline of Alphabet Village and the subline of her own road, the Line Lane. Pityful a rethoric question ran over her cheek, then"
print(my_string.count("far"))
would give an output
3
If you need to make it case-insensitive, you can convert my_string to lower-case using .lower()

NLTK - distinguishing between colors and words using context

I'm writing a program to analyze the usage of color in text. I want to search for color words such as "apricot" or "orange". For example, an author might write "the apricot sundress billowed in the wind." However, I want to only count the apricots/oranges that actually describe color, not something like "I ate an apricot" or "I drank orange juice."
Is there anyway to do this, perhaps using context() in NLTK?

Welcome to the broad field of homonymy, polysemy and WSD. In corpus linguistics, this is an approach where collocations e.g. and are used to determine a probability of the juice having the colour "orange" or being made of the respective fruit. Both probabilities are high, but the probability of "jacket" being made of the respective fruit should be much lower. There are different methods to be used. You could ask corpus annotators (specialists, crowdsourcing etc.) to annotate data in a text, which you can use to train your (machine learning) model, in this case a simple classifier. Otherwise you could use large text data to gather collocation counts in combinition with Wordnet, which may give you semantic information whether it is usual for a jacket to be made of fruits. A fortunate detail is that only rarely people use stereotypical colours in text, so you don't have to care for cases like "the yellow banana".
Shallow parsing may also help, since colour adjectives should be preferrably used in attributive position.
A different approach would be to use word similarity measures (vector space semantics)
or embeddings for Word Sense disambiguation (WSD).
Maybe this helps:
https://web.stanford.edu/~jurafsky/slp3/slides/Chapter18.wsd.pdf
https://towardsdatascience.com/a-simple-word-sense-disambiguation-application-3ca645c56357

Extract the most relevant location corresponding to a keyword

I'm implementing an application that tracks the locations of Australia's sharks through analysing a Twitter dataset. So I'm using shark as the keyword and search for the Twitts that contains "shark" and a location phrase.
So the question is how to identify that "Airlie Beach at Hardy Reef" is the one that is correlated to "shark"? If it's possible, can anyone provide a working code of Python to demonstrate? Thank you so much!

If you've already used NER to extract a list of locations, could you then create a table of target words and assign probabilities of being the correct location? For example, you are interested in beaches not hospitals. If beach is mentioned within the location, the probability of being the correct location increases. Another hacky way of doing it might be determining the number of characters or tokens between the word shark and the location - hoping that the smaller the distance, the more likely the word is to be related to the actual attack.

This is not an easy task, This would require Named Entity Recognition https://www.quora.com/What-are-the-best-python-libraries-for-extracting-location-from-text

(Beginner to) NLP:I am trying to understand how I can categorise words in text to identify all the words related to a topic

I have scraped a website using BeautifulSoup and now I want to analyse all the text that I have scraped and create a long-list of food items that occur in that piece of text.
Example text
If you’re a vegetarian and forever lamenting the fact that you can’t have wontons, these guys are for you! The filling is made with a simple mix of firm tofu crumbles, seasoned with salt, ginger, white pepper, and green onions. It’s super simple but so satisfying.
Make sure you drain your tofu well and dry it out as much as possible so that the filling isn’t too wet. You can even go a step further and give it a press: line a plate with paper towels, the put some paper towels on top and weigh the tofu down with another plate.
The best thing about these wontons is that the filling is completely cooked so you can adjust the seasoning just by tasting. Just make sure that the filling is slightly more saltier than you would have it if you were just eating it on it’s own. Wonton wrappers don’t have much in the way of seasoning.
These guys cook up in a flash because all you’re doing is cooking the wonton wrappers. Once you pop them in the boiling water and they float to the top, you’re good to go. Give them a toss in a spicy-soy-vinegar dressing and you’re in heaven!
I would like to create a long list from this which identifies:
wontons, tofu, vinegar, white pepper, onions, salt
I am not sure how I can do this without having a pre-existing list of food items. Therefore, any suggestions would be great. Looking for something which can do this automatically without too much manual intervention! (I am quite new to NLP and deep learning and so any articles/ methods you recommend would be super useful!)
Thanks!

If you are new in this field you can use the GENSIM, a free python library for topic modeling.You can extract the food items using Latent Semantic Analysis or Similarity Queries.
https://radimrehurek.com/gensim/index.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.