Generate WordCloud from multiple sets of text - python

Based on this question How to create a word cloud from a corpus in Python?, I a did build a word cloud, using amueller's library. However, I fail to see how I can feed the cloud with more that one text sets. Here is what I have tried so far:
wc = WordCloud(background_color="white", max_words=2000, mask=alice_mask,
stopwords=STOPWORDS.add("said"))
wc.generate(set_of_words)
wc.generate("foo") # this overwrites the previous line of code
# but I would like this to be appended to the set of words
I can not find any manual for the library, so I have no idea about how to proceed, do you? :)
In reality, as you see here: Dictionary with array of different types as value in Python, I have this data structure:
category = { "World news": [2, "foo bla content of", "content of 2nd article"],
"Politics": [1, "only 1 article here"],
...
}
and I would like to append to the world cloud "foo bla content of" and "content of 2nd article".

The easiest solution would be to regenerate the wordcloud with the updated corpus.
To build a corpus with the text contained in your category data structure (for all topics) you could use this comprehension:
# Update the corpus
corpus = " ".join([" ".join(value[1:]) for value in category.values()])
# Regenerate the word cloud
wc.generate(corpus)
To build the word cloud for a single key in your data structure (eg Politics):
# Update the corpus
corpus = " ".join(category["Politics"][1:])
# Regenerate the word cloud
wc.generate(corpus)
Explanation:
join glues multiple string together separated by a given delimeter
[1:] takes all the elements from a list except the first one
dict.values() gives a list of all the values in the dictionary
The expression " ".join([" ".join(value[1:]) for value in category.values()]) thus can be translated as:
First glue together all the elements per key except the first one (as it is a counter). Then glue together all the resulting strings.

From a brief skim over the class in https://github.com/amueller/word_cloud/blob/master/wordcloud/wordcloud.py there isn't an update method, so you would need either to regenerate the wordcloud or add an update method.
Easiest way would probably be to maintain the original source text, and add to the end of this, then regenerate.

Related

Find and replace string with empty paragraph inside .doc and .docx word document

Sample environment:
Dictionary = {"camel":"create-para","donkey":"monkey","cat":"dog"}
cwd = os.getcwd(".")
for files in cwd
if files.endswith(".doc") or files.endswith(".doc"):
for Dictionary in files:
do the changes
2 things to notice:
create-para in dictionary means that remove string1 and create a new paragraph in place of string1.
In VBA macro it is like this:
Dictionary = {"camel":"^p","donkey":"monkey","cat":"dog"}
However, how to do that?
For example, I want to remove the word materials and replace it with a paragraph
Before
After
I'm not fully sure what you are trying to do here, what is for Dictionary in files:? Aren't Dictionary and files two separate variables? Also, I think your if condition should be:
if files.endswith(".doc") or files.endswith(".docx"):
If you are trying to change a doc/docx file, you can achieve it using python-docx. The documentation should be able to help you out. If you want to replace paragraphs, you can use this snippet from the library's GitHub page. If you want to add paragraphs, you can use the add_paragraph function:
document.add_paragraph('A plain paragraph having some ')

Using whoosh as matcher without an index

Is it possible to use whoosh as a matcher without building an index?
My situation is that I have subscriptions pre-defined with strings, and documents coming through in a stream. I check each document matches the subscriptions and send them if so. I don't need to store the documents, or recall them later. Once they've been sent to the subscriptions, they can be discarded.
Currently just using simple matching, but as consumers ask for searches based on fields, and/or logic, etc, I'm wondering if it's possible to use a whoosh matcher and allow whoosh query syntax for this.
I could build an index for each document, query it, and then throw it away, but that seems very wasteful, is it possible to directly construct a Matcher? I couldn't find any docs or questions online indicating a way to do this and my attempts haven't worked.
Alternatively, is this just the wrong library for this task, and is there something better suited?
The short answer is no.
Search indices and matchers work quite differently. For example, if searching for the phrase "hello world", a matcher would simply check the document text contains the substring "hello world". A search index cannot do this, it would have to check every document, and that be very slow.
As documents are added, every word in them is added to the index for that word. So the index for "hello" will say that document 1 matches at position 0, and the index for "world" will say that document 1 matches at position 6. And a search for "hello world" will find all document IDs in the "hello" index, then all in the "world" index, and see if any have a position for "world" which is 6 digits after the position for "hello".
So it's a completely orthogonal way of doing things in whoosh vs a matcher.
It is possible to do this with whoosh, using a new index for each document, like so:
def matches_subscription(doc: Document, q: Query) -> bool:
with RamStorage() as store:
ix = store.create_index(schema)
writer = ix.writer()
writer.add_document(
title=doc.title,
description=doc.description,
keywords=doc.keywords
)
writer.commit()
with ix.searcher() as searcher:
results = searcher.search(q)
return bool(results)
This takes about 800 milliseconds per check, which is quite slow.
A better solution is to build a parser with pyparsing, anbd then create your own nested query classes which can do the matching, better fitting your specific search queries. It's quite extendable too that way. That can bring it down to ~40 microseconds, so, 20,000 times faster.

How to search for multiple multi-word phrases in pandas?

I have some JSON data converted into a Pandas DataFrame. I am looking to find all columns whose string content matches a list of multi word phrases.
I am working with a massive amount of Twitter JSON data already downloaded for public use (so Twitter API usage is not applicable). This JSON is converted into a Pandas DataFrame. One of the columns available is, text which the body of the tweet. An example is
We’re kicking off the first portion of a citywide traffic calming project to make residential streets more safe & pedestrian-friendly, next week!
Tuesday, July 30 at 10:30 AM
Nautilus Drive and 42 Street
I want to be able to have a list of phrases, phrases = ["We're kicking off", "we're starting", "we're initiating"] and do something like pd[pd['text'].str.contains(phrases)]] to ensure that I can obtain pandas DataFrame rows whose text column contains one of the phrases.
This is perhaps asking too much, but ideally I would also be able to match something like phrases = ["(We're| we are) kicking off", "(we're | we are) starting", "(we're| we are) initiating"]
Make a list with keywords or phrases you want to match, i have put on logic for perfect match, you can change it by changing regex. Also it will capture by which keywords was the text caught.
Here is the code -
for i in range(len(mustkeywords)):
for index in range(len(text)):
result = re.search(r'\s*\b'+mustkeywords[i]+r'\W\s*', text[index])
if result:
commentlist.append(text[index])
keywordlist.append(mustkeywords[i])
tempmustkeywordsdf=pd.DataFrame(columns={"Comments"},data=commentlist) #temp df for keywords
tempmustkeywordsdf["Keywords"]=keywordlist #adding keywords column to this df
Here mustkeywords is a list that contains your phrases or keywords
.text is a string that contains all the data/phrases that you want to check keywords into.
and tempmustkeywordsdf is that contains matched strings and keywords that matched them.
I hope this helps.

Clean API results to get the headlines of news articles?

I have been having trouble finding a way to pull out specific text info from the Guardian API for my dissertation. I have managed to get all my text onto Python but how do you then clean it to get say, just the headlines of the news articles?
This is a snippet of the API result that I want to pull out info from:
{
"response": {
"status":"ok",
"userTier":"developer",
"total":1869990,
"startIndex":1,
"pageSize":10,
"currentPage":1,
"pages":186999,
"orderBy":"newest",
"results":[
{
"id":"sport/live/2016/jul/09/tour-de-france-2016-stage-eight-live",
"type":"liveblog",
"sectionId":"sport",
"sectionName":"Sport",
"webPublicationDate":"2016-07-09T13:21:36Z",
"webTitle":"Tour de France 2016: stage eight – live!",
"webUrl":"https://www.theguardian.com/sport/live/2016/jul/09/tour-de-france-2016-stage-eight-live",
"apiUrl":"https://content.guardianapis.com/sport/live/2016/jul/09/tour-de-france-2016-stage-eight-live",
"isHosted":false
},
{
"id":"sport/live/2016/jul/09/serena-williams-v-angelique-kerber-wimbledon-womens-final-live",
"type":"liveblog",
"sectionId":"sport",
"sectionName":"Sport",
"webPublicationDate":"2016-07-09T13:21:02Z",
"webTitle":"Serena Williams v Angelique Kerber: Wimbledon women's final –
...
Hoping the OP adds the used code to the question.
One solution in python is, that whatever you get (from the methods offered by the requests module?) will be either already deeply nested structures you can well index into or you can easily map it to these structures (via json.loads(the_string_you_displayed).
Sample:
d = json.loads(the_string_you_displayed)
head_line = d['response']['results'][0]['webTitle']
Would give the value into headline that is stored in the first dict found in the results "array" (index 0) of the response entries value. (The question was updated so now, the full path is visible)
in case I read the sample snippet given correctly, and it has been cut during copy and paste here, as the given sample is (as is) invalid JSON.
If the text does not represent a valid JSON text, it will depend on sifting through text via substring or pattern matching and may well be very brittle ...
Update: So assuming the full response structure is stored inside a variable named data:
result_seq = data['response']['results'] # yields a list here
headlines = [result['webTitle'] for result in result_seq]
The last line works like so: This is a list comprehension compactly creating a list from all entries result in the result_seq by picking the value of the key webTitle in each dict.
An explicit for loop like solution picking them all would be:
result_seq = data['response']['results']
headlines = []
for result in result_seq:
headlines.append(result['webTitle'])
This does not check for errors like result dicts without a key webTitle etc. but Python will raise a matching exception, and one can decide, if one likes to wrap the processing inside a try: except block or hope for the best ...

clustering of a text file

Original Question:
I have a flat file with each row representing text associated with an application. I would like to cluster applications based on the words associated with that application Is there a free code available for text mining a single flat file? Thank you.
Update 1:
There are 30,000 applications. I am trying to figure our what behaviors (of customers) are associated with each cluster. I dont have a pre defined set of words to start with. I could inspect a random few and determine some words, but then that would not give me a exaustive list of words. I would like to capture majority of the behaviors in a systematic way.
I tried converting the text file into an xml file and cluster using carrot2 workbench, but that didnt work. I havent used carrot2 before, so I may be doing something wrong there.
My understanding is that your have a file like:
game Solitaire
productivity OpenOffice
game MineSweeper
...
And you want to categorize everything based on their tag word, as in putting applications in buckets based on their associated tag/description/...
I think you can use a dictionary of lists for this purpose, e.g.:
f = open('input.txt')
out = {}
inline = f.readline()
while inline:
if ' ' not in inline:
continue
tag, appname = inline.strip('\n').split(' ', 1)
if tag not in out:
out[tag] = []
out[tag].append(appname)
inline = f.readline()
print out['game']
This iterates through input once and clusters application names based on their tags very efficiently.

Categories

Resources