I just wrote a script that extracts all the spoken text in the Dutch Parlement of a few thousand XML files. For every speaker it count the amount of times a speaker said some words.
After doing this I calculated the TF * IDF value of every word for each speaker in the Dutch Parlement. If you are not familiar with this see this link: TF IDF explanation
So now I have a dictionary for each speaker in the Dutch Parlement where the keys are the words he said and the values are the corresponding TF*IDF values:
{u'asielzoekers': 0.0034861170591325486,
u'belastingverlaging': 0.0018551991553514675,
u'buma': 0.0020712555982839408,
u'islam': 0.0029519544163739155,
u'moslims': 0.0027958002747301355,
u'ouderen': 0.0022803123245457566,
u'pechtold': 0.0021525864470786928,
u'president': 0.003281844532743345,
u'rutte': 0.0023488684001475584,
u'samsom': 0.0019304632325980841}
Right now I want to create a wordcloud from these values. I have shortly looked into the wordcloud module written by amueller But for as far as I can see this module is not working with a dictionary but just plain text.
So any help on how to create a wordcloud from a dictionary's values would be highly appreciated.
Thanks in advance!
dictionary= {u'asielzoekers': 0.0034861170591325486,.. u'samsom': 0.0019304632325980841}
from PIL import Image
import matplotlib.pyplot as plt
from wordcloud import WordCloud
wc = WordCloud(background_color="white",width=1000,height=1000, max_words=10,relative_scaling=0.5,normalize_plurals=False).generate_from_frequencies(dictionary)
plt.imshow(wc)
import matplotlib.pyplot as plt
from wordcloud import WordCloud
word_could_dict = {'Git':100, 'GitHub':100, 'push':50, 'pull':10, 'commit':80, 'add':30, 'diff':10,
'mv':5, 'log':8, 'branch':30, 'checkout':25}
wordcloud = WordCloud(width = 1000, height = 500).generate_from_frequencies(word_could_dict)
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
And we get:
Creating Wordcloud with Dictionaries
05/02/21 | 5th of February 2021 |
Producing a visualised wordcloud image from dictionary name-value pair values with WordCloud's module method of the following: generate(), generate_from_text() and generate_from_frequencies()will not work after I tried so many times figuring it out how to overcome this problem.
Having checked on Stack Overflow if there's any way around to it, I tried replicating the solution from the above answers in my program. It did not resolve the issue I had which was a TypeError exception from creating & displaying a wordcloud image.
After checking the "WordCloud API Documentation" from their official module site, I found out you have to manually use something called "multidict". It's a Python module which acts like a dictionary utilising..
"[a] collection of key-value pairs where key might be occurred more than
once in the container"
- quoted from the Multidict's main PyPi introductory page
For more information on Multidict's module, click here: https://multidict.readthedocs.io/en/stable/
To check out their official GitHub repositiory, click the following: https://github.com/aio-libs/multidict
Extracted from "WordCloud's Gallery of Example" page, here is a snippet of using the multidict module to build a frequency dictionary of values visualised in a wordcloud display:
import multidict as multidict
...
def getFrequencyDictForText(sentence):
# instantiate multidict object
fullTermsDict = multidict.MultiDict()
tmpDict = {}
# making dict for counting frequencies
for text in sentence.split(" "):
...
val = tmpDict.get(text, 0)
tmpDict[text.lower()] = val + 1
for key in tmpDict:
fullTermsDict.add(key, tmpDict[key])
return fullTermsDict
def makeImage(text):
alice_mask = np.array(Image.open("alice_mask.png"))
# instantiate and define wordcloud properties
wc = WordCloud(background_color="white", max_words=1000, mask=alice_mask)
# generate wordcloud
wc.generate_from_frequencies(text)
# display and show "wc"
plt.imshow(wc, interpolation="bilinear")
plt.show()
...
...
Note: This is not the full source. To see the whole code, check out Amueller WordCloud website
Related
as i know and i have read definition of the wordcloud is following :
Wordcloud is a popular technique that helps us identify the keywords in a text.
In a wordcloud, more frequent words have a larger and bolder font, while less frequent words have smaller or thinner fonts.
In Python, you can make simple wordclouds with the wordcloud library and nice-looking wordclouds with the stylecloudlibrary.
i have following code in order to plot those underlaine and keywords from the text :
import numpy as np
import matplotlib.pyplot as plt
import stylecloud
stylecloud.gen_stylecloud(file_path='SJ-Speech.txt',
icon_name= "fas fa-apple-alt")
plt.show()
expected output should be this :
but result is nothing :
C:\Users\User\PycharmProjects\AI_Project\venv\Scripts\python.exe C:/Users/User/PycharmProjects/AI_Project/Word_Clous_Example.py
Process finished with exit code 0
did i miss something?please help me
I am running the following code in Kaggle. The language set is "Python". Trying to do perform NLP on a paragraph and create visuals using the code below.
import matplotlib
# Force matplotlib to not use any Xwindows backend.
matplotlib.use('Agg')
import matplotlib.pyplot
import matplotlib.pyplot as plt
import pandas as pd
import itertools
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.tokenize import RegexpTokenizer
from nltk import RegexpParser
document= """The BBC has been testing a new service called SoundIndex, which
lists the top 1,000 artists based on discussions crawled from Bebo,
Last.fm, Google Groups, iTunes, MySpace and YouTube. The top five
bands according to SoundIndex right now are Coldplay, Rihanna, The
Ting Tings, Duffy and Mariah Carey , but the index is refreshed
every six hours. SoundIndex also lets users sort by popular tracks,
search by artist, or create customized charts based on music
preferences or filters by age range, sex or location. Results can
also be limited to just one data source (such as Last.fm).
"""
sentences = sent_tokenize(document)
sentences = [word_tokenize(sent) for sent in sentences]
sentences = [pos_tag(sent) for sent in sentences]
sentence = list(itertools.chain(*sentences))
#grammar = "NP: {<DT>?<JJ>*<NN>}"
grammar = """
NOUN_VERB_NOUN: {<DT>?<VB>*<NN.*>+}
GRUND_NOUN: {<VBG.><NN.*>+}
VN:{<VBN><NN>+}
NOUN_AND_ADJ: {<NN>?<JJ>*<NN.*>+}
{<N.*|JJ.*>*<N.*>} # Nouns and Adjectives, terminated with Nouns
NOUN_PHRASE: {<DT>?<JJ>*<NN>}
ADJ_PHRASE: {}
KEYPHRASE: {<DT>?<JJ>*<NN>}
KEYWORDS: {<NN.*>}
VERB_PHRASE: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
CLAUSE: {<NP><VP>}
"""
cp = RegexpParser(grammar)
result = cp.parse(sentence)
#print(result)
result.draw()
for subtree in result.subtrees():
#if subtree.label() == "NOUN_VERB_NOUN":
#print("NOUN_VERB_NOUN: "+str(subtree.leaves()))
print(str(subtree.label())+" "+str(subtree.leaves()))
result.draw()
return(result)
The error I am getting on execution is:
TclError: no display name and no $DISPLAY environment variable
Could anyone please assist on how can I resolve this issue?
I have a data-set of twitter texts which is a mixture of English, Arabic, and Farsi. I wanted to create a word-cloud out of it. Unfortunately, my word-cloud shows empty squares for Arabic and Persian words in the photo. I happened to hear about three ways of tackling this problem:
Using different encodings: I tried "UTF-8","UTF-16","UTF-32" and "ISO-8859-1" which didn't fix the problem
Using arabic_reshaper: didn't work
Using a font which simultaneously supports the three languages like "Arial" font: while trying to change the font to Arial in word-cloud I receive the following error:
input
wordcloud = WordCloud(font_path = 'arial',stopwords = stopwords, background_color = "white", max_font_size = 50, max_words = 100).generate(reshaped_text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
output
cannot open resource
This code works well in Anaconda but not in Google-Colab. The only thing needs to be solved is what path should I enter for font_path in Google-Colab
With Persian language you have three problem to solve:
Persian character don't show correctly. This will solve either with encoding or font which I think you have solved it.
Persian character appears but they are separated, in this case you should use arabic_reshaper's reshape function. Keep in mind this don't solve your problem completely and you need step 3.
Persian words written left to right, you should solve this problem with python-bidi library.
For an example I successfully created word cloud with the following code:
import matplotlib.pyplot as plt
import arabic_reshaper
from bidi.algorithm import get_display
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
txt = '''I would love to try or hear the sample audio your app can produce. I do not want to purchase, because I've purchased so many apps that say they do something and do not deliver.
Can you please add audio samples with text you've converted? I'd love to see the end results.
Thanks!
سلام حال سلام سلام سلام حال شما چطوره است نیست
'''
word_cloud = WordCloud(font_path='arial', stopwords=STOPWORDS, background_color="white", max_font_size=50, max_words=100)
word_cloud = word_cloud.generate_from_text(get_display(arabic_reshaper.reshape(txt)))
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis("off")
plt.show()
I uploaded the font to my google-drive and used this code which worked:
wordcloud = WordCloud(font_path='/content/drive/My Drive/ARIAL.TTF',stopwords=stopwords, background_color="white", max_font_size=50, max_words=100).generate(get_display(arabic_reshaper.reshape(all_tweets)))
You may want to test these specific word cloud libraries for Persian.
persian_wordcloud
wordcloud-fa
check these out too:
and
I used all the correct encoding and the software does work but the letters in Arabic aren't connected (most Arabic letters are connected when written next to each other)
This is how the plot shows the image
I used the python Pandas module using the bar chart function
The only way I could get around this is to take all the Arabic words in the pandas column that I need to plot and reshape it with arabic_reshaper, append them to a list and have this list as my x axis:
# reshaping the arabic words to show correctly on matplotlib
import arabic_reshaper
import matplotlib.pyplot as plt
from bidi.algorithm import get_display
x = [ ]
for item in df.column_name.values:
x.append(get_display(arabic_reshaper.reshape(item)))
Of course you need to install the arabic_reshaper and the bidi.algorithm packages first
This is a followup question to this question. Since it addresses a more general issue I make it a new question.
I have a network for which the labels of the nodes are in Farsi language (Arabic alphabet). When I try to use networkx to display my network it shows blank squares instead of Arabic letters. Below I copy a good example provided in the answers in here.
from bidi.algorithm import get_display
import matplotlib.pyplot as plt
import arabic_reshaper
import networkx as nx
# Arabic text preprocessing
reshaped_text = arabic_reshaper.reshape(u'زبان فارسی')
artext = get_display(reshaped_text)
# constructing the sample graph
G=nx.Graph()
G.add_edge('a', artext ,weight=0.6)
pos=nx.spring_layout(G)
nx.draw_networkx_nodes(G,pos,node_size=700)
nx.draw_networkx_edges(G,pos,edgelist=G.edges(data=True),width=6)
# Drawing Arabic text
# Just Make sure your version of the font 'Times New Roman' has Arabic in it.
# You can use any Arabic font here.
nx.draw_networkx_labels(G,pos,font_size=20, font_family='Times New Roman')
# showing the graph
plt.axis('off')
plt.show()
which generates the following image:
I tried to install the needed fonts by following command lines in python, but I get the same thing.
>>> import matplotlib.pyplot
>>> matplotlib.rcParams.update({font.family' : 'TraditionalArabic'})
Here is the ERROR message, to be more specific:
/usr/local/anaconda3/lib/python3.5/site-packages/matplotlib/font_manager.py:1288: UserWarning: findfont: Font family ['TraditionalArabic'] not found. Falling back to Bitstream Vera Sans
(prop.get_family(), self.defaultFamily[fontext])
I am also investigating ways to install the needed fonts from ubuntu cli, if possible, and put it in my docker file as it gets installed every time I spin my runs.
Best regards, s.