How to handle big textual data to create WordCloud? - python

I have a huge textual data that I need to create its word cloud. I am using a Python library named word_cloud in order to create the word cloud which is quite configurable. The problem is that my textual data is really huge, so a high-end computer is not able to complete the task even for long hours.
The data is firstly stored in MongoDB. Due to Cursor issues while reading the data into a Python list, I have exported the whole data to a plain text file - simply a txt file which is 304 MB.
So the question that I am looking for the answer is how can I handle this huge textual data? The word_cloud library needs a String parameter that contains the whole data separated with ' ' in order to create the Word Cloud.
p.s. Python version: 3.7.1
p.s. word_cloud is an open source Word Cloud generator for Python which is available on GitHub: https://github.com/amueller/word_cloud

You don't need to load all the file in memory.
from wordcloud import WordCloud
from collections import Counter
wc = WordCloud()
counts_all = Counter()
with open('path/to/file.txt', 'r') as f:
for line in f: # Here you can also use the Cursor
counts_line = wc.process_text(line)
counts_all.update(counts_line)
wc.generate_from_frequencies(counts_all)
wc.to_file('/tmp/wc.png')

Related

Is there any feasible solution to read WOT battle results .dat files?

I am new here to try to solve one of my interesting questions in World of Tanks. I heard that every battle data is reserved in the client's disk in the Wargaming.net folder because I want to make a batch of data analysis for our clan's battle performances.
image
It is said that these .dat files are a kind of json files, so I tried to use a couple of lines of Python code to read but failed.
import json
f = open('ex.dat', 'r', encoding='unicode_escape')
content = f.read()
a = json.loads(content)
print(type(a))
print(a)
f.close()
The code is very simple and obviously fails to make it. Well, could anyone tell me the truth about that?
Added on Feb. 9th, 2022
After I tried another set of codes via Jupyter Notebook, it seems like something can be shown from the .dat files
import struct
import numpy as np
import matplotlib.pyplot as plt
import io
with open('C:/Users/xukun/Desktop/br/ex.dat', 'rb') as f:
fbuff = io.BufferedReader(f)
N = len(fbuff.read())
print('byte length: ', N)
with open('C:/Users/xukun/Desktop/br/ex.dat', 'rb') as f:
data =struct.unpack('b'*N, f.read(1*N))
The result is a set of tuple but I have no idea how to deal with it now.
Here's how you can parse some parts of it.
import pickle
import zlib
file = '4402905758116487.dat'
cache_file = open(file, 'rb') # This can be improved to not keep the file opened.
# Converting pickle items from python2 to python3 you need to use the "bytes" encoding or "latin1".
legacyBattleResultVersion, brAllDataRaw = pickle.load(cache_file, encoding='bytes', errors='ignore')
arenaUniqueID, brAccount, brVehicleRaw, brOtherDataRaw = brAllDataRaw
# The data stored inside the pickled file will be a compressed pickle again.
vehicle_data = pickle.loads(zlib.decompress(brVehicleRaw), encoding='latin1')
account_data = pickle.loads(zlib.decompress(brAccount), encoding='latin1')
brCommon, brPlayersInfo, brPlayersVehicle, brPlayersResult = pickle.loads(zlib.decompress(brOtherDataRaw), encoding='latin1')
# Lastly you can print all of these and see a lot of data inside.
The response contains a mixture of more binary files as well as some data captured from the replays.
This is not a complete solution but it's a decent start to parsing these files.
First you can look at the replay file itself in a text editor. But it won't show the code at the beginning of the file that has to be cleaned out. Then there is a ton of info that you have to read in and figure out but it is the stats for each player in the game. THEN it comes to the part that has to do with the actual replay. You don't need that stuff.
You can grab the player IDs and tank IDs from WoT developer area API if you want.
After loading the pickle files like gabzo mentioned, you will see that it is simply a list of values and without knowing what the value is referring to, its hard to make sense of it. The identifiers for the values can be extracted from your game installation:
import zipfile
WOT_PKG_PATH = "Your/Game/Path/res/packages/scripts.pkg"
BATTLE_RESULTS_PATH = "scripts/common/battle_results/"
archive = zipfile.ZipFile(WOT_PKG_PATH, 'r')
for file in archive.namelist():
if file.startswith(BATTLE_RESULTS_PATH):
archive.extract(file)
You can then decompile the python files(uncompyle6) and then go through the code to see the identifiers for the values.
One thing to note is that the list of values for the main pickle objects (like brAccount from gabzo's code) always has a checksum as the first value. You can use this to check whether you have the right order and the correct identifiers for the values. The way these checksums are generated can be seen in the decompiled python files.
I have been tackling this problem for some time (albeit in Rust): https://github.com/dacite/wot-battle-results-parser/tree/main/datfile_parser.

How do I process API return without writing to file in Python?

I am trying to slice the results of an API response to process just the first n values in Python without writing to a file first.
Specifically I want to do analysis on the "front page" from HN, which is just the first 30 items. However the API (https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty) gives you the first 500 results.
Right now I'm pulling top stories and writing to a file, then importing the file and truncating the string:
import json
import requests
#request JSON data from HN API
topstories = requests.get('https://hacker-news.firebaseio.com/v0/topstories.json')
#write return to .txt file named topstories.txt
with open('topstories.txt','w') as fd:
fd.write(topstories.text)
#truncate the text file to top 30 stories
f = open('topstories.txt','r+')
f.truncate(270)
f.close
This is inelegant and inefficient. I will have to do this again to extract each 8 digit object ID.
How do I process this API return data as much as possible in memory without writing to file?
Suggestion:
User jordanm suggested the code replacement:
fd.write(json.dumps(topstories.json()[:30]))
However that would just move the needle on when I would need to write/read versus doing anything else I want with it.
What you want is the io library
https://docs.python.org/3/library/io.html
Basically:
import io
f = io.StringIO(topstories.text)
f.truncate(270)

Read a large JSON file with Python Dask raises a delimiter error

I try to read a JSON file of size ~12GB. I downloaded from the AMiner Citation Network dataset V12 from here. This is a sample, where I removed a couple of fields that makes the JSON too long.
[{"id":1091,"authors":[{"name":"Makoto Satoh","org":"Shinshu University","id":2312688602},{"name":"Ryo Muramatsu","org":"Shinshu University","id":2482909946},{"name":"Mizue Kayama","org":"Shinshu University","id":2128134587},{"name":"Kazunori Itoh","org":"Shinshu University","id":2101782692},{"name":"Masami Hashimoto","org":"Shinshu University","id":2114054191},{"name":"Makoto Otani","org":"Shinshu University","id":1989208940},{"name":"Michio Shimizu","org":"Nagano Prefectural College","id":2134989941},{"name":"Masahiko Sugimoto","org":"Takushoku University, Hokkaido Junior College","id":2307479915}],"title":"Preliminary Design of a Network Protocol Learning Tool Based on the Comprehension of High School Students: Design by an Empirical Study Using a Simple Mind Map","year":2013,"n_citation":1,"page_start":"89","page_end":"93","doc_type":"Conference","publisher":"Springer, Berlin, Heidelberg","volume":"","issue":"","doi":"10.1007/978-3-642-39476-8_19","references":[2005687710,2018037215],"fos":[{"name":"Telecommunications network","w":0.45139},{"name":"Computer science","w":0.45245},{"name":"Mind map","w":0.5347},{"name":"Humanâcomputer interaction","w":0.47011},{"name":"Multimedia","w":0.46629},{"name":"Empirical research","w":0.49737},{"name":"Comprehension","w":0.47042},{"name":"Communications protocol","w":0.51907}],"venue":{"raw":"International Conference on Human-Computer Interaction","id":1127419992,"type":"C"}}
,{"id":1388,"authors":[{"name":"Pranava K. Jha","id":2718958994}],"title":"Further Results on Independence in Direct-Product Graphs.","year":2000,"n_citation":1,"page_start":"","page_end":"","doc_type":"Journal","publisher":"","volume":"56","issue":"","doi":"","fos":[{"name":"Graph","w":0.0},{"name":"Discrete mathematics","w":0.45872},{"name":"Combinatorics","w":0.4515},{"name":"Direct product","w":0.59104},{"name":"Mathematics","w":0.42784}],"venue":{"raw":"Ars Combinatoria","id":73158690,"type":"J"}}
,{"id":1674,"authors":[{"name":"G. Beale","org":"Archaeological Computing Research Group, University of Southampton, UK#TAB#","id":2103626414},{"name":"G. Earl","org":"Archaeological Computing Research Group, University of Southampton, UK#TAB#","id":2117665592}],"title":"A methodology for the physically accurate visualisation of roman polychrome statuary","year":2011,"n_citation":1,"page_start":"137","page_end":"144","doc_type":"Conference","publisher":"Eurographics Association","volume":"","issue":"","doi":"10.2312/VAST/VAST11/137-144","references":[1535888970,1992876689,1993710814,2035653341,2043970887,2087778487,2094478046,2100468662,2110331535,2120448006,2138624212,2149970020,2150266006,2296384428,2403283736],"fos":[{"name":"Statue","w":0.40216},{"name":"Engineering drawing","w":0.43427},{"name":"Virtual reconstruction","w":0.0},{"name":"Computer science","w":0.42062},{"name":"Visualization","w":0.4595},{"name":"Polychrome","w":0.4474},{"name":"Artificial intelligence","w":0.40496}],"venue":{"raw":"International Conference on Virtual Reality","id":2754954274,"type":"C"}}]
When I try to read the file with Python Dask (I cannot open it like any other file since it's too big and I get a memory limit error)
import dask.bag as db
if __name__ == '__main__':
b = db.read_text('dblp.v12.json').map(json.loads)
print(b.take(4))
I get the following error:
JSONDecodeError: Expecting ',' delimiter
I checked the sample above in an online validator and it passes. So, I guess it's not an error in JSON, but something on Dask and how I should read the file.
It seems like dask is passing each line of your file separately to json.loads. I've removed the line breaks from your sample and was able to load the data.
Of course, doing that for your entire file and sending a single JSON object to json.loads would defeat the purpose of using dask.
One possible solution (though I'm not sure how scalable it is for very large files) is to use jq to convert your JSON file into JSON lines -- by writing each element of your root JSON array into a single line in a file:
jq -c '.[]' your.json > your.jsonl
Then you can load it with dask:
import dask.bag as db
import json
json_file = "your.jsonl"
b = db.read_text(json_file).map(json.loads)
print(b.take(4))

Printing top few lines of a large JSON file in Python

I have a JSON file whose size is about 5GB. I neither know how the JSON file is structured nor the name of roots in the file. I'm not able to load the file in the local machine because of its size So, I'll be working on high computational servers.
I need to load the file in Python and print the first 'N' lines to understand the structure and Proceed further in data extraction. Is there a way in which we can load and print the first few lines of JSON in python?
If you want to do it in Python, you can do this:
N = 3
with open("data.json") as f:
for i in range(0, N):
print(f.readline(), end = '')
You can use the command head to display the N first line of the file. To get a sample of the json to know how is it structured.
And use this sample to work on your data extraction.
Best regards

How to build a IMS open source corpus workbench and NLTK readable corpus?

Currently i've a bunch of .txtfiles. within each .txt files, each sentence is separated by newline. how do i change it to the IMS CWB format so that it's readable by CWB? and also to nltk format.
Can someone lead me to a howto page to do that? or is there a guide page to do that, i've tried reading through the manual but i dont really know. www.cwb.sourceforge.net/files/CWB_Encoding_Tutorial.pdf
Does it mean i create a data and registry directory and then i run the cwb-encode command and it will be all converted to vrt file? does it convert one file at a time? how do i script it to run through multiple file in a directory?
It's easy to produce cwb's "verticalized" format from an NLTK-readable corpus:
from nltk.corpus import brown
out = open('corpus.vrt','w')
for sentence in nltk.brown.sents():
print >>out,'<s>'
for word in sentence:
print >>out,word
print >>out,'</s>'
out.close()
From there, you can follow the instructions on the CWB website.

Categories

Resources