Printing top few lines of a large JSON file in Python - python

I have a JSON file whose size is about 5GB. I neither know how the JSON file is structured nor the name of roots in the file. I'm not able to load the file in the local machine because of its size So, I'll be working on high computational servers.
I need to load the file in Python and print the first 'N' lines to understand the structure and Proceed further in data extraction. Is there a way in which we can load and print the first few lines of JSON in python?

If you want to do it in Python, you can do this:
N = 3
with open("data.json") as f:
for i in range(0, N):
print(f.readline(), end = '')

You can use the command head to display the N first line of the file. To get a sample of the json to know how is it structured.
And use this sample to work on your data extraction.
Best regards

Related

Is there any feasible solution to read WOT battle results .dat files?

I am new here to try to solve one of my interesting questions in World of Tanks. I heard that every battle data is reserved in the client's disk in the Wargaming.net folder because I want to make a batch of data analysis for our clan's battle performances.
image
It is said that these .dat files are a kind of json files, so I tried to use a couple of lines of Python code to read but failed.
import json
f = open('ex.dat', 'r', encoding='unicode_escape')
content = f.read()
a = json.loads(content)
print(type(a))
print(a)
f.close()
The code is very simple and obviously fails to make it. Well, could anyone tell me the truth about that?
Added on Feb. 9th, 2022
After I tried another set of codes via Jupyter Notebook, it seems like something can be shown from the .dat files
import struct
import numpy as np
import matplotlib.pyplot as plt
import io
with open('C:/Users/xukun/Desktop/br/ex.dat', 'rb') as f:
fbuff = io.BufferedReader(f)
N = len(fbuff.read())
print('byte length: ', N)
with open('C:/Users/xukun/Desktop/br/ex.dat', 'rb') as f:
data =struct.unpack('b'*N, f.read(1*N))
The result is a set of tuple but I have no idea how to deal with it now.
Here's how you can parse some parts of it.
import pickle
import zlib
file = '4402905758116487.dat'
cache_file = open(file, 'rb') # This can be improved to not keep the file opened.
# Converting pickle items from python2 to python3 you need to use the "bytes" encoding or "latin1".
legacyBattleResultVersion, brAllDataRaw = pickle.load(cache_file, encoding='bytes', errors='ignore')
arenaUniqueID, brAccount, brVehicleRaw, brOtherDataRaw = brAllDataRaw
# The data stored inside the pickled file will be a compressed pickle again.
vehicle_data = pickle.loads(zlib.decompress(brVehicleRaw), encoding='latin1')
account_data = pickle.loads(zlib.decompress(brAccount), encoding='latin1')
brCommon, brPlayersInfo, brPlayersVehicle, brPlayersResult = pickle.loads(zlib.decompress(brOtherDataRaw), encoding='latin1')
# Lastly you can print all of these and see a lot of data inside.
The response contains a mixture of more binary files as well as some data captured from the replays.
This is not a complete solution but it's a decent start to parsing these files.
First you can look at the replay file itself in a text editor. But it won't show the code at the beginning of the file that has to be cleaned out. Then there is a ton of info that you have to read in and figure out but it is the stats for each player in the game. THEN it comes to the part that has to do with the actual replay. You don't need that stuff.
You can grab the player IDs and tank IDs from WoT developer area API if you want.
After loading the pickle files like gabzo mentioned, you will see that it is simply a list of values and without knowing what the value is referring to, its hard to make sense of it. The identifiers for the values can be extracted from your game installation:
import zipfile
WOT_PKG_PATH = "Your/Game/Path/res/packages/scripts.pkg"
BATTLE_RESULTS_PATH = "scripts/common/battle_results/"
archive = zipfile.ZipFile(WOT_PKG_PATH, 'r')
for file in archive.namelist():
if file.startswith(BATTLE_RESULTS_PATH):
archive.extract(file)
You can then decompile the python files(uncompyle6) and then go through the code to see the identifiers for the values.
One thing to note is that the list of values for the main pickle objects (like brAccount from gabzo's code) always has a checksum as the first value. You can use this to check whether you have the right order and the correct identifiers for the values. The way these checksums are generated can be seen in the decompiled python files.
I have been tackling this problem for some time (albeit in Rust): https://github.com/dacite/wot-battle-results-parser/tree/main/datfile_parser.

Get different strings from a file and write a .txt

I'am trying to get lines from a text file (.log) into a .txt document.
I need get into my .txt file the same data. But the line itself is sometimes different. From what I have seen on internet, it's usualy done with a pattern that will anticipate how the line is made.
1525:22Player 11 spawned with userinfo: \team\b\forcepowers\0-5-030310001013001131\ip\46.98.134.211:24806\rate\25000\snaps\40\cg_predictItems\1\char_color_blue\34\char_color_green\34\char_color_red\34\color1\65507\color2\14942463\color3\2949375\color4\2949375\handicap\100\jp\0\model\desann/default\name\Faybell\pbindicator\1\saber1\saber_malgus_broken\saber2\none\sex\male\ja_guid\420D990471FC7EB6B3EEA94045F739B7\teamoverlay\1
The line i'm working with usualy looks like this. The data i'am trying to collect are :
\ip\0.0.0.0
\name\NickName_of_the_player
\ja_guid\420D990471FC7EB6B3EEA94045F739B7
And print these data, inside a .txt file. Here is my current code.
As explained above, i'am unsure about what keyword to use for my research on google. And how this could be called (Because the string isn't the same?)
I have been looking around alot, and most of the test I have done, have allowed me to do some things, but i'am not yet able to do as explained above. So i'am in hope for guidance here :) (Sorry if i'am noobish, I understand alot how it works, I just didn't learned language in school, I mostly do small scripts, and usualy they work fine, this time it's way harder)
def readLog(filename):
with open(filename,'r') as eventLog:
data = eventLog.read()
dataList = data.splitlines()
return dataList
eventLog = readLog('games.log')
You'll need to read the files in "raw" mode rather than as strings. When reading the file from disk, use open(filename,'rb'). To use your example, I ran
text_input = r"1525:22Player 11 spawned with userinfo: \team\b\forcepowers\0-5-030310001013001131\ip\46.98.134.211:24806\rate\25000\snaps\40\cg_predictItems\1\char_color_blue\34\char_color_green\34\char_color_red\34\color1\65507\color2\14942463\color3\2949375\color4\2949375\handicap\100\jp\0\model\desann/default\name\Faybell\pbindicator\1\saber1\saber_malgus_broken\saber2\none\sex\male\ja_guid\420D990471FC7EB6B3EEA94045F739B7\teamoverlay\1"
text_as_array = text_input.split('\\')
You'll need to know which columns contain the strings you care about. For example,
with open('output.dat','w') as fil:
fil.write(text_as_array[6])
You can figure these array positions from the sample string
>>> text_as_array[6]
'46.98.134.211:24806'
>>> text_as_array[34]
'Faybell'
>>> text_as_array[44]
'420D990471FC7EB6B3EEA94045F739B7'
If the column positions are not consistent but the key-value pairs are always adjacent, we can leverage that
>>> text_as_array.index("ip")
5
>>> text_as_array[text_as_array.index("ip")+1]
'46.98.134.211:24806'

How to downsample .json file

I apologize if this is a very beginner-ish question. But I have a multivariate data set from reddit ( https://files.pushshift.io/reddit/submissions/), but the files are way too big. Is it possible to downsample one of these files down to 20% or less, and either save it as a new file (json or csv) or directly read it as a pandas dataframe? Any help will be very appreciated!
Here is my attempt thus far
def load_json_df(filename, num_bytes = -1):
'''Load the first `num_bytes` of the filename as a json blob, convert each line into a row in a Pandas data frame.'''
fs = open(filename, encoding='utf-8')
df = pd.DataFrame([json.loads(x) for x in fs.readlines(num_bytes)])
fs.close()
return df
january_df = load_json_df('RS_2019-01.json')
january_df.sample(frac=0.2)
However this gave me a memory error while trying to open it. Is there a way to downsample it without having to open the entire file?
The problem is, it is not possible to determine exactly what the 20% of the data is. In order to do that you must first read the entire length of the file and only then you can get an idea of what a 20% would look like.
Reading a large file into memory all at once throws this error generally. You can process this by reading the file line-by-line with below code:
data = []
counter = 0
with open('file') as f:
for line in f:
data.append(json.loads(line))
counter +=1
You should then be able to do this
df = pd.DataFrame([x for x in data]) #you can set a range here with counter/5 if you want to get 20%
I downloaded first of the files, i.e. https://files.pushshift.io/reddit/submissions/RS_2011-01.bz2
decompressed it and looked at the contents. As it happens, it is not a proper JSON but rather JSON-lines - a series of JSON objects, one per line (see http://jsonlines.org/ ). This means you can just cut out as many lines as you want, using any tool you want (for example, a text editor). Or you can just process the file sequentially in your Python script, taking into account every fifth line, like this:
with open('RS_2019-01.json', 'r') as infile:
for i, line in enumerate(infile):
if i % 5 == 0:
j = json.loads(line)
# process the data here

Not getting the full output out of a list

Objective
I'm trying to extract the GPS "Latitude" and "Longitude" data from a bunch of JPG's and I have been successful so far but my main problem is that when I try to write the coordinates to a text file for example I see that only 1 set of coordinates was written compared to my console output which shows that every image was extracted. Here is an example: Console Output and here is my text file that is supposed be a mirror output along my console: Text file
I don't fully understand whats the problem and why it won't just write all of them instead of one. I believe it is being overwritten somehow or the 'GPSPhoto' module is causing some issues.
Code
from glob import glob
from GPSPhoto import gpsphoto
# Scan jpg's that are located in the same directory.
data = glob("*.jpg")
# Scan contents of images and GPS values.
for x in data:
data = gpsphoto.getGPSData(x)
data = [data.get("Latitude"), data.get("Longitude")]
print("\nsource: {}".format(x), "\n ↪ {}".format(data))
# Write coordinates to a text file.
with open('output.txt', 'w') as f:
print('Coordinates:', data, file=f)
I have tried pretty much everything that I can think of including: changing the write permissions, not using glob, no loops, loops, lists, no lists, different ways to write to the file, etc.
Any help is appreciated because I am completely lost at this point. Thank you.
You're replacing the data variable each time through the loop, not appending to a list.
all_coords = []
for x in data:
data = gpsphoto.getGPSData(x)
all_coords.append([data.get("Latitude"), data.get("Longitude")])
with open('output.txt', 'w') as f:
print('Coordinates:', all_coords, file=f)

convert eye-tracking .edf file to ASC/CSV format

I have a recording of tracking data in .edf format (SR-RESEARCH eyelink). I want to convert it to ASC/CSV format in python. I have the GUI application but I want to do it programmatically (in Python).
I found the package pyEDFlib but couldn't find an example to how convert the eye-tracking .edf file to .asc or .csv.
What will the best best way to do it?
Thanks
If I trust the page here: http://pyedflib.readthedocs.io/en/latest, you can run through all the signals in the file this way:
import pyedflib
import numpy as np
f = pyedflib.EdfReader("data/test_generator.edf")
n = f.signals_in_file
signal_labels = f.getSignalLabels()
sigbufs = np.zeros((n, f.getNSamples()[0]))
for i in np.arange(n):
sigbufs[i, :] = f.readSignal(i)
The pyEDFlib library simply reads the file into an EdfReader object.
Then you just need to go through and make row for each.
I assume that signal_labels (in the code above) will be an array with all the labels so make a comma separated string out of them
signal_labels_row = ",".join(signal_labels)
Then do the same for each signal, 1 comma separated String for each
Then simply write them in a file.
I can see they provide an example of how to read a file and extract all the data you need here
https://github.com/holgern/pyedflib/blob/master/demo/readEDFFile.py
Based on your answers i have created this python3 script to export all singnals to multiple .csv files https://github.com/folkien/pyEdfToCsv

Categories

Resources