How to convert jsonl to txt file? - python

I have jsonl file(size 10.5MB)and want to finally convert it to single .txt file. I think I should first convert jsonl into json files(about 12k) then convert each json files into txt. Finally merge all txt files into one txt file.
this is my .jsonl in google drive
connect my google drive
code for jsonl to json
import glob
#Path of jsonl file
File_path = input("/content/drive/MyDrive/mednli_source/mli_train_v1.jsonl") # Path of Directory containing Jsonl files:
#path to destination folder for created json files
dest_path=input("/content/drive/MyDrive/mednli_source/train_convert") # Destinataion folder:
files = [f for f in glob.glob( File_path + "**/*.jsonl", recursive=True)]
for f in files:
with open(f, 'rb') as F:
i = 1
for row in F:
#saving every line as new json file
with open(dest_path+"/file-"+str(i)+".json", 'wb') as f:
print(f)
f.write(row)
i+=1
As you can see in the final image, the file was converted into json but I can't find saved files. I was expecting those .json files to be saved in dest_path directory folder.

Related

How to read pairwise csv and json files having same names inside a folder using python?

Consider my folder structure having files in these fashion:-
abc.csv
abc.json
bcd.csv
bcd.json
efg.csv
efg.json
and so on i.e. a pair of csv files and json files having the same names, i have to perform the same operation by reading the same named files , do some operation and proceed to the next pair of files. How do i go about this?
Basically what i have in mind as a pseudo code is:-
for files in folder_name:
df_csv=pd.read_csv('abc.csv')
df_json=pd.read_json('abc.json')
# some script to execute
#now read the next pair and repeat for all files
Did you think of something like this?
import os
# collects filenames in the folder
filelist = os.listdir()
# iterates through filenames in the folder
for file in filelist:
# pairs the .csv files with the .json files
if file.endswith(".csv"):
with open(file) as csv_file:
pre, ext = os.path.splitext(file)
secondfile = pre + ".json"
with open(secondfile) as json_file:
# do something
You can use the glob module to extract the file names matching a pattern:
import glob
import os.path
for csvfile in glob.iglob('*.csv'):
jsonfile = csvfile[:-3] + 'json'
# optionaly control file existence
if not os.path.exists(jsonfile):
# show error message
...
continue
# do smth. with csvfile
# do smth. else with jsonfile
# and proceed to next pair
If the directory structure is consistent you could do the following:
import os
for f_name in {x.split('.')[0] for x in os.listdir('./path/to/dir')}:
df_csv = pd.read_csv("{f_name}.csv")
df_json = pd.read_json("{f_name}.json")
# execute the rest

How to open and load a JSON file inside a .zip archive?

So I have a python script I run in a directory which contains .zip files and these zip files all have JSON files which I want to load. With my current method I get an error of 'No such file or directory' when I try to 'open(filename)' I assume this is because namelist() doesn't actually enter the .zip directory. How can I load the JSON file inside this zip archive once I confirm that it is indeed a .zip archive?
import zipfile
import json
import os
myList = []
files = [f for f in os.listdir('.') if os.path.isfile(f)]
for f in files:
if f.endswith('.zip'):
with zipfile.ZipFile(f) as myzip:
for filename in myzip.namelist():
if filename.endswith('.json'):
g = open(filename)
data = json.load(g)
#do stuff with g and myList

How to open multiple pickle files in a folder

I have multiple pickle files with the same format in one folder called pickle_files:
1_my_work.pkl
2_my_work.pkl
...
125_my_work.pkl
How would I go about loading those files into the workspace, without having to do it one file at a time?
Thank you!
Loop over the files and save the data in a structure, for example a dictionary:
# Imports
import pickle
import os
# Folder containing your files
directory = 'C://folder'
# Create empty dictionary to save data
data = {}
# Loop over files and read pickles
for file in os.listdir(directory):
if file.endswith('.pkl'):
with open(file, 'rb') as f:
data[file.split('.')[0]] = pickle.load(f)
# Now you can print 1_my_work
print(data['1_my_work'])

How to read json files in subfolders?

I have a file path like this '/mnt/extract'. Now inside this extract folder, I have below 3 more subfolders -
subfolder1
subfolder2
subfolder3 (it has one .json file inside it)
The json in subfolder3 looks like this -
{
"x": "/mnt/extract/p",
"y": "/mnt/extract/r",
}
I want to extract the above json file from subfolder3 and concatenate the value - /mnt/extract/p for the key 'x' with one more string 'data' so that the final path will become '/mnt/extract/p/data' where I want to finally export some data. I tried the below approach but it's not working.
import os
for root, dirs, files in list(os.walk(path)):
for name in files:
print (os.path.join(root, name))
Using the in-built python Glob module, you can read files in folders and sub-folders.
Try this:
import glob
files = glob.glob('./mnt/extract/**/*.json', recursive=True)
The files list will contain paths to all json files in the extract directory.
Try this:
import glob
final_paths = []
extract_path= './mnt/extract'
files = glob.glob(extract_path+ '/**/*.json', recursive=True)
for file in files:
with open(file, 'r') as f:
json_file = json.load(f)
output_path = json_file['x']+'/'+'data'
final_paths.append(output_path)
The final_path variable will contain the output of all json files in the folder structure.
import glob
extract_path= '/mnt/extract'
files = glob.glob(extract_path+ '/**/*.json', recursive=True)
if len(files) != 0:
with open(files[0], 'r') as f:
dict = json.load(f)
final_output_path = dict['x']+'/'+'data'
In the above code, files object is returning a list containing JSON file as the only element. To make sure that we pass json object to the open method and not list, i took files[0] which will pick the json file element from list and then it was parsed easily.If anyone has some other suggestion to handle this list object which is retuning from glob function, feel free to answer as in how can we handle it in a more cleaner way.

Extracting the extracted with python

I have a zip file containing thousands of mixed .xml and .csv files. I used the following to extract the zip file:
import zipfile
zip = zipfile.ZipFile(r'c:\my.zip')
zip.extractall(r'c:\output')
Now I need to extract the thousands of individual zip files contained in the 'c:\output' folder. I am planning on concatenating just the .csv files into one file. Thank you for the help!
Try this code :
import zipfile , os
zip = zipfile.ZipFile(r'c:/my.zip')
zip.extractall(r'c:/output')
filelist = []
for name in zip.namelist():
filelist.append(name)
zip.close()
for i in filelist:
newzip = zipfile.ZipFile(r'c:/output/'+str(i))
for file in newzip.namelist():
if '.csv' in file :
newzip.extract(file,r'c:/output/')
newzip.close()
os.remove(r'c:/output/'+str(i))

Categories

Resources