I have text file and I want to convert it to JSON:
red|2022-09-29|03:15:00|info 1
blue|2022-09-29|10:50:00|
yellow|2022-09-29|07:15:00|info 2
so i type a script to convert this file into JSON:
import json
filename = 'input_file.txt'
dict1 = {}
fields =['name', 'date', 'time', 'info']
with open(filename) as fh:
l = 1
for line in fh:
description = list( line.strip().split("|", 4))
print(description)
sno ='name'+str(l)
i = 0
dict2 = {}
while i<len(fields):
dict2[fields[i]]= description[i]
i = i + 1
dict1[sno]= dict2
l = l + 1
out_file = open("json_file.json", "w")
json.dump(dict1, out_file, indent = 4)
out_file.close()
and output looks like this:
{
"name1": {
"name": "red",
"date": "2022-09-29",
"time": "03:15:00",
"info": "info 1"
},
"name2": {
"name": "blue",
"date": "2022-09-29",
"time": "10:50:00",
"info": ""
},
"name3": {
"name": "yellow",
"date": "2022-09-29",
"time": "07:15:00",
"info": "info 2"
}
}
As you can see I do so, but now I want to change looks of this JSON file. How can I change it to make my output looks like this:
to look like this:
[
{"name":"red", "date": "2022-09-29", "time": "03:15:00", "info":"info 1"},
{"name":"blue", "date": "2022-09-29", "time": "10:50:00", "info":""},
{"name":"yellow", "date": "2022-09-29", "time": "07:15:00", "info":"info 2"}
]
If you see your required json output, it is a list and not a dict like you have right now. So using a list(data) instead of dict(dict1) should give the correct output.
Following updated code should generate the json data in required format -
import json
filename = 'input_file.txt'
data = []
fields =['name', 'date', 'time', 'info']
with open(filename) as fh:
l = 1
for line in fh:
description = list( line.strip().split("|", 4))
print(description)
sno ='name'+str(l)
i = 0
dict2 = {}
while i<len(fields):
dict2[fields[i]]= description[i]
i = i + 1
data.append(dict2)
l = l + 1
out_file = open("json_file.json", "w")
json.dump(data, out_file, indent = 4)
out_file.close()
I would use pandas, it allows you to solve your problem in one statement and avoid reinventing a wheel:
import pandas as pd
pd.read_table("input_file.txt", sep="|", header=None,
names=["name", "date" , "time", "info"]).fillna("")\
.to_json("json_file.json", orient="records")
Related
I do have dictionary, with each value as a list.
I want to write individual items to separate JSON files.
For example
data_to_write = {"Names":["name1", "name2", "name3"], "email":["mail1", "mail2", "mail3"]}
Now I want 3 jsons i.e data1.jsob, data2.json, data3.json in the following(approx) format.
data1.json
{
Name: name1,
email: mail1
}
data2.json
{
Name: name2,
email: mail2
}
and so on.
My current approach is
for file_no in range(no_of_files):
for count, (key, info_list) in enumerate(data_to_write.items()):
for info in info_list:
with open(
os.path.join(self.path_to_output_dir, str(file_no)) + ".json",
"a",
) as resume:
json.dump({key: info}, resume)
But this is wrong. Any helps appreciated.
You could use pandas to do the work for you. Read the dictionary into a dataframe, then iterate the rows of the dataframe to produce the json for each row:
import pandas as pd
data_to_write = {"Names":["name1", "name2", "name3"], "email":["mail1", "mail2", "mail3"]}
df = pd.DataFrame(data_to_write).rename(columns={'Names':'Name'})
for i in range(len(df)):
jstr = df.iloc[i].to_json()
with open(f"data{i+1}.json", "w") as f:
f.write(jstr)
Output (each line is in a separate file):
{"Name":"name1","email":"mail1"}
{"Name":"name2","email":"mail2"}
{"Name":"name3","email":"mail3"}
Try:
import json
data_to_write = {
"Names": ["name1", "name2", "name3"],
"email": ["mail1", "mail2", "mail3"],
}
for i, val in enumerate(zip(*data_to_write.values()), 1):
d = dict(zip(data_to_write, val))
with open(f"data{i}.json", "w") as f_out:
json.dump(d, f_out, indent=4)
This writes data(1..3).json with content:
# data1.json
{
"Names": "name1",
"email": "mail1"
}
# data2.json
{
"Names": "name2",
"email": "mail2"
}
...
import json
data_to_write = {
"Names": ["name1", "name2", "name3"],
"email": ["mail1", "mail2", "mail3"],
}
for ind, val in enumerate(zip(*data_to_write.values())):
jsn = dict(zip(data_to_write, val))
print(jsn)
with open("data{}.json".format(ind), "w") as f:
f.write(json.dumps(jsn))
I'm converting several JSON files into a CSV using the following code below, it works as intended, but it converts all of the data in the JSON file. Instead, I want it to do the following:
Load JSON file [done]
Extract certain nested data in the JSON file [wip]
Convert to CSV [done]
Current Code
import json, pandas
from flatten_json import flatten
# Enter the path to the JSON and the filename without appending '.json'
file_path = r'C:\Path\To\file_name'
# Open and load the JSON file
dic = json.load(open(file_path + '.json', 'r', encoding='utf-8', errors='ignore'))
# Flatten and convert to a data frame
dic_flattened = (flatten(d, '.') for d in dic)
df = pandas.DataFrame(dic_flattened)
# Export to CSV in the same directory with the original file name
export_csv = df.to_csv (file_path + r'.csv', sep=',', encoding='utf-8', index=None, header=True)
In the example at the bottom, I only want everything under the following keys: created, emails, and identities. The rest is useless information (such as statusCode) or it's duplicated under a different key name (such as profile and userInfo).
I know it requires a for loop and if statement to specify the key names later on, but not sure the best way to implement it. This is what I have so far when I want to test it:
Attempted Code
import json, pandas
from flatten_json import flatten
# Enter the path to the JSON and the filename without appending '.json'
file_path = r'C:\Path\To\file_name'
# Open and load the JSON file
json_file = open(file_path + '.json', 'r', encoding='utf-8', errors='ignore')
dic = json.load(json_file)
# List keys to extract
key_list = ['created', 'emails', 'identities']
for d in dic:
#print(d['identities']) #Print all 'identities'
#if 'identities' in d: #Check if 'identities' exists
if key_list in d:
# Flatten and convert to a data frame
#dic_flattened = (flatten(d, '.') for d in dic)
#df = pandas.DataFrame(dic_flattened)
else:
# Skip
# Export to CSV in the same directory with the original file name
#export_csv = df.to_csv (file_path + r'.csv', sep=',', encoding='utf-8', index=None, header=True)
Is this the right logic?
file_name.json Example
[
{
"callId": "abc123",
"errorCode": 0,
"apiVersion": 2,
"statusCode": 200,
"statusReason": "OK",
"time": "2020-12-14T12:00:32.744Z",
"registeredTimestamp": 1417731582000,
"UID": "_guid_abc123==",
"created": "2014-12-04T22:19:42.894Z",
"createdTimestamp": 1417731582000,
"data": {},
"preferences": {},
"emails": {
"verified": [],
"unverified": []
},
"identities": [
{
"provider": "facebook",
"providerUID": "123",
"allowsLogin": true,
"isLoginIdentity": true,
"isExpiredSession": true,
"lastUpdated": "2014-12-04T22:26:37.002Z",
"lastUpdatedTimestamp": 1417731997002,
"oldestDataUpdated": "2014-12-04T22:26:37.002Z",
"oldestDataUpdatedTimestamp": 1417731997002,
"firstName": "John",
"lastName": "Doe",
"nickname": "John Doe",
"profileURL": "https://www.facebook.com/John.Doe",
"age": 30,
"birthDay": 31,
"birthMonth": 12,
"birthYear": 1969,
"city": "City, State",
"education": [
{
"school": "High School Name",
"schoolType": "High School",
"degree": null,
"startYear": 0,
"fieldOfStudy": null,
"endYear": 0
}
],
"educationLevel": "High School",
"followersCount": 0,
"gender": "m",
"hometown": "City, State",
"languages": "English",
"locale": "en_US",
"name": "John Doe",
"photoURL": "https://graph.facebook.com/123/picture?type=large",
"timezone": "-8",
"thumbnailURL": "https://graph.facebook.com/123/picture?type=square",
"username": "john.doe",
"verified": "true",
"work": [
{
"companyID": null,
"isCurrent": null,
"endDate": null,
"company": "Company Name",
"industry": null,
"title": "Company Title",
"companySize": null,
"startDate": "2010-12-31T00:00:00"
}
]
}
],
"isActive": true,
"isLockedOut": false,
"isRegistered": true,
"isVerified": false,
"lastLogin": "2014-12-04T22:26:33.002Z",
"lastLoginTimestamp": 1417731993000,
"lastUpdated": "2014-12-04T22:19:42.769Z",
"lastUpdatedTimestamp": 1417731582769,
"loginProvider": "facebook",
"loginIDs": {
"emails": [],
"unverifiedEmails": []
},
"rbaPolicy": {
"riskPolicyLocked": false
},
"oldestDataUpdated": "2014-12-04T22:19:42.894Z",
"oldestDataUpdatedTimestamp": 1417731582894
"registered": "2014-12-04T22:19:42.956Z",
"regSource": "",
"socialProviders": "facebook"
}
]
As mentioned by juanpa.arrivillaga, I simply need to add the following line after the key_list:
json_list = [{k:d[k] for k in key_list} for d in json_list]
This is the full working code:
import json, pandas
from flatten_json import flatten
# Enter the path to the JSON and the filename without appending '.json'
file_path = r'C:\Path\To\file_name'
# Open and load the JSON file
json_list = json.load(open(file_path + '.json', 'r', encoding='utf-8', errors='ignore'))
# Extract data from the defined key names
key_list = ['created', 'emails', 'identities']
json_list = [{k:d[k] for k in key_list} for d in json_list]
# Flatten and convert to a data frame
json_list_flattened = (flatten(d, '.') for d in json_list)
df = pandas.DataFrame(json_list_flattened)
# Export to CSV in the same directory with the original file name
export_csv = df.to_csv (file_path + r'.csv', sep=',', encoding='utf-8', index=None, header=True)
Hope you are doing fine,
I have a data file(containing 1000s of a structured pattern of data), like below
PARTNER="ABC"
ADDRESS1="ABC Country INN"
DEPARTMENT="ABC Department"
CONTACT_PERSON="HR"
TELEPHONE="+91.90.XX XX X XXX"
FAX="+01.XX.XX XX XX XX"
EMAIL=""
PARTNER="DEF"
ADDRESS1="DEF Malaysia"
DEPARTMENT=""
CONTACT_PERSON=""
TELEPHONE="(YYY)YYYYY"
FAX="(001)YYYYYYYY"
EMAIL=""
PARTNER="GEH-LOP"
ADDRESS1="GEH LOP Street"
DEPARTMENT="HR"
CONTACT_PERSON="Adam"
TELEPHONE="+91.ZZ.ZZ.ZZZZ"
FAX="+91.ZZ.ZZ.ZZZ"
EMAIL=""
I tried to convert the datafile(partner.txt) to JSON with below code:
Created empty dictionaries dict1 and dict2
Reading the data file line by line
used this if not line.isspace() to make sure the linefeed is read is written in dictionary dict1
When linebreak(empty line appears) appended the content of dict1 to dict2 using dict2.update(dict1)
import json
dict1 = {}
dict2 ={}
with open("partner.txt", "r") as fh:
out_file = open("test1.json", "w")
for line in fh:
if not line.isspace():
command, description = line.strip().split("=")
dict1[command] = description.strip('"')
else:
dict2.update(dict1)
print("space found")
json.dump(dict2,out_file,indent=1)
out_file.close()
print("json file created")
But this code creates a json(test1.json) with only the single block of PARTNER
{
"PARTNER": "DEF",
"ADDRESS1": "DEF Malaysia",
"DEPARTMENT": "",
"CONTACT_PERSON": "",
"TELEPHONE": "(YYY)YYYYY",
"FAX": "(001)YYYYYYYY",
"EMAIL": ""
}
Expected Output
I tried looking up a lot but couldn't find a way:-
{
"data":[
{
"PARTNER": "ABC",
"ADDRESS1": "ABC Country INN",
"DEPARTMENT": "ABC Department",
"CONTACT_PERSON": "HR",
"TELEPHONE": "+91.90.XX XX X XXX",
"FAX": "+01.XX.XX XX XX XX",
"EMAIL": ""
},
{
"PARTNER": "DEF",
"ADDRESS1": "DEF Malaysia",
"DEPARTMENT": "",
"CONTACT_PERSON": "",
"TELEPHONE": "(YYY)YYYYY",
"FAX": "(001)YYYYYYYY",
"EMAIL": ""
},
{
"PARTNER": "GEH-LOP",
"ADDRESS1": "GEH LOP Street",
"DEPARTMENT": "HR",
"CONTACT_PERSON": "Adam",
"TELEPHONE": "+91.ZZ.ZZ.ZZZZ",
"FAX": "+91.ZZ.ZZ.ZZZ",
"EMAIL": ""
}
]
}
You need to set dict1 to a new dict each time:
import json
dict1 = {}
dict2 ={}
with open("partner.txt", "r") as fh:
out_file = open("test1.json", "w")
for line in fh:
if not line.isspace():
command, description = line.strip().split("=")
dict1[command] = description.strip('"')
else:
dict2.update(dict1)
dict1 = {} # set it to new dict
print("space found")
json.dump(dict2,out_file,indent=1)
out_file.close()
print("json file created")
You need to append the dict to a list of dictionaries, not use update, as it overwrites the keys that are always the same:
import json
dict1 = {}
data = []
with open("partner.txt", "r") as fh:
out_file = open("test1.json", "w")
for line in fh:
if not line.isspace():
command, description = line.strip().split("=")
dict1[command] = description.strip('"')
else:
data.append(dict1)
dict1 = {} # set it to new dict
print("space found")
output = {'data': data}
json.dump(output, out_file, indent=1)
out_file.close()
print("json file created")
there are many ways to do this. maybe we should make it maintainable
def list_to_dict(lines):
obj = {}
for liner in lines:
idx = liner.find("=")
obj[liner[0:idx]] = liner[idx + 2 : len(liner) - 1]
return obj
with open("file", "r") as f:
results = []
group = []
for line in list(map(lambda x: x.strip(), f.read().split("\n"))):
if line == "":
results.append(list_to_dict(group))
group = []
else:
group.append(line)
print(results)
Solution
Using regex + json + dict/list-comprehension
You can do this using the regex (regular expression) and json libraries together. The text-processing is carried out with regex and finally the json library is used to format the dictionary into JSON format and write to a .json file.
Additionally we use dict and list comprehensions to gather the intended fields.
Note:
The regex pattern used here is as follows:
# longer manually written version
pat = r'PARTNER="(.*)"\n\s*ADDRESS1="(.*)"\n\s*DEPARTMENT="(.*)"\n\s*CONTACT_PERSON="(.*)"\n\s*TELEPHONE="(.*)"\n\s*FAX="(.*)"\n\s*EMAIL="(.*)"'
# shorter equivalent automated version
pat = '="(.*)"\n\s*'.join(field_labels) + '="(.*)"'
Code
import re
import json
# Read from file or use the dummy data
with open("partner.txt", "r") as f:
s = f.read()
field_labels = [
'PARTNER',
'ADDRESS1',
'DEPARTMENT',
'CONTACT_PERSON',
'TELEPHONE',
'FAX',
'EMAIL'
]
# Define regex pattern and compile for speed
pat = '="(.*)"\n\s*'.join(field_labels) + '="(.*)"'
pat = re.compile(pat)
# Extract target fields
data = pat.findall(s)
# Prepare a list of dicts: each dict for a single block of data
d = [dict((k,v) for k,v in zip(field_labels, field_values)) for field_values in data]
text = json.dumps({'data': d}, indent=2)
print(text)
# Write to a json file
with open('output.json', 'w') as f:
f.write(text)
Output:
# output.json
{
"data": [
{
"PARTNER": "ABC",
"ADDRESS1": "ABC Country INN",
"DEPARTMENT": "ABC Department",
"CONTACT_PERSON": "HR",
"TELEPHONE": "+91.90.XX XX X XXX",
"FAX": "+01.XX.XX XX XX XX",
"EMAIL": ""
},
{
"PARTNER": "DEF",
"ADDRESS1": "DEF Malaysia",
"DEPARTMENT": "",
"CONTACT_PERSON": "",
"TELEPHONE": "(YYY)YYYYY",
"FAX": "(001)YYYYYYYY",
"EMAIL": ""
},
{
"PARTNER": "GEH-LOP",
"ADDRESS1": "GEH LOP Street",
"DEPARTMENT": "HR",
"CONTACT_PERSON": "Adam",
"TELEPHONE": "+91.ZZ.ZZ.ZZZZ",
"FAX": "+91.ZZ.ZZ.ZZZ",
"EMAIL": ""
}
]
}
Dummy Data
# Dummy Data
s = """
PARTNER="ABC"
ADDRESS1="ABC Country INN"
DEPARTMENT="ABC Department"
CONTACT_PERSON="HR"
TELEPHONE="+91.90.XX XX X XXX"
FAX="+01.XX.XX XX XX XX"
EMAIL=""
PARTNER="DEF"
ADDRESS1="DEF Malaysia"
DEPARTMENT=""
CONTACT_PERSON=""
TELEPHONE="(YYY)YYYYY"
FAX="(001)YYYYYYYY"
EMAIL=""
PARTNER="GEH-LOP"
ADDRESS1="GEH LOP Street"
DEPARTMENT="HR"
CONTACT_PERSON="Adam"
TELEPHONE="+91.ZZ.ZZ.ZZZZ"
FAX="+91.ZZ.ZZ.ZZZ"
EMAIL=""
"""
I have written a code to convert csv file to nested json format. I have multiple columns to be nested hence assigning separately for each column. The problem is I'm getting 2 fields for the same column in the json output.
import csv
import json
from collections import OrderedDict
csv_file = 'data.csv'
json_file = csv_file + '.json'
def main(input_file):
csv_rows = []
with open(input_file, 'r') as csvfile:
reader = csv.DictReader(csvfile, delimiter='|')
for row in reader:
row['TYPE'] = 'REVIEW', # adding new key, value
row['RAWID'] = 1,
row['CUSTOMER'] = {
"ID": row['CUSTOMER_ID'],
"NAME": row['CUSTOMER_NAME']
}
row['CATEGORY'] = {
"ID": row['CATEGORY_ID'],
"NAME": row['CATEGORY']
}
del (row["CUSTOMER_NAME"], row["CATEGORY_ID"],
row["CATEGORY"], row["CUSTOMER_ID"]) # deleting since fields coccuring twice
csv_rows.append(row)
with open(json_file, 'w') as f:
json.dump(csv_rows, f, sort_keys=True, indent=4, ensure_ascii=False)
f.write('\n')
The output is as below:
[
{
"CATEGORY": {
"ID": "1",
"NAME": "Consumers"
},
"CATEGORY_ID": "1",
"CUSTOMER_ID": "41",
"CUSTOMER": {
"ID": "41",
"NAME": "SA Port"
},
"CUSTOMER_NAME": "SA Port",
"RAWID": [
1
]
}
]
I'm getting 2 entries for the fields I have assigned using row[''].
Is there any other way to get rid of this? I want only one entry for a particular field in each record.
Also how can I convert the keys to lower case after reading from csv.DictReader(). In my csv file all the columns are in upper case and hence I'm using the same to assign. But I want to convert all of them to lower case.
In order to convert the keys to lower case, it would be simpler to generate a new dict per row. BTW, it should be enough to get rid of the duplicate fields:
for row in reader:
orow = collection.OrderedDict()
orow['type'] = 'REVIEW', # adding new key, value
orow['rawid'] = 1,
orow['customer'] = {
"id": row['CUSTOMER_ID'],
"name": row['CUSTOMER_NAME']
}
orow['category'] = {
"id": row['CATEGORY_ID'],
"name": row['CATEGORY']
}
csv_rows.append(orow)
I have multiple documents that together are approximately 400 GB and I want to convert them to json format in order to drop to elasticsearch for analysis.
Each file is approximately 200 MB.
Original file looked like:
IUGJHHGF#BERLIN:lhfrjy
0t7yfudf#WARSAW:qweokm246
0t7yfudf#CRACOW:Er747474
0t7yfudf#cracow:kui666666
000t7yf#Vienna:1йй2ц2й2цй2цц3у
It has the characters that are not only English. key1 is always separated with #, where city was separated either by ; or :
After I have parsed it with code:
#!/usr/bin/env python
# coding: utf8
import json
with open('2') as f:
for line in f:
s1 = line.find("#")
rest = line[s1+1:]
if rest.find(";") != -1:
if rest.find(":") != -1:
print "FOUND BOTH : ; "
s2 = -0
else:
s2 = s1+1+rest.find(";")
elif rest.find(":") != -1:
s2 = s1+1+rest.find(":")
else:
print "FOUND NO : ; "
s2 = -0
key1 = line[:s1]
city = line[s1+1:s2]
description = line[s2+1:len(line)-1]
All file looks like:
RRS12345 Cracow Sunflowers
RRD12345 Berin Data
After that parsing I want to have the output:
{
"location_data":[
{
"key1":"RRS12345",
"city":"Cracow",
"description":"Sunflowers"
},
{
"key1":"RRD123dsd45",
"city":"Berlin",
"description":"Data"
},
{
"key1":"RRD123dsds45",
"city":"Berlin",
"description":"1йй2ц2й2цй2цц3у"
}
]
}
How can I convert it to the required json format quickly, where we do not have only English characters?
import json
def process_text_to_json():
location_data = []
with open("file.txt") as f:
for line in f:
line = line.split()
location_data.append({"key1": line[0], "city": line[1], "description": line[2]})
location_data = {"location_data": location_data}
return json.dumps(location_data)
Output sample:
{"location_data": [{"city": "Cracow", "key1": "RRS12345", "description": "Sunflowers"}, {"city": "Berin", "key1": "RRD12345", "description": "Data"}, {"city": "Cracow2", "key1": "RRS12346", "description": "Sunflowers"}, {"city": "Berin2", "key1": "RRD12346", "description": "Data"}, {"city": "Cracow3", "key1": "RRS12346", "description": "Sunflowers"}, {"city": "Berin3", "key1": "RRD12346", "description": "Data"}]}
Iterate over each line and form your dict.
Ex:
d = {"location_data":[]}
with open(filename, "r") as infile:
for line in infile:
val = line.split()
d["location_data"].append({"key1": val[0], "city": val[1], "description": val[2]})
print(d)