I have multiple documents that together are approximately 400 GB and I want to convert them to json format in order to drop to elasticsearch for analysis.
Each file is approximately 200 MB.
Original file looked like:
IUGJHHGF#BERLIN:lhfrjy
0t7yfudf#WARSAW:qweokm246
0t7yfudf#CRACOW:Er747474
0t7yfudf#cracow:kui666666
000t7yf#Vienna:1йй2ц2й2цй2цц3у
It has the characters that are not only English. key1 is always separated with #, where city was separated either by ; or :
After I have parsed it with code:
#!/usr/bin/env python
# coding: utf8
import json
with open('2') as f:
for line in f:
s1 = line.find("#")
rest = line[s1+1:]
if rest.find(";") != -1:
if rest.find(":") != -1:
print "FOUND BOTH : ; "
s2 = -0
else:
s2 = s1+1+rest.find(";")
elif rest.find(":") != -1:
s2 = s1+1+rest.find(":")
else:
print "FOUND NO : ; "
s2 = -0
key1 = line[:s1]
city = line[s1+1:s2]
description = line[s2+1:len(line)-1]
All file looks like:
RRS12345 Cracow Sunflowers
RRD12345 Berin Data
After that parsing I want to have the output:
{
"location_data":[
{
"key1":"RRS12345",
"city":"Cracow",
"description":"Sunflowers"
},
{
"key1":"RRD123dsd45",
"city":"Berlin",
"description":"Data"
},
{
"key1":"RRD123dsds45",
"city":"Berlin",
"description":"1йй2ц2й2цй2цц3у"
}
]
}
How can I convert it to the required json format quickly, where we do not have only English characters?
import json
def process_text_to_json():
location_data = []
with open("file.txt") as f:
for line in f:
line = line.split()
location_data.append({"key1": line[0], "city": line[1], "description": line[2]})
location_data = {"location_data": location_data}
return json.dumps(location_data)
Output sample:
{"location_data": [{"city": "Cracow", "key1": "RRS12345", "description": "Sunflowers"}, {"city": "Berin", "key1": "RRD12345", "description": "Data"}, {"city": "Cracow2", "key1": "RRS12346", "description": "Sunflowers"}, {"city": "Berin2", "key1": "RRD12346", "description": "Data"}, {"city": "Cracow3", "key1": "RRS12346", "description": "Sunflowers"}, {"city": "Berin3", "key1": "RRD12346", "description": "Data"}]}
Iterate over each line and form your dict.
Ex:
d = {"location_data":[]}
with open(filename, "r") as infile:
for line in infile:
val = line.split()
d["location_data"].append({"key1": val[0], "city": val[1], "description": val[2]})
print(d)
Related
I have text file and I want to convert it to JSON:
red|2022-09-29|03:15:00|info 1
blue|2022-09-29|10:50:00|
yellow|2022-09-29|07:15:00|info 2
so i type a script to convert this file into JSON:
import json
filename = 'input_file.txt'
dict1 = {}
fields =['name', 'date', 'time', 'info']
with open(filename) as fh:
l = 1
for line in fh:
description = list( line.strip().split("|", 4))
print(description)
sno ='name'+str(l)
i = 0
dict2 = {}
while i<len(fields):
dict2[fields[i]]= description[i]
i = i + 1
dict1[sno]= dict2
l = l + 1
out_file = open("json_file.json", "w")
json.dump(dict1, out_file, indent = 4)
out_file.close()
and output looks like this:
{
"name1": {
"name": "red",
"date": "2022-09-29",
"time": "03:15:00",
"info": "info 1"
},
"name2": {
"name": "blue",
"date": "2022-09-29",
"time": "10:50:00",
"info": ""
},
"name3": {
"name": "yellow",
"date": "2022-09-29",
"time": "07:15:00",
"info": "info 2"
}
}
As you can see I do so, but now I want to change looks of this JSON file. How can I change it to make my output looks like this:
to look like this:
[
{"name":"red", "date": "2022-09-29", "time": "03:15:00", "info":"info 1"},
{"name":"blue", "date": "2022-09-29", "time": "10:50:00", "info":""},
{"name":"yellow", "date": "2022-09-29", "time": "07:15:00", "info":"info 2"}
]
If you see your required json output, it is a list and not a dict like you have right now. So using a list(data) instead of dict(dict1) should give the correct output.
Following updated code should generate the json data in required format -
import json
filename = 'input_file.txt'
data = []
fields =['name', 'date', 'time', 'info']
with open(filename) as fh:
l = 1
for line in fh:
description = list( line.strip().split("|", 4))
print(description)
sno ='name'+str(l)
i = 0
dict2 = {}
while i<len(fields):
dict2[fields[i]]= description[i]
i = i + 1
data.append(dict2)
l = l + 1
out_file = open("json_file.json", "w")
json.dump(data, out_file, indent = 4)
out_file.close()
I would use pandas, it allows you to solve your problem in one statement and avoid reinventing a wheel:
import pandas as pd
pd.read_table("input_file.txt", sep="|", header=None,
names=["name", "date" , "time", "info"]).fillna("")\
.to_json("json_file.json", orient="records")
I wrote a code in python that converts a file with these objects to JSON. It converts into the proper json format but the output is not exactly what I need.
{
name: (sindey, crosby)
game: "Hockey"
type: athlete
},
{
name: (wayne, gretzky)
game: "Ice Hockey"
type: athlete
}
Code:
import json
f = open("log.file", "r")
content = f.read()
splitcontent = content.splitlines()
d = []
for line in splitcontent:
appendage = {}
if ('}' in line) or ('{' in line):
# Append a just-created record and start a new one
continue
d.append(appendage)
key, val = line.split(':')
if val.endswith(','):
# strip a trailing comma
val = val[:-1]
appendage[key] = val
with open("json_log.json", 'w') as file:
file.write((json.dumps(d, indent=4, sort_keys=False)))
Desired output:
[
{
"name": "(sindey, crosby)",
"game": "Hockey",
"type": "athlete"
},
{
"name": "(wayne, gretzky)",
"game": "Ice Hockey",
"type": "athlete"
}
]
But I'm getting:
[
{
" name": " (sindey, crosby)"
},
{
" game": " \"Hockey\""
},
{
" type": " athlete"
},
{
" name": " (wayne, gretzky)"
},
{
" game": " \"Ice Hockey\""
},
{
" type": " athlete"
}
]
Any way to fix it to get the desired output and fix the {} around each individual line?
It's usually a good idea to split parsing into simpler tasks, e.g. first parse records, then parse fields.
I'm skipping the file handling and using a text variable:
intxt = """
{
name: (sindey, crosby)
game: "Hockey"
type: athlete
},
{
name: (wayne, gretzky)
game: "Ice Hockey"
type: athlete
}
"""
Then create a function that can yield all lines that are part of a record:
import json
def parse_records(txt):
reclines = []
for line in txt.split('\n'):
if ':' not in line:
if reclines:
yield reclines
reclines = []
else:
reclines.append(line)
and a function that takes those lines and parses each key/value pair:
def parse_fields(reclines):
res = {}
for line in reclines:
key, val = line.strip().rstrip(',').split(':', 1)
res[key.strip()] = val.strip()
return res
the main function becomes trivial:
res = []
for rec in parse_records(intxt):
res.append(parse_fields(rec))
print(json.dumps(res, indent=4))
the output, as desired:
[
{
"name": "(sindey, crosby)",
"game": "\"Hockey\"",
"type": "athlete"
},
{
"name": "(wayne, gretzky)",
"game": "\"Ice Hockey\"",
"type": "athlete"
}
]
The parsing functions can of course be made better, but you get the idea.
Yes I haven't checked the ouput properly, I remodified the logic now. The output is as expected.
import json
f = open("log.file", "r")
content = f.read()
print(content)
splitcontent = content.splitlines()
d = []
for line in splitcontent:
if "{" in line:
appendage = {}
elif "}" in line:
d.append(appendage)
else:
key, val = line.split(':')
appendage[key.strip()] = val.strip()
with open("json_log.json", 'w') as file:
file.write((json.dumps(d, indent=4, sort_keys=False)))
I have a wrongly-formatted JSON file where I have numbers with leading zeroes.
p = """[
{
"name": "Alice",
"RegisterNumber": 911100020001
},
{
"name": "Bob",
"RegisterNumber": 000111110300
}
]"""
arc = json.loads(p)
I get this error.
JSONDecodeError: Expecting ',' delimiter: line 8 column 24 (char 107)
Here's what is on char 107:
print(p[107])
#0
The problem is: this is the data I have. Here I am only showing two examples, but my file has millions of lines to be parsed, I need a script. At the end of the day, I need this string:
"""[
{
"name": "Alice",
"RegisterNumber": "911100020001"
},
{
"name": "Bob",
"RegisterNumber": "000111110300"
}
]"""
How can I do it?
Read the file (best line by line) and replace all the values with their string representation. You can use regular expressions for that (remodule).
Then save and later parse the valid json.
If it fits into memory, you don't need to save the file of course, but just loads the then valid json string.
Here is a simple version:
import json
p = """[
{
"name": "Alice",
"RegisterNumber": 911100020001
},
{
"name": "Bob",
"RegisterNumber": 000111110300
}
]"""
from re import sub
p = sub(r"(\d{12})", "\"\\1\"", p)
arc = json.loads(p)
print(arc[1])
This probably won't be pretty but you could probably fix this using a regex.
import re
p = "..."
sub = re.sub(r'"RegisterNumber":\W([0-9]+)', r'"RegisterNumber": "\1"', p)
json.loads(sub)
This will match all the case where you have the RegisterNumber followed by numbers.
Since the problem is the leading zeroes, tne easy way to fix the data would be to split it into lines and fix any lines that exhibit the problem. It's cheap and nasty, but this seems to work.
data = """[
{
"name": "Alice",
"RegisterNumber": 911100020001
},
{
"name": "Bob",
"RegisterNumber": 000111110300
}
]"""
result = []
for line in data.splitlines():
if ': 0' in line:
while ": 0" in line:
line = line.replace(': 0', ': ')
result.append(line.replace(': ', ': "')+'"')
else:
result.append(line)
data = "".join(result)
arc = json.loads(data)
print(arc)
Hope you are doing fine,
I have a data file(containing 1000s of a structured pattern of data), like below
PARTNER="ABC"
ADDRESS1="ABC Country INN"
DEPARTMENT="ABC Department"
CONTACT_PERSON="HR"
TELEPHONE="+91.90.XX XX X XXX"
FAX="+01.XX.XX XX XX XX"
EMAIL=""
PARTNER="DEF"
ADDRESS1="DEF Malaysia"
DEPARTMENT=""
CONTACT_PERSON=""
TELEPHONE="(YYY)YYYYY"
FAX="(001)YYYYYYYY"
EMAIL=""
PARTNER="GEH-LOP"
ADDRESS1="GEH LOP Street"
DEPARTMENT="HR"
CONTACT_PERSON="Adam"
TELEPHONE="+91.ZZ.ZZ.ZZZZ"
FAX="+91.ZZ.ZZ.ZZZ"
EMAIL=""
I tried to convert the datafile(partner.txt) to JSON with below code:
Created empty dictionaries dict1 and dict2
Reading the data file line by line
used this if not line.isspace() to make sure the linefeed is read is written in dictionary dict1
When linebreak(empty line appears) appended the content of dict1 to dict2 using dict2.update(dict1)
import json
dict1 = {}
dict2 ={}
with open("partner.txt", "r") as fh:
out_file = open("test1.json", "w")
for line in fh:
if not line.isspace():
command, description = line.strip().split("=")
dict1[command] = description.strip('"')
else:
dict2.update(dict1)
print("space found")
json.dump(dict2,out_file,indent=1)
out_file.close()
print("json file created")
But this code creates a json(test1.json) with only the single block of PARTNER
{
"PARTNER": "DEF",
"ADDRESS1": "DEF Malaysia",
"DEPARTMENT": "",
"CONTACT_PERSON": "",
"TELEPHONE": "(YYY)YYYYY",
"FAX": "(001)YYYYYYYY",
"EMAIL": ""
}
Expected Output
I tried looking up a lot but couldn't find a way:-
{
"data":[
{
"PARTNER": "ABC",
"ADDRESS1": "ABC Country INN",
"DEPARTMENT": "ABC Department",
"CONTACT_PERSON": "HR",
"TELEPHONE": "+91.90.XX XX X XXX",
"FAX": "+01.XX.XX XX XX XX",
"EMAIL": ""
},
{
"PARTNER": "DEF",
"ADDRESS1": "DEF Malaysia",
"DEPARTMENT": "",
"CONTACT_PERSON": "",
"TELEPHONE": "(YYY)YYYYY",
"FAX": "(001)YYYYYYYY",
"EMAIL": ""
},
{
"PARTNER": "GEH-LOP",
"ADDRESS1": "GEH LOP Street",
"DEPARTMENT": "HR",
"CONTACT_PERSON": "Adam",
"TELEPHONE": "+91.ZZ.ZZ.ZZZZ",
"FAX": "+91.ZZ.ZZ.ZZZ",
"EMAIL": ""
}
]
}
You need to set dict1 to a new dict each time:
import json
dict1 = {}
dict2 ={}
with open("partner.txt", "r") as fh:
out_file = open("test1.json", "w")
for line in fh:
if not line.isspace():
command, description = line.strip().split("=")
dict1[command] = description.strip('"')
else:
dict2.update(dict1)
dict1 = {} # set it to new dict
print("space found")
json.dump(dict2,out_file,indent=1)
out_file.close()
print("json file created")
You need to append the dict to a list of dictionaries, not use update, as it overwrites the keys that are always the same:
import json
dict1 = {}
data = []
with open("partner.txt", "r") as fh:
out_file = open("test1.json", "w")
for line in fh:
if not line.isspace():
command, description = line.strip().split("=")
dict1[command] = description.strip('"')
else:
data.append(dict1)
dict1 = {} # set it to new dict
print("space found")
output = {'data': data}
json.dump(output, out_file, indent=1)
out_file.close()
print("json file created")
there are many ways to do this. maybe we should make it maintainable
def list_to_dict(lines):
obj = {}
for liner in lines:
idx = liner.find("=")
obj[liner[0:idx]] = liner[idx + 2 : len(liner) - 1]
return obj
with open("file", "r") as f:
results = []
group = []
for line in list(map(lambda x: x.strip(), f.read().split("\n"))):
if line == "":
results.append(list_to_dict(group))
group = []
else:
group.append(line)
print(results)
Solution
Using regex + json + dict/list-comprehension
You can do this using the regex (regular expression) and json libraries together. The text-processing is carried out with regex and finally the json library is used to format the dictionary into JSON format and write to a .json file.
Additionally we use dict and list comprehensions to gather the intended fields.
Note:
The regex pattern used here is as follows:
# longer manually written version
pat = r'PARTNER="(.*)"\n\s*ADDRESS1="(.*)"\n\s*DEPARTMENT="(.*)"\n\s*CONTACT_PERSON="(.*)"\n\s*TELEPHONE="(.*)"\n\s*FAX="(.*)"\n\s*EMAIL="(.*)"'
# shorter equivalent automated version
pat = '="(.*)"\n\s*'.join(field_labels) + '="(.*)"'
Code
import re
import json
# Read from file or use the dummy data
with open("partner.txt", "r") as f:
s = f.read()
field_labels = [
'PARTNER',
'ADDRESS1',
'DEPARTMENT',
'CONTACT_PERSON',
'TELEPHONE',
'FAX',
'EMAIL'
]
# Define regex pattern and compile for speed
pat = '="(.*)"\n\s*'.join(field_labels) + '="(.*)"'
pat = re.compile(pat)
# Extract target fields
data = pat.findall(s)
# Prepare a list of dicts: each dict for a single block of data
d = [dict((k,v) for k,v in zip(field_labels, field_values)) for field_values in data]
text = json.dumps({'data': d}, indent=2)
print(text)
# Write to a json file
with open('output.json', 'w') as f:
f.write(text)
Output:
# output.json
{
"data": [
{
"PARTNER": "ABC",
"ADDRESS1": "ABC Country INN",
"DEPARTMENT": "ABC Department",
"CONTACT_PERSON": "HR",
"TELEPHONE": "+91.90.XX XX X XXX",
"FAX": "+01.XX.XX XX XX XX",
"EMAIL": ""
},
{
"PARTNER": "DEF",
"ADDRESS1": "DEF Malaysia",
"DEPARTMENT": "",
"CONTACT_PERSON": "",
"TELEPHONE": "(YYY)YYYYY",
"FAX": "(001)YYYYYYYY",
"EMAIL": ""
},
{
"PARTNER": "GEH-LOP",
"ADDRESS1": "GEH LOP Street",
"DEPARTMENT": "HR",
"CONTACT_PERSON": "Adam",
"TELEPHONE": "+91.ZZ.ZZ.ZZZZ",
"FAX": "+91.ZZ.ZZ.ZZZ",
"EMAIL": ""
}
]
}
Dummy Data
# Dummy Data
s = """
PARTNER="ABC"
ADDRESS1="ABC Country INN"
DEPARTMENT="ABC Department"
CONTACT_PERSON="HR"
TELEPHONE="+91.90.XX XX X XXX"
FAX="+01.XX.XX XX XX XX"
EMAIL=""
PARTNER="DEF"
ADDRESS1="DEF Malaysia"
DEPARTMENT=""
CONTACT_PERSON=""
TELEPHONE="(YYY)YYYYY"
FAX="(001)YYYYYYYY"
EMAIL=""
PARTNER="GEH-LOP"
ADDRESS1="GEH LOP Street"
DEPARTMENT="HR"
CONTACT_PERSON="Adam"
TELEPHONE="+91.ZZ.ZZ.ZZZZ"
FAX="+91.ZZ.ZZ.ZZZ"
EMAIL=""
"""
I have written a code to convert csv file to nested json format. I have multiple columns to be nested hence assigning separately for each column. The problem is I'm getting 2 fields for the same column in the json output.
import csv
import json
from collections import OrderedDict
csv_file = 'data.csv'
json_file = csv_file + '.json'
def main(input_file):
csv_rows = []
with open(input_file, 'r') as csvfile:
reader = csv.DictReader(csvfile, delimiter='|')
for row in reader:
row['TYPE'] = 'REVIEW', # adding new key, value
row['RAWID'] = 1,
row['CUSTOMER'] = {
"ID": row['CUSTOMER_ID'],
"NAME": row['CUSTOMER_NAME']
}
row['CATEGORY'] = {
"ID": row['CATEGORY_ID'],
"NAME": row['CATEGORY']
}
del (row["CUSTOMER_NAME"], row["CATEGORY_ID"],
row["CATEGORY"], row["CUSTOMER_ID"]) # deleting since fields coccuring twice
csv_rows.append(row)
with open(json_file, 'w') as f:
json.dump(csv_rows, f, sort_keys=True, indent=4, ensure_ascii=False)
f.write('\n')
The output is as below:
[
{
"CATEGORY": {
"ID": "1",
"NAME": "Consumers"
},
"CATEGORY_ID": "1",
"CUSTOMER_ID": "41",
"CUSTOMER": {
"ID": "41",
"NAME": "SA Port"
},
"CUSTOMER_NAME": "SA Port",
"RAWID": [
1
]
}
]
I'm getting 2 entries for the fields I have assigned using row[''].
Is there any other way to get rid of this? I want only one entry for a particular field in each record.
Also how can I convert the keys to lower case after reading from csv.DictReader(). In my csv file all the columns are in upper case and hence I'm using the same to assign. But I want to convert all of them to lower case.
In order to convert the keys to lower case, it would be simpler to generate a new dict per row. BTW, it should be enough to get rid of the duplicate fields:
for row in reader:
orow = collection.OrderedDict()
orow['type'] = 'REVIEW', # adding new key, value
orow['rawid'] = 1,
orow['customer'] = {
"id": row['CUSTOMER_ID'],
"name": row['CUSTOMER_NAME']
}
orow['category'] = {
"id": row['CATEGORY_ID'],
"name": row['CATEGORY']
}
csv_rows.append(orow)