How to solve problem decoding from wrong json format

How to solve problem decoding from wrong json format - python

everyone. Need help opening and reading the file.
Got this txt file - https://yadi.sk/i/1TH7_SYfLss0JQ
It is a dictionary
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}
But it was written using json into txt file.
#This is how I dump the data into a txt
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
So, the file structure is
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}{"id2":"url2", "id3":"url3", ..., "id4":"url4"}{"id5":"url5", "id6":"url6", ..., "id7":"url7"}
And it is all a string....
I need to open it and check repeated ID, delete and save it again.
But getting - json.loads shows ValueError: Extra data
Tried these:
How to read line-delimited JSON from large file (line by line)
Python json.loads shows ValueError: Extra data
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 190)
But still getting that error, just in different place.
Right now I got as far as:
with open('111111111.txt', 'r') as log:
before_log = log.read()
before_log = before_log.replace('}{',', ').split(', ')
mu_dic = []
for i in before_log:
mu_dic.append(i)
This eliminate the problem of several {}{}{} dictionaries/jsons in a row.
Maybe there is a better way to do this?
P.S. This is how the file is made:
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))

Your file size is 9,5M, so it'll took you a while to open it and debug it manually.
So, using head and tail tools (found normally in any Gnu/Linux distribution) you'll see that:
# You can use Python as well to read chunks from your file
# and see the nature of it and what it's causing a decode problem
# but i prefer head & tail because they're ready to be used :-D
$> head -c 217 111111111.txt
{"1933252590737725178": "https://instagram.fiev2-1.fna.fbcdn.net/vp/094927bbfd432db6101521c180221485/5CC0EBDD/t51.2885-15/e35/46950935_320097112159700_7380137222718265154_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net",
$> tail -c 219 111111111.txt
, "1752899319051523723": "https://instagram.fiev2-1.fna.fbcdn.net/vp/a3f28e0a82a8772c6c64d4b0f264496a/5CCB7236/t51.2885-15/e35/30084016_2051123655168027_7324093741436764160_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net"}
$> head -c 294879 111111111.txt | tail -c 12
net"}{"19332
So the first guess is that your file is a malformed series ofJSON data, and the best guess is to seperate }{ by a \n for further manipulations.
So, here is an example of how you can solve your problem using Python:
import json
input_file = '111111111.txt'
output_file = 'new_file.txt'
data = ''
with open(input_file, mode='r', encoding='utf8') as f_file:
# this with statement part can be replaced by
# using sed under your OS like this example:
# sed -i 's/}{/}\n{/g' 111111111.txt
data = f_file.read()
data = data.replace('}{', '}\n{')
seen, total_keys, to_write = set(), 0, {}
# split the lines of the in memory data
for elm in data.split('\n'):
# convert the line to a valid Python dict
converted = json.loads(elm)
# loop over the keys
for key, value in converted.items():
total_keys += 1
# if the key is not seen then add it for further manipulations
# else ignore it
if key not in seen:
seen.add(key)
to_write.update({key: value})
# write the dict's keys & values into a new file as a JSON format
with open(output_file, mode='a+', encoding='utf8') as out_file:
out_file.write(json.dumps(to_write) + '\n')
print(
'found duplicated key(s): {seen} from {total}'.format(
seen=total_keys - len(seen),
total=total_keys
)
)
Output:
found duplicated key(s): 43836 from 45367
And finally, the output file will be a valid JSON file and the duplicated keys will be removed with their values.

The basic difference between the file structure and actual json format is the missing commas and the lines are not enclosed within [. So the same can be achieved with the below code snippet
with open('json_file.txt') as f:
# Read complete file
a = (f.read())
# Convert into single line string
b = ''.join(a.splitlines())
# Add , after each object
b = b.replace("}", "},")
# Add opening and closing parentheses and ignore last comma added in prev step
b = '[' + b[:-1] + ']'
x = json.loads(b)

Related

Python: Having trouble replacing lines from file

I'm trying to build a translator using deepl for subtitles but it isn't running perfectly. I managed to translate the subtitles and most of the part I'm having problems replacing the lines. I can see that the lines are translated because it prints them but it doesn't replace them. Whenever I run the program it is the same as the original file.
This is the code responsible for:
def translate(input, output, languagef, languaget):
file = open(input, 'r').read()
fileresp = open(output,'r+')
subs = list(srt.parse(file))
for sub in subs:
try:
linefromsub = sub.content
translationSentence = pydeepl.translate(linefromsub, languaget.upper(), languagef.upper())
print(str(sub.index) + ' ' + translationSentence)
for line in fileresp.readlines():
newline = fileresp.write(line.replace(linefromsub,translationSentence))
except IndexError:
print("Error parsing data from deepl")
This is the how the file looks:
1
00:00:02,470 --> 00:00:04,570
- Yes, I do.
- (laughs)
2
00:00:04,605 --> 00:00:07,906
My mom doesn't want
to babysit everyday
3
00:00:07,942 --> 00:00:09,274
or any day.
4
00:00:09,310 --> 00:00:11,977
But I need
my mom's help sometimes.
5
00:00:12,013 --> 00:00:14,046
She's just gonna
have to be grandma today.
Help will be appreaciated :)
Thanks.

You are opening fileresp with r+ mode. When you call readlines(), the file's position will be set to the end of the file. Subsequent calls to write() will then append to the file. If you want to overwrite the original contents as opposed to append, you should try this instead:
allLines = fileresp.readlines()
fileresp.seek(0) # Set position to the beginning
fileresp.truncate() # Delete the contents
for line in allLines:
fileresp.write(...)
Update
It's difficult to see what you're trying to accomplish with r+ mode here but it seems you have two separate input and output files. If that's the case consider:
def translate(input, output, languagef, languaget):
file = open(input, 'r').read()
fileresp = open(output, 'w') # Use w mode instead
subs = list(srt.parse(file))
for sub in subs:
try:
linefromsub = sub.content
translationSentence = pydeepl.translate(linefromsub, languaget.upper(), languagef.upper())
print(str(sub.index) + ' ' + translationSentence)
fileresp.write(translationSentence) # Write the translated sentence
except IndexError:
print("Error parsing data from deepl")

How to remove brackets and the contents inside from a file

I have a file named sample.txt which looks like below
ServiceProfile.SharediFCList[1].DefaultHandling=1
ServiceProfile.SharediFCList[1].ServiceInformation=
ServiceProfile.SharediFCList[1].IncludeRegisterRequest=n
ServiceProfile.SharediFCList[1].IncludeRegisterResponse=n
Here my requirement is to remove the brackets and the integer and enter os commands with that
ServiceProfile.SharediFCList.DefaultHandling=1
ServiceProfile.SharediFCList.ServiceInformation=
ServiceProfile.SharediFCList.IncludeRegisterRequest=n
ServiceProfile.SharediFCList.IncludeRegisterResponse=n
I am quite a newbie in Python. This is my first attempt. I have used these codes to remove the brackets:
#!/usr/bin/python
import re
import os
import sys
f = os.open("sample.txt", os.O_RDWR)
ret = os.read(f, 10000)
os.close(f)
print ret
var1 = re.sub("[\(\[].*?[\)\]]", "", ret)
print var1f = open("removed.cfg", "w+")
f.write(var1)
f.close()
After this using the file as input I want to form application specific commands which looks like this:
cmcli INS "DefaultHandling=1 ServiceInformation="
and the next set as
cmcli INS "IncludeRegisterRequest=n IncludeRegisterRequest=y"
so basically now I want the all the output to be bunched to a set of two for me to execute the commands on the operating system.
Is there any way that I could bunch them up as set of two?

Reading 10,000 bytes of text into a string is really not necessary when your file is line-oriented text, and isn't scalable either. And you need a very good reason to be using os.open() instead of open().
So, treat your data as the lines of text that it is, and every two lines, compose a single line of output.
from __future__ import print_function
import re
command = [None,None]
cmd_id = 1
bracket_re = re.compile(r".+\[\d\]\.(.+)")
# This doesn't just remove the brackets: what you actually seem to want is
# to pick out everything after [1]. and ignore the rest.
with open("removed_cfg","w") as outfile:
with open("sample.txt") as infile:
for line in infile:
m = bracket_re.match(line)
cmd_id = 1 - cmd_id # gives 0, 1, 0, 1
command[cmd_id] = m.group(1)
if cmd_id == 1: # we have a pair
output_line = """cmcli INS "{0} {1}" """.format(*command)
print (output_line, file=outfile)
This gives the output
cmcli INS "DefaultHandling=1 ServiceInformation="
cmcli INS "IncludeRegisterRequest=n IncludeRegisterResponse=n"
The second line doesn't correspond to your sample output. I don't know how the input IncludeRegisterResponse=n is supposed to become the output IncludeRegisterRequest=y. I assume that's a mistake.
Note that this code depends on your input data being precisely as you describe it and has no error checking whatsoever. So if the format of the input is in reality more variable than that, then you will need to add some validation.

Concatenate multiple text files of DNA sequences in Python or R?

I was wondering how to concatenate exon/DNA fasta files using Python or R.
Example files:
So far I really liked using R ape package for the cbind method, solely because of the fill.with.gaps=TRUE attribute. I really need gaps inserted when a species is missing an exon.
My code:
ex1 <- read.dna("exon1.txt", format="fasta")
ex2 <- read.dna("exon2.txt", format="fasta")
output <- cbind(ex1, ex2, fill.with.gaps=TRUE)
write.dna(output, "Output.txt", format="fasta")
Example:
exon1.txt
>sp1
AAAA
>sp2
CCCC
exon2.txt
>sp1
AGG-G
>sp2
CTGAT
>sp3
CTTTT
Output file:
>sp1
AAAAAGG-G
>sp2
CCCCCTGAT
>sp3
----CTTTT
So far I am having trouble trying to apply this technique when I have multiple exon files (trying to figure out a loop to open and execute the cbind method for all files ending with .fa in the directory), and sometimes not all files have exons that are all identical in length - hence DNAbin stops working.
So far I have:
file_list <- list.files(pattern=".fa")
myFunc <- function(x) {
for (file in file_list) {
x <- read.dna(file, format="fasta")
out <- cbind(x, fill.with.gaps=TRUE)
write.dna(out, "Output.txt", format="fasta")
}
}
However when I run this and I check my output text file, it misses many exons and I think that is because not all files have the same exon length... or my script is failing somewhere and I can't figure it out: (
Any ideas? I can also try Python.

If you prefer using Linux one liners you have
cat exon1.txt exon2.txt > outfile
if you want only the unique records from the outfile use
awk '/^>/{f=!d[$1];d[$1]=1}f' outfile > sorted_outfile

I just came out with this answer in Python 3:
def read_fasta(fasta): #Function that reads the files
output = {}
for line in fasta.split("\n"):
line = line.strip()
if not line:
continue
if line.startswith(">"):
active_sequence_name = line[1:]
if active_sequence_name not in output:
output[active_sequence_name] = []
continue
sequence = line
output[active_sequence_name].append(sequence)
return output
with open("exon1.txt", 'r') as file: # read exon1.txt
file1 = read_fasta(file.read())
with open("exon2.txt", 'r') as file: # read exon2.txt
file2 = read_fasta(file.read())
finaldict = {} #Concatenate the
for i in list(file1.keys()) + list(file2.keys()): #both files content
if i not in file1.keys():
file1[i] = ["-" * len(file2[i][0])]
if i not in file2.keys():
file2[i] = ["-" * len(file1[i][0])]
finaldict[i] = file1[i] + file2[i]
with open("output.txt", 'w') as file: # output that in file
for k, i in finaldict.items(): # named output.txt
file.write(">{}\n{}\n".format(k, "".join(i))) #proper formatting
It's pretty hard to comment and explain it completely, and it might not help you, but this is better than nothing :P
I used Łukasz Rogalski's code from answer to Reading a fasta file format into Python dict.

Process each line of text file using Python

I am fairly new to Python. I have a text file containing many blocks of data in following format along with other unnecessary blocks.
NOT REQUIRED :: 123
Connected Part-1:: A ~$
Connected Part-3:: B ~$
Connector Location:: 100 200 300 ~$
NOT REQUIRED :: 456
Connected Part-2:: C ~$
i wish to extract the info (A,B,C, 100 200 300) corresponding to each property ( connected part-1, Connector location) and store it as list to use it later. I have prepared following code which reads file, cleans the line and store it as list.
import fileinput
with open('C:/Users/file.txt') as f:
content = f.readlines()
for line in content:
if 'Connected Part-1' in line or 'Connected Part-3' in line:
if 'Connected Part-1' in line:
connected_part_1 = [s.strip(' \n ~ $ Connected Part -1 ::') for s in content]
print ('PART_1:',connected_part_1)
if 'Connected Part-3' in line:
connected_part_3 = [s.strip(' \n ~ $ Connected Part -3 ::') for s in content]
print ('PART_3:',connected_part_3)
if 'Connector Location' in line:
# removing unwanted characters and converting into the list
content_clean_1 = [s.strip('\n ~ $ Connector Location::') for s in content]
#converting a single string item in list to a string
s = " ".join(content_clean_1)
# splitting the string and converting into a list
weld_location= s.split(" ")
print ('POSITION',weld_location)
here is the output
PART_1: ['A', '\t\tConnector Location:: 100.00 200.00 300.00', '\t\tConnected Part-3:: C~\t']
POSITION ['d', 'Part-1::', 'A', '\t\tConnector', 'Location::', '100.00', '200.00', '300.00', '\t\tConnected', 'Part-3::', 'C~\t']
PART_3: ['1:: A', '\t\tConnector Location:: 100.00 200.00 300.00', '\t\tConnected Part-3:: C~\t']
From the output of this program, i may conclude that, since 'content' is the string consisting all the characters in the file, the program is not reading an individual line. Instead it is considering all text as single string. Could anyone please help in this case?
I am expecting following output:
PART_1: ['A']
PART_3: ['C']
POSITION: ['100.00', '200.00','300.00']
(Note) When i am using individual files containing single line of data, it works fine. Sorry for such a long question

I will try to make it clear, and show how I would do it without regex. First of all, the biggest issue with the code presented is that when using the string.strip function the entire content list is being read:
connected_part_1 = [s.strip(' \n ~ $ Connected Part -1 ::') for s in content]
Content is the entire file lines, I think you want simply something like:
connected_part_1 = [line.strip(' \n ~ $ Connected Part -1 ::')]
How to parse the file is a bit subjective, but given the file format posted as input, I would do it like this:
templatestr = "{}: {}"
with open('inputreadlines.txt') as f:
content = f.readlines()
for line in content:
label, value = line.split('::')
ltokens = label.split()
if ltokens[0] == 'Connected':
print(templatestr.format(
ltokens[-1], #The last word on the label
value.split()[:-1])) #the split value without the last word '~$'
elif ltokens[0] == 'Connector':
print(value.split()[:-1]) #the split value without the last word '~$'
else: #NOT REQUIRED
pass
You can use the string.strip function to remove the funny characters '~$' instead of removing the last token as in the example.

regular expressions in python using quotes

I am attempting to create a regular expression pattern for strings similar to the below which are stored in a file. The aim is to get any column for any row, the rows need not be on a single line. So for example, consider the following file:
"column1a","column2a","column
3a,", #entity 1
"column\"this is, a test\"4a"
"column1b","colu
mn2b,","column3b", #entity 2
"column\"this is, a test\"4b"
"column1c,","column2c","column3c", #entity 3
"column\"this is, a test\"4c"
Each entity consists of four columns, column 4 for entity 2 would be "column\"this is, a test\"4b", column 2 for entity 3 would be "column2c". Each column begins with a quote and closes with a quote, however you must be careful because some columns have escaped quotes. Thanks in advance!

You could do like this, ie
Read the whole file.
Split the input according to the newline character which was not preceded by a comma.
Iterate over the spitted elements and again do splitting on the comma (and also the following optional newline character) which was preceded and followed by double quotes.
Code:
import re
with open(file) as f:
fil = f.read()
m = re.split(r'(?<!,)\n', fil.strip())
for i in m:
print(re.split('(?<="),\n?(?=")', i))
Output:
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
Here is the check..
$ cat f
"column1a","column2a","column3a,",
"column\"this is, a test\"4a"
"column1b","column2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"
$ python3 f.py
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
f is the input file name and f.py is the file-name which contains the python script.

Your problem is terribly familiar to what I have to deal thrice every month :) Except I'm not using python to solve it, but I can 'translate' what I usually do:
text = r'''"column1a","column2a","column
3a,",
"column\"this is, a test\"4a"
"column1a2","column2a2","column3a2","column4a2"
"column1b","colu
mn2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"'''
import re
# Number of columns one line is supposed to have
columns = 4
# Temporary variable to hold partial lines
buffer = ""
# Our regex to check for each column
check = re.compile(r'"(?:[^"\\]*|\\.)*"')
# Read the file line by line
for line in text.split("\n"):
# If there's no stored partial line, this is a new line
if buffer == "":
# Check if we get 4 columns and print, if not, put the line
# into buffer so we store a partial line for later
if len(check.findall(line)) == columns:
print matches
else:
# use line.strip() if you need to trim whitespaces
buffer = line
else:
# Update the variable (containing a partial line) with the
# next line and recheck if we get 4 columns
# use line.strip() if you need to trim whitespaces
buffer = buffer + line
# If we indeed get 4, our line is complete and print
# We must not forget to empty buffer now that we got a whole line
if len(check.findall(buffer)) == columns:
print matches
buffer = ""
# Optional; always good to have a safety backdoor though
# If there is a problem with the csv itself like a weird unescaped
# quote, you send it somewhere else
elif len(check.findall(buffer)) > columns:
print "Error: cannot parse line:\n" + buffer
buffer = ""
ideone demo

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to solve problem decoding from wrong json format - python

Related

Python: Having trouble replacing lines from file

How to remove brackets and the contents inside from a file

Concatenate multiple text files of DNA sequences in Python or R?

Process each line of text file using Python

regular expressions in python using quotes

Categories

Resources