I am trying to access weather api data. It returns a long a less human readable single line. I am trying to replace every bracket({) with '{/n' so that bracket remains but as well a new line character just for better readable json.
But it returns every character on a new line in the shell.
import urllib2
url2 = 'http://api.openweathermap.org/data/2.5/find?q=london,PK&units=metric'
data = urllib2.urlopen(url2)
s = data.read()
count = 0
s = s.replace('{',"{\n")
#s = ''.join(s)
for line in s:
print line
count = count + 1
print count
after join() the problem still persists.
The problematic output after this code is like this
Why don't you use the built-in capabilities of the json library that's standard in Python?
import urllib2
import json
url2 = 'http://api.openweathermap.org/data/2.5/find?q=london,PK&units=metric'
data = urllib2.urlopen(url2)
# read the contents in and parse the JSON.
jsonData = json.loads(data.read())
# print it out nicely formatted:
print json.dumps(jsonData, sort_keys=True, indent=4, separators=(',', ': '))
output:
{
"cod": "200",
"count": 1,
"list": [
{
"clouds": {
"all": 20
},
"coord": {
"lat": 38.7994,
"lon": -89.9603
},
"dt": 1442072098,
"id": 4237717,
"main": {
"humidity": 67,
"pressure": 1020,
"temp": 16.82,
"temp_max": 18.89,
"temp_min": 15
},
"name": "Edwardsville",
"sys": {
"country": "United States of America"
},
"weather": [
{
"description": "few clouds",
"icon": "02d",
"id": 801,
"main": "Clouds"
}
],
"wind": {
"deg": 350,
"speed": 4.6
}
}
],
"message": "accurate"
}
The issue is here:
for line in s:
print line
At this point, it will print every character on a separate line - that's what print does (it adds a trailing newline to each print command), as shown by this code:
print 1
print
print 2
which outputs this:
1
2
You may be confused with the name line, but it's not a special variable name. You can change the word line to any valid variable name and it will work the same way.
A for loop will iterate over an iterable. If it's a file, it will do each line. A list will be each element, and a string goes over every character. Because you say to print it, it then prints them individually.
Are you expecting a non-string response from the API? If it gives a list like this:
["calls=10","message=hello"]
then your for loop will print each in turn. But if it's just a string, like "message=hello" it will print each character.
And the reason there is a blank newline after the {? Because the replace command is working fine.
s is just a string, so doing for x in s actually iterates over individual characters of s, not over its lines. I think you're confusing it with for line in f when f is a file object!
Related
I have a problem. I am reading in a file.
This file contains abbreviations. However, I only want to read the abbreviations. This also works. However, not in the desired format as expected, I would like to save the abbreviations cleanly per line (see below for the desired output). The problem is that I'm getting something like '\t\\acro{.... How can I convert this to my desired output?
def getPrice(symbol,
shortForm,
longForm):
abbreviations = []
with open("./file.tex", encoding="utf-8") as f:
file = list(f)
save = False
for line in file:
print("\n"+ line)
if(line.startswith(r'\end{acronym}')):
save = False
if(save):
abbreviations.append(line)
if(line.startswith(r'\begin{acronym}')):
save = True
print(abbreviations)
if __name__== "__main__":
getPrice(str(sys.argv[1]),
str(sys.argv[2]),
str(sys.argv[3]))
[OUT]
['\t\\acro{knmi}[KNMI]{Koninklijk Nederlands Meteorologisch Instituut}\n', '\t\\acro{test}[TESTERER]{T E SDH SADHU AHENSAD }\n']
\chapter*{Short}
\addcontentsline{toc}{chapter}{Short}
\markboth{Short}{Short}
\begin{acronym}[TESTERER]
\acro{knmi}[KNMI]{Koninklijk Nederlands Meteorologisch Instituut}
\acro{example}[e.g.]{For example}
\end{acronym}
Desired Output
{
"abbreviation1": {
"symbol": "knmi",
"shortForm": "KNMI",
"longForm": "Koninklijk Nederlands Meteorologisch Instituut",
}
"abbreviation2": {
"symbol": "example",
"shortForm": "e.g.",
"longForm": "For example",
}
}
You can use re.findall() to capture all of the abbreviations, then use the json module to dump it out into a file. Your approach could work, but you'd have to do a lot of manual string parsing, which would be a pretty massive headache. (Note that a program that can parse arbitrary LaTeX would need something more powerful than regular expressions; however, since we're parsing a very small subset of LaTeX, regular expressions will do fine here.)
import re
import json
data = r"""\chapter*{Short}
\addcontentsline{toc}{chapter}{Short}
\markboth{Short}{Short}
\begin{acronym}[TESTERER]
\acro{knmi}[KNMI]{Koninklijk Nederlands Meteorologisch Instituut}
\acro{example}[e.g.]{For example}
\end{acronym}"""
pattern = re.compile(r"\\acro\{(.+)\}\[(.+)\]\{(.+)\}")
regex_result = re.findall(pattern, data)
final_output = {}
for index, (symbol, shortform, longform) in enumerate(regex_result, start=1):
final_output[f'abbreviation{index}'] = \
dict(symbol=symbol, shortform=shortform, longform=longform)
with open('output.json', 'w') as output_file:
json.dump(final_output, output_file, indent=4)
output.json contains the following:
{
"abbreviation1": {
"symbol": "knmi",
"shortform": "KNMI",
"longform": "Koninklijk Nederlands Meteorologisch Instituut"
},
"abbreviation2": {
"symbol": "example",
"shortform": "e.g.",
"longform": "For example"
}
}
I have large file (about 3GB) which contains what looks like a JSON file but isn't because it lacks commas (,) between "observations" or JSON objects (I have about 2 million of these "objects" in my data file).
For example, this is what I have:
{
"_id": {
"$id": "fh37fc3huc3"
},
"messageid": "4757724838492485088139042828",
"attachments": [],
"usernameid": "47284592942",
"username": "Alex",
"server": "475774810304151552",
"text": "Must watch",
"type": "462050823720009729",
"datetime": "2018-08-05T21:20:20.486000+00:00",
"type": {
"$numberLong": "0"
}
}
{
"_id": {
"$id": "23453532dwq"
},
"messageid": "232534",
"attachments": [],
"usernameid": "273342",
"usernameid": "Alice",
"server": "475774810304151552",
"text": "https://www.youtube.com/",
"type": "4620508237200097wd29",
"datetime": "2018-08-05T21:20:11.803000+00:00",
"type": {
"$numberLong": "0"
}
And this is what I want (the comma between "observations"):
{
"_id": {
"$id": "fh37fc3huc3"
},
"messageid": "4757724838492485088139042828",
"attachments": [],
"username": "Alex",
"server": "475774810304151552",
"type": {
"$numberLong": "0"
}
},
{
"_id": {
"$id": "23453532dwq"
},
"messageid": "232534",
"attachments": [],
"usernameid": "Alice",
"server": "475774810304151552",
"type": {
"$numberLong": "0"
}
This is what I tried but it doesn't give me a comma where I need it:
import re
with open('dataframe.txt', 'r') as input, open('out.txt', 'w') as output:
output.write("[")
for line in input:
line = re.sub('', '},{', line)
output.write(' '+line)
output.write("]")
What can I do so that I can add a comma between each JSON object in my datafile?
This solution presupposes that none of the fields in JSON contains neither { nor }.
If we assume that there is at least one blank line between JSON dictionaries, an idea: let's maintain unclosed curly brackets count ({) as unclosed_count; and if we meet an empty line, we add the coma once.
Like this:
with open('test.json', 'r') as input_f, open('out.json', 'w') as output_f:
output_f.write("[")
unclosed_count = 0
comma_after_zero_added = True
for line in input_f:
unclosed_count_change = line.count('{') - line.count('}')
unclosed_count += unclosed_count_change
if unclosed_count_change != 0:
comma_after_zero_added = False
if line.strip() == '' and unclosed_count == 0 and not comma_after_zero_added:
output_f.write(",\n")
comma_after_zero_added = True
else:
output_f.write(line)
output_f.write("]")
Assuming sufficient memory, you can parse such a stream one object at a time using json.JSONDecoder.raw_decode directly, instead of using json.loads.
>>> x = '{"a": 1}\n{"b": 2}\n' # Hypothetical output of open("dataframe.txt").read()
>>> decoder = json.JSONDecoder()
>>> x = '{"a": 1}\n{"b":2}\n'
>>> decoder.raw_decode(x)
({'a': 1}, 8)
>>> decoder.raw_decode(x, 9)
({'b': 2}, 16)
The output of raw_decode is a tuple containing the first JSON value decoded and the position in the string where the remaining data starts. (Note that json.loads just creates an instance of JSONDecoder, and calls the decode method, which just calls raw_decode and artificially raises an exception if the entire input isn't consumed by the first decoded value.)
A little extra work is involved; note that you can't start decoding with whitespace, so you'll have to use the returned index to detect where the next value starts, following any additional whitespace at the returned index.
Another way to view your data is that you have multiple json records separated by whitespace. You can use the stdlib JSONDecoder to read each record, then strip whitespace and repeat til done. The decoder reads a record from a string and tells you how far it got. Apply that iteratively to the data until all is consumed. This is far less risky than making a bunch of assumptions about what data is contained in the json itself.
import json
def json_record_reader(filename):
with open(filename, encoding="utf-8") as f:
txt = f.read().lstrip()
decoder = json.JSONDecoder()
result = []
while txt:
data, pos = decoder.raw_decode(txt)
result.append(data)
txt = txt[pos:].lstrip()
return result
print(json_record_reader("data.json"))
Considering the size of your file, a memory mapped text file may be the better option.
If you're sure that the only place you will find a blank line is between two dicts, then you can go ahead with your current idea, after you fix its execution. For every line, check if it's empty. If it isn't, write it as-is. If it is, write a comma instead
with open('dataframe.txt', 'r') as input_file, open('out.txt', 'w') as output_file:
output_file.write("[")
for line in input_file:
if line.strip():
output_file.write(line)
else:
output_file.write(",")
output_file.write("]")
If you cannot guarantee that any blank line must be replaced by a comma, you need a different approach.
You want to replace a close-bracket, followed by an empty line (or multiple whitespace), followed by an open-bracket, with },{.
You can keep track of the previous two lines in addition to the current line, and if these are "}", "", and "{" in that order, then write a comma before writing the "{".
from collections import deque
with open('dataframe.txt', 'r') as input_file, open('out.txt', 'w') as output_file:
last_two_lines = deque(maxlen=2)
output_file.write("[")
for line in input_file:
line_s = line.strip()
if line_s == "{" and list(last_two_lines) == ["}", ""]:
output_file.write("," + line)
else:
output_file.write(line)
last_two_lines.append(line_s)
Alternatively, if you want to stick with regex, then you could do
with open('dataframe.txt') as input_file:
file_contents = input_file.read()
repl_contents = re.sub(r'\}(\s+)\{', r'},\1{', file_contents)
with open('out.txt', 'w') as output_file:
output_file.write(repl_contents)
Here, the regex r"\}(\s+)\{" matches the pattern we're looking for (\s+ matches multiple whitespace characters, and captures them in group 1, which we then use in the replacement string as \1.
Note that you will need to read and run re.sub on the entire file, which will be slow.
I have a wrongly-formatted JSON file where I have numbers with leading zeroes.
p = """[
{
"name": "Alice",
"RegisterNumber": 911100020001
},
{
"name": "Bob",
"RegisterNumber": 000111110300
}
]"""
arc = json.loads(p)
I get this error.
JSONDecodeError: Expecting ',' delimiter: line 8 column 24 (char 107)
Here's what is on char 107:
print(p[107])
#0
The problem is: this is the data I have. Here I am only showing two examples, but my file has millions of lines to be parsed, I need a script. At the end of the day, I need this string:
"""[
{
"name": "Alice",
"RegisterNumber": "911100020001"
},
{
"name": "Bob",
"RegisterNumber": "000111110300"
}
]"""
How can I do it?
Read the file (best line by line) and replace all the values with their string representation. You can use regular expressions for that (remodule).
Then save and later parse the valid json.
If it fits into memory, you don't need to save the file of course, but just loads the then valid json string.
Here is a simple version:
import json
p = """[
{
"name": "Alice",
"RegisterNumber": 911100020001
},
{
"name": "Bob",
"RegisterNumber": 000111110300
}
]"""
from re import sub
p = sub(r"(\d{12})", "\"\\1\"", p)
arc = json.loads(p)
print(arc[1])
This probably won't be pretty but you could probably fix this using a regex.
import re
p = "..."
sub = re.sub(r'"RegisterNumber":\W([0-9]+)', r'"RegisterNumber": "\1"', p)
json.loads(sub)
This will match all the case where you have the RegisterNumber followed by numbers.
Since the problem is the leading zeroes, tne easy way to fix the data would be to split it into lines and fix any lines that exhibit the problem. It's cheap and nasty, but this seems to work.
data = """[
{
"name": "Alice",
"RegisterNumber": 911100020001
},
{
"name": "Bob",
"RegisterNumber": 000111110300
}
]"""
result = []
for line in data.splitlines():
if ': 0' in line:
while ": 0" in line:
line = line.replace(': 0', ': ')
result.append(line.replace(': ', ': "')+'"')
else:
result.append(line)
data = "".join(result)
arc = json.loads(data)
print(arc)
I am looking to format a text file from an api request output. So far my code looks like such:
import requests
url = 'http://URLhere.com'
headers = {'tokenname': 'tokenhash'}
response = requests.get(url, headers=headers,)
with open('newfile.txt', 'w') as outf:
outf.write(response.text)
and this creates a text file but the output is on one line.
What I am trying to do is:
Have it start a new line every time the code reaches a certain word like "id","status", or "closed_at" but unfortunately I have not been able to figure this out.
Also I am trying to get a count of how many "id" there are in the file but I think due to the formatting, the script does not like it.
The output is as follows:
{
[
{
"id": 12345,
"status": "open or close",
"closed_at": null,
"created_at": "yyyy-mm-ddTHH:MM:SSZ",
"due_date": "yyyy-mm-dd",
"notes": null,
"port": [pnumber
],
"priority": 1,
"identifiers": [
"12345"
],
"last_seen_time": "yyyy-mm-ddThh:mm:ss.sssZ",
"scanner_score": 1.0,
"fix_id": 12345,
"scanner_vulnerabilities": [
{
"port": null,
"external_unique_id": "12345",
"open": false
}
],
"asset_id": 12345
This continues on one line with the same names but for different assets.
This code :
with open ('text.txt') as text_file :
data = text_file.read ()
print ('\n'.join (data.split (',')))
Gives this output :
"{[{"id":12345
"status":"open or close"
"closed_at":null
"created_at":"yyyy-mm-ddTHH:MM:SSZ"
"due_date":"yyyy-mm-dd"
"notes":null
"port":[pnumber]
"priority":1
"identifiers":["12345"]
"last_seen_time":"yyyy-mm-ddThh:mm:ss.msmsmsZ"
"scanner_score":1.0
"fix_id":12345
"scanner_vulnerabilities":[{"port":null
"external_unique_id":"12345"
"open":false}]
"asset_id":12345"
And then to write it to a new file :
output = data.split (',')
with open ('new.txt', 'w') as write_file :
for line in output :
write_file.write (line + '\n')
I am trying to create a dict like this:
{
"Name": "string",
"Info": [
{
"MainID": "00000000-0000-0000-0000-000000000000",
"Values": [
{
"IndvidualID": "00000000-0000-0000-0000-000000000000",
"Result": "string"
},
{
"IndvidualID": "00000000-0000-0000-0000-000000000000",
"Result": "string"
}
]
}
]
}
Where the Values section has 100+ things inside of it. I put 2 as an example.
Unsure how to build this dynamically. Code I have attempted so far:
count = 0
with open('myTextFile.txt') as f:
line = f.readline()
line = line.rstrip()
myValues[count]["IndvidualID"] = count
myValues[count]["Result"] = line
count = count +1
data = {"Name": "Demo",
"Info": [{
"MainID":TEST_SUITE_ID,
"Values":myValues
}}
This does not work however. Due to "Key Error 0" it says. Works if I do it like myValues[count]= count but as soon as I add the extra layer it breaks myValues[count]["IndvidualID"] = count. I see some example of setting a list in there, but I need like a List (Values) with multiple things inside (ID and Result). Have tried a few things with no luck. Anyone have a simple way to do this in python?
Full traceback:
Traceback (most recent call last):
File "testExtractor.py", line 28, in <module>
myValues[count]["IndvidualID"] = count
KeyError: 0
If I add a few bits and pieces I can get this code to run:
count = 0
myValues = []
with open('myTextFile.txt') as f:
for line in f:
line = line.rstrip()
d = {"IndvidualID":count, "Result":line}
myValues.append(d)
count = count + 1
TEST_SUITE_ID = 1
data = {"Name": "Demo",
"Info": [{
"MainID":TEST_SUITE_ID,
"Values":myValues
}]}
Output:
{'Name': 'Demo', 'Info': [{'MainID': 1,
'Values': [{'IndvidualID': 0, 'Result': 'apples'}, {'IndvidualID': 1, 'Result': 'another apples'}]
}]}
Note:
I have defined myValues as an empty list. I iterate over the file to read all lines and I create a new dict for each line, appending to myValues each time. Finally I create the whole data object with myValues embedded inside.
To get the above output I actually substituted: f = ['apples','another apples'] for the open file, but I'm sure an actual file would work.