Nested data to csv python - python

I have a file of nested json data. I am trying to "get.some_object" and write a csv file with the objects (I think they are called objects: "some_object": "some_value"); I would like one row for each group of nested items. This is my code:
import csv
import json
path = 'E:/Uni Arbeit/Prof Hayo/Sascha/Bill data/97/bills/hr/hr4242'
outputfile = open('TaxLaw1981.csv', 'w', newline='')
outputwriter = csv.writer(outputfile)
with open(path + "/" + "/data.json", "r") as f:
data = json.load(f)
for act in data['actions']:
a = act.get('acted_at')
b = act.get('text')
c = act.get('type')
outputwriter.writerow([a, b, c])
outputfile.close()
The problem I have is that it only writes the last group of data to csv; however when I run
with open(path + "/" + "/data.json", "r") as f:
data = json.load(f)
for act in data['actions']:
a = act.get('acted_at')
b = act.get('text')
c = act.get('type')
print (a)
all of my "a" values print out.
Suggestions?

You need to flush your outputwriter to write the row to the file, else it will keep on replacing the one in the variable and eventually only write the last value. Writerow only works when you close the file unless you flush the data.
for act in data['actions']:
a = act.get('acted_at')
b = act.get('text')
c = act.get('type')
outputwriter.writerow([a, b, c])
outputfile.flush()

The code you posted above works 100% with the file you have.
The file (for anyone interested) is available with rsync -avz --delete --delete-excluded --exclude **/text-versions/ govtrack.us::govtrackdata/congress/97/bills/hr/hr4242 .
.
And the output to the csv file is (omitting some lines in the middle)
1981-07-23,Referred to House Committee on Ways and Means.,referral
1981-07-23,"Consideration and Mark-up Session Held by Committee on Ways and Means Prior to Introduction (Jun 10, 81 through Jul 23, 81).",action
1981-07-23,"Hearings Held by House Committee on Ways and Means Prior to Introduction (Feb 24, 25, Mar 3, 4, 5, 24, 25, 26, 27, 30, 31, Apr 1, 2, 3, 7, 81).",action
...
...
...
1981-08-12,Measure Signed in Senate.,action
1981-08-12,Presented to President.,topresident
1981-08-13,Signed by President.,signed
1981-08-13,Became Public Law No: 97-34.,enacted
You should post the full error code you get when you execute (probably due to an encoding error) to let someone understand why your code is failing.

Related

IndexError: list index out of range in Python Script

I'm new to python and so I apologize if this question has already been answered. I've used this script before and its worked so I'm not at all sure what is wrong.
I'm trying to transform a MALLET output document into a long list of topic, weight, value rather than a wide list of topics documents and weights.
Here's what the original csv I'm trying to convert looks like but there are 30 topics in it (its a text file called mb_composition.txt):
0 file:/Users/mandyregan/Dropbox/CPH-DH/MiningtheSurge/txt/Abizaid.txt 6.509147794508226E-6 1.8463345214533957E-5 3.301298069640119E-6 0.003825178550032757 0.15240841618294929 0.03903974304065183 0.10454783676528623 0.1316719812119471 1.8018057013225344E-5 4.869261713020613E-6 0.0956868156114931 1.3521101623203115E-5 9.514591058923748E-6 1.822741355900598E-5 4.932324961835634E-4 2.756817586271138E-4 4.039186874601744E-5 1.0503346606335033E-5 1.1466132458804392E-5 0.007003443189848799 6.7094360963952E-6 0.2651753488982284 0.011727025879070194 0.11306132549594633 4.463460490946615E-6 0.0032751230536005056 1.1887304822238514E-5 7.382714572306351E-6 3.538808652077042E-5 0.07158823129977483
1 file:/Users/mandyregan/Dropbox/CPH-DH/MiningtheSurge/txt/Jeffrey,%20Jim%20-%20Chk5-%20ASC%20-%20FINAL%20-%20Sept%202017.docx.txt 4.296636200313062E-6 1.218750594272488E-5 1.5556725986514498E-4 0.043172816021532695 0.04645757277949794 0.01963429696910822 0.1328206370818606 0.116826297071711 1.1893574776047563E-5 3.2141605637859693E-6 0.10242945223692496 0.010439315937573735 0.2478814493196687 1.2031769351093548E-5 0.010142417179693447 2.858721603853616E-5 2.6662348272204834E-5 6.9331747684835E-6 7.745091995495631E-4 0.04235638910274044 4.428844900369446E-6 0.0175105406405736 0.05314379308820005 0.11788631730736487 2.9462944350793084E-6 4.746133386282654E-4 7.846714475661223E-6 4.873270616886766E-6 0.008919869163605806 0.02884824479155971
And here's the python script I'm trying to use to convert it:
infile = open('mallet_output_files/mb_composition.txt', 'r')
outfile = open('mallet_output_files/weights.csv', 'w+')
outfile.write('file,topicnum,weight\n')
for line in infile:
tokens = line.split('\t')
fn = tokens[1]
topics = tokens[2:]
#outfile.write(fn[46:] + ",")
for i in range(0,59):
outfile.write(fn[46:] + ",")
outfile.write(topics[i*2]+','+topics[i*2+1]+'\n')
I'm running this in the terminal with python reshape.py and I get this error:
Traceback (most recent call last):
File "reshape.py", line 12, in <module>
outfile.write(topics[i*2]+','+topics[i*2+1]+'\n')
IndexError: list index out of range
Any idea what I'm doing wrong here? I can't seem to figure it out and am frustrated because I know Ive used this script many times before with success! If it helps I'm on Mac OSx with Python Version 2.7.10
The problem is you're looking for 60 topics per line of your CSV.
If you just want to print out the topics in the list up to the nth topic per line, you should probably define your range by the actual number of topics per line:
for i in range(len(topics) // 2):
outfile.write(fn[46:] + ",")
outfile.write(topics[i*2]+','+topics[i*2+1]+'\n')
Stated more pythonically, it would look something like this:
# Group the topics into tuple-pairs for easier management
paired_topics = [tuple(topics[i:i+2]) for i in range(0, len(topics), 2)]
# Iterate the paired topics and print them each on a line of output
for topic in paired_topics:
outfile.write(fn[46:] + ',' + ','.join(topic) + '\n')
You need to debug your code. Try printing out variables.
infile = open('mallet_output_files/mb_composition.txt', 'r')
outfile = open('mallet_output_files/weights.csv', 'w+')
outfile.write('file,topicnum,weight\n')
for line in infile:
tokens = line.split('\t')
fn = tokens[1]
topics = tokens[2:]
# outfile.write(fn[46:] + ",")
for i in range(0,59):
# Add a print statement like this
print(f'Topics {i}: {i*2} and {i*2+1}')
outfile.write(fn[46:] + ",")
outfile.write(topics[i*2]+','+topics[i*2+1]+'\n')
Your 'topics' list only has 30 elements? It looks like you're trying to access items far outside of the available range, i.e., you're trying to access topics[x] where x > 30.

How to make Read plus (r+) mode to work in python 3?

I'm trying to append some text using the r+ mode to an already existing file previously created but I don't know why it is not working. Here is my code:
#Here i'm creating the file 'task5_file'
task5_file = open('task5_file.txt', 'w+')
task5_file.write("Line---1\nLine---2\nLine---3\nLine---4\nLine---5\nLine---6\nLine---7\nLine---8\nLine---9\nLine--10\n")
task5_file.seek(0)
print("Before:\n"+ task5_file.read()+"\n")
task5_file.close()
#Next i'm trying to append text 5 times and add it every 18 characters. (starting the first loop, item is 1 if using range(1,5), seek will be set to 18, 36, 54, 72)
task5_file=open('task5_file.txt','r+')
for item in range(1,5):
task5_file.seek(item*18)
task5_file.write("append#"+str(item)+"\n")
print("After:\n+task5_file.read())
This is what I get:
Before:
Line---1
Line---2
Line---3
Line---4
Line---5
Line---6
Line---7
Line---8
Line---9
Line--10
After:
Line--10
# [ ] complete the task
# Provided Code creates and populates task5_file.txt
task5_file = open('task5_file.txt', 'w+')
task5_file.write("Line---1\nLine---2\nLine---3\nLine---4\nLine---5\nLine--
-6\nLine---7\nLine---8\nLine---9\nLine--10\n")
task5_file.seek(0)
print("Before:\n"+ task5_file.read()+"\n")
task5_file.close()
# [ ] code here
task5_file = open('task5_file.txt','r+')
for item in range(1,5):
task5_file.write("append#" + str(item) + "\n")
task5_file.seek(item*18)
task5_file.seek(0) #here is the important
print(task5_file.read(),"\n")

How to solve problem decoding from wrong json format

everyone. Need help opening and reading the file.
Got this txt file - https://yadi.sk/i/1TH7_SYfLss0JQ
It is a dictionary
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}
But it was written using json into txt file.
#This is how I dump the data into a txt
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
So, the file structure is
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}{"id2":"url2", "id3":"url3", ..., "id4":"url4"}{"id5":"url5", "id6":"url6", ..., "id7":"url7"}
And it is all a string....
I need to open it and check repeated ID, delete and save it again.
But getting - json.loads shows ValueError: Extra data
Tried these:
How to read line-delimited JSON from large file (line by line)
Python json.loads shows ValueError: Extra data
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 190)
But still getting that error, just in different place.
Right now I got as far as:
with open('111111111.txt', 'r') as log:
before_log = log.read()
before_log = before_log.replace('}{',', ').split(', ')
mu_dic = []
for i in before_log:
mu_dic.append(i)
This eliminate the problem of several {}{}{} dictionaries/jsons in a row.
Maybe there is a better way to do this?
P.S. This is how the file is made:
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
Your file size is 9,5M, so it'll took you a while to open it and debug it manually.
So, using head and tail tools (found normally in any Gnu/Linux distribution) you'll see that:
# You can use Python as well to read chunks from your file
# and see the nature of it and what it's causing a decode problem
# but i prefer head & tail because they're ready to be used :-D
$> head -c 217 111111111.txt
{"1933252590737725178": "https://instagram.fiev2-1.fna.fbcdn.net/vp/094927bbfd432db6101521c180221485/5CC0EBDD/t51.2885-15/e35/46950935_320097112159700_7380137222718265154_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net",
$> tail -c 219 111111111.txt
, "1752899319051523723": "https://instagram.fiev2-1.fna.fbcdn.net/vp/a3f28e0a82a8772c6c64d4b0f264496a/5CCB7236/t51.2885-15/e35/30084016_2051123655168027_7324093741436764160_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net"}
$> head -c 294879 111111111.txt | tail -c 12
net"}{"19332
So the first guess is that your file is a malformed series ofJSON data, and the best guess is to seperate }{ by a \n for further manipulations.
So, here is an example of how you can solve your problem using Python:
import json
input_file = '111111111.txt'
output_file = 'new_file.txt'
data = ''
with open(input_file, mode='r', encoding='utf8') as f_file:
# this with statement part can be replaced by
# using sed under your OS like this example:
# sed -i 's/}{/}\n{/g' 111111111.txt
data = f_file.read()
data = data.replace('}{', '}\n{')
seen, total_keys, to_write = set(), 0, {}
# split the lines of the in memory data
for elm in data.split('\n'):
# convert the line to a valid Python dict
converted = json.loads(elm)
# loop over the keys
for key, value in converted.items():
total_keys += 1
# if the key is not seen then add it for further manipulations
# else ignore it
if key not in seen:
seen.add(key)
to_write.update({key: value})
# write the dict's keys & values into a new file as a JSON format
with open(output_file, mode='a+', encoding='utf8') as out_file:
out_file.write(json.dumps(to_write) + '\n')
print(
'found duplicated key(s): {seen} from {total}'.format(
seen=total_keys - len(seen),
total=total_keys
)
)
Output:
found duplicated key(s): 43836 from 45367
And finally, the output file will be a valid JSON file and the duplicated keys will be removed with their values.
The basic difference between the file structure and actual json format is the missing commas and the lines are not enclosed within [. So the same can be achieved with the below code snippet
with open('json_file.txt') as f:
# Read complete file
a = (f.read())
# Convert into single line string
b = ''.join(a.splitlines())
# Add , after each object
b = b.replace("}", "},")
# Add opening and closing parentheses and ignore last comma added in prev step
b = '[' + b[:-1] + ']'
x = json.loads(b)

Python - Files - Changing line numbers based on search results

I need to search a debug file for a specific string or error and then, once it's found, look up the file by 6 lines and then print whatever that line 6 above has in it.
import linecache
file = "/file.txt"
fh = open("/file.txt", "r")
lookup = 'No Output'
with fh as myFile:
for num, line in enumerate(myFile, 1):
if lookup in line:
numUp = num + 6
new = linecache.getline(file, numUp)
print new
I tried adding doing something along the lines of "num += 6" whenever I find "The term I need to search" but my output is either blank or I receive this error:
File "testRead.py", line 12
print new
^
IndentationError: unexpected indent
If there's also another way to do "search, then scan up n-lines, then print/return" in a way that's line by line, that'd be great to know as well because the files I'll be working with vary greatly in size.
I upped an example file of some of the things I typically see: http://pastebin.com/mzvCfZid
Any time I hit the string "(Err: No Output)", I need to find its associated ID, which is the number 6 lines above the error. So "No Output is what I'd need to search for.
::Edit::
You want the functionality of a deque:
>>> from collections import deque
>>>
>>> lines = deque(maxlen = 7)
>>> for i in range(35):
... lines.append(i)
...
>>> print lines
# Note your 6 lines earlier would be lines[0] here.
deque([28, 29, 30, 31, 32, 33, 34], maxlen=7)

ODS file to JSON

I would like to convert my speadsheet of data to a JSON array of array.
This site do it: http://www.shancarter.com/data_converter/index.html
And I looked into the source code.
But what I would like is a macro / script / extension or any way to programm it to convert my .ods into a JSON file:
Like:
NAME VALUE COLOR DATE
Alan 12 blue Sep. 25, 2009
Shan 13 "green blue" Sep. 27, 2009
John 45 orange Sep. 29, 2009
Minna 27 teal Sep. 30, 2009
To:
[
["Alan",12,"blue","Sep. 25, 2009"],
["Shan",13,"green\tblue","Sep. 27, 2009"],
["John",45,"orange","Sep. 29, 2009"],
["Minna",27,"teal","Sep. 30, 2009"]
]
The answer might be again late but marcoconti83 has done exactly that: reading a ods file and return them as two dimensional arrays.
https://github.com/marcoconti83/read-ods-with-odfpy/blob/master/ODSReader.py
Once you have the data in arrays, it's not that difficult to get them into a json file. Here's example code:
import json
from odftoarray import ODSReader # renamed the file to odftoarray.py
r = ODSReader("your_file.ods")
arrays = r.getSheet("your_data_sheet_name")
json.dumps(arrays)
This may be a bit late but for those who come looking and want to do this it would likely be best to save the .ods file as .csv which nearly all spreadsheet programs can do. Then use something like this to convert it:
import csv
import sys
import json, os
def convert(csv_filename, fieldnames):
print ("Opening CSV file: ",csv_filename)
f=open(csv_filename, 'r')
csv_reader = csv.DictReader(f,fieldnames)
json_filename = csv_filename.split(".")[0]+".json"
print ("Saving JSON to file: ",json_filename)
jsonf = open(json_filename,'w')
data = json.dumps([r for r in csv_reader])
jsonf.write(data)
f.close()
jsonf.close()
csvfile = ('path/to/the/csv/file.csv')
field_names = [
"a",
"list",
"of",
"fieldnames"
]
convert(csvfile, field_names)
And a tip, csv is pretty human readable so just go through and make sure it saved in the format you want and then run this script to convert it to JSON. Check it out in a JSON viewer like JSONView and then you should be good to go!

Categories

Resources