Converting JSON to CSV with part of JSON value as row headers - python

I have just started to learn Python and I have a task of converting a JSON to a CSV file as semicolon as the delimiter and with three constraints.
My JSON is:
{"_id": "5cfffc2dd866fc32fcfe9fcc",
"tuple5": ["system1/folder", "system3/folder"],
"tuple4": ["system1/folder/text3.txt", "system2/folder/text3.txt"],
"tuple3": ["system2/folder/text2.txt"],
"tuple2": ["system2/folder"],
"tuple1": ["system1/folder/text1.txt", "system2/folder/text1.txt"],
"tupleSize": 3}
The output CSV should be in a form:
system1 ; system2 ; system3
system1/folder ; ~ ; system3/folder
system1/folder/text3.txt ; system2/folder/text3.txt ; ~
~ ; system2/folder/text2.txt ; ~
~ ; system2/folder ; ~
system1/folder/text1.txt ; system2/folder/text1.txt ; ~
So the three constraints are that the tupleSize will indicate the number of rows, the first part of the array elements i.e., sys1, sys2 and sys3 will be the array elements and finally only those elements belonging to a particular system will have the values in the CSV file (rest is ~).
I found a few posts regarding the conversion in Python like this and this. None of them had any constraints any way related to these and I am unable to figure out how to approach this.
Can someone help?
EDIT: I should mention that the array elements are dynamic and thus the row headers may vary in the CSV file.

What you want to do is fairly substantial, so if it's just a Python learning exercise, I suggest you begin with more elementary tasks.
I also think you've got what most folks call rows and columns reversed — so be warned that everything below, including the code, is using them in the opposite sense to the way you used them in your question.
Anyway, the code below first preprocesses the data to determine what the columns or fieldnames of the CSV file are going to be and to make sure there are the right number of them as specified by the 'tupleSize' key.
Assuming that constraint is met, it then iterates through the data a second time and extracts the column/field values from each key value, putting them into a dictionary whose contents represents a row to be written to the output file — and then does that when finished.
Updated
Modified to remove all keys that start with "_id" in the JSON object dictionary.
import csv
import json
import re
SEP = '/' # Value sub-component separator.
id_regex = re.compile(r"_id\d*")
json_string = '''
{"_id1": "5cfffc2dd866fc32fcfe9fc1",
"_id2": "5cfffc2dd866fc32fcfe9fc2",
"_id3": "5cfffc2dd866fc32fcfe9fc3",
"tuple5": ["system1/folder", "system3/folder"],
"tuple4": ["system1/folder/text3.txt", "system2/folder/text3.txt"],
"tuple3": ["system2/folder/text2.txt"],
"tuple2": ["system2/folder"],
"tuple1": ["system1/folder/text1.txt", "system2/folder/text1.txt"],
"tupleSize": 3}
'''
data = json.loads(json_string) # Convert JSON string into a dictionary.
# Remove non-path items from dictionary.
tupleSize = data.pop('tupleSize')
_ids = {key: data.pop(key)
for key in tuple(data.keys()) if id_regex.search(key)}
#print(f'_ids: {_ids}')
max_columns = int(tupleSize) # Use to check a contraint.
# Determine how many columns are present and what they are.
columns = set()
for key in data:
paths = data[key]
if not paths:
raise RuntimeError('key with no paths')
for path in paths:
comps = path.split(SEP)
if len(comps) < 2:
raise RuntimeError('component with no subcomponents')
columns.add(comps[0])
if len(columns) > max_columns:
raise RuntimeError('too many columns - conversion aborted')
# Create CSV file.
with open('converted_json.csv', 'w', newline='') as file:
writer = csv.DictWriter(file, delimiter=';', restval='~',
fieldnames=sorted(columns))
writer.writeheader()
for key in data:
row = {}
for path in data[key]:
column, *_ = path.split(SEP, maxsplit=1)
row[column] = path
writer.writerow(row)
print('Conversion complete')

Related

Getting empty CSV in python

I have written a piece of code that compares data from two csv's and writes the final output to a new csv. The problem is except for the header nothing else is being written into the csv. Below is my code,
import csv
data_3B = open('3B_processed.csv', 'r')
reader_3B = csv.DictReader(data_3B)
data_2A = open('2A_processed.csv', 'r')
reader_2A = csv.DictReader(data_2A)
l_3B_2A = [["taxable_entity_id", "return_period", "3B", "2A"]]
for row_3B in reader_3B:
for row_2A in reader_2A:
if row_3B["taxable_entity_id"] == row_2A["taxable_entity_id"] and row_3B["return_period"] == row_2A["return_period"]:
l_3B_2A.append([row_3B["taxable_entity_id"], row_3B["return_period"], row_3B["total"], row_2A["total"]])
with open("3Bvs2A_new.csv", "w") as csv_file:
writer = csv.writer(csv_file)
writer.writerows(l_3B_2A)
csv_file.close()
How do I solve this?
Edit:
2A_processed.csv sample:
taxable_entity_id,return_period,total
2d9cc638-5ed0-410f-9a76-422e32f34779,072019,0
2d9cc638-5ed0-410f-9a76-422e32f34779,062019,0
2d9cc638-5ed0-410f-9a76-422e32f34779,082019,0
e5091f99-e725-44bc-b018-0843953a8771,082019,0
e5091f99-e725-44bc-b018-0843953a8771,052019,41711.5
920da7ba-19c7-45ce-ba59-3aa19a6cb7f0,032019,2862.94
410ecd0f-ea0f-4a36-8fa6-9488ba3c095b,082018,48253.9
3B_processed sample:
taxable_entity_id,return_period,total
1e5ccfbc-a03e-429e-b79a-68041b69dfb0,072017,0.0
1e5ccfbc-a03e-429e-b79a-68041b69dfb0,082017,0.0
1e5ccfbc-a03e-429e-b79a-68041b69dfb0,092017,0.0
f7d52d1f-00a5-440d-9e76-cb7fbf1afde3,122017,0.0
1b9afebb-495d-4516-96bd-1e21138268b7,072017,146500.0
1b9afebb-495d-4516-96bd-1e21138268b7,082017,251710.0
The csv.DictReader objects in your code can only read through the file once, because they are reading from file objects (created with open). Therefore, the second and subsequent times through the outer loop, the inner loop does not run, because there are no more row_2A values in reader_2A - the reader is at the end of the file after the first time.
The simplest fix is to read each file into a list first. We can make a helper function to handle this, and also ensure the files are closed properly:
def lines_of_csv(filename):
with open(filename) as source:
return list(csv.DictReader(source))
reader_3B = lines_of_csv('3B_processed.csv')
reader_2A = lines_of_csv('2A_processed.csv')
I put your code into a file test.py and created test files to simulate your csvs.
$ python3 ./test.py
$ cat ./3Bvs2A_new.csv
taxable_entity_id,return_period,3B,2A
1,2,3,2
$ cat ./3B_processed.csv
total,taxable_entity_id,return_period,3B,2A
3,1,2,3,4
3,4,3,2,1
$ cat ./2A_processed.csv
taxable_entity_id,return_period,2A,3B,total
1,2,3,4,2
4,3,2,1,2
So as you can see the order of the columns doesn't matter as they are being accessed correctly using the dict reader and if the first row is a match your code works but there are no rows left in the second csv file after the processing the first row from the first file. I suggest making a dictionary if taxable_entity_id and return_period tuple values, processing the first csv file by adding totals into the dict then running through the second one and looking them up.
row_lookup = {}
for row in first_csv:
rowLookup[(row['taxable_entity_id'], row['return_period'])] = row['total']
for row in second_csv:
if (row['taxable_entity_id'],row['return_period']) in row_lookup.keys():
newRow = [row['taxable_entity_id'], row['return_period'], row['total'] ,row_lookup[(row['taxable_entity_id'],row['return_period']] ]
Of course that only works if pairs of taxable_entity_ids and return_periods are always unique... Hard to say exactly what you should do without knowing the exact nature of your task and full format of your csvs.
You can do this with pandas if the data frames are equal-sized like this :
reader_3B=pd.read_csv('3B_processed.csv')
reader_2A=pd.read_csv('2A_processed.csv')
l_3B_2A=row_3B[(row_3B["taxable_entity_id"] == row_2A["taxable_entity_id"])&(row_3B["return_period"] == row_2A["return_period"])]
l_3B_2A.to_csv('3Bvs2A_new.csv')

Load CSV to search for match in array/dictionary/list

Very new to Python and I'm now learning by trying to write a program to process data from first few lines from multiple text files. so far so good - getting the data in and reformatting it for output.
Now I'd like to change the format of one output field based on what row it sits in in the csv file. The file is 15 rows with variable number of columns.
The idea is that:
I preload the CSV file - I'd like to hardcode it into a list or dictionary - not sure what works better for the next step.
Go thru the 15 rows in the list/dictionary and if a match is found - set the output to column 1 in the same row.
Example Data:
BIT, BITSIZE, BITM, BS11, BIT, BS4, BIT1, BIT_STM
CAL, ID27, CALP, HCALI, IECY, CLLO, RD2, RAD3QI, ID4
DEN, RHO8[1], RHOZ1, RHOZ2, RHOB_HR, RHOB_ME, LDENX
DENC, CRHO, DRHO1, ZCOR2, HDRH2, ZCORQK
DEPT, DEPTH, DEPT,MD
DPL, PDL, PORZLS1, PORDLSH_Y, DPRL, HDPH_LIM, PZLS
DPS, HDPH_SAN1, DPHI_SAN2, DPUS, DPOR, PZSS1
DTC, DTCO_MFM[1], DT4PT2, DTCO_MUM[1], DTC
DTS, DT1R[1], DTSH, DT22, DTSM[1], DT24S
GR, GCGR, GR_R3, HGR3, GR5, GR6, GR_R1, MGSGR
NPL, NEU, NPOR_LIM, HTNP_LIM, NPOR, HNPO_LIM1
NPS, NPRS, CNC, NPHILS, NPOR_SS, NPRS1, CNCS, PORS
PE, PEFZ_2, HPEF, PEQK, PEF81, PEF83, PEDN, PEF8MBT
RD, AST90, ASF60, RD, RLLD, RTCH, LLDC, M2R9, LLHD
RS, IESN, FOC, ASO10, MSFR, AO20, RS, SFE, LL8, MLL
For example:
BIT, BITSIZE, BITM, BS11, BIT, BS4, BIT1, BIT_STM
returns BIT
Questions:
Is a list or dictionary a better for search speed?
If I use the csv module to load the data does it matter that number of columns aren't same for every row?
Is there a way to search either list or dictionary without using a loop?
My attempt to load into list and search:
import csv
with open('lookup.csv', 'rb') as f:
reader = csv.reader(f)
codelist = list(reader)
Would this work for searching for matching code searchcode?
for subcodes in codelist:
if searchcode in subcodes:
print "Found it!", subcodes[0]
break
I think that you should try with two-dimensional dictionary
new_dic = {}
new_dic[0] = {BIT, BITSIZE, BITM, BS11, BIT, BS4, BIT1, BIT_STM}
new_dic[1] = {CAL, ID27, CALP, HCALI, IECY, CLLO, RD2, RAD3QI, ID4}
Then you can search for the element and print it.
you can use "index" to search for an item in a list. if there that item in the list it will return the location of the first occurrence of it.
my_list = ['a','b','c','d','e','c'] # defines the list
copy_at = my_list.index('b') # checks if 'b' is in the list
copy_at # prints the location in the list where 'b' was at
1
copy_at = my_list.index('c')
copy_at
2
copy_at = my_list.index('f')
Traceback (most recent call last):
File "<pyshell#25>", line 1, in <module>
my_list.index('f')
ValueError: 'f' is not in list
you can catch the error with a "try" "except" and keep searching.

eliminate text after certain character in python pipeline- with slice?

This is a short script I've written to refine and validate a large dataset that I have.
# The purpose of this script is the refinement of the job data attained from the
# JSI as it is rendered by the `csv generator` contributed by Luis for purposes
# of presentation on the dashboard map.
import csv
# The number of columns
num_headers = 9
# Remove invalid characters from records
def url_escaper(data):
for line in data:
yield line.replace('&','&')
# Be sure to configure input & output files
with open("adzuna_input_THRESHOLD.csv", 'r') as file_in, open("adzuna_output_GO.csv", 'w') as file_out:
csv_in = csv.reader( url_escaper( file_in ) )
csv_out = csv.writer(file_out)
# Get rid of rows that have the wrong number of columns
# and rows that have only whitespace for a columnar value
for i, row in enumerate(csv_in, start=1):
if not [e for e in row if not e.strip()]:
if len(row) == num_headers:
csv_out.writerow(row)
else:
print "line %d is malformed" % i
I have one field that is structured like so:
finance|statistics|lisp
I've seen ways to do this using other utilities like R, but I want to ideally achieve the same effect within the scope of this python code.
Maybe I can iterate over all the characters of all the columnar values, perhaps as a list, and if I see a | I can dispose of the | and all the text that follows it within the scope of the column value.
I think surely it can be achieved with slices as they do here but I don't quite understand how the indices with slices work- and I can't see how I could include this process harmoniously within the cascade of the current script pipeline.
With regex I guess it's something like this
(?:|)(.*)
Why not use string's split method?
In[4]: 'finance|statistics|lisp'.split('|')[0]
Out[4]: 'finance'
It does not fail with exception when you do not have separator character in the string too:
In[5]: 'finance/statistics/lisp'.split('|')[0]
Out[5]: 'finance/statistics/lisp'

Combining files in python using

I am attempting to combine a collection of 600 text files, each line looks like
Measurement title Measurement #1
ebv-miR-BART1-3p 4.60618701
....
evb-miR-BART1-200 12.8327289
with 250 or so rows in each file. Each file is formatted that way, with the same data headers. What I would like to do is combine the files such that it looks like this
Measurement title Measurement #1 Measurement #2
ebv-miR-BART1-3p 4.60618701 4.110878867
....
evb-miR-BART1-200 12.8327289 6.813287556
I was wondering if there is an easy way in python to strip out the second column of each file, then append it to a master file? I was planning on pulling each line out, then using regular expressions to look for the second column, and appending it to the corresponding line in the master file. Is there something more efficient?
It is a small amount of data for today's desktop computers (around 150000 measurements) - so keeping everything in memory, and dumping to a single file will be easier than an another strategy. If it would not fit in RAM, maybe using SQL would be a nice approach there -
but as it is, you can create a single default dictionary, where each element is a list -
read all your files and collect the measurements to this dictionary, and dump it to disk -
# create default list dictionary:
>>> from collections import defaultdict
>>> data = defaultdict(list)
# Read your data into it:
>>> from glob import glob
>>> import csv
>>> for filename in glob("my_directory/*csv"):
... reader = csv.reader(open(filename))
... # throw away header row:
... reader.readrow()
... for name, value in reader:
... data[name].append(value)
...
>>> # and record everything down in another file:
...
>>> mydata = open("mydata.csv", "wt")
>>> writer = csv.writer(mydata)
>>> for name, values in sorted(data.items()):
... writer.writerow([name] + values)
...
>>> mydata.close()
>>>
Use the csv module to read the files in, create a dictionary of the measurement names, and make the values in the dictionary a list of the values from the file.
I don't have comment privileges yet, therefore a separate answer.
jsbueno's answer works really well as long as you're sure that the same measurement IDs occur in every file (order is not important, but the sets should be equal!).
In the following situation:
file1:
measID,meas1
a,1
b,2
file2:
measID,meas1
a,3
b,4
c,5
you would get:
outfile:
measID,meas1,meas2
a,1,3
b,2,4
c,5
instead of the desired:
outfile:
measID,meas1,meas2
a,1,3
b,2,4
c,,5 # measurement c was missing in file1!
I'm using commas instead of spaces as delimiters for better visibility.

Grouping items by match set

I am trying to parse a large amount of configuration files and group the results into separate groups based by content - I just do not know how to approach this. For example, say I have the following data in 3 files:
config1.txt
ntp 1.1.1.1
ntp 2.2.2.2
config2.txt
ntp 1.1.1.1
config3.txt
ntp 2.2.2.2
ntp 1.1.1.1
config4.txt
ntp 2.2.2.2
The results would be:
Sets of unique data 3:
Set 1 (1.1.1.1, 2.2.2.2): config1.txt, config3.txt
Set 2 (1.1.1.1): config2.txt
Set 3 (2.2.2.2): config4.txt
I understand how to glob the directory of files, loop the glob results and open each file at a time, and use regex to match each line. The part I do not understand is how I could store these results and compare each file to a set of result, even if the entries are out of order, but a match entry wise. Any help would be appreciated.
Thanks!
filenames = [ r'config1.txt',
r'config2.txt',
r'config3.txt',
r'config4.txt' ]
results = {}
for filename in filenames:
with open(filename, 'r') as f:
contents = ( line.split()[1] for line in f )
key = frozenset(contents)
results.setdefault(key, []).append(filename)
from collections import defaultdict
#Load the data.
paths = ["config1.txt", "config2.txt", "config3.txt", "config4.txt"]
files = {}
for path in paths:
with open(path) as file:
for line in file.readlines():
... #Get data from files
files[path] = frozenset(data)
#Example data.
files = {
"config1.txt": frozenset(["1.1.1.1", "2.2.2.2"]),
"config2.txt": frozenset(["1.1.1.1"]),
"config3.txt": frozenset(["2.2.2.2", "1.1.1.1"]),
"config4.txt": frozenset(["2.2.2.2"]),
}
sets = defaultdict(list)
for key, value in files.items():
sets[value].append(key)
Note you need to use frozensets as they are immutable, and hence can be used as dictionary keys. As they are not going to change, this is fine.
This alternative is more verbose than others, but it may be more efficient depending on a couple of factors (see my notes at the end). Unless you're processing a large number of files with a large number of configuration items, I wouldn't even consider using this over some of the other suggestions, but if performance is an issue this algorithm might help.
Start with a dictionary from the configuration strings to the file set (call it c2f, and from the file to the configuration strings set (f2c). Both can be built as you glob the files.
To be clear, c2f is a dictionary where the keys are strings and the values are sets of files. f2c is a dictionary where the keys are files, and the values are sets of strings.
Loop over the file keys of f2c and one data item. Use c2f to find all files that contain that item. Those are the only files you need to compare.
Here's the working code:
# this structure simulates the files system and contents.
cfg_data = {
"config1.txt": ["1.1.1.1", "2.2.2.2"],
"config2.txt": ["1.1.1.1"],
"config3.txt": ["2.2.2.2", "1.1.1.1"],
"config4.txt": ["2.2.2.2"]
}
# Build the dictionaries (this is O(n) over the lines of configuration data)
f2c = dict()
c2f = dict()
for file, data in cfg_data.iteritems():
data_set = set()
for item in data:
data_set.add(item)
if not item in c2f:
c2f[item] = set()
c2f[item].add(file)
f2c[file] = data_set;
# build the results as a list of pairs of lists:
results = []
# track the processed files
processed = set()
for file, data in f2c.iteritems():
if file in processed:
continue
size = len(data)
equivalence_list = []
# get one item from data, preferably the one used by the smallest list of
# files.
item = None
item_files = 0
for i in data:
if item == None:
item = i
item_files = len(c2f[item])
elif len(c2f[i]) < item_files:
item = i
item_files = len(c2f[i])
# All files with the same data as f must have at least the first item of
# data, just look at those files.
for other_file in c2f[item]:
other_data = f2c[other_file]
if other_data == data:
equivalence_list.append(other_file)
# No need to visit these files again
processed.add(other_file)
results.append((data, equivalence_list))
# Display the results
for data, files in results:
print data, ':', files
Adding a note on computational complexity: This is technically O((K log N)*(L log M)) where N is the number of files, M is the number of unique configuration items, K (<= N) is the number of groups of files with the same content and L (<= M) is the average number of files that have to be compared pairwise for each of the L processed files. This should be efficient if K << N and L << M.
I'd approach this like this:
First, get a dictionary like this:
{(1.1.1.1) : (file1, file2, file3), (2.2.2.2) : (file1, file3, file4) }
Then loop over the file generating the sets:
{(file1) : ((1.1.1.1), (2.2.2.2)), etc }
The compare the values of the sets.
if val(file1) == val(file3):
Set1 = {(1.1.1.1), (2.2.2.2) : (file1, file2), etc }
This is probably not the fastest and mot elegant solution, but it should work.
You need a dictionary mapping the contents of the files to the filename. So you have to read each file,
sort the entries, build a tuple from them and use this as a key.
If you can have duplicate entries in a file: read the contents into a set first.

Categories

Resources