parsing text file with JSON-like object into CSV - python

I have a text file containing key-value pairs, with the last two key-value pairs containing JSON-like objects that I would like to split out into columns and write with the other values, using the keys as column headings. The first three rows of the data file input.txt look like this:
InnerDiameterOrWidth::0.1,InnerHeight::0.1,Length2dCenterToCenter::44.6743867864386,Length3dCenterToCenter::44.6768028159989,Tag::<NULL>,{StartPoint::7858.35924983374[%2C]1703.69341358077[%2C]-3.075},{EndPoint::7822.85045874375[%2C]1730.80294308742[%2C]-3.53962362760298}
InnerDiameterOrWidth::0.1,InnerHeight::0.1,Length2dCenterToCenter::57.8689351603823,Length3dCenterToCenter::57.8700464193429,Tag::<NULL>,{StartPoint::7793.52927597915[%2C]1680.91224357457[%2C]-3.075},{EndPoint::7822.85045874375[%2C]1730.80294308742[%2C]-3.43363070193163}
InnerDiameterOrWidth::0.1,InnerHeight::0.1,Length2dCenterToCenter::68.7161350545728,Length3dCenterToCenter::68.7172034962765,Tag::<NULL>,{StartPoint::7858.35924983374[%2C]1703.69341358077[%2C]-3.075},{EndPoint::7793.52927597915[%2C]1680.91224357457[%2C]-3.45819643838485}
and we eventually came up with something that worked, but there must be a much better way:
import csv
with open('input.txt', 'rb') as fin, open('output.csv', 'wb') as fout:
reader = csv.reader(fin)
writer = csv.writer(fout)
for i, line in enumerate(reader):
mysplit = [item.split('::') for item in line if item.strip()]
if not mysplit: # blank line
continue
keys, vals = zip(*mysplit)
start_vals = [item.split('[%2C]') for item in mysplit[-2]]
end_vals = [item.split('[%2C]') for item in mysplit[-1]]
a=list(keys[0:-2])
a.extend(['start1','start2','start3','end1','end2','end3'])
b=list(vals[0:-2])
b.append(start_vals[1][0])
b.append(start_vals[1][1])
b.append(start_vals[1][2][:-1])
b.append(end_vals[1][0])
b.append(end_vals[1][1])
b.append(end_vals[1][2][:-1])
if i == 0:
# if first line: write header
writer.writerow(a)
writer.writerow(b)
which produces the output file output.csv that looks like this
InnerDiameterOrWidth,InnerHeight,Length2dCenterToCenter,Length3dCenterToCenter,Tag,start1,start2,start3,end1,end2,end3
0.1,0.1,44.6743867864386,44.6768028159989,<NULL>,7858.35924983374,1703.69341358077,-3.075,7822.85045874375,1730.80294308742,-3.53962362760298
0.1,0.1,57.8689351603823,57.8700464193429,<NULL>,7793.52927597915,1680.91224357457,-3.075,7822.85045874375,1730.80294308742,-3.43363070193163
0.1,0.1,68.7161350545728,68.7172034962765,<NULL>,7858.35924983374,1703.69341358077,-3.075,7793.52927597915,1680.91224357457,-3.45819643838485
We don't want to write code like this in the future.
What is the best way to read data like this?

I'd use:
from itertools import chain
import csv
_header_translate = {
'StartPoint': ('start1', 'start2', 'start3'),
'EndPoint': ('end1', 'end2', 'end3')
}
def header(col):
header = col.strip('{}').split('::', 1)[0]
return _header_translate.get(header, (header,))
def cleancolumn(col):
col = col.strip('{}').split('::', 1)[1]
return col.split('[%2C]')
def chainedmap(func, row):
return list(chain.from_iterable(map(func, row)))
with open('input.txt', 'rb') as fin, open('output.csv', 'wb') as fout:
reader = csv.reader(fin)
writer = csv.writer(fout)
for i, row in enumerate(reader):
if not i: # first row, write header first
writer.writerow(chainedmap(header, row))
writer.writerow(chainedmap(cleancolumn, row))
The cleancolumn method takes any of your columns and returns a tuple (possibly with only one value) after removing the braces, removing everything before the first :: and splitting on the embedded 'comma'. By using itertools.chain.from_iterable() we turn the series of tuples generated from the columns into one list again for the csv writer.
When handling the first line we generate one header row from the same columns, replacing the StartPoint and EndPoint headers with the 6 expanded headers.

Related

Removing the end of line character from a read csv file

I tried sever times to use strip() but I can't get it to work.
I removed that piece from this snip but every time I tried it I had
an error or it did nothing. The sort is fine I just want to strip the newline before writing to the new file?
import sys, csv, operator
data = csv.reader(open('tickets.csv'),delimiter=',')
sortedlist = sorted(data, key=operator.itemgetter(6))
# 0 specifies according to first column we want to sort
#now write the sort result into new CSV file
with open("newfiles.csv", "w") as f:
#writablefile = csv.writer(f)
fileWriter = csv.writer(f, delimiter=',')
for row in sortedlist:
#print(row)
lst = (row)
fileWriter.writerow(lst)
You need to add newline='' to your open() when writing a CSV file. This is explained in the documentation. Without it, your file can end up having a blank line per row.
import sys, csv, operator
data = csv.reader(open('tickets.csv'),delimiter=',')
header = next(data)
sortedlist = sorted(data, key=operator.itemgetter(6))
# 0 specifies according to first column we want to sort
#now write the sort result into a new CSV file
with open("newfiles.csv", "w", newline="") as f:
fileWriter = csv.writer(f)
fileWriter.writerow(header) # keep the header at the top
fileWriter.writerows(sortedlist)
Also you need to first read in the header row before loading everything for sorting. This avoids it being sorted. It can then be output separately when writing your sorted output CSV.
If your tickets.csv file contains blank lines, you would need to also remove these. For example:
for row in sortedList:
if row:
fileWriter.writerow(row)

python csv file add to field based off another field

I have a csv file looks like this:
I have a column called “Inventory”, within that column I pulled data from another source and it put it in a dictionary format as you see.
What I need to do is iterate through the 1000+ lines, if it sees the keywords: comforter, sheets and pillow exist than write “bedding” to the “Location” column for that row, else write “home-fashions” if the if statement is not true.
I have been able to just get it to the if statement to tell me if it goes into bedding or “home-fashions” I just do not know how I tell it to write the corresponding results to the “Location” field for that line.
In my script, im printing just to see my results but in the end I just want to write to the same CSV file.
from csv import DictReader
with open('test.csv', 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
for line in csv_dict_reader:
if 'comforter' in line['Inventory'] and 'sheets' in line['Inventory'] and 'pillow' in line['Inventory']:
print('Bedding')
print(line['Inventory'])
else:
print('home-fashions')
print(line['Inventory'])
The last column of your csv contains commas. You cannot read it using DictReader.
import re
data = []
with open('test.csv', 'r') as f:
# Get the header row
header = next(f).strip().split(',')
for line in f:
# Parse 4 columns
row = re.findall('([^,]*),([^,]*),([^,]*),(.*)', line)[0]
# Create a dictionary of one row
item = {header[0]: row[0], header[1]: row[1], header[2]: row[2],
header[3]: row[3]}
# Add each row to the list
data.append(item)
After preparing your data, you can check with your conditions.
for item in data:
if all([x in item['Inventory'] for x in ['comforter', 'sheets', 'pillow']]):
item['Location'] = 'Bedding'
else:
item['Location'] = 'home-fashions'
Write output to a file.
import csv
with open('output.csv', 'w') as f:
dict_writer = csv.DictWriter(f, data[0].keys())
dict_writer.writeheader()
dict_writer.writerows(data)
csv.DictReader returns a dict, so just assign the new value to the column:
if 'comforter' in line['Inventory'] and ...:
line['Location'] = 'Bedding'
else:
line['Location'] = 'home-fashions'
print(line['Inventory'])

Iterating through particular rows in a csvFile in Python

I have a programming assignment that include csvfiles. So far, I only have a issue with obtaining values from specific rows only, which are the rows that the user wants to look up.
When I got frustrated I just appended each column to a separate list, which is very slow (when the list is printed for test) because each column has hundreds of values.
Question:
The desired rows are the rows whose index[0] == user_input. How can I obtain these particular rows only and ignore the others?
This should give you an idea:
import csv
with open('file.csv', 'rb') as f:
reader = csv.reader(f, delimiter=',')
user_rows = filter(lambda row: row[0] == user_input, reader)
Python has the module csv
import csv
rows=[]
for row in csv.reader(open('a.csv','r'),delimiter=','):
if(row[0]==user_input):
rows.append(row)
def filter_csv_by_prefix (csv_path, prefix):
with open (csv_path, 'r') as f:
return tuple (filter (lambda line : line.split(',')[0] == prefix, f.readlines ()))
for line in filter_csv_by_prefix ('your_csv_file', 'your_prefix'):
print (line)

Replace column in csv with modified column

I got a csv file with a couple of columns and a header containing 4 rows. The first column contains the timestamp. Unfortunately it also gives milliseconds, but whenever those are at 00, they are not given in the file. It looks like that:
"TOA5","CR1000","CR1000","E9048"
"TIMESTAMP","RECORD","BattV_Avg","PTemp_C_Avg"
"TS","RN","Volts","Deg C"
"","","Avg","Avg"
"2015-08-28 12:40:23.51",1,12.91,32.13
"2015-08-28 12:50:43.23",2,12.9,32.34
"2015-08-28 13:12:22",3,12.91,32.54
As I don't need the milliseconds, I want to get rid of those, as this makes further calculations containing time a bit complicated. My approach so far:
Extract first 20 digits in each row to get a format such as 2015-08-28 12:40:23
timestamp = []
with open(filepath) as f:
for _ in xrange(4): #skip 4 header rows
next(f)
for line in f:
time = line[1:20] #Get values for the current line
timestamp.append(time) #Add values to list
From here on I'm struggling on how to procede further. I want to exchange the first column in the csv file with the newly created timestamp list.
I tried creating a dictionary, but I don't know how to use the header caption in row 2 as the key:
d = {}
with open(filepath, 'rb') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for col in csv_reader:
#use header info from row 2 as key here
This would import the whole csv file into a dict and I'd then change the TIMESTAMP entry in the dict with the timestamp list above. Is this even possible?
Or is there an easier approach on how to just change the first column in the csv with my new list so that my csv file in the end contains the timestamp just without the millisecond information?
So the first column in my csv should look like this:
"TOA5"
"TIMESTAMP"
"TS"
""
2015-08-28 12:40:23
2015-08-28 12:50:43
2015-08-28 13:12:22
This should do it and preserve the quoting:
with open(filepath1, 'rb') as fin, open(filepath2, 'wb') as fout:
reader = csv.reader(fin)
writer = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)
for _ in xrange(4): # copy first 4 header rows
writer.writerow(next(reader))
for row in reader: # process data lines
row[0] = row[0][:19] # strip fractional seconds from first column
writer.writerow([row[0], int(row[1])] + map(float, row[2:]))
Since a csv.reader returns the columns of each row as a list of strings, it's necessary to convert any which contain numeric values into their actual int or float numeric value before they're written out to prevent them from being quoted.
I believe you can easily create a new csv from iterating over the original csv and replacing the timestamp as you want.
Example -
with open(filepath, 'rb') as csv_file, open('<new file>','wb') as outfile:
csv_reader = csv.reader(csv_file, delimiter=',')
csv_writer = csv.writer(outfile, delimiter=',')
for i, row in enumerate(csv_reader): #Enumerating as we only need to change rows after 3rd index.
if i <= 3:
csv_writer.writerow(row)
else:
csv_writer.writerow([row[0][1:20]] + row[1:])
I'm not entirely sure about how to parse your csv but I would do something of the sort:
time = time.split(".")[0]
so if it does have a millisecond it would get removed and if it doesn't nothing will happen.

Compare two CSV files and look for matches Python

I have two CSV files that are like
CSV1
H1,H2,H3
arm,biopsy,forearm
heart,leg biopsy,biopsy
organs.csv
arm
leg
forearm
heart
skin
I need to compare both the files and get an output list like this [arm,forearm,heart,leg] but the script that I'm currently working on doesn't give me any output (I want leg also in the output, though it is mixed with biopsy in the same cell). Here's the code so far. How can I get all the matched words?
import csv
import io
alist, blist = [], []
with open("csv1.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
alist.append(row)
with open("organs.csv", "rb") as fileB:
reader = csv.reader(fileB, delimiter=',')
for row in reader:
blist.append(row)
first_set = set(map(tuple, alist))
secnd_set = set(map(tuple, blist))
matches = set(first_set).intersection(secnd_set)
print matches
Try this:
import csv
alist, blist = [], []
with open("csv1.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
for row_str in row:
alist += row_str.strip().split()
with open("organs.csv", "rb") as fileB:
reader = csv.reader(fileB, delimiter=',')
for row in reader:
blist += row
first_set = set(alist)
second_set = set(blist)
print first_set.intersection(second_set)
Basically, iterating through the csv file via csv reader returns a row which is a list of the items (strings) like this ['arm', 'biopsy', 'forearm'], so you have to sum lists to insert all of the items.
On the other hand, to remove duplications only one set conversion via the set() function is required, and the intersection method returns another set with the elements.
Change the part reading from csv1.csv to:
with open("csv1.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
# append all words in cell
for word in row:
alist.append(word)
I would treat the CSV files as text files, get a lists of all the words in the first and the seconds, then iterate over the first list to see if any exactly match any in the second list.

Categories

Resources