Parsing unusal csv file in Python

Parsing unusal csv file in Python - python

My friend asked me to help him parse eBay csv file and save only couple of important fields, so I thought it will be a good opportunity to learn Python (writing mostly in C for now).
The problem is, eBay csv file format is giving me a hard time:
Numer rekordu sprzedaży,Nazwa użytkownika,Imię i nazwisko kupującego,Numer telefonu kupującego,Adres e-mail kupującego,Adres 1 kupującego,Adres 2 kupującego,Miejscowość kupującego,Województwo kupującego,Kod pocztowy kupującego,Kraj kupującego,Numer przedmiotu,Nazwa przedmiotu,Etykieta niestandardowa,Ilość,Cena sprzedaży,Wysyłka i obsługa,Ubezpieczenie,Koszt płatności za pobraniem,Cena łączna,Forma płatności,Data sprzedaży,Data realizacji transakcji,Data zapłaty,Data wysyłki,Opinia wystawiona,Opinia otrzymana,Uwagi własne,Identyfikator transakcji PayPal,Usługa wysyłkowa,Opcja płatności za pobraniem,Identyfikator transakcji,Identyfikator zamówienia,Szczegóły wersji
"610","xxx","John Rodriguez","(860) 000-00000","mail#yahoo.com","0 Branford Ave Bldg 11","","City","CT","00000","Stany Zjednoczone","330972592582","Honda CBR 900 RR","","1","US $21,49","US $5,50","US $0,00","","US $26,99","PayPal","23-03-2014","23-03-2014","23-03-2014","","Nie","","","4EP58","Standard Shipping from outside US","","9639014","",""
"627","yyy","Name","063100000","mail#orange.fr","Rue barillettes","","st main","Rhône","00000","Francja","3311071","Suzuki SV 650","","1","EUR 15,99","EUR 4,00","EUR 0,00","","EUR 19,99","PayPal","31-03-2014","31-03-2014","31-03-2014","","Nie","","","6E03683046","Livraison standard ? partir de l'étranger","","9659014","",""
Pobrano rekordów: 8,,od ,23-03-2014,15:06:14, do ,11-04-2014,14:32:17
Nazwa sprzedawcy: mail#gmail.com
Parsing it with csv.DictReader, like in the manual, results with every line like as none : list[]
import csv
filename = "SalesHistory.csv"
csvfile = open(filename, encoding="iso-8859-2")
input_file = csv.DictReader(csvfile, quotechar='"', skipinitialspace=True)
for row in input_file:
print (row)
{None: ['\tNumer rekordu sprzedaży', 'Nazwa użytkownika', 'Imię i nazwisko kupującego', 'Numer telefonu kupującego',
'Adres e-mail kupującego', 'Adres 1 kupującego', 'Adres 2 kupującego', 'Miejscowość kupującego',
'Województwo kupującego', 'Kod pocztowy kupującego', 'Kraj kupującego', 'Numer przedmiotu', 'Nazwa przedmiotu',
'Etykieta niestandardowa', 'Ilość', 'Cena sprzedaży', 'Wysyłka i obsługa', 'Ubezpieczenie',
'Koszt płatności za pobraniem', 'Cena łączna', 'Forma płatności', 'Data sprzedaży',
'Data realizacji transakcji', 'Data zapłaty', 'Data wysyłki', 'Opinia wystawiona', 'Opinia otrzymana',
'Uwagi własne', 'Identyfikator transakcji PayPal', 'Usługa wysyłkowa', 'Opcja płatności za pobraniem',
'Identyfikator transakcji', 'Identyfikator zamówienia', 'Szczegóły wersji']}
instead of, first line read as keys for transactions in other lines.
I read Python CSV manual, looked at some examples, searched Stack Overflow but I still don't know what to do next - most of them cover more 'standard' version of csv.
Any tips to get me moving in the right direction would be great.

That's odd... your code didn't give me the error that you posted in your question (although I'm using Python 2.7, and you seem to be using a 3.x, maybe is because of that).
Also, the file doesn't start with a blank (empty line), does it? If it does, it'll mess up with the csv module. It uses the first line to guess the keys that csv.DictReader will use. If there's a blank line at the beginning, it won't be able to guess the keys. You should "clean" the file before trying to parse it with csv (removing empty lines should do the trick) or you could read row by row skipping empty rows, but that complicates using csv.DictReader (you should get the first non-empty row, consider its values the keys for your result dictionary and then read the rest of the rows considering its values as the values for your result dictionary... I'd just remove the empty lines from the file before parsing it)
In the code below I've added a try/catch block to deal with incomplete lines (such as the last 2 lines in your sample file), but even without it, it was working pretty ok
import csv
filename = "SalesHistory.csv"
read_dcts = []
with open(filename, 'r') as csvfile:
input_file = csv.DictReader(csvfile, quotechar='"', skipinitialspace=True)
for i, dct in enumerate(input_file):
try:
utf_dict=dict((k.decode('utf-8'), v.decode('utf-8')) \
for k, v in dct.items())
read_dcts.append(utf_dict)
except AttributeError:
print "Weird line %d found" % (i + 1)
# Verify:
for i, dct in enumerate(read_dcts):
print "Dict %d" % (i + 1)
for k, v in dct.iteritems():
print "\t%s: %s" % (k, v)
If I execute the code above, I get:
Weird line 3 found
Weird line 4 found
Dict 1
Opinia otrzymana:
Cena sprzedaży: US $21,49
[ . . . ]
Wysyłka i obsługa: US $5,50
Opcja płatności za pobraniem:
Dict 2
Opinia otrzymana:
Cena sprzedaży: EUR 15,99
[ . . . ]
Wysyłka i obsługa: EUR 4,00
Opcja płatności za pobraniem
I've removed many of the lines loaded, just for clarity's sake but besides that, it should be loading what you wanted.
If you have an update, let me know through a comment.
EDIT:
Just in case the file contains an empty line and you don't want to pre-clean it, you can pretty much do "manually" what the DictReader class does for you (use the first non-empty line as keys, and the rest of the non-empty lines as values):
import csv
filename = "SalesHistory.csv"
read_dcts = []
keys = []
with open(filename, 'r') as csvfile:
reader = csv.reader(csvfile, quotechar='"', skipinitialspace=True)
for i, row in enumerate(reader):
try:
if len(row) == 0:
raise IndexError("Row %d is empty. Should skip" % (i + 1))
if len(keys) == 0:
keys = [ val.decode('utf-8') for val in row ]
elif len(row) == len(keys):
utf_dict = dict(zip(keys, [ val.decode('utf-8') for val in row ]))
read_dcts.append(utf_dict)
except (IndexError, AttributeError), e:
print "Weird line %d found (got %s)" % ((i + 1), e)
# Verify:
for i, dct in enumerate(read_dcts):
print "Dict %d" % (i + 1)
for k, v in dct.iteritems():
print "\t%s: %s" % (k, v)

A reasonably simlpe function to read a csv file and make keys of the first line in the file and values of other lines.
import csv
def dict_from_csv(filename):
'''
(file)->list of dictionaries
Function to read a csv file and format it to a list of dictionaries.
The headers are the keys with all other data becoming values
'''
#open the file and read it using csv.reader()
#read the file. for each row that has content add it to list mf
#the keys for our user dict are the first content line of the file mf[0]
#the values to our user dict are the other lines in the file mf[1:]
mf = []
with open(filename, 'r') as f:
my_file = csv.reader(f)
for row in my_file:
if any(row):
mf.append(row)
file_keys = mf[0]
file_values = mf[1:]
#Combine the two lists, turning into a list of dictionaries, using the keys list as the key and the value list as the values
my_list = []
for value in file_values:
my_list.append(dict(zip(file_keys, file_values)))
#return the list of dictionaries
return my_list

Related

Adding words between lines to an array

This is the content of my file:
david C001 C002 C004 C005 C006 C007
* C008 C009 C010 C011 C016 C017 C018
* C019 C020 C021 C022 C023 C024 C025
anna C500 C521 C523 C547 C555 C556
* C557 C559 C562 C563 C566 C567 C568
* C569 C571 C572 C573 C574 C575 C576
* C578
charlie C701 C702 C704 C706 C707 C708
* C709 C712 C715 C716 C717 C718
I want my output to be:
david=[C001,C002,C004,C005,C006,C007,C008,C009,C010,C011,C016,C017,C018,C019,C020,C021,C022,C023,C024,C025]
anna=[C500,C521,C523,C547,C555,C556,C557,C559,C562,C563,C566,C567,C568,C569,C571,C572,C573,C574,C575,C576,C578]
charlie=[C701,C702,C704,C706,C707,C708,C709,C712,C715,C716,C717,C718]
I am able to create:
david=[C001,C002,C004,C005,C006,C007]
anna=[C500,C521,C523,C547,C555,C556]
charlie=[C701,C702,C704,C706,C707,C708]
counting the number of words in a line and using line[0] as the array name and adding the remaining words to the array.
However, I don't know how to take the continuation of words in the next lines starting with "*" to the array.
Can anyone help?

NOTE: This solution relies on defaultdict being ordered, which is something that was introduced on Python 3.6
Somewhat naive approach:
from collections import defaultdict
# Create a dictionary of people
people = defaultdict(list)
# Open up your file in read-only mode
with open('your_file.txt', 'r') as f:
# Iterate over all lines, stripping them and splitting them into words
for line in filter(bool, map(str.split, map(str.strip, f))):
# Retrieve the name of the person
# either from the current line or use the name of the last person processed
name, words = list(people)[-1] if line[0] == '*' else line[0], line[1:]
# Add all remaining words to that person's record
people[name].extend(words)
print(people['anna'])
# ['C500', 'C521', 'C523', 'C547', 'C555', 'C556', 'C557', 'C559', 'C562', 'C563', 'C566', 'C567', 'C568', 'C569', 'C571', 'C572', 'C573', 'C574', 'C575', 'C576', 'C578']
It also has the additional benefit of returning an empty list for unknown names:
print(people['matt'])
# []

You could read the lists into a dictionary using regular expressions:
import re
with open('file_name') as file:
contents = file.read()
res_list = re.findall(r"[a-z]+\s+[^a-z]+",contents)
res_dict = {}
for p in res_list:
elt = p.split()
res_dict[elt[0]] = [e for e in elt[1:] if e != '*']
print(res_dict)

I figured out a way myself. Thanks to the ones who gave their own solution. It gave me new perspective.
Below is my code:
persons_library={}
persons=['david','anna','charlie']
for i,person in enumerate(persons,start=0):
persons_library[person]=[]
with open('data.txt','r') as f:
for line in f:
line=line.replace('*',"")
line=line.split()
for i,val in enumerate(line,start=0):
if val in persons_library:
key=val
else:
persons_library[key].append(val)
print(persons_library)

How to check if an item in a dictionary exists in CSV file?

I have a dictionary and a CSV file (which is actually tab delimited):
dict1:
{1 : ['Charles', 22],
2: ['James', 36],
3: ['John', 18]}
data.csv:
[ 22 | Charles goes to the cinema | Activity ]
[ 46 | John is a butcher | Profession ]
[ 95 | Charles is a firefighter | Profession ]
[ 67 | James goes to the zoo | Activity ]
I want to take the string (name) in the first item of dict1's value and search for it in the second column of the csv. If the name appears in the sentence, I want to print the first (and only the first) sentence.
But I am having problem with the searching - how do I access the column/row data while iterating through dict1? I have tried something like this:
with open('data.csv', 'r', encoding='utf-8') as file:
reader = csv.reader(file, delimiter='\t')
for (id, (name, age)) in dict1.items():
if name in reader.row[1] # reader.row[1] is wrong!!!
print(reader.row[1])

Yes, roganjosh is right. Better way is traverse CSV file and find any key.
requested = {d[0] for d in dict1.values()}
with open('/tmp/f.csv', newline='') as csvfile:
for row in csv.reader(csvfile, delimiter='\t'):
sentence = row[1]
found = {n for n in requested if n in sentence}
for n in found:
print(f'{n}: {sentence}')
requested -= found
if not requested: # optimization, all names used
break
EDIT: answer for question, not for my imagination
EDIT2: after clarification (and some new requirements)... I hope I hit.
Prints sentence only ones per row. It not check if the same sentence is in another row. You can use set() for keep matched sentences and print them when CVS file has been proceed.
I used regex for match worlds not any sub-string.
import csv
import re
requested = {re.compile(r'\b' + re.escape(d[0]) + r'\b') for d in dict1.values()}
with open('/tmp/f.csv', newline='') as csvfile:
for row in csv.reader(csvfile, delimiter='\t'):
sentence = row[1]
found = {n for n in requested if n.search(sentence)}
if found:
requested -= found
print(sentence)
if not requested:
break
EDIT3: restore hit names (new requirement – like in real dev project :-P)
First, you can match more than one name (see len(found)).
In last example you can recover name from compiled regex (because before r'\b' was added before and after name):
found_names = [r.pattern[2:-2] for r in found]
But I don't think it's best way.
Better way is add original name to requested. I deiced to use set of tuples. Operations on sets are very fast.
requested = {(re.compile(r'\b' + re.escape(d[0]) + r'\b'), d[0])
for d in dict1.values()}
with open('/tmp/f.csv', newline='') as csvfile:
for row in csv.reader(csvfile, delimiter='\t'):
sentence = row[1]
found = {(r, n) for r, n in requested if r.search(sentence)}
if found:
found_names = tuple(n for r, n in found)
print(found_names, sentence)
requested -= found
if not requested:
break
Now found names (original d[0]) are in list found_names. You can user it as you want. For example change to string (do replace found_name= and print` lines):
found_names = ', '.join(n for r, n in found)
print(f'{found_names}: {sentence}')

Python:Loop through .csv of urls and save it as another column

New to python, read a bunch and watched a lot of videos. I can't get it to work and i'm getting frustrated.
I have a List of links like below:
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL"
"1002880800","37.2354869","-100.4607509","T32S R29W, Sec. 27, SW SW NE","Stanolind Oil and Gas Co.","William L. Rickers 1","15-119-00164","2705"," KB","2790","7652","http://www.kgs.ku.edu/WellLogs/32S29W/1043696830.zip"
"1002880821","37.1234622","-100.1158111","T34S R26W, Sec. 2, NW NW NE","SKELLY OIL CO","GRACE MCKINNEY 'A' 1","15-119-00181","2290"," KB","4000","5900","http://www.kgs.ku.edu/WellLogs/34S26W/1043696831.zip"
I'm trying to get python to go to "URL" and save it in a folder named "location" as filename "API.las".
ex) ......"location"/Section/"API".las
C://.../T32S R29W/Sec.27/15-119-00164.las
The file has hundred of rows and links to download. I wanted to implement a sleep function at the also to not bombard the servers.
What are some of the different ways to do this? I've tried pandas and a few other methods... any ideas?

You will have to do something like this
for link, file_name in zip(links, file_names):
u = urllib.urlopen(link)
udata = u.read()
f = open(file_name+".las", "w")
f.write(udata)
f.close()
u.close()
If the contents of your file are not what you wanted, you might want to look at a scraping library like BeautifulSoup for parsing.

This might be a little dirty, but it's a first pass at solving the problem. This is all contingent on each value in the CSV being encompassed in double quotes. If this is not true, this solution will need heavy tweaking.
Code:
import os
csv = """
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL"
"1002880800","37.2354869","-100.4607509","T32S R29W, Sec. 27, SW SW NE","Stanolind Oil and Gas Co.","William L. Rickers 1","15-119-00164","2705"," KB","2790","7652","http://www.kgs.ku.edu/WellLogs/32S29W/1043696830.zip"
"1002880821","37.1234622","-100.1158111","T34S R26W, Sec. 2, NW NW NE","SKELLY OIL CO","GRACE MCKINNEY 'A' 1","15-119-00181","2290"," KB","4000","5900","http://www.kgs.ku.edu/WellLogs/34S26W/1043696831.zip"
""".strip() # trim excess space at top and bottom
root_dir = '/tmp/so_test'
lines = csv.split('\n') # break CSV on newlines
header = lines[0].strip('"').split('","') # grab first line and consider it the header
lines_d = [] # we're about to perform the core actions, and we're going to store it in this variable
for l in lines[1:]: # we want all lines except the top line, which is a header
line_broken = l.strip('"').split('","') # strip off leading and trailing double-quote
line_assoc = zip(header, line_broken) # creates a tuple of tuples out of the line with the header at matching position as key
line_dict = dict(line_assoc) # turn this into a dict
lines_d.append(line_dict)
section_parts = [s.strip() for s in line_dict['Location'].split(',')] # break Section value to get pieces we need
file_out = os.path.join(root_dir, '%s%s%s%sAPI.las'%(section_parts[0], os.path.sep, section_parts[1], os.path.sep)) # format output filename the way I think is requested
# stuff to show what's actually put in the files
print file_out, ':'
print ' ', '"%s"'%('","'.join(header),)
print ' ', '"%s"'%('","'.join(line_dict[h] for h in header))
output:
~/so_test $ python so_test.py
/tmp/so_test/T32S R29W/Sec. 27/API.las :
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL"
"1002880800","37.2354869","-100.4607509","T32S R29W, Sec. 27, SW SW NE","Stanolind Oil and Gas Co.","William L. Rickers 1","15-119-00164","2705"," KB","2790","7652","http://www.kgs.ku.edu/WellLogs/32S29W/1043696830.zip"
/tmp/so_test/T34S R26W/Sec. 2/API.las :
"KGS ID","Latitude","Longitude","Location","Operator","Lease","API","Elevation","Elev_Ref","Depth_start","Depth_stop","URL"
"1002880821","37.1234622","-100.1158111","T34S R26W, Sec. 2, NW NW NE","SKELLY OIL CO","GRACE MCKINNEY 'A' 1","15-119-00181","2290"," KB","4000","5900","http://www.kgs.ku.edu/WellLogs/34S26W/1043696831.zip"
~/so_test $

Approach 1 :-
Your file has suppose 1000 rows.
Create masterlist which has the data stored in this form ->
[row1,row2,row3 and so on]
Once done, loop through this masterlist. You will get a a row in string format in every iteration.
split it make a list and splice the last column of url i.e. row[-1]
and append it to a empty list named result_url. Once it has run for all rows, save it in a file and you can easily create a directory using os module and move your file over there
Approach 2 :-
If file is too huge, read line one by one in try block and process your data (using csv module you can get each row as a list, splice url and write it to file API.las everytime).
Once your program moves 1001th line it will move to except block where you can just 'pass' or write a print to get notified.
In approach 2, you are not saving all data in any data structure, you just storing a single row at executing it, so it is more fast.
import csv, os
directory_creater = os.mkdir('Locations')
fme = open('./Locations/API.las','w+')
with open('data.csv','r') as csvfile:
spamreader = csv.reader(csvfile, delimiter = ',')
print spamreader.next()
while True:
try:
row= spamreader.next()
get_url = row[-1]
to_write = get_url+'\n'
fme.write(to_write)
except:
print "Program has run. Check output."
exit(1)
This code can do all that you mentioned efficiently in less time.

How to speed up file parsing in python?

Below is a section from an app I have been working on. The section is used to update a text file with addValue. At first I thought it was working but it seams to add more lines in and also it is very very slow.
trakt_shows_seen is a dictionary of shows, 1 show section looks like
{'episodes': [{'season': 1, 'playcount': 0, 'episode': 1}, {'season': 1, 'playcount': 0, 'episode': 2}, {'season': 1, 'playcount': 0, 'episode': 3}], 'title': 'The Ice Cream Girls'}
The section should search for each title, season and episode in the file and when found check if it has a watched marker (checkValue) if it does, it changes it to addvalue, if it does not it should add addValue to the end of the line.
A line from the file
_F /share/Storage/NAS/Videos/Tv/The Ice Cream Girls/Season 01/The Ice Cream Girls - S01E01 - Episode 1.mkv _ai Episode 1 _e 1 _r 6.5 _Y 71 _s 1 _DT 714d861 _et Episode 1 _A 4379,4376,4382,4383 _id 2551 _FT 714d861 _v c0=h264,f0=25,h0=576,w0=768 _C T _IT 717ac9d _R GB: _m 1250 _ad 2013-04-19 _T The Ice Cream Girls _G d _U thetvdb:268910 imdb:tt2372806 _V HDTV
So my question, is there a better faster way? Can I load the file into memory (file is around 1Mb) change the required lines and then save the file, or can anyone suggest another method that will speed things up.
Thanks for taking the time to look.
EDIT
I have changed the code quite a lot and this does work a lot faster, but the output is not as expected, for some reason it writes lines_of_interest to the file even though there is no code to do this??
I also have not yet added any encoding options but as the file is in utf-8 I suspect there will be an issue with accented titles.
if trakt_shows_seen:
addValue = "\t_w\t1\t"
replacevalue = "\t_w\t0\t"
with open(OversightFile, 'rb') as infile:
p = '\t_C\tT\t'
for line in infile:
if p in line:
tv_offset = infile.tell() - len(line) - 1#Find first TV in file, search from here
break
lines_of_interest = set()
for show_dict in trakt_shows_seen:
for episode in show_dict['episodes']:
p = re.compile(r'\t_s\t('+str(episode["season"])+')\t.*\t_T\t('+show_dict["title"]+')\t.*\t_e\t('+str(episode["episode"])+')\t')
infile.seek(tv_offset)#search from first Tv show
for line in infile:
if p.findall(line):
search_offset = infile.tell() - len(line) - 1
lines_of_interest.add(search_offset)#all lines that need to be changed
with open(OversightFile, 'rb+') as outfile:
for lines in lines_of_interest:
for change_this in outfile:
outfile.seek(lines)
if replacevalue in change_this:
change_this = change_this.replace(replacevalue, addValue)
outfile.write(change_this)
break#Only check 1 line
elif not addValue in change_this:
#change_this.extend(('_w', '1'))
change_this = change_this.replace("\t\n", addValue+"\n")
outfile.write(change_this)
break#Only check 1 line

Aham -- you are opening, reading and rewriting your file in every repetition of your for loop - once for each episode for each show. few things in the whole Multiverse could be slower than that.
You cango along the same line - just read all your "file" once, before the for loops,
iterate over the list read, and write everything back to disk, just once =
more or less:
if trakt_shows_seen:
addValue = "\t_w\t1\t"
checkvalue = "\t_w\t0\t"
print ' %s TV shows episodes playcount will be updated on Oversight' % len(trakt_shows_seen)
myfile_list = open(file).readlines()
for show in trakt_shows_seen:
print ' --> ' + show['title'].encode('utf-8')
for episode in show['episodes']:
print ' Season %i - Episode %i' % (episode['season'], episode['episode'])
p = re.compile(r'\t_s\t('+str(episode["season"])+')\t.*\t_T\t('+show["title"]+')\t.*\t_e\t('+str(episode["episode"])+')\t')
newList = []
for line in myfile_list:
if p.findall(line) :
if checkvalue in line:
line = line.replace(checkvalue, addValue)
elif not addValue in line:
line = line.strip("\t\n") + addValue+"\n"
newList.append(line)
myfile_list = newlist
outref = open(file,'w')
outref.writelines(newList)
outref.close()
This is still far from optimal - but is the least amoutn of change in your code to stop what is slowing it down so much.

You're rereading and rewriting your entire file for every episode of every show you track - of course this is slow. Don't do that. Instead, read the file once. Parse out the show title and season and episode numbers from each line (probably using the csv built-in library with delimiter='\t'), and see if they're in the set you're tracking. Make your substitution if they are, and write the line either way.
It's going to look something like this:
title_index = # whatever column number has the show title
season_index = # whatever column number has the season number
episode_index = # whatever column number has the episode number
with open('somefile', 'rb') as infile:
reader = csv.reader(infile, delimiter='\t')
modified_lines = []
for line in reader:
showtitle = line[title_index]
if showtitle in trakt_shows_seen:
season_number = int(line[season_index])
episode_number = int(line[episode_index])
if any((x for x in trakt_shows_seen[showtitle] if x['season'] = season_number and x['episode'] = episode_number)):
# line matches a tracked episode
watch_count_index = line.index('_w')
if watch_count_index != -1:
# possible check value found - you may be able to skip straight to assigning the next element to '1'
if line[watch_count_index + 1] == '0':
# check value found, replace
line[watch_count_index + 1] = '1'
elif line[watch_count_index + 1] != '1':
# not sure what you want to do if something like \t_w\t2\t is present
line[watch_count_index + 1] = '1'
else:
line.extend(('_w', '1'))
modified_lines.append(line)
with open('somefile', 'wb') as outfile:
writer = csv.writer(outfile, delimiter='\t')
writer.writerows(modified_lines)
The exact details will depend on how strict your file format is - the more you know about the structure of the line beforehand the better. If the indices of the title, season and episode fields vary, probably the best thing to do is iterate once through the list representing the line looking for the relevant markers.
I have skipped over error checking - depending on your confidence in the original file you might want to ensure that season and episode numbers can be converted to ints, or stringify your trakt_shows_seen values. The csv reader will return encoded bytestrings, so if show names in trakt_shows_seen are Unicode objects (which they don't appear to be in your pasted code) you should either decode the csv reader's results or encode the dictionary values.
I personally would probably convert trakt_shows_seen to a set of (title, season, episode) tuples, for more convenient checking to see if a line is of interest. At least if the field numbers for title, season and episode are fixed. I would also write to my outfile file (under a different filename) as I read the input file rather than keeping a list of lines in memory; that would allow some sanity checking with, say, a shell's diff utility before overwriting the original input.
To create a set from your existing dictionary - to some extent it depends on exactly what format trakt_shows_seen uses. Your example shows an entry for one show, but doesn't indicate how it represents more than one show. For now I'm going to assume it's a list of such dictionaries, based on your attempted code.
shows_of_interest = set()
for show_dict in trakt_shows_seen:
title = show_dict['title']
for episode_dict in show_dict['episodes']:
shows_of_interest.add((title, episode_dict['season'], episode_dict['episode']))
Then in the loop that reads the file:
# the rest as shown above
season_number = int(line[season_index])
episode_number = int(line[episode_index])
if (showtitle, season_number, episode_number) in shows_of_interest:
# line matches a tracked episode

How do I cycle through a csv in python, writing lines to a new file that meet new criteria

I've been at this a while now, and I think it in my best interest to ask advice of the experts. I know I'm not writing this the best way possible, and I've gone down a rabbit hole and confused myself.
I have a csv. A bunch, actually. That part is not the problem.
The lines at the top of the CSV are not really CSV data, but it does contain an important piece of info, the data for which the data is valid. For certain kinds of a report, it is on one line, and on others another.
My data starts on some line down from the top, usually 10 or 11, but I can't always be certain. I do know that the first column always has the same info (the header of the table of data).
I want to pull the report date from the preceding lines, and for file type A, do stuffA, and for file tpye B, do stuffB, then write out that row to a new file. I'm having a problem incrementing the row and I have no idea what I'm doing wrong.
Sample data:
"Attribute ""OPSURVEYLEVEL2_O"" [Category = ""Retail v1""]"
Date exported: 2/16/13
Exported by user: William
Project:
Classification: Online Retail v1
Report type: Attributes
Date range: from 12/14/12 to 12/14/12
"Filter OpSurvey Level 2(mine): [ LEVEL:SENTENCE TYPE:KEYWORD {OPSURVEYLEVEL2_O:""gift certificate redemption"", OPSURVEYLEVEL2_O:""combine accounts"", OPSURVEYLEVEL2_O:""cancel account"", OPSURVEYLEVEL2_O:""saved project moved to purchased project"", OPSURVEYLEVEL2_O:""unlock account"", OPSURVEYLEVEL2_O:""affiliate promotions"", OPSURVEYLEVEL2_O:""print to store coupons"", OPSURVEYLEVEL2_O:""disclaimer not clear"", OPSURVEYLEVEL2_O:""prepaid issue"", OPSURVEYLEVEL2_O:""customer wants to use coupons for print to store"", OPSURVEYLEVEL2_O:""customer received someone else's order"", OPSURVEYLEVEL2_O:""hi-res images unavailable"", OPSURVEYLEVEL2_O:""how to re-order"", OPSURVEYLEVEL2_O:""missing items"", OPSURVEYLEVEL2_O:""missing envelopes: print to store"", OPSURVEYLEVEL2_O:""missing envelopes: mail order"", OPSURVEYLEVEL2_O:""group rooms"", OPSURVEYLEVEL2_O:""print to store"", OPSURVEYLEVEL2_O:""print to store coupons"", OPSURVEYLEVEL2_O:""publisher: card not available for print to store"", OPSURVEYLEVEL2_O:publisher}]"
Total: 905
OPSURVEYLEVEL2_O,Distinct Document,% of Document,Sentiment Score
PRINT TO STORE,297,32.82,-0.1
...
Sample Code
#!/usr/bin/python
import csv, os, glob, sys, errno
path = '/path/to/Downloads'
for infile in glob.glob(os.path.join(path,'report_ATTRIBUTE_OP*.csv')):
if 'OPSURVEYLEVEL2' in infile:
prime_column = 'ops2'
elif 'OPSURVEYLEVEL3' in infile:
prime_column = 'ops3'
else:
sys.exit(errno.ENOENT)
with open(infile, "r") as csvfile:
reader = csv.reader(csvfile)
report_date = 'DATE NOT FOUND'
# import pdb; pdb.set_trace()
for row in reader:
foo = 0
while foo < 1:
if row[0][0:].find('OPSURVEYLEVEL') == 0:
foo = 1
if "Date range" in row:
report_date = row[0][-8:]
break
if foo >= 1:
if row[0][0:].find('OPSURVEYLEVEL') == 0:
break
if 'ops2' in prime_column:
dup_col = row[0]
row.insert(0,dup_col)
row.append(report_date)
elif 'ops3' in prime_column:
row.append(report_date)
with open('report_merge.csv', 'a') as outfile:
outfile.write(row)
reader.next()

There are two problems that I can see in this code.
The first is that the code won't find the date range in row. The line:
if "Date range" in row:
... should be:
if "Date range" in row[0]:
The second is that the code:
if row[0][0:].find('OPSURVEYLEVEL') == 0:
break
... is breaking out of the for loop after the header line of the data table, because that is the closest enclosing loop. I suspect that there was another while in there somewhere in a previous version of this code.
The code is simpler (and bug-free) with an if statement instead of the while and if, as follows:
for row in reader:
if foo < 1:
if row[0][0:].find('OPSURVEYLEVEL') == 0:
foo = 1
if "Date range" in row[0]: # Changed this line
print("found report date")
report_date = row[0][-8:]
else:
print(row)
if row[0][0:].find('OPSURVEYLEVEL') == 0:
break
if 'ops2' in prime_column:
dup_col = row[0]
row.insert(0,dup_col)
row.append(report_date)
elif 'ops3' in prime_column:
row.append(report_date)
with open('report_merge.csv', 'a') as outfile:
outfile.write(','.join(row)+'\n')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing unusal csv file in Python - python

Related

Adding words between lines to an array

How to check if an item in a dictionary exists in CSV file?

Python:Loop through .csv of urls and save it as another column

How to speed up file parsing in python?

How do I cycle through a csv in python, writing lines to a new file that meet new criteria

Categories

Resources