Python csvreader separate lines - python

I am using the csv module for Python. I have had a good look at the CSV File Reading and Writing guide. I want to write a loop that runs through each row in the CSV file and assigns each row do a different variable. Does anyone have any ideas on this?
I am aware that there is a .next() and .line_num, I didn't think that these would be suitable in this case although I might be wrong.
Currently I have the following code, which print out the whole CSV file:
print_csv = csv.reader(open(csv_name, 'rb'), delimiter=' ', quotechar='|')
for row in print_csv:
print ', '.join(row)
[EDIT]
I am now aware, from this question thread, that the best way to do this will depend on what the first line is going to be used for.
What I want to do with the first line of the CSV file is to check whether it is in the correct format. This would involve:
checking to see whether it has the expected number of columns
checking to see whether the column headers have the correct name
checking to see whether to columns are in the correct order.

1.- Fast Answer
Instead of setting different independent variables you could do:
mydict = {}
for idx, item in enumerate(reader):
mydict['var%i' %idx] = item
then you call your var like:
mydict['var0']
Or still shorter in py3k:
mydict = {'var%i' %idx : item for idx, item in enumerate(reader)}
But this doesnt have much sense applied this way
As a commenter said this is not different than doing directly:
mylist = list(reader)
and then
mylist[0] # instead of 'var0'
and this option is much better.
The dictionary strategy is best suited when you extract the dictionary key from the very same reader line. For example, if it were at pos pos 0,:
mydict = {item[0] : item for item in reader}
2.- The Proper Answer
But if what you want is simply to check the format of the first line (maybe to calculate the space you need for printing), the method could be:
line = reader.next()
like_this = check_how_is_my(line)
if like_this == 'something_long':
spaces = 23
else:
spaces = 0
while True:
try:
print_with_spaces(line, spaces)
reader.next()
except StopIteration:
break

Well, you can obviously do:
var1 = reader.next()
var2 = reader.next()
var3 = reader.next()
var4 = reader.next()
var5 = reader.next()
or any variation thereof. This is not my favorite coding style, but it works.

Related

Python simple nested loop

I'm trying to do something very simple: I have a recurring csv file that may have repetitions of emails and I need to find how many times each email is repeated, so I did as follows:
file = open('click.csv')
reader = csv.reader(file)
for row in reader:
email = row[0]
print (email) # just to check which value is parsing
counter = 1
for row in reader:
if email == row[0]:
counter += 1
print (counter) # just to check if it counted correctly
and it only prints out:
firstemailaddress
2
Indeed there are 2 occurrencies of the first email but somehow this stops after the first email in the csv file.
So I simplified it to
for row in reader:
email = row[0]
print (email)
and this indeed prints out all the Email addresses in the csv file
This is a simple nested loop, so what's the deal here?
Of course just checking occurrencies could be done without a script but then I have to process those emails and data related to them and merge it with another csv file so that's why
Many thanks,
The problem with your first snippet comes down to a misunderstanding of iterators, or how csv.reader works.
Your reader object is an iterator. That means it yields rows, and similar to a generator object, it has a certain "state" between iterations. Every time you iterate over one of its elements - in this case rows, you are "consuming" the next available row, until you've consumed all rows and the iterator is entirely exhausted. Here's an example of a different kind of iterator being exhausted:
Imagine you have a text file, file.txt with these lines:
hello
world
this
is
a
test
Then this code:
with open("file.txt", "r") as file:
print("printing all lines for the first time:")
for line in file:
# strip the trailing newline character
print(line.rstrip())
print("printing all lines for the second time:")
for line in file:
# strip the trailing newline character
print(file.rstrip())
print("Done!")
Output:
printing all lines for the first time:
hello
world
this
is
a
test
printing all lines for the second time:
Done!
>>>
If this output surprises you, then it's because you've misunderstood how iterators work. In this case, file is an iterator, that yields lines. The first for-loop exhausts all available lines in the file iterator. This means the iterator will be exhausted by the time we reach the second for-loop, and there are no lines left to print.
The same thing is true for your reader. You're consuming rows from your csv-file for every iteration of your outer for-loop, and then consuming another row from the inner for-loop. You can expect to have your code behave strangely when you consume your rows in this way.
You cannot use the reader that way - it is stream based and cannot be "wound back" as you try it. You also do never close your file.
Reading the file multiple times is not needed - you can get all information with one pass through your file using a dictionary to count any email adresses:
# create demo file
with open("click.csv", "w") as f:
f.write("email#somewhere, other , value, in , csv\n" * 4)
f.write("otheremail#somewhere, other , value, in , csv\n" * 2)
Process demo file:
from collections import defaultdict
import csv
emails = defaultdict(int)
with open('click.csv') as file:
reader = csv.reader(file)
for row in reader:
email = row[0]
print (email) # just to check which value is parsing
emails[email] += 1
for adr,count in emails.items():
print(adr, count)
Output:
email#somewhere 4
otheremail#somewhere 2
See:
Why can't I call read() twice on an open file?
defaultdict
As answered already, the problem is that reader is an iterator, so it is only good for a single pass. You can just put all the items in a container, like a list.
However, you only need a single pass to count things. Using a dict the most basic approach is:
counts = {}
for row in reader:
email = row[0]
if email in counts:
counts[email] = 1
else:
counts[email] += 1
There are even cleaner ways. For example, using a collections.Counter object, which is just a dict specialized for counting:
import collections
counts = collections.Counter(row[0] for row in reader)
Or even:
counts = collections.counter(email for email, _* in reader)
Try appending your email ids in a list then follow this:-
import pandas as pd
email_list = ["abc#something.com","xyz#something.com","abc#something.com"]
series = pd.Series(email_list)
print(series.value_counts())
You will get output like:-
abc#something.com 2
xyz#something.com 1
dtype: int64
The problem is, reader is a handler for the file (it is a stream).
You can walk through it only into one direction and not go back.
Similar to how generators are "consumed" by walking once through them.
But what you need is to iterate again and again - IF you want to use for-loops.
But anyway this is not an efficient way. Because actually, you want to not count again those rows which you counted once already.
So for your purpose, the best is to create a dictionary,
go row by row and if there is no entry in the dictionary for this email, create a new key for the email and as value the counts.
import csv
file = open('click.csv')
reader = csv.reader(file)
email_counts = {} # initiate dictionary
for row in reader:
email_counts[row[0]] = email_counts.get(row[0], 0) + 1
That's it!
email_counts[row] = assigns a new value for that particular email in the dictionary.
the whole trick is in email_counts.get(row, 0) + 1.
email_counts.get(row) is nothing else than email_counts[row].
but the additional argument 0 is the default value.
So this means: check, if row has an entry in email_counts. If yes, return the value for row in email_counts. Otherwise, if it is first entry, return 0. What ever is returned, increase it by + 1. This does all the equality check and correctly increases the counts for the entry.
Finally email_counts will give you all entries with their counts.
And the best: Just by going once through the file!
Not sure I get your question but if you want to have a counter of email you should not have those nested loop and just go for 1 loop with dictionary like:
cnt_mails[email] = cnt_mails.get(email, 0) + 1
this should store the count. your code is not working because you have two loops on the same iterator.
The problem is that reader is an iterator and you are depleting it with your second loop.
If you did something like:
with open('click.csv') as file:
lines = list(csv.reader(file))
for row in lines:
email = row[0]
print (email) # just to check which value is parsing
counter = 1
for row in lines:
if email == row[0]:
counter += 1
print (counter) # just to check if it counted correctly
You should get what you are looking for.
A simpler implementation:
from collections import defaultdict
counter = defaultdict(int)
with open('click.csv') as file:
reader = csv.reader(file)
for row in lines:
counter[row[0]] += 1
# Counter is not a dictionary of each email address and the number of times they are seen.

Python: writing a text file with a lot of variables

I need to create (from Python code) a text file with in each line some 50 variables, separated by comm's. I take the canonical way to be
output.write ("{},{},{},{},{},{},{},{},{},{}, ... \n".format(v,v,v,v,...
But that will be hard to read and difficult to maintain with such a lot of variables. Any other suggestions? I have thought of using the csv module, after all what I am writing is (kind of) a csv file, but thought I'd hear around for other suggestions first.
Using lists
When reaching a handful of variables that are related to each other, it is common to use a list or a dict. If you create a list:
myrow = []
myrow.append(v1)
...
This also allows for easier looping over each value. Once you have done that you can easily concatenate it to a string:
f.write(','.join(myrow))
In case your row might contain any commas (or whatever you use as a delimiter) you must ensure escaping. In this case a CSV modules helps:
import csv
with open('myfile.csv', 'w') as f:
fw = csv.writer(f)
fw.writerow(myrow) # where myrow is a list
Using dicts
Some people prefer to add additional structure, e.g.:
myrow = {}
myrow['speed'] = speed_value
myrow['some_other_row'] = other_value
import csv
with open('myfile.csv', 'w') as f:
fw = csv.writer(f)
fw.writerow(myrow)
If you can sort of your variables, you could use locals().
for example:
I set three variables:
var1 = 'xx'
var2 = 'yy'
var3 = 'zz'
and I can sort of them by sorted().
def sort(x):
if len(x) != 4:
return '99'
else:
return x[-1]
sortVars = sorted(locals(), key=sort)
Then, using for combination them.
result = ''
for i in sortVars[:3]:
result += locals()[i]
print(result)

Extracting variable names and data from csv file using Python

I have a csv file that has each line formatted with the line name followed by 11 pieces of data. Here is an example of a line.
CW1,0,-0.38,2.04,1.34,0.76,1.07,0.98,0.81,0.92,0.70,0.64
There are 12 lines in total, each with a unique name and data.
What I would like to do is extract the first cell from each line and use that to name the corresponding data, either as a variable equal to a list containing that line's data, or maybe as a dictionary, with the first cell being the key.
I am new to working with inputting files, so the farthest I have gotten is to read the file in using the stock solution in the documentation
import csv
path = r'data.csv'
with open(path,'rb') as csvFile:
reader = csv.reader(csvFile,delimiter=' ')
for row in reader:
print(row[0])
I am failing to figure out how to assign each row to a new variable, especially when I am not sure what the variable names will be (this is because the csv file will be created by a user other than myself).
The destination for this data is a tool that I have written. It accepts lists as input such as...
CW1 = [0,-0.38,2.04,1.34,0.76,1.07,0.98,0.81,0.92,0.70,0.64]
so this would be the ideal end solution. If it is easier, and considered better to have the output of the file read be in another format, I can certainly re-write my tool to work with that data type.
As Scironic said in their answer, it is best to use a dict for this sort of thing.
However, be aware that dict objects do not have any "order" - the order of the rows will be lost if you use one. If this is a problem, you can use an OrderedDict instead (which is just what it sounds like: a dict that "remembers" the order of its contents):
import csv
from collections import OrderedDict as od
data = od() # ordered dict object remembers the order in the csv file
with open(path,'rb') as csvFile:
reader = csv.reader(csvFile, delimiter = ' ')
for row in reader:
data[row[0]] = row[1:] # Slice the row up into 0 (first item) and 1: (remaining)
Now if you go looping through your data object, the contents will be in the same order as in the csv file:
for d in data.values():
myspecialtool(*d)
You need to use a dict for these kinds of things (dynamic variables):
import csv
path = r'data.csv'
data = {}
with open(path,'rb') as csvFile:
reader = csv.reader(csvFile,delimiter=' ')
for row in reader:
data[row[0]] = row[1:]
dicts are especially useful for dynamic variables and are the best method to store things like this. to access you just need to use:
data['CW1']
This solution also means that if you add any extra rows in with new names, you won't have to change anything.
If you are desperate to have the variable names in the global namespace and not within a dict, use exec (N.B. IF ANY OF THIS USES INPUT FROM OUTSIDE SOURCES, USING EXEC/EVAL CAN BE HIGHLY DANGEROUS (rm * level) SO MAKE SURE ALL INPUT IS CONTROLLED AND UNDERSTOOD BY YOURSELF).
with open(path,'rb') as csvFile:
reader = csv.reader(csvFile,delimiter=' ')
for row in reader:
exec("{} = {}".format(row[0], row[1:])
In python, you can use slicing: row[1:] will contain the row, except the first element, so you could do:
>>> d={}
>>> with open("f") as f:
... c = csv.reader(f, delimiter=',')
... for r in c:
... d[r[0]]=map(int,r[1:])
...
>>> d
{'var1': [1, 3, 1], 'var2': [3, 0, -1]}
Regarding variable variables, check How do I do variable variables in Python? or How to get a variable name as a string in Python?. I would stick to dictionary though.
An alternative to using the proper csv library could be as follows:
path = r'data.csv'
csvRows = open(path, "r").readlines()
dataRows = [[float(col) for col in row.rstrip("\n").split(",")[1:]] for row in csvRows]
for dataRow in dataRows: # Where dataRow is a list of numbers
print dataRow
You could then call your function where the print statement is.
This reads the whole file in and produces a list of lines with trailing newlines. It then removes each newline and splits each row into a list of strings. It skips the initial column and calls float() for each entry. Resulting in a list of lists. It depends how important the first column is?

Trouble with Python order of operations/loop

I have some code that is meant to convert CSV files into tab delimited files. My problem is that I cannot figure out how to write the correct values in the correct order. Here is my code:
for file in import_dir:
data = csv.reader(open(file))
fields = data.next()
new_file = export_dir+os.path.basename(file)
tab_file = open(export_dir+os.path.basename(file), 'a+')
for row in data:
items = zip(fields, row)
item = {}
for (name, value) in items:
item[name] = value.strip()
tab_file.write(item['name']+'\t'+item['order_num']...)
tab_file.write('\n'+item['amt_due']+'\t'+item['due_date']...)
Now, since both my write statements are in the for row in data loop, my headers are being written multiple times over. If I outdent the first write statement, I'll have an obvious formatting error. If I move the second write statement above the first and then outdent, my data will be out of order. What can I do to make sure that the first write statement gets written once as a header, and the second gets written for each line in the CSV file? How do I extract the first 'write' statement outside of the loop without breaking the dictionary? Thanks!
The csv module contains methods for writing as well as reading, making this pretty trivial:
import csv
with open("test.csv") as file, open("test_tab.csv", "w") as out:
reader = csv.reader(file)
writer = csv.writer(out, dialect=csv.excel_tab)
for row in reader:
writer.writerow(row)
No need to do it all yourself. Note my use of the with statement, which should always be used when working with files in Python.
Edit: Naturally, if you want to select specific values, you can do that easily enough. You appear to be making your own dictionary to select the values - again, the csv module provides DictReader to do that for you:
import csv
with open("test.csv") as file, open("test_tab.csv", "w") as out:
reader = csv.DictReader(file)
writer = csv.writer(out, dialect=csv.excel_tab)
for row in reader:
writer.writerow([row["name"], row["order_num"], ...])
As kirelagin points out in the commends, csv.writerows() could also be used, here with a generator expression:
writer.writerows([row["name"], row["order_num"], ...] for row in reader)
Extract the code that writes the headers outside the main loop, in such a way that it only gets written exactly once at the beginning.
Also, consider using the CSV module for writing CSV files (not just for reading), don't reinvent the wheel!
Ok, so I figured it out, but it's not the most elegant solutions. Basically, I just ran the first loop, wrote to the file, then ran it a second time and appended the results. See my code below. I would love any input on a better way to accomplish what I've done here. Thanks!
for file in import_dir:
data = csv.reader(open(file))
fields = data.next()
new_file = export_dir+os.path.basename(file)
tab_file = open(export_dir+os.path.basename(file), 'a+')
for row in data:
items = zip(fields, row)
item = {}
for (name, value) in items:
item[name] = value.strip()
tab_file.write(item['name']+'\t'+item['order_num']...)
tab_file.close()
for file in import_dir:
data = csv.reader(open(file))
fields = data.next()
new_file = export_dir+os.path.basename(file)
tab_file = open(export_dir+os.path.basename(file), 'a+')
for row in data:
items = zip(fields, row)
item = {}
for (name, value) in items:
item[name] = value.strip()
tab_file.write('\n'+item['amt_due']+'\t'+item['due_date']...)
tab_file.close()

Using Python to automate data cleanup, some issues

I have to process ~16,000 rows of data. Each row is a transaction record, with several parts. For example:
row= [ID, thing, widget]
What I would like to do is kind of simple- for each row, compare it to the rest of the rows one by one. If row A has a unique ID and unique widget, I want to write it to an outfile. Otherwise, I don't need it. (This program basically automates data cleanup for me.) Here's what I have so far:
try:
infile=open(file1, 'r')
for line in infile:
line_wk=line.split(",")
outfile=open(file2, 'r')
for line in outfile:
line_wk2=line.split(",")
if line_wk[0]==line_wk2[0]:
if line_wk[2]!=line_wk2[2]: #ID is not unique, but the widget is
to_write=','.join(line_wk) #queued to write later
else:
to_write=','.join(line_wk) #queued to write later
if len(to_write)>0:
outfile.close()
outfile=open(file2, 'a')
outfile.write(to_write)
outfile.close()
outfile=open(file2, 'r')
infile.close()
outfile.close()
except:
print("Something went wrong.")
Running this on a small test set, it stays within the 'try' block but otherwise just writes everything, not only the ones with a unique ID and widget. I assume there is an infinitely simpler way to do this. Any help is appreciated!
What you want to do is create a dictionary where the key is a tuple of (ID, widget) and the value is thing. Dictionary keys are guaranteed unique. So, your code would look something like this.
uniques = {}
with open("yourfile.txt") as infile:
for line in infile:
ID, thing, widget = line.strip().split(',')
uniques[(ID, widget)] = thing
with open("output.txt", "w") as outfile:
for k, v in uniques.iteritems():
outfile.write("%s,%s,%s\n" % (k[0], v, k[1]))
If preserving their original order is important then you can use OrderedDict from the collections package
You can also clean up how the outfile.write line is written, but it should work as is.
Lastly, since it appears you are reading/writing csv (comma separated values) format, you can make use of the csv module.
To test this I wrote a script
import random
import string
IDS = range(1, 100)
widgets = ['ITEM_%s' % (i, ) for i in range(10)]
thing_chars = list(string.uppercase + string.lowercase)
def get_thing():
return "".join(random.sample(thing_chars, 10))
with open("yourfile.txt", "w") as out:
for i in xrange(0, 16000):
ID = random.choice(IDS)
widget = random.choice(widgets)
thing = get_thing()
out.write("%s,%s,%s\n" % (ID, thing, widget))
It appears to run with the correct results.

Categories

Resources