Iterating over a csv file given a specific range - python

So the problem I'm having is that I'm iterating over a pretty large csv file. startDate and endDate are input given to me by the user and I need to only search in that range.
Although, when I run the program up to that point, it takes a long time to just spit back out "set()" at me. I've pointed where I'm having trouble at in the code
looking for suggestions and possibly sample code, thank you all in advance!
def compare(word1, word2, startDate, endDate):
with open('all_words.csv') as allWords:
readWords = csv.reader(allWords, delimiter=',')
year = set()
for row in readWords:
if row[1] in range(int(startDate), int(endDate)): #< Having trouble here
if row[0] == word1:
year.add(row[1])
print(year)

The reason your test isn't finding any years is that the expression:
row[1] in range(int(startDate), int(endDate))
is checking to see if a string value appears in a list of integers. If you test:
"1970" in range(1960, 1980)
you will see that it returns False. You need to write:
int(row[1]) in range(int(startDate), int(endDate))
However, this is still quite inefficient. It is checking if the value int(row[1]) occurs anywhere in the sequence [int(startDate), int(startDate)+1, ..., int(endDate)], and it's doing it by linear search. Much faster will be:
if int(startDate) <= int(row[1]) < int(endDate):
Note that your code above was written to exclude endDate for the list of possible dates (because range excludes its second argument), and I've done the same above.
Edit: Actually, I guess I should point out that it's only Python 2 where an expression like 500000 in range(1, 1000000) is inefficient. In Python 3 (or in Python 2 with xrange in place of range), it's fast.

You can try read_csv function of pandas library. This function allows you to read a desirable amount of data each time. So you can overcome the size problem.
reader = pd.read_csv(file_name, chunksize=chunk_size, iterator=True)
while True:
try:
df = reader.get_chunk(chunk_size)
# select data rows which have desired dates
except:
break
del df

Related

How to find an average from a specific row of a csv file by using loops in python?

f = open('TB_burden_countries_2014-09-29.csv')
for row in csv.reader(f):
print(row[7])
This basically reads the file and prints out the specific row, now how do I find the average of that very row by using loops. Thank you
After a quick google look up I found this post by "Billy".
Formatting data in a CSV file (calculating average) in python
Basically you want statistics of each row. In general you should do something like this:
import csv
with open('data.csv', 'r') as f:
rows = csv.reader(f)
for row in rows:
name = row[0]
scores = row[1:]
# calculate statistics of scores
attributes = {
'NAME': name,
'MAX' : max(scores),
'MIN' : min(scores),
'AVE' : 1.0 * sum(scores) / len(scores)
}
output_mesg ="name: {NAME:s} \t high: {MAX:d} \t low: {MIN:d} \t ave: {AVE:f}"
print(output_mesg.format(**attributes))
Try not to consider if doing specific things is inefficient locally. A good Pythonic script should be as readable as possible to every one.
In your code, I spot two mistakes:
Appending to row won't change anything, since row is a local variable in for loop and will get garbage collected.
row[1:3] only gives the second and the third element. row[1:4] gives what you want, as well as row[1:]. Indexing in Python normally is end-exclusive.
And some questions for you to think about:
If I can open the file in Excel and it's not that big, why not just do it in Excel? Can I make use of all the tools I have to get work done as soon as possible with least effort? Can I get done with this task in 30 seconds?

Python string not clearing correctly in loop

I am wring a script where I need to go through a csv file and find am looking for the first time that specific user logged in, and the last time they logged out. I have loops set up that are working great but when I clear the lists with the time string of their login/logout, I get an Index out of range error. Can anyone spot anything incorrect with this?
#this gets the earliest login time for each agent (but it assumes all dates to be the same!)
with open(inputFile, 'r') as dailyAgentLog:
csv_read = csv.DictReader(dailyAgentLog)
firstLoginTime = []
lastLogoutTime = []
outputLine = []
while x < len(agentName):
for row in csv_read:
if row["Agent"] == agentName[x]:
firstLoginTime.append(datetime.strptime(row["Login Time"], '%I:%M:%S %p'))
lastLogoutTime.append(datetime.strptime(row["Logout Time"], '%I:%M:%S %p'))
firstLoginTime.sort()
lastLogoutTime.sort()
outputLine = [agentName[x], agentLogin[x], agentExtension[x], row["Login Date"], firstLoginTime[0], row["Logout Date"], lastLogoutTime[-1]]
print(f'Agent {agentName[x]} first login was {firstLoginTime[0]} and last logout {lastLogoutTime[-1]}.')
fileLines.append(outputLine)
x += 1
firstLoginTime.clear() #this should be emptying/clearing the list at the end of every iteration
lastLogoutTime.clear()
The problem is that on the 2nd and following iterations, the for row in csv_read: loop doesn't execute, because there's nothing left to read. So you never fill in the firstLoginTime and lastLoginTime lists on subsequent iterations, and indexing them fails.
If the file isn't too large, you can read it into a list before iterating:
csv_read = list(csv.DictReader(dailyAgentLog))
If it's too big to hold in memory, put
dailyAgentLog.seek(0)
at the end of the loop body.
Also, instead of sorting the lists, you can use min() and max():
firstLogin = min(firstLoginTime)
lastLogin = max(lastLoginTime)
And I suggest you use
for x in range(len(agentName)):
rather than while and increment.

AIO Castle Cavalry - My code is too slow, is there a way I can shorten this?

So I am currently preparing for a competition (Australian Informatics Olympiad) and in the training hub, there is a problem in AIO 2018 intermediate called Castle Cavalry. I finished it:
input = open("cavalryin.txt").read()
output = open("cavalryout.txt", "w")
squad = input.split()
total = squad[0]
squad.remove(squad[0])
squad_sizes = squad.copy()
squad_sizes = list(set(squad))
yn = []
for i in range(len(squad_sizes)):
n = squad.count(squad_sizes[i])
if int(squad_sizes[i]) == 1 and int(n) == int(total):
yn.append(1)
elif int(n) == int(squad_sizes[i]):
yn.append(1)
elif int(n) != int(squad_sizes[i]):
yn.append(2)
ynn = list(set(yn))
if len(ynn) == 1 and int(ynn[0]) == 1:
output.write("YES")
else:
output.write("NO")
output.close()
I submitted this code and I didn't pass because it was too slow, at 1.952secs. The time limit is 1.000 secs. I wasn't sure how I would shorten this, as to me it looks fine. PLEASE keep in mind I am still learning, and I am only an amateur. I started coding only this year, so if the answer is quite obvious, sorry for wasting your time 😅.
Thank you for helping me out!
One performance issue is calling int() over and over on the same entity, or on things that are already int:
if int(squad_sizes[i]) == 1 and int(n) == int(total):
elif int(n) == int(squad_sizes[i]):
elif int(n) != int(squad_sizes[i]):
if len(ynn) == 1 and int(ynn[0]) == 1:
But the real problem is your code doesn't work. And making it faster won't change that. Consider the input:
4
2
2
2
2
Your code will output "NO" (with missing newline) despite it being a valid configuration. This is due to your collapsing the squad sizes using set() early in your code. You've thrown away vital information and are only really testing a subset of the data. For comparison, here's my complete rewrite that I believe handles the input correctly:
with open("cavalryin.txt") as input_file:
string = input_file.read()
total, *squad_sizes = map(int, string.split())
success = True
while squad_sizes:
squad_size = squad_sizes.pop()
for _ in range(1, squad_size):
try:
squad_sizes.remove(squad_size) # eliminate n - 1 others like me
except ValueError:
success = False
break
else: # no break
continue
break
with open("cavalryout.txt", "w") as output_file:
print("YES" if success else "NO", file=output_file)
Note that I convert all the input to int early on so I don't have to consider that issue again. I don't know whether this will meet AIO's timing constraints.
I can see some things in there that might be inefficient, but the best way to optimize code is to profile it: run it with a profiler and sample data.
You can easily waste time trying to speed up parts that don't need it without having much effect. Read up on the cProfile module in the standard library to see how to do this and interpret the output. A profiling tutorial is probably too long to reproduce here.
My suggestions, without profiling,
squad.remove(squad[0])
Removing the start of a big list is slow, because the rest of the list has to be copied as it is shifted down. (Removing the end of the list is faster, because lists are typically backed by arrays that are overallocated (more slots than elements) anyway, to make .append()s fast, so it only has to decrease the length and can keep the same array.
It would be better to set this to a dummy value and remove it when you convert it to a set (sets are backed by hash tables, so removals are fast), e.g.
dummy = object()
squad[0] = dummy # len() didn't change. No shifting required.
...
squad_sizes = set(squad)
squad_sizes.remove(dummy) # Fast lookup by hash code.
Since we know these will all be strings, you can just use None instead of a dummy object, but the above technique works even when your list might contain Nones.
squad_sizes = squad.copy()
This line isn't required; it's just doing extra work. The set() already makes a shallow copy.
n = squad.count(squad_sizes[i])
This line might be the real bottleneck. It's effectively a loop inside a loop, so it basically has to scan the whole list for each outer loop. Consider using collections.Counter for this task instead. You generate the count table once outside the loop, and then just look up the numbers for each string.
You can also avoid generating the set altogether if you do this. Just use the Counter object's keys for your set.
Another point unrelated to performance. It's unpythonic to use indexes like [i] when you don't need them. A for loop can get elements from an iterable and assign them to variables in one step:
from collections import Counter
...
count_table = Counter(squad)
for squad_size, n in count_table.items():
...
You can collect all occurences of the preferred number for each knight in a dictionary.
Then test if the number of knights with a given preferred number is divisible by that number.
with open('cavalryin.txt', 'r') as f:
lines = f.readlines()
# convert to int
list_int = [int(a) for a in lines]
#initialise counting dictionary: key: preferred number, item: empty list to collect all knights with preferred number.
collect_dict = {a:[] for a in range(1,1+max(list_int[1:]))}
print(collect_dict)
# loop though list, ignoring first entry.
for a in list_int[1:]:
collect_dict[a].append(a)
# initialise output
out='YES'
for key, item in collect_dict.items():
# check number of items with preference for number is divisilbe
# by that number
if item: # if list has entries:
if (len(item) % key) > 0:
out='NO'
break
with open('cavalryout.txt', 'w') as f:
f.write(out)

How can I simplify this Python code (assignment from a book)?

I am studying "Python for Everybody" book written by Charles R. Severance and I have a question to the exercise2 from Chapter7.
The task is to go through the mbox-short.txt file and "When you encounter a line that starts with “X-DSPAM-Confidence:” pull apart the line to extract the floating-point number on the line. Count these lines and then compute the total of the spam confidence values from these lines. When you reach the end of the file, print out the average spam confidence."
Here is my way of doing this task:
fname = input('Enter the file name: ')
try:
fhand = open(fname)
except:
print('File cannot be opened:', fname)
exit()
count = 0
values = list()
for line in fhand:
if line.startswith('X-DSPAM-Confidence:'):
string = line
count = count + 1
colpos = string.find(":")
portion = string[colpos+1:]
portion = float(portion)
values.append(portion)
print('Average spam confidence:', sum(values)/count)
I know this code works because I get the same result as in the book, however, I think this code can be simpler. The reason I think so is because I used a list in this code (declared it and then stored values in it). However, "Lists" is the next topic in the book and when solving this task I didn't know anything about lists and had to google them. I solved this task this way, because this is what I'd do in the R language (which I am already quite familiar with), I'd make a vector in which I'd store the values from my iteration.
So my question is: Can this code be simplified? Can I do the same task without using list? If yes, how can I do it?
I could change the "values" object to a floating type. The overhead of a list is not really needed in the problem.
values = 0.0
Then in the loop use
values += portion
Otherwise, there really is not a simpler way as this problem has tasks and you must meet all of the tasks in order to solve it.
Open File
Check For Error
Loop Through Lines
Find certain lines
Total up said lines
Print average
If you can do it in 3 lines of code great but that doesn't make what goes on in the background necessarily simpler. It will also probably look ugly.
You could filter the file's lines before the loop, then you can collapse the other variables into one, and get the values using list-comprehension. From that, you have your count from the length of that list.
interesting_lines = (line.startswith('X-DSPAM-Confidence:') for line in fhand)
values = [float(line[(line.find(":")+1):]) for line in interesting_lines]
count = len(values)
Can I do the same task without using list?
If the output needs to be an average, yes, you can accumlate the sum and the count as their own variables, and not need a list to call sum(values) against
Note that open(fname) is giving you an iterable collection anyway, and you're looping over the "list of lines" in the file.
List-comprehensions can often replace for-loops that add to a list:
fname = input('Enter the file name: ')
try:
fhand = open(fname)
except:
print('File cannot be opened:', fname)
exit()
values = [float(l[l.find(":")+1:]) for l in fhand if l.startswith('X-DSPAM-Confidence:')]
print('Average spam confidence:', sum(values)/len(values))
The inner part is simply your code combined, so perhaps less readable.
EDIT: Without using lists, it can be done with "reduce":
from functools import reduce
fname = input('Enter the file name: ')
try:
fhand = open(fname)
except:
print('File cannot be opened:', fname)
exit()
sum, count = reduce(lambda acc, l: (acc[0] + float(l[l.find(":")+1:]), acc[1]+1) if l.startswith('X-DSPAM-Confidence:') else acc, fhand, (0,0))
print('Average spam confidence:', sum / count)
Reduce is often called "fold" in other languages, and it basically allows you to iterate over a collection with an "accumulator". Here, I iterate the collection with an accumulator which is a tuple of (sum, count). With each item, we add to the sum and increment the count. See Reduce documentation.
All this being said, "simplify" does not necessarily mean as little code as possible, so I would stick with your own code if you're not comfortable with these shorthand notations.

appending array breaks program

I am writing a program to analyze some of our invoice data. Basically,I need to take an array containing each individual invoice we sent out over the past year & break it down into twelve arrays which contains the invoices for that month using the dateSeperate() function, so that monthly_transactions[0] returns Januaries transactions, monthly_transactions[1] returns Februaries & so forth.
I've managed to get it working so that dateSeperate returns monthly_transactions[0] as the january transactions. However, once all of the January data is entered, I attempt to append the monthly_transactions array using line 44. However, this just causes the program to break & become unrepsonsive. The code still executes & doesnt return an error, but Python becomes unresponsive & I have to force quite out of it.
I've been writing the the global array monthly_transactions. dateSeperate runs fine as long as I don't include the last else statement. If I do that, monthly_transactions[0] returns an array containing all of the january invoices. the issue arises in my last else statement, which when added, causes Python to freeze.
Can anyone help me shed any light on this?
I have written a program that defines all of the arrays I'm going to be using (yes I know global arrays aren't good. I'm a marketer trying to learn programming so any input you could give me on how to improve this would be much appreciated
import csv
line_items = []
monthly_transactions = []
accounts_seperated = []
Then I import all of my data and place it into the line_items array
def csv_dict_reader(file_obj):
global board_info
reader = csv.DictReader(file_obj, delimiter=',')
for line in reader:
item = []
item.append(line["company id"])
item.append(line["user id"])
item.append(line["Amount"])
item.append(line["Transaction Date"])
item.append(line["FIrst Transaction"])
line_items.append(item)
if __name__ == "__main__":
with open("ChurnTest.csv") as f_obj:
csv_dict_reader(f_obj)
#formats the transacation date data to make it more readable
def dateFormat():
for i in range(len(line_items)):
ddmmyyyy =(line_items[i][3])
yyyymmdd = ddmmyyyy[6:] + "-"+ ddmmyyyy[:2] + "-" + ddmmyyyy[3:5]
line_items[i][3] = yyyymmdd
#Takes the line_items array and splits it into new array monthly_tranactions, where each value holds one month of data
def dateSeperate():
for i in range(len(line_items)):
#if there are no values in the monthly transactions, add the first line item
if len(monthly_transactions) == 0:
test = []
test.append(line_items[i])
monthly_transactions.append(test)
# check to see if the line items year & month match a value already in the monthly_transaction array.
else:
for j in range(len(monthly_transactions)):
line_year = line_items[i][3][:2]
line_month = line_items[i][3][3:5]
array_year = monthly_transactions[j][0][3][:2]
array_month = monthly_transactions[j][0][3][3:5]
#print(line_year, array_year, line_month, array_month)
#If it does, add that line item to that month
if line_year == array_year and line_month == array_month:
monthly_transactions[j].append(line_items[i])
#Otherwise, create a new sub array for that month
else:
monthly_transactions.append(line_items[i])
dateFormat()
dateSeperate()
print(monthly_transactions)
I would really, really appreciate any thoughts or feedback you guys could give me on this code.
Based on the comments on the OP, your csv_dict_reader function seems to do exactly what you want it to do, at least inasmuch as it appends data from its argument csv file to the top-level variable line_items. You said yourself that if you print out line_items, it shows the data that you want.
"But appending doesn't work." I take it you mean that appending the line_items to monthly_transactions isn't being done. The reason for that is that you didn't tell the program to do it! The appending that you're talking about is done as part of your dateSeparate function, however you still need to call the function.
I'm not sure exactly how you want to use your dateFormat and dateSeparate functions, but in order to use them, you need to include them in the main function somehow as calls, i.e. dateFormat() and dateSeparate().
EDIT: You've created the potential for an endless loop in the last else: section, which extends monthly_transactions by 1 if the line/array year/month aren't equal. This is problematic because it's within the loop for j in range(len(monthly_transactions)):. This loop will never get to the end if the length of monthly_transactions is increased by 1 every time through.

Categories

Resources