How to reduce latency while reading data from csv file? - python

I have an excel in which there are 2000 rows which contains 1 data each, like
a.xls
RowNum Item
1 'A'
2 'B'
3 'C'
.
.
.
2000 'xyz'
I have another file, b.xls which contains about 6300000 rows of data. In this file there are some occurrences of the data in a.xls . I need to pick all the data from the file b.xls corresponding to an item in a.xls and store them in separate file called A.csv, B.csv, etc
I did it using multi-threading but it's taking lots of time to execute it. Can anybody help me reducing the latency?
This is the code I have used. The following function gets started in a thread,
def parseFromFile(pTickerList):
global gSearchList
lSearchList = gSearchList
for lTickerName in pTickerList:
c = csv.writer( open("op-new/"+ lTickerName + ".csv", "wb"))
c.writerow(["Ticker Name", "Time Stamp","Price", "Size"])
for line in lSearchList:
lSplittedLine = line.split(",")
lTickerNameFromSearchFile = lSplittedLine[0].strip()
if lTickerNameFromSearchFile[0] == "#":
continue
if ord(lTickerName[0]) < ord(lTickerNameFromSearchFile[0]):
break
elif ord(lTickerName[0]) > ord(lTickerNameFromSearchFile[0]):
continue
if lTickerNameFromSearchFile == lTickerName:
lTimeStamp = Decimal(float(lSplittedLine[1]))
lPrice = lSplittedLine[2]
lSize = lSplittedLine[4]
if str(lTimeStamp)[len(str(lTimeStamp))-2:] == "60":
lTimeStamp = str(lTimeStamp)[:len(str(lTimeStamp))-2] + "59.9"
if str(lTimeStamp).find(".") >= 0:
lTimeStamp = float(str(lTimeStamp).split(".")[0] + "." + str(lTimeStamp).split(".")[1][0])
lTimeStamp1 = "%.1f" %float(lTimeStamp)
lHumanReadableTimeStamp = datetime.strptime(str(lTimeStamp1), "%Y%m%d%H%M%S.%f")
else:
lHumanReadableTimeStamp = datetime.strptime(str(lTimeStamp), "%Y%m%d%H%M%S")
except Exception, e:
exc_type, exc_obj, exc_tb = sys.exc_info()
fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
print(exc_type, fname, exc_tb.tb_lineno)
print line
print lTimeStamp
raw_input()
c.writerow([lTickerNameFromSearchFile, lHumanReadableTimeStamp,lPrice, lSize])

It's hard to look through your code and fully understand it because it's referencing variables differently than your explanation, but I believe this approach will help you.
Start by reading all of a.csv into a set with the traits you want to be able to look up. sets in Python have very fast lookup times. This will also help you because it seems a that you do a lot of repeat computation during your inner loop from your code above.
Then start reading through b.csv, using the previous a.csv set to check. Whenever you find a match, write to A.csv and B.csv.
The big speedups you can do to your current setup are removing the repeat calculations in your inner loop, and the removal of the need for threads. Because a.csv is only 2000 lines, it will be incredibly fast to read.
Let me know if you want me to expand on any part of this.

Related

Running out of memory on python product iteration chain

I am trying to build a list of possible string combinations to then iterate against it. I am running out of memory executing the below line, which I get because it's several billion lines.
data = list(map(''.join,chain.from_iterable(product(string.digits+string.ascii_lowercase+'/',repeat = i) for i in range(0,7))))
So I think, rather than creating this massive iterable list, I create it and execute against it in waves with some kind of "holding string" that I save to memory and can restart from when I want. IE, generate and iterate against a million rows, then save the holding string to file. Then start up again with the next million rows, but start my mapping/iterations at the "holding string" or the next row. I have no clue how to do that. I think I might have to not use the .from_iterable(product( code that I had implemented. If that idea is not clear (or is clear but stupid) let me know.
Also, another option rather than breaking up the memory issue, would be to somehow optimize the iterable list itself, I'm not sure how I would do that either. I'm trying to map an API that has no existing documentation. While I don't know that a non-exhaustive list is the route to take, I'm certainly open to suggestions.
Here is the code chunk I've been using:
import csv
import string
from itertools import product, chain
#Open stringfile. If it doesn't exist, create it
try:
with open(stringfile) as f:
reader = csv.reader(f,delimiter=',')
data = list(reader)
f.close()
except:
data = list(map(''.join, chain.from_iterable(product(string.digits+string.ascii_lowercase + '/', repeat = i) for i in range(0,6))))
f=open(stringfile,'w')
f.write(str('\n.join(data)))
f.close()
pass
#Iterate against
...
EDIT: Further poking at this led me to this thread, which is similar topic. There is discussion about using islice, which helps me post-mapping (the script crashed last night while doing the API calls due to an error with my exception handling). I just restarted it at the 400k-th iterable.
Can I use .islice within a product? So for the generator, generate items 10mil-12mil (for example) and operate on just those items as a way to preserve memory?
Here is the most recent snippet of what I'm doing. You can see I plugged in the islice further down in the actual iteration, but I want to islice in the actual generation (the data = line).
#Open stringfile. If it doesn't exist, create it
try:
with open(stringfile) as f:
reader = csv.reader(f,delimiter=',')
data = list(reader)
f.close()
except:
data = list(map(''.join, chain.from_iterable(product(string.digits + string.ascii_lowercase + '/',repeat = i) for i in range(3,5))))
f=open(stringfile,'w')
f.write(str('\n'.join(data)))
f.close()
pass
print("Total items: " + str(len(data)-substart))
fdf = pd.DataFrame()
sdf = pd.DataFrame()
qdf = pd.DataFrame()
attctr = 0
#Iterate through the string combination list
for idx,kw in islice(enumerate(data),substart,substop):
#Attempt API call. Do the cooldown function if there is an issue.
if idx/1000 == int(idx/1000):
print("Iteration " + str(idx) + " of " + str(len(data)))
attctr +=1
if attctr == attcd:
print("Cooling down!")
time.sleep(cdtimer)
attctr = 0
try:
....

How can I simplify this Python code (assignment from a book)?

I am studying "Python for Everybody" book written by Charles R. Severance and I have a question to the exercise2 from Chapter7.
The task is to go through the mbox-short.txt file and "When you encounter a line that starts with “X-DSPAM-Confidence:” pull apart the line to extract the floating-point number on the line. Count these lines and then compute the total of the spam confidence values from these lines. When you reach the end of the file, print out the average spam confidence."
Here is my way of doing this task:
fname = input('Enter the file name: ')
try:
fhand = open(fname)
except:
print('File cannot be opened:', fname)
exit()
count = 0
values = list()
for line in fhand:
if line.startswith('X-DSPAM-Confidence:'):
string = line
count = count + 1
colpos = string.find(":")
portion = string[colpos+1:]
portion = float(portion)
values.append(portion)
print('Average spam confidence:', sum(values)/count)
I know this code works because I get the same result as in the book, however, I think this code can be simpler. The reason I think so is because I used a list in this code (declared it and then stored values in it). However, "Lists" is the next topic in the book and when solving this task I didn't know anything about lists and had to google them. I solved this task this way, because this is what I'd do in the R language (which I am already quite familiar with), I'd make a vector in which I'd store the values from my iteration.
So my question is: Can this code be simplified? Can I do the same task without using list? If yes, how can I do it?
I could change the "values" object to a floating type. The overhead of a list is not really needed in the problem.
values = 0.0
Then in the loop use
values += portion
Otherwise, there really is not a simpler way as this problem has tasks and you must meet all of the tasks in order to solve it.
Open File
Check For Error
Loop Through Lines
Find certain lines
Total up said lines
Print average
If you can do it in 3 lines of code great but that doesn't make what goes on in the background necessarily simpler. It will also probably look ugly.
You could filter the file's lines before the loop, then you can collapse the other variables into one, and get the values using list-comprehension. From that, you have your count from the length of that list.
interesting_lines = (line.startswith('X-DSPAM-Confidence:') for line in fhand)
values = [float(line[(line.find(":")+1):]) for line in interesting_lines]
count = len(values)
Can I do the same task without using list?
If the output needs to be an average, yes, you can accumlate the sum and the count as their own variables, and not need a list to call sum(values) against
Note that open(fname) is giving you an iterable collection anyway, and you're looping over the "list of lines" in the file.
List-comprehensions can often replace for-loops that add to a list:
fname = input('Enter the file name: ')
try:
fhand = open(fname)
except:
print('File cannot be opened:', fname)
exit()
values = [float(l[l.find(":")+1:]) for l in fhand if l.startswith('X-DSPAM-Confidence:')]
print('Average spam confidence:', sum(values)/len(values))
The inner part is simply your code combined, so perhaps less readable.
EDIT: Without using lists, it can be done with "reduce":
from functools import reduce
fname = input('Enter the file name: ')
try:
fhand = open(fname)
except:
print('File cannot be opened:', fname)
exit()
sum, count = reduce(lambda acc, l: (acc[0] + float(l[l.find(":")+1:]), acc[1]+1) if l.startswith('X-DSPAM-Confidence:') else acc, fhand, (0,0))
print('Average spam confidence:', sum / count)
Reduce is often called "fold" in other languages, and it basically allows you to iterate over a collection with an "accumulator". Here, I iterate the collection with an accumulator which is a tuple of (sum, count). With each item, we add to the sum and increment the count. See Reduce documentation.
All this being said, "simplify" does not necessarily mean as little code as possible, so I would stick with your own code if you're not comfortable with these shorthand notations.

Python - How to read different json files but to keep same list running

This might be a tough question and I will do my best to explain the best I can!
Im trying to create a script where I run different json files through a forloop (Each of these json files has their own "data") - What I want to do is that I want to be able to add those values into a list where I later match the first object in the json and matches if it is found in the script. If it is found with the same Name then we gonna check if its last object is higher or lower than it was previous in that object name. If it is higher then we print that it has been higher and we change the value object to the new one (Which we change inside a list).
Also another thing that I want it to do is that It should only append to the list once and not n times depending on how many json files I use.
I start by showing of the json that I use which we can see it contains same Name but different number.
{
"name": "Albert",
"image": "https://pbs.twimg.com/profile_images/....jpg",
"number": "5"
}
-----------------------------------
{
"name": "Albert",
"image": "https://pbs.twimg.com/profile_images/....jpg",
"number": "6"
}
Ok before I continue to explain. Here is the code I wrote so I can explain it better...
webhook_list = [
'https://discordapp.com/api/webhooks/5133124/slack',
'https://discordapp.com/api/webhooks/5124124/slack',
'https://discordapp.com/api/webhooks/5112412/slack'] #Discord different API_key/Webhook
def get_identifier(thread):
thread_id = thread['name'] #Grab name from Json
try:
thread_image = thread['image'] #Grab Image from Json
except KeyError:
thread_image = None
try:
thread_number = thread['numbers'] #Grab Image from Json
except KeyError:
thread_number = None
identifier = ('{}%{}%{}').format(thread_id, thread_image, thread_number) #Make them all into one "String"
return identifier
def script():
old_list = [] #Old_list where we gonna append if new items arrives.
while True:
for thread in [line.rstrip('\n') for line in open('names.txt')]: #We check all names in txt file (This can be changed to however you want to print, could be a list aswell. Doesn't matter
get_value_identifier = get_identifier(thread) #We send the value thread to get_identifier to grab the values which return of the identifier
if get_identifier(thread) not in old_list: #if this value is not in the old list then we go in here
#Slack/Discord function
directory = os.fsencode('./slack')
for counters, file in enumerate(os.listdir(directory)):
filename = os.fsdecode(file)
if filename.endswith(".json"):
with open('./slack/' + filename) as slackAttachment:
data = json.loads(slackAttachment.read())
data_list = []
# *****************---Picture---*****************
try:
data["attachments"][0]["thumb_url"] = information ['image_url'] #We add everything to data so we can later on be able to print it out to discord/slack
except Exception:
data["attachments"][0][
"thumb_url"] = 'https://cdn.browshot.com/static/images/not-found.png'
# *****************---Footer---*****************
data["attachments"][0]["footer"] = str(
data["attachments"][0]["footer"] + ' | ' + datetime.now().strftime(
'%Y-%m-%d [%H:%M:%S.%f')[:-3] + "]")
# -------------------------------------------------------------------------
a = get_value_identifier.split("%") #We split the identifier meaning it will be name image number
for i, items in zip(range(len(old_list)), old_list): #We split the length of old_list length and the old_list (I didn't think out of anything other way than this. This can be changed)
old_list_value = old_list[i].split("%") #We also split the old_list values the same way as we did with *a =...*
if a[0] in old_list_value[0]: #if the first value of both *a* in inside somewhere in a old_list first value.....
if old_list_value[2] < a[2]: #We check if the number is higher than the old_list. If it is then we do the things below
data["attachments"][0]["title"] = information['name'].upper()
data_list.append((webhook_list[counters], data))
for hook, data in data_list:
threading.Thread(target=sendData, args=(hook, data)).start()
old_list[i] = get_value_identifier
break
elif len(old_list_value[2]) >= len(a[2]): #We check if the number is lower than the old_list. If it is then we do the things below
old_list[i] = get_value_identifier
break
else: #If nothing is found then we just do things below here and add the value to old_list.
data["attachments"][0]["title"] = information['name'].upper()
data_list.append((webhook_list[counters], data))
for hook, data in data_list:
threading.Thread(target=sendData, args=(hook, data)).start()
old_list.append(get_value_identifier)
else:
randomtime = random.randint(3, 7)
logger.warn('No new item found! - retrying in %d secs' % (randomtime))
time.sleep(randomtime)
As you can see, this is my code that I do for opening each json format and we use data = json.loads(slackAttachment.read()) meaning it will add up into "data" which is a json.
directory = os.fsencode('./slack')
for counters, file in enumerate(os.listdir(directory)):
filename = os.fsdecode(file)
if filename.endswith(".json"):
with open('./slack/' + filename) as slackAttachment:
data = json.loads(slackAttachment.read())
Whenever one of the loop is done - It adds up everything into data and later on at the end we can print out the data or send it through a requests to Discord/Slack.
But in the end before I want to print it out to discord/slack - I am checking first of all if the name is already in the list. We do that by splitting each % which will contains "Name image Number"
In the if statement we check if a[0] (Which is current thread found) - if it is somewhere in a old list.
IF it is in the old list then we check the last number on the object
if it is higher or lower
IF it is higher then we print it out since its a increase of value
and then we change the previous old_list value to this new one.
IF it is lower then we just change the previous old_list value to the
new one.
IF there is none in the old_list that matches then we just append it
to the list.
.
a = get_value_identifier.split("%")
for i, items in zip(range(len(old_list)), old_list):
old_list_value = old_list[i].split("%")
if a[0] in old_list_value[0]:
if old_list_value[2] < a[2]:
data["attachments"][0]["title"] = information['name'].upper()
data_list.append((webhook_list[counters], data))
for hook, data in data_list:
threading.Thread(target=sendData, args=(hook, data)).start()
old_list[i] = get_value_identifier
break
elif len(old_list_value[2]) > len(a[2]):
old_list[i] = get_value_identifier
break
else:
data["attachments"][0]["title"] = information['name'].upper()
data_list.append((webhook_list[counters], data))
for hook, data in data_list:
threading.Thread(target=sendData, args=(hook, data)).start()
old_list.append(get_value_identifier)
and here is the mechanic issue.
The issue I am having is that at the beginning when we run
directory = os.fsencode('./slack')
for counters, file in enumerate(os.listdir(directory)):
filename = os.fsdecode(file)
if filename.endswith(".json"):
with open('./slack/' + filename) as slackAttachment:
data = json.loads(slackAttachment.read())
is that it will be looping through the function above
a = get_value_identifier.split("%")
for i, items in zip(range(len(old_list)), old_list):
old_list_value = old_list[i].split("%")......
x times depending on how many files there is in the slack. This will be a issue after second loop becuase if the first loop find a new item. It will add it into the old_list using append but that means when the second loop with slack/discord it means that there will be a value in the old_list that contains it. And here is where it start continuing... Meaning the first loop will always be correct but after that it will give you not the correct answers.
My question in that case is two question
How can I make so whenever the first loop hits etc if statement or elif or else. It should print the same for all discord/slack at once .
If it hits if or elif statement. Inside those I have that it should take the value from old_list[i] and change that value to the "newer" one which is get_value_identifier. How can I make it so it does it only once. Because I believe if I run through 3 slacks in my case. It will be 3 same values in the old_list due to the for loop.
I think that is it for me and I hope I explained so good I could! Please. If there is any more question to be added or anything. I will be pretty active now and I will be able to edit to either upgrade the question due to your questions below!

streamlining series of try-except + if-statements for faster processing in Python

I'm processing strings using regexes in a bunch of files in a directory. To each line in a file, I apply a series of try-statements to match a pattern and if they do, then I transform the input. After I have analyzed each line, I write it to a new file. I have a lot of these try-else followed by if-statements (I only included two here as an illustration). My issue here is that after processing a few files, the script slows down so much that it almost stalls the process completely. I don't know what in my code is causing the slowing down but I have a feeling it is the combination of try-else + if-statements. How can I streamline the transformations so that the data is processed at a reasonable speed?
Or is it that I need a more efficient iterator that does not tax memory to the same extent?
Any feedback would be much appreciated!
import re
import glob
fileCounter = 0
for infile in glob.iglob(r'\input-files\*.txt'):
fileCounter += 1
outfile = r'\output-files\output_%s.txt' % fileCounter
with open(infile, "rb") as inList, open(outfile, "wb") as outlist:
for inline in inlist:
inword = inline.strip('\r\n')
#apply some text transformations
#Transformation #1
try: result = re.match('^[AEIOUYaeiouy]([bcćdfghjklłmnńprsśtwzżź]|rz|sz|cz|dz|dż|dź|ch)[aąeęioóuy](.*\[=\].*)*', inword).group()
except: result = None
if result == inword:
inword = re.sub('(?<=^[AEIOUYaeiouy])(?=([bcćdfghjklłmnńprsśtwzżź]|rz|sz|cz|dz|dż|dź|ch)[aąeęioóuy])', '[=]', wbWord)
#Transformation #2 etc.
try: result = re.match('(.*\[=\].*)*(\w?\w?)[AEIOUYaąeęioóuy]\[=\][ćsśz][ptkbdg][aąeęioóuyrfw](.*\[=\].*)*', inword).group()
except: result = None
if result == inword:
inword = re.sub('(?<=[AEIOUYaąeęioóuy])\[=\](?=[ćsśz][ptkbdg][aąeęioóuyrfw])', '', inword)
inword = re.sub('(?<=[AEIOUYaąeęioóuy][ćsśz])(?=[ptkbdg][aąeęioóuyrfw])', '[=]', inword)
outline = inword + "\n"
outlist.write(outline)
print "Processed file number %s" % fileCounter
print "*** Processing completed ***"
try/except is indeed not the most efficient way (nor the most readable one) to test for the result of a re.match() , but the penalty hit should still be (more or less) constant - the performance should not degrade during execution (until perhaps there's some worst case happening due to your data but well) - so chances are the problem is elsewhere.
FWIW you can start by replacing your try/except blocks with the appropriate canonical solution, ie instead of:
try:
result = re.match(someexp, yourline).group()
except:
result = None
you want:
match = re.match(someexp, yourline)
result = match.group() if match else None
This will slightly improve perfs but, most importantly, make your code more readable and much more maintainable - at least it won't hide any unexpected error.
As a side note, never use a bare except clause, always only catch expected exceptions (here it would have been an AttributeError since re.match() returns None when nothing matched and None has of course no attribute group).
This will very probably NOT solve your problem but at least you'll then know the issue is elsewhere.

Using multiple booleans in an if statement to decide which file to write to

I'm trying to catch 4570 close encounters between planets, and output the data into certain files, depending on which two planets had the close encounters. I have 5 planets in total, and each planet has a close encounter ONLY with the planet(s) adjacent to it, leaving the possibility of 4 encounters.
data1 = open('data1.txt', 'a+')
data2 = open('data2.txt', 'a+')
data3 = open('data3.txt', 'a+')
data4 = open('data4.txt', 'a+')
for i in range(0,100000): #range this big since close encounters don't happen every iteration
def P_dist(p1, p2):
#function calculating distances between planets
init_SMA = [sim.particles[1].a,sim.particles[2].a,sim.particles[3].a,sim.particles[4].a,sim.particles[5].a]
try:
sim.integrate(10e+9*2*np.pi)
except rebound.Encounter as error:
print(error)
for j in range(len(init_SMA)-1):
distance = P_dist(j, j+1)
if distance <= .01:
count+=1
if count > 4570:
break
elif(init_SMA[j] == init_SMA[0] and init_SMA[j+1] == init_SMA[1])
#write stuff to data1
elif(init_SMA[j] == init_SMA[1] and init_SMA[j+1] == init_SMA[2])
#write stuff to data2
elif(init_SMA[j] == init_SMA[2] and init_SMA[j+1] == init_SMA[3])
#write stuff to data3
elif(init_SMA[j] == init_SMA[3] and init_SMA[j+1] == init_SMA[4])
#write stuff to data4
#close files
Everyone, I apologize. I left out lots of the code that shows the creation of the planetary system. The main for loop is responsible for creating a planetary system, catching a close encounter, writing it to the files, and repeating until 4570 close encounters have occurred.
It isn't ideal to keep four different files open in a running script. What's more, you haven't opened those files using Python's convenient with context manager, which takes care of cleanly closing opened files among other things. You're also performing open operations every loop iteration - files usually should be opened and closed once as there is a lot of consequential I/O overhead.
As for a cleaner approach, I would conditionally accumulate items/lines in Python data storage objects, then just do a one-off open and write at the end of the script. That way, if something goes awry during the main logic, you don't have files that have been partially written to.
This would be something along the lines of:
create 4 empty lists
for loop
logic to conditionally append lines to be written to the text files to those lists
with open('data1.txt', 'a+') as f:
write contents of list1 to f
... copy paste for remaining 3
I'd probably put the four data files in a list, so you can just do:
filesArray = [data1,data2,data3,data4]
#insider your for loop:
if(count > 4750):
break
if(distance <= 0.01):
count += 1
filesArray[j].write(data)#for whatever your data is
else:
break
It would be even better to do
fileNamesArray = ["data1.txt", "data2.txt", "data3.txt", "data4.txt"]
#inside your for loop:
if(count > 4750):
break
if(distance <= 0.01):
count += 1
with open(fileNamesArray[j], "a") as dataFile:
dataFile.write(data)#for whatever your data is
This helps avoid data corruption in case your program crashes for another reason
This also avoids storing every result you get into a list in memory, which I'd guess could be expensive for complex simulations
It does bind your performance to disk speed though, so I guess it's a tradeoff

Categories

Resources