Running out of memory on python product iteration chain - python

I am trying to build a list of possible string combinations to then iterate against it. I am running out of memory executing the below line, which I get because it's several billion lines.
data = list(map(''.join,chain.from_iterable(product(string.digits+string.ascii_lowercase+'/',repeat = i) for i in range(0,7))))
So I think, rather than creating this massive iterable list, I create it and execute against it in waves with some kind of "holding string" that I save to memory and can restart from when I want. IE, generate and iterate against a million rows, then save the holding string to file. Then start up again with the next million rows, but start my mapping/iterations at the "holding string" or the next row. I have no clue how to do that. I think I might have to not use the .from_iterable(product( code that I had implemented. If that idea is not clear (or is clear but stupid) let me know.
Also, another option rather than breaking up the memory issue, would be to somehow optimize the iterable list itself, I'm not sure how I would do that either. I'm trying to map an API that has no existing documentation. While I don't know that a non-exhaustive list is the route to take, I'm certainly open to suggestions.
Here is the code chunk I've been using:
import csv
import string
from itertools import product, chain
#Open stringfile. If it doesn't exist, create it
try:
with open(stringfile) as f:
reader = csv.reader(f,delimiter=',')
data = list(reader)
f.close()
except:
data = list(map(''.join, chain.from_iterable(product(string.digits+string.ascii_lowercase + '/', repeat = i) for i in range(0,6))))
f=open(stringfile,'w')
f.write(str('\n.join(data)))
f.close()
pass
#Iterate against
...
EDIT: Further poking at this led me to this thread, which is similar topic. There is discussion about using islice, which helps me post-mapping (the script crashed last night while doing the API calls due to an error with my exception handling). I just restarted it at the 400k-th iterable.
Can I use .islice within a product? So for the generator, generate items 10mil-12mil (for example) and operate on just those items as a way to preserve memory?
Here is the most recent snippet of what I'm doing. You can see I plugged in the islice further down in the actual iteration, but I want to islice in the actual generation (the data = line).
#Open stringfile. If it doesn't exist, create it
try:
with open(stringfile) as f:
reader = csv.reader(f,delimiter=',')
data = list(reader)
f.close()
except:
data = list(map(''.join, chain.from_iterable(product(string.digits + string.ascii_lowercase + '/',repeat = i) for i in range(3,5))))
f=open(stringfile,'w')
f.write(str('\n'.join(data)))
f.close()
pass
print("Total items: " + str(len(data)-substart))
fdf = pd.DataFrame()
sdf = pd.DataFrame()
qdf = pd.DataFrame()
attctr = 0
#Iterate through the string combination list
for idx,kw in islice(enumerate(data),substart,substop):
#Attempt API call. Do the cooldown function if there is an issue.
if idx/1000 == int(idx/1000):
print("Iteration " + str(idx) + " of " + str(len(data)))
attctr +=1
if attctr == attcd:
print("Cooling down!")
time.sleep(cdtimer)
attctr = 0
try:
....

Related

Getting the last two variables in for loop

I am trying to make a program that shows me the data of two specific coins. What it basically does is to takes the data in an infinite "for loop" to display the info until I close the program.
And now I am trying to get the last two elements of this infinite for loop every time it runs again and make calculations with it. I know I can't just hold all the items in a list and I am not sure how to store last two's and use them every time.
for line in lines:
coinsq = line.strip()
url = priceKey + coinsq + "USDT"
data = requests.get(url)
datax = data.json()
print( datax['symbol'] + " " + datax['price'])
Store the data in a deque (from the collections module).
Initialise your deque like this:
from collections import deque
d = deque([], 2)
Now you can append to d as many times as you like and it will only ever have the most recent two entries.
So, for example:
d.append('a')
d.append('b')
d.append('c')
for e in d:
print(e)
Will give the output:
b
c
Adapting your code to use this technique should be trivial.
I recommend this approach in favour of using two variables because it's easier to change if you (for some reason) decided that you want the last N values because all you need to do is change the deque constructor
You can just use two variables that you update for each new elements, at the end you will just have the two last elements seen :
pre_last = None
last = None
for line in lines:
coinsq = line.strip()
url = priceKey + coinsq + "USDT"
data = requests.get(url)
datax = data.json()
print( datax['symbol'] + " " + datax['price'])
pre_last = last
last = datax
#Do the required calculations with last and pre_last
(And just to be exact this isn't an infinite loop otherwise there wouldn't be a 'last' element)
As your script does not have prior information of when the execution is going to halt, I suggest to define a queue-like structure. In each iteration, you update your last item and your previous-to-last. In that way, you just have to keep in memory two elements. I don't know how were you planning on accessing those two elements when the execution has finished, but you should be able to access that queue when the execution is over.
Sorry for not providing code, but this can be done in many ways, I supposed it was better to suggest you a way of proceeding.
You can define a variable for the second-last element of the for loop, and use the datax variable that's already defined in the loop as the last element:
sec_last = None
datax = None
for line in lines:
sec_last = datax
coinsq = line.strip()
url = priceKey + coinsq + "USDT"
data = requests.get(url)
datax = data.json()
print( datax['symbol'] + " " + datax['price'])
print(f"Last element", datax)
print(f"Second Last element", sec_last)

Share variable in concurrent.futures

I am trying to do a word counter with mapreduce using concurrent.futures, previously I've done a multi threading version, but was so slow because is CPU bound.
I have done the mapping part to divide the words into ['word1',1], ['word2,1], ['word1,1], ['word3',1] and between the processes, so each process will take care of a part of the text file. The next step ("shuffling") is to put these words in a dictionary so that it looks like this: word1: [1,1], word2:[1], word3: [1], but I cannot share the dictionary between the processes because we are using multiprocessing instead of multithreading, so how can I make each process add the "1" to the dictionary shared between all the processes? I'm stuck with this, and I can't continue.
I am at this point:
import sys
import re
import concurrent.futures
import time
# Read text file
def input(index):
try:
reader = open(sys.argv[index], "r", encoding="utf8")
except OSError:
print("Error")
sys.exit()
texto = reader.read()
reader.close()
return texto
# Convert text to list of words
def splitting(input_text):
input_text = input_text.lower()
input_text = re.sub('[,.;:!¡?¿()]+', '', input_text)
words = input_text.split()
n_processes = 4
# Creating processes
with concurrent.futures.ProcessPoolExecutor() as executor:
results = []
for id_process in range(n_processes):
results.append(executor.submit(mapping, words, n_processes, id_process))
for f in concurrent.futures.as_completed(results):
print(f.result())
def mapping(words, n_processes, id_process):
word_map_result = []
for i in range(int((id_process / n_processes) * len(words)),
int(((id_process + 1) / n_processes) * len(words))):
word_map_result.append([words[i], 1])
return word_map_result
if __name__ == '__main__':
if len(sys.argv) == 1:
print("Please, specify a text file...")
sys.exit()
start_time = time.time()
for index in range(1, len(sys.argv)):
print(sys.argv[index], ":", sep="")
text = input(index)
splitting(text)
# for word in result_dictionary_words:
# print(word, ':', result_dictionary_words[word])
print("--- %s seconds ---" % (time.time() - start_time))
I've seen that when doing concurrent programming it is usually best to avoid using shared state as far as possible, so how I can implement Map reduce word count without share the dictionary between processes?
You can create a shared dictionary using a Manager from multiprocessing. I understand from your program that it is your word_map_result you need to share.
You could try something like this
from multiprocessing import Manager
...
def splitting():
...
word_map_result = Manager().dict()
with concurrent.futures.....:
...
results.append(executor.submit(mapping, words, n_processes, id_process, word_map_result)
...
...
def mapping(words, n_processes, id_process, word_map_result):
for ...
# Do not return anything - word_map_result is up to date in your main process
Basically you will remove the local copy of word_map_result from your mapping function and pass it the Manager instance as a parameter. This word_map_result is now shared between all your subprocesses and the main program. Managers add data transfer overhead, though, so this might not help you very much.
In this case you do not return anything from the workers so you do not need the for loop to process results either in your main program - your word_map_result is identical in all subprocesses and the main program.
I may have misunderstood your problem and I am not familiar with the algorithm if it is possible to re-engineer that to work so that you don't need to share anything between processes.
It seems like a misconception to be using multiprocessing at all. First, there is overhead in creating the pool and overhead in passing data to and from the processes. And if you decide to use a shared, managed dictionary that worker function mapping can use to store its results in, know that a managed dictionary uses a proxy, the accessing of which is rather slow. The alternative to using a managed dictionary would be as you currently have it, i.e. mapping returns a list and the main process uses those results to create the keys and values of the dictionary. But what then is the point of mapping returning a list where each element is always a list of two elements where the second element is always the constant value 1? Isn't that rather wasteful of time and space?
I think your performance will be no faster (probably slower) than just implementing splitting as:
# Convert text to list of words
def splitting(input_text):
input_text = input_text.lower()
input_text = re.sub('[,.;:!¡?¿()]+', '', input_text)
words = input_text.split()
results = {}
for word in words:
results[word] = [1]
return results

Storing Multi-dimensional Lists?

(Code below)
I'm scraping a website and the data I'm getting back is in 2 multi-dimensional arrays. I'm wanting everything to be in a JSON format because I want to save this and load it in again later when I add "tags".
So, less vague. I'm writing a program which takes in data like what characters you have and what missions are requiring you to do (you can complete multiple at once if the attributes align), and then checks that against a list of attributes that each character fulfills and returns a sorted list of the best characters for the context.
Right now I'm only scraping character data but I've already "got" the attribute data per character - the problem there was that it wasn't sorted by name so it was just a randomly repeating list that I needed to be able to look up. I still haven't quite figured out how to do that one.
Right now I have 2 arrays, 1 for the headers of the table and one for the rows of the table. The rows contain the "Answers" for the Header's "Questions" / "Titles" ; ie Maximum Level, 50
This is true for everything but the first entry which is the Name, Pronunciation (and I just want to store the name of course).
So:
Iterations = 0
While loop based on RowArray length / 9 (While Iterations <= that)
HeaderArray[0] gives me the name
RowArray[Iterations + 1] gives me data type 2
RowArray[Iterations + 2] gives me data type 3
Repeat until Array[Iterations + 8]
Iterations +=9
So I'm going through and appending these to separate lists - single arrays like CharName[] and CharMaxLevel[] and so on.
But I'm actually not sure if that's going to make this easier or not? Because my end goal here is to send "CharacterName" and get stuff back based on that AND be able to send in "DesiredTraits" and get "CharacterNames who fit that trait" back. Which means I also need to figure out how to store that category data semi-efficiently. There's over 80 possible categories and most only fit into about 10. I don't know how I'm going to store or load that data.
I'm assuming JSON is the best way? And I'm trying to keep it all in one file for performance and code readability reasons - don't want a file for each character.
CODE: (Forgive me, I've never scraped anything before + I'm actually somewhat new to Python - just got it 4? days ago)
https://pastebin.com/yh3Z535h
^ In the event anyone wants to run this and this somehow makes it easier to grab the raw code (:
import time
import requests, bs4, re
from urllib.parse import urljoin
import json
import os
target_dir = r"D:\00Coding\Js\WebScraper" #Yes, I do know that storing this in my Javascript folder is filthy
fullname = os.path.join(target_dir,'TsumData.txt')
StartURL = 'http://disneytsumtsum.wikia.com/wiki/Skill_Upgrade_Chart'
URLPrefix = 'http://disneytsumtsum.wikia.com'
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_links(url):
soup = make_soup(url)
a_tags = soup.find_all('a', href=re.compile(r"^/wiki/"))
links = [urljoin(URLPrefix, a['href'])for a in a_tags] # convert relative url to absolute url
return links
def get_tds(link):
soup = make_soup(link)
#tds = soup.find_all('li', class_="category normal") #This will give me the attributes / tags of each character
tds = soup.find_all('table', class_="wikia-infobox")
RowArray = []
HeaderArray = []
if tds:
for td in tds:
#print(td.text.strip()) #This is everything
rows = td.findChildren('tr')#[0]
headers = td.findChildren('th')#[0]
for row in rows:
cells = row.findChildren('td')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
RowArray.append(clean_content)
for row in rows:
cells = row.findChildren('th')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
HeaderArray.append(clean_content)
print(HeaderArray)
print(RowArray)
return(RowArray, HeaderArray)
#Output = json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1)
#print(json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1))
#TempFile = open(fullname, 'w') #Read only, Write Only, Append
#TempFile.write("EHLLO")
#TempFile.close()
#print(td.tbody.Series)
#print(td.tbody[Series])
#print(td.tbody["Series"])
#print(td.data-name)
#time.sleep(1)
if __name__ == '__main__':
links = get_links(StartURL)
MainHeaderArray = []
MainRowArray = []
MaxIterations = 60
Iterations = 0
for link in links: #Specifically I'll need to return and append the arrays here because they're being cleared repeatedly.
#print("Getting tds calling")
if Iterations > 38: #There are this many webpages it'll first look at that don't have the data I need
TempRA, TempHA = get_tds(link)
MainHeaderArray.append(TempHA)
MainRowArray.append(TempRA)
MaxIterations -= 1
Iterations += 1
#print(MaxIterations)
if MaxIterations <= 0: #I don't want to scrape the entire website for a prototype
break
#print("This is the end ??")
#time.sleep(3)
#jsonized = map(lambda item: {'Name':item[0], 'Series':item[1]}, zip())
print(MainHeaderArray)
#time.sleep(2.5)
#print(MainRowArray)
#time.sleep(2.5)
#print(zip())
TsumName = []
TsumSeries = []
TsumBoxType = []
TsumSkillDescription = []
TsumFullCharge = []
TsumMinScore = []
TsumScoreIncreasePerLevel = []
TsumMaxScore = []
TsumFullUpgrade = []
Iterations = 0
MaxIterations = len(MainRowArray)
while Iterations <= MaxIterations: #This will fire 1 time per Tsum
print(Iterations)
print(MainHeaderArray[Iterations][0]) #Holy this gives us Mickey ;
print(MainHeaderArray[Iterations+1][0])
print(MainHeaderArray[Iterations+2][0])
print(MainHeaderArray[Iterations+3][0])
TsumName.append(MainHeaderArray[Iterations][0])
print(MainRowArray[Iterations][1])
#At this point it will, of course, crash - that's because I only just realized I needed to append AND I just realized that everything
#Isn't stored in a list as I thought, but rather a multi-dimensional array (as you can see below I didn't know this)
TsumSeries[Iterations] = MainRowArray[Iterations+1]
TsumBoxType[Iterations] = MainRowArray[Iterations+2]
TsumSkillDescription[Iterations] = MainRowArray[Iterations+3]
TsumFullCharge[Iterations] = MainRowArray[Iterations+4]
TsumMinScore[Iterations] = MainRowArray[Iterations+5]
TsumScoreIncreasePerLevel[Iterations] = MainRowArray[Iterations+6]
TsumMaxScore[Iterations] = MainRowArray[Iterations+7]
TsumFullUpgrade[Iterations] = MainRowArray[Iterations+8]
Iterations += 9
print(Iterations)
print("It's Over")
time.sleep(3)
print(TsumName)
print(TsumSkillDescription)
Edit:
tl;dr my goal here is to be like
"For this Mission Card I need a Blue Tsum with high score potential, a Monster's Inc Tsum for a bunch of games, and a Male Tsum for a long chain.. what's the best Tsum given those?" and it'll be like "SULLY!" and automatically select it or at the very least give you a list of Tsums. Like "These ones match all of them, these ones match 2, and these match 1"
Edit 2:
Here's the command Line Output for the code above:
https://pastebin.com/vpRsX8ni
Edit 3: Alright, just got back for a short break. With some minor looking over I see what happened - my append code is saying "Append this list to the array" meaning I've got a list of lists for both the Header and Row arrays that I'm storing. So I can confirm (for myself at least) that these aren't nested lists per se but they are definitely 2 lists, each containing a single list at every entry. Definitely not a dictionary or anything "special case" at least. This should help me quickly find an answer now that I'm not throwing "multi-dimensional list" around my google searches or wondering why the list stuff isn't working (as it's expecting 1 value and gets a list instead).
Edit 4:
I need to simply add another list! But super nested.
It'll just store the categories that the Tsum has as a string.
so Array[10] = ArrayOfCategories[Tsum] (which contains every attribute in string form that the Tsum has)
So that'll be ie TsumArray[10] = ["Black", "White Gloves", "Mickey & Friends"]
And then I can just use the "Switch" that I've already made in order to check them. Possibly. Not feeling too well and haven't gotten that far yet.
Just use the with open file as json_file , write/read (super easy).
Ultimately stored 3 json files. No big deal. Much easier than appending into one big file.

Python - how do i save a itertools.product loop and resume where it left off

I am writing a python script to basically check every possible url and log it if it responds to a request.
I found a post on StackOverflow that suggested a method of generating the strings for the urls which works well.
for n in range(1, 4 + 1):
for comb in product(chars, repeat=n):
url = ("http://" + ''.join(comb) + ".com")
currentUrl = url
checkUrl(url)
As you can imagine there is way to many urls and it is going to take a very long time so I am trying to make a way to save my script and resume from were it left off.
My question is how can I have the loop start from a specific place, or does anyone have a working piece of code that does the same thing and will allow me to specify at starting point.
This is my script soo far..
import urllib.request
from string import digits, ascii_uppercase, ascii_lowercase
from itertools import product
goodUrls = "Valid_urls.txt"
saveFile = "save.txt"
currentUrl = ''
def checkUrl(url):
print("Trying - "+url)
try:
urllib.request.urlopen(url)
except Exception as e:
None
else:
log = open(goodUrls, 'a')
log.write(url + '\n')
chars = digits + ascii_lowercase
try:
while True:
for n in range(1, 4 + 1):
for comb in product(chars, repeat=n):
url = ("http://" + ''.join(comb) + ".com")
currentUrl = url
checkUrl(url)
except KeyboardInterrupt:
print("Saving and Exiting")
open(saveFile,'w').write(currentUrl)
The return value of itertools.product is a generator object. As such all you'll have to do is:
products = product(...)
for foo in products:
if bar(foo):
spam(foo)
break
# other stuff
for foo in products:
# starts where you left off.
In your case the time taken to iterate through the possibilities is pretty small, at least compared to the time it'll take to make all those network requests. You could either save all the possibilities to disk and dump a list of what's left after every run of the program, or you could just save which number you're on. Since product has deterministic output, that should do it.
try:
with open("progress.txt") as f:
first_up = int(f.read().strip())
except FileNotFoundError:
first_up = 0
try:
for i, foo in enumerate(products):
if i <= first_up:
continue # skip this iteration
# do stuff down here
except KeyboardInterrupt:
# this is really rude to do, by the by....
print("Saving and exiting"
with open("progress.txt", "w") as f:
f.write(str(i))
If there's some reason you need a human-readable "progress" file, you can save your last password as you did above and do:
for foo in itertools.dropwhile(products, lambda p != saved_password):
# do stuff
Although the attempt to find all the URLs by this method is ridiculous, the general question posed is a very good one. The short answer is that you cannot pickle an iterator in a straightforward way, because the pickle mechanism can't save the iterator's internal state. However, you can pickle an object that implements both __iter__ and __next__. So if you create a class that has the desired functionality and also works as an iterator (by implementing those two functions), it can be pickled and reloaded. The reloaded object, when you make an iterator from it, will continue from where it left off.
#! python3.6
import pickle
class AllStrings:
CHARS = "abcdefghijklmnopqrstuvwxyz0123456789"
def __init__(self):
self.indices = [0]
def __iter__(self):
return self
def __next__(self):
s = ''.join([self.CHARS[n] for n in self.indices])
for m in range(len(self.indices)):
self.indices[m] += 1
if self.indices[m] < len(self.CHARS):
break
self.indices[m] = 0
else:
self.indices.append(0)
return s
try:
with open("bookmark.txt", "rb") as f:
all_strings = pickle.load(f)
except IOError:
all_strings = AllStrings()
try:
for s in iter(all_strings):
print(s)
except KeyboardInterrupt:
with open("bookmark.txt", "wb") as f:
pickle.dump(all_strings, f)
This solution also removes the limitation on the length of the string. The iterator will run forever, eventually generating all possible strings. Of course at some point the application will stop due to the increasing entropy of the universe.

How to reduce latency while reading data from csv file?

I have an excel in which there are 2000 rows which contains 1 data each, like
a.xls
RowNum Item
1 'A'
2 'B'
3 'C'
.
.
.
2000 'xyz'
I have another file, b.xls which contains about 6300000 rows of data. In this file there are some occurrences of the data in a.xls . I need to pick all the data from the file b.xls corresponding to an item in a.xls and store them in separate file called A.csv, B.csv, etc
I did it using multi-threading but it's taking lots of time to execute it. Can anybody help me reducing the latency?
This is the code I have used. The following function gets started in a thread,
def parseFromFile(pTickerList):
global gSearchList
lSearchList = gSearchList
for lTickerName in pTickerList:
c = csv.writer( open("op-new/"+ lTickerName + ".csv", "wb"))
c.writerow(["Ticker Name", "Time Stamp","Price", "Size"])
for line in lSearchList:
lSplittedLine = line.split(",")
lTickerNameFromSearchFile = lSplittedLine[0].strip()
if lTickerNameFromSearchFile[0] == "#":
continue
if ord(lTickerName[0]) < ord(lTickerNameFromSearchFile[0]):
break
elif ord(lTickerName[0]) > ord(lTickerNameFromSearchFile[0]):
continue
if lTickerNameFromSearchFile == lTickerName:
lTimeStamp = Decimal(float(lSplittedLine[1]))
lPrice = lSplittedLine[2]
lSize = lSplittedLine[4]
if str(lTimeStamp)[len(str(lTimeStamp))-2:] == "60":
lTimeStamp = str(lTimeStamp)[:len(str(lTimeStamp))-2] + "59.9"
if str(lTimeStamp).find(".") >= 0:
lTimeStamp = float(str(lTimeStamp).split(".")[0] + "." + str(lTimeStamp).split(".")[1][0])
lTimeStamp1 = "%.1f" %float(lTimeStamp)
lHumanReadableTimeStamp = datetime.strptime(str(lTimeStamp1), "%Y%m%d%H%M%S.%f")
else:
lHumanReadableTimeStamp = datetime.strptime(str(lTimeStamp), "%Y%m%d%H%M%S")
except Exception, e:
exc_type, exc_obj, exc_tb = sys.exc_info()
fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
print(exc_type, fname, exc_tb.tb_lineno)
print line
print lTimeStamp
raw_input()
c.writerow([lTickerNameFromSearchFile, lHumanReadableTimeStamp,lPrice, lSize])
It's hard to look through your code and fully understand it because it's referencing variables differently than your explanation, but I believe this approach will help you.
Start by reading all of a.csv into a set with the traits you want to be able to look up. sets in Python have very fast lookup times. This will also help you because it seems a that you do a lot of repeat computation during your inner loop from your code above.
Then start reading through b.csv, using the previous a.csv set to check. Whenever you find a match, write to A.csv and B.csv.
The big speedups you can do to your current setup are removing the repeat calculations in your inner loop, and the removal of the need for threads. Because a.csv is only 2000 lines, it will be incredibly fast to read.
Let me know if you want me to expand on any part of this.

Categories

Resources