How can I avoid nested loops for maximum efficiency? - python

I wrote this iterate through a series of facebook user likes. The scrubbing process requires the code first pick a user, then pick a like, then a character from that like. If too many characters in a like are not english characters (in the alphanum string) then the like is assumed to be gibberish and is removed.
This filtering process continues through all likes and all users. I know having nested loops is a no no, but I don't see a way to do this without having a triple nested loop. Any suggestions? Additionally if anyone has any other efficiency or conventional advice I would love to hear it.
def cleaner(likes_path):
'''
estimated run time for 170k users: 3min
this method takes a given csv format datasheet of noisy facebook likes.
data is scrubbed row by row (meaning user by user) removing 'likes' that are not useful
data is parsed into manageable size specified files.
if more data is continuously added method will just keep adding new files
if more data is added at a later time choosing a new folder to put it in would
work best so that the update method can add it to existing counts instead
of starting over
'''
with open(os.path.join(likes_path)) as likes:
dct = [0]
file_num = 0
#initializes naming scheme for self-numbering files
file_size = 30000
#sets file size to 30000 userId's
alphanum = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890 $%#-'
user_count = 0
too_big = 1000
too_long = 30
for rows in likes:
repeat_check = []
user_count += 1
user_likes = make_like_list(rows)
to_check = user_likes[1:]
if len(to_check) < too_big:
#users with more than 1000 likestake up much more resources/time
#and are of less analytical value
for like in to_check:
if len(like) > too_long or len(like) == 0:
#This changes the filter sensitivity. Most useful likes
#are under 30 char long
user_likes.remove(like)
else:
letter_check = sum(1 for letter in like[:5] if letter in alphanum)
if letter_check < len(like[:5])-1:
user_likes.remove(like)
if len(user_likes) > 1 and len(user_likes[0]) == 32:
#userID's are 32 char long, this filters out some mistakes
#filters out users with no likes
scrubbed_to_check = user_likes[1:]
for like in scrubbed_to_check:
if like == 'Facebook' or like == 'YouTube':
#youtube and facebook are very common likes but
#aren't very useful
user_likes.remove(like)
#removes duplicate likes
elif like not in repeat_check:
repeat_check.append(like)
else:
user_likes.remove(like)
scrubbed_rows = '"'+'","'.join(user_likes)+'"\n'
if user_count%file_size == 1:
#This block allows for data to be parsed into
#multiple smaller files
file_num += 1
dct.append(file_num)
dct[file_num] = open(file_write_path + str(file_num) +'.csv', 'w')
if file_num != 1:
dct[file_num-1].close()
dct[file_num].writelines(scrubbed_rows)
if user_counter(user_count, 'Users Scrubbed:', 200000):
break
print 'Total Users Scrubbed:', user_count

Related

How can I loop or RegEx through a text stream in Python?

I need to look for patterns in huge text files (around 100 billion characters) that I can only viably access through a URL. Before the files got so big, I was just running a for loop through a str input with this function:
def check_text_for_pattern(source_text, substring_size):
substrings_counter = 0
unique_substrings = [""]
is_unique = True
print("Looking for desired patterns within source text")
for x in range(len(source_text)):
substring_candidate = source_text[x - 1: substring_size + x - 1]
***pattern rules***
for y in unique_substrings:
if y == substring_candidate:
is_unique = False
if is_unique:
print("New unique substring found: " + substring_candidate)
unique_substrings[substrings_counter] = substring_candidate
substrings_counter += 1
is_unique = True
return unique_substrings
It is working well, but I can't seem to figure out the right way to loop through data that is not completely loaded from the very start, partially because a data stream has no len property. How can I keep moving through character subsets without missing any in a scenario like that? Also, since there are multiple files to run through now, how do I signal my code that end of file has been reached so it can move on to the next URL?

Using program in another file gives different output

I'm having a rather unique issue with my code that I have not experienced before and could use some guidance.
Here is an attempt a short explanation:
Basically, I have a program with many functions that are tied to one main one. It takes in data from files sent to it and gives output based on many factors. Running this function in the file itself gives the proper results, however, if I import this function and run it in the main.py, it gives very, very incorrect output.
I am going to do my best to show the least amount of code in this post, so here is the GitHub. Please use it for further reference and understanding of what is happening. I don't know any websites that I can use to link and run my code for these purposes.
sentiment_analysis.py is the file with all of the functions. main.py is the file that utilizes it all, and driver.py is the file given by my prof to test this assignment.
Basic assignment explanation (skip if not needed for answering the question): Take in twitter data from the files given along with keywords that have an associated happiness value. Take all data, split into timezone regions (approximation based on given point values, not real timezones), and then give back basic information about the data in the files. ie. Average happiness per timezone, total keyword tweets, and total tweets, for each region.
Running sentiment_analysis will currently give correct output based on heavy testing.
Running main and driver will give incorrect output. Ex. tweets2 has 25 total lines of twitter data, but using driver will return 91 total tweets and keyword tweets (eastern data, 4th test scenario in driver.py) instead of the expected 15 total tweets in that region.
I've spent about 3 hours testing scenarios and outputting different information to try and debug but have had no luck. If anyone has any idea why it's returning different outputs when called in a different file, that would be great.
The following are the three most important functions in the file, with the first being the one called in another file.
def compute_tweets(tweets, keywords):
try:
with open(tweets, encoding="utf-8", errors="ignore") as f: # opens the file
tweet_list = f.read().splitlines() # reads and splitlines the file. Gets rid of the \n
print(tweet_list)
with open(keywords, encoding="utf-8", errors="ignore") as f:
keyword_dict = {k: int(v) for line in f for k,v in [line.strip().split(',')]}
# instead of opening this file normally i am using dictionary comprehension to turn the entire file into a dictionary
# instead of the standard list which would come from using the readlines() function.
determine_timezone(tweet_list) # this will run the function to split all pieces of the file into region specific ones
eastern = calculations(keyword_dict, eastern_list)
central = calculations(keyword_dict, central_list)
mountain = calculations(keyword_dict, mountain_list)
pacific = calculations(keyword_dict, pacific_list)
return final_calculation(eastern, central, mountain, pacific)
except FileNotFoundError as excpt:
empty_list = []
print(excpt)
print("One or more of the files you entered does not exist.")
return empty_list
# Constants for Timezone Detection
# eastern begin
p1 = [49.189787, -67.444574]
p2 = [24.660845, -67.444574]
# Central begin, eastern end
p3 = [49.189787, -87.518395]
# p4 = [24.660845, -87.518395] - Not needed
# Mountain begin, central end
p5 = [49.189787, -101.998892]
# p6 = [24.660845, -101.998892] - Not needed
# Pacific begin, mountain end
p7 = [49.189787, -115.236428]
# p8 = [24.660845, -115.236428] - Not needed
# pacific end, still pacific
p9 = [49.189787, -125.242264]
# p10 = [24.660845, -125.242264]
def determine_timezone(tweet_list):
for index, tweet in enumerate(tweet_list): # takes in index and tweet data and creates a for loop
long_lat = get_longlat(tweet) # determines the longlat for the tweet that is currently needed to work on
if float(long_lat[0]) <= float(p1[0]) and float(long_lat[0]) >= float(p2[0]):
if float(long_lat[1]) <= float(p1[1]) and float(long_lat[1]) > float(p3[1]):
# this is testing for the eastern region
eastern_list.append(tweet_list[index])
elif float(long_lat[1]) <= float(p3[1]) and float(long_lat[1]) > float(p5[1]):
# testing for the central region
central_list.append(tweet_list[index])
elif float(long_lat[1]) <= float(p5[1]) and float(long_lat[1]) > float(p7[1]):
# testing for mountain region
mountain_list.append(tweet_list[index])
elif float(long_lat[1]) <= float(p7[1]) and float(long_lat[1]) >= float(p9[1]):
# testing for pacific region
pacific_list.append(tweet_list[index])
else:
# if nothing is found, continue to the next element in the tweet data and do nothing
continue
else:
# if nothing is found for the longitude, then also continue
continue
def calculations(keyword_dict, tweet_list):
# - Constants for caclulations and returns
total_tweets = 0
total_keyword_tweets = 0
average_happiness = 0
happiness_sum = 0
for entry in tweet_list: # saying for each piece of the tweet list
word_list = input_splitting(entry) # run through the input splitting for list of words
total_tweets += 1 # add one to total tweets
keyword_happened_counter = 0 # this is used to know if the word list has already had a keyword tweet. Needs to be
# reset to 0 again in this spot.
for word in word_list: # for each word in that word list
for key, value in keyword_dict.items(): # take the key and respective value for each item in the dict
# print("key:", key, "val:", value)
if word == key: # if the word we got is the same as the key value
if keyword_happened_counter == 0: # and the keyword counter hasnt gone up
total_keyword_tweets += 1 # add one to the total keyword tweets
keyword_happened_counter += 1 # then add one to keyword happened counter
happiness_sum += value # and, if we have a keyword tweet, no matter what add to the happiness sum
else:
continue # if we don't have a word == key, continue iterating.
if total_keyword_tweets != 0:
average_happiness = happiness_sum / total_keyword_tweets # calculation for the average happiness value
else:
average_happiness = 0
return [average_happiness, total_keyword_tweets, total_tweets] # returning a tuple of info in proper order
My apologies for the wall of both text and code. I'm new to making posts on here and am trying to include all relevant information... If anyone knows of a better way to do this aside from using github and code blocks, please do let me know.
Thanks in advance.

CS50 'DNA': Ways to speed up my Week 6 'dna.py' program?

So for this problem I had to create a program that takes in two arguments. A CSV database like this:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
And a DNA sequence like this:
TAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
My program works by first getting the "Short Tandem Repeat" (STR) headers from the database (AGATC, etc.), then counting the highest number of times each STR repeats consecutively within the sequence. Finally, it compares these counted values to the values of each row in the database, printing out a name if a match is found, or "No match" otherwise.
The program works for sure, but is ridiculously slow whenever ran using the larger database provided, to the point where the terminal pauses for an entire minute before returning any output. And unfortunately this is causing the 'check50' marking system to time-out and return a negative result upon testing with this large database.
I'm presuming the slowdown is caused by the nested loops within the 'STR_count' function:
def STR_count(sequence, seq_len, STR_array, STR_array_len):
# Creates a list to store max recurrence values for each STR
STR_count_values = [0] * STR_array_len
# Temp value to store current count of STR recurrence
temp_value = 0
# Iterates over each STR in STR_array
for i in range(STR_array_len):
STR_len = len(STR_array[i])
# Iterates over each sequence element
for j in range(seq_len):
# Ensures it's still physically possible for STR to be present in sequence
while (seq_len - j >= STR_len):
# Gets sequence substring of length STR_len, starting from jth element
sub = sequence[j:(j + (STR_len))]
# Compares current substring to current STR
if (sub == STR_array[i]):
temp_value += 1
j += STR_len
else:
# Ensures current STR_count_value is highest
if (temp_value > STR_count_values[i]):
STR_count_values[i] = temp_value
# Resets temp_value to break count, and pushes j forward by 1
temp_value = 0
j += 1
i += 1
return STR_count_values
And the 'DNA_match' function:
# Searches database file for DNA matches
def DNA_match(STR_values, arg_database, STR_array_len):
with open(arg_database, 'r') as csv_database:
database = csv.reader(csv_database)
name_array = [] * (STR_array_len + 1)
next(database)
# Iterates over one row of database at a time
for row in database:
name_array.clear()
# Copies entire row into name_array list
for column in row:
name_array.append(column)
# Converts name_array number strings to actual ints
for i in range(STR_array_len):
name_array[i + 1] = int(name_array[i + 1])
# Checks if a row's STR values match the sequence's values, prints the row name if match is found
match = 0
for i in range(0, STR_array_len, + 1):
if (name_array[i + 1] == STR_values[i]):
match += 1
if (match == STR_array_len):
print(name_array[0])
exit()
print("No match")
exit()
However, I'm new to Python, and haven't really had to consider speed before, so I'm not sure how to improve upon this.
I'm not particularly looking for people to do my work for me, so I'm happy for any suggestions to be as vague as possible. And honestly, I'll value any feedback, including stylistic advice, as I can only imagine how disgusting this code looks to those more experienced.
Here's a link to the full program, if helpful.
Thanks :) x
Thanks for providing a link to the entire program. It seems needlessly complex, but I'd say it's just a lack of knowing what features are available to you. I think you've already identified the part of your code that's causing the slowness - I haven't profiled it or anything, but my first impulse would also be the three nested loops in STR_count.
Here's how I would write it, taking advantage of the Python standard library. Every entry in the database corresponds to one person, so that's what I'm calling them. people is a list of dictionaries, where each dictionary represents one line in the database. We get this for free by using csv.DictReader.
To find the matches in the sequence, for every short tandem repeat in the database, we create a regex pattern (the current short tandem repeat, repeated one or more times). If there is a match in the sequence, the total number of repetitions is equal to the length of the match divided by the length of the current tandem repeat. For example, if AGATCAGATCAGATC is present in the sequence, and the current tandem repeat is AGATC, then the number of repetitions will be len("AGATCAGATCAGATC") // len("AGATC") which is 15 // 5, which is 3.
count is just a dictionary that maps short tandem repeats to their corresponding number of repetitions in the sequence. Finally, we search for a person whose short tandem repeat counts match those of count exactly, and print their name. If no such person exists, we print "No match".
def main():
import argparse
from csv import DictReader
import re
parser = argparse.ArgumentParser()
parser.add_argument("database_filename")
parser.add_argument("sequence_filename")
args = parser.parse_args()
with open(args.database_filename, "r") as file:
reader = DictReader(file)
short_tandem_repeats = reader.fieldnames[1:]
people = list(reader)
with open(args.sequence_filename, "r") as file:
sequence = file.read().strip()
count = dict(zip(short_tandem_repeats, [0] * len(short_tandem_repeats)))
for short_tandem_repeat in short_tandem_repeats:
pattern = f"({short_tandem_repeat}){{1,}}"
match = re.search(pattern, sequence)
if match is None:
continue
count[short_tandem_repeat] = len(match.group()) // len(short_tandem_repeat)
try:
person = next(person for person in people if all(int(person[k]) == count[k] for k in short_tandem_repeats))
print(person["name"])
except StopIteration:
print("No match")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())

Iterate over list of multiple strings using for loop

I'm fairly new to coding in Python. For a personal project, I'm looking for different ways to retrieve birthdays and days of death from a list of Wikipedia pages. I am using wikipedia package.
One way I try to achieve that is by iterating over the Wikipedia summary and returning the index from when I count four digits in a row.
import wikipedia as wp
names = ('Zaha Hadid', 'Rem Koolhaas')
wiki_summary = wp.summary(names)
b_counter = 0
i_b_year = []
d_counter = 0
i_d_year = []
for i,x in enumerate(wiki_summary):
if x.isdigit() == True:
b_counter += 1
if b_counter == 4:
i_b_year = i
break
else:
continue
else:
b_counter = 0
So far, that works for the first person in my list but I would like to iterate over all the names in my names list. Is there a way to use the for loop to find the index and use a for loop to iterate over the names?
I know there are other ways like parsing to find the bday tags, but I would like to try a couple of different solutions.
You are trying to:
Declare two empty lists to store birth year and death year of each person.
Get Wikipedia summary of each person from a tuple.
Parse first two numbers with 4 digits from the summary and append them to birth year and death year list.
The problem is that summary of the persons may not include birth year and death year as first two 4 digit numbers. For example Rem_Koolhaas's wikipedia summary includes his birth year as first 4 digit number but second 4 digit number is in this line: In 2005, he co-founded Volume Magazine together with Mark Wigley and Ole Bouman.
We can see that, the birth_year and death_year list may not include accurate information.
Here is the code that does what you are trying to achieve:
import wikipedia as wp
names = ('Zaha Hadid', 'Rem Koolhaas')
i_b_year = []
i_d_year = []
for person_name in names:
wiki_summary = wp.summary(person_name)
birth_year_found = False
death_year_found = False
digits = ""
for c in wiki_summary:
if c.isdigit() == True:
if birth_year_found == False:
digits += c
if len(digits) == 4:
birth_year_found = True
i_b_year.append(int(digits))
digits = ""
elif death_year_found == False:
digits += c
if len(digits) == 4:
death_year_found = True
i_d_year.append(int(digits))
break
else:
digits = ""
if birth_year_found == False:
i_b_year.append(0)
if death_year_found == False:
i_d_year.append(0)
for i in range(len(names)):
print(names[i], i_b_year[i], i_d_year[i])
Output:
Zaha Hadid 1950 2016
Rem Koolhaas 1944 2005
Disclaimer: in the above code, I have appended 0 if two 4 digit numbers are not found in the summary of any person. As I have already mentioned there is no assertion that wikipedia summary will list a person's birth year and death year as first two 4 digits numbers the lists may include wrong information.
I am not familiar with the Wikipedia package, but it seems like you could just iterate over the names tuple:
import Wikipedia as wp
names = ('Zaha Hadid', 'Rem Koolhaas')
i_b_year = []
for name in names: #This line is new
wiki_summary = wp.summary(name) #Just changed names for name
b_counter = 0
d_counter = 0
i_d_year = []
for i,x in enumerate(wiki_summary):
if x.isdigit() == True:
b_counter += 1
if b_counter == 4:
i_b_year.append(i) #I am guessing you want this list to increase with each name in names. Thus, 'append'.
break
else:
continue
else:
b_counter = 0
First of all, your code won't work due to several reasons:
Importing wikipedia will only work with first lowercase letter import wikipedia
summary method accepts strings (in your case names), so you would have to call it for every name in a set
All of this aside, let's try to achieve what you're trying to do:
import wikipedia as wp
import re
# First thing we see (at least for pages provided) is that dates all share the same format:
# For those who are no longer with us 31 October 1950 ā€“ 31 March 2016
# For those who are still alive 17 November 1944
# So we have to build regex patterns to find those
# First is the months pattern, since it's quite a big one
MONTHS_PATTERN = r"January|February|March|April|May|June|July|August|September|October|November|December"
# Next we build our date pattern, double curly braces are used for literal text
DATE_PATTERN = re.compile(fr"\d{{1,2}}\s({MONTHS_PATTERN})\s\d{{,4}}")
# Declare our set of names, great choice of architects BTW :)
names = ('Zaha Hadid', 'Rem Koolhaas')
# Since we're trying to get birthdays and dates of death, we will create a dictionary for storing values
lifespans = {}
# Iterate over them in a loop
for name in names:
lifespan = {'birthday': None, 'deathday': None}
try:
summary = wp.summary(name)
# First we find the first date in summary, since it's most likely to be the birthday
first_date = DATE_PATTERN.search(summary)
if first_date:
# If we've found a date ā€“ suppose it's birthday
bday = first_date.group()
lifespan['birthday'] = bday
# Let's check whether the person is no longer with us
LIFESPAN_PATTERN = re.compile(fr"{bday}\sā€“\s{DATE_PATTERN.pattern}")
lifespan_found = LIFESPAN_PATTERN.search(summary)
if lifespan_found:
lifespan['deathday'] = lifespan_found.group().replace(f"{bday} ā€“ ", '')
lifespans[name] = lifespan
else:
print(f'No dates were found for {name}')
except wp.exceptions.PageError:
# Handle not found page, so that code won't break
print(f'{name} was not found on Wikipedia')
pass
# Print result
print(lifespans)
Output for provided names:
{'Zaha Hadid': {'birthday': '31 October 1950', 'deathday': '31 March 2016'}, 'Rem Koolhaas': {'birthday': '17 November 1944', 'deathday': None}}
This approach is inefficient and has many flaws, like if we get a page with dates fitting our regular expression, yet not being birthday and death day. It's quite ugly (even though I've tried my best :) ) and you'd be better off parsing tags.
If you're not happy with date format from Wikipedia, I suggest you look into datetime. Also, consider that those regular expressions fit those two specific pages, I did not conduct any research on how dates might be represented in Wikipedia. So, if there are any inconsistencies, I suggest you stick with parsing tags.

Storing Multi-dimensional Lists?

(Code below)
I'm scraping a website and the data I'm getting back is in 2 multi-dimensional arrays. I'm wanting everything to be in a JSON format because I want to save this and load it in again later when I add "tags".
So, less vague. I'm writing a program which takes in data like what characters you have and what missions are requiring you to do (you can complete multiple at once if the attributes align), and then checks that against a list of attributes that each character fulfills and returns a sorted list of the best characters for the context.
Right now I'm only scraping character data but I've already "got" the attribute data per character - the problem there was that it wasn't sorted by name so it was just a randomly repeating list that I needed to be able to look up. I still haven't quite figured out how to do that one.
Right now I have 2 arrays, 1 for the headers of the table and one for the rows of the table. The rows contain the "Answers" for the Header's "Questions" / "Titles" ; ie Maximum Level, 50
This is true for everything but the first entry which is the Name, Pronunciation (and I just want to store the name of course).
So:
Iterations = 0
While loop based on RowArray length / 9 (While Iterations <= that)
HeaderArray[0] gives me the name
RowArray[Iterations + 1] gives me data type 2
RowArray[Iterations + 2] gives me data type 3
Repeat until Array[Iterations + 8]
Iterations +=9
So I'm going through and appending these to separate lists - single arrays like CharName[] and CharMaxLevel[] and so on.
But I'm actually not sure if that's going to make this easier or not? Because my end goal here is to send "CharacterName" and get stuff back based on that AND be able to send in "DesiredTraits" and get "CharacterNames who fit that trait" back. Which means I also need to figure out how to store that category data semi-efficiently. There's over 80 possible categories and most only fit into about 10. I don't know how I'm going to store or load that data.
I'm assuming JSON is the best way? And I'm trying to keep it all in one file for performance and code readability reasons - don't want a file for each character.
CODE: (Forgive me, I've never scraped anything before + I'm actually somewhat new to Python - just got it 4? days ago)
https://pastebin.com/yh3Z535h
^ In the event anyone wants to run this and this somehow makes it easier to grab the raw code (:
import time
import requests, bs4, re
from urllib.parse import urljoin
import json
import os
target_dir = r"D:\00Coding\Js\WebScraper" #Yes, I do know that storing this in my Javascript folder is filthy
fullname = os.path.join(target_dir,'TsumData.txt')
StartURL = 'http://disneytsumtsum.wikia.com/wiki/Skill_Upgrade_Chart'
URLPrefix = 'http://disneytsumtsum.wikia.com'
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_links(url):
soup = make_soup(url)
a_tags = soup.find_all('a', href=re.compile(r"^/wiki/"))
links = [urljoin(URLPrefix, a['href'])for a in a_tags] # convert relative url to absolute url
return links
def get_tds(link):
soup = make_soup(link)
#tds = soup.find_all('li', class_="category normal") #This will give me the attributes / tags of each character
tds = soup.find_all('table', class_="wikia-infobox")
RowArray = []
HeaderArray = []
if tds:
for td in tds:
#print(td.text.strip()) #This is everything
rows = td.findChildren('tr')#[0]
headers = td.findChildren('th')#[0]
for row in rows:
cells = row.findChildren('td')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
RowArray.append(clean_content)
for row in rows:
cells = row.findChildren('th')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
HeaderArray.append(clean_content)
print(HeaderArray)
print(RowArray)
return(RowArray, HeaderArray)
#Output = json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1)
#print(json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1))
#TempFile = open(fullname, 'w') #Read only, Write Only, Append
#TempFile.write("EHLLO")
#TempFile.close()
#print(td.tbody.Series)
#print(td.tbody[Series])
#print(td.tbody["Series"])
#print(td.data-name)
#time.sleep(1)
if __name__ == '__main__':
links = get_links(StartURL)
MainHeaderArray = []
MainRowArray = []
MaxIterations = 60
Iterations = 0
for link in links: #Specifically I'll need to return and append the arrays here because they're being cleared repeatedly.
#print("Getting tds calling")
if Iterations > 38: #There are this many webpages it'll first look at that don't have the data I need
TempRA, TempHA = get_tds(link)
MainHeaderArray.append(TempHA)
MainRowArray.append(TempRA)
MaxIterations -= 1
Iterations += 1
#print(MaxIterations)
if MaxIterations <= 0: #I don't want to scrape the entire website for a prototype
break
#print("This is the end ??")
#time.sleep(3)
#jsonized = map(lambda item: {'Name':item[0], 'Series':item[1]}, zip())
print(MainHeaderArray)
#time.sleep(2.5)
#print(MainRowArray)
#time.sleep(2.5)
#print(zip())
TsumName = []
TsumSeries = []
TsumBoxType = []
TsumSkillDescription = []
TsumFullCharge = []
TsumMinScore = []
TsumScoreIncreasePerLevel = []
TsumMaxScore = []
TsumFullUpgrade = []
Iterations = 0
MaxIterations = len(MainRowArray)
while Iterations <= MaxIterations: #This will fire 1 time per Tsum
print(Iterations)
print(MainHeaderArray[Iterations][0]) #Holy this gives us Mickey ;
print(MainHeaderArray[Iterations+1][0])
print(MainHeaderArray[Iterations+2][0])
print(MainHeaderArray[Iterations+3][0])
TsumName.append(MainHeaderArray[Iterations][0])
print(MainRowArray[Iterations][1])
#At this point it will, of course, crash - that's because I only just realized I needed to append AND I just realized that everything
#Isn't stored in a list as I thought, but rather a multi-dimensional array (as you can see below I didn't know this)
TsumSeries[Iterations] = MainRowArray[Iterations+1]
TsumBoxType[Iterations] = MainRowArray[Iterations+2]
TsumSkillDescription[Iterations] = MainRowArray[Iterations+3]
TsumFullCharge[Iterations] = MainRowArray[Iterations+4]
TsumMinScore[Iterations] = MainRowArray[Iterations+5]
TsumScoreIncreasePerLevel[Iterations] = MainRowArray[Iterations+6]
TsumMaxScore[Iterations] = MainRowArray[Iterations+7]
TsumFullUpgrade[Iterations] = MainRowArray[Iterations+8]
Iterations += 9
print(Iterations)
print("It's Over")
time.sleep(3)
print(TsumName)
print(TsumSkillDescription)
Edit:
tl;dr my goal here is to be like
"For this Mission Card I need a Blue Tsum with high score potential, a Monster's Inc Tsum for a bunch of games, and a Male Tsum for a long chain.. what's the best Tsum given those?" and it'll be like "SULLY!" and automatically select it or at the very least give you a list of Tsums. Like "These ones match all of them, these ones match 2, and these match 1"
Edit 2:
Here's the command Line Output for the code above:
https://pastebin.com/vpRsX8ni
Edit 3: Alright, just got back for a short break. With some minor looking over I see what happened - my append code is saying "Append this list to the array" meaning I've got a list of lists for both the Header and Row arrays that I'm storing. So I can confirm (for myself at least) that these aren't nested lists per se but they are definitely 2 lists, each containing a single list at every entry. Definitely not a dictionary or anything "special case" at least. This should help me quickly find an answer now that I'm not throwing "multi-dimensional list" around my google searches or wondering why the list stuff isn't working (as it's expecting 1 value and gets a list instead).
Edit 4:
I need to simply add another list! But super nested.
It'll just store the categories that the Tsum has as a string.
so Array[10] = ArrayOfCategories[Tsum] (which contains every attribute in string form that the Tsum has)
So that'll be ie TsumArray[10] = ["Black", "White Gloves", "Mickey & Friends"]
And then I can just use the "Switch" that I've already made in order to check them. Possibly. Not feeling too well and haven't gotten that far yet.
Just use the with open file as json_file , write/read (super easy).
Ultimately stored 3 json files. No big deal. Much easier than appending into one big file.

Categories

Resources