Using a dictionary (with changing values) as a flyweight in Python - python

Suppose I create a dictionary
d = {i:False for i in range(0,100)}
Then I make a list
l = [d[12], d[10], d[70]]
and then change the dictionary:
d[12] = True...
the list doesn't change, is this behavior expected? If so, how can I add the values as references?
I'm doing this in a more complicated context, but this is the first thing I wanted to investigate (of many potential issues, I just wrote this).
Here's the full code:
import csv
# create a bingo num flyweight --> we don't have to update every board
b = {i:False for i in range(0,100)}
boards = []
with open("04.csv") as f:
reader = csv.reader(f)
# get the numbers to be drawn
draw = list(map(int, next(reader)))
# skip the first emptyline before boards start
next(reader)
# we want to store each board as a list of rows and columns
# i.e. 5x5 board is a 5x10 list, this will make checking easier
board = []
for row in reader:
print(row)
if len(row) < 1:
boards.append(board[:] + list(map(list, list(zip(*board)))))
board = []
else:
board.append([b[int(i)] for i in filter(None,row[0].split(" "))])
for i in range(0, 100):
b[i] = True
print(boards[0])
I suspect the way that I append board to boards could be the culprit...
You might recognize this as an 'advent of code' problem, I promise I can solve it with a more rudimentary technique, I just don't want to give up on this method... thank you!

Related

finding students with highest and lowest credits in array in python

firstly, i apologize for my bad english as it is not my native language but i will try my best to explain everything as best as i can. i am trying to find students with highest and lowest credits from a .csv file
here is my csv looks like
and here is my code so far:
i appended the first names into first_names array(same thing with the last name and credits)
def arrays(i):
import csv
with open('FCredits.csv','r+') as f_data:
csv_reader = csv.reader(f_data, delimiter=',')
first_names = []
last_names = []
f_credits = []
for row in csv_reader:
csv_reader = csv.reader(f_data, delimiter=',')
first_name = row[0]
last_name = row[1]
f_credit = row[2]
first_names.append(first_name)
last_names.append(last_name)
f_credits.append(f_credit)
find_min_max(first_names,last_names,f_credits)
but then stuck on the next part
def find_min_max(first_names,last_names,f_credits):
minVal, maxVal = [],[]
for i in f_credits:
minVal.append(f_credits)
maxVal.append(f_credits)
print(min(minVal))
print(max(minVal))
basically, what i wanted to do on the second part is to print out the student with lowest and highest amount of credits and write them in a new csv file but gave up halfway.
There are a few things that I have noted in your question:
The highest or lowest mark can be scored by multiple people.
That means the output may or may not be single and so a list is required.
Please see my code below:
def get_min_max(first_names,last_names,f_credits):
max_value = max(f_credits)
min_value = min(f_credits)
minVal = []
maxVal = []
for element in zip(first_names, last_names, f_credits):
if element[-1] == max_value:
maxVal.append(element)
elif element[-1] == min_value:
minVal.append(element)
print(len(maxVal), maxVal)
print(len(minVal), minVal)
return maxVal, minVal
For the If .. Elif part i suggest that you use a list comprehension to make the code concise and faster. I have written it this way so that you understand it better.
You may also want to read about min, max and zip functions of python from the official python documentation.

How can I effectively and efficiently count integer matches in a large array and store those counts for later use?

For an assignment, I have to take a large number of "lottery" values from an Excel sheet (individually, a little over 2,300 integers, but they're grouped in six) and create an algorithm to try and predict the most likely next six numbers that will be drawn. So far, I have split up 396 six-number/six-slot combinations into arrays.
My plan was to count how many times each number (possible values, 1-75) shows up in each slot and divide it by 396, but I'm not exactly sure how to go through each slot array, count the matches, and store the percentages for each possible value.
The only way I have been able to count the matches was a super simple but absolutely humongous block of code (I created a variable for each possible 1-75 and made a for loop with a nested if loop that would simply add a count to the correct variable) that I couldn't figure out how to condense and is horrible to try and use to complete algorithm. I still have it, but I'm not going to post that part since it is horrible. But the rest of my code, I'll put down below:
# Open the Lottery_Data.csv and return the raw data.
def open_csv_file():
file = open("Lottery_Data.csv", "r")
csv_data = file.read()
csv_data = csv_data.split("\n")
file.close()
return csv_data
# Main Function
def main():
# Copy raw data from CSV file into a variable.
csv_data = open_csv_file()
# Create a separate array for the date (day and specific date), and the six lottery number-slots.
lottery_date_day = []
lottery_date_specific = []
lottery_col_1 = []
lottery_col_2 = []
lottery_col_3 = []
lottery_col_4 = []
lottery_col_5 = []
lottery_col_6 = []
# Split the raw data from csv_data and spread them into the arrays
for i in range(len(csv_data)-2):
line = csv_data[i].split(",")
l_d_d = line[0]
l_d_s = line[1]
l1 = line[2]
l2 = line[3]
l3 = line[4]
l4 = line[5]
l5 = line[6]
l6 = line[7]
lottery_date_day.append(l_d_d)
lottery_date_specific.append(l_d_s)
lottery_col_1.append(l1)
lottery_col_2.append(l2)
lottery_col_3.append(l3)
lottery_col_4.append(l4)
lottery_col_5.append(l5)
lottery_col_6.append(l6)
# Go into lottery arrays 1-6 and find how many matches for numbers 1-75 there are
# Divide the total matches from the total entries for each number's percentage
main()```

Slow list parsing with python3.7 for duplicate item removal

I'm trying to remove duplicate items from a large text file containing 250 million items at 4.4 Gigabytes.
I was impressed to see that I could load this file into a python list in just a few minutes with the following code:
x = []
with open("online.txt") as file:
for l in file:
x.append(l)
print('count of array: ')
print(len(x))
But when I tried to simply check to make sure the next item doesn't exist before added it to an array, it's taking many hours to finish. I feel like I'm missing something simple that would really speed this up.
Here's the code I used to check for duplicate items:
a = []
x = []
with open("online.txt") as file:
for l in file:
if l in a:
print('duplicate')
print(l)
else:
x.append(l.strip())
a.append(l)
print('with duplicates: ');
print(len(a))
print('without duplicates: ')
print(len(x))
This is running on a server with 64 Gigs of ram and modern dual xeon processors.
The problem is with a simple list, python has to search through every entry each time before adding a new one.
You could try a python dictionary or a set instead of a list. These data structures are faster for determining if an entry exists already.
Simply change your code:
a = {} # set
x = {}
with open("online.txt") as file:
for l in file:
if l in a:
print('duplicate')
print(l)
else:
x.add(l.strip()) # add to the set
a.add(l)
You don't specify your input file-format, but there may be speed increases possibly by loading the whole data-set into a giant string, then splitting it up with python functions, rather than manually like you do here.
In the end, here's the code I used to remove duplicates:
x = set([])
with open("all.txt") as file:
for l in file:
x.add(l)
print('count of array: ')
print(len(x))

Storing Multi-dimensional Lists?

(Code below)
I'm scraping a website and the data I'm getting back is in 2 multi-dimensional arrays. I'm wanting everything to be in a JSON format because I want to save this and load it in again later when I add "tags".
So, less vague. I'm writing a program which takes in data like what characters you have and what missions are requiring you to do (you can complete multiple at once if the attributes align), and then checks that against a list of attributes that each character fulfills and returns a sorted list of the best characters for the context.
Right now I'm only scraping character data but I've already "got" the attribute data per character - the problem there was that it wasn't sorted by name so it was just a randomly repeating list that I needed to be able to look up. I still haven't quite figured out how to do that one.
Right now I have 2 arrays, 1 for the headers of the table and one for the rows of the table. The rows contain the "Answers" for the Header's "Questions" / "Titles" ; ie Maximum Level, 50
This is true for everything but the first entry which is the Name, Pronunciation (and I just want to store the name of course).
So:
Iterations = 0
While loop based on RowArray length / 9 (While Iterations <= that)
HeaderArray[0] gives me the name
RowArray[Iterations + 1] gives me data type 2
RowArray[Iterations + 2] gives me data type 3
Repeat until Array[Iterations + 8]
Iterations +=9
So I'm going through and appending these to separate lists - single arrays like CharName[] and CharMaxLevel[] and so on.
But I'm actually not sure if that's going to make this easier or not? Because my end goal here is to send "CharacterName" and get stuff back based on that AND be able to send in "DesiredTraits" and get "CharacterNames who fit that trait" back. Which means I also need to figure out how to store that category data semi-efficiently. There's over 80 possible categories and most only fit into about 10. I don't know how I'm going to store or load that data.
I'm assuming JSON is the best way? And I'm trying to keep it all in one file for performance and code readability reasons - don't want a file for each character.
CODE: (Forgive me, I've never scraped anything before + I'm actually somewhat new to Python - just got it 4? days ago)
https://pastebin.com/yh3Z535h
^ In the event anyone wants to run this and this somehow makes it easier to grab the raw code (:
import time
import requests, bs4, re
from urllib.parse import urljoin
import json
import os
target_dir = r"D:\00Coding\Js\WebScraper" #Yes, I do know that storing this in my Javascript folder is filthy
fullname = os.path.join(target_dir,'TsumData.txt')
StartURL = 'http://disneytsumtsum.wikia.com/wiki/Skill_Upgrade_Chart'
URLPrefix = 'http://disneytsumtsum.wikia.com'
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_links(url):
soup = make_soup(url)
a_tags = soup.find_all('a', href=re.compile(r"^/wiki/"))
links = [urljoin(URLPrefix, a['href'])for a in a_tags] # convert relative url to absolute url
return links
def get_tds(link):
soup = make_soup(link)
#tds = soup.find_all('li', class_="category normal") #This will give me the attributes / tags of each character
tds = soup.find_all('table', class_="wikia-infobox")
RowArray = []
HeaderArray = []
if tds:
for td in tds:
#print(td.text.strip()) #This is everything
rows = td.findChildren('tr')#[0]
headers = td.findChildren('th')#[0]
for row in rows:
cells = row.findChildren('td')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
RowArray.append(clean_content)
for row in rows:
cells = row.findChildren('th')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
HeaderArray.append(clean_content)
print(HeaderArray)
print(RowArray)
return(RowArray, HeaderArray)
#Output = json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1)
#print(json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1))
#TempFile = open(fullname, 'w') #Read only, Write Only, Append
#TempFile.write("EHLLO")
#TempFile.close()
#print(td.tbody.Series)
#print(td.tbody[Series])
#print(td.tbody["Series"])
#print(td.data-name)
#time.sleep(1)
if __name__ == '__main__':
links = get_links(StartURL)
MainHeaderArray = []
MainRowArray = []
MaxIterations = 60
Iterations = 0
for link in links: #Specifically I'll need to return and append the arrays here because they're being cleared repeatedly.
#print("Getting tds calling")
if Iterations > 38: #There are this many webpages it'll first look at that don't have the data I need
TempRA, TempHA = get_tds(link)
MainHeaderArray.append(TempHA)
MainRowArray.append(TempRA)
MaxIterations -= 1
Iterations += 1
#print(MaxIterations)
if MaxIterations <= 0: #I don't want to scrape the entire website for a prototype
break
#print("This is the end ??")
#time.sleep(3)
#jsonized = map(lambda item: {'Name':item[0], 'Series':item[1]}, zip())
print(MainHeaderArray)
#time.sleep(2.5)
#print(MainRowArray)
#time.sleep(2.5)
#print(zip())
TsumName = []
TsumSeries = []
TsumBoxType = []
TsumSkillDescription = []
TsumFullCharge = []
TsumMinScore = []
TsumScoreIncreasePerLevel = []
TsumMaxScore = []
TsumFullUpgrade = []
Iterations = 0
MaxIterations = len(MainRowArray)
while Iterations <= MaxIterations: #This will fire 1 time per Tsum
print(Iterations)
print(MainHeaderArray[Iterations][0]) #Holy this gives us Mickey ;
print(MainHeaderArray[Iterations+1][0])
print(MainHeaderArray[Iterations+2][0])
print(MainHeaderArray[Iterations+3][0])
TsumName.append(MainHeaderArray[Iterations][0])
print(MainRowArray[Iterations][1])
#At this point it will, of course, crash - that's because I only just realized I needed to append AND I just realized that everything
#Isn't stored in a list as I thought, but rather a multi-dimensional array (as you can see below I didn't know this)
TsumSeries[Iterations] = MainRowArray[Iterations+1]
TsumBoxType[Iterations] = MainRowArray[Iterations+2]
TsumSkillDescription[Iterations] = MainRowArray[Iterations+3]
TsumFullCharge[Iterations] = MainRowArray[Iterations+4]
TsumMinScore[Iterations] = MainRowArray[Iterations+5]
TsumScoreIncreasePerLevel[Iterations] = MainRowArray[Iterations+6]
TsumMaxScore[Iterations] = MainRowArray[Iterations+7]
TsumFullUpgrade[Iterations] = MainRowArray[Iterations+8]
Iterations += 9
print(Iterations)
print("It's Over")
time.sleep(3)
print(TsumName)
print(TsumSkillDescription)
Edit:
tl;dr my goal here is to be like
"For this Mission Card I need a Blue Tsum with high score potential, a Monster's Inc Tsum for a bunch of games, and a Male Tsum for a long chain.. what's the best Tsum given those?" and it'll be like "SULLY!" and automatically select it or at the very least give you a list of Tsums. Like "These ones match all of them, these ones match 2, and these match 1"
Edit 2:
Here's the command Line Output for the code above:
https://pastebin.com/vpRsX8ni
Edit 3: Alright, just got back for a short break. With some minor looking over I see what happened - my append code is saying "Append this list to the array" meaning I've got a list of lists for both the Header and Row arrays that I'm storing. So I can confirm (for myself at least) that these aren't nested lists per se but they are definitely 2 lists, each containing a single list at every entry. Definitely not a dictionary or anything "special case" at least. This should help me quickly find an answer now that I'm not throwing "multi-dimensional list" around my google searches or wondering why the list stuff isn't working (as it's expecting 1 value and gets a list instead).
Edit 4:
I need to simply add another list! But super nested.
It'll just store the categories that the Tsum has as a string.
so Array[10] = ArrayOfCategories[Tsum] (which contains every attribute in string form that the Tsum has)
So that'll be ie TsumArray[10] = ["Black", "White Gloves", "Mickey & Friends"]
And then I can just use the "Switch" that I've already made in order to check them. Possibly. Not feeling too well and haven't gotten that far yet.
Just use the with open file as json_file , write/read (super easy).
Ultimately stored 3 json files. No big deal. Much easier than appending into one big file.

How to match fields from two lists and further filter based upon the values in subsequent fields?

EDIT: My question was answered on reddit. Here is the link if anyone is interested in the answer to this problem https://www.reddit.com/r/learnpython/comments/42ibhg/how_to_match_fields_from_two_lists_and_further/
I am attempting to get the pos and alt strings from file1 to match up with what is in
file2, fairly simple. However, file2 has values in the 17th split element/column to the
last element/column (340th) which contains string such as 1/1:1.2.2:51:12 which
I also want to filter for.
I want to extract the rows from file2 that contain/match the pos and alt from file1.
Thereafter, I want to further filter the matched results that only contain certain
values in the 17th split element/column onwards. But to do so the values would have to
be split by ":" so I can filter for split[0] = "1/1" and split[2] > 50. The problem is
I have no idea how to do this.
I imagine I will have to iterate over these and split but I am not sure how to do this
as the code is presently in a loop and the values I want to filter are in columns not rows.
Any advice would be greatly appreciated, I have sat with this problem since Friday and
have yet to find a solution.
import os,itertools,re
file1 = open("file1.txt","r")
file2 = open("file2.txt","r")
matched = []
for (x),(y) in itertools.product(file2,file1):
if not x.startswith("#"):
cells_y = y.split("\t")
pos_y = cells[0]
alt_y = cells[3]
cells_x = x.split("\t")
pos_x = cells_x[0]+":"+cells_x[1]
alt_x = cells_x[4]
if pos_y in pos_x and alt_y in alt_x:
matched.append(x)
for z in matched:
cells_z = z.split("\t")
if cells_z[16:len(cells_z)]:
Your requirement is not clear, but you might mean this:
for (x),(y) in itertools.product(file2,file1):
if x.startswith("#"):
continue
cells_y = y.split("\t")
pos_y = cells[0]
alt_y = cells[3]
cells_x = x.split("\t")
pos_x = cells_x[0]+":"+cells_x[1]
alt_x = cells_x[4]
if pos_y != pos_x: continue
if alt_y != alt_x: continue
extra_match = False
for f in range(17, 341):
y_extra = y[f].split(':')
if y_extra[0] != '1/1': continue
if y_extra[2] <= 50: continue
extra_match = True
break
if not extra_match: continue
xy = x + y
matched.append(xy)
I chose to concatenate x and y into the matched array, since I wasn't sure whether or not you would want all the data. If not, feel free to go back to just appending x or y.
You may want to look into the csv library, which can use tab as a delimiter. You can also use a generator and/or guards to make the code a bit more pythonic and efficient. I think your approach with indexes works pretty well, but it would be easy to break when trying to modify down the road, or to update if your file lines change shape. You may wish to create objects (I use NamedTuples in the last part) to represent your lines and make it much easier to read/refine down the road.
Lastly, remember that Python has a shortcut feature with the comparative 'if'
for example:
if x_evaluation and y_evaluation:
do some stuff
when x_evaluation returns False, Python will skip y_evaluation entirely. In your code, cells_x[0]+":"+cells_x[1] is evaluated every single time you iterate the loop. Instead of storing this value, I wait until the easier alt comparison evaluates to True before doing this (comparatively) heavier/uglier check.
import csv
def filter_matching_alt_and_pos(first_file, second_file):
for x in csv.reader(open(first_file, 'rb'), delimiter='\t'):
for y in csv.reader(open(second_file, 'rb'), delimiter='\t'):
# continue will skip the rest of this loop and go to the next value for y
# this way, we can abort as soon as one value isn't what we want
# .. todo:: we could make a filter function and even use the filter() built-in depending on needs!
if x[3] == y[4] and x[0] == ":".join(y[:1]):
yield x
def match_datestamp_and_alt_and_pos(first_file, second_file):
for z in filter_matching_alt_and_pos(first_file, second_file):
for element in z[16:]:
# I am not sure I fully understood your filter needs for the 2nd half. Here, I split all elements from the 17th onward and look for the two cases you mentioned. This seems like it might be very heavy, but at least we're using generators!
# same idea as before, we abort as early as possible to avoid needless indexing and checks
for chunk in element.split(":"):
# WARNING: if you aren't 100% sure the 2nd element is an int, this is very dangerous
# here, I use the continue keyword and the negative-check to help eliminate excess overhead. The execution is very similar as above, but might be easier to read/understand and can help speed things along in some cases
# once again, I do the lighter check before the heavier one
if not int(chunk[2])> 50:
# continue automatically skips to the next iteration on element
continue
if not chunk[:1] == "1/1":
continue
yield z
if __name__ == '__main__':
first_file = "first.txt"
second_file = "second.txt"
# match_datestamp_and_alt_and_pos returns a generator; for loop through it for the lines which matched all 4 cases
match_datestamp_and_alt_and_pos(first_file=first_file, second_file=second_file)
namedtuples for the first part
from collections import namedtuple
FirstFileElement = namedtuple("FirstFrameElement", "pos unused1 unused2 alt")
SecondFileElement = namedtuple("SecondFrameElement", "pos1 pos2 unused2 unused3 alt")
def filter_matching_alt_and_pos(first_file, second_file):
for x in csv.reader(open(first_file, 'rb'), delimiter='\t'):
for y in csv.reader(open(second_file, 'rb'), delimiter='\t'):
# continue will skip the rest of this loop and go to the next value for y
# this way, we can abort as soon as one value isn't what we want
# .. todo:: we could make a filter function and even use the filter() built-in depending on needs!
x_element = FirstFileElement(*x)
y_element = SecondFileElement(*y)
if x.alt == y.alt and x.pos == ":".join([y.pos1, y.pos2]):
yield x

Categories

Resources