Iterate over list of multiple strings using for loop - python

I'm fairly new to coding in Python. For a personal project, I'm looking for different ways to retrieve birthdays and days of death from a list of Wikipedia pages. I am using wikipedia package.
One way I try to achieve that is by iterating over the Wikipedia summary and returning the index from when I count four digits in a row.
import wikipedia as wp
names = ('Zaha Hadid', 'Rem Koolhaas')
wiki_summary = wp.summary(names)
b_counter = 0
i_b_year = []
d_counter = 0
i_d_year = []
for i,x in enumerate(wiki_summary):
if x.isdigit() == True:
b_counter += 1
if b_counter == 4:
i_b_year = i
break
else:
continue
else:
b_counter = 0
So far, that works for the first person in my list but I would like to iterate over all the names in my names list. Is there a way to use the for loop to find the index and use a for loop to iterate over the names?
I know there are other ways like parsing to find the bday tags, but I would like to try a couple of different solutions.

You are trying to:
Declare two empty lists to store birth year and death year of each person.
Get Wikipedia summary of each person from a tuple.
Parse first two numbers with 4 digits from the summary and append them to birth year and death year list.
The problem is that summary of the persons may not include birth year and death year as first two 4 digit numbers. For example Rem_Koolhaas's wikipedia summary includes his birth year as first 4 digit number but second 4 digit number is in this line: In 2005, he co-founded Volume Magazine together with Mark Wigley and Ole Bouman.
We can see that, the birth_year and death_year list may not include accurate information.
Here is the code that does what you are trying to achieve:
import wikipedia as wp
names = ('Zaha Hadid', 'Rem Koolhaas')
i_b_year = []
i_d_year = []
for person_name in names:
wiki_summary = wp.summary(person_name)
birth_year_found = False
death_year_found = False
digits = ""
for c in wiki_summary:
if c.isdigit() == True:
if birth_year_found == False:
digits += c
if len(digits) == 4:
birth_year_found = True
i_b_year.append(int(digits))
digits = ""
elif death_year_found == False:
digits += c
if len(digits) == 4:
death_year_found = True
i_d_year.append(int(digits))
break
else:
digits = ""
if birth_year_found == False:
i_b_year.append(0)
if death_year_found == False:
i_d_year.append(0)
for i in range(len(names)):
print(names[i], i_b_year[i], i_d_year[i])
Output:
Zaha Hadid 1950 2016
Rem Koolhaas 1944 2005
Disclaimer: in the above code, I have appended 0 if two 4 digit numbers are not found in the summary of any person. As I have already mentioned there is no assertion that wikipedia summary will list a person's birth year and death year as first two 4 digits numbers the lists may include wrong information.

I am not familiar with the Wikipedia package, but it seems like you could just iterate over the names tuple:
import Wikipedia as wp
names = ('Zaha Hadid', 'Rem Koolhaas')
i_b_year = []
for name in names: #This line is new
wiki_summary = wp.summary(name) #Just changed names for name
b_counter = 0
d_counter = 0
i_d_year = []
for i,x in enumerate(wiki_summary):
if x.isdigit() == True:
b_counter += 1
if b_counter == 4:
i_b_year.append(i) #I am guessing you want this list to increase with each name in names. Thus, 'append'.
break
else:
continue
else:
b_counter = 0

First of all, your code won't work due to several reasons:
Importing wikipedia will only work with first lowercase letter import wikipedia
summary method accepts strings (in your case names), so you would have to call it for every name in a set
All of this aside, let's try to achieve what you're trying to do:
import wikipedia as wp
import re
# First thing we see (at least for pages provided) is that dates all share the same format:
# For those who are no longer with us 31 October 1950 – 31 March 2016
# For those who are still alive 17 November 1944
# So we have to build regex patterns to find those
# First is the months pattern, since it's quite a big one
MONTHS_PATTERN = r"January|February|March|April|May|June|July|August|September|October|November|December"
# Next we build our date pattern, double curly braces are used for literal text
DATE_PATTERN = re.compile(fr"\d{{1,2}}\s({MONTHS_PATTERN})\s\d{{,4}}")
# Declare our set of names, great choice of architects BTW :)
names = ('Zaha Hadid', 'Rem Koolhaas')
# Since we're trying to get birthdays and dates of death, we will create a dictionary for storing values
lifespans = {}
# Iterate over them in a loop
for name in names:
lifespan = {'birthday': None, 'deathday': None}
try:
summary = wp.summary(name)
# First we find the first date in summary, since it's most likely to be the birthday
first_date = DATE_PATTERN.search(summary)
if first_date:
# If we've found a date – suppose it's birthday
bday = first_date.group()
lifespan['birthday'] = bday
# Let's check whether the person is no longer with us
LIFESPAN_PATTERN = re.compile(fr"{bday}\s–\s{DATE_PATTERN.pattern}")
lifespan_found = LIFESPAN_PATTERN.search(summary)
if lifespan_found:
lifespan['deathday'] = lifespan_found.group().replace(f"{bday} – ", '')
lifespans[name] = lifespan
else:
print(f'No dates were found for {name}')
except wp.exceptions.PageError:
# Handle not found page, so that code won't break
print(f'{name} was not found on Wikipedia')
pass
# Print result
print(lifespans)
Output for provided names:
{'Zaha Hadid': {'birthday': '31 October 1950', 'deathday': '31 March 2016'}, 'Rem Koolhaas': {'birthday': '17 November 1944', 'deathday': None}}
This approach is inefficient and has many flaws, like if we get a page with dates fitting our regular expression, yet not being birthday and death day. It's quite ugly (even though I've tried my best :) ) and you'd be better off parsing tags.
If you're not happy with date format from Wikipedia, I suggest you look into datetime. Also, consider that those regular expressions fit those two specific pages, I did not conduct any research on how dates might be represented in Wikipedia. So, if there are any inconsistencies, I suggest you stick with parsing tags.

Related

I want to add my two list items in a dictionary

I am writing a code for a time table maker. This code takes the subjects that you want to study and the preference of studying(morning,afternoon,evening). Then I take total hours to be studied(20 hours) in a week. Then I divide it by 7 to calculate the study per day. Then I want to assign a random subject to my random time slot.
# Time Table Creator
import random
def TimeTableCreator(subjects,day,time_slots,total_hours):
day = {}
studyperday = total_hours/7
studyperday = round(studyperday)
subjects_study = random.sample(subjects,k=studyperday) # subjects that are selected randomly
final_time_slots = random.sample(time_slots,k=studyperday) # list of my time slots
#trying to add both these items using for loop in a dictionary
for subject in subjects_study:
for i in range(studyperday):
day[subject] = final_time_slots[i]
print(day)
subjects = []
while True:
subject = input("Enter a subject:\n")
if subject=="":
break
subjects.append(subject)
total_hours = 20
print("What is your preference of studying: Morning or afternoon or evening:")
time_preference = input()
if time_preference.lower()=="morning":
time_slots = ["7:00-8:00","8:00-9:00","9:00-10:00","10:00-11:00","2:00-3:00"]
elif time_preference.lower()=="afternoon":
time_slots = ["12:00-13:00","13:00-14:00","15:00-16:00","16:00-17:00","18:00-19:00"]
elif time_preference.lower()=="evening":
time_slots = ["16:00-17:00","18:00-19:00","19:00-20:00","20:00-21:00","21:00-22:00"]
else:
print("Invalid preference")
TimeTableCreator(subjects,"Monday",time_slots,total_hours)
Program output
Enter a subject:
phy
Enter a subject:
chem
Enter a subject:
bio
Enter a subject:
What is your preference of studying: Morning or afternoon or evening:
morning
**{'chem': '2:00-3:00', 'phy': '2:00-3:00', 'bio': '2:00-3:00'}**
Process finished with exit code 0
As you can see the time slot is the same that is assigned to different subjects. But I want that time slots should be different for each subject.I want that the time slots should be different for each subject. It is showing 2:00 to 3:00 for each subject. But I want different slots to be assigned for different subjects
Please help.
The problem in your code is dictionary creation. You don't need the nested loop. You can use zip()
Just change this to
for subject in subjects_study:
for i in range(studyperday):
day[subject] = final_time_slots[i]
this
for x, y in zip(subjects_study, final_time_slots):
day[x] = y
One liner solution:
day = dict(zip(subjects_study, final_time_slots))
# Time Table Creator
import random
def TimeTableCreator(subjects,day,time_slots,total_hours):
day = {}
studyperday = round(total_hours/7)
subjects_study = random.sample(subjects,k=studyperday) # subjects that are selected randomly
final_time_slots = random.sample(time_slots,k=studyperday) # list of my time slots
#trying to add both these items using for loop in a dictionary
i = 0
for subject in subjects_study:
day[subject] = final_time_slots[i]
i += 1
print(day)
You were using a nested for loop, that assigned each value of subject from each final_total_hours and therefore it ended up at the same random value (the last one).
This only needed a single for loop as above.
You are looping using nested loops so are going through all the i values for each subject and ending up with the value for the last value of i for all the subjects:
for subject in subjects_study:
for i in range(studyperday):
day[subject] = final_time_slots[i]
what you are wanting is to loop together using enumerate:
for i, subject in enumerate(subjects_study):
day[subject] = final_time_slots[i]

How to make user input not case sensitive?

I want to create a function to filter which files I want to open and which months and day specifically. That way, the users need to input which city (files) they want to analyze on which particular month or day. However, I want the user to be able to input something that is not case sensitive.
For example, the user can input 'chicago'/'CHICAGO"/"ChIcAgO" and the it still give you the right output and not the error handling message. Here is the code I use:
def get_filters ():
city_options = ['Chicago','New York City','Washington']
month_options = ['January','February','March','April','May','June','All']
day_options = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday','All']
while True:
try:
city = city_options.index(input('\nInsert name of the city to analyze! (Chicago, New York City, Washington)\n'))
month = month_options.index(input('\nInsert month to filter by or "All" to apply no month filter! (January, February, etc.)\n'))
day = day_options.index(input('\nInsert day of the week to filter by or "All" to apply no day filter! (Monday, Tuesday, etc.)\n'))
return city_options[city].lower(), month_options[month].lower(), day_options[day].lower()
except ValueError:
print ("Your previous choice is not available. Please try again")
def load_data (city,month,day):
#load data file into DataFrame
df = pd.read_csv(CITY_DATA[city].lower())
#convert start time column (string) to datetime
df['Start Time']=pd.to_datetime(df['Start Time'])
#create new column to extract month and day of the week from start time
df['Month'] = df['Start Time'].dt.month
df['Day_of_Week'] = df['Start Time'].dt.weekday_name
#filter by month if applicable
if month.lower()!= 'All':
#use the index of the month list to get corresponding into
months = ['January', 'February', 'March', 'April', 'May', 'June']
month = months.index(month) + 1
#filter by month to create new dataframes
df = df[df['Month'] == month]
if day.lower()!= 'All':
#filter by day_of_week to create new DataFrames
df =df[df['Day_of_Week'] == day]
return(df)
The best way to do so is just take the required input and convert it into the required case.
Use the inbuilt functions of python
variable.lower()
or
variable.upper()
You should use str.casefold to remove case sensitivity. As per the docs, this is stricter than str.lower:
str.casefold()
Return a casefolded copy of the string. Casefolded strings may be used
for caseless matching.
Casefolding is similar to lowercasing but more aggressive because it
is intended to remove all case distinctions in a string. For example,
the German lowercase letter 'ß' is equivalent to "ss". Since it is
already lowercase, lower() would do nothing to 'ß'; casefold()
converts it to "ss".
For example:
x = 'ßHello'
print(x.casefold())
sshello
I just started learning Python in December 2021.
I encountered a similar problem while writing the codes for a 'choose your own fantasy game".
To remove the case sensitivity, use the string.lower() or string.upper() function. Now, this answer is no different from the first answer to this question.
So, what's special here?
Well, if you are testing the answer from the user, your if-statement must have the same case as the case you are converting to e.g
answer = input("What is rat? 'Animal' or 'Food' ")
3)if answer.lower() == "animal":
print("Congratulations. You win! ")
5) else:
print("Sorry, you lose! ")
OR:
answer = input("What is rat? 'Animal' or 'Food' ")
3)if answer.upper() == "ANIMAL":
print("Congratulations. You win! ")
5) else:
print("Sorry, you lose! ")
N.B: No matter how the user types their answer (ANIMAL, ANimal, animal), if the answer is correct, he wins. So long the test case function is the same as the case string (line 3) to be tested.
I hope this helps!
I am new too but I think you should look at string functions. Presuming you use python 3 since you use input and get no ValueError, you can just add .lover().title() after the parentheses of the input
Example:
city = city_options.index(input('\nInsert name of the city to analyze! (Chicago, New York City, Washington)\n').lower().title())
Should do the trick as like If you input cHIcaGO it will be converted to Chicago instantly.
Hope it helps!
Edit:(After correcting misspelling of lower() function tried it on webbrowser, pycharm and Python itself. Works just fine for me(I'm using python 2.7 so I corrected all inputs as raw_input,If you are using python 3 you don't have to change them. ).)

Any way of returning 2 objects from a function without making a tuple?

I am making a basic date converter and I need to update the the date every time the user enters an invalid date and is asked to input again. From this function below, I need both the object day and year returned.
def day_valid (month, dates, feb_day, month_days):
day = int(dates[2:4])
while month_days == 31 and day > 31:
print ("Invalid day input.")
print()
dates = input_date()
day = int(dates[2:4])
if month_days == 31 and day < 32:
break
while month_days == 30 and day > 30:
print ("Invalid day input.")
print()
dates = input_date()
day = int(dates[2:4])
if month_days == 30 and day < 31:
break
while month_days == feb_day and day > feb_day:
print ("Invalid day input.")
print()
dates = input_date()
day = int(dates[2:4])
if month_days == feb_day and day <= feb_day:
break
return day
When a user types in 00102002 in MMDDYYYY format, there is no month. So the user is prompted to enter again, entering 01102005. The code still displays the date as 10 January 2002 and not 2005 .
If any one needs clarification on the code, please ask!
My main function:
def main():
loop = "Y"
print()
print("Welcome to Date Converter!")
print()
while loop.upper () == "Y" :
dates = input_date()
year = int(dates[4:])
month = month_valid(dates)
feb_day = feb_days(year)
month_days = month_Days(month, feb_day)
day = day_valid(month, dates, feb_day, month_days)
month_str = month_names(month)
print()
print("The date is " + str(day) + " " + month_str + " " + str(year))
loop = str(input ("Do you want to re-run this program? Y/N: "))
main()
This sounds first of all like an XY Problem: someone wants to do X, and comes up with a solution requiring doing Y. They need help with Y, so request help to do Y. However, it turns out that Y is not an appropriate solution. By recognizing the XY Problem and asking how to do X instead, the person gets better help and more insight into X.
XY Problems also often look suspiciously like homework problems, since those are often of the form "write a program that does X, by doing Y".
It's OK to pose a question that you want to do X and tried to solve it using Y.
Anyway, that's why you're probably going to get low-effort answers. I'll make the effort :)
Anyway, going with the Y question :)
There is a readability practice that considers tuples harmful because you don't know what the purpose of the items in the tuple are. Consider instead creating an object that holds the things, each with its own attribute, and then return that.
Since you stated that you needed day and year returned:
class DayAndYear(object):
def __init__(self, day, year):
self.day = day
self.year = year
And that's how you do it without making a tuple, and it increases the readability of your program, such as it is.
Now, going with the unstated X question:
without knowing what month_valid does,
assuming feb_days returns the number of days in February of the given year,
assuming month_Days returns the number of days in the given month when it isn't February,
it seems that you want a function that will check if a string is a valid MMDDYYYY string.
def is_valid_date(s):
"""Checks if the given date is a valid MMDDYYYY string.
Args:
s (str): A date to check.
Returns:
bool: True if the date is valid, False otherwise.
"""
if len(s) != 8:
return False
try:
date = int(s[:2])
month = int(s[2:4])
year = int(s[4:])
except ValueError:
return False
if month < 1 and month > 12:
return False
if month == 2:
days_in_month = days_in_february(year)
else:
days_in_month = days_in_month(month)
return date >= 1 and date <= days_in_month
def print_date(s):
"""Prints the given MMDDYYYY date, assuming it has already been checked for validity.
Args:
s (str): A date to print.
"""
print("The date is {:d} {:s} {:d}.".format(
int(s[2:4]), month_name(int(s[:2])), int(s[4:])))
I'd like to highlight a few general techniques to make your programs read better:
We don't know X. A well-posed question is one with specifications for the input and output of the program.
I've used verbose, readable function names.
I've used function comments, complete with args, arg types, and return values so there's no guessing about what things do.
I've chosen a split between checking validity and printing an already valid string. You can combine them. You can also return a string rather than print the date, and return instead the sentinel value None if the date was not valid.
Don't compute any more than you have to. Note the early returns.
No doubt there are library functions that will do this, but I've assumed you don't want to use any library functions.
The short key concepts:
Readability: Programs should be almost as easy to read as prose in your native language.
Readability: Function names should be descriptive.
Readability: Comment your code.
Readability: Choose a consistent format for functions and stick with it ("month_Days" vs "feb_days")
Efficiency: Return early.
Testability: Specify well what your program does in terms of inputs and outputs, give examples of good and bad inputs.
Effectiveness: Use library functions.
Stackoverflowness: Consider if your problem is an XY problem.

Storing Multi-dimensional Lists?

(Code below)
I'm scraping a website and the data I'm getting back is in 2 multi-dimensional arrays. I'm wanting everything to be in a JSON format because I want to save this and load it in again later when I add "tags".
So, less vague. I'm writing a program which takes in data like what characters you have and what missions are requiring you to do (you can complete multiple at once if the attributes align), and then checks that against a list of attributes that each character fulfills and returns a sorted list of the best characters for the context.
Right now I'm only scraping character data but I've already "got" the attribute data per character - the problem there was that it wasn't sorted by name so it was just a randomly repeating list that I needed to be able to look up. I still haven't quite figured out how to do that one.
Right now I have 2 arrays, 1 for the headers of the table and one for the rows of the table. The rows contain the "Answers" for the Header's "Questions" / "Titles" ; ie Maximum Level, 50
This is true for everything but the first entry which is the Name, Pronunciation (and I just want to store the name of course).
So:
Iterations = 0
While loop based on RowArray length / 9 (While Iterations <= that)
HeaderArray[0] gives me the name
RowArray[Iterations + 1] gives me data type 2
RowArray[Iterations + 2] gives me data type 3
Repeat until Array[Iterations + 8]
Iterations +=9
So I'm going through and appending these to separate lists - single arrays like CharName[] and CharMaxLevel[] and so on.
But I'm actually not sure if that's going to make this easier or not? Because my end goal here is to send "CharacterName" and get stuff back based on that AND be able to send in "DesiredTraits" and get "CharacterNames who fit that trait" back. Which means I also need to figure out how to store that category data semi-efficiently. There's over 80 possible categories and most only fit into about 10. I don't know how I'm going to store or load that data.
I'm assuming JSON is the best way? And I'm trying to keep it all in one file for performance and code readability reasons - don't want a file for each character.
CODE: (Forgive me, I've never scraped anything before + I'm actually somewhat new to Python - just got it 4? days ago)
https://pastebin.com/yh3Z535h
^ In the event anyone wants to run this and this somehow makes it easier to grab the raw code (:
import time
import requests, bs4, re
from urllib.parse import urljoin
import json
import os
target_dir = r"D:\00Coding\Js\WebScraper" #Yes, I do know that storing this in my Javascript folder is filthy
fullname = os.path.join(target_dir,'TsumData.txt')
StartURL = 'http://disneytsumtsum.wikia.com/wiki/Skill_Upgrade_Chart'
URLPrefix = 'http://disneytsumtsum.wikia.com'
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_links(url):
soup = make_soup(url)
a_tags = soup.find_all('a', href=re.compile(r"^/wiki/"))
links = [urljoin(URLPrefix, a['href'])for a in a_tags] # convert relative url to absolute url
return links
def get_tds(link):
soup = make_soup(link)
#tds = soup.find_all('li', class_="category normal") #This will give me the attributes / tags of each character
tds = soup.find_all('table', class_="wikia-infobox")
RowArray = []
HeaderArray = []
if tds:
for td in tds:
#print(td.text.strip()) #This is everything
rows = td.findChildren('tr')#[0]
headers = td.findChildren('th')#[0]
for row in rows:
cells = row.findChildren('td')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
RowArray.append(clean_content)
for row in rows:
cells = row.findChildren('th')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
HeaderArray.append(clean_content)
print(HeaderArray)
print(RowArray)
return(RowArray, HeaderArray)
#Output = json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1)
#print(json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1))
#TempFile = open(fullname, 'w') #Read only, Write Only, Append
#TempFile.write("EHLLO")
#TempFile.close()
#print(td.tbody.Series)
#print(td.tbody[Series])
#print(td.tbody["Series"])
#print(td.data-name)
#time.sleep(1)
if __name__ == '__main__':
links = get_links(StartURL)
MainHeaderArray = []
MainRowArray = []
MaxIterations = 60
Iterations = 0
for link in links: #Specifically I'll need to return and append the arrays here because they're being cleared repeatedly.
#print("Getting tds calling")
if Iterations > 38: #There are this many webpages it'll first look at that don't have the data I need
TempRA, TempHA = get_tds(link)
MainHeaderArray.append(TempHA)
MainRowArray.append(TempRA)
MaxIterations -= 1
Iterations += 1
#print(MaxIterations)
if MaxIterations <= 0: #I don't want to scrape the entire website for a prototype
break
#print("This is the end ??")
#time.sleep(3)
#jsonized = map(lambda item: {'Name':item[0], 'Series':item[1]}, zip())
print(MainHeaderArray)
#time.sleep(2.5)
#print(MainRowArray)
#time.sleep(2.5)
#print(zip())
TsumName = []
TsumSeries = []
TsumBoxType = []
TsumSkillDescription = []
TsumFullCharge = []
TsumMinScore = []
TsumScoreIncreasePerLevel = []
TsumMaxScore = []
TsumFullUpgrade = []
Iterations = 0
MaxIterations = len(MainRowArray)
while Iterations <= MaxIterations: #This will fire 1 time per Tsum
print(Iterations)
print(MainHeaderArray[Iterations][0]) #Holy this gives us Mickey ;
print(MainHeaderArray[Iterations+1][0])
print(MainHeaderArray[Iterations+2][0])
print(MainHeaderArray[Iterations+3][0])
TsumName.append(MainHeaderArray[Iterations][0])
print(MainRowArray[Iterations][1])
#At this point it will, of course, crash - that's because I only just realized I needed to append AND I just realized that everything
#Isn't stored in a list as I thought, but rather a multi-dimensional array (as you can see below I didn't know this)
TsumSeries[Iterations] = MainRowArray[Iterations+1]
TsumBoxType[Iterations] = MainRowArray[Iterations+2]
TsumSkillDescription[Iterations] = MainRowArray[Iterations+3]
TsumFullCharge[Iterations] = MainRowArray[Iterations+4]
TsumMinScore[Iterations] = MainRowArray[Iterations+5]
TsumScoreIncreasePerLevel[Iterations] = MainRowArray[Iterations+6]
TsumMaxScore[Iterations] = MainRowArray[Iterations+7]
TsumFullUpgrade[Iterations] = MainRowArray[Iterations+8]
Iterations += 9
print(Iterations)
print("It's Over")
time.sleep(3)
print(TsumName)
print(TsumSkillDescription)
Edit:
tl;dr my goal here is to be like
"For this Mission Card I need a Blue Tsum with high score potential, a Monster's Inc Tsum for a bunch of games, and a Male Tsum for a long chain.. what's the best Tsum given those?" and it'll be like "SULLY!" and automatically select it or at the very least give you a list of Tsums. Like "These ones match all of them, these ones match 2, and these match 1"
Edit 2:
Here's the command Line Output for the code above:
https://pastebin.com/vpRsX8ni
Edit 3: Alright, just got back for a short break. With some minor looking over I see what happened - my append code is saying "Append this list to the array" meaning I've got a list of lists for both the Header and Row arrays that I'm storing. So I can confirm (for myself at least) that these aren't nested lists per se but they are definitely 2 lists, each containing a single list at every entry. Definitely not a dictionary or anything "special case" at least. This should help me quickly find an answer now that I'm not throwing "multi-dimensional list" around my google searches or wondering why the list stuff isn't working (as it's expecting 1 value and gets a list instead).
Edit 4:
I need to simply add another list! But super nested.
It'll just store the categories that the Tsum has as a string.
so Array[10] = ArrayOfCategories[Tsum] (which contains every attribute in string form that the Tsum has)
So that'll be ie TsumArray[10] = ["Black", "White Gloves", "Mickey & Friends"]
And then I can just use the "Switch" that I've already made in order to check them. Possibly. Not feeling too well and haven't gotten that far yet.
Just use the with open file as json_file , write/read (super easy).
Ultimately stored 3 json files. No big deal. Much easier than appending into one big file.

How can I avoid nested loops for maximum efficiency?

I wrote this iterate through a series of facebook user likes. The scrubbing process requires the code first pick a user, then pick a like, then a character from that like. If too many characters in a like are not english characters (in the alphanum string) then the like is assumed to be gibberish and is removed.
This filtering process continues through all likes and all users. I know having nested loops is a no no, but I don't see a way to do this without having a triple nested loop. Any suggestions? Additionally if anyone has any other efficiency or conventional advice I would love to hear it.
def cleaner(likes_path):
'''
estimated run time for 170k users: 3min
this method takes a given csv format datasheet of noisy facebook likes.
data is scrubbed row by row (meaning user by user) removing 'likes' that are not useful
data is parsed into manageable size specified files.
if more data is continuously added method will just keep adding new files
if more data is added at a later time choosing a new folder to put it in would
work best so that the update method can add it to existing counts instead
of starting over
'''
with open(os.path.join(likes_path)) as likes:
dct = [0]
file_num = 0
#initializes naming scheme for self-numbering files
file_size = 30000
#sets file size to 30000 userId's
alphanum = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890 $%#-'
user_count = 0
too_big = 1000
too_long = 30
for rows in likes:
repeat_check = []
user_count += 1
user_likes = make_like_list(rows)
to_check = user_likes[1:]
if len(to_check) < too_big:
#users with more than 1000 likestake up much more resources/time
#and are of less analytical value
for like in to_check:
if len(like) > too_long or len(like) == 0:
#This changes the filter sensitivity. Most useful likes
#are under 30 char long
user_likes.remove(like)
else:
letter_check = sum(1 for letter in like[:5] if letter in alphanum)
if letter_check < len(like[:5])-1:
user_likes.remove(like)
if len(user_likes) > 1 and len(user_likes[0]) == 32:
#userID's are 32 char long, this filters out some mistakes
#filters out users with no likes
scrubbed_to_check = user_likes[1:]
for like in scrubbed_to_check:
if like == 'Facebook' or like == 'YouTube':
#youtube and facebook are very common likes but
#aren't very useful
user_likes.remove(like)
#removes duplicate likes
elif like not in repeat_check:
repeat_check.append(like)
else:
user_likes.remove(like)
scrubbed_rows = '"'+'","'.join(user_likes)+'"\n'
if user_count%file_size == 1:
#This block allows for data to be parsed into
#multiple smaller files
file_num += 1
dct.append(file_num)
dct[file_num] = open(file_write_path + str(file_num) +'.csv', 'w')
if file_num != 1:
dct[file_num-1].close()
dct[file_num].writelines(scrubbed_rows)
if user_counter(user_count, 'Users Scrubbed:', 200000):
break
print 'Total Users Scrubbed:', user_count

Categories

Resources