Python: No Traceback when Scraping Data into Excel Spreadsheet

Python: No Traceback when Scraping Data into Excel Spreadsheet - python

I'm an inexperienced coder working in python. I wrote a script to automate a process where certain information would be ripped from a webpage and then copied, where it would be pasted into a new excel spreadsheet. I've written and executed the code, but the excel spreadsheet I've designated to receive the data is completely empty. Worst of all, there is no traceback error. Would you help me find the problem in my code? And how do you generally solve your own problems when not provided a traceback error?
import xlsxwriter, urllib.request, string
def main():
#gets the URL for the expert page
open_sesame = urllib.request.urlopen('https://aries.case.com.pl/main_odczyt.php?strona=eksperci')
#reads the expert page
readpage = open_sesame.read()
#opens up a new file in excel
workbook = xlsxwriter.Workbook('expert_book.xlsx')
#adds worksheet to file
worksheet = workbook.add_worksheet()
#initializing the variable used to move names and dates
#in the excel spreadsheet
boxcoA = ""
boxcoB = ""
#initializing expert attribute variables and lists
expert_name = ""
url_ticker = 0
name_ticker = 0
raw_list = []
url_list = []
name_list= []
date_list= []
#this loop goes through and finds all the lines
#that contain the expert URL and name and saves them to raw_list::
#raw_list loop
for i in readpage:
i = str(i)
if i.startswith('<tr><td align=left><a href='):
raw_list += i
#this loop goes through the lines in raw list and extracts
#the name of the expert, saving it to a list::
#name_list loop
for n in raw_list:
name_snip = n.split('target=_blank>','</a></td><')[1]
name_list += name_snip
#this loop fills a list with the dates the profiles were last updated::
#date_list
for p in raw_list:
url_snipoff = p[28:]
url_snip = url_snipoff.split('"')[0]
url_list += url_snip
expert_url = 'https://aries.case.com.pl/'+url_list[url_ticker]
open_expert = urllib2.openurl(expert_url)
read_expert = open_expert.read()
for i in read_expert:
if i.startswith('<p align=left><small>Last update:'):
update = i.split('Last update:','</small>')[1]
open_expert.close()
date_list += update
#now that we have a list of expert names and a list of profile update dates
#we can work on populating the excel spreadsheet
#this operation will iterate just as long as the list is long
#meaning that it will populate the excel spreadsheet
#with all of our names and dates that we wanted
for z in raw_list:
boxcoA = string('A',z)
boxcoB = string('B',z)
worksheet.write(boxcoA, name_list[z])
worksheet.write(boxcoB, date_list[z])
workbook.close()
print('Operation Complete')
main()

The lack of a traceback only means your code raises no exceptions. It does not mean your code is logically correct.
I would look for logic errors by adding print statements, or using a debugger such as pdb or pudb.
One problem I notice with your code is that the first loop seems to presume that i is a line, whereas it is actually a character. You might find splitlines() more useful

If there is no traceback then there is no error.
Most likely something has gone wrong with your scraping/parsing code and your raw_list or other arrays aren't populated.
Try print out the data that should be written to the worksheet in the last loop to see if there is any data to be written.
If you aren't writing data to the worksheet then it will be empty.

Related

CSV files - can't print specific values/entities from my csv

For a game I am planning, I want to create a piece of code that will write one specific value from a list of all the items in my game into the player's inventory (e.g.: player gets item "potion", which would require searching the items CSV for potion and then putting the relevant info into the CSV). Whenever I run my code however I get the error "TypeError: '_io.TextIOWrapper' object is not subscriptable".
I've tried researching and asking peers but the closest I've gotten to a clear solution is someone mentioning writing to a list from the CSV, but they didn't explain much more. Hoping someone could elaborate on this for me or provide an easier solution.
import csv
allitems = open("ALLITEMS.csv")
checkallitems = csv.reader(allitems)
playerinv = open("INVENTORY.csv")
checkinv = csv.reader(playerinv)
item = input("")
for x in checkallitems:
print(allitems[x][0])
if item == allitems[x][0]:
playerinv.write(allitems[x][0]+"\n")
allitems.close()
playerinv.close()

The problem is allitems is a file object returned by open and the statement for x in checkallitems iterates over the lines of such file so, you are trying to use a list as index in that file. Also, you have to open INVENTORY.csv in write mode (using 'w' or 'a') to be able to write to it.
Just use x instead of allitems[x]. The snippet below should do the job:
for x in checkallitems:
if item == x[0]:
playerinv.write(x[0]+"\n")
So, the complete code could be:
Code
import csv
allitems = open("ALLITEMS.csv")
checkallitems = csv.reader(allitems)
playerinv = open("INVENTORY.csv", 'a')
checkinv = csv.reader(playerinv)
item = input("")
for x in checkallitems:
if item == x[0]: # Check if the item is equal to the first item on the list
playerinv.write(x[0]+"\n")
allitems.close()
playerinv.close()
I dont know what do you want to accomplish, so I tried to stick as much as possible to your code.
If you want to only write the item provided by the user if it is found the current list of items, this would do the job:
Code
import csv
allitems = open("ALLITEMS.csv")
checkallitems = csv.reader(allitems)
playerinv = open("INVENTORY.csv", 'a')
checkinv = csv.reader(playerinv)
item = input("")
for x in checkallitems:
if item in x: # Check if the item is in the current list
playerinv.write(item +"\n")
allitems.close()
playerinv.close()
I hope this can help you. Let me know if any of this worked for you, otherwise, tell me what went wrong.

My script keeps on printing even when nothing is left

I've written a script in python using openpyxl to get some names and its correspoding values from Sheet1 and use them as parameters meant to be passed in an url to make it a valid url. The problem is when I run my script, it keeps on printing urls even when there are only 5 of them in Sheet1. So far my knowledge goes, the way I defined max row is accurate. How the max row becomes unlimited?
This is the script:
import requests
from openpyxl import load_workbook
wb = load_workbook('ReverseSearch.xlsx')
ws = wb['Sheet1']
def search_name(session,query,query1):
session.headers = {"User-Agent":"Mozilla/5.0"}
res = session.get(url.format(query,query1))
print(res.url)
if __name__ == '__main__':
url = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}"
for row in range(2, ws.max_row + 1): #I used row 2 cause there are headers in row 1
key = ws.cell(row=row,column=1).value
key1 = ws.cell(row=row,column=2).value
session = requests.Session()
search_name(session,key,key1)
names I used:
café claude
sears fine food
chaat cafe
bean bag coffee house
primo patio cafe
values I used:
3392129
473113343
18528177
12192803
641231
I'm supposed to get only 5 links (fully qualified) but Im getting blank urls when there are no parameters left.
https://www.yellowpages.com/san-francisco-ca/mip/cafe-claude-3392129?lid=3392129
https://www.yellowpages.com/san-francisco-ca/mip/sears-fine-food-473113343?lid=473113343
https://www.yellowpages.com/san-francisco-ca/mip/chaat-cafe-18528177?lid=18528177
https://www.yellowpages.com/san-francisco-ca/mip/bean-bag-coffee-house-12192803?lid=12192803
https://www.yellowpages.com/san-francisco-ca/mip/primo-patio-cafe-641231?lid=641231
https://www.yellowpages.com/los-angeles-ca/mip/None-None
https://www.yellowpages.com/los-angeles-ca/mip/None-None
https://www.yellowpages.com/los-angeles-ca/mip/None-None
I wish my script stops when 5 links are printed.
Btw, this is how the url looks like:
url = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}"

I would put this as a comment but I don't have enough rep.
My first trouble shooting step would be to check what you get if you do?
print(ws.max_row)
Does it print 7?
If it prints a bigger number it might be counting empty rows in your document, in that case you would need to check the content of your cells a break the loop

Storing Multi-dimensional Lists?

(Code below)
I'm scraping a website and the data I'm getting back is in 2 multi-dimensional arrays. I'm wanting everything to be in a JSON format because I want to save this and load it in again later when I add "tags".
So, less vague. I'm writing a program which takes in data like what characters you have and what missions are requiring you to do (you can complete multiple at once if the attributes align), and then checks that against a list of attributes that each character fulfills and returns a sorted list of the best characters for the context.
Right now I'm only scraping character data but I've already "got" the attribute data per character - the problem there was that it wasn't sorted by name so it was just a randomly repeating list that I needed to be able to look up. I still haven't quite figured out how to do that one.
Right now I have 2 arrays, 1 for the headers of the table and one for the rows of the table. The rows contain the "Answers" for the Header's "Questions" / "Titles" ; ie Maximum Level, 50
This is true for everything but the first entry which is the Name, Pronunciation (and I just want to store the name of course).
So:
Iterations = 0
While loop based on RowArray length / 9 (While Iterations <= that)
HeaderArray[0] gives me the name
RowArray[Iterations + 1] gives me data type 2
RowArray[Iterations + 2] gives me data type 3
Repeat until Array[Iterations + 8]
Iterations +=9
So I'm going through and appending these to separate lists - single arrays like CharName[] and CharMaxLevel[] and so on.
But I'm actually not sure if that's going to make this easier or not? Because my end goal here is to send "CharacterName" and get stuff back based on that AND be able to send in "DesiredTraits" and get "CharacterNames who fit that trait" back. Which means I also need to figure out how to store that category data semi-efficiently. There's over 80 possible categories and most only fit into about 10. I don't know how I'm going to store or load that data.
I'm assuming JSON is the best way? And I'm trying to keep it all in one file for performance and code readability reasons - don't want a file for each character.
CODE: (Forgive me, I've never scraped anything before + I'm actually somewhat new to Python - just got it 4? days ago)
https://pastebin.com/yh3Z535h
^ In the event anyone wants to run this and this somehow makes it easier to grab the raw code (:
import time
import requests, bs4, re
from urllib.parse import urljoin
import json
import os
target_dir = r"D:\00Coding\Js\WebScraper" #Yes, I do know that storing this in my Javascript folder is filthy
fullname = os.path.join(target_dir,'TsumData.txt')
StartURL = 'http://disneytsumtsum.wikia.com/wiki/Skill_Upgrade_Chart'
URLPrefix = 'http://disneytsumtsum.wikia.com'
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_links(url):
soup = make_soup(url)
a_tags = soup.find_all('a', href=re.compile(r"^/wiki/"))
links = [urljoin(URLPrefix, a['href'])for a in a_tags] # convert relative url to absolute url
return links
def get_tds(link):
soup = make_soup(link)
#tds = soup.find_all('li', class_="category normal") #This will give me the attributes / tags of each character
tds = soup.find_all('table', class_="wikia-infobox")
RowArray = []
HeaderArray = []
if tds:
for td in tds:
#print(td.text.strip()) #This is everything
rows = td.findChildren('tr')#[0]
headers = td.findChildren('th')#[0]
for row in rows:
cells = row.findChildren('td')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
RowArray.append(clean_content)
for row in rows:
cells = row.findChildren('th')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
HeaderArray.append(clean_content)
print(HeaderArray)
print(RowArray)
return(RowArray, HeaderArray)
#Output = json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1)
#print(json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1))
#TempFile = open(fullname, 'w') #Read only, Write Only, Append
#TempFile.write("EHLLO")
#TempFile.close()
#print(td.tbody.Series)
#print(td.tbody[Series])
#print(td.tbody["Series"])
#print(td.data-name)
#time.sleep(1)
if __name__ == '__main__':
links = get_links(StartURL)
MainHeaderArray = []
MainRowArray = []
MaxIterations = 60
Iterations = 0
for link in links: #Specifically I'll need to return and append the arrays here because they're being cleared repeatedly.
#print("Getting tds calling")
if Iterations > 38: #There are this many webpages it'll first look at that don't have the data I need
TempRA, TempHA = get_tds(link)
MainHeaderArray.append(TempHA)
MainRowArray.append(TempRA)
MaxIterations -= 1
Iterations += 1
#print(MaxIterations)
if MaxIterations <= 0: #I don't want to scrape the entire website for a prototype
break
#print("This is the end ??")
#time.sleep(3)
#jsonized = map(lambda item: {'Name':item[0], 'Series':item[1]}, zip())
print(MainHeaderArray)
#time.sleep(2.5)
#print(MainRowArray)
#time.sleep(2.5)
#print(zip())
TsumName = []
TsumSeries = []
TsumBoxType = []
TsumSkillDescription = []
TsumFullCharge = []
TsumMinScore = []
TsumScoreIncreasePerLevel = []
TsumMaxScore = []
TsumFullUpgrade = []
Iterations = 0
MaxIterations = len(MainRowArray)
while Iterations <= MaxIterations: #This will fire 1 time per Tsum
print(Iterations)
print(MainHeaderArray[Iterations][0]) #Holy this gives us Mickey ;
print(MainHeaderArray[Iterations+1][0])
print(MainHeaderArray[Iterations+2][0])
print(MainHeaderArray[Iterations+3][0])
TsumName.append(MainHeaderArray[Iterations][0])
print(MainRowArray[Iterations][1])
#At this point it will, of course, crash - that's because I only just realized I needed to append AND I just realized that everything
#Isn't stored in a list as I thought, but rather a multi-dimensional array (as you can see below I didn't know this)
TsumSeries[Iterations] = MainRowArray[Iterations+1]
TsumBoxType[Iterations] = MainRowArray[Iterations+2]
TsumSkillDescription[Iterations] = MainRowArray[Iterations+3]
TsumFullCharge[Iterations] = MainRowArray[Iterations+4]
TsumMinScore[Iterations] = MainRowArray[Iterations+5]
TsumScoreIncreasePerLevel[Iterations] = MainRowArray[Iterations+6]
TsumMaxScore[Iterations] = MainRowArray[Iterations+7]
TsumFullUpgrade[Iterations] = MainRowArray[Iterations+8]
Iterations += 9
print(Iterations)
print("It's Over")
time.sleep(3)
print(TsumName)
print(TsumSkillDescription)
Edit:
tl;dr my goal here is to be like
"For this Mission Card I need a Blue Tsum with high score potential, a Monster's Inc Tsum for a bunch of games, and a Male Tsum for a long chain.. what's the best Tsum given those?" and it'll be like "SULLY!" and automatically select it or at the very least give you a list of Tsums. Like "These ones match all of them, these ones match 2, and these match 1"
Edit 2:
Here's the command Line Output for the code above:
https://pastebin.com/vpRsX8ni
Edit 3: Alright, just got back for a short break. With some minor looking over I see what happened - my append code is saying "Append this list to the array" meaning I've got a list of lists for both the Header and Row arrays that I'm storing. So I can confirm (for myself at least) that these aren't nested lists per se but they are definitely 2 lists, each containing a single list at every entry. Definitely not a dictionary or anything "special case" at least. This should help me quickly find an answer now that I'm not throwing "multi-dimensional list" around my google searches or wondering why the list stuff isn't working (as it's expecting 1 value and gets a list instead).
Edit 4:
I need to simply add another list! But super nested.
It'll just store the categories that the Tsum has as a string.
so Array[10] = ArrayOfCategories[Tsum] (which contains every attribute in string form that the Tsum has)
So that'll be ie TsumArray[10] = ["Black", "White Gloves", "Mickey & Friends"]
And then I can just use the "Switch" that I've already made in order to check them. Possibly. Not feeling too well and haven't gotten that far yet.

Just use the with open file as json_file , write/read (super easy).
Ultimately stored 3 json files. No big deal. Much easier than appending into one big file.

Retrieve Cells w/ Data and Stop Once at Empty Cell openpyxl

I am writing a Python script that leverages the OpenPyXL module to open an excel sheet that has a column of data that I'd then like to be put into a list for future usage. I've created a simple function to complete this task but I wasn't sure if there was a more effective manner in which to approach this task. Here is what I have right now :
import pyautogui
import openpyxl
wb = openpyxl.load_workbook('120.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')
def dataGrab(sheet):
counter = 0
aData = []
while 1 > 0:
cell = "A"+str(counter)
print(cell)
wo = sheet[cell].value
print(wo)
if wo is None:
break
aData.append(wo)
counter += 1
print(aData)
return aData
aData = dataGrab(sheet)
So in essence what happens is the workbook is opened and then an infinite loop kicks off. While in this loop I am making the cell identifier ( A+counter ) to make things like "A1" and so on to then be passed to the line that gets the value of the cell which is then just appended to an existing list. This loops forever until wo is "None".
This works perfectly fine and yields no errors, I am more looking for suggestions to improve the function or know if I "did it right". Thank you for your feedback!
PS : The line "wb = openpyxl.load_workbook('120.xlsx')" takes a good 30-45 seconds to run. Any way to make that go faster? Not priority.

Can I write whole lines in Google Spreadsheets using gspread in python?

I am trying to write a simple script that will take csv as an input and write it in a single spreadsheet document. Now I have it working however the script is slow. It takes around 10 minutes to write cca 350 lines in two worksheets.
Here is the script I have:
#!/usr/bin/python
import json, sys
import gspread
from oauth2client.client import SignedJwtAssertionCredentials
json_key = json.load(open('client_secrets.json'))
scope = ['https://spreadsheets.google.com/feeds']
# change to True to see Debug messages
DEBUG = False
def updateSheet(csv,sheet):
linelen = 0
counter1 = 1 # starting column in spreadsheet: A
counter2 = 1 # starting row in spreadsheet: 1
counter3 = 0 # helper for iterating through line entries
credentials = SignedJwtAssertionCredentials(json_key['client_email'], json_key['private_key'], scope)
gc = gspread.authorize(credentials)
wks = gc.open("Test Spreadsheet")
worksheet = wks.get_worksheet(sheet)
if worksheet is None:
if sheet == 0:
worksheet = wks.add_worksheet("First Sheet",1,8)
elif sheet == 1:
worksheet = wks.add_worksheet("Second Sheet",1,8)
else:
print "Error: spreadsheet does not exist"
sys.exit(1)
worksheet.resize(1,8)
for i in csv:
line = i.split(",")
linelen = len(line)-1
if (counter3 > linelen):
counter3 = 0
if (counter1 > linelen):
counter1 = 1
if (DEBUG):
print "entry length (starting from 0): ", linelen
print "line: ", line
print "counter1: ", counter1
print "counter3: ", counter3
while (counter3<=linelen):
if (DEBUG):
print "writing line: ", line[counter3]
worksheet.update_cell(counter2, counter1, line[counter3].rstrip('\n'))
counter3 += 1
counter1 += 1
counter2 += 1
worksheet.resize(counter2,8)
I am sysadmin so I apologize in advance for shitty code.
Anyway, the script will take line by line from csv, split by comma and write cell by cell, hence it takes time to write it. The idea is to have cron execute this once a day and it will remove older entries and write new ones -- that's why I use resize().
Now, I am wondering if there is a better way to take whole csv line and write it in the sheet with each value in it's own cell, avoiding writing cell by cell like I have now? This would significantly reduce time it takes to execute it.
Thanks!

Yes, this can be done. I upload in chunks of 100 lines by 12 rows and it handles it fine - I'm not sure how well this scales though for something like a whole csv in one go. Also be aware that the default length of a sheet is 1000 rows and you will get an error if you try to reference a row outside of this range (so use add_rows beforehand to ensure there is space). Simplified example:
data_to_upload = [[1, 2], [3, 4]]
column_names = ['','A','B','C','D','E','F','G','H', 'I','J','K','L','M','N',
'O','P','Q','R','S','T','U','V','W','X','Y','Z', 'AA']
# To make it dynamic, assuming that all rows contain same number of elements
cell_range = 'A1:' + str(column_names[len(data_to_upload[0])]) + str(len(data_to_upload))
cells = worksheet.range(cell_range)
# Flatten the nested list. 'Cells' will not by default accept xy indexing.
flattened_data = flatten(data_to_upload)
# Go based on the length of flattened_data, not cells.
# This is because if you chunk large data into blocks, all excess cells will take an empty value
# Doing the other way around will get an index out of range
for x in range(len(flattened_data)):
cells[x].value = flattened_data[x].decode('utf-8')
worksheet.update_cells(cells)
If your rows are of different lengths then clearly you would need to insert the appropriate number of empty strings into cells to ensure that the two lists don't get out of sync. I use decode for convenience because I kept crashing with special characters so seems best to just have it in.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: No Traceback when Scraping Data into Excel Spreadsheet - python

Related

CSV files - can't print specific values/entities from my csv

My script keeps on printing even when nothing is left

Storing Multi-dimensional Lists?

Retrieve Cells w/ Data and Stop Once at Empty Cell openpyxl

Can I write whole lines in Google Spreadsheets using gspread in python?

Categories

Resources