I'm having problem with nested for loop (for doc in query) that is ran only once. It's inside for item in news_items which I have verified iterates 10 times, and the for doc in query loop should iterate 9 times. When I'm printing doc, it prints 9 documents, however as I'm trying to make if / else check on the document's content, it only happens to run one time. (I would expect 9 x 10 outputs since it's checking item from parent, to doc in query but all I get is 9 outputs).
I've tried to look on stack but nothing I found seems to be relevant, from other programing languages I work with I don't see why this wouldn't work but maybe I'm missing something since I'm fairly new to Python (1 week).
def scrape(url):
# GET DATE AT THE TIME OF CRAWL START
today = date.today()
d1 = today.strftime("%d/%m/%Y")
# D2 is used for query only
d2 = today.strftime("%Y%m%d")
# LOAD URL IN DRIVER
driver.get(url)
try:
news_container = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "FlashNews-Box-Root"))
)
# array of items
news_items = news_container.find_elements_by_class_name("FlashNews-Box-Item")
refresher_ref = db.collection(u'news').document('sources').collection('refresher_news')
# query for last article
query = refresher_ref.order_by(u'article_timestamp', direction=firestore.Query.DESCENDING).limit(10).stream()
for item in news_items:
print("News items found: " + str(len(news_items)))
try:
# image is optional so we need to try it
try:
item_image = item.find_element_by_class_name("FlashNews-Box-ItemImage").find_element_by_tag_name(
"img").get_attribute("src")
except Exception as e:
item_image = "unavailable"
# time will be added to the same day as when this was ran, since this will run often and compare
# article texts, we won't have issue with wrong dates
item_time = item.find_element_by_class_name("FlashNews-Box-ItemTime").text + " " + d1
item_time_query_temp = item.find_element_by_class_name("FlashNews-Box-ItemTime").text.replace(":", "")
# normalize timestamp for sorting
if len(item_time_query_temp) == 3:
item_time_query_temp = "0" + item_time_query_temp
item_time_query = d2 + item_time_query_temp
item_text = item.find_element_by_class_name("FlashNews-Box-ItemText").text
item_redirect = item.find_element_by_class_name("FlashNews-Box-ItemText").find_element_by_tag_name(
"a").get_attribute("href")
result = {"article_time": item_time, "article_url": item_redirect, "article_image": item_image,
"article_text": item_text, "article_timestamp": item_time_query}
# print(result)
# save data to firestore - check for last item in firestore, then add this article
is_new = True
print("Printing 10x")
# THIS EXECUTES ONLY ONCE?
for doc in query:
# print(str(len(query)))
current_doc = doc.to_dict()
# print(current_doc)
# print(current_doc)
# print("Iteration: " + current_doc['article_text'])
# print("Old: " + current_doc["article_text"] + " New: " + item_text)
if current_doc['article_text'] == item_text:
print("Match")
# print(current_doc['article_text'] + item_text)
# print("Old: " + current_doc['article_text'] + " New: " + item_text)
else:
print("Mismatch")
# print(current_doc['article_text'] + item_text)
# print("Skipping article as the text exists in last 10")
# else:
# print("Old: " + current_doc['article_text'] + " New: " + item_text)
# print(str(is_new))
# if is_new:
# refresher_ref.add(result)
# print("Adding document")
except Exception as e:
print(e)
except Exception as e:
# HANDLE ERRORS
print(e)
print("Completed running.")
# quit driver at the end of function run
driver.quit()
query isn't a list, but some other iterable type that you can only consume once (similar to a generator). In order to use it multiple times in the outer loop, you'll need to create a list to hold the contents in memory. For example,
# query for last article
query = refresher_ref.order_by(u'article_timestamp', direction=firestore.Query.DESCENDING).limit(10).stream()
query = list(query)
for item in news_items:
...
I am trying to implement multiprocessing in my web crawler, what I usually see online is sending the url as args into the function of map or map_async or apply_asyn. The data I am crawling is in the table, thus, I extract them by doing two times beautifulsoup find_all for row and column. Since the data I am crawling sometime is in one page which only require one url. I try to use the return list from Find_all as args for map_async, but the error occur showing "Fatal Python error: Cannot recover from stackoverflow."
The error occurred on the following line
return_list = pool.map_async(func, Species_all_recorded_data_List)
How could I solve it or where should the multiprocessing be implemented will be better?
The second problem is that if I put some code above the function crawl_all_data_mp, when it execute the pool = Pool(), all the code above will execute. I solved it by simply move all the other code under that function. It might not be correct since I still can't really run the code due to the first error.
Looking for your advice
My code:
(1) Function to call for web crawling
from tkinter import filedialog
from tkinter import *
import csv
import os.path
from os import path
from Index import *
from Dragonfly import *
import codecs
from multiprocessing import Process, Value
#\ multiprocessing ver
def multiprocessing_row_data(Web_rawl_Species_family_name, Web_rawl_Species_name, Total_num, Limit_CNT, expecting_CNT, oldID, page, Species_all_record_data_Data_Set):
global DataCNT, stop_crawl_all_data_mp
tmp_List = Species_all_record_data_Data_Set.find_all('td')
# End condition
# 1.no data in next page
# 2.for update to find unti the old data by inspecting its ID
# 3.if it count over the the limit count
id = tmp_List[0].text
if (len(id) == 0) or (DataCNT >= expecting_CNT)or (DataCNT >= Limit_CNT):
print(' --Finish crawl--' + ' crawl to page: ' + str(page) + ", ID: " + id + ", count: " + str(DataCNT))
stop_crawl_all_data_mp = True
raise StopIteration
# access the same value in memory when doing multiprocessing
with DataCNT.getlock():
DataCNT.value += 1
response_DetailedInfo = session.post(general_url + Detailed_discriptions_url + id, headers=headers)
soup2 = BeautifulSoup(response_DetailedInfo.text, 'html.parser')
print("Current finished datas >> " + str(DataCNT.value) + " /" + str(Total_num) + " (" + str(DataCNT.value * 100 / Total_num) + "%)", end='\r')
return DetailedTableInfo(tmp_List[0].text, tmp_List[1].text, tmp_List[2].text, tmp_List[3].text, tmp_List[4].text, tmp_List[5].text, tmp_List[7].text, tmp_List[6].text,
soup2.find(id='R_LAT').get('value'),
soup2.find(id='R_LNG').get('value'),
Web_rawl_Species_family_name,
Web_rawl_Species_name,
soup2.find(id='R_MEMO').get('value'))
def crawl_all_data_mp(Web_rawl_Species_family_name, Web_rawl_Species_name, Total_num, Limit_CNT, expecting_CNT, oldID):
page = 0
DataList = []
while not stop_crawl_all_data_mp:
pool = multiprocessing.Pool(10)
Species_all_recorded_data = session.post( general_url +
species_all_record_data_first_url +
species_all_record_data_page_url + str(page) +
species_all_record_data_species_url +
Species_class_key[Web_rawl_Species_family_name] +
Species_key[Web_rawl_Species_name],
headers=headers)
soup = BeautifulSoup(Species_all_recorded_data.text, 'html.parser')
Species_all_recorded_data_List = soup.find_all(id='theRow')
func = partial(multiprocessing_row_data, Web_rawl_Species_family_name, Web_rawl_Species_name, Total_num, Limit_CNT, expecting_CNT, oldID, page)
return_list = pool.map_async(func, Species_all_recorded_data_List)
DataList.append(list(filter(None, return_list.get())))
page += 1
# make sure whe main is finished, subfunctions still keep rolling on
pool.close()
pool.join()
return [DataList, page]
(2) main
it goes wrong on the following line for calling the function above
[datatmpList, page] = crawl_all_data_mp(Input_species_famliy, Input_species, Total_num, limit_cnt, expecting_CNT, oldID)
the main code:
# --main--
if __name__ == '__main__':
# setting
Input_species_famliy = "細蟌科"
Input_species = "四斑細蟌"
limit_cnt = 6000
folder = 'Crawl_Data\\' + Species_class_key[Input_species_famliy]
File_name = folder + "\\" + Species_class_key[Input_species_famliy] + Species_key[Input_species] +'.csv'
oldID = 0
oldData_len = 0
print("--Start crawl-- " + Input_species_famliy + " " + Input_species)
print("[folder]: " + folder)
stop_crawl_all_data_mp = False
# check the file exist or not
file_check = path.exists(current_path + "\\" + File_name)
# get the Old ID
if file_check:
file_size = os.stat(current_path + "\\" + File_name).st_size
if not file_size == 0:
with open(File_name, newline='', errors = "ignore") as F:
R = csv.reader(F)
oldData = [line for line in R]
oldID = oldData[0][0]
oldData_len = len(oldData)-1
# login
Login_Web(myaccount, mypassword)
# find the total number of the species_input (expect executed one time)
Species_total_num_Dict = Find_species_total_data()
# get the data
Total_num = int(Species_total_num_Dict[Input_species])
#[datatmpList, page] = crawl_all_data(Input_species_famliy, Input_species, Total_num, limit_cnt, oldID)
expecting_CNT = Total_num - oldData_len # get the total number of data need to be update ot crawl
[datatmpList, page] = crawl_all_data_mp(Input_species_famliy, Input_species, Total_num, limit_cnt, expecting_CNT, oldID)
Data = []
for Data_tmp in datatmpList:
Data.append([Data_tmp.SpeciesFamily,
Data_tmp.Species,
Data_tmp.IdNumber,
Data_tmp.Dates,
Data_tmp.Times,
Data_tmp.User,
Data_tmp.City,
Data_tmp.Dictrict,
Data_tmp.Place,
Data_tmp.Altitude,
Data_tmp.Latitude,
Data_tmp.Longitude,
Data_tmp.Description
])
#auto make the directories
newDir = current_path + "\\" + folder
if (not os.path.isdir(newDir)):
os.mkdir(newDir)
# 'a' stands for append, which can append the new data to old one
with open(File_name, mode='a', newline='', errors = "ignore") as employee_file:
employee_writer = csv.writer(employee_file, delimiter=',', quoting=csv.QUOTE_MINIMAL)
# init , for there is no file exists or the file is empty
if ((not file_check) or (file_size == 0)):
employee_writer.writerow(CSV_Head)
employee_writer.writerows(Data)
# for inserting the data into the old one
else:
for i in range(0, len(Data)):
oldData.insert(i, Data[i])
employee_writer.writerows(oldData)
for using pandas-datareader with yahoo, when I have start and end as the same date I get no information returned when I ask on that date. If I ask a day later, it works. But I want today's close today.
import sys
from sqlalchemy import *
import os
import datetime
import pandas_datareader.data as web
end = datetime.datetime(2015, 10, 15)
start = datetime.datetime(2015, 10, 15)
path = 'c:\\python34\\myprojects\\msis\\'
try:
os.mkdir(path)
except:
pass
fname = path + 'test.txt'
fhand = open(fname, 'w')
engine = create_engine('mysql+mysqlconnector://root:#localhost /stockinfo')
connection = engine.connect()
result1 = engine.execute("select symbol from equities where daily = 'Y'")
for sqlrow in result1:
try:
info = web.DataReader(sqlrow[0], 'yahoo', start, end)
print (info)
close = info['Close'].ix['2015-10-14']
print ("=========================" + str(round(close,4)))
answer = "Closing price for " + sqlrow[0] + " is " + str(round(close,4)) + "\n"
except:
answer = "No success for " + sqlrow[0] + "\n"
fhand.write(answer)
# result2 = engine.execute("update holdings set lasrprice = " + round(close,4) + " where symbol = '" + sqlrow[0] + "'")
# result2.close()
result1.close()
fhand.close()
The code takes the second "except" route.
What am I doing wrong/what is happening?
I am writing a script in Python that will look at all of the groups AND all of the users on a linux system, and output me a file.
I have a situation where if an account is in a certain group, I want to set the user ID myself (in this example, it is because the account/s is a Non User Account)
Code below:
#/usr/bin/python
import grp,pwd,os
from os.path import expanduser
destdir = expanduser("~")
destfile = '/newfile.txt'
appname = 'app name'
groupid = ''
userid = ''
#delete old feed file and create file
if os.path.exists(destdir + destfile):
os.remove(destdir + destfile)
print "file deleted...creating new file"
output = open(destdir + destfile, 'w+')
output.write('ACCOUNTID|USERLINKID|APPLICATIONROLE|APPLICATION' + '\n')
else:
print "no file to delete...creating file"
output = open(destdir + destfile, 'w+')
output.write('ACCOUNTID|USERLINKID|APPLICATIONROLE|APPLICATION' + '\n')
#get user/group data for all users non primary groups
#documentation: https://docs.python.org/2/library/grp.html
groups = grp.getgrall()
for group in groups:
groupid = group[2]
print groupid #checking to see if group ids print correctly. Yes it does
for user in group[3]:
if groupid == '33': #Issue is here!
userid = 'qwerty'
print userid #testing var
output.write(user + '|' + userid + '|' + group[0] + '|' + appname + '\n')
The issue is here:
if groupid == '33': #Issue is here!
userid = 'qwerty'
print userid #testing var
The variable "userid" is never set to it's value and never prints anything while testing.
The group "33" does have users in it and exists. I cannot figure out why this doesn't work :(
I have another piece of code that does this for users (as I am looking at both Primary and Secondary groups, and once I figure out this part, I can fix the rest)
Your validate of groupid is against a string, it is an integer
if groupid == 31: # then do something
I dont get why it keeps giving me the list out of range error for sys.argv[1]. From my understanding I am passing data to user_database. Help please
import sys, MySQLdb
def PrintFields(database, table):
host = 'localhost'
user = 'root'
password = 'boysnblue1'
conn = MySQLdb.Connection(db=parking_report, host=localhost, user=root, passwd=boysnblue1)
mysql = conn.cursor()
sql = """ SHOW COLUMNS FROM %s """ % table
mysql.execute("select id, date, time, status, from report_table ")
fields=mysql.fetchall()
print '<table border="0"><tr><th>order</th><th>name</th><th>type</th><th>description</th></tr>'
print '<tbody>'
counter = 0
for field in fields:
counter = counter + 1
id = field[0]
date = field[1]
time = field[2]
status = field[3]
print '<tr><td>' + str(counter) + '</td><td>' + id + '</td><td>' + date + '</td><td>' + time + '</td><td>' + status + ' </td></tr>'
print '</tbody>'
print '</table>'
mysql.close()
conn.close()
users_database = sys.argv[1]
users_table = sys.argv[2]
print "Wikified HTML for " + users_database + "." + users_table
print "========================"
PrintFields(users_database, users_table)
sys.argv is a list containing the name of the program's file and all of the arguments it was passed on the command line.
If you run python script2.py, the contents of sys.argv will be ['script2.py'].
If you run python script2.py database_name table_name, the contents of sys.argv will be ['script2.py', 'database_name', 'table_name'], which is what your program is currently configured to expect:
users_database = sys.argv[1]
users_table = sys.argv[2]
Since you are calling it the first way, sys.argv[1] does not exist, and you get your error that the index (1) is out of range (it only goes to 0).