Scrapy Hanging on CSV Parse - python

I've looked quite a bit for a solution to this and maybe it's just an error on my end.
I'm using the python/scrapy framework to scrape a couple of sites. The sites provide me with .CSV files for their datasets. So instead of parsing the gridview they provide line by line, I've set scrapy to download the .CSV file, then open the CSV file and write each value as a Scrapy item.
I've got my concurrent requests set to 1. I figured this would stop Scrapy from parsing the next request in the list until it was finished adding all the items. Unfortunately, the bigger the .CSV file, the longer Scrapy takes to parse the rows and import them as items. It will usually do about half of a 500kb CSV file before it makes the next request.
logfiles = sorted([ f for f in os.listdir(logdir) if f.startswith('QuickSearch')])
logfiles = str(logfiles).replace("['", "").replace("']", "")
##look for downloaded CSV to open
try:
with open(logfiles, 'rU') as f: ##open
reader = csv.reader(f)
for row in reader:
item['state'] = 'Florida'
item['county'] = county ##variable set from response.url
item['saleDate'] = row[0]
item['caseNumber'] = row[2]
..
yield item
except:
pass
f.close()
print ">>>>>>>>>>>F CLOSED"
os.remove(logfiles)
I need Scrapy to completely finish importing all the CSV values as items before moving on to the next request. Is there a way to accomplish this?

Related

Python Looping through urls in csv file returns \ufeffhttps://

I am new to python and I am trying to loop through the list of urls in a csv file and grab the website titleusing BeautifulSoup, which I would like then to save to a file Headlines.csv. But I am unable to grab the webpage title. If I use a variable with single url as follows:
url = 'https://www.space.com/japan-hayabusa2-asteroid-samples-landing-date.html'
resp = req.get(url)
soup = BeautifulSoup(resp.text, 'lxml')
print(soup.title.text)
It works just fine and I get the title Japanese capsule carrying pieces of asteroid Ryugu will land on Earth Dec. 6 | Space
But when I use the loop,
import csv
with open('urls_file2.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for url in reader:
print(url)
resp = req.get(url)
soup = BeautifulSoup(resp.text, 'lxml')
print(soup.title.text)
I get the following
['\ufeffhttps://www.foxnews.com/us/this-day-in-history-july-16']
and an error message
InvalidSchema: No connection adapters were found for "['\\ufeffhttps://www.foxnews.com/us/this-day-in-history-july-16']"
I am not sure what am I doing wrong.
You have a byte order mark \\ufeff on the URL you parse from your file.
It looks like your file is a signature file and has encoding like utf-8-sig.
You need to read with the file with encoding='utf-8-sig'
Read more here.
As the previous answer has already mentioned about the "\ufeff", you would need to change the encoding.
The second issue is that when you read a CSV file, you will get a list containing all the columns for each row. The keyword here is list. You are passing the request a list instead of a string.
Based on the example you have given, I would assume that your urls are in the first column of the csv. Python lists starts with a index of 0 and not 1. So to extract out the url, you would need to extract the index of 0 which refers to the first column.
import csv
with open('urls_file2.csv', newline='', encoding='utf-8-sig') as f:
reader = csv.reader(f)
for url in reader:
print(url[0])
To read up more on lists, you can refer here.
You can add more columns to the CSV file and experiment to see how the results would appear.
If you would like to refer to the column name while reading each row, you can refer here.

Python Multi-threading scraping, write data in csv file

I use multiprocessing pool to multiply the speed of scraping and everything is okay, only I don't understand why python write every 30 rows the header of my csv, I know there is a link with the param of pool I entered but how can correct this behavior
def parse(url):
dico = {i: '' for i in colonnes}
r = requests.get("https://change.org" + url, headers=headers, timeout=10)
# sleep(2)
if r.status_code == 200:
# I scrape my data here
...
pprint(dico)
writer.writerow(dico)
return dico
with open(lang + '/petitions_' + lang + '.csv', 'a') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames= colonnes)
writer.writeheader()
with Pool(30) as p:
p.map(parse, liens)
Someone can tell where put the 'writer.writerow(dico)' to avoid repetition of the header?
Thanks
Check if the file exists:
os.path.isfile('mydirectory/myfile.csv')
If it exists don't write the header again. Create a function(def...) for the header and another for data.
Looks like the "header" you are referring to comes from the writer.writeheader() line, not the writer.writerow() line.
Without a complete piece of your code, I can only assume that you have something like an outer loop that wraps around the with open block. So, every time your code enters the with block, a header line is printed, and then 30 lines of your scraped data (because of the pool size).

Problems saving scraped data from a webpage into a excel file

I am new to scraping using Python. After using a lot of useful resources I was able to scrape the content of a Page. However, I am having trouble saving this data into a .csv file.
Python:
import mechanize
import time
import requests
import csv
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Firefox(executable_path=r'C:\Users\geckodriver.exe')
driver.get("myUrl.jsp")
username = driver.find_element_by_name('USER')
password = driver.find_element_by_name('PASSWORD')
username.send_keys("U")
password.send_keys("P")
main_frame = driver.find_element_by_xpath('//*[#id="Frame"]')
src = driver.switch_to_frame(main_frame)
table = driver.find_element_by_xpath("/html/body/div/div[2]/div[5]/form/div[7]/div[3]/table")
rows = table.find_elements(By.TAG_NAME, "tr")
for tr in rows:
outfile = open("C:/Users/Scripts/myfile.csv", "w")
with outfile:
writers = csv.writer(outfile)
writers.writerows(tr.text)
Problem:
Only one of the rows gets written to the excel file. However, when I print the tr.text into the console, all the required rows show up. How can I get all the text inside tr elements to be written into an excel file?
Currently your code will open the file, write one line, close it, then on the next row open it again and overwrite the line. Please consider the following code snippet:
# We use 'with' to open the file and auto close it when done
# syntax is best modified as follows
with open('C:/Users/Scripts/myfile.csv', 'w') as outfile:
writers = csv.writer(outfile)
# we only need to open the file once so we open it first
# then loop through each row to print everything into the open file
for tr in rows:
writers.writerows(tr.text)

Writing the exact same thing in CSV file using Python

I've encountered an issue with my writing CSV program for a web-scraping project.
I got a data formatted like this :
table = {
"UR": url,
"DC": desc,
"PR": price,
"PU": picture,
"SN": seller_name,
"SU": seller_url
}
Which I get from a loop that analyze a html page and return me this table.
Basically, this table is ok, it changes every time I've done a loop.
The thing now, is when I want to write every table I get from that loop into my CSV file, it is just gonna write the same thing over and over again.
The only element written is the first one I get with my loop and write it about 10 millions times instead of about 45 times (articles per page)
I tried to do it vanilla with the library 'csv' and then with pandas.
So here's my loop :
if os.path.isfile(file_path) is False:
open(file_path, 'a').close()
file = open(file_path, "a", encoding = "utf-8")
i = 1
while True:
final_url = website + brand_formatted + "+handbags/?p=" + str(i)
request = requests.get(final_url)
soup = BeautifulSoup(request.content, "html.parser")
articles = soup.find_all("div", {"class": "dui-card searchresultitem"})
for article in articles:
table = scrap_it(article)
write_to_csv(table, file)
if i == nb_page:
break
i += 1
file.close()
and here my method to write into a csv file :
def write_to_csv(table, file):
import csv
writer = csv.writer(file, delimiter = " ")
writer.writerow(table["UR"])
writer.writerow(table["DC"])
writer.writerow(table["PR"])
writer.writerow(table["PU"])
writer.writerow(table["SN"])
writer.writerow(table["SU"])
I'm pretty new on writing CSV files and Python in general but I can't find why this isn't working. I've followed many guide and got more or less the same code for writing csv file.
edit: Here's an output in an img of my csv file
you can see that every element is exactly the same, even if my table change
EDIT: I fixed my problems by making a file for each article I scrap. That's a lot of files but apparently it is fine for my project.
This might be solution you wanted
import csv
fieldnames = ['UR', 'DC', 'PR', 'PU', 'SN', 'SU']
def write_to_csv(table, file):
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writerow(table)
Reference: https://docs.python.org/3/library/csv.html

How to write data into multiple files in a directory

I have been stuck for hours on how to write my crawled data into multiple files. I wrote a code that scraps a website and extracts all the body of each link in the website. An example is crawling news website and you extract all the links and then extracts all the body of each links. I have done that succesffully But now my concern now is that instead of storing them all into a file using this code below
def save_data(data):
the_file = open('raw_data.txt', 'w')
for title_text, body_content, url in data:
the_file.write("%s\n" % [title_text, body_content, url])
how do I write the code such that I store each article in a different file. So I would be having something like Article_00, Article_01, Article_01...
Thanks
If you want to save the data in multiple files, then you must open multiple files for writing.
Use enumerate to get a counter for which data set you're iterating over, so you can use it in the filename like this:
def save_data(data):
for i, (title_text, body_content, url) in enumerate(data):
file = open('Article_%02d' % (i,), 'w+')
file.write("%s\n" % [title_text, body_content, url])
file.close()

Categories

Resources