Python Multi-threading scraping, write data in csv file

Python Multi-threading scraping, write data in csv file - python

I use multiprocessing pool to multiply the speed of scraping and everything is okay, only I don't understand why python write every 30 rows the header of my csv, I know there is a link with the param of pool I entered but how can correct this behavior
def parse(url):
dico = {i: '' for i in colonnes}
r = requests.get("https://change.org" + url, headers=headers, timeout=10)
# sleep(2)
if r.status_code == 200:
# I scrape my data here
...
pprint(dico)
writer.writerow(dico)
return dico
with open(lang + '/petitions_' + lang + '.csv', 'a') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames= colonnes)
writer.writeheader()
with Pool(30) as p:
p.map(parse, liens)
Someone can tell where put the 'writer.writerow(dico)' to avoid repetition of the header?
Thanks

Check if the file exists:
os.path.isfile('mydirectory/myfile.csv')
If it exists don't write the header again. Create a function(def...) for the header and another for data.

Looks like the "header" you are referring to comes from the writer.writeheader() line, not the writer.writerow() line.
Without a complete piece of your code, I can only assume that you have something like an outer loop that wraps around the with open block. So, every time your code enters the with block, a header line is printed, and then 30 lines of your scraped data (because of the pool size).

Related

having problems with python csv

I'am having trouble with python csv module I'am trying to write a newline in a csv file is there any reson why it would not work?
Code:
csv writing function
def write_response_csv(name,games,mins):
with open("sport_team.csv",'w',newline='',encoding='utf-8') as csv_file:
fieldnames=['Vardas','Žaidimai','Minutės']
writer = csv.DictWriter(csv_file,fieldnames=fieldnames)
writer.writeheader()
writer.writerow({'Vardas':name,'Žaidimai':games,"Minutės":mins})
with requests.get(url,headers=headers) as page:
content = soup(page.content,'html.parser')
content = content.findAll('table',class_='table01 tablesorter')
names = find_name(content)
times = 0
for name in names:
matches = find_matches(content,times)
min_in_matches = find_min(content,times)
times +=1
csv_file = write_response_csv(name,matches,min_in_matches)
try:
print(name,matches,min_in_matches)
except:
pass

When you call your write_response_csv function it is reopening the file and starting at line 1 again in the csv file and each new line of data you are passing to that function is overwriting the previous one written. What you could do try is creating the csv file outside of the scope of your writer function and setting your writer function to append mode instead of write mode. This will ensure that it will write the data on the next empty csv line, instead of starting at line 1.
#Outside of function scope
fieldnames=['Vardas','Žaidimai','Minutės']
#Create sport_team.csv file w/ headers
with open('sport_team.csv', 'w',encoding='utf-8') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames)
writer.writeheader()
#Write response function
def write_response_csv(name,games,mins):
with open('sport_team.csv','a',encoding='utf-8') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames)
writer.writerow({'Vardas':name,'Žaidimai':games,"Minutės":mins})
Note:
You will run into the same issue if you are reusing this script to continuously add new lines of data to the same file because each time you run it the code that creates the csv file will essentially recreate a blank sport_team.csv file with the headers. If you would like to reuse the code to continuously add new data, I would look into using os.path and utilizing it to confirm if sport_team.csv exists already and if so, to not run that code after the fieldnames.

Try using metabob, it find code errors for you. I've been using it as a Python beginner, and has been pretty successful with it.

How to follow webscraping progress with python

I'm running a Python script with BeautifulSoup in order to extract Text, topics and tags from web articles. The website contains 210 pages, and each page contain 10 articles. (each article's url is stocked in a txt file)
I'm using the following code :
data = []
with open('urls.txt', 'r') as inf:
for row in inf:
url = row.strip()
response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
if response.ok:
try:
soup = BeautifulSoup(response.text,"html.parser")
text = soup.select_one('div.para_content_text').get_text(strip=True)
topic = soup.select_one('div.article_tags_topics').get_text(strip=True)
tags = soup.select_one('div.article_tags_tags').get_text(strip=True)
except AttributeError:
print (" ")
data.append(
{
'text':text,
'topic': topic,
'tags':tags
}
)
pd.DataFrame(data).to_csv('text.csv', index = False, header=True)
time.sleep(3)
My code seems to be corret but I ran this code and it has been running for several days now.
I would like to understand if it is an error that is blocking progress or if the process is simply very long.
To do this, I would like to know if it would be possible to add a "component" to my code that would allow me to track the number of urls processed in real time.
Any ideas ?

The way your code is written now, you are accumulating all the data in memory until it's all fetched. The easiest way to keep track of the progress without changing the code too much would be to just print either the current URL, or the number of the URL you're processing.
A better way that involves changing the code a little more would be to write the data to the CSV file as you are parsing it, instead of all at once in the end. Something like
print("text,topic,tags")
with open('urls.txt', 'r') as inf:
for row in inf:
url = row.strip()
response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
# Getting the data you want...
print(f"{text},{topic},{tags}")
If you are going with this method, make sure to escape/remove commas, or use an actual CSV library to produce the lines.

Why does my Python program close after running the first loop?

I'm new to Python and scraping. I'm trying to run two loops. One goes and scrapes ids from one page. Then, using those ids, I call another API to get more info/properties.
But when I run this program, it just runs the first bit fine (gets the IDs), but then it closes and doesn't run the 2nd part. I feel I'm missing something really basic about control flow in Python here. Why does Python close after the first loop when I run it in Terminal?
import requests
import csv
import time
import json
from bs4 import BeautifulSoup, Tag
file = open('parcelids.csv','w')
writer = csv.writer(file)
writer.writerow(['parcelId'])
for x in range(1,10):
time.sleep(1) # slowing it down
url = 'http://apixyz/Parcel.aspx?Pid=' + str(x)
source = requests.get(url)
response = source.content
soup = BeautifulSoup(response, 'html.parser')
parcelId = soup.find("span", id="MainContent_lblMblu").text.strip()
writer.writerow([parcelId])
out = open('mapdata.csv','w')
with open('parcelIds.csv', 'r') as in1:
reader = csv.reader(in1)
writer = csv.writer(out)
next(reader, None) # skip header
for row in reader:
row = ''.join(row[0].split())[:-2].upper().replace('/','-') #formatting
url="https://api.io/api/properties/"
url1=url+row
time.sleep(1) # slowing it down
response = requests.get(url1)
resp_json_payload = response.json()
address = resp_json_payload['property']['address']
writer.writerow([address])

If you are running in windows (where filenames are not case sensitive), then the file you have open for writing (parcelids.csv) is still open when you reopen it to read from it.
Try closing the file before opening it to read from it.

Writing the exact same thing in CSV file using Python

I've encountered an issue with my writing CSV program for a web-scraping project.
I got a data formatted like this :
table = {
"UR": url,
"DC": desc,
"PR": price,
"PU": picture,
"SN": seller_name,
"SU": seller_url
}
Which I get from a loop that analyze a html page and return me this table.
Basically, this table is ok, it changes every time I've done a loop.
The thing now, is when I want to write every table I get from that loop into my CSV file, it is just gonna write the same thing over and over again.
The only element written is the first one I get with my loop and write it about 10 millions times instead of about 45 times (articles per page)
I tried to do it vanilla with the library 'csv' and then with pandas.
So here's my loop :
if os.path.isfile(file_path) is False:
open(file_path, 'a').close()
file = open(file_path, "a", encoding = "utf-8")
i = 1
while True:
final_url = website + brand_formatted + "+handbags/?p=" + str(i)
request = requests.get(final_url)
soup = BeautifulSoup(request.content, "html.parser")
articles = soup.find_all("div", {"class": "dui-card searchresultitem"})
for article in articles:
table = scrap_it(article)
write_to_csv(table, file)
if i == nb_page:
break
i += 1
file.close()
and here my method to write into a csv file :
def write_to_csv(table, file):
import csv
writer = csv.writer(file, delimiter = " ")
writer.writerow(table["UR"])
writer.writerow(table["DC"])
writer.writerow(table["PR"])
writer.writerow(table["PU"])
writer.writerow(table["SN"])
writer.writerow(table["SU"])
I'm pretty new on writing CSV files and Python in general but I can't find why this isn't working. I've followed many guide and got more or less the same code for writing csv file.
edit: Here's an output in an img of my csv file
you can see that every element is exactly the same, even if my table change
EDIT: I fixed my problems by making a file for each article I scrap. That's a lot of files but apparently it is fine for my project.

This might be solution you wanted
import csv
fieldnames = ['UR', 'DC', 'PR', 'PU', 'SN', 'SU']
def write_to_csv(table, file):
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writerow(table)
Reference: https://docs.python.org/3/library/csv.html

Scrapy Hanging on CSV Parse

I've looked quite a bit for a solution to this and maybe it's just an error on my end.
I'm using the python/scrapy framework to scrape a couple of sites. The sites provide me with .CSV files for their datasets. So instead of parsing the gridview they provide line by line, I've set scrapy to download the .CSV file, then open the CSV file and write each value as a Scrapy item.
I've got my concurrent requests set to 1. I figured this would stop Scrapy from parsing the next request in the list until it was finished adding all the items. Unfortunately, the bigger the .CSV file, the longer Scrapy takes to parse the rows and import them as items. It will usually do about half of a 500kb CSV file before it makes the next request.
logfiles = sorted([ f for f in os.listdir(logdir) if f.startswith('QuickSearch')])
logfiles = str(logfiles).replace("['", "").replace("']", "")
##look for downloaded CSV to open
try:
with open(logfiles, 'rU') as f: ##open
reader = csv.reader(f)
for row in reader:
item['state'] = 'Florida'
item['county'] = county ##variable set from response.url
item['saleDate'] = row[0]
item['caseNumber'] = row[2]
..
yield item
except:
pass
f.close()
print ">>>>>>>>>>>F CLOSED"
os.remove(logfiles)
I need Scrapy to completely finish importing all the CSV values as items before moving on to the next request. Is there a way to accomplish this?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Multi-threading scraping, write data in csv file - python

Check if the file exists: os.path.isfile('mydirectory/myfile.csv') If it exists don't write the header again. Create a function(def...) for the header and another for data.

Related

having problems with python csv

How to follow webscraping progress with python

Why does my Python program close after running the first loop?

Writing the exact same thing in CSV file using Python

Scrapy Hanging on CSV Parse

Categories

Resources