I have been stuck for hours on how to write my crawled data into multiple files. I wrote a code that scraps a website and extracts all the body of each link in the website. An example is crawling news website and you extract all the links and then extracts all the body of each links. I have done that succesffully But now my concern now is that instead of storing them all into a file using this code below
def save_data(data):
the_file = open('raw_data.txt', 'w')
for title_text, body_content, url in data:
the_file.write("%s\n" % [title_text, body_content, url])
how do I write the code such that I store each article in a different file. So I would be having something like Article_00, Article_01, Article_01...
Thanks
If you want to save the data in multiple files, then you must open multiple files for writing.
Use enumerate to get a counter for which data set you're iterating over, so you can use it in the filename like this:
def save_data(data):
for i, (title_text, body_content, url) in enumerate(data):
file = open('Article_%02d' % (i,), 'w+')
file.write("%s\n" % [title_text, body_content, url])
file.close()
Related
I have an Excel file with a column filled with +4000 URLs each one in a different cell. I need to use Python to open it with Chrome and scraping the website some of the data from a website.
past them in excel.
And then do the same step for the next URL. Could you please help me with that?
export the excel file to csv file read data from it as
def data_collector(url):
# do your code here and return data that you want to write in place of url
return url
with open("myfile.csv") as fobj:
content = fobj.read()
#below line will return you urls in form of list
urls = content.replace(",", " ").strip()
for url in urls:
data_to_be_write = data_collector(url)
# added extra quotes to prevent csv from breaking it is prescribed
# to use csv module to write in csv file but for ease of understanding
# i did it like this, Hoping You will correct it by yourself
content = "\"" + {content.replace(url, data_to_be_write) + "\""
with open("new_file.csv", "wt") as fnew:
fnew.write(content)
after running this code you will get new_file.csv opening it with Excel you will get your desired data in place of url
if you want your url with data just append it like with data in string seprated by colon.
The goal is to download GTFS data through python web scraping, starting with https://transitfeeds.com/p/agence-metropolitaine-de-transport/129/latest/download
Currently, I'm using requests like so:
def download(url):
fpath = "prov/city/GTFS"
r = requests.get(url)
if r.ok:
print("Saving file.")
open(fpath, "wb").write(r.content)
else:
print("Download failed.")
The results of requests.content of the above url unfortunately renders the following:
You can see the files of interest within the output (e.g. stops.txt) but how might I access them to read/write?
I fear you're trying to read a zip file with a text editor, perhaps you should try using the "zipfile" module.
The following worked:
def download(url):
fpath = "path/to/output/"
f = requests.get(url, stream = True, headers = headers)
if f.ok:
print("Saving to {}".format(fpath))
g=open(fpath+'output.zip','wb')
g.write(f.content)
g.close()
else:
print("Download failed with error code: ", f.status_code)
You need to write this file into a zip.
import requests
url = "https://transitfeeds.com/p/agence-metropolitaine-de-transport/129/latest/download"
fname = "gtfs.zip"
r = requests.get(url)
open(fname, "wb").write(r.content)
Now fname exists and has several text files inside. If you want to programmatically extract this zip and then read the content of a file, for example stops.txt, then you need first to extract a single file, or simply extractall.
import zipfile
# this will extract only a single file, and
# raise a KeyError if the file is missing from the archive
zipfile.ZipFile(fname).extract("stops.txt")
# this will extract all the files found from the archive,
# overwriting files in the process
zipfile.ZipFile(fname).extractall()
Now you just need to work with your file(s).
thefile = "stops.txt"
# just plain text
text = open(thefile).read()
# csv file
import csv
reader = csv.reader(open(thefile))
for row in reader:
...
I've encountered an issue with my writing CSV program for a web-scraping project.
I got a data formatted like this :
table = {
"UR": url,
"DC": desc,
"PR": price,
"PU": picture,
"SN": seller_name,
"SU": seller_url
}
Which I get from a loop that analyze a html page and return me this table.
Basically, this table is ok, it changes every time I've done a loop.
The thing now, is when I want to write every table I get from that loop into my CSV file, it is just gonna write the same thing over and over again.
The only element written is the first one I get with my loop and write it about 10 millions times instead of about 45 times (articles per page)
I tried to do it vanilla with the library 'csv' and then with pandas.
So here's my loop :
if os.path.isfile(file_path) is False:
open(file_path, 'a').close()
file = open(file_path, "a", encoding = "utf-8")
i = 1
while True:
final_url = website + brand_formatted + "+handbags/?p=" + str(i)
request = requests.get(final_url)
soup = BeautifulSoup(request.content, "html.parser")
articles = soup.find_all("div", {"class": "dui-card searchresultitem"})
for article in articles:
table = scrap_it(article)
write_to_csv(table, file)
if i == nb_page:
break
i += 1
file.close()
and here my method to write into a csv file :
def write_to_csv(table, file):
import csv
writer = csv.writer(file, delimiter = " ")
writer.writerow(table["UR"])
writer.writerow(table["DC"])
writer.writerow(table["PR"])
writer.writerow(table["PU"])
writer.writerow(table["SN"])
writer.writerow(table["SU"])
I'm pretty new on writing CSV files and Python in general but I can't find why this isn't working. I've followed many guide and got more or less the same code for writing csv file.
edit: Here's an output in an img of my csv file
you can see that every element is exactly the same, even if my table change
EDIT: I fixed my problems by making a file for each article I scrap. That's a lot of files but apparently it is fine for my project.
This might be solution you wanted
import csv
fieldnames = ['UR', 'DC', 'PR', 'PU', 'SN', 'SU']
def write_to_csv(table, file):
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writerow(table)
Reference: https://docs.python.org/3/library/csv.html
I've looked quite a bit for a solution to this and maybe it's just an error on my end.
I'm using the python/scrapy framework to scrape a couple of sites. The sites provide me with .CSV files for their datasets. So instead of parsing the gridview they provide line by line, I've set scrapy to download the .CSV file, then open the CSV file and write each value as a Scrapy item.
I've got my concurrent requests set to 1. I figured this would stop Scrapy from parsing the next request in the list until it was finished adding all the items. Unfortunately, the bigger the .CSV file, the longer Scrapy takes to parse the rows and import them as items. It will usually do about half of a 500kb CSV file before it makes the next request.
logfiles = sorted([ f for f in os.listdir(logdir) if f.startswith('QuickSearch')])
logfiles = str(logfiles).replace("['", "").replace("']", "")
##look for downloaded CSV to open
try:
with open(logfiles, 'rU') as f: ##open
reader = csv.reader(f)
for row in reader:
item['state'] = 'Florida'
item['county'] = county ##variable set from response.url
item['saleDate'] = row[0]
item['caseNumber'] = row[2]
..
yield item
except:
pass
f.close()
print ">>>>>>>>>>>F CLOSED"
os.remove(logfiles)
I need Scrapy to completely finish importing all the CSV values as items before moving on to the next request. Is there a way to accomplish this?
I need to save a file (.pdf) but I'm unsure how to do it. I need to save .pdfs and store them in such a way that they are organized in a directories much like they are stored on the site I'm scraping them off.
From what I can gather I need to make a pipeline, but from what I understand pipelines save "Items" and "items" are just basic data like strings/numbers. Is saving files a proper use of pipelines, or should I save file in spider instead?
Yes and no[1]. If you fetch a pdf it will be stored in memory, but if the pdfs are not big enough to fill up your available memory so it is ok.
You could save the pdf in the spider callback:
def parse_listing(self, response):
# ... extract pdf urls
for url in pdf_urls:
yield Request(url, callback=self.save_pdf)
def save_pdf(self, response):
path = self.get_path(response.url)
with open(path, "wb") as f:
f.write(response.body)
If you choose to do it in a pipeline:
# in the spider
def parse_pdf(self, response):
i = MyItem()
i['body'] = response.body
i['url'] = response.url
# you can add more metadata to the item
return i
# in your pipeline
def process_item(self, item, spider):
path = self.get_path(item['url'])
with open(path, "wb") as f:
f.write(item['body'])
# remove body and add path as reference
del item['body']
item['path'] = path
# let item be processed by other pipelines. ie. db store
return item
[1] another approach could be only store pdfs' urls and use another process to fetch the documents without buffering into memory. (e.g. wget)
There is a FilesPipeline that you can use directly, assuming you already have the file url, the link shows how to use FilesPipeline:
https://groups.google.com/forum/print/msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ
It's a perfect tool for the job. The way Scrapy works is that you have spiders that transform web pages into structured data(items). Pipelines are postprocessors, but they use same asynchronous infrastructure as spiders so it's perfect for fetching media files.
In your case, you'd first extract location of PDFs in spider, fetch them in pipeline and have another pipeline to save items.