Writing the exact same thing in CSV file using Python - python

I've encountered an issue with my writing CSV program for a web-scraping project.
I got a data formatted like this :
table = {
"UR": url,
"DC": desc,
"PR": price,
"PU": picture,
"SN": seller_name,
"SU": seller_url
}
Which I get from a loop that analyze a html page and return me this table.
Basically, this table is ok, it changes every time I've done a loop.
The thing now, is when I want to write every table I get from that loop into my CSV file, it is just gonna write the same thing over and over again.
The only element written is the first one I get with my loop and write it about 10 millions times instead of about 45 times (articles per page)
I tried to do it vanilla with the library 'csv' and then with pandas.
So here's my loop :
if os.path.isfile(file_path) is False:
open(file_path, 'a').close()
file = open(file_path, "a", encoding = "utf-8")
i = 1
while True:
final_url = website + brand_formatted + "+handbags/?p=" + str(i)
request = requests.get(final_url)
soup = BeautifulSoup(request.content, "html.parser")
articles = soup.find_all("div", {"class": "dui-card searchresultitem"})
for article in articles:
table = scrap_it(article)
write_to_csv(table, file)
if i == nb_page:
break
i += 1
file.close()
and here my method to write into a csv file :
def write_to_csv(table, file):
import csv
writer = csv.writer(file, delimiter = " ")
writer.writerow(table["UR"])
writer.writerow(table["DC"])
writer.writerow(table["PR"])
writer.writerow(table["PU"])
writer.writerow(table["SN"])
writer.writerow(table["SU"])
I'm pretty new on writing CSV files and Python in general but I can't find why this isn't working. I've followed many guide and got more or less the same code for writing csv file.
edit: Here's an output in an img of my csv file
you can see that every element is exactly the same, even if my table change
EDIT: I fixed my problems by making a file for each article I scrap. That's a lot of files but apparently it is fine for my project.

This might be solution you wanted
import csv
fieldnames = ['UR', 'DC', 'PR', 'PU', 'SN', 'SU']
def write_to_csv(table, file):
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writerow(table)
Reference: https://docs.python.org/3/library/csv.html

Related

After reading urls from a text file, how can I save all the responses into separate files?

I have a script that reads urls from a text file, performs a request and then saves all the responses in one text file. How can I save each response in a different text file instead of all in the same file? For example, if my text file labeled input.txt has 20 urls, I would like to save the responses in 20 different .txt files like output1.txt, output2.txt instead of just one .txt file. So for each request, the response in saved in a new .txt file. Thank you
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for line in map(str.strip, f_in):
if not line:
continue
response = requests.get(line)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
categories = soup.find_all("a", {"class":'navlabellink nvoffset nnormal'})
for category in categories:
data = line + "," + category.text
with open('output.txt', 'a+') as f:
f.write(data + "\n")
print(data)
Here's a quick way to implement what others have hinted at:
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for i, line in enumerate(map(str.strip, f_in)):
if not line:
continue
...
with open(f'output_{i}.txt', 'w') as f:
f.write(data + "\n")
print(data)
You can make a new file by using open('something.txt', 'w'). If the file is found, it'll erase its content. Else, it'll make a new file named 'something.txt'. Now, you can use file.write() to write your info!
I'm not sure, if I understood your problem right.
I would create an array/list and would create an object for each url request and response. Then add the objects to the array/list and write for each object a different file.
There are at least two ways you could generate files for each url. One, shown below, is to create a hash of some data unique data of the file. In this case I chose category but your could also use the whole contents of the file. This creates a unique string to use for a file name so that two links with the same category text don't overwrite each other when saved.
Another way, not shown, is to find some unique value within the data itself and use it as the filename without hashing it. However, this can cause more problems than it solves since data on the Internet should not be trusted.
Here's your code with an MD5 hash used for a filename. MD5 is not a secure hashing function for passwords but it's safe for creating unique filenames.
Updated Snippet
import hashlib
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for line in map(str.strip, f_in):
if not line:
continue
response = requests.get(line)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
categories = soup.find_all("a", {"class":'navlabellink nvoffset nnormal'})
for category in categories:
data = line + "," + category.text
filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
f.write(data + "\n")
print(data)
Code added
filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
Capturing Updated Pages
If you care about catching contents of a page at different points in time, hash the whole contents of the file. That way, if anything within the page changes the previous contents of the page aren't lost. In this case, I hash both the url and the file contents and concatenate the hashes with the URL hash followed by a hash of the file contents. That way, all versions of a file are visible when the directory is sorted.
hashed_contents = hashlib.sha256()
hashed_contents.update(category['href'].encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
for category in categories:
data = line + "," + category.text
hashed_url = hashlib.sha256()
hashed_url.update(category['href'].encode('utf-8'))
page = requests.get(category['href'])
hashed_content = hashlib.sha256()
hashed_content.update(page.text.encode('utf-8')
filename = '{}_{}.html'.format(hashed_url.hexdigest(), hashed_content.hexdigest())
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
f.write(data + "\n")
print(data)

Python scraping encoding excel formulas

I'm trying to scrape websites to csv and work on this data but text formulas won't work properly. I don't really underestand what I'm doing wrong but my guess is the encoding part.
This is the python part :
page = requests.get(url)
encoding = page.encoding if 'charset' in page.headers.get('content-type', '').lower() else None
soup = BeautifulSoup(page.content, 'html.parser', from_encoding=encoding)
example = soup.find(class_= htmlClass).get_text()
example = "".join([s for s in example.splitlines(True) if s.strip()])
example = example.splitlines()
outputList.append(example)
[...]
with open(outputFile, "w") as fileHandle:
fileHandle.writelines(outputFileData)
The text in the csv does looks ok but if I'm trying to have some MATCH formulas it often won't find the data. =MATCH("*13 MARCH*";F1:F20;0) will give N/A while there is the text 13 MARCH in the column.
I've done many changes and test and I noted that when I use this :
with codecs.open(outputFile, "w", "utf-8") as fileHandle: I have special characters in the CSV file and this probably explain the MATCH formulas not properly finding text.
If it helps, I actually import the csv in googlesheet via script and then work with MATCH formulas, the script is :
function importFromCSV() {
var file = DriveApp.getFilesByName("menulist.csv");
var csvFile = file.next().getBlob().getDataAsString();
var csvData = Utilities.parseCsv(csvFile, ";");
var ss = SpreadsheetApp.openById("xxx");
var sheet = ss.getSheetByName('import');
sheet.getRange('A7:AZ60').clear()
sheet.getRange(7,1, csvData.length, csvData[0].length).setValues(csvData);
}
I had rubies with the above and added var csvFile = file.next().getBlob().getDataAsString('ISO-8859-1'); to avoid rubies but MATCH formula still wont work.
And idea what I'm doing wrong with encodoing ?
Try Using, hope it will solve your problem
with codecs.open(outputFile, "w", "utf-8-sig") as fileHandle:

Python Multi-threading scraping, write data in csv file

I use multiprocessing pool to multiply the speed of scraping and everything is okay, only I don't understand why python write every 30 rows the header of my csv, I know there is a link with the param of pool I entered but how can correct this behavior
def parse(url):
dico = {i: '' for i in colonnes}
r = requests.get("https://change.org" + url, headers=headers, timeout=10)
# sleep(2)
if r.status_code == 200:
# I scrape my data here
...
pprint(dico)
writer.writerow(dico)
return dico
with open(lang + '/petitions_' + lang + '.csv', 'a') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames= colonnes)
writer.writeheader()
with Pool(30) as p:
p.map(parse, liens)
Someone can tell where put the 'writer.writerow(dico)' to avoid repetition of the header?
Thanks
Check if the file exists:
os.path.isfile('mydirectory/myfile.csv')
If it exists don't write the header again. Create a function(def...) for the header and another for data.
Looks like the "header" you are referring to comes from the writer.writeheader() line, not the writer.writerow() line.
Without a complete piece of your code, I can only assume that you have something like an outer loop that wraps around the with open block. So, every time your code enters the with block, a header line is printed, and then 30 lines of your scraped data (because of the pool size).

How to prevent writing into txt file the same words using open(text.txt,a)?

I have a question regarding appending to text file. I have written a script and what this script does is that it will read the URL in JSON format and extract the list of titles and write into the file "WordsInCategory.text".
As this code will be used in a loop thus I used f1 = open('WordsInCategory.text', 'a').
But I encountered a problem, that is it will add in already existing title into the file.
I am having trouble coming out with a solution to solve this problem and using 'w' will overwrite what it is written.
My code is as follows:
import urllib2
import json
url1 ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtype=page&cmtitle=Category:Geography&cmlimit=100'
json_obj = urllib2.urlopen(url1)
data1 = json.load(json_obj)
f1 = open('WordsInCategory.text', 'a')
for item in data1['query']:
for i in data1['query']['categorymembers']:
f1.write((i['title']).encode('utf8')+"\n")
Please advice on how I should modify my code.
Thank you.
I would suggest saving every title in an array, before writing to a file (and hence writing only once to the given file). You can modify your code this way :
import urllib2
import json
data = []
f1 = open('WordsInCategory.text', 'w')
url1 ='https://en.wikipedia.org/w/api.php?\
action=query&format=json&list=categorymembers\
&cmtype=page&cmtitle=Category:Geography&cmlimit=100'
json_obj = urllib2.urlopen(url1)
data1 = json.load(json_obj)
for item in data1['query']:
for i in data1['query']['categorymembers']:
data.append(i['title'].encode('utf8')+"\n")
# Do additional requests, and append the new titles to the data array
f1.write(''.join(set(data)))
f1.close()
set allows me to delete any duplicate entry.
If keeping the titles in memory is a problem, you can check if the title already exists before writing it to the file, but it may be awfully time consuming :
import urllib2
import json
data = []
url1 ='https://en.wikipedia.org/w/api.php?\
action=query&format=json&list=categorymembers\
&cmtype=page&cmtitle=Category:Geography&cmlimit=100'
json_obj = urllib2.urlopen(url1)
data1 = json.load(json_obj)
for item in data1['query']:
for i in data1['query']['categorymembers']:
title = (i['title'].encode('utf8')+"\n")
with open('WordsInCategory.text', 'r') as title_check:
if title not in title_check:
data.append(title)
with open('WordsInCategory.text', 'a') as f1:
f1.write(''.join(set(data)))
# Handle additional requests
Hope it'll be helpful.
You can track the titles you added.
titles = []
and then add each title to the list when writing
if title not in titles:
# write to file
titles += title

Scrapy Hanging on CSV Parse

I've looked quite a bit for a solution to this and maybe it's just an error on my end.
I'm using the python/scrapy framework to scrape a couple of sites. The sites provide me with .CSV files for their datasets. So instead of parsing the gridview they provide line by line, I've set scrapy to download the .CSV file, then open the CSV file and write each value as a Scrapy item.
I've got my concurrent requests set to 1. I figured this would stop Scrapy from parsing the next request in the list until it was finished adding all the items. Unfortunately, the bigger the .CSV file, the longer Scrapy takes to parse the rows and import them as items. It will usually do about half of a 500kb CSV file before it makes the next request.
logfiles = sorted([ f for f in os.listdir(logdir) if f.startswith('QuickSearch')])
logfiles = str(logfiles).replace("['", "").replace("']", "")
##look for downloaded CSV to open
try:
with open(logfiles, 'rU') as f: ##open
reader = csv.reader(f)
for row in reader:
item['state'] = 'Florida'
item['county'] = county ##variable set from response.url
item['saleDate'] = row[0]
item['caseNumber'] = row[2]
..
yield item
except:
pass
f.close()
print ">>>>>>>>>>>F CLOSED"
os.remove(logfiles)
I need Scrapy to completely finish importing all the CSV values as items before moving on to the next request. Is there a way to accomplish this?

Categories

Resources