Stupid question. I have made my first scraper/crawler. It gives me exactly what i want, but when i write it to csv file, text appears with \n'] brackets. If i try to remove it in any way - it breaks my output in csv file.
Although the website is in hebrew, it shouldn't be a problem. Just look at csv that you get.
Thanks in advance
import csv
import requests
from bs4 import BeautifulSoup as bs
import io
url='https://www.maariv.co.il/news/politics'
source = requests.get(url).text
soup = bs(source, 'html.parser')
file = io.open('maariv7.csv', 'w', encoding="utf-16")
csv_writer = csv.writer(file, delimiter='|')
csv_writer.writerow(['Headline', 'Summary', 'Text', 'name'])
file.close()
def single_page_scraper(url):
source = requests.get(url).text
soup = bs(source, 'html.parser')
file = io.open('maariv7.csv', 'a', encoding="utf-16")
csv_writer = csv.writer(file, delimiter='|')
for article in soup.find_all(class_='article-title'):
headline = article.h1.text
print (headline,'\n')
for article in soup.find_all(class_='article-description'):
summary = article.h2.text
print(summary,'\n')
text=[]
name=[]
for par in soup.find_all(class_='article-body'):
text.append(par.get_text())
print(text)
politics = io.open('politicians.txt', 'r', encoding="utf-8")
my_list=politics.read().splitlines()
my_file=str(text)
for i in my_list:
if i in my_file:
name.append(i)
name_list = ", ".join(name)
print(name_list,'\n''\n''\n''\n')
csv_writer.writerow([headline, summary, my_file, name_list])
file.close()
for articles in soup.find_all(class_='three-articles-in-row'):
link = articles.a['href']
single_page_scraper(link)
Check out Yibo Yang's answer at the bottom.
Basically, try switching this line:
csv_writer = csv.writer(file, delimiter='|')
to this:
csv_writer = csv.writer(file, delimiter='|', newline='')
And see if it makes a difference.
So, inside of single_page_scraper I use
They are actually putting newlines in their text, so you should strip them right where you append the text: instead of text.append(par.get_text()) add the strip text.append(par.get_text())
for par in soup.find(class_='article-body'):
if isinstance(par, NavigableString):
t = par.strip()
else:
t = par.text.strip()
if t != '':
text.append(t)
edit: you would have to from bs4 import NavigableString
Related
I am trying to search and replace certain words in my .xml file and replace it with another, but I struggle a bit.
I have been using this code so far:
import xml.etree.ElementTree as ET
with open('Rom1.xml', encoding="utf8") as f:
tree = ET.parse(f)
#root = tree.find('ExportedObjects')
root = tree.getroot()
for elem in root.iter():
try:
elem.text = elem.text.replace('Rom1', 'Rom2')
except AttributeError:
pass
Rom1.xml this is a snapshot from the XML file showing the structure
The XML file is pretty big but it contains the string 'Rom1' 41 times and I would like to replace all of them.
I know a simple search and replace in text editor does the job, but I want to automate this since I will do it for several hundered of files.
Any help is appriciated :)
If there is no possibility of ambiguity then you could just do this:
with open('Rom1.xml', encoding='utf-8', mode='r+') as xml:
content = xml.read().replace('Rom1', 'Rom2')
xml.seek(0)
xml.write(content)
xml.truncate()
In this case the truncate() call is not necessary. However, if the second argument to replace() was shorter than the first then this would be crucial. Just leave it there to account for all eventualities
Ok so I tried something else with great success:
import xml.etree.ElementTree as ET
Rom2 = input('Number: ')
input_file = "Rom1.xml"
output_file = Rom2+".xml"
with open(input_file) as f:
xml_content = f.readlines()
with open(output_file, 'w+') as f:
for line in xml_content:
f.write(line.replace('Rom1', Rom2))
But if I want to replace a second string f.ex 'SQ4XXX' to 'SQ4050' then it replaces both and keeps the old as well? I'm confused.
import xml.etree.ElementTree as ET
Rom2 = input('Number: ')
sq = input('SQ: ')
input_file = "Rom1.xml"
output_file = Rom2+".xml"
with open(input_file) as f:
xml_content = f.readlines()
with open(output_file, 'w+') as f:
for line in xml_content:
f.write(line.replace('Rom1', Rom2))
f.write(line.replace('SQ4XXX', sq))
Ok I got it working like I wanted, thanks for the help guys!
Heres the final code:
import xml.etree.ElementTree as ET
Rom2 = input('Number: ')
sq4 = input('SQ4: ')
sq5 = input('SQ5: ')
input_file = "Rom1.xml"
output_file = Rom2+".xml"
with open(input_file) as f:
xml_content = f.readlines()
with open(output_file, 'w+') as f:
for line in xml_content:
f.write(line.replace('Rom1', Rom2))
with open(output_file, encoding='utf-8', mode='r+') as xml:
content = xml.read().replace('SQ4XXX', sq4)
xml.seek(0)
xml.write(content)
xml.truncate()
with open(output_file, encoding='utf-8', mode='r+') as xml:
content = xml.read().replace('SQ5XXX', sq5)
xml.seek(0)
xml.write(content)
xml.truncate()er code here
I'm trying to figure out what will be the next step to convert my webscrape to CSV.
I've tried putting every column into individual lists, but I feel like this is not the solution.
from bs4 import BeautifulSoup
import requests
url = 'https://www.pro-football-reference.com/years/2018/passing.htm'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
for row in tb.find_all('tr'):
i = row.get_text()
print(i)
This should work
import csv #quite crucial
final_table = []
for row in tb.findall('tr'):
next_line = row.get_text()
final_table.append([next_line])
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerows(final_table)
Use the csv module. We'll grab the headers with soup.find("tr").find_all("th"), then loop over the body and write it to the text file. The first cell of each row is a <th>, so we need to handle that separately and prepend it to the <td> data. Note that the staggered headers every 30 lines are omitted.
import csv
import requests
from bs4 import BeautifulSoup
url = "https://www.pro-football-reference.com/years/2018/passing.htm"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
with open("output.csv", "w") as f:
writer = csv.writer(f)
writer.writerow([x.get_text() for x in soup.find("tr").find_all("th")])
for row in soup.find_all("tr"):
data = [row.find("th").get_text()] + [x.get_text() for x in row.find_all("td")]
if data:
writer.writerow(data)
Output (just the top few rows):
Rk,Player,Tm,Age,Pos,G,GS,QBrec,Cmp,Att,Cmp%,Yds,TD,TD%,Int,Int%,Lng,Y/A,AY/A,Y/C,Y/G,Rate,QBR,Sk,Yds,NY/A,ANY/A,Sk%,4QC,GWD
1,Ben Roethlisberger,PIT,36,QB,16,16,9-6-1,452,675,67.0,5129,34,5.0,16,2.4,97,7.6,7.5,11.3,320.6,96.5,71.0,24,166,7.10,7.04,3.4,2,3
2,Andrew Luck*,IND,29,QB,16,16,10-6-0,430,639,67.3,4593,39,6.1,15,2.3,68,7.2,7.4,10.7,287.1,98.7,69.4,18,134,6.79,6.95,2.7,3,3
3,Matt Ryan,ATL,33,QB,16,16,7-9-0,422,608,69.4,4924,35,5.8,7,1.2,75,8.1,8.7,11.7,307.8,108.1,68.5,42,296,7.12,7.71,6.5,1,1
4,Kirk Cousins,MIN,30,QB,16,16,8-7-1,425,606,70.1,4298,30,5.0,10,1.7,75,7.1,7.3,10.1,268.6,99.7,58.2,40,262,6.25,6.48,6.2,1,0
Check this thread if you see extra newlines in the CSV result on Windows.
I am crawling data from Wikipedia and it works so far. I can display it on the terminal, but I can't write it the way I need it into a csv file :-/
The code is pretty long, but I paste it here anyway and hope that somebody can help me.
import csv
import requests
from bs4 import BeautifulSoup
def spider():
url = 'https://de.wikipedia.org/wiki/Liste_der_Gro%C3%9F-_und_Mittelst%C3%A4dte_in_Deutschland'
code = requests.get(url).text # Read source code and make unicode
soup = BeautifulSoup(code, "lxml") # create BS object
table = soup.find(text="Rang").find_parent("table")
for row in table.find_all("tr")[1:]:
partial_url = row.find_all('a')[0].attrs['href']
full_url = "https://de.wikipedia.org" + partial_url
get_single_item_data(full_url) # goes into the individual sites
def get_single_item_data(item_url):
page = requests.get(item_url).text # Read source code & format with .text to unicode
soup = BeautifulSoup(page, "lxml") # create BS object
def getInfoBoxBasisDaten(s):
return str(s) == 'Basisdaten' and s.parent.name == 'th'
basisdaten = soup.find_all(string=getInfoBoxBasisDaten)[0]
basisdaten_list = ['Bundesland', 'Regierungsbezirk:', 'Höhe:', 'Fläche:', 'Einwohner:', 'Bevölkerungsdichte:',
'Postleitzahl', 'Vorwahl:', 'Kfz-Kennzeichen:', 'Gemeindeschlüssel:', 'Stadtgliederung:',
'Adresse', 'Anschrift', 'Webpräsenz:', 'Website:', 'Bürgermeister', 'Bürgermeisterin',
'Oberbürgermeister', 'Oberbürgermeisterin']
with open('staedte.csv', 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['Bundesland', 'Regierungsbezirk:', 'Höhe:', 'Fläche:', 'Einwohner:', 'Bevölkerungsdichte:',
'Postleitzahl', 'Vorwahl:', 'Kfz-Kennzeichen:', 'Gemeindeschlüssel:', 'Stadtgliederung:',
'Adresse', 'Anschrift', 'Webpräsenz:', 'Website:', 'Bürgermeister', 'Bürgermeisterin',
'Oberbürgermeister', 'Oberbürgermeisterin']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames, delimiter=';', quotechar='|', quoting=csv.QUOTE_MINIMAL, extrasaction='ignore')
writer.writeheader()
for i in basisdaten_list:
wanted = i
current = basisdaten.parent.parent.nextSibling
while True:
if not current.name:
current = current.nextSibling
continue
if wanted in current.text:
items = current.findAll('td')
print(BeautifulSoup.get_text(items[0]))
print(BeautifulSoup.get_text(items[1]))
writer.writerow({i: BeautifulSoup.get_text(items[1])})
if '<th ' in str(current): break
current = current.nextSibling
print(spider())
The output is incorrect in 2 ways. The cells are their right places and only one city is written, all others are missing. It looks like this:
But it should look like this + all other cities in it:
'... only one city is written ...': You call get_single_item_data for each city. Then inside this function you open the output file with the same name, in the statement with open('staedte.csv', 'w', newline='', encoding='utf-8') as csvfile: which will overwrite the output file each time you call the function.
Each variable is written to a new row: In the statement writer.writerow({i: BeautifulSoup.get_text(items[1])}) you write the value for one variable to a row. What you need to do instead is to make a dictionary for values before you start looking for page values. As you accumulate the values from the page you shove them into the dictionary by field name. Then after you have found all of the values available you call writer.writerow.
I have a bunch of web text that I'd like to scrape and export to a csv file. The problem is that the text is split over multiple lines on the website and that's how beautifulsoup reads it. When I export to csv, all the text goes into one cell but the cell has multiple lines of text. When I try to read the csv into another program, it interprets the multiple lines in a way that yields a nonsensical dataset. The question is, how do I put all the text into a single line after I pull it with beautifulsoup but before I export to csv?
Here's a simple working example demonstrating the problem of multiple lines (in fact, the first few lines in the resulting csv are blank, so at first glance it may look empty):
import csv
import requests
from bs4 import BeautifulSoup
def main():
r = requests.get("https://www.econometricsociety.org/publications/econometrica/2017/03/01/search-yield")
soup = BeautifulSoup(r.text,"html.parser")
with open('Temp.csv', 'w', encoding='utf8', newline='') as f:
writer = csv.writer(f,delimiter=",")
abstract=soup.find("article").text
writer.writerow([abstract])
if __name__ == '__main__':
main()
UPDATE: there have been some good suggestions, but it's still not working. The following code still produces a csv file with line breaks in a cell:
import csv
import requests
from bs4 import BeautifulSoup
with open('Temp.csv', 'w', encoding='utf8', newline='') as f:
writer = csv.writer(f,delimiter=',')
r = requests.get("https://www.econometricsociety.org/publications/econometrica/2017/03/01/search-yield")
soup = BeautifulSoup(r.text,'lxml')
find_article = soup.find('article')
find_2para = find_article.p.find_next_sibling("p")
find_largetxt = find_article.p.find_next_sibling("p").nextSibling
writer.writerow([find_2para,find_largetxt])
Here's another attempt based on a different suggestion. This one also ends up producing a line break in the csv file:
import csv
import requests
from bs4 import BeautifulSoup
def main():
r = requests.get("https://www.econometricsociety.org/publications/econometrica/2017/03/01/search-yield")
soup = BeautifulSoup(r.text,"html.parser")
with open('Temp.csv', 'w', encoding='utf8', newline='') as f:
writer = csv.writer(f,delimiter=",")
abstract=soup.find("article").get_text(separator=" ", strip=True)
writer.writerow([abstract])
if __name__ == '__main__':
main()
Change your abstract = ... line into:
abstract = soup.find("article").get_text(separator=" ", strip=True)
It'll separate each line using the separator parameter (in this case It'll separate the strings with an empty space.
The solution that ended up working for me is pretty simple:
abstract=soup.find("article").text.replace("\t", "").replace("\r", "").replace("\n", "")
That gets rid of all line breaks.
r = requests.get("https://www.econometricsociety.org/publications/econometrica/2017/03/01/search-yield")
soup = BeautifulSoup(r.text,'lxml') # I prefer using xml parser
find_article = soup.find('article')
# Next line how to find The title in this case: Econometrica: Mar 2017, Volume 85, Issue 2
find_title = find_article.h3
# find search yeild
find_yeild = find_article.h1
#first_paragraph example : DOI: 10.3982/ECTA14057 p. 351-378
find_1para = find_article.p
#second p example : David Martinez‐Miera, Rafael Repullo
find_2para = find_article.p.find_next_sibling("p")
#find the large text area using e.g. 'We present a model of the relationship bet...'
find_largetxt = find_article.p.find_next_sibling("p").nextSibling
I used a variety of methods of getting to the text area you wish just for the purpose of education(you can use .text on each of these to get the text without tags or you can use Zroq's method.
But you can write each one of these into the file by doing for example
writer.writerow(find_title.text)
I have data which is being accessed via http request and is sent back by the server in a comma separated format, I have the following code :
site= 'www.example.com'
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
soup = soup.get_text()
text=str(soup)
The content of text is as follows:
april,2,5,7
may,3,5,8
june,4,7,3
july,5,6,9
How can I save this data into a CSV file.
I know I can do something along the lines of the following to iterate line by line:
import StringIO
s = StringIO.StringIO(text)
for line in s:
But i'm unsure how to now properly write each line to CSV
EDIT---> Thanks for the feedback as suggested the solution was rather simple and can be seen below.
Solution:
import StringIO
s = StringIO.StringIO(text)
with open('fileName.csv', 'w') as f:
for line in s:
f.write(line)
General way:
##text=List of strings to be written to file
with open('csvfile.csv','wb') as file:
for line in text:
file.write(line)
file.write('\n')
OR
Using CSV writer :
import csv
with open(<path to output_csv>, "wb") as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for line in data:
writer.writerow(line)
OR
Simplest way:
f = open('csvfile.csv','w')
f.write('hi there\n') #Give your csv text here.
## Python will convert \n to os.linesep
f.close()
You could just write to the file as you would write any normal file.
with open('csvfile.csv','wb') as file:
for l in text:
file.write(l)
file.write('\n')
If just in case, it is a list of lists, you could directly use built-in csv module
import csv
with open("csvfile.csv", "wb") as file:
writer = csv.writer(file)
writer.writerows(text)
I would simply write each line to a file, since it's already in a CSV format:
write_file = "output.csv"
with open(write_file, "wt", encoding="utf-8") as output:
for line in text:
output.write(line + '\n')
I can't recall how to write lines with line-breaks at the moment, though :p
Also, you might like to take a look at this answer about write(), writelines(), and '\n'.
To complement the previous answers, I whipped up a quick class to write to CSV files. It makes it easier to manage and close open files and achieve consistency and cleaner code if you have to deal with multiple files.
class CSVWriter():
filename = None
fp = None
writer = None
def __init__(self, filename):
self.filename = filename
self.fp = open(self.filename, 'w', encoding='utf8')
self.writer = csv.writer(self.fp, delimiter=';', quotechar='"', quoting=csv.QUOTE_ALL, lineterminator='\n')
def close(self):
self.fp.close()
def write(self, elems):
self.writer.writerow(elems)
def size(self):
return os.path.getsize(self.filename)
def fname(self):
return self.filename
Example usage:
mycsv = CSVWriter('/tmp/test.csv')
mycsv.write((12,'green','apples'))
mycsv.write((7,'yellow','bananas'))
mycsv.close()
print("Written %d bytes to %s" % (mycsv.size(), mycsv.fname()))
Have fun
What about this:
with open("your_csv_file.csv", "w") as f:
f.write("\n".join(text))
str.join() Return a string which is the concatenation of the strings in iterable.
The separator between elements is
the string providing this method.
In my situation...
with open('UPRN.csv', 'w', newline='') as out_file:
writer = csv.writer(out_file)
writer.writerow(('Name', 'UPRN','ADMIN_AREA','TOWN','STREET','NAME_NUMBER'))
writer.writerows(lines)
you need to include the newline option in the open attribute and it will work
https://www.programiz.com/python-programming/writing-csv-files