I'm a beginner in Python, and I'm trying to extract data from the web and display it in a table :
# import libraries
import urllib2
from bs4 import BeautifulSoup
import csv
from datetime import datetime
quote_page = 'http://www.bloomberg.com/quote/SPX:IND'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('h1', attrs={'class': 'name'})
name = name_box.text.strip()
print name
price_box = soup.find('div', attrs={'class':'price'})
price = price_box.text
print price
with open('index.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([name, price, datetime.now()])
this is a very basic code that extract data from bloomberg and display it in an csv file.
It should display the name in a column, the price in an other one and the date in the third one.
But actually it copy all this data in the first row : Result of the index.csv file .
Am I missing something with my code?
Thank you for your help !
Wikipedia:
In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.
The problem isn't related to your Python code! Your script actually writes the plain text file with the fields separated by commas. It is your csv file viewer which doesn't take commas as separators. You should check in the preferences of your csv file viewer.
It looks like when you are importing your CSV into Excel that it isn't being interpreted right. When I imported it into Excel, I noticed that the comma in the "2,337.58" is messing up the CSV data, putting the ,337.58" into a column of it's own. When you import the data into excel, you should get a popup that will ask how the data is represented. You should pick the delimited option and then select delimiter: comma. Finally, click finish.
Related
I'm a beginner at programming. I'm trying to make a system like Readwise(it collects highlights from Kindle and sends a bunch of highlights to your email) for myself as my first project. Right now I'm trying to make a part where I take out highlights from an html file exported from Kindle, and write them into an excel file. I think I somehow managed to do the first part but I get this error on the second part.
TypeError: Value must be a list, tuple, range or generator, or a dict. Supplied value is <class 'str'>
I believe this means that I can't write strings into the file with my code. Could you tell me what I can do here?
from bs4 import BeautifulSoup
from openpyxl import load_workbook
with open("test.html", "r", encoding="utf-8") as html_file:
content = html_file.read()
soup = BeautifulSoup(content, "lxml")
note_tags = soup.find_all("div", class_="noteText")
for note in note_tags:
highlights = note.text
print(highlights)
wb = load_workbook('highlights.xlsx')
ws = wb.active
ws.append(highlights)
wb.save
I tried to use Pandas instead because as the next step I wanna make sure that it won't write duplicates and it seems easier to do with Pandas. But every time I run the script the excel file got corrupted and I got a "at least one sheet must be visible" error.
ws.append() expects the highlights to be a "list, tuple, range or generator, or a dict." as the error says. But in your example, you give it a string. If you just want the string from highlights in the first column of your Excel. You could just make it into a list (with just 1 item) by putting highlights between square brackets, so [highlights]. Adding a row to the Excel file then becomes ws.append([highlights]). Now you are appending a list instead of a string.
But the other problem (I guess) I see, is that you have several notes that you loop over, but your current code only appends the last iteration over for note in note_tags: to the Excel file. I assume you want every note as a new ROW in Excel (not a new column), so you want to append() every iteration over note_tags to your excel-file.
from bs4 import BeautifulSoup
from openpyxl import load_workbook
with open("test.html", "r", encoding="utf-8") as html_file:
content = html_file.read()
soup = BeautifulSoup(content, "lxml")
# Open Excel file
wb = load_workbook('highlights.xlsx')
ws = wb.active
note_tags = soup.find_all("div", class_="noteText")
for note in note_tags:
highlights = note.text
print(highlights)
ws.append([highlights]) # Add line to a new row in Excel
wb.save('highlights.xlsx')
I am currently using the below code to web scrape data and then store it in a CSV file.
from bs4 import BeautifulSoup
import requests
url='https://www.business-standard.com/rss/companies-101.rss'
soup = BeautifulSoup(requests.get(url).content, 'xml')
news_items = []
for item in soup.findAll('item'):
news_item = {}
news_item['title'] = item.title.text
news_item['excerpt'] = item.description.text
print(item.link.text)
s = BeautifulSoup(requests.get(item.link.text).content, 'html.parser')
news_item['text'] = s.select_one('.p-content').get_text(strip=True, separator=' ')
news_item['link'] = item.link.text
news_item['pubDate'] = item.pubDate.text
news_item['Category'] = 'Company'
news_items.append(news_item)
import pandas as pd
df = pd.DataFrame(news_items)
df.to_csv('company_data.csv',index = False)
When displaying the data frame, the results look fine as attached.enter image description here
But while opening the csv file, the columns are not as expected. enter image description hereCan anyone tell me the reason.
The issue is that your data contains commas and the default seperator for to_csv is "," So each comma in your data set is treated as a seperate column.
If you perform df.to_excel('company_data.xlsx', index=False) you won't have this issue since it is not comma seperated.
A csv file is not an Excel file despite what Microsoft pretends but is a text file containing records and fields. The records are separated with sequences of \r\n and the fields are separated with a delimiter, normally the comma. Fields can contain new lines or delimiters provided they are enclosed in quotation marks.
But Excel is known to have a very poor csv handling module. More exactly it can read what it has written, or what is formatted the way it would have written it. To be locale friendly, MS folks decided that they will use the locale decimal separator (which is the comma , in some West European locales) and will use another separator (the semicolon ; when the comma is the decimal separator). As a result, using Excel to read CSV files produced by other tools is a nightmare with possible workarounds like changing the separator when writing the CSV file, but no clean way. LibreOffice is far behind MS Office for most features except CSV handling. So my advice is to avoid using Excel for CSV files but use LibreOffice Calc instead.
I have extracted a list of text from a section of a website. Specifically, I scraped the 'experience' section of Linkedin and have extracted each work experience item within that section.
However, the data is in the form of a text list, and I am having issues formatting it as a csv file in the way that I want.
My relevant code is below:
from selenium import webdriver
ChromeOptions = webdriver.ChromeOptions()
driver = webdriver.Chrome('/Users/jones/Downloads/chromedriver')
driver.get('https://www.linkedin.com/in/pauljgarner/')
rows = []
name = sel.xpath('normalize-space(//li[#class="inline t-24 t-black t-normal break-words"])').extract_first()
experience = driver.find_elements_by_xpath('//section[#id = "experience-section"]/ul//li')
rows.append([name])
for item in experience:
rows[0].append(item.text)
print(item.text)
print("")
with open(parameters.file_name, 'w', encoding='utf8') as file:
writer = csv.writer(file)
writer.writerows(rows)
The excel output I am getting from this code is below:
As you can see, it seems like a line break is separating each observation.
My desired excel output is below:
(Note that each text list has it's own variable names. For example, Company Name is for the first text list, and Company Name_2 for the second text list).
I suspect that I need to find a way to specify in Python that a line break is a delimiter in each list of text. However, I am unsure of how to do this. Any help would be appreciated.
Disclosure: I posted a question on this same issue a few days ago, but I am posting a more specific question on delimiters because I haven't seen anything about specifying linebreaks as a delimiter in writing to csv with Python.
I think you need to split each element of rows on '\n'.
You also need to specify the headers to get the desired output.
headers = ['Name', 'Title', ... ]
with open(parameters.file_name, 'w', encoding='utf8') as file:
writer = csv.writer(file)
writer.writerow(headers)
for row in rows:
writer.writerow(row.split('\n'))
When scraping data with BS and writing to .csv file something is causing the data to add a carriage return before the text and after the text within each cell so effectively when you look at the cell, there is a blank row, then the data then another blank row, this causes excel to wrap the data.
Has anyone ever seen this occur? I know about writerow inserting a new row between each cell, but never seen it inside the cells themselves.
Crude example: star's indicate empty row within the cell
(*************************)
Data Written From BS
(*************************)
import urllib2
from bs4 import BeautifulSoup
import csv
quote_page = "http://www.starrpartners.com.au/selling?&agentOfficeID=0&officeID=0&alternateagentID=6435&agentID=6435&status=current&disposalMethod=sold&orderBy=sold&page=2"
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'lxml')
with open('index.csv', 'wb') as csv_file:
writer = csv.writer(csv_file)
for band in soup.find_all('h5', attrs={'class': 'text-ltBlu'}):
writer.writerow([band.text])
Side Note: The HTML within the page has an empty row before and after, would it be as simple as that? And if so, is there something i can do to ask BS to ignore empty rows?
<h5 class="text-ltBlu">
<b>North Parramatta</b><br>
<span class="text-dkBlu">1/13 Brickfield Street</span>
</h5>
If anyone ever looks this up and needs an answer:
The secret was in the writing method, I also needed to strip out any erroneous data.
Instead of:
writer.writerow([band.text])
We need to also add a strip command
writer.writerow([band.text.strip()])
Hope this helps someone else in the future.
So I am quite the beginner in Python, but what I'm trying to do is to download each CSV file for the NYSE. In an excel file I have every symbol. The Yahoo API allows you to download the CSV file by adding the symbol to the base url.
My first instinct was to use pandas, but pandas doesn't store strings.
So what I have
import urllib
strs = ["" for x in range(3297)]
#strs makes the blank string array for each symbol
#somehow I need to be able to iterate each symbol into the blank array spots
while y < 3297:
strs[y] = "symbol of each company from csv"
y = y+1
#loop for downloading each file from the link with strs[y].
while i < 3297:
N = urllib.URLopener()
N.retrieve('http://ichart.yahoo.com/table.csv?s='strs[y],'File')
i = i+1
Perhaps the solution is simpler than what I am doing.
From what I can see in this question you can't see how to connect your list of stock symbols to how you read the data in Pandas. e.g. 'http://ichart.yahoo.com/table.csv?s='strs[y] is not valid syntax.
Valid syntax for this is
pd.read_csv('http://ichart.yahoo.com/table.csv?s={}'.format(strs[y]))
It would be helpful if you could add a few sample lines from your csv file to the question. Guessing at your structure you would do something like:
import pandas as pd
symbol_df = pd.read_csv('path_to_csv_file')
for stock_symbol in symbol_df.symbol_column_name:
df = pd.read_csv('http://ichart.yahoo.com/table.csv?s={}'.format(stock_symbol))
# process your dataframe here
Assuming you take that Excel file w/ the symbols and output as a CSV, you can use Python's built-in CSV reader to parse it:
import csv
base_url = 'http://ichart.yahoo.com/table.csv?s={}'
reader = csv.reader(open('symbols.csv'))
for row in reader:
symbol = row[0]
data_csv = urllib.urlopen(base_url.format(symbol)).read()
# save out to file, or parse with CSV library...