I'm a beginner at programming. I'm trying to make a system like Readwise(it collects highlights from Kindle and sends a bunch of highlights to your email) for myself as my first project. Right now I'm trying to make a part where I take out highlights from an html file exported from Kindle, and write them into an excel file. I think I somehow managed to do the first part but I get this error on the second part.
TypeError: Value must be a list, tuple, range or generator, or a dict. Supplied value is <class 'str'>
I believe this means that I can't write strings into the file with my code. Could you tell me what I can do here?
from bs4 import BeautifulSoup
from openpyxl import load_workbook
with open("test.html", "r", encoding="utf-8") as html_file:
content = html_file.read()
soup = BeautifulSoup(content, "lxml")
note_tags = soup.find_all("div", class_="noteText")
for note in note_tags:
highlights = note.text
print(highlights)
wb = load_workbook('highlights.xlsx')
ws = wb.active
ws.append(highlights)
wb.save
I tried to use Pandas instead because as the next step I wanna make sure that it won't write duplicates and it seems easier to do with Pandas. But every time I run the script the excel file got corrupted and I got a "at least one sheet must be visible" error.
ws.append() expects the highlights to be a "list, tuple, range or generator, or a dict." as the error says. But in your example, you give it a string. If you just want the string from highlights in the first column of your Excel. You could just make it into a list (with just 1 item) by putting highlights between square brackets, so [highlights]. Adding a row to the Excel file then becomes ws.append([highlights]). Now you are appending a list instead of a string.
But the other problem (I guess) I see, is that you have several notes that you loop over, but your current code only appends the last iteration over for note in note_tags: to the Excel file. I assume you want every note as a new ROW in Excel (not a new column), so you want to append() every iteration over note_tags to your excel-file.
from bs4 import BeautifulSoup
from openpyxl import load_workbook
with open("test.html", "r", encoding="utf-8") as html_file:
content = html_file.read()
soup = BeautifulSoup(content, "lxml")
# Open Excel file
wb = load_workbook('highlights.xlsx')
ws = wb.active
note_tags = soup.find_all("div", class_="noteText")
for note in note_tags:
highlights = note.text
print(highlights)
ws.append([highlights]) # Add line to a new row in Excel
wb.save('highlights.xlsx')
Related
Dear all i often need to concat csv files with identical headers (i.e. but them into one big file). Usually i just use pandas, but i now need to operate in an enviroment were i am not at liberty to install any library. The csv and the html libs do exsit.
I also need to remove all remaining html tags like %amp; for the apercent symbol which are still within the data. I do know in which columns these can come up.
I thought aboug doing it like this, and the concat part of my code seems to work fine:
import CSV
import html
for file in files: # files is a list of csv files.
with open(file, "rt", encoding="utf-8-sig") as source, open(outfilePath, "at", newline='',
encoding='utf-8-sig') as result:
d_reader = csv.DictReader(source,delimiter=";")
# Set header based on first file in file_list:
if file == test_files[0]:
Common_header = d_reader.fieldnames
# Define DcitwriterObject
wtr = csv.DictWriter(result, fieldnames=Common_header, lineterminator='\n', delimiter=";")
# Write Header only once to emtpy file
if result.tell() == 0:
wtr.writeheader()
# If i remove this block i get my concateneated singe csv file as a result
# Howerver the html tags/encoded symbols are sill present.
for row in d_reader:
print(html.unescape(row['ColA'])) # This prints the unescaped Values in the column correctly
# If i kepp these two lines, i get an empty file with just the header as a result of the concatenation
row['ColA'] = html.unescape(row['ColA'])
row['ColB'] = html.unescape(row['ColB'])
wtr.writerows(d_reader)
I would have thought the simply suppling the encoding='utf-8-sig' part to the result file would be sufficient to get rid of the html symbols but that does not work. If you could give me a hint what i am doint wrong in the usage of the code containing the html.unescape function in my code that would be nice.
Thank you in advance
I am working a script for reading specific cells from an Excel workbook into a list, and then from the list into a CSV. There's a loop to get workbooks open from a folder as well.
My code:
import csv
import openpyxl
import os
path = r'C:\Users.....' # Folder holding workbooks
workbooks = os.listdir(path)
cell_values = [] # List for storing cell values from worksheets
for workbook in workbooks: # Workbook iteration
wb = openpyxl.load_workbook(os.path.join(path, workbook), data_only=True) # Open workbook
sheet = wb.active # Get sheet
f = open('../record.csv', 'w', newline='') # Open the CSV file
cell_list = ["I9", "AK6", "N35"] # List of cells to check
with f: # CSV writer loop
record_writer = csv.writer(f) # Open CSV writer
for cells in cell_list: # Loop through cell list to get cell values and write them to the cell_values list
cell_values.append(sheet[cells].value) # Append cell values to the cell_values list
record_writer.writerow(cell_values) # Write cell_values list to CSV
quit() # Terminate program after all workbooks in the folder have been analyzed
The output just puts all values on the same line, albeit separated by commas, but it doesn't help me when I go to open my results in Excel if everything is on the same line. When I was using xlrd, the format was vertical but all I had to do was transpose the dataset to be good. But I had to change from xlrd (which was a smart move in general) because it would not read merged cells.
I get this:
4083940,140-21-541,NP,8847060,140-21-736,NP
When I want this
4083940,140-21-541,NP
8847060,140-21-736,NP
Edit - I forgot the "what have I tried" portion of my post. I have tried changing my loops around to avoid overwriting the previous write to the CSV. I have tried clearing the list on each loop to get the script to treat each new entry as a new line. I have tried adding \n in the writer line as I saw in a couple of posts. I have tried to use writerows instead of writerow. I tried A instead of W even though it is a fix and not a solution but that didn't quite work right either.
Your main problem is that cell_values is accumulating the cells from multiple sheets. You need to reset it, like, cell_values = [], for every sheet.
I went back to your original example and:
moved the opening of record.csv up, and placed all the work inside the scope of that file being open and written into
moved cell_values = [] inside your workbook loop
moved cell_list = ["I9", "AK6", "N35"] to the top, because that's really scoped for the entire script, if every workbook has the same cells
removed quit(), it's not necessary at the very end of the script, and in general should probably be avoided: Python exit commands - why so many and when should each be used?
import csv
import openpyxl
import os
path = r'C:\Users.....' # Folder holding workbooks
workbooks = os.listdir(path)
cell_list = ["I9", "AK6", "N35"] # List of cells to check
with open('record.csv', 'w', newline='') as f:
record_writer = csv.writer(f)
for workbook in workbooks:
wb = openpyxl.load_workbook(os.path.join(path, workbook), data_only=True)
sheet = wb.active
cell_values = [] # reset for every sheet
for cells in cell_list:
cell_values.append(sheet[cells].value)
# Write one row per sheet
record_writer.writerow(cell_values)
Also, I can see your new the CSV module, and struggling a little conceptually (since you tried writerow, then writerows, trying to debug your code). Python's official document for CSV doesn't really give practical examples of how to use it. Try reading up here, Writing to a CSV.
I'm coming up with an error of opening up an excel file after writing to it. This is what I have so far:
#locate source document
Path = Path(r'C:\Users\username\Test\EXCEL_Test.xlsx')
# open doc and go to active sheet
wb = load_workbook(filename = Path)
ws = wb.active
#add drop down list to each cell in a certain column
dv_v = DataValidation(type="list", formula1='"Y,N"', allow_blank=True)
for cell in ws['D']:
cell = ws.add_data_validation(dv_v)
wb.save(Path)
And these are the two errors that comes up on opening the excel file:
First error popup:
"We found a problem with some content in 'EXCEL_Test.xlsx'. Do you want us to try to recover as much as we can? If you trust the source of this workbook, click Yes."
Second error popup:
"Repaired Part: /xl/worksheets/sheet1.xml part with XML error. HRESULT 0x8000ffff Line 1, column 0."
My data validation is not showing up, and the file has the above errors when attempting to open the file to view the openpyxl changes.
Maybe if someone can help me find out why these errors are popping up? Python finishes with exit code 0, and why the data validation is coming up as blanks in the recovered file?
I think you are using the ws.add_data_validation(dv) incorrectly. The data validations get assigned to the dv first then the dv gets added to the cell.
Try doing it like this.
import openpyxl
from openpyxl import Workbook
from openpyxl.worksheet.datavalidation import DataValidation
#locate source document
Path = 'C:/Users/username/test/Excel_test.xlsx'
# open doc and go to active sheet
wb = openpyxl.load_workbook(filename = Path)
ws = wb['Sheet1']
#add drop down list to each cell in a certain column
dv = DataValidation(type="list", formula1='"Y,N"', allow_blank=True)
ws.add_data_validation(dv)
# This is the same as for the whole of column D
dv.add('D1:D1048576')
wb.save(Path)
Take a look at the Docs here: https://openpyxl.readthedocs.io/en/stable/validation.html
When scraping data with BS and writing to .csv file something is causing the data to add a carriage return before the text and after the text within each cell so effectively when you look at the cell, there is a blank row, then the data then another blank row, this causes excel to wrap the data.
Has anyone ever seen this occur? I know about writerow inserting a new row between each cell, but never seen it inside the cells themselves.
Crude example: star's indicate empty row within the cell
(*************************)
Data Written From BS
(*************************)
import urllib2
from bs4 import BeautifulSoup
import csv
quote_page = "http://www.starrpartners.com.au/selling?&agentOfficeID=0&officeID=0&alternateagentID=6435&agentID=6435&status=current&disposalMethod=sold&orderBy=sold&page=2"
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'lxml')
with open('index.csv', 'wb') as csv_file:
writer = csv.writer(csv_file)
for band in soup.find_all('h5', attrs={'class': 'text-ltBlu'}):
writer.writerow([band.text])
Side Note: The HTML within the page has an empty row before and after, would it be as simple as that? And if so, is there something i can do to ask BS to ignore empty rows?
<h5 class="text-ltBlu">
<b>North Parramatta</b><br>
<span class="text-dkBlu">1/13 Brickfield Street</span>
</h5>
If anyone ever looks this up and needs an answer:
The secret was in the writing method, I also needed to strip out any erroneous data.
Instead of:
writer.writerow([band.text])
We need to also add a strip command
writer.writerow([band.text.strip()])
Hope this helps someone else in the future.
I'm a beginner in Python, and I'm trying to extract data from the web and display it in a table :
# import libraries
import urllib2
from bs4 import BeautifulSoup
import csv
from datetime import datetime
quote_page = 'http://www.bloomberg.com/quote/SPX:IND'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('h1', attrs={'class': 'name'})
name = name_box.text.strip()
print name
price_box = soup.find('div', attrs={'class':'price'})
price = price_box.text
print price
with open('index.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([name, price, datetime.now()])
this is a very basic code that extract data from bloomberg and display it in an csv file.
It should display the name in a column, the price in an other one and the date in the third one.
But actually it copy all this data in the first row : Result of the index.csv file .
Am I missing something with my code?
Thank you for your help !
Wikipedia:
In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.
The problem isn't related to your Python code! Your script actually writes the plain text file with the fields separated by commas. It is your csv file viewer which doesn't take commas as separators. You should check in the preferences of your csv file viewer.
It looks like when you are importing your CSV into Excel that it isn't being interpreted right. When I imported it into Excel, I noticed that the comma in the "2,337.58" is messing up the CSV data, putting the ,337.58" into a column of it's own. When you import the data into excel, you should get a popup that will ask how the data is represented. You should pick the delimited option and then select delimiter: comma. Finally, click finish.