Python--Export Parsed XML to txt file using [duplicate] - python

This question already has an answer here:
How do I write all of these rows into a CSV file for a given range?
(1 answer)
Closed 6 years ago.
I'm parsing text from an XML file. Parsing works well, and I can print the results in full, but when I try to write the text into a text document, all I get in the document is the last item.
from bs4 import BeautifulSoup
import urllib.request
import sys
req = urllib.request.urlopen('file:///C:/Users/John/Desktop/Dow%20Jones/compaq%20neg%201.xml')
xml = BeautifulSoup(req, 'xml')
for item in xml.findAll('paragraph'):
sys.stdout = open('CN1.txt', 'w')
print(item.text)
sys.stdout.close()
What am I missing here?

It looks like you are opening the file every time you go through the loop, which I am surprised it let you do. When it opens the file, it is is opening it in write mode and therefore is wiping out everything that was in it on the last pass through the loop.

Related

Write multiple results to CSV [duplicate]

This question already has answers here:
How to append a new row to an old CSV file in Python?
(8 answers)
Closed 1 year ago.
I'm using selenium and beautifulsoup to iterate through a number of webpages and sort out the results. I have that working, however I want to export the results to a CSV using this block of code:
with open('finallist.csv', mode='w') as final_list:
stock_writer = csv.writer(final_list, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
stock_writer.writerow([ticker, element.get_text()])
The only issue is, with the result being multiple different things, this code as it stands just replaces the first line of the CSV every time a new result comes in. Is there any way I can have it write to a new line each time?
Per the Python documentation for the open() function, you can pass the 'a' mode to the open() function. Doing so will append any text to the end of the file, if the file already exists.
with open('finallist.csv', mode='a') as final_list:
...

delete everything except URL with python [duplicate]

This question already has answers here:
Extracting a URL in Python
(10 answers)
Closed 2 years ago.
I have a JSON file that contains metadata for 900 articles. I want to delete all the data except for the lines that contain URLs and resave the file as .txt.
I created this code but I couldn't continue the saving phase:
import re
with open("path\url_example.json") as file:
for line in file:
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
print(urls)
A part of the results:
['http://www.google.com.']
['https://www.tutorialspoint.com']
Another issue is the results are marked between [' '] and may end with . I don't need this. My expected result is:
http://www.google.com
https://www.tutorialspoint.com
If you know which key your URLs will be found under in your JSON, you might find an easier approach is to deserialize the JSON using the JSON module from the Python standard library and work with a dict instead of using regex.
However, if you want to work with regex, remember urls is a list of regex matches. If you know there's definitely only going to be only one match per line, then just print the first entry and rstrip off the terminal ".", if it's there.
import re
with open("path\url_example.txt") as file:
for line in file:
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
print(urls[0].rstrip('.'))
If you expect to see multiple matches per line:
import re
with open("path\url_example.txt") as file:
for line in file:
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
for url in urls:
print(url.rstrip('.'))
Without further information on the file you have (txt, json?) and on the kind of input line you are looping through, here a simple try without re.findall().
with open("path\url_example.txt") as handle:
for line in handle:
if not re.search('http'):
continue
spos = line.find('http')
epos = line.find(' ', spos)
url = line[spos:epos]
print(url)

Can not save xml file using minidom [duplicate]

This question already has answers here:
Troubles while parsing with python very large xml file
(3 answers)
Closed 4 years ago.
I tried to modify and save a xml file using minidom in python.
Everything is quite working good except 1 specific file, that I only can read but can not write it back.
Code that I use to save xml file:
domXMLFile = minidom.parse(dom_document_filename)
#some modification
F= open(dom_document_filename,"w")
domXMLFile .writexml(F)
F.close()
My question is :
Is it true that minidom can not handle too large file ( 714KB )?
How do i solve my problem?
In my opinion, lxml is way better than minidom for handling XML. If you have it, here is how to use it:
from lxml import etree
root = etree.parse('path/file.xml')
# some changes to root
with open('path/file.xml', 'w') as f:
f.write(etree.tostring(root, pretty_print=True))
If not, you could use pdb to debug your code. Just write import pdb; pdb.set_trace() in your code where you want a break pont and when running your function in a shell, it should stop at this line. It may give you a better view for what is not working.

Read json file as input and output as pprint? [duplicate]

This question already has answers here:
How to prettyprint a JSON file?
(15 answers)
Closed 5 years ago.
I'm working with a large json file that is currently encoded as one long line.
This makes it unintelligable for other people to work with, so I want to render it using pprint.
At the moment I'm trying to import the full file and print as pprint but my output looks like this:
<_io.TextIOWrapper name='hash_mention.json' mode='r' encoding='UTF-8'>
My question is- what is that showing? How can I get it to output the json data as pprint?
The code I've written looks like this:
import pprint
with open('./hash_mention.json', 'r') as input_data_file:
pprint.pprint(input_data_file)
You opened the file in read mode but forgot to read the file contents.
Just change pprint.pprint(input_data_file) with pprint.pprint(input_data_file.read()) and voila!

Trying to download data from URL with CSV File

I'm slightly new to Python and have a question as to why the following code doesn't produce any output in the csv file. The code is as follows:
import csv
import urllib2
url = 'http://www.rba.gov.au/statistics/tables/csv/f17-yields.csv'
response = urllib2.urlopen(url)
cr = csv.reader(response)
for row in cr:
with open("AusCentralbank.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(row)
Cheers.
Edit:
Brien and Albert solved the initial issue I had. However, I now have one further question. When I download the CSV File which I have listed above which is in "http://www.rba.gov.au/statistics/tables/#interest-rates" under Zero-coupon "Interest Rates - Analytical Series - 2009 to Current - F17" and is the F-17 Yields CSV I see that it has 5 workbooks and I actually just want to gather the data in the 5th Workbook. Is there a way I could do this? Cheers.
I could only test my code using Python 3. However, the only diffence should be urllib2, hence I am using urllib.respose for opening the desired url.
The variable html is type bytes and can generally be written to a file in binary mode. Additionally, your source is a csv-file already, so there should be no need to convert it somehow:
#!/usr/bin/env python3
# coding: utf-8
import urllib
url = 'http://www.rba.gov.au/statistics/tables/csv/f17-yields.csv'
response = urllib.request.urlopen(url)
html = response.read()
with open('output.csv', 'wb') as f:
f.write(html)
It is probably because of your opening mode.
According to documentation:
'w' for only writing (an existing file with the same name will be
erased)
You should use append(a) mode to append it to the end of the file.
'a' opens the file for appending; any data written to the file is
automatically added to the end.
Also, since the file you are trying to download is csv file, you don't need to convert it.
#albert had a great answer. I've gone ahead and converted it to the equivalent Python 2.x code. You were doing a bit too much work in your original program; since the file was already a csv you didn't need to do any special work to turn it into a csv.
import urllib2
url = 'http://www.rba.gov.au/statistics/tables/csv/f17-yields.csv'
response = urllib2.urlopen(url)
html = response.read()
with open('AusCentralbank.csv', 'wb') as f:
f.write(html)

Categories

Resources