Python CSV module handling comma within quote inside a field

Python CSV module handling comma within quote inside a field - python

I am using Python's csv module to parse data from a CSV file in my application. While testing the application, my colleague entered a piece of sample text copy-pasted from random website.
The sample text has double quotes inside the field and a comma within the double quotes. The commas outside of double quotes are correctly handled by the csv module but the comma inside the double quote is split into next column. I looked at the csv specification and the field does comply to the specification by escaping the double quotes by another set of double quotes.
I checked the file in libreoffice and it is handled correctly.
Here's one line from the csv data where I'm having a problem:
company_name,company_revenue,company_start_year,company_website,company_description,company_email
Acme Inc,80000000000000,2004,http://google.com,"The company is never clearly defined in Road Runner cartoons but appears to be a conglomerate which produces every product type imaginable, no matter how elaborate or extravagant - most of which never work as desired or expected. In the Road Runner cartoon Beep, Beep, it was referred to as ""Acme Rocket-Powered Products, Inc."" based in Fairfield, New Jersey. Many of its products appear to be produced specifically for Wile E. Coyote; for example, the Acme Giant Rubber Band, subtitled ""(For Tripping Road Runners)"".
Sometimes, Acme can also send living creatures through the mail, though that isn't done very often. Two examples of this are the Acme Wild-Cat, which had been used on Elmer Fudd and Sam Sheepdog (which doesn't maul its intended victim); and Acme Bumblebees in one-fifth bottles (which sting Wile E. Coyote). The Wild Cat was used in the shorts Don't Give Up the Sheep and A Mutt in a Rut, while the bees were used in the short Zoom and Bored.
While their products leave much to be desired, Acme delivery service is second to none; Wile E. can merely drop an order into a mailbox (or enter an order on a website, as in the Looney Tunes: Back in Action movie), and have the product in his hands within seconds.",roadrunner#acme.com
Here's what it looks like in the debug log:
2014-08-27 21:35:53,922 - DEBUG: company_website=http://google.com
2014-08-27 21:35:53,923 - DEBUG: company_revenue=80000000000000
2014-08-27 21:35:53,923 - DEBUG: company_start_year=2004
2014-08-27 21:35:53,923 - DEBUG: account_description=The company is never clearly defined in Road Runner cartoons but appears to be a conglomerate which produces every product type imaginable, no matter how elaborate or extravagant - most of which never work as desired or expected. In the Road Runner cartoon Beep, Beep, it was referred to as "Acme Rocket-Powered Products
2014-08-27 21:35:53,924 - DEBUG: company_name=Acme Inc
2014-08-27 21:35:53,925 - DEBUG: company_email=Inc."" based in Fairfield
The relevant piece of code to handle csv parsing:
with open(csvfile, 'rU') as contactsfile:
# sniff for dialect of csvfile so we can automatically determine
# what delimiters to use
try:
dialect = csv.Sniffer().sniff(contactsfile.read(2048))
except:
dialect = 'excel'
get_total_jobs(contactsfile, dialect)
contacts = csv.DictReader(contactsfile, dialect=dialect, skipinitialspace=True, quoting=csv.QUOTE_MINIMAL)
# Start reading the rows
for row in contacts:
process_job()
for key, value in row.iteritems():
logging.debug("{}={}".format(key,value))
I understand that this is just junk data and we'll likely never encounter such a data but the csv files we receive are not within our control and we can have such an edge case. And since it's a valid csv file, which is handled correctly by libreoffice, it makes sense for me to handle it correctly as well.
I have searched for other questions on csv handling where people have had problems with either handling of quotes or comma within the field. I have both of these working fine, my problem is when a comma is nested within quotes within a field. There is a question with same problem which does solve the issue Comma in DoubleDouble Quotes in CSV File but it's a hackish way where I am not preserving the contents as they are given to me, which is a valid way as per RFC4180.

The Dialect.doublequote attribute
controls how instances of quotechar appearing inside a field should be
themselves be quoted. When True, the character is doubled. When False,
the escapechar is used as a prefix to the quotechar. It defaults to
True.
The sniffer is setting the doublequote attribute to False, but the CSV you posted should be parsed with doublequote = True:
import csv
with open(csvfile, 'rb') as contactsfile:
# sniff for dialect of csvfile so we can automatically determine
# what delimiters to use
try:
dialect = csv.Sniffer().sniff(contactsfile.read(2048))
except:
dialect = 'excel'
# get_total_jobs(contactsfile, dialect)
contactsfile.seek(0)
contacts = csv.DictReader(contactsfile, dialect=dialect, skipinitialspace=True,
quoting=csv.QUOTE_MINIMAL, doublequote=True)
# Start reading the rows
for row in contacts:
for key, value in row.iteritems():
print("{}={}".format(key,value))
yields
company_description=The company is never clearly defined in Road Runner cartoons but appears to be a conglomerate which produces every product type imaginable, no matter how elaborate or extravagant - most of which never work as desired or expected. In the Road Runner cartoon Beep, Beep, it was referred to as "Acme Rocket-Powered Products, Inc." based in Fairfield, New Jersey. Many of its products appear to be produced specifically for Wile E. Coyote; for example, the Acme Giant Rubber Band, subtitled "(For Tripping Road Runners)".
Sometimes, Acme can also send living creatures through the mail, though that isn't done very often. Two examples of this are the Acme Wild-Cat, which had been used on Elmer Fudd and Sam Sheepdog (which doesn't maul its intended victim); and Acme Bumblebees in one-fifth bottles (which sting Wile E. Coyote). The Wild Cat was used in the shorts Don't Give Up the Sheep and A Mutt in a Rut, while the bees were used in the short Zoom and Bored.
While their products leave much to be desired, Acme delivery service is second to none; Wile E. can merely drop an order into a mailbox (or enter an order on a website, as in the Looney Tunes: Back in Action movie), and have the product in his hands within seconds.
company_website=http://google.com
company_start_year=2004
company_name=Acme Inc
company_revenue=80000000000000
company_email=roadrunner#acme.com
Also, per the docs, in Python2 the filehandle should be opened in 'rb' mode, not 'rU' mode:
If csvfile is a file object, it must be opened with the ‘b’ flag on
platforms where that makes a difference.

Related

Scrape latitude and longitude locations obtained from Mapbox

I'm working on a divvy dataset project.
I want to scrape information for each suggestion location and comments provided from here http://suggest.divvybikes.com/.
Am I able to scrape this information from Mapbox? It is displayed on a map so it must have the information somewhere.

I visited the page, and logged my network traffic using Google Chrome's Developer Tools. Filtering the requests to view only XHR (XmlHttpRequest) requests, I saw a lot of HTTP GET requests to various REST APIs. These REST APIs return JSON, which is ideal. Only two of these APIs seem to be relevant for your purposes - one is for places, the other for comments associated with those places. The places API's JSON contains interesting information, such as place ids and coordinates. The comments API's JSON contains all comments regarding a specific place, identified by its id. Mimicking those calls is pretty straightforward with the third-party requests module. Fortunately, the APIs don't seem to care about request headers. The query-string parameters (the params dictionary) need to be well-formulated though, of course.
I was able to come up with the following two functions: get_places makes multiple calls to the same API, each time with a different page query-string parameter. It seems that "page" is the term they use internally to split up all their data into different chunks - all the different places/features/stations are split up across multiple pages, and you can only get one page per API call. The while-loop accumulates all places in a giant list, and it keeps going until we receive a response which tells us there are no more pages. Once the loop ends, we return the list of places.
The other function is get_comments, which takes a place id (string) as a parameter. It then makes an HTTP GET request to the appropriate API, and returns a list of comments for that place. This list may be empty if there are no comments.
def get_places():
import requests
from itertools import count
api_url = "http://suggest.divvybikes.com/api/places"
page_counter = count(1)
places = []
for page_nr in page_counter:
params = {
"page": str(page_nr),
"include_submissions": "true"
}
response = requests.get(api_url, params=params)
response.raise_for_status()
content = response.json()
places.extend(content["features"])
if content["metadata"]["next"] is None:
break
return places
def get_comments(place_id):
import requests
api_url = "http://suggest.divvybikes.com/api/places/{}/comments".format(place_id)
response = requests.get(api_url)
response.raise_for_status()
return response.json()["results"]
def main():
from operator import itemgetter
places = get_places()
place_id = places[12]["id"]
print("Printing comments for the thirteenth place (id: {})\n".format(place_id))
for comment in map(itemgetter("comment"), get_comments(place_id)):
print(comment)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
Printing comments for the thirteenth place (id: 107062)
I contacted Divvy about this five years ago and would like to pick the conversation back up! The Evanston Divvy bikes are regularly spotted in Wilmette and we'd love to expand the system for riders. We could easily have four stations - at the Metra Train Station, and the CTA station, at the lakefront Gillson Park and possibly one at Edens Plaza in west Wilmette. Please, please, please contact me directly. Thanks.
>>>
For this example, I'm printing all the comments for the 13th place in our list of places. I picked that one because it is the first place which actually has comments (0 - 11 didn't have any comments, most places don't seem to have comments). In this case, this place only had one comment.
EDIT - If you wanted to save the place ids, latitude, longitude and comments in a CSV, you can try changing the main function to:
def main():
import csv
print("Getting places...")
places = get_places()
print("Got all places.")
fieldnames = ["place id", "latitude", "longitude", "comments"]
print("Writing to CSV file...")
with open("output.csv", "w") as file:
writer = csv.DictWriter(file, fieldnames)
writer.writeheader()
num_places_to_write = 25
for place_nr, place in enumerate(places[:num_places_to_write], start=1):
print("Writing place #{}/{} with id {}".format(place_nr, num_places_to_write, place["id"]))
writer.writerow(dict(zip(fieldnames, [place["id"], *place["geometry"]["coordinates"], [c["comment"] for c in get_comments(place["id"])]])))
return 0
With this, I got results like:
place id,latitude,longitude,comments
107098,-87.6711076553,41.9718155716,[]
107097,-87.759540081,42.0121073671,[]
107096,-87.747695446,42.0263916146,[]
107090,-87.6642036438,42.0162096564,[]
107089,-87.6609444613,41.8852953922,[]
107083,-87.6007853815,41.8199433342,[]
107082,-87.6355862613,41.8532736671,[]
107075,-87.6210737228,41.8862644836,[]
107074,-87.6210737228,41.8862644836,[]
107073,-87.6210737228,41.8862644836,[]
107065,-87.6499611139,41.9627251578,[]
107064,-87.6136027649,41.8332984674,[]
107062,-87.7073025402,42.0760990584,"[""I contacted Divvy about this five years ago and would like to pick the conversation back up! The Evanston Divvy bikes are regularly spotted in Wilmette and we'd love to expand the system for riders. We could easily have four stations - at the Metra Train Station, and the CTA station, at the lakefront Gillson Park and possibly one at Edens Plaza in west Wilmette. Please, please, please contact me directly. Thanks.""]"
In this case, I used the list-slicing syntax (places[:num_places_to_write]) to only write the first 25 places to the CSV file, just for demonstration purposes. However, after about the first thirteen were written, I got this exception message:
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
So, I'm guessing that the comment-API doesn't expect to receive so many requests in such a short amount of time. You may have to sleep in the loop for a bit to get around this. It's also possible that the API doesn't care, and just happened to timeout.

Want to extract text from a text or pdf file as different paragraphs

Check the following text piece
IN THE HIGH COURT OF GUJARAT AT AHMEDABAD
R/CRIMINAL APPEAL NO. 251 of 2009
FOR APPROVAL AND SIGNATURE:
HONOURABLE MR.JUSTICE R.P.DHOLARIA
==========================================================
1 Whether Reporters of Local Papers may be allowed to see the judgment ?
2 To be referred to the Reporter or not ?
3 Whether their Lordships wish to see the fair copy of the judgment ?
4 Whether this case involves a substantial question of law as to the interpretation of the Constitution of India or any order made thereunder ?
========================================================== STATE OF GUJARAT,S M RAO,FOOD INSPECTOR,OFFICE OF THE Versus DHARMESHBHAI NARHARIBHAI GANDHI ========================================================== Appearance: MS HB PUNANI, APP (2) for the Appellant(s) No. 1 MR DK MODI(1317) for the Opponent(s)/Respondent(s) No. 1 ==========================================================
CORAM: HONOURABLE MR.JUSTICE R.P.DHOLARIA
Date : 12/03/2019
ORAL JUDGMENT
1. The appellant State of Gujarat has
preferred the present appeal under section 378(1)
(3) of the Code of Criminal Procedure, 1973
against the judgment and order of acquittal dated
Page 1 of 12
R/CR.A/251/2009 JUDGMENT
17.11.2008 rendered by learned 2nd Additional
Civil Judge and Judicial Magistrate, First Class,
Nadiad in Food Case No.1 of 2007.
The short facts giving rise to the
present appeal are that on 10.11.2006 at about
18.00 hours, the complainant visited the place of
the respondent accused situated at Juna
Makhanpura, Rabarivad, Nadiad along with panch
witness and the respondent was found dealing in
provisional items. The complainant identified
himself as a Food Inspector and after giving
intimation in Form No.6 has purchased muddamal
sample of mustard seeds in the presence of the
panchas for the purpose of analysis. Thereafter,
the complainant Food Inspector has divided the
said sample in equal three parts and after
completing formalities of packing and sealing
obtained signatures of the vendor and panchas and
out of the said three parts, one part was sent to
the Public Analyst, Vadodara for analysis and
remaining two parts were sent to the Local Health
Authority, Gandhinagar. Thereafter, the Public
Analyst forwarded his report. In the said report,
it is stated that the muddamal sample of mustard
seeds is misbranded which is in breach of the
provisions of the Food Adulteration Act, 1954
(for short “the Act”) and the Rules framed
thereunder. It is alleged that, therefore, the
sample of mustard seeds was misbranded and,
thereby, the accused has committed the offence.
**Page 2 of 12
R/CR.A/251/2009* JUDGMENT*
Hence, the complaint came to be lodged against
the respondent accused.
I want to be able to write a program such that it follows the given constraints. Be wary of the fact that this is only a single file i have like 40k files and it should run on all the files. All the files have some difference but the basic format for every file is the same.
Constraints.
It should start the text extraction process from after the "metadata" . Metadata is the data about the file from the starting of the file i.e " In the high court of gujarat" till Oral Judgment. In all the files i have , there are various POINTS after the string ends. So i need all these points as a separate paragraph ( see the text has 2 points , i need it in different paragraphs ).
Check the lines in italics, these are the panes in the text/pdf file. I need to remove these as these donot have any meaning to the text content i want.
These files are both available in TEXT or PDF format so i can use either. But i am new to python so i dont know how and where to start. I just have basic knowledge in python.
This data is going to be made into a "corpus" for further processes in building a huge expert system so you know what needs to be done i hope.

Read the official python docs!
Start with python's basic str type and its methods. One of its methods, find, will find substrings in your text.
Use the python slicing notation to extract the portion of text you need, e.g.
text = """YOUR TEXT HERE..."""
meta_start = 'In the high court of gujarat'
meta_end = 'ORAL JUDGMENT'
pos1 = text.find(meta_start)
pos2 = text.find(meta_end)
if pos2 > pos1 and pos1 > -1:
# text is found, extract it
text1 = text[meta_start + len(meta_start):meta_end - 1]
After that, you can go ahead and save your extracted text to a database.
Of course, a better and more complicated solution would be to use regular expressions, but that's another story -- try finding the right way for yourself!
As to italics and other text formatting, you won't ever be able to mark it out in plain text (unless you have some 'meta' markers, like e.g. [i] tags).

Python - How to read a csv files separated by commas which have commas within the values?

The file has an URL which contain commas within it. For example:
~oref=https://tuclothing.tests.co.uk/c/Girls/Girls_Underwear_Socks&Tights?INITD=GNav-CW-GrlsUnderwear&title=Underwear,+Socks+&+Tights
Between Underwear and +Socks there is a comma which is making my life not easy.
Is there a way to indicate to the reader(Pandas, CSV reader..etc) that the whole URL is just one value?
This is a bigger sample with columns and values:
Event Time,User ID,Advertiser ID,TRAN Value,Other Data,ORD Value,Interaction Time,Conversion ID,Segment Value 1,Floodlight Configuration,Event Type,Event Sub-Type,DBM Auction ID,DBM Request Time,DBM Billable Cost (Partner Currency),DBM Billable Cost (Advertiser Currency),
1.47E+15,CAESEKoMzQamRFTrkbdTDT5F-gM,2934701,,~oref=https://tuclothing.tests.co.uk/c/NewIn/NewIn_Womens?q=%3AnewArrivals&page=2&size=24,4.60E+12,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,,0,0,
1.47E+15,CAESEKQhGXdLq0FitBKF5EPPfgs,2934701,,~oref=https://tuclothing.tests.co.uk/c/Women/Women_Accessories?INITD=GNav-WW-Accesrs&q=%3AnewArrivals&title=Accessories&mkwid=sv5biFf2y_dm&pcrid=90361315613&pkw=leather%20bag&pmt=e&med=Search&src=Google&adg=Womens_Accessories&kw=leather+bag&cmp=TU_Women_Accessories&adb_src=4,4.73E+12,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,,0,0,
1.47E+15,CAESEEpNRaLne21k6juip9qfAos,2934701,,num=16512910;~oref=https://tuclothing.tests.co.uk/,1,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,,0,0,
1.47E+15,CAESEJ3a2YRrPSSeeRUFHDSoXNQ,2934701,,~oref=https://tuclothing.tests.co.uk/c/Girls/Girls_Underwear_Socks&Tights?INITD=GNav-CW-GrlsUnderwear&title=Underwear,+Socks+&+Tights,8.12E+12,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,0,0,0
1.47E+15,CAESEGmwaNjTvIrQ3MoIvqiRC8U,2934701,,~oref=https://tuclothing.tests.co.uk/login/checkout,1.75E+12,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,,0,0,
1.47E+15,CAESEM3G-Nh6Q0OhboLyOhtmtiI,2934701,,~oref=https://3984747.fls.doubleclick.net/activityi;~oref=http%3A%2F%2Fwww.tests.co.uk%2Fshop%2Fgb%2Fgroceries%2Ffrozen-%2Fbeef--pork---lamb,3.74E+12,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,,0,0,
1.47E+15,CAESENlK7oc-ygl637Y2is3a90c,2934701,,~oref=https://tuclothing.tests.co.uk/,5.10E+12,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,,0,0,

It looks like, in this case, the only comma which you are having issues with is located in a URL. You could run your csv file through a preprocessor method which strips out commas in your URLs or URL encode them.
Personally, I would opt for the URL encoding method which will convert the comma to %2E, this way you don't have a comma in your URL when you start reading your csv row values, yet the URL still retains its working link to the reference/destination page.
If you had this issue with other fields (not a URL), or in other unknown/random locations in the csv row, then the solution would not be easy at all. But since you know exactly where the issue is occurring each time, you could perform a static lookup for that character and replace if found in that particular field.

Import File to Database in PhpMyAdmin

I want to import a file to a phpmyadmin database. It is to have 5 columns: id, url, lat, lon and address. However each line of the file is structured as follows:
23947501894 https://farm2.staticflickr.com/1664/23947501894_09e21ac1c4_q.jpg 53.404021 -2.996651 Belgian Merchant Seamen, Queensway (Mersey Tunnel), Liverpool, North West England, England, CH41, United Kingdom
Most of the data I want to input is seperated by a space, other than when it gets to the address at the end, where it has many spaces and commas. Is it possible to input this data to the database as is? If so can anyone suggest how I might do this?
I am very new to phpmyadmin and I am using python to do this. Thanks in advance for your help I am very stuck!

You'll have to process the text file before importing, since the delimiter also appears unescaped in line with your data.
The good news is that your data format makes this really easy. Take the first four spaces and convert them to a special character (maybe ; or ~, something that doesn't appear anywhere else in your data). You can accomplish this with your favorite stream editor or text manipulation program (sed, awk, perl, and python are all good candidates for this work).
There are many ways to do this (see also these answers for an idea how many different ways exist, though note that question is about working on an entire file and we want to work on individual lines), but probably the simplest is by running sed four times:
for i in $(seq 4) ; do sed -i -e 's/ /~/' ~/import.csv ; done
Make sure you do this with a copy of the file because this will edit the specified file in-place.
From your phpMyAdmin Import tab, you'll then use ~ (or whatever separator you used) as the value for "Columns separated with:" and leaving blank all the others except for leaving "auto" at "Lines terminated with:"
Your import settings should look like this (again, substitute whatever character you need to for the delimiter):

Log in PHPMyAdmin, then do:
[Refer Right Frame]
1. Click Database Tab and Create DB
2. Click Import Tab
3. Click Browse and select csv file
4. Change Format from SQL to CSV
5. Click Go

Preserving line breaks when parsing with Scrapy in Python

I've written a Scrapy spider that extracts text from a page. The spider parses and outputs correctly on many of the pages, but is thrown off by a few. I'm trying to maintain line breaks and formatting in the document. Pages such as http://www.state.gov/r/pa/prs/dpb/2011/04/160298.htm are formatted properly like such:
April 7, 2011
Mark C. Toner
2:03 p.m. EDT
MR. TONER: Good afternoon, everyone. A couple of things at the top,
and then I‚Äôll take your questions. We condemn the attack on innocent
civilians in southern Israel in the strongest possible terms, as well
as ongoing rocket fire from Gaza. As we have reiterated many times,
there‚Äôs no justification for the targeting of innocent civilians,
and those responsible for these terrorist acts should be held
accountable. We are particularly concerned about reports that indicate
the use of an advanced anti-tank weapon in an attack against civilians
and reiterate that all countries have obligations under relevant
United Nations Security Council resolutions to prevent illicit
trafficking in arms and ammunition. Also just a brief statement --
QUESTION: Can we stay on that just for one second?
MR. TONER: Yeah. Go ahead, Matt.
QUESTION: Apparently, the target of that was a school bus. Does that
add to your outrage?
MR. TONER: Well, any attack on innocent civilians is abhorrent, but
certainly the nature of the attack is particularly so.
While pages like http://www.state.gov/r/pa/prs/dpb/2009/04/121223.htm have output like this with no line breaks:
April 2, 2009
Robert Wood
11:53 a.m. EDTMR. WOOD: Good morning, everyone. I think it‚Äôs just
about still morning. Welcome to the briefing. I don‚Äôt have anything,
so ‚Äì sir.QUESTION: The North Koreans have moved fueling tankers, or
whatever, close to the site. They may or may not be fueling this
missile. What words of wisdom do you have for the North Koreans at
this moment?MR. WOOD: Well, Matt, I‚Äôm not going to comment on, you
know, intelligence matters. But let me just say again, we call on the
North to desist from launching any type of missile. It would be
counterproductive. It‚Äôs provocative. It further inflames tensions in
the region. We want to see the North get back to the Six-Party
framework and focus on denuclearization.Yes.QUESTION: Japan has also
said they‚Äôre going to call for an emergency meeting in the Security
Council, you know, should this launch go ahead. Is this something that
you would also be looking for?MR. WOOD: Well, let‚Äôs see if this test
happens. We certainly hope it doesn‚Äôt. Again, calling on the North
not to do it. But certainly, we will ‚Äì if that test does go forward,
we will be having discussions with our allies.
The code I'm using is as follows:
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
speaker = hxs.select("//span[contains(#class, 'official_s_name')]") #gets the speaker
speaker = speaker.select('string()').extract()[0] #extracts speaker text
date = hxs.select('//*[#id="date_long"]') #gets the date
date = date.select('string()').extract()[0] #extracts the date
content = hxs.select('//*[#id="centerblock"]') #gets the content
content = content.select('string()').extract()[0] #extracts the content
texts = "%s\n\n%s\n\n%s" % (date, speaker, content) #puts everything together in a string
filename = ("/path/StateDailyBriefing-" + '%s' ".txt") % (date) #creates a file using the date
#opens the file defined above and writes 'texts' using utf-8
with codecs.open(filename, 'w', encoding='utf-8') as output:
output.write(texts)
I think they problem lies in the formatting of the HTML of the page. On the pages that output the text incorrectly, the paragraphs are separated by <br> <p></p>, while on the pages that output correctly the paragraphs are contained within <p align="left" dir="ltr">. So, while I've identified this, I'm not sure how to make everything output consistently in the correct form.

The problem is that when you getting text() or string(), <br> tags are not converted to newline.
Workaround - replace <br> tags before doing XPath requests. Code:
response = response.replace(body=response.body.replace('<br />', '\n'))
hxs = HtmlXPathSelector(response)
And let me give some advice, if you know, that there is only one node, you can use text() instead string():
date = hxs.select('//*[#id="date_long"]/text()').extract()[0]

Try this xpath:
//*[#id="centerblock"]//text()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.