Remove nested newline characters in delimited file? - python

I have a caret-delimited file. The only carets in the file are delimiters -- there are none in text. Several of the fields are free text fields and contain embedded newline characters. This makes parsing the file very difficult. I need the newline characters at the end of the records, but I need to remove them from the fields with text.
This is open source maritime piracy data from the Global Integrated Shipping Information System. Here are three records, preceded by the header row. In the first, the boat name is NORMANNIA, in the second, it is "Unkown" and in the third, it is KOTA BINTANG.
ship_name^ship_flag^tonnage^date^time^imo_num^ship_type^ship_released_on^time_zone^incident_position^coastal_state^area^lat^lon^incident_details^crew_ship_cargo_conseq^incident_location^ship_status_when_attacked^num_involved_in_attack^crew_conseq^weapons_used_by_attackers^ship_parts_raided^lives_lost^crew_wounded^crew_missing^crew_hostage_kidnapped^assaulted^ransom^master_crew_action_taken^reported_to_coastal_authority^reported_to_which_coastal_authority^reporting_state^reporting_intl_org^coastal_state_action_taken
NORMANNIA^Liberia^24987^2009-09-19^22:30^9142980^Bulk carrier^^^Off Pulau Mangkai,^^South China Sea^3° 04.00' N^105° 16.00' E^Eight pirates armed with long knives and crowbars boarded the ship underway. They broke into 2/O cabin, tied up his hands and threatened him with a long knife at his throat. Pirates forced the 2/O to call the Master. While the pirates were waiting next to the Master’s door, they seized C/E and tied up his hands. The pirates rushed inside the Master’s cabin once it was opened. They threatened him with long knives and crowbars and demanded money. Master’s hands were tied up and they forced him to the aft station. The pirates jumped into a long wooden skiff with ship’s cash and crew personal belongings and escaped. C/E and 2/O managed to free themselves and raised the alarm^Pirates tied up the hands of Master, C/E and 2/O. The pirates stole ship’s cash and master’s, C/E & 2/O cash and personal belongings^In international waters^Steaming^5-10 persons^Threat of violence against the crew^Knives^^^^^^^^SSAS activated and reported to owners^^Liberian Authority^^ICC-IMB Piracy Reporting Centre Kuala Lumpur^-
Unkown^Marshall Islands^19846^2013-08-28^23:30^^General cargo ship^^^Cam Pha Port^Viet Nam^South China Sea^20° 59.92' N^107° 19.00' E^While at anchor, six robbers boarded the vessel through the anchor chain and cut opened the padlock of the door to the forecastle store. They removed the turnbuckle and lashing of the forecastle store's rope hatch. The robbers escaped upon hearing the alarm activated when they were sighted by the 2nd officer during the turn-over of duty watch keepers.^"There was no injury to the crew however, the padlock of the door to the forecastle store and the rope hatch were cut-opened.
Two centre shackles and one end shackle were stolen"^In port area^At anchor^5-10 persons^^None/not stated^Main deck^^^^^^^-^^^Viet Nam^"ReCAAP ISC via ReCAAP Focal Point (Vietnam)
ReCAAP ISC via Focal Point (Singapore)"^-
KOTA BINTANG^Singapore^8441^2002-05-12^15:55^8021311^Bulk carrier^^UTC^^^South China Sea^^^Seven robbers armed with long knives boarded the ship, while underway. They broke open accommodation door, held hostage a crew member and forced the Master to open his cabin door. They then tied up the Master and crew member, forced them back onto poop deck from where the robbers jumped overboard and escaped in an unlit boat^Master and cadet assaulted; Cash, crew belongings and ship's cash stolen^In territorial waters^Steaming^5-10 persons^Actual violence against the crew^Knives^^^^^^2^^-^^Yes. SAR, Djakarta and Indonesian Naval Headquarters informed^^ICC-IMB PRC Kuala Lumpur^-
You'll notice that the first and third records are fine and easy to parse. The second record, "Unkown," has some nested newline characters.
How should I go about removing the nested newline characters (but not those at the end of the records) in a python script (or otherwise, if there's an easier way) so that I can import this data into SAS?

load the data into a string a then do
import re
newa=re.sub('\n','',a)
and there will be no newlines in newa
newa=re.sub('\n(?!$)','',a)
and it leaves the ones at the end of the line but strips the rest

I see you've tagged this as regex, but I would recommend using the builtin CSV library to parse this. The CSV library will parse the file correctly, keeping newlines where it should.
Python CSV Examples: http://docs.python.org/2/library/csv.html

I solved the problem by counting the number of delimiters encountered and manually switching to a new record when I reached the number associated with a single record. I then stripped all of the newline characters and wrote the data back out to a new file. In essence, it's the original file with the newline characters stripped from the fields but with a newline character at the end of each record. Here's the code:
f = open("events.csv", "r")
carets_per_record = 33
final_file = []
temp_file = []
temp_str = ''
temp_cnt = 0
building = False
for i, line in enumerate(f):
# If there are no carets on the line, we are building a string
if line.count('^') == 0:
building = True
# If we are not building a string, then set temp_str equal to the line
if building is False:
temp_str = line
else:
temp_str = temp_str + " " + line
# Count the number of carets on the line
temp_cnt = temp_str.count('^')
# If we do not have the proper number of carets, then we are building
if temp_cnt < carets_per_record:
building = True
# If we do have the proper number of carets, then we are finished
# and we can push this line to the list
elif temp_cnt == carets_per_record:
building = False
temp_file.append(temp_str)
# Strip embedded newline characters from the temp file
for i, item in enumerate(temp_file):
final_file.append(temp_file[i].replace('\n', ''))
# Write the final_file list out to a csv final_file
g = open("new_events.csv", "wb")
# Write the lines back to the file
for item in enumerate(final_file):
# item is a tuple, so we get the content part and append a new line
g.write(item[1] + '\n')
# Close the files we were working with
f.close()
g.close()

Related

Accumulating Characters in Python

So I have this textfile, and in that file it goes like this... (just a bit of it)
"The truest love that ever heart
Felt at its kindled core
Did through each vein in quickened start
The tide of being pour
Her coming was my hope each day
Her parting was my pain
The chance that did her steps delay
Was ice in every vein
I dreamed it would be nameless bliss
As I loved loved to be
And to this object did I press
As blind as eagerly
But wide as pathless was the space
That lay our lives between
And dangerous as the foamy race
Of ocean surges green
And haunted as a robber path
Through wilderness or wood
For Might and Right and Woe and Wrath
Between our spirits stood
I dangers dared I hindrance scorned
I omens did defy
Whatever menaced harassed warned
I passed impetuous by
On sped my rainbow fast as light
I flew as in a dream
For glorious rose upon my sight
That child of Shower and Gleam"
Now, the calculate the length of words without the letter 'e' in each line of text. So in the first line it should have 4, then 5, then 17, etc.
My current code is
for line in open("textname.txt"):
line_strip = line.strip()
line_strip_split = line_strip.split()
for word in line_strip_split:
if "e" not in word:
word_e = word
print (len(word_e))
My explanation is: Strip each word from each other by removing spaces, so it becomes ['Felt','at','its','kindled','core'], etc. Then we split each word because we can regard it individually when removing words with 'e'?. So we want words without e, then print the length of the string.
HOWEVER, this separates each word into a different line by splitting then separating the string? So this doesn't add all the words together in each line but separates it, so the answer becomes "4 / 2 / 3"
Try this:
for line in open("textname.txt"):
line_strip = line.strip()
line_strip_split = line_strip.split()
words_with_no_e = []
for word in line_strip_split:
if "e" not in word:
# Adding words without e to a new list
words_with_no_e.append(word)
# ''.join() will returns all the elements of array concatenated
# len() will count the length
print(len(''.join(words_with_no_e)))
It append all the words without e in into new list in each line, then concatenate all words then it prints length of it.

Differences in language identification langid

I have a database from which I read. I want to identify the language in a specific cell, defined by column. I am using the langid library for python.
I read from my database like this:
connector = sqlite3.connect("somedb.db")
selecter = connector.cursor()
selecter.execute(''' SELECT tags FROM sometable''')
for row in selecter: #iterate through all the rows in db
#print (type(row)) #tuple
rf = str(row)
#print (type(rf)) #string
lan = langid.classify("{}".format(rf))
Technically, it works. It identifies the languages used and later on (not displayed here) writes the identified language back into the database.
So, now comes the weird part.
I wanted to double check some results manually. So I have these words:
a = "shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote"
When I perform the language identification on the database it plots me Portuguese into the database.
But, performing it like this:
a = "shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote"
lan = langid.classify(a)
Well, that returns me French. Apart from that it is neither French nor Portuguese, why is it returned different results?!
In the loop row is bound to a tuple with a single item, i.e. ('tags',) - where 'tags' stands for the list of words. str(row) therefore (in Python 3) returns "('tags',)" and it is this string (including the single quotes, commas and braces) that is being passed to langid.classify(). If you are using Python 2 the string becomes "(u'tags',)".
Now, I am not sure that this explains the different language detection result, and my testing in Python 2 shows that it doesn't, but it is an obvious difference between database sourced data and the plain string data.
Possibly there is some encoding issue coming into play. How is the data stored in the database? Text data should be UTF-8 encoded.

Insert quotes in the string using index python

I want to insert quotes("") around the date and text in the string (which is in the file input.txt). Here is my input file:
created_at : October 9, article : ISTANBUL — Turkey is playing a risky game of chicken in its negotiations with NATO partners who want it to join combat operations against the Islamic State group — and it’s blowing back with violence in Turkish cities. As the Islamic militants rampage through Kurdish-held Syrian territory on Turkey’s border, Turkey says it won’t join the fight unless the U.S.-led coalition also goes after the government of Syrian President Bashar Assad.
created_at : October 9, article : President Obama chairs a special meeting of the U.N. Security Council last month. (Timothy A. Clary/AFP/Getty Images) When it comes to President Obama’s domestic agenda and his maneuvers to (try to) get things done, I get it. I understand what he’s up to, what he’s trying to accomplish, his ultimate endgame. But when it comes to his foreign policy, I have to admit to sometimes thinking “whut?” and agreeing with my colleague Ed Rogers’s assessment on the spate of books criticizing Obama’s foreign policy stewardship.
I want to put quotes around the date and text as follows:
created_at : "October 9", article : "ISTANBUL — Turkey is playing a risky game of chicken in its negotiations with NATO partners who want it to join combat operations against the Islamic State group — and it’s blowing back with violence in Turkish cities. As the Islamic militants rampage through Kurdish-held Syrian territory on Turkey’s border, Turkey says it won’t join the fight unless the U.S.-led coalition also goes after the government of Syrian President Bashar Assad".
created_at : "October 9", article : "President Obama chairs a special meeting of the U.N. Security Council last month. (Timothy A. Clary/AFP/Getty Images) When it comes to President Obama’s domestic agenda and his maneuvers to (try to) get things done, I get it. I understand what he’s up to, what he’s trying to accomplish, his ultimate endgame. But when it comes to his foreign policy, I have to admit to sometimes thinking “whut?” and agreeing with my colleague Ed Rogers’s assessment on the spate of books criticizing Obama’s foreign policy stewardship".
Here is my code which finds the index for comma(, after the date) and index for the article and then by using these, I want to insert quotes around the date. Also I want to insert quotes around the text, but how to do this?
f = open("input.txt", "r")
for line in f:
article_pos = line.find("article")
print article_pos
comma_pos = line.find(",")
print comma_pos
While you can do this with low-level operations like find and slicing, that's really not the easy or idiomatic way to do it.
First, I'll show you how to do it your way:
comma_pos = line.find(", ")
first_colon_pos = line.find(" : ")
second_colon_pos = line.find(" : ", comma_pos)
line = (line[:first_colon_pos+3] +
'"' + line[first_colon_pos+3:comma_pos] + '"' +
line[comma_pos:second_colon_pos+3] +
'"' + line[second_colon_pos+3:] + '"')
But you can more easily just split the line into bits, munge those bits, and join them back together:
dateline, article = line.split(', ', 1)
key, value = dateline.split(' : ')
dateline = '{} : "{}"'.format(key, value)
key, value = article.split(' : ')
article = '{} : "{}"'.format(key, value)
line = '{}, {}'.format(dateline, article)
And then you can take the repeated parts and refactor them into a simple function so you don't have to write the same thing twice (which may come in handy if you later need to write it four times).
It's even easier using a regular expression, but that might not be as easy to understand for a novice:
line = re.sub(r'(.*?:\s*)(.*?)(\s*,.*?:\s*)(.*)', r'\1"\2"\3"\4"', line)
This works by capturing everything up to the first : (and any spaces after it) in one group, then everything from there to the first comma in a second group, and so on:
(.*?:\s*)(.*?)(\s*,.*?:\s*)(.*)
Debuggex Demo
Notice that the regex has the advantage that I can say "any spaces after it" very simply, while with find or split I had to explicitly specify that there was exactly one space on either side of the colon and one after the comma, because searching for "0 or more spaces" is a lot harder without some way to express it like \s*.
You could also take a look at the regex library re.
E.g.
>>> import re
>>> print(re.sub(r'created_at:\s(.*), article:\s(.*)',
... r'created_at: "\1", article: "\2"',
... 'created_at: October 9, article: ...'))
created_at: "October 9", article: "..."
The first param to re.sub is the pattern you are trying to match. The parens () capture the matches and can be used in the second argument with \1. The third argument is the line of text.

Extracting name from line

I have data in the following format:
Bxxxx, Mxxxx F Birmingham AL (123) 555-2281 NCC Clinical Mental Health, Counselor Education, Sexual Abuse Recovery, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling English 99.52029 -99.8115
Axxxx, Axxxx Brown Birmingham AL (123) 555-2281 NCC Clinical Mental Health, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling English 99.52029 -99.8115
Axxxx, Bxxxx Mobile AL (123) 555-8011 NCC Childhood & Adolescence, Clinical Mental Health, Sexual Abuse Recovery, Disaster Counseling English 99.68639 -99.053238
Axxxx, Rxxxx Lunsford Athens AL (123) 555-8119 NCC, NCCC, NCSC Career Development, Childhood & Adolescence, School, Disaster Counseling, Supervision English 99.804501 -99.971283
Axxxx, Mxxxx Mobile AL (123) 555-5963 NCC Clinical Mental Health, Counselor Education, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling, Supervision English 99.68639 -99.053238
Axxxx, Txxxx Mountain Brook AL (123) 555-3099 NCC Addictions and Dependency, Career Development, Childhood & Adolescence, Corrections/Offenders, Sexual Abuse Recovery English 99.50214 -99.75557
Axxxx, Lxxxx Birmingham AL (123) 555-4550 NCC Addictions and Dependency, Eating Disorders English 99.52029 -99.8115
Axxxx, Wxxxx Birmingham AL (123) 555-2328 NCC English 99.52029 -99.8115
Axxxx, Rxxxx Mobile AL (123) 555-9411 NCC Addictions and Dependency, Childhood & Adolescence, Couples & Family, Sexual Abuse Recovery, Depression/Grief/Chronically or Terminally Ill English 99.68639 -99.053238
And need to extract only the person names. Ideally, I'd be able to use humanName to get a bunch of name objects with fields name.first, name.middle, name.last, name.title...
I've tried iterating through until I hit the first two consecutive caps letters representing the state and then storing the stuff previous into a list and then calling humanName but that was a disaster. I don't want to continue to try this method.
Is there a way to sense the starts and ends of words? That might be helpful...
Recommendations?
Your best bet is to find a different data source. Seriously. This one is farked.
If you can't do that, then I would do some work like this:
Replace all double spaces with single spaces.
Split the line by spaces
Take the last 2 items in the list. Those are lat and lng
Looping backwards in the list, do a lookup of each item into a list of potential languages. If the lookup fails, you are done with languages.
Join the remaining list items back with spaces
In the line, find the first opening paren. Read about 13 or 14 characters in, replace all punctuation with empty strings, and reformat it as a normal phone number.
Split the remainder of the line after the phone number by commas.
Using that split, loop through each item in the list. If the text starts with more than 1 capital letter, add it to certifications. Otherwise, add it to areas of practice.
Going back to the index you found in step #6, get the line up until then. Split it on spaces, and take the last item. That's the state. All that's left is name and city!
Take the first 2 items in the space-split line. That's your best guess for name, so far.
Look at the 3rd item. If it is a single letter, add it to the name and remove from the list.
Download US.zip from here: http://download.geonames.org/export/zip/US.zip
In the US data file, split all of it on tabs. Take the data at indexes 2 and 4, which are city name and state abbreviation. Loop through all data and insert each row, concatenated as abbreviation + ":" + city name (i.e. AK:Sand Point) into a new list.
Make a combination of all possible joins of the remaining items in your line, in the same format as in step #13. So you'd end up with AL:Brown Birmingham and AL:Birmingham for the 2nd line.
Loop through each combination and search for it in the list you created in step #13. If you found it, remove it from the split list.
Add all remaining items in the string-split list to the person's name.
If desired, split the name on the comma. index[0] is the last name index[1] is all remaining names. Don't make any assumptions about middle names.
Just for giggles, I implemented this. Enjoy.
import itertools
# this list of languages could be longer and should read from a file
languages = ["English", "Spanish", "Italian", "Japanese", "French",
"Standard Chinese", "Chinese", "Hindi", "Standard Arabic", "Russian"]
languages = [language.lower() for language in languages]
# Loop through US.txt and format it. Download from geonames.org.
cities = []
with open('US.txt', 'r') as us_data:
for line in us_data:
line_split = line.split("\t")
cities.append("{}:{}".format(line_split[4], line_split[2]))
# This is the dataset
with open('state-teachers.txt', 'r') as teachers:
next(teachers) # skip header
for line in teachers:
# Replace all double spaces with single spaces
while line.find(" ") != -1:
line = line.replace(" ", " ")
line_split = line.split(" ")
# Lat/Lon are the last 2 items
longitude = line_split.pop().strip()
latitude = line_split.pop().strip()
# Search for potential languages and trim off the line as we find them
teacher_languages = []
while True:
language_check = line_split[-1]
if language_check.lower().replace(",", "").strip() in languages:
teacher_languages.append(language_check)
del line_split[-1]
else:
break
# Rejoin everything and then use phone number as the special key to split on
line = " ".join(line_split)
phone_start = line.find("(")
phone = line[phone_start:phone_start+14].strip()
after_phone = line[phone_start+15:]
# Certifications can be recognized as acronyms
# Anything else is assumed to be an area of practice
certifications = []
areas_of_practice = []
specialties = after_phone.split(",")
for specialty in specialties:
specialty = specialty.strip()
if specialty[0:2].upper() == specialty[0:2]:
certifications.append(specialty)
else:
areas_of_practice.append(specialty)
before_phone = line[0:phone_start-1]
line_split = before_phone.split(" ")
# State is the last column before phone
state = line_split.pop()
# Name should be the first 2 columns, at least. This is a basic guess.
name = line_split[0] + " " + line_split[1]
line_split = line_split[2:]
# Add initials
if len(line_split[0].strip()) == 1:
name += " " + line_split[0].strip()
line_split = line_split[1:]
# Combo of all potential word combinations to see if we're dealing with a city or a name
combos = [" ".join(combo) for combo in set(itertools.permutations(line_split))] + line_split
line = " ".join(line_split)
city = ""
# See if the state:city combo is valid. If so, set it and let everything else be the name
for combo in combos:
if "{}:{}".format(state, combo) in cities:
city = combo
line = line.replace(combo, "")
break
# Remaining data must be a name
if line.strip() != "":
name += " " + line
# Clean up names
last_name, first_name = [piece.strip() for piece in name.split(",")]
print first_name, last_name
Not a code answer, but it looks like you could get most/all of the data you're after from the licensing board at http://www.abec.alabama.gov/rostersearch2.asp?search=%25&submit1=Search. Names are easy to get there.

How to use text.split() and retain blank (empty) lines

New to python, need some help with my program. I have a code which takes in an unformatted text document, does some formatting (sets the pagewidth and the margins), and outputs a new text document. My entire code works fine except for this function which produces the final output.
Here is the segment of the problem code:
def process(document, pagewidth, margins, formats):
res = []
onlypw = []
pwmarg = []
count = 0
marg = 0
for segment in margins:
for i in range(count, segment[0]):
res.append(document[i])
text = ''
foundmargin = -1
for i in range(segment[0], segment[1]+1):
marg = segment[2]
text = text + '\n' + document[i].strip(' ')
words = text.split()
Note: segment [0] means the beginning of the document, and segment[1] just means to the end of the document if you are wondering about the range. My problem is when I copy text to words (in words=text.split() ) it does not retain my blank lines. The output I should be getting is:
This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in
this. If they but knew it, almost all men in their degree,
some time or other, cherish very nearly the same feelings
towards the ocean with me.
There now is your insular city of the Manhattoes, belted
round by wharves as Indian isles by coral reefs--commerce
surrounds it with her surf.
And what my current output looks like:
This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in
this. If they but knew it, almost all men in their degree,
some time or other, cherish very nearly the same feelings
towards the ocean with me. There now is your insular city of
the Manhattoes, belted round by wharves as Indian isles by
coral reefs--commerce surrounds it with her surf.
I know the problem happens when I copy text to words, since it doesn't keep the blank lines. How can I make sure it copies the blank lines plus the words?
Please let me know if I should add more code or more detail!
First split on at least 2 newlines, then split on words:
import re
paragraphs = re.split('\n\n+', text)
words = [paragraph.split() for paragraph in paragraphs]
You now have a list of lists, one per paragraph; process these per paragraph, after which you can rejoin the whole thing into new text with double newlines inserted back in.
I've used re.split() to support paragraphs being delimited by more than 2 newlines; you could use a simple text.split('\n\n') if there are ever only going to be exactly 2 newlines between paragraphs.
use a regexp to find the words and the blank lines rather than split
m = re.compile('(\S+|\n\n)')
words=m.findall(text)

Categories

Resources