String Cutting with multiple lines - python

so i'm new to python besides some experience with tKintner (some GUI experiments).
I read an .mbox file and copy the plain/text in a string. This text contains a registering form. So a Stefan, living in Maple Street, London working for the Company "MultiVendor XXVideos" has registered with an email for a subscription.
Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
I would like to take this data and put in a .csv row with column
"Name", "Adress", "Company",...
Now i tried to cut and slice everything. For debugging i use "print"(IDE = KATE/KDE + terminal... :-D ).
Problem is, that the data contains multiple lines after keywords but i only get the first line.
How would you improve my code?
import mailbox
import csv
import email
from time import sleep
import string
fieldnames = ["ID","Subject","Name", "Adress", "Company"]
searchKeys = [ 'Name_OF_Person','Adress_HOME','Company_NAME']
mbox_file = "REG.mbox"
export_file_name = "test.csv"
if __name__ == "__main__":
with open(export_file_name,"w") as csvfile:
writer = csv.DictWriter(csvfile, dialect='excel',fieldnames=fieldnames)
writer.writeheader()
for message in mailbox.mbox(mbox_file):
if message.is_multipart():
content = '\n'.join(part.get_payload() for part in message.get_payload())
content = content.split('<')[0] # only want text/plain.. Ill split #right before HTML starts
#print content
else:
content = message.get_payload()
idea = message['message-id']
sub = message['subject']
fr = message['from']
date = message['date']
writer.writerow ('ID':idea,......) # CSV writing will work fine
for line in content.splitlines():
line = line.strip()
for pose in searchKeys:
if pose in line:
tmp = line.split(pose)
pmt = tmp[1].split(":")[1]
if next in line !=:
print pose +"\t"+pmt
sleep(1)
csvfile.closed
OUTPUT:
OFFICIAL_POSTAL_ADDRESS =20
Here, the lines are missing..
from file:
OFFICIAL_POSTAL_ADDRESS: =20
London, testarossa street 41
EDIT2:
#Yaniv
Thank you, iam still trying to understand every step, but just wanted to give a comment. I like the idea to work with the list/matrix/vector "key_value_pairs"
The amount of keywords in the emails is ~20 words. Additionally, my values are sometimes line broken by "=".
I was thinking something like:
Search text for Keyword A,
if true:
search text from Keyword A until keyword B
if true:
copy text after A until B
Name_OF_=
Person: Stefan
Adress_
=HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
Maybe the HTML from EMAIL.mbox is easier to process?
<tr><td bgcolor=3D"#eeeeee"><font face=3D"Verdana" size=3D"1">
<strong>NAM=
E_REGISTERING_PERSON</strong></font></td><td bgcolor=3D"#eeeeee"><font
fac=e=3D"Verdana" size=3D"1">Stefan </font></td></tr>
But the "=" are still there
should i replace ["="," = "] with "" ?

I would go for a "routine" parsing loop over the input lines, and maintain a current_key and current_value variables, as a value for a certain key in your data might be "annoying", and spread across multiple lines.
I've demonstrated such parsing approach in the code below, with some assumptions regarding your problem. For example, if an input line starts with a whitespace, I assumed it must be the case of such "annoying" value (spread across multiple lines). Such lines would be concatenated into a single value, using some configurable string (the parameter join_lines_using_this). Another assumption is that you might want to strip whitespaces from both keys and values.
Feel free to adapt the code to fit your assumptions on the input, and raise Exceptions whenever they don't hold!
# Note the usage of .strip() in some places, to strip away whitespaces. I assumed you might want that.
def parse_funky_text(text, join_lines_using_this=" "):
key_value_pairs = []
current_key, current_value = None, ""
for line in text.splitlines():
line_split = line.split(':')
if line.startswith(" ") or len(line_split) == 1:
if current_key is None:
raise ValueError("Failed to parse this line, not sure which key it belongs to: %s" % line)
current_value += join_lines_using_this + line.strip()
else:
if current_key is not None:
key_value_pairs.append((current_key, current_value))
current_key, current_value = None, ""
current_key = line_split[0].strip()
# We've just found a new key, so here you might want to perform additional checks,
# e.g. if current_key not in sharedKeys: raise ValueError("Encountered a weird key?! %s in line: %s" % (current_key, line))
current_value = ':'.join(line_split[1:]).strip()
# Don't forget the last parsed key, value
if current_key is not None:
key_value_pairs.append((current_key, current_value))
return key_value_pairs
Example usage:
text = """Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos"""
parse_funky_text(text)
Will output:
[('Name_OF_Person', 'Stefan'), ('Adress_HOME', 'London, Maple Street 45'), ('Company_NAME', 'MultiVendor XXVideos')]

You indicate in the comments that your input strings from the content should be relatively consistent. If that is the case, and you want to be able to split that string across multiple lines, the easiest thing to do would be to replace \n with spaces and then just parse the single string.
I've intentionally constrained my answer to using just string methods rather than inventing a huge function to do this. Reason: 1) Your process is already complex enough, and 2) your question really boils down to how to process the string data across multiple lines. If that is the case, and the pattern is consistent, this will get this one off job done
content = content.replace('\n', ' ')
Then you can split on each of the boundries in your consistently structured headers.
content = content.split("Name_OF_Person:")[1] #take second element of the list
person = content.split("Adress_HOME:")[0] # take content before "Adress Home"
content = content.split("Adress_HOME:")[1] #take second element of the list
address = content.split("Company_NAME:")[0] # take content before
company = content.split("Adress_HOME:")[1] #take second element of the list (the remainder) which is company
Normally, I would suggest regex. (https://docs.python.org/3.4/library/re.html). Long term, if you need to do this sort of thing again, regex is going to pay dividends on time spend munging data. To make a regex function "cut" across multiple lines, you would use the re.MULTILINE option. So it might endup looking something like re.search('Name_OF_Person:(.*)Adress_HOME:', html_reg_form, re.MULTILINE)

Related

How to remove dash/ hyphen from each line in .txt file

I wrote a little program to turn pages from book scans to a .txt file. On some lines, words are moved to another line. I wonder if this is any way to remove the dashes and merge them with the syllables in the line below?
E.g.:
effects on the skin is fully under-
stood one fights
to:
effects on the skin is fully understood
one fights
or:
effects on the skin is fully
understood one fights
Or something like that. As long as it was connected. Python is my third language and so far I can't think of anything, so maybe someone will give mea hint.
Edit:
The point is that the last symbol, if it is a dash, is removed and merged with the rest of the word below
This is a generator which takes the input line-by-line. If it ends with a - it extracts the last word and holds it over for the next line. It then yields any held-over word from the previous line combined with the current line.
To combine the results back into a single block of text, you can join it against the line separator of your choice:
source = """effects on the skin is fully under-
stood one fights
check-out Daft Punk's new sin-
le "Get Lucky" if you hav-
e the chance. Sound of the sum-
mer."""
def reflow(text):
holdover = ""
for line in text.splitlines():
if line.endswith("-"):
lin, _, e = line.rpartition(" ")
else:
lin, e = line, ""
yield f"{holdover}{lin}"
holdover = e[:-1]
print("\n".join(reflow(source)))
""" which is:
effects on the skin is fully
understood one fights
check-out Daft Punk's new
single "Get Lucky" if you
have the chance. Sound of the
summer.
"""
To read one file line-by-line and write directly to a new file:
def reflow(infile, outfile):
with open(infile) as source, open(outfile, "w") as dest:
holdover = ""
for line in source.readlines():
line = line.rstrip("\n")
if line.endswith("-"):
lin, _, e = line.rpartition(" ")
else:
lin, e = line, ""
dest.write(f"{holdover}{lin}\n")
holdover = e[:-1]
if __name__ == "__main__":
reflow("source.txt", "dest.txt")
Here is one way to do it
with open('test.txt') as file:
combined_strings = []
merge_line = False
for item in file:
item = item.replace('\n', '') # remove new line character at end of line
if '-' in item[-1]: # check that it is the last character
merge_line = True
combined_strings.append(item[:-1])
elif merge_line:
merge_line = False
combined_strings[-1] = combined_strings[-1] + item
else:
combined_strings.append(item)
If you just parse the line as a string then you can utilize the .split() function to move around these kinds of items
words = "effects on the skin is fully under-\nstood one fights"
#splitting among the newlines
wordsSplit = words.split("\n")
#splitting among the word spaces
for i in range(len(wordsSplit)):
wordsSplit[i] = wordsSplit[i].split(" ")
#checking for the end of line hyphens
for i in range(len(wordsSplit)):
for g in range(len(wordsSplit[i])):
if "-" in wordsSplit[i][g]:
#setting the new word in the list and removing the hyphen
wordsSplit[i][g] = wordsSplit[i][g][0:-1]+wordsSplit[i+1][0]
wordsSplit[i+1][0] = ""
#recreating the string
msg = ""
for i in range(len(wordsSplit)):
for g in range(len(wordsSplit[i])):
if wordsSplit[i][g] != "":
msg += wordsSplit[i][g]+" "
What this does is split by the newlines which are where the hyphens usually occur. Then it splits those into a smaller array by word. Then checks for the hyphens and if it finds one it replaces it with the next phrase in the words list and sets that word to nothing. Finally, it reconstructs the string into a variable called msg where it doesn't add a space if the value in the split array is a nothing string.
What about
import re
a = '''effects on the skin is fully under-
stood one fights'''
re.sub(r'-~([a-zA-Z0-9]*) ', r'\1\n', a.replace('\n', '~')).replace('~','\n')
Explanation
a.replace('\n', '~') concatenate input string into one line with (~ instead of \n - You need to choose some other if you want to use ~ char in the text.)
-~([a-zA-Z0-9]*) regex then selects all strings we want to alter with the () backreference which saves it to re.sub memory. Using '\1\n' it is later re-invoked.
.replace('~','\n') finally replaces all remaining ~ chars to newlines.

extract data at specific columns in a line if there is any data at them

I have a file with lines of data like below I need to pull out the characters at 74-79 and 122-124 some lines will not have any character at 74-79 and I want to skip those lines.
import re
def main():
file=open("CCDATA.TXT","r")
lines =file.readlines()
file.close()
for line in lines:
lines=re.sub(r" +", " ", line)
print(lines)
main()
CF214L214L1671310491084111159 Customer Name 46081 171638440 0000320800000000HCCCIUAW 0612170609170609170300000000003135
CF214L214L1671310491107111509 Customer Name 46144 171639547 0000421200000000DRNRIUAW 0612170613170613170300000000003135
CF214L214L1671380999999900002000007420
CF214L214L1671310491084111159 Customer Name 46081 171638440 0000320800000000DRCSIU 0612170609170609170300000000003135
CF214L214L1671380999999900001000003208
CF214L214L1671510446646410055 Customer Name 46436 171677320 0000027200000272AA 0616170623170623170300000050003001
CF214L214L1671510126566110169 Customer Name 46450 171677321 0000117900001179AA 0616170623170623170300000250003001
CF214L214L1671510063942910172 Customer Name 46413 171677322 0000159300001593AA 0616170623170623170300000150003001
CF214L214L1671510808861010253 Customer Name 46448 171677323 0000298600002986AA 0616170623170623170300000350003001
CF214L214L1671510077309510502 Customer Name 46434 171677324 0000294300002943AA 0616170622170622170300000150003001
CF214L214L1671580999999900029000077728
CF214L214L1671610049631611165 Customer Name 46221 171677648 0000178700000000 0616170619170619170300000000003000
CF214L214L1671610895609911978 Customer Name 46433 171677348 0000011800000118AC 0616170622170622170300000150003041
CF214L214L1671680999999900002000001905
Short answer:
Just take line[74:79] and such as Roelant suggested. Since the lines in your input are always 230 chars long though, there'll never be an IndexError, so you rather need to check if the result is all whitespace with isspace():
field=line[74:79]
<...>
if isspace(field): continue
A more robust approach that would also validate input (check if you're required to do so) is to parse the entire line and use a specific element from the result.
One way is a regex as per Parse a text file and extract a specific column, Tips for reading in a complex file - Python and an example at get the path in a file inside {} by python .
But for your specific format that appears to be an archaic, punchcard-derived one, with column number defining the datum's meaning, the format can probably be more conveniently expressed as a sequence of column numbers associated with field names (you never told us what they mean so I'm using generic names):
fields=[
("id1",(0,39)),
("cname_text":(40,73)),
("num2":(74:79)),
("num3":(96,105)),
#whether to introduce a separate field at [122:125]
# or parse "id4" further after getting it is up to you.
# I'd suggest you follow the official format spec.
("id4":(106,130)),
("num5":(134,168))
]
line_end=230
And parsed like this:
def parse_line(line,fields,end):
result={}
#for whitespace validation
# prev_ecol=0
for fname,(scol,ecol) in format.iteritems():
#optionally validate delimiting whitespace
# assert prev_ecol==scol or isspace(line[prev_ecol,scol])
#lines in the input are always `end' symbols wide, so IndexError will never happen for a valid input
field=line[scol:ecol]
#optionally do conversion and such, this is completely up to you
field=field.rstrip(' ')
if not field: field=None
result[fname]=field
#for whitespace validation
# prev_ecol=ecol
#optionally validate line end
# assert ecol==end or isspace(line[ecol:end])
All that leaves is skip lines where the field is empty:
for line in lines:
data = parse_line(line,fields,line_end)
if any(data[fname] is None for fname in ('num2','id4')): continue
#handle the data
def read_all_lines(filename='CCDATA.TXT'):
with open(filename,"r") as file:
for line in file:
try:
first = line[74:79]
second = line[122:124]
except IndexError:
continue # skip line
else:
do_something_with(first, second)
Edit: Thanks for commenting, apparently it should have been:
for line in file:
first = line[74:79]
second = line[122:124]
if set(first) != set(' ') and set(second) != set(' '):
do_something_with(first, second)

Creating a table which has sentences from a paragraph each on a row with Python

I have an abstract which I've split to sentences in Python. I want to write to 2 tables. One which has the following columns: abstract id (which is the file number that I extracted from my document), sentence id (automatically generated) and each sentence of this abstract on a row.
I would want a table that looks like this
abstractID SentenceID Sentence
a9001755 0000001 Myxococcus xanthus development is regulated by(1st sentence)
a9001755 0000002 The C signal appears to be the polypeptide product (2nd sentence)
and another table NSFClasses having abstractID and nsfOrg.
How to write sentences (each on a row) to table and assign sentenceId as shown above?
This is my code:
import glob;
import re;
import json
org = "NSF Org";
fileNo = "File";
AbstractString = "Abstract";
abstractFlag = False;
abstractContent = []
path = 'awardsFile/awd_1990_00/*.txt';
files = glob.glob(path);
for name in files:
fileA = open(name,'r');
for line in fileA:
if line.find(fileNo)!= -1:
file = line[14:]
if line.find(org) != -1:
nsfOrg = line[14:].split()
print file
print nsfOrg
fileA = open(name,'r')
content = fileA.read().split(':')
abstract = content[len(content)-1]
abstract = abstract.replace('\n','')
abstract = abstract.split();
abstract = ' '.join(abstract)
sentences = abstract.split('.')
print sentences
key = str(len(sentences))
print "Sentences--- "
As others have pointed out, it's very difficult to follow your code. I think this code will do what you want, based on your expected output and what we can see. I could be way off, though, since we can't see the file you are working with. I'm especially troubled by one part of your code that I can't see enough to refactor, but feels obviously wrong. It's marked below.
import glob
for filename in glob.glob('awardsFile/awd_1990_00/*.txt'):
fh = open(filename, 'r')
abstract = fh.read().split(':')[-1]
fh.seek(0) # reset file pointer
# See comments below
for line in fh:
if line.find('File') != -1:
absID = line[14:]
print absID
if line.find('NSF Org') != -1:
print line[14:].split()
# End see comments
fh.close()
concat_abstract = ''.join(abstract.replace('\n', '').split())
for s_id, sentence in enumerate(concat_abstract.split('.')):
# Adjust numeric width arguments to prettify table
print absID.ljust(15),
print '{:06d}'.format(s_id).ljust(15),
print sentence
In that section marked, you are searching for the last occurrence of the strings 'File' and 'NSF Org' in the file (whether you mean to or not because the loop will keep overwriting your variables as long as they occur), then doing something with the 15th character onward of that line. Without seeing the file, it is impossible to say how to do it, but I can tell you there is a better way. It probably involves searching through the whole file as one string (or at least the first part of it if this is in its header) rather than looping over it.
Also, notice how I condensed your code. You store a lot of things in variables that you aren't using at all, and collecting a lot of cruft that spreads the state around. To understand what line N does, I have to keep glancing ahead at line N+5 and back over lines N-34 to N-17 to inspect variables. This creates a lot of action at a distance, which for reasons cited is best to avoid. In the smaller version, you can see how I substituted in string literals in places where they are only used once and called print statements immediately instead of storing the results for later. The results are usually more concise and easily understood.

retrieving name from number ID

I have a code that takes data from online where items are referred to by a number ID, compared data about those items, and builds a list of item ID numbers based on some criteria. What I'm struggling with is taking this list of numbers and turning it into a list of names. I have a text file with the numbers and corresponding names but am having trouble using it because it contains multi-word names and retains the \n at the end of each line when i try to parse the file in any way with python. the text file looks like this:
number name\n
14 apple\n
27 anjou pear\n
36 asian pear\n
7645 langsat\n
I have tried split(), as well as replacing the white space between with several difference things to no avail. I asked a question earlier which yielded a lot of progress but still didn't quite work. The two methods that were suggested were:
d = dict()
f=open('file.txt', 'r')
for line in f:
number, name = line.split(None,1)
d[number] = name
this almost worked but still left me with the \n so if I call d['14'] i get 'apple\n'. The other method was:
import re
f=open('file.txt', 'r')
fr=f.read()
r=re.findall("(\w+)\s+(.+)", fr)
this seemed to have gotten rid of the \n at the end of every name but leaves me with the problem of having a tuple with each number-name combo being a single entry so if i were to say r[1] i would get ('14', 'apple'). I really don't want to delete each new line command by hand on all ~8400 entries...
Any recommendations on how to get the corresponding name given a number from a file like this?
In your first method change the line ttn[number] = name to ttn[number] = name[:-1]. This simply strips off the last character, and should remove your \n.
names = {}
with open("id_file.txt") as inf:
header = next(inf, '') # skip header row
for line in inf:
id, name = line.split(None, 1)
names[int(id)] = name.strip()
names[27] # => 'anjou pear'
Use this to modify your first approach:
raw_dict = dict()
cleaned_dict = dict()
Assuming you've imported file to dictionary:
raw_dict = {14:"apple\n",27:"anjou pear\n",36 :"asian pear\n" ,7645:"langsat\n"}
for keys in raw_dict:
cleaned_dict[keys] = raw_dict[keys][:len(raw_dict[keys])-1]
So now, cleaned_dict is equal to:
{27: 'anjou pear', 36: 'asian pear', 7645: 'langsat', 14: 'apple'}
*Edited to add first sentence.

Using variables in a reg-ex

So I matched (with the help of kind contributors on stack overflow) the item number in:
User Number 1 will probably like movie ID: RecommendedItem[item:557, value:7.32173]the most!
Now I'm trying to extract the corresponding name from another text file using the item number. Its contents look like:
557::Voyage to the Bottom of the Sea (1961)::Adventure|Sci-Fi
For some reason I'm just coming up with 'None' on terminal. No matches found.
myfile = open('result.txt', 'r')
myfile2 = open('movies.txt', 'r')
content = myfile2.read()
for line in myfile:
m = re.search(r'(?<=RecommendedItem\[item:)(\d+)',line)
n = re.search(r'(?<=^'+m.group(0)+'\:\:)(\w+)',content)
print n
I'm not sure if I can use a variable in a look behind assertion..
Really appreciate all the help I'm getting here!
EDIT: Turns out the only problem was the unneeded caret symbol in the second regular-expression.
Here, once you've found the number, you use a 'old style' (could equally use .format if you so desired) string format to put it into the regular expression. I thought it'd be nice to access the values via a dictionary hence the named matches, you could do it without this though. To get the a list of genres, just .split("|") the string under suggestionDict["Genres"].
import re
num = 557
suggestion="557::Voyage to the Bottom of the Sea (1961)::Adventure|Sci-Fi"
suggestionDict = re.search(r'%d::(?P<Title>[a-zA-Z0-9 ]+)\s\((?P<Date>\d+)\)::(?P<Genres>[a-zA-Z1-9|]+)' % num, suggestion).groupdict()
#printing to show if it works/doesn't
print('\n'.join(["%s:%s" % (k,d) for k,d in suggestionDict.items()]))
#clearer example of how to use
print("\nCLEAR EXAMPLE:")
print(suggestionDict["Title"])
Prodcuing
Title:Voyage to the Bottom of the Sea
Genres:Adventure|Sci
Date:1961
CLEAR EXAMPLE:
Voyage to the Bottom of the Sea
>>>

Categories

Resources