So I matched (with the help of kind contributors on stack overflow) the item number in:
User Number 1 will probably like movie ID: RecommendedItem[item:557, value:7.32173]the most!
Now I'm trying to extract the corresponding name from another text file using the item number. Its contents look like:
557::Voyage to the Bottom of the Sea (1961)::Adventure|Sci-Fi
For some reason I'm just coming up with 'None' on terminal. No matches found.
myfile = open('result.txt', 'r')
myfile2 = open('movies.txt', 'r')
content = myfile2.read()
for line in myfile:
m = re.search(r'(?<=RecommendedItem\[item:)(\d+)',line)
n = re.search(r'(?<=^'+m.group(0)+'\:\:)(\w+)',content)
print n
I'm not sure if I can use a variable in a look behind assertion..
Really appreciate all the help I'm getting here!
EDIT: Turns out the only problem was the unneeded caret symbol in the second regular-expression.
Here, once you've found the number, you use a 'old style' (could equally use .format if you so desired) string format to put it into the regular expression. I thought it'd be nice to access the values via a dictionary hence the named matches, you could do it without this though. To get the a list of genres, just .split("|") the string under suggestionDict["Genres"].
import re
num = 557
suggestion="557::Voyage to the Bottom of the Sea (1961)::Adventure|Sci-Fi"
suggestionDict = re.search(r'%d::(?P<Title>[a-zA-Z0-9 ]+)\s\((?P<Date>\d+)\)::(?P<Genres>[a-zA-Z1-9|]+)' % num, suggestion).groupdict()
#printing to show if it works/doesn't
print('\n'.join(["%s:%s" % (k,d) for k,d in suggestionDict.items()]))
#clearer example of how to use
print("\nCLEAR EXAMPLE:")
print(suggestionDict["Title"])
Prodcuing
Title:Voyage to the Bottom of the Sea
Genres:Adventure|Sci
Date:1961
CLEAR EXAMPLE:
Voyage to the Bottom of the Sea
>>>
Related
I'm trying to get all the substrings under a "customLabel" tag, for example "Month" inside of ...,"customLabel":"Month"},"schema":"metric...
Unusual issue: this is a 1071552 characters long ndjson file, of a single line ("for line in file:" is pointless since there's only one).
The best I found was that:
How to find a substring of text with a known starting point but unknown ending point in python
but if I use this, the result obviously doesn't stop (at Month) and keeps writing the whole remaining of the file, same as if using partition()[2].
Just know that Month is only an example, customLabel has about 300 variants and they are not listed (I'm actually doing this to list them...)
To give some details here's my script so far:
with open("file.ndjson","rt", encoding='utf-8') as ndjson:
filedata = ndjson.read()
x="customLabel"
count=filedata.count(x)
for i in range (count):
if filedata.find(x)>0:
print("Found "+str(i+1))
So right now it properly tells me how many occurences of customLabel there are, I'd like to get the substring that comes after customLabel":" instead (Month in the example) to put them all in a list, to locate them way more easily and enable the use of replace() for traductions later on.
I'd guess regex are the solution but I'm pretty new to that, so I'll post that question by the time I learn about them...
If you want to search for all (even nested) customLabel values like this:
{"customLabel":"Month" , "otherJson" : {"customLabel" : 23525235}}
you can use RegEx patterns with the re module
import re
label_values = []
regex_pattern = r"\"customLabel\"[ ]?:[ ]?([1-9a-zA-z\"]+)"
with open("file.ndjson", "rt", encoding="utf-8") as ndjson:
for line in ndjson:
values = re.findall(regex_pattern, line)
label_values.extend(values)
print(label_values) # ['"Month"', '23525235']
# If you don't want the items to have quotations
label_values = [i.replace('"', "") for i in label_values]
print(label_values) # ['Month', '23525235']
Note: If you're only talking about ndjson files and not nested searching, then it'd be better to use the json module to parse the lines and then easily get the value of your specific key which is customLabel.
import json
label = "customLabel"
label_values = []
with open("file.ndjson", "rt", encoding="utf-8") as ndjson:
for line in ndjson:
line_json = json.loads(line)
if line_json.get(label) is not None:
label_values.append(line_json.get(label))
print(label_values) # ['Month']
I need to parse a part of my output file that looks like this (Image is also attached for clarity)
y,mz) = (-.504D-04,-.543D-04,-.538D-03)
The expected output is :
the code I have so far looks like below:
class NACParser(ParseSection):
name = "coupling"
which is good but there are some issues:
It only prints from the very last line and this, I think, is because it overwrites due to similar other lines.
This code will only work for this specific molecule and I want something that can work for any molecule. What I mean is : in this example - I have a molecule with 15 atoms and the first atom is c (carbon) , 5th atom is h (hydrogen) and 11th atom is s (sulfur) but the total number of atoms (which is currently 15 ) and the name of atoms can be different when I have different molecule.
So I am wondering how can I write a general code that can work for a general molecule . Any help?
This will to literally what you asked. Maybe you can use this as a basis. I just gather all the atom IDs when I find a line with "ATOM", and create the dict entries when I find a line with "d/d". I would show the output, but I just typed in faked data because I didn't want to retype all of that.
import re
from pprint import pprint
header = r"(\d+ [a-z]{1,2})"
atoms = []
gather = {}
for line in open('x.txt'):
if len(line) < 5:
continue
if 'ATOM' in line:
atoms = re.findall( header, line )
atoms = [s.replace(' ','') for s in atoms]
continue
if '/d' in line:
parts = line.split()
row = parts[0].replace('/','')
for at,val in zip(atoms,parts[1:]):
gather[at+'_'+row] = val
pprint(gather)
Here's the output from your test data. I hope you realize that the cut-and-paste data doesn't match the image. The image uses d/dx, but the cut and paste uses dE/dx. I have assumed you want the "E" in the dict tag too, but that's easy to fix if you don't.
{'10c_dEdx': '0.8337613D-02',
'10c_dEdy': '-0.8171767D-01',
'10c_dEdz': '-0.4316928D-02',
'11s_dEdx': '0.3138990D-01',
'11s_dEdy': '0.3893252D-01',
'11s_dEdz': '0.2767787D-02',
'12h_dEdx': '0.8416159D-02',
'12h_dEdy': '0.3335059D-02',
'12h_dEdz': '0.1357569D-01',
'13h_dEdx': '0.1128067D-02',
'13h_dEdy': '-0.1457401D-01',
'13h_dEdz': '-0.7834375D-03',
'14h_dEdx': '0.8941240D-02',
'14h_dEdy': '0.4869915D-02',
'14h_dEdz': '-0.1273530D-01',
'15h_dEdx': '0.4292434D-03',
'15h_dEdy': '-0.1418384D-01',
'15h_dEdz': '-0.7764904D-03',
'1c_dEdx': '-0.1150239D-01',
'1c_dEdy': '0.4798462D-02',
'1c_dEdz': '0.6015413D-05',
'2c_dEdx': '0.2259669D-01',
'2c_dEdy': '0.5902019D-01',
'2c_dEdz': '0.3707704D-02',
'3c_dEdx': '-0.3153006D-02',
'3c_dEdy': '-0.4060517D-01',
'3c_dEdz': '-0.2306249D-02',
'4n_dEdx': '-0.2718508D-01',
'4n_dEdy': '0.3404657D-01',
'4n_dEdz': '0.1334956D-02',
'5h_dEdx': '-0.1064344D-01',
'5h_dEdy': '-0.1054522D-01',
'5h_dEdz': '-0.8032586D-03',
'6c_dEdx': '0.3017851D-01',
'6c_dEdy': '-0.2805275D-01',
'6c_dEdz': '-0.9413310D-03',
'7s_dEdx': '-0.2253417D-01',
'7s_dEdy': '0.1196856D-01',
'7s_dEdz': '0.2069422D-03',
'8n_dEdx': '-0.3195785D-01',
'8n_dEdy': '0.1888257D-01',
'8n_dEdz': '0.3914382D-03',
'9h_dEdx': '-0.4441489D-02',
'9h_dEdy': '0.1382483D-01',
'9h_dEdz': '0.6724659D-03'}
so i'm new to python besides some experience with tKintner (some GUI experiments).
I read an .mbox file and copy the plain/text in a string. This text contains a registering form. So a Stefan, living in Maple Street, London working for the Company "MultiVendor XXVideos" has registered with an email for a subscription.
Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
I would like to take this data and put in a .csv row with column
"Name", "Adress", "Company",...
Now i tried to cut and slice everything. For debugging i use "print"(IDE = KATE/KDE + terminal... :-D ).
Problem is, that the data contains multiple lines after keywords but i only get the first line.
How would you improve my code?
import mailbox
import csv
import email
from time import sleep
import string
fieldnames = ["ID","Subject","Name", "Adress", "Company"]
searchKeys = [ 'Name_OF_Person','Adress_HOME','Company_NAME']
mbox_file = "REG.mbox"
export_file_name = "test.csv"
if __name__ == "__main__":
with open(export_file_name,"w") as csvfile:
writer = csv.DictWriter(csvfile, dialect='excel',fieldnames=fieldnames)
writer.writeheader()
for message in mailbox.mbox(mbox_file):
if message.is_multipart():
content = '\n'.join(part.get_payload() for part in message.get_payload())
content = content.split('<')[0] # only want text/plain.. Ill split #right before HTML starts
#print content
else:
content = message.get_payload()
idea = message['message-id']
sub = message['subject']
fr = message['from']
date = message['date']
writer.writerow ('ID':idea,......) # CSV writing will work fine
for line in content.splitlines():
line = line.strip()
for pose in searchKeys:
if pose in line:
tmp = line.split(pose)
pmt = tmp[1].split(":")[1]
if next in line !=:
print pose +"\t"+pmt
sleep(1)
csvfile.closed
OUTPUT:
OFFICIAL_POSTAL_ADDRESS =20
Here, the lines are missing..
from file:
OFFICIAL_POSTAL_ADDRESS: =20
London, testarossa street 41
EDIT2:
#Yaniv
Thank you, iam still trying to understand every step, but just wanted to give a comment. I like the idea to work with the list/matrix/vector "key_value_pairs"
The amount of keywords in the emails is ~20 words. Additionally, my values are sometimes line broken by "=".
I was thinking something like:
Search text for Keyword A,
if true:
search text from Keyword A until keyword B
if true:
copy text after A until B
Name_OF_=
Person: Stefan
Adress_
=HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
Maybe the HTML from EMAIL.mbox is easier to process?
<tr><td bgcolor=3D"#eeeeee"><font face=3D"Verdana" size=3D"1">
<strong>NAM=
E_REGISTERING_PERSON</strong></font></td><td bgcolor=3D"#eeeeee"><font
fac=e=3D"Verdana" size=3D"1">Stefan </font></td></tr>
But the "=" are still there
should i replace ["="," = "] with "" ?
I would go for a "routine" parsing loop over the input lines, and maintain a current_key and current_value variables, as a value for a certain key in your data might be "annoying", and spread across multiple lines.
I've demonstrated such parsing approach in the code below, with some assumptions regarding your problem. For example, if an input line starts with a whitespace, I assumed it must be the case of such "annoying" value (spread across multiple lines). Such lines would be concatenated into a single value, using some configurable string (the parameter join_lines_using_this). Another assumption is that you might want to strip whitespaces from both keys and values.
Feel free to adapt the code to fit your assumptions on the input, and raise Exceptions whenever they don't hold!
# Note the usage of .strip() in some places, to strip away whitespaces. I assumed you might want that.
def parse_funky_text(text, join_lines_using_this=" "):
key_value_pairs = []
current_key, current_value = None, ""
for line in text.splitlines():
line_split = line.split(':')
if line.startswith(" ") or len(line_split) == 1:
if current_key is None:
raise ValueError("Failed to parse this line, not sure which key it belongs to: %s" % line)
current_value += join_lines_using_this + line.strip()
else:
if current_key is not None:
key_value_pairs.append((current_key, current_value))
current_key, current_value = None, ""
current_key = line_split[0].strip()
# We've just found a new key, so here you might want to perform additional checks,
# e.g. if current_key not in sharedKeys: raise ValueError("Encountered a weird key?! %s in line: %s" % (current_key, line))
current_value = ':'.join(line_split[1:]).strip()
# Don't forget the last parsed key, value
if current_key is not None:
key_value_pairs.append((current_key, current_value))
return key_value_pairs
Example usage:
text = """Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos"""
parse_funky_text(text)
Will output:
[('Name_OF_Person', 'Stefan'), ('Adress_HOME', 'London, Maple Street 45'), ('Company_NAME', 'MultiVendor XXVideos')]
You indicate in the comments that your input strings from the content should be relatively consistent. If that is the case, and you want to be able to split that string across multiple lines, the easiest thing to do would be to replace \n with spaces and then just parse the single string.
I've intentionally constrained my answer to using just string methods rather than inventing a huge function to do this. Reason: 1) Your process is already complex enough, and 2) your question really boils down to how to process the string data across multiple lines. If that is the case, and the pattern is consistent, this will get this one off job done
content = content.replace('\n', ' ')
Then you can split on each of the boundries in your consistently structured headers.
content = content.split("Name_OF_Person:")[1] #take second element of the list
person = content.split("Adress_HOME:")[0] # take content before "Adress Home"
content = content.split("Adress_HOME:")[1] #take second element of the list
address = content.split("Company_NAME:")[0] # take content before
company = content.split("Adress_HOME:")[1] #take second element of the list (the remainder) which is company
Normally, I would suggest regex. (https://docs.python.org/3.4/library/re.html). Long term, if you need to do this sort of thing again, regex is going to pay dividends on time spend munging data. To make a regex function "cut" across multiple lines, you would use the re.MULTILINE option. So it might endup looking something like re.search('Name_OF_Person:(.*)Adress_HOME:', html_reg_form, re.MULTILINE)
I have an abstract which I've split to sentences in Python. I want to write to 2 tables. One which has the following columns: abstract id (which is the file number that I extracted from my document), sentence id (automatically generated) and each sentence of this abstract on a row.
I would want a table that looks like this
abstractID SentenceID Sentence
a9001755 0000001 Myxococcus xanthus development is regulated by(1st sentence)
a9001755 0000002 The C signal appears to be the polypeptide product (2nd sentence)
and another table NSFClasses having abstractID and nsfOrg.
How to write sentences (each on a row) to table and assign sentenceId as shown above?
This is my code:
import glob;
import re;
import json
org = "NSF Org";
fileNo = "File";
AbstractString = "Abstract";
abstractFlag = False;
abstractContent = []
path = 'awardsFile/awd_1990_00/*.txt';
files = glob.glob(path);
for name in files:
fileA = open(name,'r');
for line in fileA:
if line.find(fileNo)!= -1:
file = line[14:]
if line.find(org) != -1:
nsfOrg = line[14:].split()
print file
print nsfOrg
fileA = open(name,'r')
content = fileA.read().split(':')
abstract = content[len(content)-1]
abstract = abstract.replace('\n','')
abstract = abstract.split();
abstract = ' '.join(abstract)
sentences = abstract.split('.')
print sentences
key = str(len(sentences))
print "Sentences--- "
As others have pointed out, it's very difficult to follow your code. I think this code will do what you want, based on your expected output and what we can see. I could be way off, though, since we can't see the file you are working with. I'm especially troubled by one part of your code that I can't see enough to refactor, but feels obviously wrong. It's marked below.
import glob
for filename in glob.glob('awardsFile/awd_1990_00/*.txt'):
fh = open(filename, 'r')
abstract = fh.read().split(':')[-1]
fh.seek(0) # reset file pointer
# See comments below
for line in fh:
if line.find('File') != -1:
absID = line[14:]
print absID
if line.find('NSF Org') != -1:
print line[14:].split()
# End see comments
fh.close()
concat_abstract = ''.join(abstract.replace('\n', '').split())
for s_id, sentence in enumerate(concat_abstract.split('.')):
# Adjust numeric width arguments to prettify table
print absID.ljust(15),
print '{:06d}'.format(s_id).ljust(15),
print sentence
In that section marked, you are searching for the last occurrence of the strings 'File' and 'NSF Org' in the file (whether you mean to or not because the loop will keep overwriting your variables as long as they occur), then doing something with the 15th character onward of that line. Without seeing the file, it is impossible to say how to do it, but I can tell you there is a better way. It probably involves searching through the whole file as one string (or at least the first part of it if this is in its header) rather than looping over it.
Also, notice how I condensed your code. You store a lot of things in variables that you aren't using at all, and collecting a lot of cruft that spreads the state around. To understand what line N does, I have to keep glancing ahead at line N+5 and back over lines N-34 to N-17 to inspect variables. This creates a lot of action at a distance, which for reasons cited is best to avoid. In the smaller version, you can see how I substituted in string literals in places where they are only used once and called print statements immediately instead of storing the results for later. The results are usually more concise and easily understood.
Program Details:
I am writing a program for python that will need to look through a text file for the line:
Found mode 1 of 12: EV= 1.5185449E+04, f= 19.612545, T= 0.050988.
Problem:
Then after the program has found that line, it will then store the line into an array and get the value 19.612545, from f = 19.612545.
Question:
I so far have been able to store the line into an array after I have found it. However I am having trouble as to what to use after I have stored the string to search through the string, and then extract the information from variable f. Does anyone have any suggestions or tips on how to possibly accomplish this?
Depending upon how you want to go at it, CosmicComputer is right to refer you to Regular Expressions. If your syntax is this simple, you could always do something like:
line = 'Found mode 1 of 12: EV= 1.5185449E+04, f= 19.612545, T= 0.050988.'
splitByComma=line.split(',')
fValue = splitByComma[1].replace('f= ', '').strip()
print(fValue)
Results in 19.612545 being printed (still a string though).
Split your line by commas, grab the 2nd chunk, and break out the f value. Error checking and conversions left up to you!
Using regular expressions here is maddness. Just use string.find as follows: (where string is the name of the variable the holds your string)
index = string.find('f=')
index = index + 2 //skip over = and space
string = string[index:] //cuts things that you don't need
string = string.split(',') //splits the remaining string delimited by comma
your_value = string[0] //extracts the first field
I know its ugly, but its nothing compared with RE.