I have a very long file that I managed to parse using Python regular expression one value at a time, for example, here is the code that I'm using to print out all the values between the <h2> tags:
import os
import re
def query():
f = open('company.txt', 'r')
names = re.findall(r'<h2>(.*?)</h2>', f.read(), re.DOTALL)
for name in names:
print name
if __name__=="__main__":
query()
and I repeat the same thing to print out the area_code as well. But this time, I just replace the pattern in the findall function to print the area code. This means I'm having to run the code twice.
My question is, is there a way to simply run the two queries at the same time and printing the results in one line separated by a pipe (|)?
like so: Planner | B21
Below is the short sample file I'm trying to parse.
<h2>Planner</h2>
area_place = 'City of Angels';
area_code = 'B21';
period = 'Summer';
... more content
<h2>Executive</h2>
area_place = 'London';
area_code = 'D33';
period = 'Winter';
...more content
This is working for me with your test data in Python 2.7, give it a try:
import os
import re
def query():
f = open('company.txt', 'r')
names = re.findall(r"<h2>(.+?)</h2>.*?area_code = '(.+?)'", f.read(), re.DOTALL)
for name in names:
print name[0] + " | " + name[1]
if __name__=="__main__":
query()
Basically, I'm just incorporating both queries into one, and then specifying the capture group numerically. You may want to rename "names" since it makes less sense the way I'm doing it.
Alternatively, if you'd like to keep your existing queries and you can assume that they will all be the same length, you could do something like this:
names = re.findall(your names regex)
area_codes = re.findall(your area code regex)
for i in range(len(names)): //very dangerous, if there's one failed match many entries may be mismatched!
print names[i] + " | " + area_codes[i]
However, I would not recommend this approach unless you're extremely confident in the regularity of your data.
Related
I wrote a script to gather information out of an XML file. Inside, there are ENTITY's defined and I need a RegEx to get the value out of it.
<!ENTITY ABC "123">
<!ENTITY BCD "234">
<!ENTITY CDE "345">
First, i open up the xml file and save the contents inside of a variable.
xml = open("file.xml", "r")
lines = xml.readlines()
Then I got a for loop:
result = "ABC"
var_search_result_list = []
var_searcher = "ENTITY\s" + result + '.*"[^"]*"\>'
for line in lines:
var_search_result = re.match(var_searcher, line)
if var_search_result != None:
var_search_result_list += list(var_search_result.groups())
print(var_search_result_list)
I really want to have the value 123 inside of my var_search_result_list list. Instead, I get an empty list every time I use this. Has anybody got a solution?
Thanks in Advance - Toki
There are a few issues in the code.
You are using re.match which has to match from the start of the string.
Your pattern is ENTITY\sABC.*"([^"]*)"\> which does not match from
the start of the given example strings.
If you want to add 123 only, you have to use a capture group, and add it using var_search_result.group(1) to the result list using append
For example:
import re
xml = open("file.xml", "r")
lines = xml.readlines()
result = "ABC"
var_search_result_list = []
var_searcher = "ENTITY\s" + result + '.*"([^"]*)"\>'
print(var_searcher)
for line in lines:
var_search_result = re.search(var_searcher, line)
if var_search_result:
var_search_result_list.append(var_search_result.group(1))
print(var_search_result_list)
Output
['123']
A bit more precise pattern could be
<!ENTITY\sABC\s+"([^"]*)"\>
Regex demo
I am searching through a text file line by line and i want to get back all strings that contains the prefix AAAXX1234. For example in my text file i have these lines
Hello my ID is [123423819::AAAXX1234_3412] #I want that(AAAXX1234_3412)
Hello my ID is [738281937::AAAXX1234_3413:AAAXX1234_4212] #I
want both of them(AAAXX1234_3413, AAAXX1234_4212)
Hello my ID is [123423819::XXWWF1234_3098] #I don't care about that
The code i have a just to check if the line starts with "Hello my ID is"
with open(file_hrd,'r',encoding='utf-8') as hrd:
hrd=hrd.readlines()
for line in hrd:
if line.startswith("Hello my ID is"):
#do something
Try this:
import re
with open(file_hrd,'r',encoding='utf-8') as hrd:
res = []
for line in hrd:
res += re.findall('AAAXX1234_\d+', line)
print(res)
Output:
['AAAXX1234_3412', 'AAAXX1234_3413', 'AAAXX1234_4212']
I’d suggest you to parse your lines and extract the information into meaningful parts. That way, you can then use a simple startswith on the ID part of your line. In addition, this will also let you control where you find these prefixes, e.g. in case the lines contains additional data that could also theoretically contain something that looks like an ID.
Something like this:
if line.startswith('Hello my ID is '):
idx_start = line.index('[')
idx_end = line.index(']', idx_start)
idx_separator = line.index(':', idx_start, idx_end)
num = line[idx_start + 1:idx_separator]
ids = line[idx_separator + 2:idx_end].split(':')
print(num, ids)
This would give you the following output for your three example lines:
123423819 ['AAAXX1234_3412']
738281937 ['AAAXX1234_3413', 'AAAXX1234_4212']
123423819 ['XXWWF1234_3098']
With that information, you can then check the ids for a prefix:
if any(ids, lambda x: x.startswith('AAAXX1234')):
print('do something')
Using regular expressions through the re module and its findall() function should be enough:
import re
with open('file.txt') as file:
prefix = 'AAAXX1234'
lines = file.read().splitlines()
output = list()
for line in lines:
output.extend(re.findall(f'{prefix}_[\d]+', line))
You can do it by findall with the regex r'AAAXX1234_[0-9]+', it will find all parts of the string that start with AAAXX1234_ and then grabs all of the numbers after it, change + to * if you want it to match 'AAAXX1234_' on it's own as well
I want to write a UDF python for pig, to read lines from the file called like
#'prefix.csv'
spol.
LLC
Oy
OOD
and match the names and if finds any matches, then replaces it with white space. here is my python code
def list_files2(name, f):
fin = open(f, 'r')
for line in fin:
final = name
extra = 'nothing'
if (name != name.replace(line.strip(), ' ')):
extra = line.strip()
final = name.replace(line.strip(), ' ').strip()
return final, extra,'insdie if'
return final, extra, 'inside for'
Running this code in python,
>print list_files2('LLC nakisa', 'prefix.csv' )
>print list_files2('AG company', 'prefix.csv' )
returns
('nakisa', 'LLC', 'insdie if')
('AG company', 'nothing', 'inside for')
which is exactly what I need. But when I register this code as a UDF in apache pig for this sample list:
nakisa company LLC
three Oy
AG Lans
Test OOD
pig returns wrong answer on the third line:
((nakisa company,LLC,insdie if))
((three,Oy,insdie if))
((A G L a n s,,insdie if))
((Test,OOD,insdie if))
The question is why UDF enters the if loop for the third entry which does not have any match in the prefix.csv file.
I don't know pig but the way you are checking for a match is strange and might be the cause of your problem.
If you want to check whether a string is a substring of another, python provides
the find method on strings:
if name.find(line.strip()) != -1:
# find will return the first index of the substring or -1 if it was not found
# ... do some stuff
additionally, your code might leave the file handle open. A way better approach to handle file operations is by using the with statement. This assures that in any case (except of interpreter crashes) the file handle will get closed.
with open(filename, "r") as file_:
# Everything within this block can use the opened file.
Last but not least, python provides a module called csv with a reader and a writer, that handle the parsing of the csv file format.
Thus, you could try the following code and check if it returns the correct thing:
import csv
def list_files2(name, filename):
with open(filename, 'rb') as file_:
final = name
extra = "nothing"
for prefix in csv.reader(file_):
if name.find(prefix) != -1:
extra = prefix
final = name.replace(prefix, " ")
return final, extra, "inside if"
return final, extra, "inside for"
Because your file is named prefix.csv I assume you want to do prefix substitution. In this case, you could use startswith instead of find for the check and replace the line final = name.replace(prefix, " ") with final = " " + name[name.find(prefix):]. This assures that only a prefix will be substituted with the space.
I hope, this helps
I have an abstract which I've split to sentences in Python. I want to write to 2 tables. One which has the following columns: abstract id (which is the file number that I extracted from my document), sentence id (automatically generated) and each sentence of this abstract on a row.
I would want a table that looks like this
abstractID SentenceID Sentence
a9001755 0000001 Myxococcus xanthus development is regulated by(1st sentence)
a9001755 0000002 The C signal appears to be the polypeptide product (2nd sentence)
and another table NSFClasses having abstractID and nsfOrg.
How to write sentences (each on a row) to table and assign sentenceId as shown above?
This is my code:
import glob;
import re;
import json
org = "NSF Org";
fileNo = "File";
AbstractString = "Abstract";
abstractFlag = False;
abstractContent = []
path = 'awardsFile/awd_1990_00/*.txt';
files = glob.glob(path);
for name in files:
fileA = open(name,'r');
for line in fileA:
if line.find(fileNo)!= -1:
file = line[14:]
if line.find(org) != -1:
nsfOrg = line[14:].split()
print file
print nsfOrg
fileA = open(name,'r')
content = fileA.read().split(':')
abstract = content[len(content)-1]
abstract = abstract.replace('\n','')
abstract = abstract.split();
abstract = ' '.join(abstract)
sentences = abstract.split('.')
print sentences
key = str(len(sentences))
print "Sentences--- "
As others have pointed out, it's very difficult to follow your code. I think this code will do what you want, based on your expected output and what we can see. I could be way off, though, since we can't see the file you are working with. I'm especially troubled by one part of your code that I can't see enough to refactor, but feels obviously wrong. It's marked below.
import glob
for filename in glob.glob('awardsFile/awd_1990_00/*.txt'):
fh = open(filename, 'r')
abstract = fh.read().split(':')[-1]
fh.seek(0) # reset file pointer
# See comments below
for line in fh:
if line.find('File') != -1:
absID = line[14:]
print absID
if line.find('NSF Org') != -1:
print line[14:].split()
# End see comments
fh.close()
concat_abstract = ''.join(abstract.replace('\n', '').split())
for s_id, sentence in enumerate(concat_abstract.split('.')):
# Adjust numeric width arguments to prettify table
print absID.ljust(15),
print '{:06d}'.format(s_id).ljust(15),
print sentence
In that section marked, you are searching for the last occurrence of the strings 'File' and 'NSF Org' in the file (whether you mean to or not because the loop will keep overwriting your variables as long as they occur), then doing something with the 15th character onward of that line. Without seeing the file, it is impossible to say how to do it, but I can tell you there is a better way. It probably involves searching through the whole file as one string (or at least the first part of it if this is in its header) rather than looping over it.
Also, notice how I condensed your code. You store a lot of things in variables that you aren't using at all, and collecting a lot of cruft that spreads the state around. To understand what line N does, I have to keep glancing ahead at line N+5 and back over lines N-34 to N-17 to inspect variables. This creates a lot of action at a distance, which for reasons cited is best to avoid. In the smaller version, you can see how I substituted in string literals in places where they are only used once and called print statements immediately instead of storing the results for later. The results are usually more concise and easily understood.
Using Python I wanted to extract data rows shown below to a csv file from a bunch of javascript files which contain hardcoded data as shown below:
....html code....
hotels[0] = new hotelData();
hotels[0].hotelName = "MANHATTAN";
hotels[0].hotelPhone = "";
hotels[0].hotelSalesPhone = "";
hotels[0].hotelPhone = 'Phone: 888-350-6432';
hotels[0].hotelStreet = "787 11TH AVENUE";
hotels[0].hotelCity = "NEW YORK";
hotels[0].hotelState = "NY";
hotels[0].hotelZip = "10019";
hotels[0].hotelId = "51543";
hotels[0].hotelLat = "40.7686";;
hotels[0].hotelLong = "-73.992645";;
hotels[1] = new hotelData();
hotels[1].hotelName = "KOEPPEL";
hotels[1].hotelPhone = "";
hotels[1].hotelSalesPhone = "";
hotels[1].hotelPhone = 'Phone: 718-721-9100';
hotels[1].hotelStreet = "57-01 NORTHERN BLVD.";
hotels[1].hotelCity = "WOODSIDE";
hotels[1].hotelState = "NY";
hotels[1].hotelZip = "11377";
hotels[1].hotelId = "51582";
hotels[1].hotelLat = "40.75362";;
hotels[1].hotelLong = "-73.90366";;
var mykey = "AlvQ9gNhp7oNuvjhkalD4OWVs_9LvGHg0ZLG9cWwRdAUbsy-ZIW1N9uVSU0V4X-8";
var map = null;
var pins = null;
var i = null;
var boxes = new Array();
var currentBox = null;
var mapOptions = {
credentials: mykey,
enableSearchLogo: false,
showMapTypeSelector: false,
enableClickableLogo: false
}
.....html code .....
Hence the required csv output would be like rows of the above data:
MANHATTAN,,,Phone: 888-350-6432 ...
KOEPPEL,,,Phone: 718-721-9100 ...
Should I use code generation tool to directly parse the above statements to get the data ? Which is the most efficient Python method to transform such data contained in thousands of Javascript files into csv tabular format?
Update:
Ideally I would like the solution to parse the JavaScript statements as Python objects and then store it to CSV to gain maximum independence from ordering and formatting of the input script code
I'd recommend using a regular expression to pick out all "hotel[#]. ..." lines, and then add all of the results to a dictionary. Then, with the dictionary, output to a CSV file. The following should work:
import re
import csv
src_text = your_javascript_text
p = re.compile(r'hotels\[(?P<hotelid>\d+)\].(?P<attr>\w+) = ("|\')(?P<attr_val>.*?)("|\');', re.DOTALL)
hotels = {}
fieldnames = []
for result in [m.groupdict() for m in p.finditer(src_text)]:
if int(result['hotelid']) not in hotels:
hotels[int(result['hotelid'])] = {}
if result['attr'] not in fieldnames:
fieldnames.append(result['attr'])
hotels[int(result['hotelid'])][result['attr']] = result['attr_val']
output = open('hotels.csv','wb')
csv_writer = csv.DictWriter(output, delimiter=',', fieldnames=fieldnames, quoting=csv.QUOTE_ALL)
csv_writer.writerow(dict((f,f) for f in fieldnames))
for hotel in hotels.items():
csv_writer.writerow(hotel[1])
You now have a dictionary of Hotels w/ attributes, grouped by the ID in the Javascript, as well as the output file "hotels.csv" (with header row & proper escaping). I did do things like named groups which really aren't necessary, but find it to be more self-commenting.
It should be noted that if the same group is provided in the Javascript twice, like hotelPhone, the last is the only one stored.
When dealing with this type of problem, it falls to you and your judgment how much tolerance and sanitation you need. You may need to modify the regular expression to handle examples not int he small sample provided (ie. change in capture groups, restrict matches to those at the start of a line, etc.); or escape newline characters, like those in the phone number); or strip out certain text (eg. "Phone: " in the phone numbers). There's no real way for us to know this, so keep that in mind.
Cheers!
If this is something you will have to do routinely and you want to make the process fully automatic I think the easiest would be just to parse the files using Python and then write to csv using the csv Python module.
Your code could look somewhat like this:
with open("datafile.txt") as f:
hotel_data = []
for line in f:
# Let's make sure the line not empty
if line:
if "new hotelData();" in line:
if hotel_data:
write_to_csv(hotel_data)
hotel_data = []
else:
# Data, still has ending quote and semi colon
data = line.split("= ")[1]
# Remove ending quote and semi colon
data = data[:-2]
hotel_data.append(data)
def write_to_csv(hotel_data):
with open('hotels.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',',
quotechar='""', quoting=csv.QUOTE_MINIMAL)
spamwriter.writerow(hotel_data)
Beware that I have not tested this code, it is just meant to help you and point you in the right direction, it is not the complete solution.
If each hotel has every field declared in your files (i.e. if all of the hotels have the same amount of lines, even if some of them are empty), you may try to use a simple regular expression to extract every value surrounded by quotes ("xxx"), and then group them by number (for example, group every 5 fields into a single line and then add a line break).
A simple regex that would work would be ["'][^"']*["'] (EDIT: this is because I see that some fileds (i.e. Phone) use single quotes and the rest use quotes).
To make the search, use findall:
compPattern = re.compile(pattern)
results = compPattern.findall(compPattern)