Read and process data from URL in python - python

I am trying to get the data from URL.below is the URL Format.
What I am trying to do
1)read line by line and find if the line contains the desired keyword.
3)If yes then store the previous line's content "GETCONTENT" in a list
<http://www.example.com/XYZ/a-b-c/w#>DONTGETCONTENT
a <http://www.example.com/XYZ/mount/v1#NNNN> ,
<http://www.w3.org/2002/w#Individual> ;
<http://www.w3.org/2000/01/rdf-schema#label>
"some content , "some url content ;
<http://www.example.com/XYZ/log/v1#hasRelation>
<http://www.example.com/XYZ/data/v1#Change> ;
<http://www.example.com/XYZ/log/v1#ServicePage>
<https://dev.org.net/apis/someLabel> ;
<http://www.example.com/XYZ/log/v1#Description>
"Some API Content .
<http://www.example.com/XYZ/model/v1#GETBBBBBB>
a <http://www.w3.org/01/07/w#BBBBBB> ;
<http://www.w3.org/2000/01/schema#domain>
<http://www.example.com/XYZ/data/v1#xyz> ;
<http://www.w3.org/2000/01/schema#label1>
"some content , "some url content ;
<http://www.w3.org/2000/01/schema#range>
<http://www.w3.org/2001/XMLSchema#boolean> ;
<http://www.example.com/XYZ/log/v1#Description>
"Some description .
<http://www.example.com/XYZ/datamodel-ee/v1#GETAAAAAA>
a <http://www.w3.org/01/07/w#AAAAAA> ;
<http://www.w3.org/2000/01/schema#domain>
<http://www.example.com/XYZ/data/v1#Version> ;
<http://www.w3.org/2000/01/schema#label>
"some content ;
<http://www.w3.org/2000/01/schema#range>
<http://www.example.com/XYZ/data/v1#uuu> .
<http://www.example.com/XYZ/datamodel/v1#GETCCCCCC>
a <http://www.w3.org/01/07/w#CCCCCC ,
<http://www.w3.org/2002/07/w#Name>
<http://www.w3.org/2000/01/schema#domain>
<http://www.example.com/XYZ/data/v1#xyz> ;
<http://www.w3.org/2000/01/schema#label1>
"some content , "some url content ;
<http://www.w3.org/2000/01/schema#range>
<http://www.w3.org/2001/XMLSchema#boolean> ;
<http://www.example.com/XYZ/log/v1#Description>
"Some description .
below is the code i tried so far but it is printing all the content of the file
import re
def read_from_url():
try:
from urllib.request import urlopen
except ImportError:
from urllib2 import urlopen
url_link = "examle.com"
html = urlopen(url_link)
previous=None
for line in html:
previous=line
line = re.search(r"^(\s*a\s*)|\#GETBBBBBB|#GETAAAAAA|#GETCCCCCC\b",
line.decode('UTF-8'))
print(previous)
if __name__ == '__main__':
read_from_url()
Expected output:
GETBBBBBB , GETAAAAAA , GETCCCCCC
Thanks in advance!!

When it comes to reading data from URLs, the requests library is much simpler:
import requests
url = "https://www.example.com/your/target.html"
text = requests.get(url).text
If you haven't got it installed you could use the following to do so:
pip3 install requests
Next, why go through the hassle of shoving all of your words into a single regular expression when you could use a word array and then use a for loop instead?
For example:
search_words = "hello word world".split(" ")
matching_lines = []
for (i, line) in enumerate(text.split()):
line = line.strip()
if len(line) < 1:
continue
for word i search_words:
if re.search("\b" + word + "\b", line):
matching_lines.append(line)
continue
Then you'd output the result, like this:
print(matching_lines)
Running this where the text variable equals:
"""
this word will save the line
ignore me!
hello my friend!
what about me?
"""
Should output:
[
"this word will save the line",
"hello my friend!"
]
You could make the search case insensitive by using the lower method, like this:
search_words = [word for word in "hello word world".lower().split(" ")]
matching_lines = []
for (i, line) in enumerate(text.split()):
line = line.strip()
if len(line) < 1:
continue
line = line.lower()
for word i search_words:
if re.search("\b" + word + "\b", line):
matching_lines.append(line)
continue
Notes and information:
the continue keyword prevents you from searching for more than one word match in the current line
the enumerate function allows us to iterate of the index and the current line
I didn't put the lower function for the words inside of the for loop to prevent you from having to call lower for every word match and every line
I didn't call lower on the line until after the check because there's no point in lowercasing an empty line
Good luck.

I'm puzzled about a few things-- answering which may help the community better assist you. Specifically, I can't tell what form the file is in (ie. is it a txt file or a url you're making a request to and parsing the response of). I also can't tell if you're trying to get the entire line, just the url, or just the bit that follows the hash symbol.
Nonetheless, you stated you were looking for the program to output GETBBBBBB , GETAAAAAA , GETCCCCCC, and here's a quick way to get those specific values (assuming the values are in the form of a string):
search = re.findall(r'#(GET[ABC]{6})>', string)
Otherwise, if you're reading from a txt file, this may help:
with open('example_file.txt', 'r') as file:
lst = []
for line in file:
search = re.findall(r'#(GET[ABC]{6})', line)
if search != []:
lst += search
print(lst)
Of course, these are just some quick suggestions in case they may be of help. Otherwise, please answer the questions I mentioned at the beginning of my response and maybe it can help someone on SO better understand what you're looking to get.

Related

Get the full word(s) by knowing only just a part of it

I am searching through a text file line by line and i want to get back all strings that contains the prefix AAAXX1234. For example in my text file i have these lines
Hello my ID is [123423819::AAAXX1234_3412] #I want that(AAAXX1234_3412)
Hello my ID is [738281937::AAAXX1234_3413:AAAXX1234_4212] #I
want both of them(AAAXX1234_3413, AAAXX1234_4212)
Hello my ID is [123423819::XXWWF1234_3098] #I don't care about that
The code i have a just to check if the line starts with "Hello my ID is"
with open(file_hrd,'r',encoding='utf-8') as hrd:
hrd=hrd.readlines()
for line in hrd:
if line.startswith("Hello my ID is"):
#do something
Try this:
import re
with open(file_hrd,'r',encoding='utf-8') as hrd:
res = []
for line in hrd:
res += re.findall('AAAXX1234_\d+', line)
print(res)
Output:
['AAAXX1234_3412', 'AAAXX1234_3413', 'AAAXX1234_4212']
I’d suggest you to parse your lines and extract the information into meaningful parts. That way, you can then use a simple startswith on the ID part of your line. In addition, this will also let you control where you find these prefixes, e.g. in case the lines contains additional data that could also theoretically contain something that looks like an ID.
Something like this:
if line.startswith('Hello my ID is '):
idx_start = line.index('[')
idx_end = line.index(']', idx_start)
idx_separator = line.index(':', idx_start, idx_end)
num = line[idx_start + 1:idx_separator]
ids = line[idx_separator + 2:idx_end].split(':')
print(num, ids)
This would give you the following output for your three example lines:
123423819 ['AAAXX1234_3412']
738281937 ['AAAXX1234_3413', 'AAAXX1234_4212']
123423819 ['XXWWF1234_3098']
With that information, you can then check the ids for a prefix:
if any(ids, lambda x: x.startswith('AAAXX1234')):
print('do something')
Using regular expressions through the re module and its findall() function should be enough:
import re
with open('file.txt') as file:
prefix = 'AAAXX1234'
lines = file.read().splitlines()
output = list()
for line in lines:
output.extend(re.findall(f'{prefix}_[\d]+', line))
You can do it by findall with the regex r'AAAXX1234_[0-9]+', it will find all parts of the string that start with AAAXX1234_ and then grabs all of the numbers after it, change + to * if you want it to match 'AAAXX1234_' on it's own as well

Counting phrase frequencies in an html file

I'm currently trying to get used to Python and have recently hit block in my coding. I couldn't run a code that would count the number of times a phrase appears in an html file. I've recently received some help constructing the code for counting the frequency in a text file but am wondering there is a way to do this directly from the html file (to bypass the copy and paste alternative). Any advice will be sincerely appreciated. The previous coding I have used is the following:
#!/bin/env python 3.3.2
import collections
import re
# Defining a function named "findWords".
def findWords(filepath):
with open(filepath) as infile:
for line in infile:
words = re.findall('\w+', line.lower())
yield from words
phcnt = collections.Counter()
from itertools import tee
phrases = {'central bank', 'high inflation'}
fw1, fw2 = tee(findWords('02.2003.BenBernanke.txt'))
next(fw2)
for w1,w2 in zip(fw1, fw2):
phrase = ' '.join([w1, w2])
if phrase in phrases:
phcnt[phrase] += 1
print(phcnt)
You can use some_str.count(some_phrase) function
In [19]: txt = 'Text mining, also referred to as text data mining, Text mining,\
also referred to as text data mining,'
In [20]: txt.lower().count('data mining')
Out[20]: 2
What about just stripping the html tags before doing the analysis? html2text does this job quite well.
import html2text
content = html2text.html2text(infile.read())
would give you the text content (somehow formatted, but this is no problem in your approach I think). There are options to ignore images and links additionally, which you would use like
h = html2text.HTML2Text()
h.ignore_images = True
h.ignore_links = True
content = h.handle(infile.read())

python, string.replace() and \n

(Edit: the script seems to work for others here trying to help. Is it because I'm running python 2.7? I'm really at a loss...)
I have a raw text file of a book I am trying to tag with pages.
Say the text file is:
some words on this line,
1
DOCUMENT TITLE some more words here too.
2
DOCUMENT TITLE and finally still more words.
I am trying to use python to modify the example text to read:
some words on this line,
</pg>
<pg n=2>some more words here too,
</pg>
<pg n=3>and finally still more words.
My strategy is to load the text file as a string. Build search-for and a replace-with strings corresponding to a list of numbers. Replace all instances in string, and write to a new file.
Here is the code I've written:
from sys import argv
script, input, output = argv
textin = open(input,'r')
bookstring = textin.read()
textin.close()
pages = []
x = 1
while x<400:
pages.append(x)
x = x + 1
pagedel = "DOCUMENT TITLE"
for i in pages:
pgdel = "%d\n%s" % (i, pagedel)
nplus = i + 1
htmlpg = "</p>\n<p n=%d>" % nplus
bookstring = bookstring.replace(pgdel, htmlpg)
textout = open(output, 'w')
textout.write(bookstring)
textout.close()
print "Updates to %s printed to %s" % (input, output)
The script runs without error, but it also makes no changes whatsoever to the input text. It simply reprints it character for character.
Does my mistake have to do with the hard return? \n? Any help greatly appreciated.
In python, strings are immutable, and thus replace returns the replaced output instead of replacing the string in place.
You must do:
bookstring = bookstring.replace(pgdel, htmlpg)
You've also forgot to call the function close(). See how you have textin.close? You have to call it with parentheses, like open:
textin.close()
Your code works for me, but I might just add some more tips:
Input is a built-in function, so perhaps try renaming that. Although it works normally, it might not for you.
When running the script, don't forget to put the .txt ending:
$ python myscript.py file1.txt file2.txt
Make sure when testing your script to clear the contents of file2.
I hope these help!
Here's an entirely different approach that uses re(import the re module for this to work):
doctitle = False
newstr = ''
page = 1
for line in bookstring.splitlines():
res = re.match('^\\d+', line)
if doctitle:
newstr += '<pg n=' + str(page) + '>' + re.sub('^DOCUMENT TITLE ', '', line)
doctitle = False
elif res:
doctitle = True
page += 1
newstr += '\n</pg>\n'
else:
newstr += line
print newstr
Since no one knows what's going on, it's worth a try.

Python next substring search

I am transmitting a message with a pre/postamble multiple times. I want to be able to extract the message between two valid pre/postambles. My curent code is
print(msgfile[msgfile.find(preamble) + len(preamble):msgfile.find(postamble, msgfile.find(preamble))])
The problem is that if the postamble is corrupt, it will print all data between the first valid preamble and the next valid postamble. An example received text file would be:
garbagePREAMBLEmessagePOSTcMBLEgarbage
garbagePRdAMBLEmessagePOSTAMBLEgarbage
garbagePREAMBLEmessagePOSTAMBLEgarbage
and it will print
messagePOSTcMBLEgarbage
garbagePRdEAMBLEmessage
but what i really want it to print is the message from the third line since it has both a valid pre/post amble. So I guess what i want is to be able to find and index from the next instance of a substring. Is there an easy way to do this?
edit: I dont expect my data to be in nice discrete lines. I just formatted it that way so it would be easier to see
Process it line by line:
>>> test = "garbagePREAMBLEmessagePOSTcMBLEgarbage\n"
>>> test += "garbagePRdAMBLEmessagePOSTAMBLEgarbage\n"
>>> test += "garbagePREAMBLEmessagePOSTAMBLEgarbage\n"
>>> for line in test.splitlines():
if line.find(preamble) != -1 and line.find(postamble) != -1:
print(line[line.find(preamble) + len(preamble):line.find(postamble)])
are all messages on single lines?
Then you can use regular expressions to identify lines with valid pre- and postamble:
input_file = open(yourfilename)
import re
pat = re.compile('PREAMBLE(.+)POSTAMBLE')
messages = [pat.search(line).group(1) for line in input_file
if pat.search(line)]
print messages
import re
lines = ["garbagePREAMBLEmessagePOSTcMBLEgarbage",
"garbagePRdAMBLEmessagePOSTAMBLEgarbage",
"garbagePREAMBLEmessagePOSTAMBLEgarbage"]
# you can use regex
my_regex = re.compile("garbagePREAMBLE(.*?)POSTAMBLEgarbage")
# get the match found between the preambles and print it
for line in lines:
found = re.match(my_regex,line)
# if there is a match print it
if found:
print(found.group(1))
# you can use string slicing
def validate(pre, post, message):
for line in lines:
# method would break on a string smaller than both preambles
if len(line) < len(pre) + len(post):
print("error line is too small")
# see if the message fits the pattern
if line[:len(pre)] == pre and line[-len(post):] == post:
# print message
print(line[len(pre):-len(post)])
validate("garbagePREAMBLE","POSTAMBLEgarbage", lines)

Help parsing text file in python

Really been struggling with this one for some time now, i have many text files with a specific format from which i need to extract all the data and file into different fields of a database. The struggle is tweaking the parameters for parsing, ensuring i get all the info correctly.
the format is shown below:
WHITESPACE HERE of unknown length.
K PA DETAILS
2 4565434 i need this sentace as one DB record
2 4456788 and this one
5 4879870 as well as this one, content will vary!
X Max - there sometimes is a line beginning with 'Max' here which i don't need
There is a Line here that i do not need!
WHITESPACE HERE of unknown length.
The tough parts were 1) Getting rid of whitespace, and 2)defining the fields from each other, see my best attempt, below:
dict = {}
XX = (open("XX.txt", "r")).readlines()
for line in XX:
if line.isspace():
pass
elif line.startswith('There is'):
pass
elif line.startswith('Max', 2):
pass
elif line.startswith('K'):
pass
else:
for word in line.split():
if word.startswith('4'):
tmp_PA = word
elif word == "1" or word == "2" or word == "3" or word == "4" or word == "5":
tmp_K = word
else:
tmp_DETAILS = word
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',(tmp_PA,tmp_K,tmp_DETAILS))
At the minute, i can pull the K & PA fields no problem using this, however my DETAILS is only pulling one word, i need the entire sentance, or at least 25 chars of it.
Thanks very much for reading and I hope you can help! :)
K
You are splitting the whole line into words. You need to split into first word, second word and the rest. Like line.split(None, 2).
It would probably use regular expressions. And use the oposite logic, that is if it starts with number 1 through 5, use it, otherwise pass. Like:
pattern = re.compile(r'([12345])\s+\(d+)\s+\(.*\S)')
f = open('XX.txt', 'r') # No calling readlines; lazy iteration is better
for line in f:
m = pattern.match(line)
if m:
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
(m.group(2), m.group(1), m.group(3)))
Oh, and of course, you should be using prepared statement. Parsing SQL is orders of magnitude slower than executing it.
If I understand correctly your file format, you can try this script
filename = 'bug.txt'
f = file(filename,'r')
foundHeaders = False
records = []
for rawline in f:
line = rawline.strip()
if not foundHeaders:
tokens = line.split()
if tokens == ['K','PA','DETAILS']:
foundHeaders = True
continue
else:
tokens = line.split(None,2)
if len(tokens) != 3:
break
try:
K = int(tokens[0])
PA = int(tokens[1])
except ValueError:
break
records.append((K,PA,tokens[2]))
f.close()
for r in records:
print r # replace this by your DB insertion code
This will start reading the records when it encounters the header line, and stop as soon as the format of the line is no longer (K,PA,description).
Hope this helps.
Here is my attempt using re
import re
stuff = open("source", "r").readlines()
whitey = re.compile(r"^[\s]+$")
header = re.compile(r"K PA DETAILS")
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
if whitey.match(line):
pass
elif header.match(line):
pass
elif juicy_info.match(line):
result = juicy_info.search(line)
print result.group('third')
print result.group('second')
print result.group('first')
Using re I can pull the data out and manipulate it on a whim. If you only need the juicy info lines, you can actually take out all the other checks, making this a REALLY concise script.
import re
stuff = open("source", "r").readlines()
#create a regular expression using subpatterns.
#'first, 'second' and 'third' are our own tags ,
# we could call them Adam, Betty, etc.
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
result = juicy_info.search(line)
if result:#do stuff with data here just use the tag we declared earlier.
print result.group('third')
print result.group('second')
print result.group('first')
import re
reg = re.compile('K[ \t]+PA[ \t]+DETAILS[ \t]*\r?\n'\
+ 3*'([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*\r?\n')
with open('XX.txt') as f:
mat = reg.search(f.read())
for tripl in ((2,1,3),(5,4,6),(8,7,9)):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(*tripl)
I prefer to use [ \t] instead of \s because \s matches the following characters:
blank , '\f', '\n', '\r', '\t', '\v'
and I don't see any reason to use a symbol representing more that what is to be matched, with risks to match erratic newlines at places where they shouldn't be
Edit
It may be sufficient to do:
import re
reg = re.compile(r'^([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*$',re.MULTILINE)
with open('XX.txt') as f:
for mat in reg.finditer(f.read()):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(2,1,3)

Categories

Resources