Using beautiful soup to find a name in a html document - python

Hey I've tried for a while and I can't figure how to identify the name using soup.find function. The item that I'm looking for is identified by ,"name": how do I find it if it is in something like this. The text continues upwards and below.
,"100002078216989":{"watermark":1488952059387,"action":1488954831234},"100002219436413":{"watermark":1488717577383,"action":1488717619845},"100003348640283":{"watermark":1489154862229,"action":1489158262774},"100004986371453":{"watermark":1489154862229,"action":1489154866065}}],[]],["MDynaTemplate","registerTemplates",[],[{"URLg3i":["MMessageSourceTextTemplate","\u003Cspan
class=\"source mfss
fcg\">[[text]]\u003C/span>"],"DHGslp":["MMessageSourceTextWithLinkTemplate","\u003Cspan
class=\"mfss fcg\">\u003Ca
href=\"[[\u0025UNESCAPED]][[download_href]]\">[[text]]\u003C/a>\u003C/span>"],"vSvEYy":["MReadReceiptTextTemplate","\u003Cspan
class=\"mfss
fcg\">[[text]]\u003C/span>"]}],[]],["MShortProfiles","set",[],["Value",{"id":"Value","name":"Value","firstName":"Value","vanity":"Value","thumbSrc":null

Here is my solution:
def get_name(self, file):
s = BeautifulSoup(open(file), "lxml")
for item in s.find("p"):
print("The base item: \n" +item + "\n")
item = item.split("name\":\"")
print("1st split: \n" + item[-1] + "\n")
item = item[-1].split("\",\"")
print("2nd split: \n" + item[0] + "\n")
Output:
The base item:
"100002078216989":{"watermark":1488952059387,"action":1488954831234},"100002219436413":{"watermark":1488717577383,"action":1488717619845},"100003348640283":{"watermark":1489154862229,"action":1489158262774},"100004986371453":{"watermark":1489154862229,"action":1489154866065}}],[]],["MDynaTemplate","registerTemplates",[],[{"URLg3i":["MMessageSourceTextTemplate","\u003Cspan class=\"source mfss fcg\">[[text]]\u003C/span>"],"DHGslp":["MMessageSourceTextWithLinkTemplate","\u003Cspan class=\"mfss fcg\">\u003Ca href=\"[[\u0025UNESCAPED]][[download_href]]\">[[text]]\u003C/a>\u003C/span>"],"vSvEYy":["MReadReceiptTextTemplate","\u003Cspan class=\"mfss fcg\">[[text]]\u003C/span>"]}],[]],["MShortProfiles","set",[],["Value",{"id":"Value","name":"Value","firstName":"Value","vanity":"Value","thumbSrc":null
1st split:
Value","firstName":"Value","vanity":"Value","thumbSrc":null
2nd split:
Value
In fact, your html file is not a perfect format. So the best way I can find is like this. However, it can somehow suit your need.

Related

Pull data using regex and insert into a .csv file

So I am using regex to pull data from a webpage. Done.
Now I am trying to insert this data into a .csv file. No problem right?
So I am having trouble pulling my data from the loops I created to insert into the .csv file. It looks like the best way to conquer this is to create a list, and somehow insert the data into the list and write the data into the csv file. But how can I do that with my current setup?
import re
import sqlite3 as lite
import mysql.connector
import urllib.request
from bs4 import BeautifulSoup
import csv
#We're pulling info on socks from e-commerce site Aliexpress
url="https://www.aliexpress.com/premium/socks.html?SearchText=socks&ltype=wholesale&d=y&tc=ppc&blanktest=0&initiative_id=SB_20171202125044&origin=y&catId=0&isViewCP=y"
req = urllib.request.urlopen(url)
soup = BeautifulSoup(req, "html.parser")
div = soup.find_all("div", attrs={"class":"item"})
for item in div:
title_pattern = '<img alt="(.*?)\"'
comp = re.compile(title_pattern)
href = re.findall(comp, str(item))
for x in href:
print(x)
price_pattern = 'itemprop="price">(.*?)<'
comp = re.compile(price_pattern)
href = re.findall(comp, str(item))
for x in href:
print(x)
seller_pattern = '<a class="store j-p4plog".*?>(.*?)<'
comp = re.compile(seller_pattern)
href = re.findall(comp, str(item))
for x in href:
print(x)
orders_pattern = '<em title="Total Orders">.*?<'
comp = re.compile(orders_pattern)
href = re.findall(comp, str(item))
for x in href:
print(x[32:-1])
feedback_pattern = '<a class="rate-num j-p4plog".*?>(.*)<'
comp = re.compile(feedback_pattern)
href = re.findall(comp, str(item))
for x in href:
print(x)
# Creation and insertion of CSV file
# csvfile = "aliexpress.csv"
# csv = open(csvfile, "w")
# columnTitleRow = "Title,Price,Seller,Orders,Feedback,Pair"
# csv.write(columnTitleRow)
#
# for stuff in div:
# title =
# price =
# seller =
# orders =
# feedback =
# row = title + "," + price + "," + seller + "," + orders + "," + feedback +
"," + "\n"
# csv.write(row)
I want to be able to print these lists by their row.
It looks like the best way to conquer this is to create a list, and somehow insert the data into the list and write the data into the csv file. But how can I do that with my current setup?
Yes you're right. Replace your print statements with appends to a list:
data = []
for item in div:
title_pattern = '<img alt="(.*?)\"'
comp = re.compile(title_pattern)
href = re.findall(comp, str(item))
for x in href:
data.append(x)
price_pattern = 'itemprop="price">(.*?)<'
comp = re.compile(price_pattern)
href = re.findall(comp, str(item))
for x in href:
data.append(x)
And then later
csv.writerow(data)
From what I remember, csv.write takes a list and not a rendered CSV string anyways. That's the whole point, it takes the raw data and escapes it properly and adds the commas for you.
Edit: As explained in the comment, I misremembered the interface to csv writer. writerow takes a list, not write. Updated.

Python: Replace() takes no effect on write to file

I'm working on a web parser for a webpage containing mathematical constants. I need to replace some characters in order to have it on a specific format, but I dont know why if I print it, i seems to be working fine; but when I open the output file the format achieved by replace() doesn't seems to have took effect.
That's the code
#!/usr/bin/env python3
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://www.ebyte.it/library/educards/constants/ConstantsOfPhysicsAndMath.html"
soup = BeautifulSoup(urlopen(url).read(), "html5lib")
f = open("ebyteParse-output.txt", "w")
table = soup.find("table", attrs={"class": "grid9"})
rows = table.findAll("tr")
for tr in rows:
# If its a category of constants we write that as a comment
if tr.has_attr("bgcolor"):
f.write("\n\n# " + tr.find(text=True) + "\n")
continue
cols = tr.findAll("td")
if (len(cols) >= 2):
if (cols[0]["class"][0] == "box" or cols[0]["class"][0] == "boxi" and cols[1]["class"][0] == "boxa"):
constant = str(cols[0].find(text=True)).replace(" ", "-")
value = str(cols[1].find(text=True))
value = value.replace(" ", "").replace("...", "").replace("[", "").replace("]", "")
print(constant + "\t" + value)
f.write(constant + "\t" + value)
f.write("\n")
f.close()
That is what print shows:
That is what I get on the output file
Thanks you,
Salva
File i was looking for was catched so no changes where seen. Thanks for answering

loops and replacing in python

I seem to have hit a wall with my script. I'm trying to make it grab the text of a commentary from a website and put in some basic XML tags. It grabs everything on a page and that needs to be fixed, but that's a secondary concern right now. I've gotten the script to split the text into chapters, but I can't figure out how to further divide it into the verses. I'm trying to replace every occurrence of "Verse" in a chapter with </verse><verse name = "n">, with "n" being the verse number. I've tried a few things, including for loops and ElementTree, but it either doesn't work or makes every verse name the same.
I tried putting in the following code, but it never seemed to complete when I try it:
x = "Verse"
for x in para:
para = para.replace (x, '</verse><verse name = " ' +str(n+1) + ' " >' )
n = n + 1
The code below seems to be the most...functional that I've managed to make it. Any advice on how I should fix this or what else I might try?
from lxml import html
import requests
name = open("new.txt", "a")
name.write("""<?xml version="1.0"?>""")
name.write("<data>")
n = 0
for i in range(0, 17):
url_base = "http://www.studylight.org/commentaries/acc/view.cgi?bk=45&ch="
url_norm = url_base + str(i)
page = requests.get(url_norm)
tree = html.fromstring(page.text)
para = tree.xpath('/html/body/div[2]//table//text()')
name.write("<chapter name =\"" + str(i) + "\" >")
para = str(para)
para = para.replace("&", " ")
para = para.replace ("Verse", '</verse><verse name = " ' +str(n+1) + ' " >' )
name.write(str(para))
name.write("</chapter>")
name.write("</data>")
name.close()
print "done"
you shouldn't be changing texts, when manipulating xhtml document use xslt

Python: Print Result Only

import os
def find_method(name):
i = 0
found_dic = { "$_GET":[], "$_POST":[], "include":[], "require":[], "mysql_query":[], "SELECT":[], "system":[], "exec":[], "passthru":[], "readfile":[], "fopen":[], "eval":[] }
for x in file(name, "r"):
i += 1
for key in found_dic:
if x.strip().find(key) != -1:
found_dic[key].append("LINE:"+str(i)+":" + x.strip())
print "="*20, name, "="*20
for key in found_dic:
if found_dic[key]:
print " ", "-"*10, key, "-"*10
for r in found_dic[key]:
print " ",r
def search(dirname):
flist = os.listdir(dirname)
for f in flist:
next = os.path.join(dirname, f)
if os.path.isdir(next):
search(next)
else:
doFileWork(next)
def doFileWork(filename):
ext = os.path.splitext(filename)[-1]
#if ext == '.html': print filename
if ext == '.php':
# print "target:" + filename
find_method(filename)
how can I print only results. its prints all name of file eventhough file doesn't have any result in it. I want to make print file name if its has any result in it
this is about searching word, but it shows every word include like (seaching for include) then it also finds word in sentence and prints all sentence I want to find only word "include" not included in sentence. it's really hard to explain.. I hope to understand.. srry
It looks like there may be a problem with the indentation of the first print command, you are printing 'name', but it is outside of the for loop.
Try populating your dictionary, and then printing the dictionary, along the lines of:
with open(your_file) as f:
found_dic = {}
key = 'your_key'
# populate the dictionary
found_dic[key] = [i for i in f if key in i and i not in found_dic]
With this as a starting point, hopefully you can format the result to the dictionary as you need it. Only lines that include the 'key' will be in the found_dic, so you should be able to print these out in any format you like.
Hope this helps
I hope that's what you asked for:
for i, line in enumerate(file(name, "r")):
found = False
for key in found_dic:
if key in line.strip():
found_dic[key].append("LINE:"+str(i)+":" + key)
found = True
if found:
print "="*20, name, "="*20
for key in found_dic:
if found_dic[key]:
print " ", "-"*10, key, "-"*10
for r in found_dic[key]:
print " ",r
You have to check if you found something if you only want to print the name when you actually found something. Also, you only concatenate key in line 5, because key is what you search. And you only want to add what you search.
Further changes:
I used the enumerate function in line i, its far easier and more readable than incrementing you own i.
I also changed the condition in line 10. Using the in keyword here is the more simple and readable way...

Extract specific entries from blastx output file, write to new file

I have created a script that successfully searches for keywords (specified by user) within a Blastx output file in XML format. Now, I need to write those records (query, hit, score, evalue, etc) that contain the keyword in the alignment title to a new file.
I have created separate lists for each of the query titles, hit title, e-value and alignment lengths but cannot seem to write them to a new file.
Problem #1: what if Python errors, and one of the lists is missing a value...? Then all the other lists will be giving wrong information in reference to the query ("line slippage", if you will...).
Problem #2: even if Python doesn't error, and all the lists are the same length, how can I write them to a file so that the first item in each list is associated with each other (and thus, item #10 from each list is also associated?) Should I create a dictionary instead?
Problem#3: dictionaries have only a single value for a key, what if my query has several different hits? Not sure if it will be overwritten or skipped, or if it will just error. Any suggestions? My current script:
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
import re
#obtain full path to blast output file (*.xml)
outfile = input("Full path to Blast output file (XML format only): ")
#obtain string to search for
search_string = input("String to search for: ")
#open the output file
result_handle = open(outfile)
#parse the blast record
blast_records = NCBIXML.parse(result_handle)
#initialize lists
query_list=[]
hit_list=[]
expect_list=[]
length_list=[]
#create 'for loop' that loops through each HIGH SCORING PAIR in each ALIGNMENT from each RECORD
for record in blast_records:
for alignment in record.alignments: #for description in record.descriptions???
for hsp in alignment.hsps: #for title in description.title???
#search for designated string
search = re.search(search_string, alignment.title)
#if search comes up with nothing, end
if search is None:
print ("Search string not found.")
break
#if search comes up with something, add it to a list of entries that match search string
else:
#option to include an 'exception' (if it finds keyword then DOES NOT add that entry to list)
if search is "trichomonas" or "entamoeba" or "arabidopsis":
print ("found exception.")
break
else:
query_list.append(record.query)
hit_list.append(alignment.title)
expect_list.append(expect_val)
length_list.append(length)
#explicitly convert 'variables' ['int' object or 'float'] to strings
length = str(alignment.length)
expect_val = str(hsp.expect)
#print ("\nquery name: " + record.query)
#print ("alignment title: " + alignment.title)
#print ("alignment length: " + length)
#print ("expect value: " + expect_val)
#print ("\n***Alignment***\n")
#print (hsp.query)
#print (hsp.match)
#print (hsp.sbjct + "\n\n")
if query_len is not hit_len is not expect_len is not length_len:
print ("list lengths don't match!")
break
else:
qrylen = len(query_list)
query_len = str(qrylen)
hitlen = len(hit_list)
hit_len = str(hitlen)
expectlen = len(expect_list)
expect_len = str(expectlen)
lengthlen = len(length_list)
length_len = str(lengthlen)
outpath = str(outfile)
#create new file
outfile = open("__Blast_Parse_Search.txt", "w")
outfile.write("File contains entries from [" + outpath + "] that contain [" + search_string + "]")
outfile.close
#write list to file
i = 0
list_len = int(query_len)
for i in range(0, list_len):
#append new file
outfile = open("__Blast_Parse_Search.txt", "a")
outfile.writelines(query_list + hit_list + expect_list + length_list)
i = i + 1
#write to disk, close file
outfile.flush()
outfile.close
print ("query list length " + query_len)
print ("hit list length " + hit_len)
print ("expect list length " + expect_len)
print ("length list length " + length_len + "\n\n")
print ("first record: " + query_list[0] + " " + hit_list[0] + " " + expect_list[0] + " " + length_list[0])
print ("last record: " + query_list[-1] + " " + hit_list[-1] + " " + expect_list[-1] + " " + length_list[-1])
print ("\nFinished.\n")
If I understand your problem correctly you could use a default value for the line slippage thing like:
try:
x(list)
except exception:
append_default_value(list)
http://docs.python.org/tutorial/errors.html#handling-exceptions
or use tuples for dictionary keys like (0,1,1) and use the get method for your default value.
http://docs.python.org/py3k/library/stdtypes.html#mapping-types-dict
If you need to maintain data structures in your output files you might try using shelve:
or you could append some type of reference after each record and give each record a unique id for example '#32{somekey:value}#21#22#44#'
again you can have multiple keys using a tuple.
I don't know if that helps, you might clarify exactly what parts of your code you have trouble with. Like x() gives me output y but I expect z.

Categories

Resources