Get result from xml code with python - python

I have an xml code from nimbuzz chat app which searches for a chatroom in nimbuzz server using input keyword.
<iq type="set" id="Nimbuzz_SearchRooms" to="conference.nimbuzz.com"><query xmlns="jabber:iqearch"><set xmlns="http://jabber.org/protocol/rsm"><index>0</index><max>10</max></set><x type="get" xmlns="jabberata"><field var="name"><value>**INPUT KEYWORD FOR ROOM NAME**</value></field><field var="include_password_protected"><value>true</value></field></x></query></iq>
The code is working and I'm getting the following result as xml code:
See the image:
The result code image
I started with this code but I can't complete it because I can't understand how it works:
def handler_search_room(type, source, parameters):
Key_word = raw_input("Please write the Key word: ")
if parameters.split():
iq = xmpp.Iq('set')
iq.setID('Nimbuzz_SearchRooms')
iq.setTo('conference.nimbuzz.com)
I need to send the first code to the nimbuzz server and then I need to get the result with the information of each Chatroom.
The result code should get this information for each chatroom:
name
subject.
num_users
num_max_users
is_passowrd_protected
is_member_only
language
location-type
location
How can I do that with python? I will happy if someone would help me to make my code.
Download the XML code if you want:
http://qfs.mobi/f1833350

def handler_akif_room(type, source, parameters):
room = parameters
if parameters.split():
iq = xmpp.Iq('set')
iq.setID('Nimbuzz_SearchRooms')
iq.setTo('conference.nimbuzz.com')
query = xmpp.Node('query')
query.setNamespace('jabber:iq:search')
iq.addChild(node=query)
sett = xmpp.Node('set')
sett.setNamespace('http://jabber.org/protocol/rsm')
query.addChild(node=sett)
sifir = "0"
ind = xmpp.Node('index')
ind.setData(sifir)
sett.addChild(node=ind)
on = "10"
maxx = xmpp.Node('max')
maxx.setData(on)
sett.addChild(node=maxx)
qqx = xmpp.Node('x')
qqx.setNamespace('jabber:x:data" type="get')
query.addChild(node=qqx)
field = xmpp.Node('field')
field.setAttr('var','name')
field.setData(' ')
qqx.addChild(node=field)
valueone = xmpp.Node('value')
valueone.setData(room)
field.addChild(node=valueone)
fieldtwo = xmpp.Node('field')
fieldtwo.setAttr('var','include_password_protected')
fieldtwo.setData(' ')
qqx.addChild(node=fieldtwo)
valuetwo = xmpp.Node('value')
valuetwo.setData('true')
fieldtwo.addChild(node=valuetwo)
JCON.SendAndCallForResponse(iq, handler_akif_room_answ, {'type': type, 'source': source})
msg(source[1], u''+str(iq)+' .') ## <= OUT SHOW :)) GOOD WORK! CONTACT ME akif#nimbuzz.com :)

Related

How to scrape a link from a multipart email in python

I have a program which logs on to a specified gmail account and gets all the emails in a selected inbox that were sent from an email that you input at runtime.
I would like to be able to grab all the links from each email and append them to a list so that i can then filter out the ones i don't need before outputting them to another file. I was using a regex to do this which requires me to convert the payload to a string. The problem is that the regex i am using doesn't work for findall(), it only works when i use search() (I am not too familiar with regexes). I was wondering if there was a better way to extract all links from an email that doesn't involve me messing around with regexes?
My code currently looks like this:
print(f'[{Mail.timestamp}] Scanning inbox')
sys.stdout.write(Style.RESET)
self.search_mail_status, self.amount_matching_criteria = self.login_session.search(Mail.CHARSET,search_criteria)
if self.amount_matching_criteria == 0 or self.amount_matching_criteria == '0':
print(f'[{Mail.timestamp}] No mails from that email address could be found...')
Mail.enter_to_continue()
import main
main.main_wrapper()
else:
pattern = '(?P<url>https?://[^\s]+)'
prog = re.compile(pattern)
self.amount_matching_criteria = self.amount_matching_criteria[0]
self.amount_matching_criteria_str = str(self.amount_matching_criteria)
num_mails = re.search(r"\d.+",self.amount_matching_criteria_str)
num_mails = ((num_mails.group())[:-1]).split(' ')
sys.stdout.write(Style.GREEN)
print(f'[{Mail.timestamp}] Status code of {self.search_mail_status}')
sys.stdout.write(Style.RESET)
sys.stdout.write(Style.YELLOW)
print(f'[{Mail.timestamp}] Found {len(num_mails)} emails')
sys.stdout.write(Style.RESET)
num_mails = self.amount_matching_criteria.split()
for message_num in num_mails:
individual_response_code, individual_response_data = self.login_session.fetch(message_num, '(RFC822)')
message = email.message_from_bytes(individual_response_data[0][1])
if message.is_multipart():
print('multipart')
multipart_payload = message.get_payload()
for sub_message in multipart_payload:
string_payload = str(sub_message.get_payload())
print(prog.search(string_payload).group("url"))
Ended up using this for loop with a recursive function and a regex to get the links, i then removed all links without a the substring that you can input earlier on in the program before appending to a set
for message_num in self.amount_matching_criteria.split():
counter += 1
_, self.individual_response_data = self.login_session.fetch(message_num, '(RFC822)')
self.raw = email.message_from_bytes(self.individual_response_data[0][1])
raw = self.raw
self.scraped_email_value = email.message_from_bytes(Mail.scrape_email(raw))
self.scraped_email_value = str(self.scraped_email_value)
self.returned_links = prog.findall(self.scraped_email_value)
for i in self.returned_links:
if self.substring_filter in i:
self.link_set.add(i)
self.timestamp = time.strftime('%H:%M:%S')
print(f'[{self.timestamp}] Links scraped: [{counter}/{len(num_mails)}]')
The function used:
def scrape_email(raw):
if raw.is_multipart():
return Mail.scrape_email(raw.get_payload(0))
else:
return raw.get_payload(None,True)

monitoring a text site (json) using python

IM working on a program to grab variant ID from this website
https://www.deadstock.ca/collections/new-arrivals/products/nike-air-max-1-cool-grey.json
Im using the code
import json
import requests
import time
endpoint = "https://www.deadstock.ca/collections/new-arrivals/products/nike-air-max-1-cool-grey.json"
req = requests.get(endpoint)
reqJson = json.loads(req.text)
for id in reqJson['product']:
name = (id['title'])
print (name)
I dont know what to do here in order to grab the Name of the items. If you visit the link you will see that the name is under 'title'. If you could help me with this that would be awesome.
I get the error message "TypeError: string indices must be integers" so im not too sure what to do.
Your biggest problem right now is that you are adding items to the list before you're checking if they're in it, so everything is coming back as in the list.
Looking at your code right now, I think what you want to do is combine things into a single for loop.
Also as a heads up you shouldn't use a variable name like list as it is shadowing the built-in Python function list().
list = [] # You really should change this to something else
def check_endpoint():
endpoint = ""
req = requests.get(endpoint)
reqJson = json.loads(req.text)
for id in reqJson['threads']: # For each id in threads list
PID = id['product']['globalPid'] # Get current PID
if PID in list:
print('checking for new products')
else:
title = (id['product']['title'])
Image = (id['product']['imageUrl'])
ReleaseType = (id['product']['selectionEngine'])
Time = (id['product']['effectiveInStockStartSellDate'])
send(title, PID, Image, ReleaseType, Time)
print ('added to database'.format(PID))
list.append(PID) # Add PID to the list
return
def main():
while(True):
check_endpoint()
time.sleep(20)
return
if __name__ == "__main__":
main()

Parsing Json with multiple "levels" with Python

I'm trying to parse a json file from an api call.
I have found this code that fits my need and trying to adapt it to what I want:
import math, urllib2, json, re
def download():
graph = {}
page = urllib2.urlopen("http://fx.priceonomics.com/v1/rates/?q=1")
jsrates = json.loads(page.read())
pattern = re.compile("([A-Z]{3})_([A-Z]{3})")
for key in jsrates:
matches = pattern.match(key)
conversion_rate = -math.log(float(jsrates[key]))
from_rate = matches.group(1).encode('ascii','ignore')
to_rate = matches.group(2).encode('ascii','ignore')
if from_rate != to_rate:
if from_rate not in graph:
graph[from_rate] = {}
graph[from_rate][to_rate] = float(conversion_rate)
return graph
And I've turned it into:
import math, urllib2, json, re
def download():
graph = {}
page = urllib2.urlopen("https://bittrex.com/api/v1.1/public/getmarketsummaries")
jsrates = json.loads(page.read())
for pattern in jsrates['result'][0]['MarketName']:
for key in jsrates['result'][0]['Ask']:
matches = pattern.match(key)
conversion_rate = -math.log(float(jsrates[key]))
from_rate = matches.group(1).encode('ascii','ignore')
to_rate = matches.group(2).encode('ascii','ignore')
if from_rate != to_rate:
if from_rate not in graph:
graph[from_rate] = {}
graph[from_rate][to_rate] = float(conversion_rate)
return graph
Now the problem is that there is multiple level in the json "Result > 0, 1,2 etc"
json screenshot
for key in jsrates['result'][0]['Ask']:
I want the zero to be able to be any number, I don't know if thats clear.
So I could get all the ask price to match their marketname.
I have shortened the code so it doesnt make too long of a post.
Thanks
PS: sorry for the english, its not my native language.
You could loop through all of the result values that are returned, ignoring the meaningless numeric index:
for result in jsrates['result'].values():
ask = result.get('Ask')
if ask is not None:
# Do things with your ask...

How to parse a single-column text file into a table using python?

I'm new here to StackOverflow, but I have found a LOT of answers on this site. I'm also a programming newbie, so i figured i'd join and finally become part of this community - starting with a question about a problem that's been plaguing me for hours.
I login to a website and scrape a big body of text within the b tag to be converted into a proper table. The layout of the resulting Output.txt looks like this:
BIN STATUS
8FHA9D8H 82HG9F RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
INVENTORY CODE: FPBC *SOUP CANS LENTILS
BIN STATUS
HA8DHW2H HD0138 RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8SHDNADU 00A123 #2956- INVALID STOCK COUPON CODE (MISSING).
93827548 096DBR RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
There are a bunch of pages with the exact same blocks, but i need them to be combined into an ACTUAL table that looks like this:
BIN INV CODE STATUS
HA8DHW2HHD0138 FPBC-*SOUP CANS LENTILS RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8SHDNADU00A123 FPBC-*SOUP CANS LENTILS #2956- INVALID STOCK COUPON CODE (MISSING).
93827548096DBR FPBC-*SOUP CANS LENTILS RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8FHA9D8H82HG9F SSXR-98-20LM NM CORN CREAM RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
Essentially, all separate text blocks in this example would become part of this table, with the inv code repeating with its Bin values. I would post my attempts at parsing this data(have tried Pandas/bs/openpyxl/csv writer), but ill admit they are a little embarrassing, as i cannot find any information on this specific problem. Is there any benevolent soul out there that can help me out? :)
(Also, i am using Python 2.7)
A simple custom parser like the following should do the trick.
from __future__ import print_function
def parse_body(s):
line_sep = '\n'
getting_bins = False
inv_code = ''
for l in s.split(line_sep):
if l.startswith('INVENTORY CODE:') and not getting_bins:
inv_data = l.split()
inv_code = inv_data[2] + '-' + ' '.join(inv_data[3:])
elif l.startswith('INVENTORY CODE:') and getting_bins:
print("unexpected inventory code while reading bins:", l)
elif l.startswith('BIN') and l.endswith('MESSAGE'):
getting_bins = True
elif getting_bins == True and l:
bin_data = l.split()
# need to add exception handling here to make sure:
# 1) we have an inv_code
# 2) bin_data is at least 3 items big (assuming two for
# bin_id and at least one for message)
# 3) maybe some constraint checking to ensure that we have
# a valid instance of an inventory code and bin id
bin_id = ''.join(bin_data[0:2])
message = ' '.join(bin_data[2:])
# we now have a bin, an inv_code, and a message to add to our table
print(bin_id.ljust(20), inv_code.ljust(30), message, sep='\t')
elif getting_bins == True and not l:
# done getting bins for current inventory code
getting_bins = False
inv_code = ''
A rather complex one, but this might get you started:
import re, pandas as pd
from pandas import DataFrame
rx = re.compile(r'''
(?:INVENTORY\ CODE:)\s*
(?P<inv>.+\S)
[\s\S]+?
^BIN.+[\n\r]
(?P<bin_msg>(?:(?!^\ ).+[\n\r])+)
''', re.MULTILINE | re.VERBOSE)
string = your_string_here
# set up the dataframe
df = DataFrame(columns = ['BIN', 'INV', 'MESSAGE'])
for match in rx.finditer(string):
inv = match.group('inv')
bin_msg_raw = match.group('bin_msg').split("\n")
rxbinmsg = re.compile(r'^(?P<bin>(?:(?!\ {2}).)+)\s+(?P<message>.+\S)\s*$', re.MULTILINE)
for item in bin_msg_raw:
for m in rxbinmsg.finditer(item):
# append it to the dataframe
df.loc[len(df.index)] = [m.group('bin'), inv, m.group('message')]
print(df)
Explanation
It looks for INVENTORY CODE and sets up the groups (inv and bin_msg) for further processing in afterwork() (note: it would be easier if you had only one line of bin/msg as you need to split the group here afterwards).
Afterwards, it splits the bin and msg part and appends all to the df object.
I had a code written for a website scrapping which may help you.
Basically what you need to do is write click on the web page go to html and try to find the tag for the table you are looking for and using the module (i am using beautiful soup) extract the information. I am creating a json as I need to store it into mongodb you can create table.
#! /usr/bin/python
import sys
import requests
import re
from BeautifulSoup import BeautifulSoup
import pymongo
def req_and_parsing():
url2 = 'http://businfo.dimts.in/businfo/Bus_info/EtaByRoute.aspx?ID='
list1 = ['534UP','534DOWN']
for Route in list1:
final_url = url2 + Route
#r = requests.get(final_url)
#parsing_file(r.text,Route)
outdict = []
outdict = [parsing_file( requests.get(url2+Route).text,Route) for Route in list1 ]
print outdict
conn = f_connection()
for i in range(len(outdict)):
insert_records(conn,outdict[i])
def parsing_file(txt,Route):
soup = BeautifulSoup(txt)
table = soup.findAll("table",{"id" : "ctl00_ContentPlaceHolder1_GridView2"})
#trtags = table[0].findAll('tr')
tdlist = []
trtddict = {}
"""
for trtag in trtags:
print 'print trtag- ' , trtag.text
tdtags = trtag.findAll('td')
for tdtag in tdtags:
print tdtag.text
"""
divtags = soup.findAll("span",{"id":"ctl00_ContentPlaceHolder1_ErrorLabel"})
for divtag in divtags:
for divtag in divtags:
print "div tag - " , divtag.text
if divtag.text == "Currently no bus is running on this route" or "This is not a cluster (orange bus) route":
print "Page not displayed Errored with below meeeage for Route-", Route," , " , divtag.text
sys.exit()
trtags = table[0].findAll('tr')
for trtag in trtags:
tdtags = trtag.findAll('td')
if len(tdtags) == 2:
trtddict[tdtags[0].text] = sub_colon(tdtags[1].text)
return trtddict
def sub_colon(tag_str):
return re.sub(';',',',tag_str)
def f_connection():
try:
conn=pymongo.MongoClient()
print "Connected successfully!!!"
except pymongo.errors.ConnectionFailure, e:
print "Could not connect to MongoDB: %s" % e
return conn
def insert_records(conn,stop_dict):
db = conn.test
print db.collection_names()
mycoll = db.stopsETA
mycoll.insert(stop_dict)
if __name__ == "__main__":
req_and_parsing()

Same code used in multiple functions but with minor differences - how to optimize?

This is the code of a Udacity course, and I changed it a little. Now, when it runs, it asks me for a movie name and the trailer would open in a pop up in a browser (that's another part, which is not shown).
As you can see, this program has a lot of repetitive code in it, the functions extract_name, movie_poster_url and movie_trailer_url have kind of the same code. Is there a way to get rid of the same code being repeated but have the same output? If so, will it run faster?
import fresh_tomatoes
import media
import urllib
import requests
from BeautifulSoup import BeautifulSoup
name = raw_input("Enter movie name:- ")
global movie_name
def extract_html(name):
url = "website name" + name + "continuation of website name" + name + "again continuation of web site name"
response = requests.get(url)
page = str(BeautifulSoup(response.content))
return page
def extract_name(page):
start_link = page.find(' - IMDb</a></h3><div class="s"><div class="kv"')
start_url = page.find('>',start_link-140)
start_url1 = page.find('>', start_link-140)
end_url = page.find(' - IMDb</a>', start_link-140)
name_of_movie = page[start_url1+1:end_url]
return extract_char(name_of_movie)
def extract_char(name_of_movie):
name_array = []
for words in name_of_movie:
word = words.strip('</b>,')
name_array.append(word)
return ''.join(name_array)
def movie_poster_url(name_of_movie):
movie_name, seperator, tail = name_of_movie.partition(' (')
#movie_name = name_of_movie.rstrip('()0123456789 ')
page = urllib.urlopen('another web site name' + movie_name + 'continuation of website name').read()
start_link = page.find('"Poster":')
start_url = page.find('"',start_link+9)
end_url = page.find('"',start_url+1)
poster_url = page[start_url+1:end_url]
return poster_url
def movie_trailer_url(name_of_movie):
movie_name, seperator, tail = name_of_movie.partition(' (')
#movie_name = name_of_movie.rstrip('()0123456789 ')
page = urllib.urlopen('another website name' + movie_name + " trailer").read()
start_link = page.find('<div class="yt-lockup-dismissable"><div class="yt-lockup-thumbnail contains-addto"><a aria-hidden="true" href=')
start_url = page.find('"',start_link+110)
end_url = page.find('" ',start_url+1)
trailer_url1 = page[start_url+1:end_url]
trailer_url = "www.youtube.com" + trailer_url1
return trailer_url
page = extract_html(name)
movie_name = extract_name(page)
new_movie = media.Movie(movie_name, "Storyline WOW", movie_poster_url(movie_name), movie_trailer_url(movie_name))
movies = [new_movie]
fresh_tomatoes.open_movies_page(movies)
You could move the shared parts into their own function:
def find_page(url, name, find, offset):
movie_name, seperator, tail = name_of_movie.partition(' (')
page = urllib.urlopen(url.format(name)).read()
start_link = page.find(find)
start_url = page.find('"',start_link+offset)
end_url = page.find('" ',start_url+1)
return page[start_url+1:end_url]
def movie_poster_url(name_of_movie):
return find_page("another website name{} continuation of website name", name_of_movie, '"Poster":', 9)
def movie_trailer_url(name_of_movie):
trailer_url = find_page("another website name{} trailer", name_of_movie, '<div class="yt-lockup-dismissable"><div class="yt-lockup-thumbnail contains-addto"><a aria-hidden="true" href=', 110)
return "www.youtube.com" + trailer_url
It definetely wont run faster (there is extra work to do to "switch" between the functions) but the performance difference is probably negligable.
For your second question: Profiling is not a technique or method, it's "finding out what's being bad" in your code:
Profiling is a form of
dynamic program analysis that measures, for example, the space
(memory) or time complexity of a program, the usage of particular
instructions, or the frequency and duration of function calls.
(wikipedia)
So it's not something that speeds up your program, it's a word for things you do to find out what you can do to speed up your program.
Going really quickly here because I am a super newb but I can see the repetition; what I would do is to figure out the (mostly) repeating blocks of code shared by all 3 functions and then figure out where they differ; write a new function that takes the differences as the arguments. so for instance:
def extract(tarString,delim,startDiff,endDiff):
start_link = page.find(tarString)
start_url = page.find(delim,start_link+startDiff)
end_url = page.find(delim,start_url+endDiff)
url_out = page[start_url+1:end_url]
Then, in your poster, trailer, etc functions, just call this extract function with the appropriate arguments for each case. ie poster would call
poster_url=extract(tarString='"Poster:"',delim='"',startDiff=9, endDiff=1)
I can see you've got another answer already and it's very likely it's written by someone who knows more than I do, but I hope you get something out of my "philosophy of modularizing" from a newbie perspective.

Categories

Resources