Need help parsing XML with ElementTree - python

I'm trying to parse the following XML data:
http://pastebin.com/UcbQQSM2
This is just an example of the 2 types of data I will run into. Companies with the needed address information and companies without the needed information.
From the data I need to collect 3 pieces of information:
1) The Company name
2) The Company street
3) The Company zipcode
I'm able to do this with the following code:
#Creates list of Company names
CompanyList = []
for company in xmldata.findall('company'):
name = company.find('name').text
CompanyList.append(name)
#Creates list of Company zipcodes
ZipcodeList = []
for company in xmldata.findall('company'):
contact_data = company.find('contact-data')
address1 = contact_data.find('addresses')
for address2 in address1.findall('address'):
ZipcodeList.append(address2.find('zip').text)
#Creates list of Company streets
StreetList = []
for company in xmldata.findall('company'):
contact_data = company.find('contact-data')
address1 = contact_data.find('addresses')
for address2 in address1.findall('address'):
StreetList.append(address2.find('street').text)
But it doesn't really do what I want it to, and I can't figure out how to do what I want. I believe it will be some type of 'if' statement but I don't know.
The problem is that where I have:
for address2 in address1.findall('address'):
ZipcodeList.append(address2.find('zip').text)
and
for address2 in address1.findall('address'):
StreetList.append(address2.find('street').text)
It only adds to the list the places that actually have a street name or zipcode listed in the XML, but I need a placemark for the companies that also DON'T have that information listed so that my lists match up.
I hope this makes sense. Let me know if I need to add more information.
But, basically, I'm trying to find a way to say if there isn't a zipcode/street name for the Company put "None" and if there is then put the zipcode/street name.
Any help/guidance is appreciated.

Well I am going to do a bad thing and suggest you use a conditional (ternary) operator.
StreetList.append(address2.find('street').text if address2.find('street').text else 'None')
So this statement says return address2.find('street').text if **address2.find('street') is not empty else return 'None'.
Additionally you could created a new method to do the same test and call it in both places, note my python is rusty but should get you close:
def returnNoneIfEmpty(testText):
if testText:
return testText
else:
return 'None'
Then just call it:
StreetList.append(returnNoneIfEmpty(address2.find('street').text))

Related

Scrape many products with the same name Html. I can only scrape one, not all of them

I have successfully performed a little scraping from a site through Selenium. The data downloaded without problems. Good! On the site there are many products that, in Html, have the same identical name.
At the moment, only a single product and its details (name, description, price, seller, etc.) have been scraped, but I would like to scrape ALL the products on the page .... which I repeat, they have the same same identical name. Here are the names:
#Selenium code for scraping
Product_Name = (driver.find_element_by_class_name ("tablet-desktop-only"). Text)
Product_Description = (driver.find_element_by_class_name ("h-text-left"). Text)
Vendor = (driver.find_element_by_class_name ("in-vendor"). Text)
Price = (driver.find_element_by_class_name ("h-text-center"). Text)
print(Product_Name)
print(Product_Description)
print(Vendor)
print(Price)
How to scrape other products too if they have the same exact same name? I would like to create a list of all products, not just one product. Thank you
You are going to need to find all elements for each of the kinds of things you are looking for. So, start with:
Product_Names = driver.find_elements_by_class_name("tablet-desktop-only")
Product_Descriptions = driver.find_elements_by_class_name("h-text-left")
Vendors = driver.find_elements_by_class_name("in-vendor")
Prices = driver.find_elements_by_class_name("h-text-center")
You should have 4 lists of elements (not strings), each of which should be the same length, and picking up things in the same order. To be safe we will choose to work with the shortest list.
Num_Groups = min(len(Product_Names),len(Product_Descriptions),len(Vendors), len(Prices))
Then we loop over all 4 lists at the same time:
for i in range(Num_Groups):
print(Product_Names[i].text)
print(Product_Descriptions[i].text)
print(Vendors[i].text)
print(Prices[i].text)
#you might want to add printing a blank line here
Note we need .text here so we get the text of the element, not a description of the element itself. Also note the [i] to get that element in the list.
Within this loop is where you would do your database inserts (though probably connect outside the loop), making sure to merge the .text into the SQL string, not the element's string representation.
To find multiple elements with a specific class, we can use find_elements_by_class_name (The difference with the function you wrote is that in this function you should write element, instead of elements!). This function returns a list from which we can select the desired element from its indexes.
Note that this gives you a list and you can not use text on it, but you must use it on its indexes.
Example :
elements = find_elements_by_class_name('tablet-desktop-only')
print( elements[0].text )
# Or using a for :
for element in elements:
print(element.text)
Just for an example if tablet-desktop-only represent multiple value for Product name. You should use find_elements not find_element
name = driver.find_elements_by_class_name ("tablet-desktop-only")
for nme in name:
print(nme.text)
You can easily replicate this for others like Description , Vendor and Price
Update 1 :
above name is a list in Python, similarly you can have list for Description , Vendor and Price
Now we have 4 list, we can print items one by one like this :
for seq in name + Description + Vendor + Price:
print(seq)

Python, need help breaking down a variable into a list

My code is meant to find a specific class element on a page labelled "lh-copy truncate silve" and then copy all links within the attribute as well as info into a list. As of right now, the code simply saves the list into a variable instead and I am having issues making the conversion.
Here is the code that I have so far:
age_sex = browser.find_elements_by_xpath('//*[#class="lh-copy truncate silver"]')
for ii in age_sex:
link = ii.find_element_by_xpath('.//a').get_attribute('href')
sex = ii.find_element_by_xpath('.//span').text
print(link, sex)
The code returns the information that I need in variable as opposed to list format.
Edit: The reason why I need it to be a list as opposed to a variable with a variable if I type variable[1], it'll just give me the second letter of the https:// link which is 't". Whereas if it is in list format, list[1] will return to me the full link. It's the only way that I know to be able to divide the block of text in a variable into separate links that can be accessed separately by my script.
It appears that your for loop is only printing individual elements. If you want lists of links and sexs, this may be helpful:
age_sex = browser.find_elements_by_xpath('//*[#class="lh-copy truncate silver"]')
link_list = []
sex_list = []
for ii in age_sex:
link = ii.find_element_by_xpath('.//a').get_attribute('href')
link_list.append(link)
sex = ii.find_element_by_xpath('.//span').text
sex_list.append(sex)
print(link_list, sex_list)
If you want to keep things together (i.e. list of link and sex pairs), you can have the following:
age_sex = browser.find_elements_by_xpath('//*[#class="lh-copy truncate silver"]')
result_list = []
for ii in age_sex:
link = ii.find_element_by_xpath('.//a').get_attribute('href')
sex = ii.find_element_by_xpath('.//span').text
result_list.append([link, sex])
print(result_list)
I hope I'm understanding your problem correctly.
# If your info in the variable are separated by something, a space for example or any specific char, try the following.
new_list = varibale.split(char)
# if it's a space:
new_list = varibale.split(' ')
Could you please explain clearer the problem?

Unable to get the output in an organized manner

I've written a script in python to scrape some item names along with review texts and reviewers connected to each item name from a webpage using their api. The thing is my below script can do those things partially. I need to do those in an organized manner.
For example, in each item name there are multiple review texts and reviewer names connected to it. I wish to get them along the columns like:
Name review text reviewer review text reviewer -----
Basically, I can't get the idea how to make use of the already defined for loop in the right way within my script. Lastly, there are few item names which do not have any reviews or reviewers, so the code breaks when it doesn't find any reviews and so.
This s my approach so far:
import requests
url = "https://eatstreet.com/api/v2/restaurants/{}?yelp_site="
res = requests.get("https://eatstreet.com/api/v2/locales/madison-wi/restaurants")
for item in res.json():
itemid = item['id']
req = requests.get(url.format(itemid))
name = req.json()['name']
for texualreviews in req.json()['yelpReviews']:
reviews = texualreviews['message']
reviewer = texualreviews['reviewerName']
print(f'{name}\n{reviews}\n{reviewer}\n')
If I use print statement outside the for loop, It only gives me a single review and reviewer.
Any help to fix that will be highly appreciated.
You need to append the review and a reviewer name to an array to display as you wish.
Try the following code.
review_data = dict()
review_data['name'] = req.json()['name']
review_data['reviews'] = []
for texualreviews in req.json()['yelpReviews']:
review_sub_data = {'review': texualreviews['message'], 'reviewer': texualreviews['reviewerName']}
review_data['reviews'].append(review_sub_data)
#O/P {'name': 'xxx', 'reviews':[{'review':'xxx', 'reviewer': 'xxx'}, {'review':'xxx', 'reviewer': 'xxx'}]}
Hope this helps!

Building a tuple to be rendered by a form

I have a model University which has a field city. I'm trying to build a form where the user can select cities or universities. The universities selection is fine:
universities = University.objects.all()
university = forms.ModelMultipleChoiceField(widget=CheckboxSelectMultiple, queryset=universities)
The method I'm trying to get the cities is what is causing me the problem. Here's what I currently have:
cities = []
for uni in universities:
cities.append(uni.city)
cities = tuple(cities)
city_select = forms.MultipleChoiceField(widget=CheckboxSelectMultiple, choices=cities)
This gives me the error too many values to unpack because the tuple isn't key paired. Is there any easier way to return the choices I've gathered, I feel like I'm going about it in the wrong way. If not, how do I key pair the tuples of cities?
I think a simple change like below, where each entry in cities is a tuple should make this work:
cities = []
for uni in universities:
cities.append((uni.city, uni.city))
cities = tuple(cities)
city_select = forms.MultipleChoiceField(widget=CheckboxSelectMultiple, choices=cities)
MultipleChoiceField doesn't want a tuple, it wants a queryset. You can use values_list to get one with the fields you want:
city_select = forms.MultipleChoiceField(widget=CheckboxSelectMultiple, queryset=University.objects.values_list('id', 'city'))

sorting multi-language country names

I have a list of country names in different languages that I am attempting to sort by their country name. Currently, the sort is based on the index value.
Here is my truncated list of country names:
ADDRESS_COUNTRY_STYLE_TYPES = {}
for language_code in LANGUAGES.iterkeys():
ADDRESS_COUNTRY_STYLE_TYPES[language_code] = OrderedDict()
if 'af' in LANGUAGES.iterkeys():
ADDRESS_COUNTRY_STYLE_TYPES['af'][0] = " Kies 'n land of gebied" # Select a country or territory
ADDRESS_COUNTRY_STYLE_TYPES['af'][1] = "Afganistan" #Afghanistan
ADDRESS_COUNTRY_STYLE_TYPES['af'][2] = "Åland" #Aland
ADDRESS_COUNTRY_STYLE_TYPES['af'][3] = "Albanië" #Albania
....
ADDRESS_COUNTRY_STYLE_TYPES['af'][14] = "Australië" #Australia
ADDRESS_COUNTRY_STYLE_TYPES['af'][15] = "Oostenryk" #Austria
ADDRESS_COUNTRY_STYLE_TYPES['af'][16] = "Aserbeidjan" #Azerbaijan
ADDRESS_COUNTRY_STYLE_TYPES['af'][17] = "Bahamas" #Bahamas
ADDRESS_COUNTRY_STYLE_TYPES['af'][18] = "Bahrein" #Bahrain
ADDRESS_COUNTRY_STYLE_TYPES['af'][19] = "Bangladesj" #Bangladesh
ADDRESS_COUNTRY_STYLE_TYPES['af'][20] = "Barbados" #Barbados
ADDRESS_COUNTRY_STYLE_TYPES['af'][21] = "Wit-Rusland" #Belarus
ADDRESS_COUNTRY_STYLE_TYPES['af'][22] = "België" #Belgium
....
Here is my code I have in my views.py file that calls the country names:
def get_address_country_style_types(available_languages, with_country_style_zero=True):
address_country_style_types = {}
preview_labels = {}
for code, name in available_languages:
address_country_style_types[code] = ADDRESS_COUNTRY_STYLE_TYPES[code].copy()
if not with_country_style_zero:
address_country_style_types[code].pop(0)
preview_labels[code] = ADDRESS_DETAILS_LIVE_PREVIEW_LABELS[code]
# in case preview labels are not defined for the language code
# fall back to 'en', which should always be there
if len(preview_labels[code]) == 0:
preview_labels[code] = ADDRESS_DETAILS_LIVE_PREVIEW_LABELS['en']
address_country_style_types = sorted(address_country_style_types, key=lambda x:x[1])
return address_country_style_types, preview_labels
The above code only returns the index number in the html drop down list. The issue is with the following line of code (or more to the point my lack of knowledge of how to get it working):
address_country_style_types = sorted(address_country_style_types, key=lambda x:x[1])
How do I return the sorted country list ? Am I using lambda in the correct way here? Should I be using lambda here?
I have been working on this over several days, my coding skills are not very strong, and I have read many related posts to no avail, so any help is appreciated.
I have read this blog about sorting a list of multilingual country names that appear in a form HTML select drop down list - which is essentially what I am attempting to do.
EDIT
Commenting out the line of code below in the code above does return a list of country names, but the country names are sorted by the index value not the country name.
address_country_style_types = sorted(address_country_style_types, key=lambda x:x[1])
I have failed to sort the multi-language country names programatically.
Instead, I copied the list into excel and hit the sort button (based on the translated country name - the index value stays uniform), then copied the data back to the file. Works as expected - just a lot of work.
I hope that this helps someone.

Categories

Resources