Exclude part of string with Regex in python with web scraping - python

I'm trying to scrape some data from an e-commerce website for a personal project. I'm trying to build a nested list of strings from the html but am having an issue with one part of the html.
Each list item appears as the following:
<div class="impressions" data-impressions=\'{"id":"01920","name":"Sleepy","price":12.95,"brand":"Lush","category":"Bubble Bar","variant":"7 oz.","quantity":1,"list":"/bath/bubble-bars/sleepy/9999901920.html","dimension11":"","dimension12":"Naked,Self Preserving,Vegan","dimension13":1,"dimension14":1,"dimension15":true}\'></div>
What I have now is a regex that turns all the items in the data-impressions tag like so and splits them at the comma:
list_return = [re.findall('\{([^{]+[^}\'></div>])', i) for i in bathshower_impressions]
list_return = [re.split(',', list_return[i][0]) for i in range(0, len(list_return))]
Which gives me a list of lists of lists for each thing which will become a key:value pair in a dictionary. For the example above here is what the second level item would be:
[['"id"', '"01920"'],
['"name"', '"Sleepy"'],
['"price"', '12.95'],
['"brand"', '"Lush"'],
['"category"', '"Bubble Bar"'],
['"variant"', '"7 oz."'],
['"quantity"', '1'],
['"list"', '"/bath/bubble-bars/sleepy/9999901920.html"'],
['"dimension11"', '""'],
['"dimension12"', '"Naked'],
['Self Preserving'],
['Vegan"'],
['"dimension13"', '1'],
['"dimension14"', '1'],
['"dimension15"', 'true']]
My problem is with dimension12, I can't figure out how to exclude that dimension from splitting at the comma, so that that list would appear as:
['"dimension12"', '"Naked,Self Preserving,Vegan"']
Any help is appreciated, thanks.

I'd like to suggest a bit different approach. That attribute value looks like JSON, so why not use a json module? That way, you have a ready-made data structure, that you can modify further.
import json
from bs4 import BeautifulSoup
html_list = [
"""<div class="impressions" data-impressions=\'{"id":"01920","name":"Sleepy","price":12.95,"brand":"Lush","category":"Bubble Bar","variant":"7 oz.","quantity":1,"list":"/bath/bubble-bars/sleepy/9999901920.html","dimension11":"","dimension12":"Naked,Self Preserving,Vegan","dimension13":1,"dimension14":1,"dimension15":true}\'></div>""",
]
data_structures = []
for html_item in html_list:
soup = BeautifulSoup(html_item, "html.parser").find("div", {"class": "impressions"})
data_structures.append(json.loads(soup["data-impressions"]))
print(data_structures)
This outputs a list of dictionaries:
[{'id': '01920', 'name': 'Sleepy', 'price': 12.95, 'brand': 'Lush', 'category': 'Bubble Bar', 'variant': '7 oz.', 'quantity': 1, 'list': '/bath/bubble-bars/sleepy/9999901920.html', 'dimension11': '', 'dimension12': 'Naked,Self Preserving,Vegan', 'dimension13': 1, 'dimension14': 1, 'dimension15': True}]
To access the desired key, just do this:
for data_item in data_structures:
print(data_item["dimension12"])
Prints: Naked,Self Preserving,Vegan

Related

How do I remove the text in URL from the brackets, and organize the attributes into a list?

This is my code to extract the text from the url:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
from re import findall
url = 'https://api.coolcatsnft.com/cat/6003'
c = requests.get(url).text
soup = BeautifulSoup(c, 'html.parser')
with open("coolcat.txt", "w") as file:
file.write(str(soup))
mylines = []
with open ('coolcat.txt', 'rt') as myfile:
for myline in myfile:
mylines.append(myline)
print(mylines)
which outputs this:
['{"description":"Cool Cats is a collection of 9,999 randomly generated and stylistically curated NFTs that exist on the Ethereum Blockchain. Cool Cat holders can participate in exclusive events such as NFT claims, raffles, community giveaways, and more. Remember, all cats are cool, but some are cooler than others. Visit [www.coolcatsnft.com](https://www.coolcatsnft.com/) to learn more.","image":"https://ipfs.io/ipfs/QmcgQpxMNRw4Zt5B1tkDZfA4aAWCBhv6JVw4YnLync2x4e","name":"Cool Cat #6003","attributes":[{"trait_type":"body","value":"blue cat skin"},{"trait_type":"hats","value":"headband red"},{"trait_type":"shirt","value":"toga"},{"trait_type":"face","value":"ninja red"},{"trait_type":"tier","value":"cool_2"}],"points":{"Body":0,"Hats":1,"Shirt":2,"Face":1},"ipfs_image":"https://ipfs.io/ipfs/QmcgQpxMNRw4Zt5B1tkDZfA4aAWCBhv6JVw4YnLync2x4e","google_image":"https://drive.google.com/uc?id=1zI7STVzE6sEBPzfjs__ejqphxz9QUhLu"}']
How do I extract just the traits and organize them into a list?
Conveniently, your strings happen to be dictionaries. Is it always the case?
You can map your string list to a list of dictionaries:
li = ['{"description":"Cool Cats is a collection of 9,999 randomly generated and stylistically curated NFTs that exist on the Ethereum Blockchain. Cool Cat holders can participate in exclusive events such as NFT claims, raffles, community giveaways, and more. Remember, all cats are cool, but some are cooler than others. Visit [www.coolcatsnft.com](https://www.coolcatsnft.com/) to learn more.","image":"https://ipfs.io/ipfs/QmcgQpxMNRw4Zt5B1tkDZfA4aAWCBhv6JVw4YnLync2x4e","name":"Cool Cat #6003","attributes":[{"trait_type":"body","value":"blue cat skin"},{"trait_type":"hats","value":"headband red"},{"trait_type":"shirt","value":"toga"},{"trait_type":"face","value":"ninja red"},{"trait_type":"tier","value":"cool_2"}],"points":{"Body":0,"Hats":1,"Shirt":2,"Face":1},"ipfs_image":"https://ipfs.io/ipfs/QmcgQpxMNRw4Zt5B1tkDZfA4aAWCBhv6JVw4YnLync2x4e","google_image":"https://drive.google.com/uc?id=1zI7STVzE6sEBPzfjs__ejqphxz9QUhLu"}']
dic_list = list(map(eval, li))
for dic in dic_list:
traits = dic["attributes"]
print(traits)
You then print the values of the dictionaries.
The above example outputs:
[{'trait_type': 'body', 'value': 'blue cat skin'}, {'trait_type': 'hats', 'value': 'headband red'}, {'trait_type': 'shirt', 'value': 'toga'}, {'trait_type': 'face', 'value': 'ninja red'}, {'trait_type': 'tier', 'value': 'cool_2'}]
You might consider reading the content of coolcat.txt with json.load though.

How to extract all text between certain characters with Python re

I'm trying to extract all text between certain characters but my current code simply returns an empty list. Each row has a long text string that looks like this:
"[{'index': 0, 'spent_transaction_hash': '4b3e9741022d4', 'spent_output_index': 68, 'script_asm': '3045022100e9e2280f5e6d965ced44', 'value': Decimal('381094.000000000')}\n {'index': 1, 'spent_transaction_hash': '0cfbd8591a3423', 'spent_output_index': 2, 'script_asm': '3045022100a', 'value': Decimal('3790496.000000000')}]"
I just need the values for "spent_transaction_hash". For example, I'd like to create a new column that has a list of ['4b3e9741022d4', '0cfbd8591a3423']. I'm trying to extract the values between 'spent_transaction_hash': and the comma. Here's my current code:
my_list = []
for row in df['column']:
value = re.findall(r'''spent_transaction_hash'\: \(\[\'(.*?)\'\]''', row)
my_list.append(value)
This code simply returns a blank list. Could anyone please tell me which part of my code is wrong?
Is is what you're looking for? 'spent_transaction_hash'\: '([a-z0-9]+)'
Test: https://regex101.com/r/cnviyS/1
Since it looks like you already have a list of Python dict objects, but in string format, why not just eval it and grab the desired keys? of course with that approach you don't need the regex matching, which is what the question had asked.
from decimal import Decimal
v = """\
[{'index': 0, 'spent_transaction_hash': '4b3e9741022d4', 'spent_output_index': 68, 'script_asm': '3045022100e9e2280f5e6d965ced44', 'value': Decimal('381094.000000000')}\n {'index': 1, 'spent_transaction_hash': '0cfbd8591a3423', 'spent_output_index': 2, 'script_asm': '3045022100a', 'value': Decimal('3790496.000000000')}]\
"""
L = eval(v.replace('\n', ','))
hashes = [e['spent_transaction_hash'] for e in L]
print(hashes)
# ['4b3e9741022d4', '0cfbd8591a3423']

Converting an list to a dictionary in python

I'm trying to convert a list that is in the form of a dictionary to an actual dictionary.
This is for a webs scraping tool. I've tried removing to the single '' and setting as a dictionary, but I am new to programming and I think my logic is off in some way.
My list is of the form
['"name":"jack"', '"address":"1234 College Ave"']
I am trying to convert general form to a dictionary of the form
{"name":"jack", "address":"1234 College Ave"}
You can convert it to a string JSON representation then use json.loads.
>>> import json
>>> data = ['"name":"jack"', '"address":"1234 College Ave"']
>>> json.loads('{' + ', '.join(data) + '}')
{'name': 'jack', 'address': '1234 College Ave'}
l = ['"name":"jack"', '"address":"1234 College Ave"']
d = {elem.split(":")[0][1:-1]:elem.split(":")[1][1:-1] for elem in l}
print(d)
One way to tackle this is to fix each individual string before passing it to json.loads.
inp = ['"name":"jack"', '"address":"1234 College Ave"']
import json
result = {}
for item in inp:
result.update(json.loads("{" + item + "}"))
print(result)
{'name': 'jack', 'address': '1234 College Ave'}
However, ideally you should be getting data in a better format and not have to rely on manipulating the data before being able to use it. Fix this problem "upstream" if you can.

Ask how to automatically grab the word from the array list

As title ,I web crawler the diigo , and have many list, I become the list to the set().Like this:
data = [ ['spanish', 'web2.0', 'e-learning', 'education', 'social', 'spain', 'tools', 'learning', 'google', 'e-learning2.0'], ['education', 'technology', 'learning', 'classroom', 'students', 'web2.0'], ['education'], ['technology'] ]
And doing something calculate
search_table = {}
for i, tag_list in enumerate(data):
for tag in tag_list:
if tag not in search_table:
search_table[tag] = set()
search_table[tag].add(i)
# How many people have `technology`?
print(len(search_table["technology"]))
# How many people have `education`?
print(len(search_table["education"]))
# How many people have both `technology`, `education`?
print(len(search_table["technology"] & search_table["education"]))
data have many tags, i want to do this ->print(len(search_table["technology"]))<- technology can auto change next world like classroom.
i realy don't konw how to do,i only think is
for u in user_data:
print u
but how to add the world to print(len(search_table[" u "]))
sincerely bob
I think I understand what you mean. You were nearly there:
user_data = ["technology", "classroom"]
for u in user_data:
print(len(search_table[u]))
will first print the number of items in search_table["technology"] and then print the number of items in search_table["classroom"].
When you're working with lists, you access elements of the list using numbers so you will not need to change the word 'code'. You would merely access it like this:
>>> user_data = ['code','java','learn']
>>> user_data[0]
'code'
>>>
Generally, when you're accessing elements like user_data["code"], it's because you're accessing keys of a dictionary like so:
>>> user_data = {'code':'java, python, ruby'}
>>> user_data['code']
'java, python, ruby'
Depending on how you store the information will affect how you're accessing that stored information. Considering you have user data, you will probably want to store them in dictionaries in lists like:
[
{'name': 'bob', 'code': 'java, python', 'school':'StackOU'},
{'name': 'bobina', ...
]
You'd access bob's coding skills like:
>>> user_data = [
... {'name': 'bob', 'code': 'java, python', 'school':'StackOU'},
... ]
>>> user_data[0]['code']
'java, python'

Python: Extract info from xml to dictionary

I need to extract information from an xml file, isolate it from the xml tags before and after, store the information in a dictionary, then loop through the dictionary to print a list. I am an absolute beginner so I'd like to keep it as simple as possible and I apologize if how I've described what I'd like to do doesn't make much sense.
here is what i have so far.
for line in open("/people.xml"):
if "name" in line:
print (line)
if "age" in line:
print(line)
Current Output:
<name>John</name>
<age>14</age>
<name>Kevin</name>
<age>10</age>
<name>Billy</name>
<age>12</age>
Desired Output
Name Age
John 14
Kevin 10
Billy 12
edit- So using the code below I can get the output:
{'Billy': '12', 'John': '14', 'Kevin': '10'}
Does anyone know how to get from this to a chart with headers like my desired output?
try xmldict (Convert xml to python dictionaries, and vice-versa.):
>>> xmldict.xml_to_dict('''
... <root>
... <persons>
... <person>
... <name first="foo" last="bar" />
... </person>
... <person>
... <name first="baz" last="bar" />
... </person>
... </persons>
... </root>
... ''')
{'root': {'persons': {'person': [{'name': {'last': 'bar', 'first': 'foo'}}, {'name': {'last': 'bar', 'first': 'baz'}}]}}}
# Converting dictionary to xml
>>> xmldict.dict_to_xml({'root': {'persons': {'person': [{'name': {'last': 'bar', 'first': 'foo'}}, {'name': {'last': 'bar', 'first': 'baz'}}]}}})
'<root><persons><person><name><last>bar</last><first>foo</first></name></person><person><name><last>bar</last><first>baz</first></name></person></persons></root>'
or try xmlmapper (list of python dictionary with parent-child relationship):
>>> myxml='''<?xml version='1.0' encoding='us-ascii'?>
<slideshow title="Sample Slide Show" date="2012-12-31" author="Yours Truly" >
<slide type="all">
<title>Overview</title>
<item>Why
<em>WonderWidgets</em>
are great
</item>
<item/>
<item>Who
<em>buys</em>
WonderWidgets1
</item>
</slide>
</slideshow>'''
>>> x=xml_to_dict(myxml)
>>> for s in x:
print s
>>>
{'text': '', 'tail': None, 'tag': 'slideshow', 'xmlinfo': {'ownid': 1, 'parentid': 0}, 'xmlattb': {'date': '2012-12-31', 'author': 'Yours Truly', 'title': 'Sample Slide Show'}}
{'text': '', 'tail': '', 'tag': 'slide', 'xmlinfo': {'ownid': 2, 'parentid': 1}, 'xmlattb': {'type': 'all'}}
{'text': 'Overview', 'tail': '', 'tag': 'title', 'xmlinfo': {'ownid': 3, 'parentid': 2}, 'xmlattb': {}}
{'text': 'Why', 'tail': '', 'tag': 'item', 'xmlinfo': {'ownid': 4, 'parentid': 2}, 'xmlattb': {}}
{'text': 'WonderWidgets', 'tail': 'are great', 'tag': 'em', 'xmlinfo': {'ownid': 5, 'parentid': 4}, 'xmlattb': {}}
{'text': None, 'tail': '', 'tag': 'item', 'xmlinfo': {'ownid': 6, 'parentid': 2}, 'xmlattb': {}}
{'text': 'Who', 'tail': '', 'tag': 'item', 'xmlinfo': {'ownid': 7, 'parentid': 2}, 'xmlattb': {}}
{'text': 'buys', 'tail': 'WonderWidgets1', 'tag': 'em', 'xmlinfo': {'ownid': 8, 'parentid': 7}, 'xmlattb': {}}
above code will give generator. When you iterate over it; you will get information in dict keys; like tag, text, xmlattb,tail and addition information in xmlinfo. Here root element will have parentid information as 0.
Use an XML parser for this. For example,
import xml.etree.ElementTree as ET
doc = ET.parse('people.xml')
names = [name.text for name in doc.findall('.//name')]
ages = [age.text for age in doc.findall('.//age')]
people = dict(zip(names,ages))
print(people)
# {'Billy': '12', 'John': '14', 'Kevin': '10'}
It seems to me that this is an exercise in learning how to parse this XML manually rather than simply pulling a library out of the bag to do it for you. If I am wrong, I suggest watching the udacity video by Steve Huffman that can be found here: http://www.udacity.com/view#Course/cs253/CourseRev/apr2012/Unit/362001/Nugget/365002. He explains how to use the minidom module to parse lightweight xml files such as these.
Now, the first point I want to make in my answer, is that you don't want to create a python dictionary to print all of these values. A python dictionary is simply a set of keys that correspond to values. There is no ordering to them, and so traversal in the order they appeared in the file is a pain in the butt. You are trying to print out all of the names together with their corresponding ages, so a data structure like a list of tuples would probably be better suited to collating your data.
It seems like the structure of your xml file is such that each name tag is succeeded by an age tag that corresponds to it. There also seems to only be a single name tag per line. This makes matters fairly simple. I'm not going to write the most efficient or universal solution to this problem, but instead I will try to make the code as simple to understand as I can.
So let's first create a list to store the data:
Let's then create a list to store the data:
a_list = []
Now open your file, and initialize a couple of variables to hold each name and age:
from __future__ import with_statement
with open("/people.xml") as f:
name, age = None, None #initialize a name and an age variable to be used during traversals.
for line in f:
name = extract_name(line,name) # This function will be defined later.
age = extract_age(line) # So will this one.
if age: #We know that if age is defined, we can add a person to our list and reset our variables
a_list.append( (name,age) ) # and now we can re-initialize our variables.
name,age = None , None # otherwise simply read the next line until age is defined.
Now for each line in the file, we wanted to determine whether it contains a user. If it did, we wanted to extract the name. Let's create a function used to do this:
def extract_name(a_line,name): #we pass in the line as well as the name value that that we defined before beginning our traversal.
if name: # if the name is predefined, we simply want to keep the name at its current value. (we can clear it upon encountering the corresponding age.)
return name
if not "<name>" in a_line: #if no "<name>" in a_line, return. otherwise, extract new name.
return
name_pos = a_line.find("<name>")+6
end_pos = a_line.find("</name>")
return a_line[name_pos:end_pos]
Now, we must create a function to parse the line for a user's age. We can do this in a similar way to the previous function, but we know that once we have an age, it will be added into the list immediately. As such, we never need to concern ourselves with age's previous value. The function can therefore look like this:
def extract_age(a_line):
if not "<age>" in a_line: #if no "<age>" in a_line:
return
age_pos = a_line.find("<age>")+5 # else extract age from line and return it.
end_pos = a_line.find("</age>")
return a_line[age_pos:end_pos]
Finally, you want to print the list. You might do it as follows:
for item in a_list:
print '\t'.join(item)
Hope this helped. I haven't tested out my code, so it might still be slightly buggy. The concepts are there, though. :)
Here's another way using lxml library:
from lxml import objectify
def xml_to_dict(xml_str):
""" Convert xml to dict, using lxml v3.4.2 xml processing library, see http://lxml.de/ """
def xml_to_dict_recursion(xml_object):
dict_object = xml_object.__dict__
if not dict_object: # if empty dict returned
return xml_object
for key, value in dict_object.items():
dict_object[key] = xml_to_dict_recursion(value)
return dict_object
return xml_to_dict_recursion(objectify.fromstring(xml_str))
xml_string = """<?xml version="1.0" encoding="UTF-8"?><Response><NewOrderResp>
<IndustryType>Test</IndustryType><SomeData><SomeNestedData1>1234</SomeNestedData1>
<SomeNestedData2>3455</SomeNestedData2></SomeData></NewOrderResp></Response>"""
print xml_to_dict(xml_string)
To preserve the parent node, use this instead:
def xml_to_dict(xml_str):
""" Convert xml to dict, using lxml v3.4.2 xml processing library, see http://lxml.de/ """
def xml_to_dict_recursion(xml_object):
dict_object = xml_object.__dict__
if not dict_object: # if empty dict returned
return xml_object
for key, value in dict_object.items():
dict_object[key] = xml_to_dict_recursion(value)
return dict_object
xml_obj = objectify.fromstring(xml_str)
return {xml_obj.tag: xml_to_dict_recursion(xml_obj)}
And if you want to only return a subtree and convert it to dict, you can use Element.find() :
xml_obj.find('.//') # lxml.objectify.ObjectifiedElement instance
See lxml documentation.

Categories

Resources