Scraping a list with scrapy and structure it

Scraping a list with scrapy and structure it - python

I'm trying to scrape every title and score from that page https://myanimelist.net/animelist/MoonlessMidnite?status=7 and return data in that form :
{"user" : moonlessmidnite, "anime" : A, "score" : x
"user" : moonlessmidnite, "anime" : B, "score" : x
"user" : moonlessmidnite, "anime" : C, "score" : x }
...ect
I managed to get table
table = response.xpath('.//tr[#class = "list-table-data"]')
score = table.xpath('.//td[#class = "data score"]//a/text()').extract()
title = table.xpath('.//td//a[#class = "link sort"]').extract()
but when i'm trying to scrape title or score i got some weird ouput like :
['\n ', '\n ', '${ item.anime_title }']

Look at the raw HTML of the website:
You see that it indeed contains ${ item.anime_title }.
That indicates that the content is generated via Javascript.
There's no easy solution for that, you'll have to look at the XHR requests that are being done and see if you can get something meaningful.
If you look closely at the HTML, you will see that the data is contained in a big JSON string in the table data-item attrbute.
Try this in the scrapy shell:
fetch('https://myanimelist.net/animelist/MoonlessMidnite?status=7')
import json
json.loads(response.xpath('//table[#class="list-table"]/#data-items').extract_first()
This outputs something like this:
{'status': 2,
'score': 0,
'tags': '',
'is_rewatching': 0,
'num_watched_episodes': 1,
'anime_title': 'Hidan no Aria Special',
'anime_num_episodes': 1,
'anime_airing_status': 2,
'anime_id': 10604,
'anime_studios': None,
'anime_licensors': None,
'anime_season': None,
'has_episode_video': False,
'has_promotion_video': True,
'has_video': True,
'video_url': '/anime/10604/Hidan_no_Aria_Special/video',
'anime_url': '/anime/10604/Hidan_no_Aria_Special',
'anime_image_path': 'https://cdn.myanimelist.net/r/96x136/images/anime/2/29138.jpg?s=90cb8381c58c92d39862ac700c43f7b5',
'is_added_to_list': False,
'anime_media_type_string': 'Special',
'anime_mpaa_rating_string': 'PG-13',
'start_date_string': None,
'finish_date_string': None,
'anime_start_date_string': '12-21-11',
'anime_end_date_string': '12-21-11',
'days_string': None,
'storage_string': '',
'priority_string': 'Low'},
{'status': 6,
'score': 0,
'tags': '',
'is_rewatching': 0,
'num_watched_episodes': 0,
'anime_title': '.hack//Roots',
'anime_num_episodes': 26,
'anime_airing_status': 2,
'anime_id': 873,
'anime_studios': None,
'anime_licensors': None,
'anime_season': None,
'has_episode_video': False,
'has_promotion_video': True,
'has_video': True,
'video_url': '/anime/873/hack__Roots/video',
'anime_url': '/anime/873/hack__Roots',
'anime_image_path': 'https://cdn.myanimelist.net/r/96x136/images/anime/3/13050.jpg?s=db9ff70bf19742172f1d0140c95c4a65',
'is_added_to_list': False,
'anime_media_type_string': 'TV',
'anime_mpaa_rating_string': 'PG-13',
'start_date_string': None,
'finish_date_string': None,
'anime_start_date_string': '04-06-06',
'anime_end_date_string': '09-28-06',
'days_string': None,
'storage_string': '',
'priority_string': 'Low'}
You then just have to use this dict to get the info that you need.

Related

Django, query filter not applying to prefetched data

I have a table that contains the details of cards, and another table that holds the inventory data for those cards for the users.
I am trying to get the card's data and include the inventory quantities for the cards returned. I do this by adding a prefetch for the query.
However when I try to filter the prefetched data by the user id, it is still returning data for the users other than the one that is logged in.
Code:
sql_int = "CAST((REGEXP_MATCH(number, '\d+'))[1] as INTEGER)"
cards = magic_set_cards.objects.all().exclude(side='b').extra(select={'int': sql_int}) \
.prefetch_related(Prefetch('inventoryCards', queryset=inventory_cards.objects.filter(user_id=request.user.id))) \
.values('id', 'set_id__code', 'set_id__name', 'set_id__isOnlineOnly', 'number', 'name', 'imageUri', 'hasNonFoil', 'hasFoil', 'inventoryCards__user_id', 'inventoryCards__nonfoil', 'inventoryCards__foil') \
.order_by('-set_id__releaseDate', 'set_id__name', 'int', 'number')
Data returned:
{
'id': UUID('a7ef0985-345d-4d0d-bd71-069870ce4fd6'),
'set_id__code': 'MID',
'set_id__name': 'Innistrad: Midnight Hunt',
'set_id__isOnlineOnly': False,
'number': '46',
'name': 'Curse of Surveillance',
'imageUri': 'https://c1.scryfall.com/file/scryfall-cards/normal/front/d/6/d6a5b3b2-4f27-4c97-9d87-d7bdcc06d36a.jpg?1634348722',
'hasNonFoil': True,
'hasFoil': True,
'inventoryCards__user_id': 3,
'inventoryCards__nonfoil': 2,
'inventoryCards__foil': 0
},
{
'id': UUID('a7ef0985-345d-4d0d-bd71-069870ce4fd6'),
'set_id__code': 'MID',
'set_id__name': 'Innistrad: Midnight Hunt',
'set_id__isOnlineOnly': False,
'number': '46',
'name': 'Curse of Surveillance',
'imageUri': 'https://c1.scryfall.com/file/scryfall-cards/normal/front/d/6/d6a5b3b2-4f27-4c97-9d87-d7bdcc06d36a.jpg?1634348722',
'hasNonFoil': True,
'hasFoil': True,
'inventoryCards__user_id': 1,
'inventoryCards__nonfoil': 0,
'inventoryCards__foil': 1
}
...
As you can see from the data above, they are the same cards but it is returning the inventory data for two different users.
Please note that 'inventoryCards__user_id' will not be returned in the final version, it is there for display purpose for this question.

How to find specific script tag from a webpage using Beautifulsoup

I'm new to python and beautifulsoup. I'm trying to find a json data inside script tag. My problem is the webpage contains many script tags.
I need to get this script tag :
<script type="text/javascript">
P.when('A').register("ImageBlockATF", function(A){
var data = {
'colorImages': { 'initial': [{"hiRes":"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SL1003_.jpg",
"thumb":"https://images-na.ssl-images-amazon.com/images/I/41lv4ReBL4L._AC_US40_.jpg",
"large":"https://images-na.ssl-images-amazon.com/images/I/41lv4ReBL4L._AC_.jpg",
"main":{"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SY355_.jpg":[355,355],
"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SY450_.jpg":[450,450],
"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SX425_.jpg":[425,425],
"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SX466_.jpg":[466,466],
"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SX522_.jpg":[522,522],
"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SX569_.jpg":[569,569],
"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SX679_.jpg":[679,679]},
"variant":"MAIN","lowRes":null},{"hiRes":"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SL1005_.jpg","thumb":"https://images-na.ssl-images-amazon.com/images/I/41shdN1aAoL._AC_US40_.jpg","large":"https://images-na.ssl-images-amazon.com/images/I/41shdN1aAoL._AC_.jpg","main":{"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SY355_.jpg":[355,355],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SY450_.jpg":[450,450],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX425_.jpg":[425,425],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX466_.jpg":[466,466],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX522_.jpg":[522,522],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX569_.jpg":[569,569],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX679_.jpg":[679,679]},"variant":"PT01","lowRes":null},{"hiRes":"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SL1005_.jpg","thumb":"https://images-na.ssl-images-amazon.com/images/I/41pt8OOHsaL._AC_US40_.jpg","large":"https://images-na.ssl-images-amazon.com/images/I/41pt8OOHsaL._AC_.jpg","main":{"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SY355_.jpg":[355,355],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SY450_.jpg":[450,450],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX425_.jpg":[425,425],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX466_.jpg":[466,466],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX522_.jpg":[522,522],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX569_.jpg":[569,569],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX679_.jpg":[679,679]},"variant":"PT02","lowRes":null}]},
'colorToAsin': {'initial': {}},
'holderRatio': 1.0,
'holderMaxHeight': 700,
'heroImage': {'initial': []},
'heroVideo': {'initial': []},
'spin360ColorData': {'initial': {}},
'spin360ColorEnabled': {'initial': 0},
'spin360ConfigEnabled': false,
'spin360LazyLoadEnabled': false,
'showroomEnabled': false,
'showroomViewModel': {'initial': {}},
'playVideoInImmersiveView':true,
'useTabbedImmersiveView':true,
'totalVideoCount':'0',
'videoIngressATFSlateThumbURL':'',
'mediaTypeCount':'0',
'atfEnhancedHoverOverlay' : true,
'winningAsin': 'B08373YYCM',
'weblabs' : {},
'aibExp3Layout' : 1,
'aibRuleName' : 'frank-powered',
'acEnabled' : true,
'dp60VideoPosition': 0,
'dp60VariantList': '',
'dp60VideoThumb': '',
'dp60MainImage': 'https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SY355_.jpg',
'airyConfig' :A.$.parseJSON('{"jsUrl":"https://images-na.ssl-images-amazon.com/images/G/01/vap/video/airy2/prod/2.0.1460.0/js/airy.skin._CB485981857_.js","cssUrl":"https://images-na.ssl-images-amazon.com/images/G/01/vap/video/airy2/prod/2.0.1460.0/css/beacon._CB485971591_.css","swfUrl":"https://images-na.ssl-images-amazon.com/images/G/01/vap/video/airy2/prod/2.0.1460.0/flash/AiryBasicRenderer._CB485925577_.swf","foresterMetadataParams":{"marketplaceId":"A2VIGQ35RCS4UG","method":"Kitchen.ImageBlock","requestId":"4MGH16D6R7WCR018779W","session":"259-8488476-1037262","client":"Dpx"}}')
};
A.trigger('P.AboveTheFold'); // trigger ATF event.
return data;
});
</script>
How i can get this script tag which starts "P.when('A').register("ImageBlockATF", function(A){" from the webpage using reqular expression ?

you can get all script tags by
page = requests.get("url")
soup = BeautifulSoup(page.text, "html.parser")
results = soup.find_all("script")
and then you could have your filtering as
your_script_tag = [x for x in results if str(x).__contains__("P.when('A').register")]
print(your_script_tag)

Can't fetch an item out of weird json content

I'm trying to get some items from json content. However, the structure of that json content is foreign to me and as a result I can't fetch the value of property out of it.
I've tried so far with:
import json
import requests
from bs4 import BeautifulSoup
link = 'https://www.zillow.com/homedetails/5958-SW-4th-St-Miami-FL-33144/43835884_zpid/'
def fetch_content(link):
content = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(content.text,"lxml")
item = soup.select_one("script#hdpApolloPreloadedData").text
print(json.loads(item)['apiCache'])
if __name__ == '__main__':
fetch_content(link)
The result I get running the above script is:
{"VariantQuery{\"zpid\":43835884}":{"property":{"zpid":43835884,"streetAddress":"5958 SW 4th St",
Which I can't further process for that weird key in front.
Expected output:
{"zpid":43835884,"streetAddress":"5958 SW 4th St", ----
How can I get the value of that property?

You can get zpid and address by their mangled json with:
json.loads(json.loads(item.text)['apiCache'])['VariantQuery{"zpid":43835884}']['property']['zpid']
Out[1889]: 43835884
json.loads(json.loads(item.text)['apiCache'])['VariantQuery{"zpid":43835884}']['property']['streetAddress']
Out[1890]: '5958 SW 4th St'
I noticed you can always get the zpid like this:
link = 'https://www.zillow.com/homedetails/5958-SW-4th-St-Miami-FL-33144/43835884_zpid/'
content = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(content.text,"lxml")
item = soup.select_one("script#hdpApolloPreloadedData").text
print(json.loads(item)['zpid'])

Just modify your function to the following. I also added another function (process_fetched_content()) to give you some more freedom. You could simply run it and it will take care of situations even when you have multiple keys that start with 'VariantQuery{"zpid":'. The final output is a dict with the keys being your zpid and the values being what you are looking for.
If you have a lot of zpid values, then this will let you accumulate them all together and then process them. The benefit is the list of keys is then the list of zpids you have.
Here's how you could use this code.
results = process_fetched_content(raw_dictionary = fetch_content(link, verbose=False))
print(results)
output:
{'43835884': {'zpid': 43835884, 'streetAddress': '5958 SW 4th St', 'zipcode': '33144', 'city': 'Miami', 'state': 'FL', 'latitude': 25.76661, 'longitude': -80.292801, 'price': 340000, 'dateSold': 1576875600000, 'bathrooms': 2, 'bedrooms': 3, 'livingArea': 1757, 'yearBuilt': 1973, 'lotSize': 4331, 'homeType': 'SINGLE_FAMILY', 'homeStatus': 'RECENTLY_SOLD', 'photoCount': 19, 'imageLink': 'https://photos.zillowstatic.com/p_g/IS7yxihwtuqmlq1000000000.jpg', 'daysOnZillow': 0, 'isFeatured': False, 'shouldHighlight': False, 'brokerId': 0, 'zestimate': 341336, 'rentZestimate': 2200, 'listing_sub_type': {}, 'priceReduction': '', 'isUnmappable': False, 'rentalPetsFlags': 128, 'mediumImageLink': 'https://photos.zillowstatic.com/p_c/IS7yxihwtuqmlq1000000000.jpg', 'isPreforeclosureAuction': False, 'homeStatusForHDP': 'RECENTLY_SOLD', 'priceForHDP': 340000, 'festimate': 341336, 'isListingOwnedByCurrentSignedInAgent': False, 'isListingClaimedByCurrentSignedInUser': False, 'hiResImageLink': 'https://photos.zillowstatic.com/p_f/IS7yxihwtuqmlq1000000000.jpg', 'watchImageLink': 'https://photos.zillowstatic.com/p_j/IS7yxihwtuqmlq1000000000.jpg', 'tvImageLink': 'https://photos.zillowstatic.com/p_m/IS7yxihwtuqmlq1000000000.jpg', 'tvCollectionImageLink': 'https://photos.zillowstatic.com/p_l/IS7yxihwtuqmlq1000000000.jpg', 'tvHighResImageLink': 'https://photos.zillowstatic.com/p_n/IS7yxihwtuqmlq1000000000.jpg', 'zillowHasRightsToImages': True, 'desktopWebHdpImageLink': 'https://photos.zillowstatic.com/p_h/IS7yxihwtuqmlq1000000000.jpg', 'isNonOwnerOccupied': False, 'hideZestimate': False, 'isPremierBuilder': False, 'isZillowOwned': False, 'currency': 'USD', 'country': 'USA', 'taxAssessedValue': 224131, 'streetAddressOnly': '5958 SW 4th St', 'unit': ' '}}
Code
import json
import requests
from bs4 import BeautifulSoup
link = 'https://www.zillow.com/homedetails/5958-SW-4th-St-Miami-FL-33144/43835884_zpid/'
def fetch_content(link, verbose=False):
content = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(content.text,"lxml")
item = soup.select_one("script#hdpApolloPreloadedData").text
d = json.loads(item)['apiCache']
d = json.loads(d)
if verbose:
print(d)
return d
def process_fetched_content(raw_dictionary=None):
if raw_dictionary is not None:
keys = [k for k in raw_dictionary.keys() if k.startswith('VariantQuery{"zpid":')]
results = dict((k.split(':')[-1].replace('}',''), d.get(k).get('property', None)) for k in keys)
return results
else:
return None

How to find non-retweets in a MongoDB collection of tweets?

I have a collection of about 1.4 million tweets in a MongoDB collection. I want to find all that are NOT retweets, and am using Python. The structure of a document is as follows:
{
'_id': ObjectId('59388c046b0c1901172555b9'),
'coordinates': None,
'created_at': datetime.datetime(2016, 8, 18, 17, 17, 12),
'geo': None,
'is_quote': False,
'lang': 'en',
'text': b'Adam Cole Praises Kevin Owens + A Preview For Next Week\xe2\x80\x99s',
'tw_id': 766323071976247296,
'user_id': 2231233110,
'user_lang': 'en',
'user_loc': 'main; #Kan1shk3',
'user_name': 'sheezy0',
'user_timezone': 'Chennai'
}
I can write a query that works to find the particular tweet from above:
twitter_mongo_collection.find_one({
'text': b'Adam Cole Praises Kevin Owens + A Preview For Next Week\xe2\x80\x99s'
})
But when I try to find retweets, my code doesn't work, for example I try to find any tweets that start like this:
'text': b'RT some tweet'
Using this query:
find_one( {'text': {'$regex': "/^RT/" } } )
It doesn't return an error, but it doesn't find anything. I suspect it has something to do with that 'b' at the beginning before the text starts. I know I also need to put '$not:' in there somewhere but am not sure where.
Thanks!

It looks like your regex search is trying to match the string
b'RT'
but you want to match strings like
b'RT some text afterwards'
try using this regex instead
find_one( {'text': {'$regex': "/^RT.*/" } } )

I had to decode the 'text' field that was encoded as binary. Then I was able to use
twitter_mongo_collection.find_one( { {'text': { '$not': re.compile("^RT.*") } } )
to find all the documents that did not start with "RT".

Extracting Multiple Fields from an Array of Dictionaries

I have an array of dictionaries that looks like this :
[{u'description': None, u'url': u'https://epi.testsite.net/index.php?/suites/view/196', u'is_completed': False, u'is_baseline': False, u'completed_on': None, u'is_master': False, u'project_id': 13, u'id': 196, u'name': u'Very Basic'}, {u'description': None, u'url': u'https://epi.testsite.net/index.php?/suites/view/200', u'is_completed': False, u'is_baseline': False, u'completed_on': None, u'is_master': False, u'project_id': 13, u'id': 200, u'name': u'Stress Testing'}]
and some Python code written to extract the 'id' field. Code is as follows :
suites_list = client.send_get ('get_suites/' + pid)
suites_list_ids = [item['id'] for item in suites_list]
return (suites_list_ids)
suites_list generates the data above; suite_list_ids generates a tidy output as follows :
[196, 200]
I would like to pull a second field 'name' and have that included in the output. The desired result like this :
[ {196,'Very Basic'}, {200, 'Stress Testing'} ]
I have been burning many cycles on this one and probably am overlooking something simple. Appreciate any advice.
Dan.

you can do something like this:
suites_list_vals = [(item['id'], item['name']) for item in suites_list]
output:
[(196, 'Very Basic'), (200, 'Stress Testing')]
That is a list of tuples. To iterate on the object you can do something like this:
for val in suites_list_vals:
print(val[0], ':', val[1])
output:
196 : Very Basic
200 : Stress Testing

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping a list with scrapy and structure it - python

Related

Django, query filter not applying to prefetched data

How to find specific script tag from a webpage using Beautifulsoup

Can't fetch an item out of weird json content

How to find non-retweets in a MongoDB collection of tweets?

Extracting Multiple Fields from an Array of Dictionaries

Categories

Resources