proper way to store scraped HTML for re-parsing? - python

I'm scraping lots of pages and storing the source in a Postgres database, then determining bits of information to parse and either parsing it with Postgres' own regex, which is super fast, or with Python and BeautifulSoup, row by row, which is perhaps more "proper" but much, much slower.
I'm wondering if I should convert the source to JSON and store in a JSONB field. Seems faster because all the JSON can be indexed...am I wrong? Or maybe switch to MongoDB? I just feel that there must be a faster way. Let's assume, for purposes of argument, that I cannot determine in advance all data that needs to be parsed. Suggestions?

Related

Storing, modifying and manipulating web scraped data

I'm working on a python webscraper that pulls data from a car advertising site. I got the scraping part all done with beatifoulsoup but I've ran into many difficulties trying to store and modify it. I would really appreciate some advice on this part since I'm a lacking knowledge on this part.
So here is what I want to do:
Scrape the data each hour (done).
Store scraped data as a dictionary in a .JSON file (done).
Everytime the ad_link not found in the scraped_data.json set it to dict['Status'] = 'Inactive' (done).
If a cars price changes , print notification + add old price to dictionary. On this part I came across many challenges with the .JSON way.
I've kept using 2 .json files and comparing them to each other (scraped_data_temp , permanent_data.json) but I think this is by far not the best method.
What would you guys suggest? How should I do this? .
What would be the best way to approach manipulating this kind of data ? (Databases maybe? - got no experince with them but I'm eager to learn) and what would be a good way to represent this kind of data, pygal?
Thank you very much.
If you have larger data, I would definitely recommend using some kind of DB. If you don't have the need to use DB server, you can use sqlite. I have used it in the past to save bigger data locally. You can use sqlalchemy in python to interact with DB-s.
As for displaying data, I tend to use matplotlib. It's extremely flexible, has extensive documentation and examples, so you can adjust data to you linking, graphs, charts, etc.
I'm assuming that you are using python3.

How to improve the write speed to sql database using python

I'm trying to find a better way to push data to sql db using python. I have tried
dataframe.to_sql() method and cursor.fast_executemany()
but they don't seem to increase the speed with that data(the data is in csv files) i'm working with right now. Someone suggested that i could use named tuples and generators to load data much faster than pandas can do.
[Generally the csv files are atleast 1GB in size and it takes around 10-17 minutes to push one file]
I'm fairly new to much of concepts of python,so please suggest some method or atleast a reference any article that shows any info. Thanks in advance
If you are trying to insert the csv as is into the database (i.e. without doing any processing in pandas), you could use sqlalchemy in python to execute a "BULK INSERT [params, file, etc.]". Alternatively, I've found that reading the csvs, processing, writing to csv, and then bulk inserting can be an option.
Otherwise, feel free to specify a bit more what you want to accomplish, how you need to process the data before inserting to the db, etc.

Is there an existing library to parse Wikpedia tables from dump?

I need to extract the data from tables in a wiki dump in a somewhat convenient form, e.g. a list of lists. However, due to the format of the dump it looks sort of tricky. I am aware of the WikiExtractor, which is useful for getting clean text from a dump, but it drops tables altogether. Is there a parser that would get me conveniently readable tables in a same way?
I did not manage to find a good way to parse Wikipedia tables from XML dumps. However, there seem to be some ways to do so using HTML parsers, e.g. wikitables parser. This would require a lot of scraping unless you need to analyze only the tables from specific pages. However, it seems possible to do it offline as it seems HTML Wiki dumps are about to resume (dumps, phabricator task)

Using Beautifulsoup and regex to traverse javascript in page

I'm fetching webpages with a bunch of javascript on it, and I'm interested in parsing through the javascript portion of the pages for certain relevant info. Right now I have the following code in Python/BeautifulSoup/regex:
scriptResults = soup('script',{'type' : 'text/javascript'})
which yields an array of scripts, of which I can use a for loop to search for text I'd like:
for script in scriptResults:
for block in script:
if *patterniwant* in block:
**extract pattern from line using regex**
(Text in asterisks is pseudocode, of course.)
I was wondering if there was a better way for me to just use regex to find the pattern in the soup itself, searching only through the scripts themselves? My implementation works, but it just seems really clunky so I wanted something more elegant and/or efficient and/or Pythonic.
Thanks in advance!
I lot of website have client side data in JSON format. I that case I would suggest to extract JSON part from JavaScirpt code and parse it using Python's json modules (e.g. json.json.loads ). As a result you will get standard dictionary object.
Another option is to check with your browser what sort of AJAX requests application makes. Quite often it also returns structured data in JSON.
I would also check if page has any structured data already available (e.g. OpenGraph, microformats, RDFa, RSS feeds). A lot of web sites include this to improve pages SEO and make it better integrating with social network sharing.

Efficient way to check whether a page changed (while storing as little info as possible)?

I have some webpages where I'm collecting data over time. I don't care about the content itself, just whether the page has changed.
Currently, I use Python's requests.get to fetch a page, hash the page (md5), and store that hash value to compare in the future.
Is there a computationally cheaper or smaller-storage strategy for this? Things work now; I just wanted to check if there's a better/cheaper way. :)
You can keep track of the date of the last version you got and use the If-Modified-Since header in your request. However, some resources ignore that header. (In general it's difficult to handle it for dynamically-generated content.) In that case you'll have to fall back to less efficient method.
A hash would be the most trustable source of change detection. I would use CRC32. It's only 32 bits as opposed to 128bits for md5. Also, even in browser Javascript it can be very fast. I have personal experience in improving the speed for a JS implementation of CRC32 for very large datasets.

Categories

Resources