Scraping websites - Online or offline data processing is better - python

I am scraping websites for a research project using Python Beautifulsoup.
I have scraped a few thousand records and put them in excel.
In essence, I want to extract a substring of text (e.g. "python" from a post-title "Introduction to python for dummies").
The post-title is scraped and stored in a cell in excel.
I want to extract "pyhon" and put it in another cell.
I need some advice if it was better to do the extraction while scraping OR do it offline in excel.
Since this is research project, there is no need for real time speed. i am looking at saving my effort.
Another related question is if python can be used to do the extraction in the offline mode - i.e. open excel, do the extraction , close excel.
Any help or advice is really appreciated.

Do it at the same time. It will probably only take a handful of lines of code. There's no reason to do the work of walking over the whole file twice.

Related

Scraping large and complex PDF tables

I've been trying to scrape some data off of PDFs regarding 2020 election results in California for my own morbid curiosity.
I need to scrape many tables that appear across many pages. In some cases, the rows will continue onto the next page, and additional columns will appear on other pages as well. I've included a link to one example. I'm comfortable with R, but I can also use Python if that will be better for scraping. I haven't found many resources indicating how to deal with tables that carry onto additional pages for either language though. I need to get these tables into a CSV or XLSX format.
Thank you in advance!
In this example, Pages 15-28 should be one table.
https://www.co.tehama.ca.us/images/images/Elections/StatementOfVotesCastNOV2020v2excel.pdf
I was able to get the entire table using the following procedure.
Open the pdf in MS Word - not Adobe Acrobat. Word will convert the
document.
After the conversion has completed, select all. (Both may
take some time.)
Paste into a blank Excel worksheet. Save and enjoy.

Storing, modifying and manipulating web scraped data

I'm working on a python webscraper that pulls data from a car advertising site. I got the scraping part all done with beatifoulsoup but I've ran into many difficulties trying to store and modify it. I would really appreciate some advice on this part since I'm a lacking knowledge on this part.
So here is what I want to do:
Scrape the data each hour (done).
Store scraped data as a dictionary in a .JSON file (done).
Everytime the ad_link not found in the scraped_data.json set it to dict['Status'] = 'Inactive' (done).
If a cars price changes , print notification + add old price to dictionary. On this part I came across many challenges with the .JSON way.
I've kept using 2 .json files and comparing them to each other (scraped_data_temp , permanent_data.json) but I think this is by far not the best method.
What would you guys suggest? How should I do this? .
What would be the best way to approach manipulating this kind of data ? (Databases maybe? - got no experince with them but I'm eager to learn) and what would be a good way to represent this kind of data, pygal?
Thank you very much.
If you have larger data, I would definitely recommend using some kind of DB. If you don't have the need to use DB server, you can use sqlite. I have used it in the past to save bigger data locally. You can use sqlalchemy in python to interact with DB-s.
As for displaying data, I tend to use matplotlib. It's extremely flexible, has extensive documentation and examples, so you can adjust data to you linking, graphs, charts, etc.
I'm assuming that you are using python3.

Python Web Scripting

I wanted to do this before for some websites but didn't know where to start. This time however I am adamant. I am talking about the scripts where we crawl a website and extract the data we require. My target is this: Basically I have to appear for job interviews in December. There is this site (http://www.geeksforgeeks.org/) which contains large number of questions from previous interviews (like http://www.geeksforgeeks.org/amazon-interview-set-42-on-campus/ & http://www.geeksforgeeks.org/adobe-interview-set-6-campus-mts-1/). Every title has word "set" and a number in it. It is quite cumbersome to keep track of what I have done and what not. So I want to extract questions from each of these pages and put them in a pdf with the title. How can I do this using curl, regex and Scrapy? I am intermediate in C/C++/Java and but have only beginner proficiency in Python. Any help is much appreciated. Also point me to any such scripts you such know of. I want to do this on my own. Just requires a starting point and some guidance. Thanks.
If you want just a starting point, try scrapy a screen-scraping library for python. I would recommend that you use the requests library for making requests. It's by far the simplest option (with no loss of power).
Also, don't try to parse html or xml with a regex. Just don't. Use one of the fine libraries available (beautifulsoup or lxml, or lxml with a beautifulsoup backend are the most popular, but there are others).

How do I scrape a specific table from a web page and display it in Excel? The table goes horizontally?

I am trying to scrape information from the tables at this website >>Here<<
I want to be able to get the scores when I want, I want to be able to get it and export it into Excel, also, I would like the data to come under the hole no. as well. The data that I want is wrapped in a <table> tag with a class of "scoreboard", that is the bit that I want. I would also like the players name.
Is this possible, if so how?
Please answer.
Excel has its own import data from website feature. It has a nice GUI and can let you easily make dynamic web queries in your excel sheet so that the data will update every time you open the book. This might be the easiest and most efficient way for you to go.
Scrappy is much better, especially for larger projects, for use in python, but if your going to put it back into Excel it might not be worth the extra effort.
Check out the official Excel docs on creating dynamic web queries here.
You might wanna take a look at Scrapy. It's a web scraper framework written in Python. It's powerful and easily extensible and customizable.

Python scripting for XBMC

I am new to programming and to Python itself. I have no programming experience. I have managed to read up on Python and done some fairly basic Python tutorial, now I am ready for my first project in Python.
I am basing my project around XBMC, I want to develop some addons for this awesome media center.
I have a few websites that I want to scrape and display in XBMC. One is a music website and one is a payed TV website which is only available to people with accounts with them. I have managed to scrape a website with feedparse but I have no idea how to output these titles and links to play in XBMC.
My question here is: where do I start, how do I construct the script for these websites, what tools/libraries/modules do I need. And what do I need to do to include it into XBMC.
On the general topic that has been asked a ton of times regarding webpage scraping, the common answer is always Mechanize/Beautiful Soup for python. That would allow you to actually get your data.
Once you have your data, its then just a matter of formatting it the way you want, for your xbmc app: http://wiki.xbmc.org/index.php?title=HOW-TO:Write_Python_Scripts_for_XBMC
Its a two step process.
Get your data from a source and format it into some common structure
Use the common structure to populate your elements in the xbmc script
What you actually want to do with your script will determine how you would use your data. If its just simply providing information, then that link above would pretty much explain it.

Categories

Resources