Storing, modifying and manipulating web scraped data - python

I'm working on a python webscraper that pulls data from a car advertising site. I got the scraping part all done with beatifoulsoup but I've ran into many difficulties trying to store and modify it. I would really appreciate some advice on this part since I'm a lacking knowledge on this part.
So here is what I want to do:
Scrape the data each hour (done).
Store scraped data as a dictionary in a .JSON file (done).
Everytime the ad_link not found in the scraped_data.json set it to dict['Status'] = 'Inactive' (done).
If a cars price changes , print notification + add old price to dictionary. On this part I came across many challenges with the .JSON way.
I've kept using 2 .json files and comparing them to each other (scraped_data_temp , permanent_data.json) but I think this is by far not the best method.
What would you guys suggest? How should I do this? .
What would be the best way to approach manipulating this kind of data ? (Databases maybe? - got no experince with them but I'm eager to learn) and what would be a good way to represent this kind of data, pygal?
Thank you very much.

If you have larger data, I would definitely recommend using some kind of DB. If you don't have the need to use DB server, you can use sqlite. I have used it in the past to save bigger data locally. You can use sqlalchemy in python to interact with DB-s.
As for displaying data, I tend to use matplotlib. It's extremely flexible, has extensive documentation and examples, so you can adjust data to you linking, graphs, charts, etc.
I'm assuming that you are using python3.

Related

Swamped with real time data and tasked with building a database

I work for a power company, and have been tasked with building a database. I have a pretty beginner/intermediate understanding level of python, and can fuddle decently with MSSQL. They have procured Azure for this project, and I am completely lost of how to start this task.
Here is one of the sources of data that I want to scrape every minute.
http://ets.aeso.ca/ets_web/docroot/tradingPage.html - this is a complete overview of the Alberta power market in real time.
Ideally, I would want to be able to scrape this data and other sources, and then modify it to fit into in a certain format and push it onto the SQL server.
Do I need virtual machines that are just looping over python scripts? Or do I need managed instances? This data also then needs to be able to be queried right after it is scraped. Eventually this data may feed machine learning algorithms (I don't know jack about that either but I have been told it should play friendly with that type of enviornment).
Just looking to see if anyone has any insight in how you would approach this, and can tell me what I clearly don't know and haven't thought of. Any insight is truly appreciated.
Thanks!

Rethinking my approach to scraping dynamic content with Python and Selenium

Currently I am working on a project that will scrape content from various similarly designed websites which contain dynamic content. My end goal is to then aggregate all this data into one application or report of sorts. I made some progress in terms of pulling the needed data from one page but my lack of experience and knowledge in this realm has left me thinking I went down the wrong path.
https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu
The above link is the perfect example of the type of page I will be pulling from.
In my initial attempt I was able to have the page scroll to the bottom all the while collecting data from the various elements using, plus the manual scroll.
cards = driver.find_elements_by_css_selector("div[class^='product-card__Content']")
This allowed me to on the fly pull all the data points I needed, minus the overarching category, which happens to be a parent element, this is something I can map manually in excel, but would prefer to be able to have it pulled alongside everything else.
This got me thinking that maybe I should have taken a top down approach, rathen then what I am seeing as a bottom up approach? But no matter how hard I try based on advice on others I could not get it working as intended where I can pull the category from the parent div due to my lack of understanding.
Based on input of others I was able to make a pivot of sorts and using the below code, I was able to get the category as well as the product name, without any need to scroll the page at all, which went against every experience I have had with this project so far - I am unclear how/why this is possible.
for product_group_name in driver.find_elements_by_css_selector("div[class^='products-grid__ProductGroupTitle']"):
for product in driver.find_elements_by_xpath("//div[starts-with(#class,'products-grid__ProductGroup')][./div[starts-with(#class,'products-grid__ProductGroupTitle')][text()='" + product_group_name.text + "']]//div[starts-with(#class,'consumer-product-card__InViewContainer')]"):
print (product_group_name.text, product.text)
The problem with this code, which is much quicker as it does not rely on scrolling, is that no matter how I approach it I am unable to pull the additional data points of brand and price. Obviously it is something in my approach, but outside of my knowledge level currently.
Any specific or general advice would be appreciated as I would like to scale this into something a bit more robust as my knowledge builds, I would like to be able to have this scanning multiple different URLS at set points in the day, long way away from this but I want to make sure I start on the right path if possible. Based off what I have provided, is the top down approach better in this case? Bottom up? Is this subjective?
I have noticed comments about pulling the entire source code of the page and working with that, would that be a valid approach and possibly better suited to my needs? Would it even be possible based on the dynamic nature of the page?
Thank you.

How to store data from web scraping poject

#Background
I am currently playing with some web scraping project as I am learning python.
I have a project which scrapes products with information about price etc using Selenium.
Than I add every record to pandas DF, do some additional data manipulation and than store data in csv and upload to google drive. This runs every night
#Question itself
I would like to watch price changes, new products etc. Would you recommend, how to store data with date key, so there is option to flag new products etc?
My idea is to store every load in one csv and add one column with "date_of_load"... But this seems noob_like... Maybe store data in PostrgreDB? I would like to start learning SQL, so I would try making my own DB.
Thanks for your ideas
As for me better to use NoSQL (Mongo) for this task. You can create JSON (data of prices) with keys are date.
This can help you:
https://www.mongodb.com/blog/post/getting-started-with-python-and-mongodb
https://www.mongodb.com/python
https://realpython.com/introduction-to-mongodb-and-python/
https://www.google.com/search?&q=python+mongo
That is cool! I would suggest sqlite3 (https://docs.python.org/3/library/sqlite3.html) just to get a feeling with SQL. As you can see, it says "It’s also possible to prototype an application using SQLite and then port the code to a larger database such as PostgreSQL or Oracle", which is sort of what you suggested(?), so it could be a nice place to start.
However, CSV might do just fine. As long as there is not too much data (it takes forever to load(and process) all your necessary data), it doesn't matter much how you store it as long as you manage to apply it as you desire.

I need a starting point to code an app to extract text from pdf to excel

To start I just want to state that I'm an Electrical Engineer with basic knowledge of programming.
My requirement is as follows:
I want to create an app where I can load and view PDF files that
contain tables.
These PDF files tables are of irregular shapes and in a different
position on every page. (that's why tools like tabular couldn't help
me)
Each table entry is multiline and of irregular dimensions (I cannot
select a whole row at a time it has to be each element alone. simply
copying the lines to excel won't work either because it will need a
lot of formatting)
So I want to be able to select each table entry individually from the
table (like a selection or cropping box over the required text),
delete new line if there is a new line in the text and just keep spaces.
The generated excel (or access database I do not really mind any)
should be reviewable and saveable (if those are even words XD).
I have a good knowledge of python and a very elementary knowledge of Django and I'm seeking some expert who can tell me what do I really need to learn (and if possible where to learn it) to execute my project.
Is it very much for me to execute and if I can dedicate 10 hours a week, how much would it take me to execute such a project.
Thanks all for your help in advance.
Don't use Python, use Word. Open the pdf, then step through the tables collection to collect the data and put it into excel. See this for an example
Here are the advises i can provide you :
first of all, ask internet for questions :
https://lmddgtfy.net/?q=python%20library%20tabular%20pdf
-> Camelot , which is mentioned multiple time seems to be relevant
For the use of excel sheet, i present you one of the most famous library for manipulating DataFrame: Pandas
You can use small courses on internet which will offer you a quick ability to manage your project easier.
for the application, you can easily find on youtube courses on a library made by someone who will explain you how to do a basic application. It could offer you the entry point you are talking about. Then, You can just wonder what else do you need or simply want for making it better.
for the time needed, it depends on how much time do you need to understand the basics, how much time you spend on having a deeper comprehension. I think in one week, working during your free time with a real interest, it could be working( not perfect, but working, which is a good beginning)
PS: I am not sure if your question is relevant for the aims of stackoverflow. I suggest you to read this file. ( https://stackoverflow.com/help/how-to-ask)

How do I scrape a specific table from a web page and display it in Excel? The table goes horizontally?

I am trying to scrape information from the tables at this website >>Here<<
I want to be able to get the scores when I want, I want to be able to get it and export it into Excel, also, I would like the data to come under the hole no. as well. The data that I want is wrapped in a <table> tag with a class of "scoreboard", that is the bit that I want. I would also like the players name.
Is this possible, if so how?
Please answer.
Excel has its own import data from website feature. It has a nice GUI and can let you easily make dynamic web queries in your excel sheet so that the data will update every time you open the book. This might be the easiest and most efficient way for you to go.
Scrappy is much better, especially for larger projects, for use in python, but if your going to put it back into Excel it might not be worth the extra effort.
Check out the official Excel docs on creating dynamic web queries here.
You might wanna take a look at Scrapy. It's a web scraper framework written in Python. It's powerful and easily extensible and customizable.

Categories

Resources