How to store data from web scraping poject

How to store data from web scraping poject - python

#Background
I am currently playing with some web scraping project as I am learning python.
I have a project which scrapes products with information about price etc using Selenium.
Than I add every record to pandas DF, do some additional data manipulation and than store data in csv and upload to google drive. This runs every night
#Question itself
I would like to watch price changes, new products etc. Would you recommend, how to store data with date key, so there is option to flag new products etc?
My idea is to store every load in one csv and add one column with "date_of_load"... But this seems noob_like... Maybe store data in PostrgreDB? I would like to start learning SQL, so I would try making my own DB.
Thanks for your ideas

As for me better to use NoSQL (Mongo) for this task. You can create JSON (data of prices) with keys are date.
This can help you:
https://www.mongodb.com/blog/post/getting-started-with-python-and-mongodb
https://www.mongodb.com/python
https://realpython.com/introduction-to-mongodb-and-python/
https://www.google.com/search?&q=python+mongo

That is cool! I would suggest sqlite3 (https://docs.python.org/3/library/sqlite3.html) just to get a feeling with SQL. As you can see, it says "It’s also possible to prototype an application using SQLite and then port the code to a larger database such as PostgreSQL or Oracle", which is sort of what you suggested(?), so it could be a nice place to start.
However, CSV might do just fine. As long as there is not too much data (it takes forever to load(and process) all your necessary data), it doesn't matter much how you store it as long as you manage to apply it as you desire.

Related

Storing, modifying and manipulating web scraped data

I'm working on a python webscraper that pulls data from a car advertising site. I got the scraping part all done with beatifoulsoup but I've ran into many difficulties trying to store and modify it. I would really appreciate some advice on this part since I'm a lacking knowledge on this part.
So here is what I want to do:
Scrape the data each hour (done).
Store scraped data as a dictionary in a .JSON file (done).
Everytime the ad_link not found in the scraped_data.json set it to dict['Status'] = 'Inactive' (done).
If a cars price changes , print notification + add old price to dictionary. On this part I came across many challenges with the .JSON way.
I've kept using 2 .json files and comparing them to each other (scraped_data_temp , permanent_data.json) but I think this is by far not the best method.
What would you guys suggest? How should I do this? .
What would be the best way to approach manipulating this kind of data ? (Databases maybe? - got no experince with them but I'm eager to learn) and what would be a good way to represent this kind of data, pygal?
Thank you very much.

If you have larger data, I would definitely recommend using some kind of DB. If you don't have the need to use DB server, you can use sqlite. I have used it in the past to save bigger data locally. You can use sqlalchemy in python to interact with DB-s.
As for displaying data, I tend to use matplotlib. It's extremely flexible, has extensive documentation and examples, so you can adjust data to you linking, graphs, charts, etc.
I'm assuming that you are using python3.

I need a starting point to code an app to extract text from pdf to excel

To start I just want to state that I'm an Electrical Engineer with basic knowledge of programming.
My requirement is as follows:
I want to create an app where I can load and view PDF files that
contain tables.
These PDF files tables are of irregular shapes and in a different
position on every page. (that's why tools like tabular couldn't help
me)
Each table entry is multiline and of irregular dimensions (I cannot
select a whole row at a time it has to be each element alone. simply
copying the lines to excel won't work either because it will need a
lot of formatting)
So I want to be able to select each table entry individually from the
table (like a selection or cropping box over the required text),
delete new line if there is a new line in the text and just keep spaces.
The generated excel (or access database I do not really mind any)
should be reviewable and saveable (if those are even words XD).
I have a good knowledge of python and a very elementary knowledge of Django and I'm seeking some expert who can tell me what do I really need to learn (and if possible where to learn it) to execute my project.
Is it very much for me to execute and if I can dedicate 10 hours a week, how much would it take me to execute such a project.
Thanks all for your help in advance.

Don't use Python, use Word. Open the pdf, then step through the tables collection to collect the data and put it into excel. See this for an example

Here are the advises i can provide you :
first of all, ask internet for questions :
https://lmddgtfy.net/?q=python%20library%20tabular%20pdf
-> Camelot , which is mentioned multiple time seems to be relevant
For the use of excel sheet, i present you one of the most famous library for manipulating DataFrame: Pandas
You can use small courses on internet which will offer you a quick ability to manage your project easier.
for the application, you can easily find on youtube courses on a library made by someone who will explain you how to do a basic application. It could offer you the entry point you are talking about. Then, You can just wonder what else do you need or simply want for making it better.
for the time needed, it depends on how much time do you need to understand the basics, how much time you spend on having a deeper comprehension. I think in one week, working during your free time with a real interest, it could be working( not perfect, but working, which is a good beginning)
PS: I am not sure if your question is relevant for the aims of stackoverflow. I suggest you to read this file. ( https://stackoverflow.com/help/how-to-ask)

Parsing a CSV into a database for an API using Python?

I'm gonna use data from a .csv to train a model to predict user activity on google ads (impressions, clicks) in relation to the weather for a given day. And I have a .csv that contains 6000+ recordings of this info and want to parse it into a database using Python.
I tried making a df in pandas but for some reason the whole table isn't shown. The middle columns (there's about 7 columns I think) and rows (numbered over 6000 as I mentioned) are replaced with '...' when I print the table so I'm not sure if the entirety of the information is being stored and if this will be usable.
My next attempt will possible be SQLite but since it's local memory, will this interfere with someone else making requests to my API endpoint if I don't have the db actively open at all times?
Thanks in advance.

If you used pd.read_csv() i can assure you all of the info is there, it's just not displaying it.
You can check by doing something like print(df['Column_name_you_are_interested_in'].tolist()) just to make sure though. You can also use the various count type methods in pandas to make sure all of your lines are there.
Panadas is pretty versatile so it shouldn't have trouble with 6000 lines

Database in Excel using win32com or xlrd Or Database in mysql

I have developed a website where the pages are simply html tables. I have also developed a server by expanding on python's SimpleHTTPServer. Now I am developing my database.
Most of the table contents on each page are static and doesn't need to be touched. However, there is one column per table (i.e. page) that needs to be editable and stored. The values are simply text that the user can enter. The user enters the text via html textareas that are appended to the tables via javascript.
The database is to store key/value pairs where the value is the user entered text (for now at least).
Current situation
Because the original format of my webpages was xlsx files I opted to use an excel workbook as my database that basically just mirrors the displayed web html tables (pages).
I hook up to the excel workbook through win32com. Every time the table (page) loads, javascript iterates through the html textareas and sends an individual request to the server to load in its respective text from the database.
Currently this approach works but is terribly slow. I have tried to optimize everything as much as I can and I believe the speed limitation is a direct consequence of win32com.
Thus, I see four possible ways to go:
Replace my current win32com functionality with xlrd
Try to load all the html textareas for a table (page) at once through one server call to the database using win32com
Switch to something like sql (probably use mysql since it's simple and robust enough for my needs)
Use xlrd but make a single call to the server for each table (page) as in (2)
My schedule to build this functionality is around two days.
Does anyone have any thoughts on the tradeoffs in time-spent-coding versus speed of these approaches? If anyone has any better/more streamlined methods in mind please share!

Probably not the answer you were looking for, but your post is very broad, and I've used win32coma and Excel a fair but and don't see those as good tools towards your goal. An easier strategy is this:
for the server, use Flask: it is a Python HTTP server that makes it crazy easy to respond to HTTP requests via Python code and HTML templates. You'll have a fully capable server running in 5 minutes, then you will need a bit of time create code to get data from your DB and render from templates (which are really easy to use).
for the database, use SQLite (there is far more overhead intergrating with MysQL); because you only have 2 days, so
you could also use a simple CSV file, since the API (Python has a CSV file read/write module) is much simpler, less ramp up time. One CSV per user, easy to manage. You don't worry about insertion of rows for a user, you just append; and you don't implement remove of rows for a user, you just mark as inactive (a column for active/inactive in your CSV). In processing GET request from client, as you read from the CSV, you can count how many certain rows are inactive, and do a re-write of the CSV, so once in a while the request will be a little slower to respond to client.
even simpler yet you could use in-memory data structure of your choice if you don't need persistence across restarts of the server. If this is for a demo this should be acceptable limitation.
for the client side, use jQuery on top of javascript -- maybe you are doing that already. Makes it super easy to manipulate the DOM and use effects like slide-in/out etc. Get yourself the book "Learning jQuery", you'll be able to make good use of jQuery in just a couple hours.
If you only have two days it might be a little tight, but you will probably need more than 2 days to get around the issues you are facing with your current strategy, and issues you will face imminently.

How do I scrape a specific table from a web page and display it in Excel? The table goes horizontally?

I am trying to scrape information from the tables at this website >>Here<<
I want to be able to get the scores when I want, I want to be able to get it and export it into Excel, also, I would like the data to come under the hole no. as well. The data that I want is wrapped in a <table> tag with a class of "scoreboard", that is the bit that I want. I would also like the players name.
Is this possible, if so how?
Please answer.

Excel has its own import data from website feature. It has a nice GUI and can let you easily make dynamic web queries in your excel sheet so that the data will update every time you open the book. This might be the easiest and most efficient way for you to go.
Scrappy is much better, especially for larger projects, for use in python, but if your going to put it back into Excel it might not be worth the extra effort.
Check out the official Excel docs on creating dynamic web queries here.

You might wanna take a look at Scrapy. It's a web scraper framework written in Python. It's powerful and easily extensible and customizable.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.