Web scraping a forum [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
My concern revolves around how to store the data I'm trying to retrieve data from certain threads of a forum. I want to be able to plot as much information as I want, so I don't want to store everything in a rigid structure; I want to be able to use as much info as I can (timezones more active, timezones more active per user, keywords throughout the years, points throughout posters, etc).
How should I store this? A tree with upper nodes being pages and lower as posts? How do I store that tree in a way it is easy* to read?
* easy as in encapsulated in a format I could export easily to other stuff.

I suggest to scrape only the posts (why would you ever need the pages?) into JSON, which you can keep in PostgreSQL in a jsonb field—it allows querying your JSON flexibly.
Later you’d write a script, or multiple, that would iterate over posts and do useful stuff like cleaning up the data, normalizing values, aggregating stats, etc.
See also
Someone wrote a post about PostgreSQL and querying JSON

Related

Sort and order output data [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 months ago.
Improve this question
I have developed a small program that reads and outputs the live data of a machine. However, the data is outputted in a confusing and unordered way.
My question is, what exactly can I do to sort the output data e.g. in a table.
Best
enter image description here
You wrote many (topic, payload) tuples to a file, test_ViperData.txt.
Good.
Now, to view them in an ordered manner, just call /usr/bin/sort:
$ sort test_ViperData.txt
If you wish to do this entirely within python,
without e.g. creating a subprocess,
you might want to build up a long list of result tuples.
results = []
...
results.append((topic, payload))
...
print(sorted(results))
The blank-delimited file format you are using is OK, as far as it goes.
But you might prefer to use comma-delimited CSV format.
Then you could view the file within spreadsheet software,
or could manipulate it with the standard csv module
or various pandas tools.
When you review the text file next week,
you might find it more useful if
each record includes a timestamp:
import datetime as dt
...
results.append((topic, payload, dt.datetime.now()))

Scrapy - Database choice [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I'm currently building a scrapy project that can crawl any website from the first depth to the last. I don't extract many data, but I store the all page HTML (response.body) in a database.
I am currently using Elasticsearch with the bulk API, to store my raw HTML.
I had a look in Cassandra but I did not find an equivalent of the Elasticsearch bulk and it affects the performances of my spider.
I am interested in performance, and was wondering if Elasticsearch was a good choice, and if maybe there was a more appropriate NoSQL database.
That very much depends on what you are planning to do with the scraped data later on.
Elasticsearch does some complex indexing operations upon insertion which will make subsequent searches in the database quite fast ... but this also costs processing time and introduces a latency.
So to answer your question whether Elasticsearch was a good choice:
If you plan to build some kind of search engine later on, Elasticsearch was a good choice (as the name indicates). But you should have a close look at the configuration of Elasticsearch's indexing options etc to make sure it works the way you need it.
If on the other hand you just want to store the data and do processing tasks on it later, Elasticsearch was a poor choice and you would be better off with Cassandra or another NoSQL database.
Which NoSQL database suits your needs best, depends - again - on the actual usage scenario.

How would I go about pulling data from a website using Python? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
In reference towards me question, how would one be able to input data and retrieve data from various websites (not using an API)?
Is there a module that searches or acts like a human for purposes as in searching along applicably given fields; in effort to (as said before) retrieve data?
Sorry if I'm making my question hard to follow along; though if so, here's an example of what I am trying to accomplish:
Directing an AI towards a specific website.
Inputting data into the search field.
Then finally, retrieving said data after previously ran processes.
I'm fairly new to the section or field in manipulating websites via APIs or various (unknown) code; therefore, sorry if I missed anything!
You can use
mechanize,
BeautifulSoup,
Urllib,
Urllib2,
modules in Python. What I suggest you is use mechanize module. It is like scraping website through python program. More over simply a browser through python code.

efficient database file trees [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
So I was making a simple chat app with python. I want to store user specific data in a database, but I'm unfamiliar with efficiency. I want to store usernames, public rsa keys, missed messages, missed group messages, urls to profile pics etc.
There's a couple of things in there that would have to be grabbed pretty often, like missed messages and profile pics and a couple of hashes. So here's the question: what database style would be fastest while staying memory efficient? I want it to be able to handle around 10k users (like that's ever gonna happen).
heres some I thought of:
everything in one file (might be bad on memory, and takes time to load in, important, as I would need to load it in after every change.)
seperate files per user (Slower, but memory efficient)
seperate files
per data value
directory for each user, seperate files for each value.
thanks,and try to keep it objective so this isnt' instantly closed!
The only answer possible at this point is 'try it and see'.
I would start with MySQL (mostly because it's the 'lowest common denominator', freely available everywhere); it should do everything you need up to several thousand users, and if you get that far you should have a far better idea of what you need and where the bottlenecks are.

Sqlalchemy, python, easiest way to populate database with data [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I tend to start projects that are far beyond what I am capable of doing, bad habit or a good way to force myself to learn, I don't know. Anyway, this project uses a postgresql database, python and sqlalchemy. I am slowly learning everything from sql to sqlalchemy and python. I have started to figure out models and the declarative approach, but I am wondering: what is the easiest way to populate the database with data that needs to be there from the beginning, such as an admin user for my project? How is this usually done?
Edit:
Perhaps this question was worder in a bad way. What I wanted to know was the possible ways to insert initial data in my database, I tried using sqlalchemy and checking if every item existed or not, if not, insert it. This seemed tedious and can't be the way to go if there is a lot of initial data. I am a beginner at this and what better way to learn is there than to ask the people who do this regularly how they do it? Perhaps not a good fit for a question on stackoverflow, sorry.
You could use a schema change management tool like liquibase. Normally this is used to keep your database schema in source control, and apply patches to update your schema.
You can also use liquibase to load data from CSV files. So you could add a startup.csv file in liquibase that would be run the first time you run liquibase against your database. You can also have it run any time, and will merge data in the CSV with the database.

Categories

Resources