Facebook/Instagram Comments Scraping - python

I need your help in order to find a tutorial or any other information regarding the scraping with Python (Legally: Because this is a part of a data collection for my thesis so I will need the legal ways to scrap the data please).
Whould you please help me to find out the usefull sources to realize this ?

https://medium.com/analytics-vidhya/web-scraping-instagram-with-selenium-python-b8e77af32ad4
Medium usually provides fairly good instructions.

Related

TexSoup for Bib-Files

this is my first question so I will try to do everything as proper as possible.
I am currently using LaTeX to write my documents at my University because I want to use the powerful citing capabilities provided by BibTeX. For ease of use, I am writing on scripts that will implement my .bib-files into my .tex files easier and allow easier management of my .bib-files. As I am using Arch Linux, I did this in bash, but it is a little clunky. Therefore I wanted to switch to python, as I came across the TexSoup-library for Python.
My issue is now, that I cannot find resources regarding the use of TexSoup for .bib files, I can only find resources on .tex-files. Does anybody know, if and if yes how I can use TeXSoup to find books / articles or other entries in my bib-files with python (or the TexSoup-library)?
with open("bib_complete.bib") as f:
soup = TexSoup(f)
print(soup)
This is a code sample I am trying to use, but I don't know how to look for entry names or entry-types with the package. I would really appreciate if someone could guide me to good resources if they exist.
I hope my writing was comprehensive enough and not too long.
Thanks everybody!

Python Web Scripting

I wanted to do this before for some websites but didn't know where to start. This time however I am adamant. I am talking about the scripts where we crawl a website and extract the data we require. My target is this: Basically I have to appear for job interviews in December. There is this site (http://www.geeksforgeeks.org/) which contains large number of questions from previous interviews (like http://www.geeksforgeeks.org/amazon-interview-set-42-on-campus/ & http://www.geeksforgeeks.org/adobe-interview-set-6-campus-mts-1/). Every title has word "set" and a number in it. It is quite cumbersome to keep track of what I have done and what not. So I want to extract questions from each of these pages and put them in a pdf with the title. How can I do this using curl, regex and Scrapy? I am intermediate in C/C++/Java and but have only beginner proficiency in Python. Any help is much appreciated. Also point me to any such scripts you such know of. I want to do this on my own. Just requires a starting point and some guidance. Thanks.
If you want just a starting point, try scrapy a screen-scraping library for python. I would recommend that you use the requests library for making requests. It's by far the simplest option (with no loss of power).
Also, don't try to parse html or xml with a regex. Just don't. Use one of the fine libraries available (beautifulsoup or lxml, or lxml with a beautifulsoup backend are the most popular, but there are others).

how to count number of links shared by a facebook page?

I am working on a website for which it would be useful to know the number of links shared by a particular facebook page (e.g., http://www.facebook.com/cocacola) so that the user can know whether they are 'liking' a firehose of information or a dribble of goodness. What is the best way to get the number of links/status updates that are shared by a particular page?
+1 for implementations that use python (this is a django website) but any solutions are welcome! I tried using fbconsole to accomplish this but I have come up a little short.
For what it is worth, this unanswered question seems relevant. As does the fact that, as of 2012.04.18, you can export your data to csv from the insights management page on the facebook site. The information is in there I just don't know how to get it out...
Thanks for your help!
In the event that anyone else finds this useful, I thought I'd post my gist example here. fbconsole makes it fairly simple to extract data through the Facebook Graph API.
The caveat is that it was not terribly easy to programmatically extract data through fbconsole so I wrote the fbconsole.automatically_authenticate to make it much easier to access this information in a systematic way. This addition has not yet been incorporated into the master branch of fbconsole (it was just posted this morning), but it is available here in the meantime for those that are interested.

Help with parsing lxml

To implement a college project, I need to handle XML files. For this I choose lxml after doing some research. However I can't seem to find some nice tutorial to help me get started. I can't choose most specifically which type of parsing I need to use. My XML files don't have that much data but speed is main concern, not memory.
Can anyone point me to some tutorial that would help me or some book that I can lookup? I have already tried the tutorial on lxml site but that didn't help me much. Is there some small application I can look up to get a hang of parsing XML with lxml
No applications but examples:
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
http://infohost.nmt.edu/tcc/help/pubs/pylxml/pylxml.pdf

what next after 'dive into python'

I've been meaning to learn another language than java. So I started to poke around with python. I've gone over 'dive into python' so I have a decent knowledge about python now.
where do you suggest I go from here? I dont want to go through another advanced book again and would like to use the python knowledge towards building 'something'.
I've heard that python is good for web crawling, however, I did not see that in dive into python. Can the community suggest how to use my pythong knowledge towards web crawlers or spiders?
That really kind of depends on what you enjoy, or would like to build. Since you haven't said, I'll recommend something I enjoyed instead. Programming Collective Intelligence by Toby Segaran is a fun book, and the examples are all in Python. It might be more interesting to you -- if nothing else, it would give your web crawler something to do with the pages it gathers.
Edit: Fusspawn's suggestion of PyGame is very good, if don't want any more books and just want to "dive in" to something.
You can try my Building Skills in OO Design.
http://homepage.mac.com/s_lott/books/oodesign.html
If you like math try learning Python by solving Project Euler problems using python. Each problem is not too much code and it helped me increase my python skills.
I always find making a small game is a nice way to learn a language
PyGame makes it simple and could help learn more about python. I suggest giving it ago if your that way inclined.
To get started with web crawling, consider the Scrapy framework.
http://scrapy.org/
"Scrapy is a high level scraping and web crawling framework for writing spiders to crawl and parse web pages for all kinds of purposes, from information retrieval to monitoring or testing web sites."
It's still edging towards a first release, but is usable and has decent documentation.
For very basic web scraping, check out Mechanize (for basic web "browsing") and BeautifulSoup (for parsing "html soup"):
http://wwwsearch.sourceforge.net/mechanize/
http://www.crummy.com/software/BeautifulSoup/
One fun thing to do would be to combine these interests with some natural language processing projects. The NLTK book recently published by O'Reilly is available online as well:
http://www.nltk.org/book
Lots of fun to be had combining these interests. :-)
If you want to expand beyond web crawling and don't want to start a your own project (or don't know what to do), check out The Python Challenge. It's a game where you have to solve puzzles with a bit of python code. I really enjoyed it.
Is web crawling something you want to do or just something you think you can accomplish? Python is a good tool for web crawling(see here and here), but if you really just want ANY project to work on to get more familiar to the language/APIs I'd suggest you pick a project that you have a general interest in regardless. That way it'll be easier to stick with to fruition as you already have an interest in the project in addition to an interest in the language.
Find an interesting open source project to participate in. You could start looking on pythonsource or sourceforge.
The Tools/webchecker/ directory, which should be in your Python distribution (otherwise you can get it via the link I gave), is a start -- with lots of limitations (no threading except in wsgui.py, no async operation, ...), but removing some of them would be a great learning experience!
A vastly superior spidering system could be built on top of Twisted, e.g. starting with the snippet at the bottom of this mail (which only gets one page, but in the proper asynchronous way!) and adding the other functionality you see exemplified in webchecker (parse and respect robots.txt, get links from pages, etc, etc).
If you wanna "advanced book", I recommend Alex's Python in a Nutshell, Second Edition, learn quite a lot from the book, and Tarek's Expert Python Programming,we all know it's a advanced book for it's title:) .
For read some open source project, recommend SQLAlchemy and Django.
Maybe try to start you own project is the best way.
Others have said it but I'll repeat: work on something you are interested in or it won't be fun.
If you do decide that a crawler would be fun, take a look at google-kongulo, web spider plugin for Google desktop search. The code is quite short and well-written, so this might make a good base for when you decide what you want to crawl.
If you're specifically interested in crawling the Web, check out the three-part talk called "Scrape the Web" given at PyCon 2009. It's part of this RSS feed.
Read Dive Into Python again, it discusses HTML processing and HTTP web services in chapters 8 and 11.

Categories

Resources