I'm trying to gather metadata from a youtube channel using python via youtube data api-v3.
I've managed to get data in json format but the texts like titles, descriptions, comments etc are in Korean language and they are shown as
ex : "title": "\ud55c \uc5ec\ub984\ubc24\uc758 \uafc8",
Is there a way to set encoding for specific feature/keys?
https://developers.google.com/youtube/v3/docs/i18nLanguages/list
The above seems to be the way to solve the problem, but I am new to api and I do not know how to apply it... can anyone give me an example or a link? so I can try myself?
or is there a way to decode back to original language? from the broken language output?
Thank you very much for your time.
Related
Can any one help me on how to convert pdf file to xml file using python code? My pdf contains:
Unstructured data
It has images
Mathematical equations
Chemical Equations
Table Data
Logo's tag's etc.
I tried using PDFMiner, but my pdf data was not converted into .xml/json file format. Are there any libraries other than PDFMiner? PyPDF2, Tabula-py, PDFQuery, comelot, PyMuPDF, pdf to dox, pandas- these other libraries/utilities all not suitable for my requirement.
Please advise me on any other options. Thank you.
The first thing I would recommend you trying is GROBID (see here for the full documentation). You can play with an online demo here to see if fits your needs (select TEI -> Process Fulltext Document, and upload a PDF). You can also check out this from the Allen Institute (it is based on GROBID and has a handy function for converting TEI.XML to JSON).
The other package which--obviously--does a good job is the Adobe PDF Extract API (see here). It's of course a paid service but when you register for an account you get 1.000 document transactions for free. It's easy to implement in Python, well documented, and a good way for experimenting and getting a feel for the difficulties of reliable data extraction from PDF.
I worked with both options to extract text, figures, tables etc. from scientific papers. Both yielded good results. The main problem with out-of-the-box solutions is that, when you work with complex formats (or badly formatted docs), erroneously identified document elements are quite common (for example a footnote or a header gets merged with the main text). Both options are based on machine learning models and, at least for GROBID, it is possible to retrain these models for your specific task (I haven't tried this so far, so I don't know how worthwhile it is).
However, if your target PDFs are all of the same (simple) format (or if you can control their format) you should be fine with either option.
So, I am trying to print out gifs by using Tenor API.
I want it to only print one gif link but it prints out everything any Idea how to fix this?
Thank you.
https://i.stack.imgur.com/xf084.png
Sadly, I can not tell you the exact problem you are having, I replicated your code and used the official API Docs here
From what I can tell, this is one GIF just in a lot of different formats.
You can filter them like so:
print(top_8gifs['weburl'])
or
print(top_8gifs['results'][0])
EDIT: Looking at your .png (please embed it as code in the future) this should work for you, if you want the url:
print(top_8gifs[0]['url'])
A Python dict you can select using the key (like gifs['weburl'])
A Python list you have to select by index so gifs[0]
Using these techniques you can gather the data you need from that output.
This is a small project I'd like to get started on in the near future. It's still in the planning stage so this post is more about being steered in the right direction
Essentially, I'd like to obtain tweets from a user and parse the tweets into a table/database, with the aim to be able to run this program in real-time.
My initial plan to tackle this was to use Beautiful Soup, a Python specific library, however, I believe the Twitter API is the better approach (advice on this subject would be appreciated)
There are still 3 unknowns:
Where do I store the tweets once obtained?
How to parse the tweets?
Where to store the parsed data?
To answer (3), I suppose it depends on what I want to do with the data. I still haven't decided how I'll use the parsed data but I know that I'd like it put into categories so my thinking is probably a database/table/excel??
A few questions still to answer and I'd like you guys to steer me in the right direction. My programming language knowledge is limited to just C for now, but as this project means a great deal to me, I'm willing to put the effort in and learn the necessary languages/APIs.
What languages/APIs will I need to gain an understanding of to accomplish this project? From where I stand, it seems to be Twitter API and Python.
EDIT: So I have a basic script going which obtains a user tweets. It works better than expected. However, I'd like to take it another step. I'd like to only obtain the users' tweets if it contains a hashtag inside of the tweet. All other tweets should be ignored. How best to do this?
Here is a snippet of the basic code I have going:
import tweepy
import twitter_credentials
auth = tweepy.OAuthHandler(twitter_credentials.CONSUMER_KEY, twitter_credentials.CONSUMER_SECRET)
auth.set_access_token(twitter_credentials.ACCESS_TOKEN, twitter_credentials.ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)
stuff = api.user_timeline(screen_name = 'XXXXXXXXXX', count = 10, include_rts = False)
for status in stuff:
print(status.text)
Scraping Twitter (or any other social network) with for example Beautiful soup, as you said, is not a good idea for 2 reasons :
if the source pages changes (name attributes, div ids...), you have to keep your code up to date
your script can be banned because scraping is not "allowed".
To answer your questions :
1) you can store the tweets wherever you want : csv, mysql, sqlite, redis, neo4j...
2) With official API, you get JSON. Here is a Tweet Object : https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html . With tweepy, for example status.text will give you the text of the tweet.
3) Same as #1. If you don't know actually what you will do with the data, store the full JSONs. You will be able later to parse them.
I suggest tweepy/python (http://www.tweepy.org/) or twit/nodejs (https://www.npmjs.com/package/twit). And read official docs : https://developer.twitter.com/en/docs/api-reference-index
I have generally wondering if it is possible to access data displayed on graphs such as these: https://coinmarketcap.com/currencies/bitcoin/
I have looked at the .htm file and other links within this, yet none seem to be containing this. Would anyone know how to retrieve and print information from the graph using python?
You can retrieve the information from coinmarketcap using their API.
https://coinmarketcap.com/api/
This will allow you to request data from them, sometimes with given parameters (e.g. the currency) and you'll retrieve the data in a JSON format, which you can then again parse and display with python.
There are plenty of API tutorials out there for python on the web. Good luck!
I'd like to know how can I simply get the title or other information about a video using Youtube API, in case the only thing I know is the url of the video (so basically the video ID).
What other info can I get about a video? eg: Length, Category, Uploader name, Country of origin, ... ???
Can somebody provide me a usable code snippet and the library to use for this data collecting?
Thanks for the help in advance.
There are plenty of examples and documentation about what you can get using the YouTube API's Python bindings.
Here are Python code samples as provided by YouTube:
https://developers.google.com/youtube/v3/code_samples/python#create_and_manage_youtube_video_caption_tracks
And the code samples can also be downloaded from their GitHub repository:
https://github.com/youtube/api-samples/tree/master/python
Try below code:
payload = {'id': search_result["id"]["videoId"], 'part': 'contentDetails,statistics,snippet', 'key': DEVELOPER_KEY}
l = requests.Session().get('https://www.googleapis.com/youtube/v3/videos', params=payload)
resp_dict = json.loads(l.content)
print "Title: ",resp_dict['items'][0]['snippet']['title']
Try using this API: https://pypi.org/project/python-youtube/
To grab the title, you can do something like this:
from pyyoutube import Api
playlistVideoItems = api.get_playlist_items(playlist_id='PLOU2XLYxmsIKpaV8h0AGE05so0fAwwfTw').items
print(playlistVideoItems[0].snippet.title)
Note that the above is somewhat untested code. I copied the relevant bits and pieces from what I currently have, but I did not test this exact set lines of code. And of course, I'm not actually using that playlist ID for my purposes.
In regards to what other types of information can be gathered, I would recommend reading the documentation or running in a debugger. As of this writing, I am trying to figure out how to obtain the video uploader name, but I don't really need this data although it would be nice to have. I will update this answer if I figure it out.