Python Reddit API converting gifv to readable mp4 - python

I am completely stuck as when dabbling in Reddit's API aka Praw I wanted to learn to save the number 1 hottest post as an mp4 however Reddit saves all of their gifs on Imgur which convert all gifs to gifv, how would I go around converting the gifv to mp4 so I can read them? Btw simply renaming it seems to lead to corruption.
This is my code so far: (details have been xxxx'd for confidentiality)
reddit = praw.Reddit(client_id ="xxxx" , client_secret ="xxxx", username = "xxxx", password ="xxxx", user_agent="xxxx")
subreddit = reddit.subreddit("dankmemes")
hot_dm = subreddit.hot(limit=1);
for sub in hot_dm:
print(sub)
url = sub.url
print(url)
print(sub.permalink)
meme = requests.get(url)
newF = open("{}.mp4".format(sub), "wb") #here the file is created but when played is corrupted
newF.write(meme.content)
newF.close()

Some posts already have an mp4 conversion inside the preview > variants portion of the json response.
Therefore to download only those posts that have a gif and therefore have an mp4 version you could do something like this:
subreddit = reddit.subreddit("dankmemes")
hot_dm = subreddit.hot(limit=10)
for sub in hot_dm:
if sub.selftext == "": # check that the post is a link to some content (image/video/link)
continue
try: # try to access variants and catch the exception thrown
has_variants = sub.preview['images'][0]['variants'] # variants contain both gif and mp4 versions (if available)
except AttributeError:
continue # no conversion available as variants doesn't exist
if 'mp4' not in has_variants: # check that there is an mp4 conversion available
continue
mp4_video = has_variants['mp4']['source']['url']
print(sub, sub.url, sub.permalink)
meme = requests.get(mp4_video)
with open(f"{sub}.mp4", "wb") as newF:
newF.write(meme.content)
Though you are most likely going to want to increase the limit of posts that you look through when searching through hot as the first post may be a pinned post (usually some rules about the subreddit), this is why I initially checked the selftext. In addition, there may be other posts that are only images, therefore with a small limit you might not return any posts that could be converted to mp4s.

Related

Merge jpg links into pdf python

I'm trying to use kissmanga api to add mangas to my website.
This is the code:
from kissmanga import get_search_results, get_manga_details, get_manga_episode, get_manga_chapter
manga_search = get_search_results(query="Attack on titan")
for k in manga_search:
titleManga=(k.get('title' ))
for k in manga_search:
IdManga=(k.get('mangaid' ))
for k in manga_search:
print(titleManga)
manga_chapter = get_manga_chapter(mangaid=IdManga, chapNumber=1)
print(manga_chapter)
However, when I print manga_chapter I get:
{'totalPages': "['https://cdn.mangaclash.com/manga_5f3c9f1374eb8/237848eb4cd4b762b981f4e863a3edf9/1.jpg', 'https://cdn.mangaclash.com/manga_5f3c9f1374eb8/237848eb4cd4b762b981f4e863a3edf9/2.jpg', 'https://cdn.mangaclash.com/manga_5f3c9f1374eb8/237848eb4cd4b762b981f4e863a3edf9/3.jpg', 'https://cdn.mangaclash.com/manga_5f3c9f1374eb8/237848eb4cd4b762b981f4e863a3edf9/4.jpg', ']"}
How would I go about combining those jpgs into a pdf that a user can later download from my site? (so sends pdf to user and then deletes it to save space)?
I tried separating them individually with json but no luck, still kinda newish to python.

How to get twitter handle from tweet using Tweepy API 2.0

I am using the Twitter API StreamingClient using the python module Tweepy. I am currently doing a short stream where I am collecting tweets and saving the entire ID and text from the tweet inside of a json object and writing it to a file.
My goal is to be able to collect the Twitter handle from each specific tweet and save it to a json file (preferably print it in the output terminal as well).
This is what the current code looks like:
KEY_FILE = './keys/bearer_token'
DURATION = 10
def on_data(json_data):
json_obj = json.loads(json_data.decode())
#print('Received tweet:', json_obj)
print(f'Tweet Screen Name: {json_obj.user.screen_name}')
with open('./collected_tweets/tweets.json', 'a') as out:
json.dump(json_obj, out)
bearer_token = open(KEY_FILE).read().strip()
streaming_client = tweepy.StreamingClient(bearer_token)
streaming_client.on_data = on_data
streaming_client.sample(threaded=True)
time.sleep(DURATION)
streaming_client.disconnect()
And I have no idea how to do this, the only thing I found is that someone did this:
json_obj.user.screen_name
However, this did not work at all, and I am completely stuck.
So a couple of things
Firstly, I'd recommend using on_response rather than on_data because StreamClient already defines a on_data function to parse the json. (Then it will fire on_tweet, on_response, on_error, etc)
Secondly, json_obj.user.screen_name is part of API v1 I believe, which is why it doesn't work.
To get extra data using Twitter Apiv2, you'll want to use Expansions and Fields (Tweepy Documentation, Twitter Documentation)
For your case, you'll probably want to use "username" which is under the user_fields.
def on_response(response:tweepy.StreamResponse):
tweet:tweepy.Tweet = response.data
users:list = response.includes.get("users")
# response.includes is a dictionary representing all the fields (user_fields, media_fields, etc)
# response.includes["users"] is a list of `tweepy.User`
# the first user in the list is the author (at least from what I've tested)
# the rest of the users in that list are anyone who is mentioned in the tweet
author_username = users and users[0].username
print(tweet.text, author_username)
streaming_client = tweepy.StreamingClient(bearer_token)
streaming_client.on_response = on_response
streaming_client.sample(threaded=True, user_fields = ["id", "name", "username"]) # using user fields
time.sleep(DURATION)
streaming_client.disconnect()
Hope this helped.
also tweepy documentation definitely needs more examples for api v2
KEY_FILE = './keys/bearer_token'
DURATION = 10
def on_data(json_data):
json_obj = json.loads(json_data.decode())
print('Received tweet:', json_obj)
with open('./collected_tweets/tweets.json', 'a') as out:
json.dump(json_obj, out)
bearer_token = open(KEY_FILE).read().strip()
streaming_client = tweepy.StreamingClient(bearer_token)
streaming_client.on_data = on_data
streaming_client.on_closed = on_finish
streaming_client.sample(threaded=True, expansions="author_id", user_fields="username", tweet_fields="created_at")
time.sleep(DURATION)
streaming_client.disconnect()

Google Document Ai giving different outputs for the same file

I was using Document OCR API to extract text from a pdf file, but part of it is not accurate. I found that the reason may be due to the existence of some Chinese characters.
The following is a made-up example in which I cropped part of the region that the extracted text is wrong and add some Chinese characters to reproduce the problem.
When I use the website version, I cannot get the Chinese characters but the remaining characters are correct.
When I use Python to extract the text, I can get the Chinese characters correctly but part of the remaining characters are wrong.
The actual string that I got.
Are the versions of Document AI in the website and API different? How can I get all the characters correctly?
Update:
When I print the detected_languages (don't know why for lines = page.lines, the detected_languages for both lines are empty list, need to change to page.blocks or page.paragraphs first) after printing the text, I get the following output.
Code:
from google.cloud import documentai_v1beta3 as documentai
project_id= 'secret-medium-xxxxxx'
location = 'us' # Format is 'us' or 'eu'
processor_id = 'abcdefg123456' # Create processor in Cloud Console
opts = {}
if location == "eu":
opts = {"api_endpoint": "eu-documentai.googleapis.com"}
client = documentai.DocumentProcessorServiceClient(client_options=opts)
def get_text(doc_element: dict, document: dict):
"""
Document AI identifies form fields by their offsets
in document text. This function converts offsets
to text snippets.
"""
response = ""
# If a text segment spans several lines, it will
# be stored in different text segments.
for segment in doc_element.text_anchor.text_segments:
start_index = (
int(segment.start_index)
if segment in doc_element.text_anchor.text_segments
else 0
)
end_index = int(segment.end_index)
response += document.text[start_index:end_index]
return response
def get_lines_of_text(file_path: str, location: str = location, processor_id: str = processor_id, project_id: str = project_id):
# You must set the api_endpoint if you use a location other than 'us', e.g.:
# opts = {}
# if location == "eu":
# opts = {"api_endpoint": "eu-documentai.googleapis.com"}
# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"
# Read the file into memory
with open(file_path, "rb") as image:
image_content = image.read()
document = {"content": image_content, "mime_type": "application/pdf"}
# Configure the process request
request = {"name": name, "raw_document": document}
result = client.process_document(request=request)
document = result.document
document_pages = document.pages
response_text = []
# For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document
# Read the text recognition output from the processor
print("The document contains the following paragraphs:")
for page in document_pages:
lines = page.blocks
for line in lines:
block_text = get_text(line.layout, document)
confidence = line.layout.confidence
response_text.append((block_text[:-1] if block_text[-1:] == '\n' else block_text, confidence))
print(f"Text: {block_text}")
print("Detected Language", line.detected_languages)
return response_text
if __name__ == '__main__':
print(get_lines_of_text('/pdf path'))
It seems the language code is wrong, will this affect the result?
Posting this Community Wiki for better visibility.
One of features of DocumentAI is OCR - Optical Character Recognition which allows recognizing text from various files.
OP in this scenario received difference outputs using Try it function and Client Libraries - Python.
Why are there discrepancies between Try it and Python library?
It's hard to say as both methods use the same API documentai_v1beta3. It might be related to some files modifications when pdf is uploading to Try it Demo, different endpoints, language alphabet recognition or some random stuff.
When you are using Python Client you also get accuracy % of text identification. Below examples from my testes:
However, OP's identification is about 0,73 so it might get wrong results and in this situation is a visible issue. I guess it cannot be anyhow improved using code. Maybe if there would be different quality of PDF (in shown OPs example there are some dots which might affect identification).

replying to a tweet in tweepy

There's an issue with my code where no matter what I try, every time I reply to a tweet, it just posts as a regular status update on my timeline.
here is a snippet of the code
class StreamListener(tweepy.StreamListener):
def on_status(self, status):
tweetid = status.id
tweetnouser = status.text.replace("#CarlWheezerBot", "")
username = '#'+status.user.screen_name
user_tweet = gTTS(text=tweetnouser, lang='en', slow=False)
# Saving the converted audio
user_tweet.save("useraudio/text2speech.mp3")
# importing the audio and getting the audio all mashed up
text2speech = AudioFileClip("useraudio/text2speech.mp3")
videoclip = VideoFileClip("original_video/original_cut.mp4")
editedAudio = videoclip.audio
# splicing the original audio with the text2speech
compiledAudio = CompositeAudioClip([editedAudio.set_duration(3.8), text2speech.set_start(3.8)])
videoclip.audio = compiledAudio
# saving the completed video fie
videoclip.write_videofile("user_video/edited.mp4", audio_codec='aac')
upload_result = api.media_upload("user_video/edited.mp4")
api.update_status( status='#CarlWheezerBot',in_reply_to_status_id=[tweetid], media_ids=[upload_result.media_id_string], auto_populate_reply_metadata=True)
I have also tried it without any status, as well as using status.id_str. Nothing seems to work., I have done it without the metadata parameter as well. I am following the documentation word for word.
OKAY. for everyone reading this in the future
use this in_reply_to_status_id=tweetid
do not use the square brackets. Everything works perfectly now
While playing around with it, I also noticed that you should also mention author of the tweet you're replying to, especially if you're replying to an existing reply because it will still post it as status update. Line from documentation:
in_reply_to_status_id – The ID of an existing status that the update is in reply to. Note: This parameter will be ignored unless the author of the Tweet this parameter references is mentioned within the status text. Therefore, you must include #username, where username is the author of the referenced Tweet, within the update.

Office365-REST-Python-Client 401 on File Update

I finally got over the hurdle of uploading files into SharePoint which enabled me to answer my own question here:
Office365-REST-Python-Client Access Token issue
However, the whole point of my project was to add metadata to the files being uploaded to make it possible to filter on them. For the avoidance of double, I am talking about column information in Sharepoints Document Libraries.
Ideally, I would like to do this when I upload the files in the first place but my understanding of the rest API is that you have to upload first and then use a PUT request to update its metadata.
The link to the Git Hub for Office365-REST-Python-Client:
https://github.com/vgrem/Office365-REST-Python-Client
This Libary seems to be the answer but the closest I can find to documentation is under the examples folder. Sadly the example for update file metadata does not exist. I think part of the reason for this stems from the only option being to use a PUT request on a list item.
According to the REST API documentation, which this library is built on, an item's metadata must be operated on as part of a list.
REST API Documentation for file upload:
https://learn.microsoft.com/en-us/sharepoint/dev/sp-add-ins/working-with-folders-and-files-with-rest#working-with-files-by-using-rest
REST API Documentation for updating list metadata:
https://learn.microsoft.com/en-us/sharepoint/dev/sp-add-ins/working-with-lists-and-list-items-with-rest#update-list-item
There is an example for updating a list item:
'https://github.com/vgrem/Office365-REST-Python-Client/blob/master/examples/sharepoint/listitems_operations_alt.py' but it returns a 401. If you look at my answer to my own question in the link-up top you will see that I granted this App full control. So an unauthorized response and stopped has stopped me dead in my tracks wondering what to do next.
So after all that, my question is:
How do I upload a file to a Sharepoint Document Libary and add Metadata to its column information using Office365-REST-Python-Client?
Kind Regards
Rich
Upload endpoint request
url: http://site url/_api/web/GetFolderByServerRelativeUrl('/Shared Documents')/Files/Add(url='file name', overwrite=true)
method: POST
body: contents of binary file
headers:
Authorization: "Bearer " + accessToken
X-RequestDigest: form digest value
content-type: "application/json;odata=verbose"
content-length:length of post body
could be converted to the following Python example:
ctx = ClientContext(url, ctx_auth)
file_info = FileCreationInformation()
file_info.content = file_content
file_info.url = os.path.basename(path)
file_info.overwrite = True
target_file = ctx.web.get_folder_by_server_relative_url("Shared Documents").files.add(file_info)
ctx.execute_query()
Once file is uploaded, it's metadata could be set like this:
list_item = target_file.listitem_allfields # get associated list item
list_item.set_property("Title", "New title")
list_item.update()
ctx.execute_query()
I'm glad I stumbled upon this post and Office365-REST-Python-Client in general. However, I'm currently stuck trying to update a file's metadata, I keep receiving:
'File' object has no attribute 'listitem_allfields'
Any help is greatly appreciated. Note, I also updated this module to v 2.3.1
Here's my code:
list_title = "Documents"
target_folder = ctx.web.lists.get_by_title(list_title).root_folder
target_file = target_folder.upload_file(filename, filecontents)
ctx.execute_query()
list_item = target_file.listitem_allfields
I've also tried:
library_root = ctx.web.get_folder_by_server_relative_url('Shared Documents')
file_info = FileCreationInformation()
file_info.overwrite = True
file_info.content = filecontent
file_info.url = filename
upload_file = library_root.files.add(file_info)
ctx.load(upload_file)
ctx.execute_query()
list_item = upload_file.listitem_allfields
I've also tried to get the uploaded file item directly with the same result:
target_folder = ctx.web.lists.get_by_title(list_title).root_folder
target_file = target_folder.upload_file(filename, filecontent)
ctx.execute_query()
uploaded_file = ctx.web.get_file_by_server_relative_url(target_file.serverRelativeUrl)
print(uploaded_file.__dict__)
list_item = uploaded_file.listitem_allfields
All variations return:
'File' object has no attribute 'listitem_allfields'
What am I missing? How to add metadata to a new SPO file/list item uploaded via Python/Office365-REST-Python-Client
Update:
The problem was I was looking for the wrong property of the uploaded file. The correct attribute is:
uploaded_file.listItemAllFields
Note the correct casing. Hopefully my question/answer may help someone else who's is as ignorant as me of attribute/object casing.

Categories

Resources