I have written a lambda function who's goal is to accept a pdf, parse it using PyPDF2, and return specific text fields as a payload.
import PyPDF2 as pypdf
def lambda_handler(event, context):
file_path = event['body']
pdfdoc=open(file_path,'rb')
pdfreader=pypdf.PdfFileReader(pdfdoc)
#dictionary of extracted text fields
dic = pdfreader.getFormTextFields()
#fields we want to send in our payload
firstname = dic['name'].split(' ')[0]
lastname = dic['name[0]'].split(' ')[1]
payload = {"FirstName": firstname,
"LastName": lastname,
}
return jsonify(payload)
The problem is, I dont know if this is going to work and I want to test it by feeding it a pdf and see what error it spits out at me. How can I manually feed my function a PDF without tying it to API Gateway or another app? Ive tested the code locally, but I've never passed a PDF to a lambda function before so this is the part that is throwing me off.
Related
I am using the Twitter API StreamingClient using the python module Tweepy. I am currently doing a short stream where I am collecting tweets and saving the entire ID and text from the tweet inside of a json object and writing it to a file.
My goal is to be able to collect the Twitter handle from each specific tweet and save it to a json file (preferably print it in the output terminal as well).
This is what the current code looks like:
KEY_FILE = './keys/bearer_token'
DURATION = 10
def on_data(json_data):
json_obj = json.loads(json_data.decode())
#print('Received tweet:', json_obj)
print(f'Tweet Screen Name: {json_obj.user.screen_name}')
with open('./collected_tweets/tweets.json', 'a') as out:
json.dump(json_obj, out)
bearer_token = open(KEY_FILE).read().strip()
streaming_client = tweepy.StreamingClient(bearer_token)
streaming_client.on_data = on_data
streaming_client.sample(threaded=True)
time.sleep(DURATION)
streaming_client.disconnect()
And I have no idea how to do this, the only thing I found is that someone did this:
json_obj.user.screen_name
However, this did not work at all, and I am completely stuck.
So a couple of things
Firstly, I'd recommend using on_response rather than on_data because StreamClient already defines a on_data function to parse the json. (Then it will fire on_tweet, on_response, on_error, etc)
Secondly, json_obj.user.screen_name is part of API v1 I believe, which is why it doesn't work.
To get extra data using Twitter Apiv2, you'll want to use Expansions and Fields (Tweepy Documentation, Twitter Documentation)
For your case, you'll probably want to use "username" which is under the user_fields.
def on_response(response:tweepy.StreamResponse):
tweet:tweepy.Tweet = response.data
users:list = response.includes.get("users")
# response.includes is a dictionary representing all the fields (user_fields, media_fields, etc)
# response.includes["users"] is a list of `tweepy.User`
# the first user in the list is the author (at least from what I've tested)
# the rest of the users in that list are anyone who is mentioned in the tweet
author_username = users and users[0].username
print(tweet.text, author_username)
streaming_client = tweepy.StreamingClient(bearer_token)
streaming_client.on_response = on_response
streaming_client.sample(threaded=True, user_fields = ["id", "name", "username"]) # using user fields
time.sleep(DURATION)
streaming_client.disconnect()
Hope this helped.
also tweepy documentation definitely needs more examples for api v2
KEY_FILE = './keys/bearer_token'
DURATION = 10
def on_data(json_data):
json_obj = json.loads(json_data.decode())
print('Received tweet:', json_obj)
with open('./collected_tweets/tweets.json', 'a') as out:
json.dump(json_obj, out)
bearer_token = open(KEY_FILE).read().strip()
streaming_client = tweepy.StreamingClient(bearer_token)
streaming_client.on_data = on_data
streaming_client.on_closed = on_finish
streaming_client.sample(threaded=True, expansions="author_id", user_fields="username", tweet_fields="created_at")
time.sleep(DURATION)
streaming_client.disconnect()
I'm new to developing and my question(s) involves creating an API endpoint in our route. The api will be used for a POST from a Vuetify UI. Data will come from our MongoDB. We will be getting a .txt file for our shell script but it will have to POST as a JSON. I think these are the steps for converting the text file:
1)create a list for the lines of the .txt
2)add each line to the list
3) join the list elements into a string
4)create a dictionary with the file/file content and convert it to JSON
This is my current code for the steps:
import json
something.txt: an example of the shell script ###
f = open("something.txt")
create a list to put the lines of the file in
file_output = []
add each line of the file to the list
for line in f:
file_output.append(line)
mashes all of the list elements together into one string
fileoutput2 = ''.join(file_output)
print(fileoutput2)
create a dict with file and file content and then convert to JSON
json_object = {"file": fileoutput2}
json_response = json.dumps(json_object)
print(json_response)
{"file": "Hello\n\nSomething\n\nGoodbye"}
I have the following code for my baseline below that I execute on my button press in the UI
#bp_customer.route('/install-setup/<string:customer_id>', methods=['POST'])
def install_setup(customer_id):
cust = Customer()
customer = cust.get_customer(customer_id)
### example of a series of lines with newline character between them.
script_string = "Beginning\nof\nscript\n"
json_object = {"file": script_string}
json_response = json.dumps(json_object)
get the install shell script content
replace the values (somebody has already done this)
attempt to return the below example json_response
return make_response(jsonify(json_response), 200)
my current Vuetify button press code is here: so I just have to ammend it to a POST and the new route once this is established
onClickScript() {
console.log("clicked");
axios
.get("https://sword-gc-eadsusl5rq-uc.a.run.app/install-setup/")
.then((resp) => {
console.log("resp: ", resp.data);
this.scriptData = resp.data;
});
},
I'm having a hard time combining these 2 concepts in the correct way. Any input as to whether I'm on the right path? Insight from anyone who's much more experienced than me?
You're on the right path, but needlessly complicating things a bit. For example, the first bit could be just:
import json
with open("something.txt") as f:
json_response = json.dumps({'file': f.read()})
print(json_response)
And since you're looking to pass everything through jsonify anyway, even this would suffice:
with open("something.txt") as f:
data = {'file': f.read()}
Where you can pass data directly through jsonify. The rest of it isn't sufficiently complete to offer any concrete comments, but the basic idea is OK.
If you have a working whole, you could go to https://codereview.stackexchange.com/ to ask for some reviews, you should limit questions on StackOverflow to actual questions about getting something to work.
I finally got over the hurdle of uploading files into SharePoint which enabled me to answer my own question here:
Office365-REST-Python-Client Access Token issue
However, the whole point of my project was to add metadata to the files being uploaded to make it possible to filter on them. For the avoidance of double, I am talking about column information in Sharepoints Document Libraries.
Ideally, I would like to do this when I upload the files in the first place but my understanding of the rest API is that you have to upload first and then use a PUT request to update its metadata.
The link to the Git Hub for Office365-REST-Python-Client:
https://github.com/vgrem/Office365-REST-Python-Client
This Libary seems to be the answer but the closest I can find to documentation is under the examples folder. Sadly the example for update file metadata does not exist. I think part of the reason for this stems from the only option being to use a PUT request on a list item.
According to the REST API documentation, which this library is built on, an item's metadata must be operated on as part of a list.
REST API Documentation for file upload:
https://learn.microsoft.com/en-us/sharepoint/dev/sp-add-ins/working-with-folders-and-files-with-rest#working-with-files-by-using-rest
REST API Documentation for updating list metadata:
https://learn.microsoft.com/en-us/sharepoint/dev/sp-add-ins/working-with-lists-and-list-items-with-rest#update-list-item
There is an example for updating a list item:
'https://github.com/vgrem/Office365-REST-Python-Client/blob/master/examples/sharepoint/listitems_operations_alt.py' but it returns a 401. If you look at my answer to my own question in the link-up top you will see that I granted this App full control. So an unauthorized response and stopped has stopped me dead in my tracks wondering what to do next.
So after all that, my question is:
How do I upload a file to a Sharepoint Document Libary and add Metadata to its column information using Office365-REST-Python-Client?
Kind Regards
Rich
Upload endpoint request
url: http://site url/_api/web/GetFolderByServerRelativeUrl('/Shared Documents')/Files/Add(url='file name', overwrite=true)
method: POST
body: contents of binary file
headers:
Authorization: "Bearer " + accessToken
X-RequestDigest: form digest value
content-type: "application/json;odata=verbose"
content-length:length of post body
could be converted to the following Python example:
ctx = ClientContext(url, ctx_auth)
file_info = FileCreationInformation()
file_info.content = file_content
file_info.url = os.path.basename(path)
file_info.overwrite = True
target_file = ctx.web.get_folder_by_server_relative_url("Shared Documents").files.add(file_info)
ctx.execute_query()
Once file is uploaded, it's metadata could be set like this:
list_item = target_file.listitem_allfields # get associated list item
list_item.set_property("Title", "New title")
list_item.update()
ctx.execute_query()
I'm glad I stumbled upon this post and Office365-REST-Python-Client in general. However, I'm currently stuck trying to update a file's metadata, I keep receiving:
'File' object has no attribute 'listitem_allfields'
Any help is greatly appreciated. Note, I also updated this module to v 2.3.1
Here's my code:
list_title = "Documents"
target_folder = ctx.web.lists.get_by_title(list_title).root_folder
target_file = target_folder.upload_file(filename, filecontents)
ctx.execute_query()
list_item = target_file.listitem_allfields
I've also tried:
library_root = ctx.web.get_folder_by_server_relative_url('Shared Documents')
file_info = FileCreationInformation()
file_info.overwrite = True
file_info.content = filecontent
file_info.url = filename
upload_file = library_root.files.add(file_info)
ctx.load(upload_file)
ctx.execute_query()
list_item = upload_file.listitem_allfields
I've also tried to get the uploaded file item directly with the same result:
target_folder = ctx.web.lists.get_by_title(list_title).root_folder
target_file = target_folder.upload_file(filename, filecontent)
ctx.execute_query()
uploaded_file = ctx.web.get_file_by_server_relative_url(target_file.serverRelativeUrl)
print(uploaded_file.__dict__)
list_item = uploaded_file.listitem_allfields
All variations return:
'File' object has no attribute 'listitem_allfields'
What am I missing? How to add metadata to a new SPO file/list item uploaded via Python/Office365-REST-Python-Client
Update:
The problem was I was looking for the wrong property of the uploaded file. The correct attribute is:
uploaded_file.listItemAllFields
Note the correct casing. Hopefully my question/answer may help someone else who's is as ignorant as me of attribute/object casing.
I'm trying to send the output of a reduce to a map, by chaining in a pipeline, similar to this guy:
I would like to chain multiple mapreduce jobs in google app engine in Python
I tried his solution, but it didn't work.
The flow of my pipeline is:
Map1
Reduce1
Map2
Reduce2
I'm saving the output of Reduce1 to the blobstore under a blob_key, and then trying to access the blob from Map2. But I get the following error while executing the second map: "BadReaderParamsError: Could not find blobinfo for key <blob_key here>".
Here's the pipeline code:
class SongsPurchasedTogetherPipeline(base_handler.PipelineBase):
def run(self, filekey, blobkey):
bucket_name = app_identity.get_default_gcs_bucket_name()
intermediate_output = yield mapreduce_pipeline.MapreducePipeline(
"songs_purchased_together_intermediate",
"main.songs_purchased_together_map1",
"main.songs_purchased_together_reduce1",
"mapreduce.input_readers.BlobstoreLineInputReader",
"mapreduce.output_writers.GoogleCloudStorageOutputWriter",
mapper_params={
"blob_keys": blobkey,
},
reducer_params={
"output_writer": {
"bucket_name": bucket_name,
"content_type": "text/plain",
}
},
shards=1)
yield StoreOutput("SongsPurchasedTogetherIntermediate", filekey, intermediate_output)
intermediate_output_key = yield BlobKey(intermediate_output)
output = yield mapreduce_pipeline.MapreducePipeline(
"songs_purchased_together",
"main.songs_purchased_together_map2",
"main.songs_purchased_together_reduce2",
"mapreduce.input_readers.BlobstoreLineInputReader",
"mapreduce.output_writers.GoogleCloudStorageOutputWriter",
mapper_params=(intermediate_output_key),
reducer_params={
"output_writer": {
"bucket_name": bucket_name,
"content_type": "text/plain",
}
},
shards=1)
yield StoreOutput("SongsPurchasedTogether", filekey, output)
and here's the BlobKey class which takes the intermediate output and generates the blob key for Map2 to use:
class BlobKey(base_handler.PipelineBase):
def run(self, output):
blobstore_filename = "/gs" + output[0]
blobstore_gs_key = blobstore.create_gs_key(blobstore_filename)
return {
"blob_keys": blobstore_gs_key
}
The StoreOutput class is the same as the one in Google's MapReduce demo https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/demo/main.py, and does the same thing as the BlobKey class, but additionally sends the blob's URL to HTML as a link.
Manually visiting the URL appname/blobstore/<blob_key>, by typing it into a browser (after Reduce1 succeeds, but Map2 fails) displays the output expected from Reduce1. Why can't Map2 find the blob? Sorry I'm a newbie to AppEngine, and I'm probably going wrong somewhere because I don't fully understand blob storage.
Okay, I found out that Google has removed the BlobstoreOutputWriter from the list of standard writers on the GAE GitHub repository, which makes things a little more complicated. I had to write to the Google Cloud Store, and read from there. I wrote a helper class which generates mapper parameters for the GoogleCloudStorageInputReader.
class GCSMapperParams(base_handler.PipelineBase):
def run(self, GCSPath):
bucket_name = app_identity.get_default_gcs_bucket_name()
return {
"input_reader": {
"bucket_name": bucket_name,
"objects": [path.split('/', 2)[2] for path in GCSPath],
}
}
The function takes as argument the output of one MapReduce stage which uses a GoogleCloudStorageOutputWriter, and returns a dictionary which can be assigned to mapper_params of the next MapReduce stage.
Basically, the value of the output of the first MapReduce stage is a list containing <app_name>/<pipeline_name>/key/output-[i], where i is the number of shards. In order to use a GoogleCloudStorageInputReader, the key to the data should be passed through the variable objects in the mapper_params. The key must be of the form key/output-[i], so the helper class simply removes the <app_name>/<pipeline_name>/ from it.
I'm uploading documents to Salesforce using beatbox and python and the files are attaching correctly but the data contained within the files gets completely corrupted.
def Send_File():
import beatbox
svc = beatbox.Client() # instantiate the object
svc.login(login1, pw1) # login using your sf credentials
update_dict = {
'type':'Attachment',
'ParentId': accountid,
'Name': 'untitled.txt',
'body':'/Users/My_Files/untitled.txt',
}
results2 = svc.create(update_dict)
print results2
output is:
00Pi0000005ek6gEAAtrue
So things are coming through well, but when I go to the salesforce record 00Pi0000005ek6gEAA and view the file the contents of the file are:
˝KÆœ Wøä ï‡Îä˜øHÅCj÷øaÎ0j∑ø∫{b∂Wù
I have no clue what's causing the issue and I can't find any situations where this has happened to other people
Link to SFDC Documentation on uploads
the 'body' value in the dictionary should be the base64 encoded contents of the file, not the file name. you need to read and encode the file contents yourself. e.g.
body = ""
with open("/Users/My_Files/untitled.txt", "rb") as f:
body = f.read().encode("base64")
update_dict = {
'type' : 'Attachement'
'ParentId' : accountId,
'Name' : 'untitled.txt',
'Body' : body }
...
Docs about Attachment