Listing extractors from import.io - python

I would like to know how to get the crawling data (list of URLs manually input through the GUI) from my import.io extractors.
The API documentation is very scarce and it does not specify if the GET requests I make actually start a crawler (and consume one of my crawler available runs) or just query the result of manually launched crawlers.
Also I would like to know how to obtain the connector ID, as I understand, an extractor is nothing more than a specialized connector, but when I use the extractor_id as the connector id for querying the API, I get the connector does not exist.
A way I thought I could have listed the URLs I have in one off my extractors is this:
https://api.import.io/store/connector/_search?
_sortDirection=DESC&_default_operator=OR&_mine=true&_apikey=123...
But the only result I get is:
{ "took": 2, "timed_out": false, "hits": {
"total": 0,
"hits": [],
"max_score": 0 } }
Nevertheless, even if I would get a more complete response, the example result I see in the documentation ddoes not mention any kind of list or element containing the URLs I'm trying to get from my import.io account.
I am using python to create this API

The legacy API will not work for any non-legacy connectors, so you will have to use the new Web Extractor API. Unfortunately, there is no documentation for this.
Luckily, with some snooping you can find the following call to list connectors connected to your apikey:
https://store.import.io/store/extractor/_search?_apikey=YOUR_API_KEY
From here, You check each hit and verify the _type property is set to EXTRACTOR. This will give you access to, among other things, the GUID associated with the extractor and the name you chose for it when you created it.
You can then do the following to download the latest run from the extractor in CSV format:
https://data.import.io/extractor/{{GUID}}/csv/latest?_apikey=YOUR_API_KEY
This was found in the Integrations tab of every Web Extractor. There are other queries there as well.
Hope this helps.

Related

How to create a new branch, push a text file and send merge request to a gitlab repository using Python?

I found
https://github.com/python-gitlab/python-gitlab, but I was unable to understand the examples in the doc.
That's right there are no tests we can find in the doc. Here's a basic answer for your question.
If you would like a complete working script, I have attached it here:
https://github.com/torpidsnake/common_scripts/blob/main/automation_to_create_push_merge_in_gitlab/usecase_gitlab_python.py
Breaking down the steps below:
Create an authkey for you: Follow the steps here: https://docs.gitlab.com/ee/user/profile/personal_access_tokens.html
Create a gitlab server instance of your project
server = gitlab.Gitlab('https://gitlab.example.com', private_token=YOUR_API_TOKEN)
project = server.projects.get(PROJECT_ID)
Create a branch using:
branch = project.branches.create(
{"branch": branch_name, "ref": project.default_branch}
)
Upload a file using:
project.files.create(
{
"file_path": file_name,
"branch": branch.name,
"content": "data to be written",
"encoding": "text", # or 'base64'; useful for binary files
"author_email": AUTHOR_EMAIL, # Optional
"author_name": AUTHOR_NAME, # Optional
"commit_message": "Create file",
}
)
Create a merge request using:
project.mergerequests.create(
{
"source_branch": branch.name,
"target_branch": project.default_branch,
"title": "merge request title",
}
)
Looking at python-gitlab, I don't see some of the things you are looking for. In that case, I suggest you break it apart and do the individual steps using more basic tools and libraries.
The first two parts you don't need to use Gitlab API to do. You can basically use Python to do the clone, branch, edit, and commit calls using git.exe and against your disk. In some ways, that is easier since you can duplicate the calls yourself. You could use GitPython.
I'd recommend you do that through one of these methods instead of trying to do it via Gitlab API. It's easier to understand, debug, and investigate if you do the branch work locally (or even inside a CI).
Once you push up the code into a branch, you can use Gitlab's API to create a merge request via REST (such as the requests library). The description for creating the MR is at https://docs.gitlab.com/ee/api/merge_requests.html#create-mr and most of the fields are optional so the minimum looks like:
{
"id": "some-user%2Fsome-project",
"source_branch": "name_of_your_mr_branch",
"target_branch": "main",
"title": "Automated Merge Request..."
}
This is an authenticated POST call (to create). Between those links, you should have most of what you need to do this.

Convert Node.js HTTP request to Python

I have this small piece of code in node.js, which makes an API request and I need to convert this into python requests.get()....
import got from 'got'
got.get(`https://example.com/api`, {
json: true,
query: {
4,
data: `12345`
},
})
So my python code would start like this:
import requests
requests.get('https://example.com/api')
But how can I add the parameters
json: true,
query: {
4,
data: `12345`
},
in the python request ?
I would highly recommend looking at available docs when trying to solve problems like this, you'll get the answer a lot quicker generally and learn a lot more - I have linked the docs within this answer to make it easier to explore them for future use. I have never used the nodejs got library, but looked at the docs to identify what each of the parameters mean, the npm page has good documentation for this:
json - Sets content type header to "application/json", sets
accept header to "application/json", and will automatically run
JSON.parse(response). I am not aware of your familiarity with http
headers, but more information can be looked up on MDN and a list
of headers can be found on the wikipedia article for header
fields.
query - This sets the query string for the request.
I assume you are familiar with this, but more information can be
checked on the wikipedia for query string.
So, from the above it looks lke you are trying to send the following request to the server:
URL (with query string): https://example.com/api?4&data=12345
Headers
type: json/appication
accept: json/application
I would recommend reading through the python requests library user guide to get a better understanding of how to use the library.
For setting custom headers, the optional "headers" parameter can be used.
For query string, the optional "params" parameter allows for this. The only problem with params is the lack of support for a valueless key (the 4 in your example), to get around this encoding the query string in the url directly may be the best approach until the requests library supports this feature. Not sure when support will be available, but I did find a closed issue on GitHub mentioning potential support in a later version.

Facebook Graph API: Cannot pass categories array on Python

I'm trying to use Facebook's Graph API to get all the restaurants within a certain location. I'm doing this on Python and my url is
url ="https://graph.facebook.com/search?type=place&center=14.6091,121.0223,50000&categories=[\"FOOD_BEVERAGE\"]&fields=name,location,category,fan_count,rating_count,overall_star_rating&limit=100&access_token=" + access
However, this is the error message I get.
{
"error": {
"message": "(#100) For field 'placesearch': param categories must be an array.",
"code": 100,
"type": "OAuthException",
"fbtrace_id": "EGQ8YdwnzUT"
}
}
But when I paste the URL on the Graph explorer (linked below), it works. I can't do this exhaustively on the explorer because I need to collect restaurant data from all the next pages. Can someone help me explain why this is happening and how to fix it so that I can access it through Python?
Example Link
You did not specify an API version, so this will fall back to the lowest version your app can use.
I can reproduce the error for API versions <= v2.8 in Graph API Explorer - for v2.9 and above it works.
So use https://graph.facebook.com/v2.9/search?… (at least, or a higher version, up to you.)

Youtube API get the current search term?

I am trying to get a JSON List of Videos. My Problem is that I can´t get the current Searchresult. If I go on youtube I get different Searchresult as if I run my python Script. I ask because I recognize that there is no such term handled in Stackoverflow.
Code:
def getVideo():
parameters = {"part": "snippet",
"maxResults": 5,
"order": "relevance",
"pageToken": "",
"publishedAfter": None,
"publishedBefore": None,
"q": "",
"key": api_key,
"type": "video",
}
url = "https://www.googleapis.com/youtube/v3/search"
parameters["q"] = "Shroud"
page = requests.request(method="get", url=url, params=parameters)
j_results = json.loads(page.text)
print(page.text)
#print(j_results)
getVideo()
I have some thoughts. I think its because of the variables publishedAfter and publishedbefore but I dont know how I can fix it.
Best Regards
Cren1993
Search: list
Returns a collection of search results that match the query parameters specified in the API request. By default, a search result set identifies matching video, channel, and playlist resources, but you can also configure queries to only retrieve a specific type of resource.
No were in there does it mention that the search results for the YouTube api will return the exact same results as the YouTube Website. They are different systems. The API is for use by third party developers like yourself. The YouTube website is controlled by Google and probably contained extra search features that google has not exposed to third party developers either because they cant or they dont want to.

Search index for flat HTML pages

I'm looking to add search capability into an existing entirely static website. Likely, the new search functionality itself would need to be dynamic, as the search index would need to be updated periodically (as people make changes to the static content), and the search results will need to be dynamically produced when a user interacts with it. I'd hope to add this functionality using Python, as that's my preferred language, though am open to ideas.
The Google Web Search API won't work in this case because the content being indexed is on a private network. Django haystack won't work for this case, as that requires that the content be stored in Django models. A tool called mnoGoSearch might be an option, as I think it can spider a website like Google does, but I'm not sure how active that project is anymore; the project site seems a bit dated.
I'm curious about using tools like Solr, ElasticSearch, or Whoosh, though I believe that those tools are only the indexing engine and don't handle the parsing of search content. Does anyone have any recommendations as to how one may index static html content for retrieving as a set of search results? Thanks for reading and for any feedback you have.
With Solr, you would write code that retrieves content to be indexed, parses out the target portions from the each item then sends it to Solr for indexing.
You would then interact with Solr for search, and have it return either the entire indexed document an ID or some other identifying information about the original indexed content, using that to display results to the user.

Categories

Resources