Pythonic way of parsing possibly quoted fields - python

To do this, I'd normally write a function that pulls one field at a time from the input string, and then loop until the input string is empty.
But there must be a more pythonic way of doing it that splits everything up at once.
Fields in the input string are separated by a space, and fields that contain spaces are enclosed by quotation marks. Quoted fields do not contain quotation marks.
An real example of this format is a web server's access_log file:
216.244.66.234 - - [01/Nov/2019:19:20:07 +0000] "GET /robots.txt HTTP/1.1" 200 67 "-" "Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help#moz.com)"
EDIT:
access_log was a bad choice as an example, as it contains a bracket-delimited field that contains a space.
But since there is a simple solution to my original question (shlex.split()), I'll revise this question to include processing the bracketed field too (again with no internal delimiter character).
What I'm looking for is an example of parsing a string into fields in a way other than using a function to pull one token out of the string at a time.

IUUC, you could use shlex.split:
from shlex import split
s = '216.244.66.234 - - [01/Nov/2019:19:20:07 +0000] "GET /robots.txt HTTP/1.1" 200 67 "-" "Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help#moz.com)"'
for field in split(s):
print(field)
Output
216.244.66.234
-
-
[01/Nov/2019:19:20:07
+0000]
GET /robots.txt HTTP/1.1
200
67
-
Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help#moz.com)

Related

Ruby net/http GET requests with empty body [duplicate]

This question already has answers here:
Ruby - net/http - following redirects
(6 answers)
Closed 17 days ago.
I'm currently simply trying to get a simple GET request working in Ruby, however, I'm seeing some strange behavior.
I have an Open Web Analytics application running with Docker and it is reachable at http://127.0.0.1:8080/.
I can reach the login site and everything works fine.
Now I want to do a GET request with Ruby to analyze the body of that request but I cannot get it to work, in other languages like Python or simple GET requests over the terminal it works fine. Why not with Ruby?
Here is my very basic Ruby code:
require 'net/http'
url = 'http://127.0.0.1:8080/'
uri = URI(url)
session = Net::HTTP.new(uri.host, uri.port)
response = session.get(uri.request_uri)
puts response.body
Which doesn't output anything. If I look into the NGINX logs from the container, I can see the request being made but there is no further redirection as with the other methods (see below).
172.23.0.1 - - [02/Feb/2023:20:02:59 +0000] "GET / HTTP/1.1" 302 5 "-" "Ruby" "-" 0.088 0.088 . -
If I do a simple GET over the terminal, it works:
GET http://127.0.0.1:8080/
will output the correct body, and in the NGINX logs I can see the following:
172.23.0.1 - - [02/Feb/2023:20:20:10 +0000] "GET / HTTP/1.1" 302 5 "-" "lwp-request/6.61 libwww-perl/6.61" "-" 0.086 0.088 . -
172.23.0.1 - - [02/Feb/2023:20:20:10 +0000] "GET /index.php?owa_do=base.loginForm&owa_go=http%3A%2F%2F127.0.0.1%3A8080%2F& HTTP/1.1" 200 3200 "-" "lwp-request/6.61 libwww-perl/6.61" "-" 0.086 0.088 . -
Doing it in Python with the following basic code also works and gives similar results as with the terminal GET version:
import requests
x = requests.get("http://127.0.0.1:8080/")
print(x.content)
What am I doing wrong?
Got it working with following redirects (see here):
begin
response = Net::HTTP.get_response(URI.parse(url))
url = response['location']
end while response.is_a?(Net::HTTPRedirection)

Python get url from string(regex)

So what I am trying to do is to extract all urls from HTTP requests list. They should be stripped of protocol, parameters and slash at the end of the path(if exists).So for example:
10.4.180.222 [5/Feb/2018:08:03:40 +0100] "GET http://somewebsite.com/ HTTP/1.1" 200 1080
10.4.180.222 [5/Feb/2018:08:03:11 +0100] "GET http://www.somewebsite.cc/somesubdomain/ HTTP/1.1" 200 3056
10.4.180.222 [5/Feb/2018:08:03:11 +0100] "GET https://www.somewebsite.ua HTTP/1.1" 200 3056
Should be:
somewebsite.com
www.somewebsite.cc/somepath
www.somewebsite.ua
I've tried to do this in two steps, without using any sophisticated regex(just general for any url)
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', file.read())
And then using urlparse.
domain = '{url.netloc}{url.path}'.format(url=urlparse(url))
It works almost fine. However I am getting path ending with slash.
www.somewebsite.cc/somepath/
So I've decided to use regex. However, I know only basics so I can't come up with anything well-functioning.Right now I have something like that but it doesn't cover "/" thing and different protocols :/
Thank you for any advice :)
((?:www\.+)[A-Za-z0-9\.\-]+)((?:\/[\+~%\/\.\w\-]*))
If the end slash is your only problem, this is the solution.
urls = [ x.rstrip('/') for x in re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', file.read()) ]
In other words, just do
urls = [ x.rstrip('/') for x in < your regex goes here > ].

404 while connecting to /hello/1 but 200 while connecting to any other number such as /hello/12 in flask

Trying to learn flask but stuck with some error or maybe an issue.
def check_int(no):
return "number is %d" %no
app.add_url_rule('/hello/<int:no>', 'nothign_specific', check_int)
So when I do a curl call to http://127.0.0.1:5000/hello/1 it fails wherein the same curl call to any other number apart from 1 passes.
http://127.0.0.1:5000/hello/<any number apart from 1 passes>
127.0.0.1 - - [05/Aug/2016 14:17:48] "GET /hello/1/ HTTP/1.1" 404 -
127.0.0.1 - - [05/Aug/2016 14:18:01] "GET /hello/12 HTTP/1.1" 200 -
Can someone let me know what's happening around
In flask, if your route (or rule) definition has no trailing slash is explicit. If you would add a trailing / to your url rule, i.e.
'/hello/<int:no>/'
then you would be able to use both (request with or without /).
According to flask docs, a route with a trailing slash is treated similar to a folder name in a file system: If accessed without the slash, flask will recognize it and redirect you to the one with slash. Contrastingly, a route that is defined without a trailing slash is treated like the pathname of a file, i.e. it will throw 404 when accessed with a trailing slash.
Read more: http://flask.pocoo.org/docs/0.11/quickstart/, section "Unique URLs / Redirection Behavior"

Django URL dispatcher not matching named group

I'm trying to make a DJango site, but the group matching in the URL dispatcher is giving me "p" no matter what I enter into the URL. Here's the pertinent parts of my code:
From user's urls.py (it does get included in the main urls.py)
url(r'^lookup?(?P<match_str>\w+)/$', views.lookup, name='user_lookup')
From views.py
def lookup(request, match_str):
users = User.objects.filter(name__contains=match_str)
json = serializers.serialize("json", users)
return json
And a couple log entries:
[01/Jul/2014 22:43:17] "GET /user/lookup/?z HTTP/1.1" 500 11363
[01/Jul/2014 22:43:18] "GET /user/lookup/?za HTTP/1.1" 500 11363
On closer inspection, it looks like my AJAX is actually sending two calls, and the second call is actually what's being matched. The logs for the second calls of the above log lines are:
[01/Jul/2014 22:43:17] "GET /merchant/lookup?z HTTP/1.1" 301 0
[01/Jul/2014 22:43:18] "GET /merchant/lookup?za HTTP/1.1" 301 0
I put a "debug" line in the view to print match_str and no matter I put it, I get 'p'. What is going on here?
Per karthikr's request, here's the result of print request.GET, match_str
<QueryDict: {u'za': [u'']}> p
Your regex doesn't match the URL from the log. The GET goes to /user/lookup, and the string user is not contained in Django's url Changing your regex to ^lookup/\?(?P<match_str>\w+)$, the request lookup/?someuser creates a named group match_str with the value someuser.
I recommend using one of the many online regex testers to play with the URL regex.

Get more information from django/python or server log

I'm trying to debug a "POST" request error but I do not have enough information. Thus I need help to figure out more.
I get the following error in my tail -a. This is the only thing it displays in tail and of inside the log itself. I assume that tail does not have -v for verbose.
==> python/logs/access_log-20131102-000000-EST <==
85.75.241.1 - - [02/Nov/2013:09:09:47 -0400] "POST /dajaxice/async.store_event/ HTTP/1.1" 500 16516 "http://example.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36"
(I changed the example.com above with the original)
Where should I search to get additional information about this 500 error in the log files? Can I force python tell more?
In local server I get the following which does not tell something particular either.
[02/Nov/2013 14:22:15] "POST /dajaxice/async.store_event/ HTTP/1.1" 200 24
Finally are the codes 16516 and 24 tell me something particular in 500 16516 and 200 24 respectively? I know that 500/200 are the http codes but what are the others?
You're looking in the access log. Errors, not surprisingly, are logged in the error log - you should look there for more detail.
(The second value is the number of bytes in the response.)
I know, one should not use debug = Trueon a live server, but if it is the only way to hunt down a bug, ou should consider swithing it on for a few minutes to get more information.
Furthermore, the django debug-toolbar can be of help, e.g. it can disply addidtional logging messages which are not written to file but raised using
import logging
logger = logging.getLogger(__name__) # Get an instance of a logger __name__ will be your app name
and use e.g.
logger.debug(str(form.cleaned_data))
Or you write your own logger:
"""Logger to File"""
file_logger = logging.getLogger("file_logger")
file_logger.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(message)s')
handler = logging.handlers.RotatingFileHandler(os.path.join(MEDIA_ROOT, "log", "filelogger_log.log"), maxBytes=10000000, backupCount=5)
handler.setFormatter(formatter)
file_logger.addHandler(handler)

Categories

Resources