cURL, cat, python and missing parts from a web page - python

I have downloaded a web page(charset=iso-8859-1) using curl
curl "webpage_URL" > site.txt
The encoding of my terminal is utf-8. Here I try to see the encoding of this file:
file -i site.txt
site.txt: regular file
Now: the strange thing: If I open the file with nano I find all the words that are visible in a normal browser. BUT when I use:
cat site.txt
some words are missing. This fact makes me curious and after some hours of research I didn't figure out why.
In python too, it does't find all the words:
def function(url):
p = subprocess.Popen(["curl", url], stdout=subprocess.PIPE)
output, err = p.communicate()
print output
soup=BeautifulSoup(output)
return soup.body.find_all(text=re.compile('common_word'))
I also tried to use urllib2 but I had no success.
What am I doing wrong?

If somebody will face the same problem:
The root of my problem were some carriage return characters (\r) that are present in the web page. The terminal cannot print them. This wouldn't be a big problem, but the whole line that contains a \r is skipped.
So, in order to see the content of the entire file: this characters should be escaped with the -v or -e option:
cat -v site.txt
(thanks to MendiuSolves who has suggested to use the cat command options)
In order to solve a part of the python problem: I changed the return value from soup.body.find_all(text=re.compile('common_word')) to soup.find_all(text=re.compile('common_word'))
It is obvious that if the word you search for is on one of the line containing a \r and you will print it you will not see the result. The solution could be either filter the character or write the content in a file.

Related

stdout captured from command has twice as many newlines when passed to selenium

I have a bit of code that I am trying to capture the stdout:
def MediaInfo():
cmd= ['MediaInfo.exe', 'videofile.mkv']
test = subprocess.run(cmd, capture_output=True)
info = test.stdout.decode("utf-8")
print (info)
When using print or writing it to file, it looks fine. But when I use selenium to fill it into a message box:
techinfo = driver.find_element(By.NAME, "techinfo").send_keys(info)
there is an additional empty line between every line. Originally I had an issue where the stdout was a byte literal. It looked like b"This is the first line.\r\nThis is the second line.\r\n" Adding .decode("utf-8") is what fixed that but I am wondering if in certain instances something is interpreting \r\n as creating two lines. I'm just not sure if it is an issue with Selenium or subprocess or something else. The webpage element Selenium is writing to doesn't seem to have an issue. It looks correct if I copy and paste it from the text file. Meaning, it's not just the way it's displayed, there are actually twice as many line feeds. Any ideas? I don't want to just loop through and delete the extra lines. Too kludgy. I'm guessing this is an issue with Python 3, from what I've read.
send_keys() will send each key individually which means "\r\n" is sent as two key presses. Replacing "\r\n" with "\n" prior to sending to element should do the trick.

How to get data from web in python using curl?

In bash when I used
myscript.sh
file="/tmp/vipin/kk.txt"
curl -L "myabcurlx=10&id-11.com" > $file
cat $file
./myscript.sh gives me below output
1,2,33abc
2,54fdd,fddg3
3,fffff,gfr54
When I tried to fetch it using python and tried below code -
mypython.py
command = curl + ' -L ' + 'myabcurlx=10&id-11.com'
output = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE).stdout.read().decode('ascii')
print(output)
python mypython.py throw error, Can you please point out what is wrong with my code.
Error :
/bin/sh: line 1: &id=11: command not found
Wrong Parameter
command = curl + ' -L ' + 'myabcurlx=10&id-11.com'
Print out what this string is, or just think about it. Assuming that curl is the string 'curl' or '/usr/bin/curl' or something, you get:
curl -L myabcurlx=10&id-11.com
That’s obviously not the same thing you typed at the shell. Most importantly, that last argument is not quoted, and it has a & in the middle of it, which means that what you’re actually asking it to do is to run curl in the background and then run some other program that doesn’t exist, as if you’d done this:
curl -L myabcurlx=10 &
id-11.com
Obviously you could manually include quotes in the string:
command = curl + ' -L ' + '"myabcurlx=10&id-11.com"'
… but that won’t work if the string is, say, a variable rather than a literal in your source—especially if that variable might have quote characters within it.
The shlex module has helpers to quoting things properly.
But the easiest thing to do is just not try to build a command line in the first place. You aren’t using any shell features here, so why add the extra headaches, performance costs, problems with the shell getting in the way of your output and retcode, and possible security issues for no benefit?
Make the arguments a list rather than a string:
command = [curl, '-L', 'myabcurlx=10&id-11.com']
… and leave off the shell=True
And it just works. No need to get spaces and quotes and escapes right.
Well, it still won’t work, because Popen doesn’t return output, it’s a constructor for a Popen object. But that’s a whole separate problem—which should be easy to solve if you read the docs.
But for this case, an even better solution is to use the Python bindings to libcurl instead of calling the command-line tool. Or, even better, since you’re not using any of the complicated features of curl in the first place, just use requests to make the same request. Either way, you get a response object as a Python object with useful attributes like text and headers and request.headers that you can’t get from a command line tool except by parsing its output as a giant string.
import subprocess
fileName="/tmp/vipin/kk.txt"
with open(fileName,"w") as f:
subprocess.read(["curl","-L","myabcurlx=10&id-11.com"],stdout=f)
print(fileName)
recommended approaches:
https://docs.python.org/3.7/library/urllib.request.html#examples
http://docs.python-requests.org/en/master/user/install/

Safely echo python commands into file without executing them

So I have a python file with a ton of lines in it that I want to read into python then echo into another file over a socket.
Assuming I have file foo.py
import os
os.popen('some command blah')
print("some other commands, doesn't matter")
Then I try and open the file, read all the lines, and echo each line into a new file.
Something along the lines of
scriptCode = open(os.path.realpath(__file__)).readlines()
for line in scriptCode:
connection.send("echo " + line + " >> newfile.py")
print("file transfered!")
However, when I do this, the command is executed in the remote shell.
So my question:
How do I safely echo text into a file without executing any keywords in it?
What have I tried?
Adding single quotes around line
Adding single quotes around line and then a backslash to single quotes in line
Things I've considered but haven't tried yet:
Base64 encoding the line until on the remote machine then decoding it (I don't want to do this because there's no guarentee it'll have this command)
I know this is odd. Why am I doing this?
I'm building a pentesting reverse shell handler.
shlex.quote will:
Return a shell-escaped version of the string s. The returned value is a string that can safely be used as one token in a shell command line, for cases where you cannot use a list.
Much safer than trying to quote a string by yourself.

Bash Variable Contains '\r' - Carriage Return Causing Problems in my Script

I have a bash script (rsync.sh) that works fine and has this line in it:
python /path/to/rsync_script.py $EMAIL "$RSYNC $PATH1 $PATH1_BACKUP"
I want to break the command (it's actually much longer than shown here because my variables have longer names) in two and use something like this:
python /path/to/rsync_script.py \
$EMAIL "$RSYNC $PATH1 $PATH1_BACKUP"
But when I do this I get the error:
scripts/rsync.sh: line 32: $'admin#mydomain.com\r': command not found
It puts the carriage return, \r in there.
How can I break this line up and not include the carriage return?
The problem looks like Windows line endings.
Here's how you can check in Python.
repr(open('rsync.sh', 'rb').read())
# If you see any \\r\\n, it's windows
Here's how you can fix it:
text = open('rsync.sh', 'r').read().replace('\r\n', '\n')
open('rsync.sh', 'wb').write(text)
Edit
Here's some code that shows the problem.
# Python:
open('abc-n.sh', 'wb').write('echo abc \\' + '\n' + 'def')
open('abc-r-n.sh', 'wb').write('echo abc \\' + '\r\n' + 'def')
And then run the files we made...
$ sh abc-n.sh
abc def
$ sh abc-r-n.sh
abc
abc-r-n.sh: 2: def: not found
If you can chnage the python script, maybe it will be easier to pass it the variable names thenselves, instead of their content.
From within the Python code you w=have better and more consistent tools to deal with whitespace characters (like \r) than from within bash.
To do that, just change your .sh line to
python /path/to/rsync_script.py EMAIL "RSYNC PATH1 PATH1_BACKUP"
And on your rsync_script.py, use os.environ to read the contents of the shell variables (and clear the \r's in them) - something like:
import os, sys
paths = []
for var_name in sys.argv(2).split(" "):
paths.append(os.environ[var_name].strip())
So I figured it out... I made a mistake in this question and I got so much awesome help but it was me doing a dumb thing that caused the problem. As I mentioned above, I may have copied and pasted from Windows at some point (I had forgotten since I did most of the edits in vim). I went back and wrote a short script with the essentials of the original in vim and then added in the '\' for line break and the script worked just fine. I feel bad accepting my own answer since it was so stupid. I made sure to up-vote everyone who helped me. Thanks again.

How can I fix encoding errors in a string in python

I have a python script as a subversion pre-commit hook, and I encounter some problems with UTF-8 encoded text in the submit messages. For example, if the input character is "å" the output is "?\195?\165". What would be the easiest way to replace those character parts with the corresponding byte values? Regexp doesn't work as I need to do processing on each element and merge them back together.
code sample:
infoCmd = ["/usr/bin/svnlook", "info", sys.argv[1], "-t", sys.argv[2]]
info = subprocess.Popen(infoCmd, stdout=subprocess.PIPE).communicate()[0]
info = info.replace("?\\195?\\166", "æ")
I do the same things in my code and you should be able to use:
...
u_changed_path = unicode(changed_path, 'utf-8')
...
When using the approach above, I've only run into issues with characters like line feeds and such. If you post some code, it could help.

Categories

Resources