Python 3 print to file creating errant new lines - python

I wrote code to read in url and IP data with IP as the key for urls visited. I am attempting to print the IP key then the number of url visits for each.
The problem is that when printing to my file there is a new line after some IPs.
Here is the output section of code:
`for key, value in ipVisit.items():
outputF.write(key + " " + str(len(ipVisit[key]))+ '\n' )`
Even if I increase or decrease the number of spaces between key and # of visits the third output is always the only one to be on one line. Here is the output:
194.33.212.111
28
12.65.4.100
28
205.23.104.49 31
205.23.104.49
29
Did I do something stupid with my loop? How can I fix this?

One thing I've found to be very helpful when writing to files is to ignore the write method entirely:
for key, value in ipVisit.items():
print(key + " " + str(len(ipVisit[key])), file=outputF)
This has the possibly-great side effect of outputting to stdout if outputF==None, which I've taken advantage of for command line programs in the past (passing in the output file vs. - or something).
Using print, you'll get the newline semantics that you're familiar with and the commenter's suggestion of .rstrip() will take care of any leftover errant newline characters.
EDIT: It might also be wise to avoid string building with the + operator and instead use the format method. Also, you have the value already form your for loop, there's no need to index into ipVisit again:
for key, value in ipVisit.items():
print('{} {}'.format(key, len(value)), file=outputF)
# or rstrip if there's still extra newlines
print('{} {}'.format(key.rstrip(), len(value)), file=outputF) # this will only work if you're sure `key` is a str

Related

Python read strings from file, preserving variables to be printed

I am making a Python script that will choose a response at random from a list.
To fill this list I want to read strings from a file, the strings will look something like this:
"This number is " + str(num) + ", this is good"
"Oh no the number is " + str(num) +", this is good
Obviously these are read from the file as strings so if I printed one of them they would come out as you see them here and wont have the value for "num" substituted. Is there anyway to read these strings from a file while keeping the ability to substitute variables (like a raw format) like how it would work if my code did
list.append("This number is " + str(num) + ", this is good")
The reason I want to read from a file is because I will have many different strings and they may change so I would rather not hard code them into the program (keep in mind the example strings are very basic)
Thanks
You could use the format specification mini-language, and then call .format on your strings before displaying them.
strings.txt:
This number is {num} this is good
Oh no the number is {num} this is good
main.py:
import random
with open("strings.txt") as file:
possible_strings = file.read().split("\n")
number = 23
s = random.choice(possible_strings)
print(s.format(num=number))
Possible output:
This number is 23 this is good
Use something in your file to indicate a substitution is needed, and then make those substitutions.
For example, if you need to put in the value of num, your text could use {{num}} where the substitution is needed. Then use regex to find such substrings, and replace them with the desired values.

Python - \n appearing in concatenated strings

I've been having an issue with my Python code. I am trying to concatenate the value of two string objects, but when I run the code it keeps printing a '\n' between the two strings.
My code:
while i < len(valList):
curVal = valList[i]
print(curVal)
markupConstant = 'markup.txt'
markupFileName = curVal + markupConstant
markupFile = open(markupFileName)
Now when I run this, it gives me this error:
OSError: [Errno 22] Invalid argument: 'cornWhiteTrimmed\nmarkup.txt'
See that \n between the two strings? I've dissected the code a bit by printing each string individually, and neither one contains a \n on its own. Any ideas as to what I'm doing wrong?
Thanks in advance!
The concatenation itself doesn't add the \n for sure. valList is probably the result of calling readlines() on a file object, so each element in it will have a trailing \n. Call strip on each element before using it:
while i < len(valList):
curVal = valList[i].strip()
print(curVal)
markupConstant = 'markup.txt'
markupFileName = curVal + markupConstant
markupFile = open(markupFileName)
The reason you are not seeing the \n when you actually print out the python statements is because \n is technically the newline character. You will not see this when you actually print, it will only skip to a new line. The problem is when you have this in the middle of your two strings, it is going to cause problems. The solution to your issue is the strip method. You can read into its documentation here (https://www.tutorialspoint.com/python/string_strip.htm) but basically you can use this method to strip the newline character off of any of your strings.
Just to make an addition to the other answers explaining why this came about:
When you need to actually inspect what characters a string contains, you can't simply print it. Many characters are "invisible" when printed.
Turn the string into a list first:
list(curVal)
Or my personal favorite:
[c for c in curVal]
These will create lists that properly show all hard to see characters.

getting rid of trailing output

How do i get rid of the extra character at the end of a line when i flush output?
Output:
{Fifth Level} Last Key Ran: 7 Output: -7 =
That '=' is what i want to get rid of.
code:
for number in str(fourth_level):
x=int(number)
x=x^(priv_key-pub_key)
print "\r{Fifth Level} Last Key Ran:",str(number),"Output:",x,
sys.stdout.flush()
time.sleep(sleep_time)
fifth_level.append(x)
Also is there any way to get multiple lines outputting data at the same time without going down one line or changing format? Using flush it gets rid of the second line output.
As a side note, check the ,x, part of the print statement. That 'x' is fishy.
For string manipulations, try writing everything into a temporary string first. You can then edit that string. This will give you more control over editing it.
Also, rstrip might do the trick if the characters being displayed are consistent.
Reference:
* http://docs.python.org/library/string.html
"string.rstrip(s[, chars]) Return a copy of the string with trailing characters removed."

Verify CSV against given format

I am expecting users to upload a CSV file of max size 1MB to a web form that should fit a given format similar to:
"<String>","<String>",<Int>,<Float>
That will be processed later. I would like to verify the file fits a specified format so that the program that shall later use the file doesnt receive unexpected input and that there are no security concerns (say some injection attack against the parsing script that does some calculations and db insert).
(1) What would be the best way to go about doing this that would be fast and thorough? From what I've researched I could go the path of regex or something more like this. I've looked at the python csv module but that doesnt appear to have any built in verification.
(2) Assuming I go for a regex, can anyone direct me to towards the best way to do this? Do I match for illegal characters and reject on that? (eg. no '/' '\' '<' '>' '{' '}' etc.) or match on all legal eg. [a-zA-Z0-9]{1,10} for the string component? I'm not too familiar with regular expressions so pointers or examples would be appreciated.
EDIT:
Strings should contain no commas or quotes it would just contain a name (ie. first name, last name). And yes I forgot to add they would be double quoted.
EDIT #2:
Thanks for all the answers. Cutplace is quite interesting but is a standalone. Decided to go with pyparsing in the end because it gives more flexibility should I add more formats.
Pyparsing will process this data, and will be tolerant of unexpected things like spaces before and after commas, commas within quotes, etc. (csv module is too, but regex solutions force you to add "\s*" bits all over the place).
from pyparsing import *
integer = Regex(r"-?\d+").setName("integer")
integer.setParseAction(lambda tokens: int(tokens[0]))
floatnum = Regex(r"-?\d+\.\d*").setName("float")
floatnum.setParseAction(lambda tokens: float(tokens[0]))
dblQuotedString.setParseAction(removeQuotes)
COMMA = Suppress(',')
validLine = dblQuotedString + COMMA + dblQuotedString + COMMA + \
integer + COMMA + floatnum + LineEnd()
tests = """\
"good data","good2",100,3.14
"good data" , "good2", 100, 3.14
bad, "good","good2",100,3.14
"bad","good2",100,3
"bad","good2",100.5,3
""".splitlines()
for t in tests:
print t
try:
print validLine.parseString(t).asList()
except ParseException, pe:
print pe.markInputline('?')
print pe.msg
print
Prints
"good data","good2",100,3.14
['good data', 'good2', 100, 3.1400000000000001]
"good data" , "good2", 100, 3.14
['good data', 'good2', 100, 3.1400000000000001]
bad, "good","good2",100,3.14
?bad, "good","good2",100,3.14
Expected string enclosed in double quotes
"bad","good2",100,3
"bad","good2",100,?3
Expected float
"bad","good2",100.5,3
"bad","good2",100?.5,3
Expected ","
You will probably be stripping those quotation marks off at some future time, pyparsing can do that at parse time by adding:
dblQuotedString.setParseAction(removeQuotes)
If you want to add comment support to your input file, say a '#' followed by the rest of the line, you can do this:
comment = '#' + restOfline
validLine.ignore(comment)
You can also add names to these fields, so that you can access them by name instead of index position (which I find gives more robust code in light of changes down the road):
validLine = dblQuotedString("key") + COMMA + dblQuotedString("title") + COMMA + \
integer("qty") + COMMA + floatnum("price") + LineEnd()
And your post-processing code can then do this:
data = validLine.parseString(t)
print "%(key)s: %(title)s, %(qty)d in stock at $%(price).2f" % data
print data.qty*data.price
I'd vote for parsing the file, checking you've got 4 components per record, that the first two components are strings, the third is an int (checking for NaN conditions), and the fourth is a float (also checking for NaN conditions).
Python would be an excellent tool for the job.
I'm not aware of any libraries in Python to deal with validation of CSV files against a spec, but it really shouldn't be too hard to write.
import csv
import math
dataChecker = csv.reader(open('data.csv'))
for row in dataChecker:
if len(row) != 4:
print 'Invalid row length.'
return
my_int = int(row[2])
my_float = float(row[3])
if math.isnan(my_int):
print 'Bad int found'
return
if math.isnan(my_float):
print 'Bad float found'
return
print 'All good!'
Here's a small snippet I made:
import csv
f = csv.reader(open("test.csv"))
for value in f:
value[0] = str(value[0])
value[1] = str(value[1])
value[2] = int(value[2])
value[3] = float(value[3])
If you run that with a file that doesn't have the format your specified, you'll get an exception:
$ python valid.py
Traceback (most recent call last):
File "valid.py", line 8, in <module>
i[2] = int(i[2])
ValueError: invalid literal for int() with base 10: 'a3'
You can then make a try-except ValueError to catch it and let the users know what they did wrong.
There can be a lot of corner-cases for parsing CSV, so you probably don't want to try doing it "by hand". At least start with a package/library built-in to the language that you're using, even if it doesn't do all the "verification" you can think of.
Once you get there, then examine the fields for your list of "illegal" chars, or examine the values in each field to determine they're valid (if you can do so). You also don't even need a regex for this task necessarily, but it may be more concise to do it that way.
You might also disallow embedded \r or \n, \0 or \t. Just loop through the fields and check them after you've loaded the data with your csv lib.
Try Cutplace. It verifies that tabluar data conforms to an interface control document.
Ideally, you want your filtering to be as restrictive as possible - the fewer things you allow, the fewer potential avenues of attack. For instance, a float or int field has a very small number of characters (and very few configurations of those characters) which should actually be allowed. String filtering should ideally be restricted to only what characters people would have a reason to input - without knowing the larger context it's hard to tell you exactly which you should allow, but at a bare minimum the string match regex should require quoting of strings and disallow anything that would terminate the string early.
Keep in mind, however, that some names may contain things like single quotes ("O'Neil", for instance) or dashes, so you couldn't necessarily rule those out.
Something like...
/"[a-zA-Z' -]+"/
...would probably be ideal for double-quoted strings which are supposed to contain names. You could replace the + with a {x,y} length min/max if you wanted to enforce certain lengths as well.

Python help - Parsing Packet Logs

I'm writing a simple program that's going to parse a logfile of a packet dump from wireshark into a more readable form. I'm doing this with python.
Currently I'm stuck on this part:
for i in range(len(linelist)):
if '### SERVER' in linelist[i]:
#do server parsing stuff
packet = linelist[i:find("\n\n", i, len(linelist))]
linelist is a list created using the readlines() method, so every line in the file is an element in the list. I'm iterating through it for all occurances of "### SERVER", then grabbing all lines after it until the next empty line(which signifies the end of the packet). I must be doing something wrong, because not only is find() not working, but I have a feeling there's a better way to grab everything between ### SERVER and the next occurance of a blank line.
Any ideas?
Looking at thefile.readlines() doc:
file.readlines([sizehint])
Read until EOF using readline() and return a list containing the lines thus read. If the optional sizehint argument is present, instead of reading up to EOF, whole lines totalling approximately sizehint bytes (possibly after rounding up to an internal buffer size) are read. Objects implementing a file-like interface may choose to ignore sizehint if it cannot be implemented, or cannot be implemented efficiently.
and the file.readline() doc:
file.readline([size])
Read one entire line from the file. A trailing newline character is kept in the string (but may be absent when a file ends with an incomplete line). [6] If the size argument is present and non-negative, it is a maximum byte count (including the trailing newline) and an incomplete line may be returned. An empty string is returned only when EOF is encountered immediately.
A trailing newline character is kept in the string - means that each line in linelist will contain at most one newline. That is why you cannot find a "\n\n" substring in any of the lines - look for a whole blank line (or an empty one at EOF):
if myline in ("\n", ""):
handle_empty_line()
Note: I tried to explain find behavior, but a pythonic solution looks very different from your code snippet.
General idea is:
inpacket = False
packets = []
for line in open("logfile"):
if inpacket:
content += line
if line in ("\n", ""): # empty line
inpacket = False
packets.append(content)
elif '### SERVER' in line:
inpacket = True
content = line
# put here packets.append on eof if needed
This works well with an explicit iterator, also. That way, nested loops can update the iterator's state by consuming lines.
fileIter= iter(theFile)
for x in fileIter:
if "### SERVER" in x:
block = [x]
for y in fileIter:
if len(y.strip()) == 0: # empty line
break
block.append(y)
print block # Or whatever
# elif some other pattern:
This has the pleasant property of finding blocks that are at the tail end of the file, and don't have a blank line terminating them.
Also, this is quite easy to generalize, since there's no explicit state-change variables, you just go into another loop to soak up lines in other kinds of blocks.
best way - use generators
read presentation Generator Tricks for Systems Programmers
This best that I saw about parsing log ;)

Categories

Resources