first of all, I'm new to python, so maybe my code is a little weird or bordering to be wrong, but it works, so there is that.
I've been googleing for this problem, but can't find anyone who writes about, I got this huge list written like this
1 2 3 4 5
2 2 2 2 2
3 3 3 3 3
etc, note that it is spaces and not tab, and this I can't change, since I'm working with a print out from ls-dyna
So I am using this script to remove the whitespaces before the numbers, since they have been giving me troubles when trying to format the numbers into a matrix and then i remove the empty lines afterwards
for line in input:
print >> output, line.lstrip(' ')
but for some reason, I have 4442 lines (and here I mean writen lines, which is easy to track since they are enumerated) but the output only has 4411, so it removes 31 lines, with numbers I need
Why is this?
The lstrip() won't remove lines because it is used inside the print statement which will always append a newline character (the way you use it). But the for line in input might step through the list of lines in an unexpected way, i. e. it could skip lines or combine them in a manner you didn't expect.
Maybe newline and carriage return characters result in this strange problem.
I propose to let the .lstrip(' ') away for testing and compare the output with the input to find the places where something gets changed. Probably you should use output.write(line) to circumvent all the automatics of the print statement (especially appending newline characters).
Then you should use a special separator when outputting (output.write('###' + line) or similar) to find out how the iteration through the input takes place.
Related
Personally, I have the following string "E2017010000000601". This character E is for control, after comes the year, then the month and in the last positions comes a user code with a maximum of 7 positions. I would like to know how can I in Python remove those 0 from the middle of the string that are unnecessary.
For example, in the string "E2018090001002202", I do not need these 3 zeros between 9 and 1.
Already in the string "E2017010000000601", I do not need those 7 zeros between 1 and 6 ..
I have over 1000 files with this type of string, and renaming it one by one is tricky. I know that in Python I can rename this huge amount of files, but I did some code and I'm not able to mount the way I explained ... Any help?
This is basic string slicing as long as you are sure the structure is identical for each string.
You can use something like:
original_string = "E2017010000000601"
cut_string = str(int(original_string[7:]))
This should work because first you remove the first 7 values, the control char, year and month.
Then you turn to integer which removes all the zeroes at the front, then back to string.
Basically the same answer as Alexis, but since I can't comment yet, in a separate answer: since you want to keep the "EYYYYMM" part of the string, the code would be:
>>>original_string = 'E2017010000000601'
>>>cut_string= original_string[:7] + str(int(original_string[7:]))
>>>cut_string
'E201701601'
A quick explanation: we know what the first seven characters of the string will be, and we want to keep those in the string. Then we add the rest of the string, turned into an integer and back into a string, so that all unnecessary zeroes in front are removed.
In my code I am trying to print without a new line after the program exits and print again. I am printing with a comma after for example:
type_d=dict.fromkeys(totals,[0,0])
for keysl in type_d.keys():
print type_d[keysl][0] , ",", type_d[keysl][1] , ",",
print "HI" ,
But when the program exits and I call on another one there is a newline inputted after the last value in the file. How can I avoid this?
I believe that this is not documented, but it's intentional, and related to behavior that is documented.
If you look up print in the docs:
A space is written before each object is (converted and) written, unless the output system believes it is positioned at the beginning of a line. This is the case (1) when no characters have yet been written to standard output, (2) when the last character written to standard output is a whitespace character except ' ', or (3) when the last write operation on standard output was not a print statement. (In some cases it may be functional to write an empty string to standard output for this reason.)
And Python does keep track of whether the last thing written to sys.stdout was written by print (including by print >>sys.stdout) or not (unless sys.stdout is not an actual file object). (See PyFile_SoftSpace in the C API docs.) That's how it does (3) above.
And it's also what it uses to decide whether to print a newline when closing stdout.
So, if you don't want that newline at the end, you can just do the same workaround mentioned in (3) at the end of your program:
for i in range(10):
print i,
print 'Hi',
sys.stdout.write('')
Now, when you run it:
$ python hi.py && python hi.py
0 1 2 3 4 5 6 7 8 9 Hi0 1 2 3 4 5 6 7 8 9 Hi
If you want to see the source responsible:
The PRINT_ITEM bytecode handler in ceval is where the "soft space" flag gets set.
The code that checks and outputs a newline or not depending on the flag is Py_FlushLine (which is also used by many other parts of Python).
The code that calls that check is, I believe, handle_system_exit—but notice that the various different "Very High Level C API" functions for running Python code in the same file also do the same thing at the end.
You can try to use the code below, it will eliminate the new line:
import sys
sys.stdout.write("HI")
I am trying to compare two lists in python and produce two arrays that contain matching rows and non-matching rows, but the program prints the data in an ugly format. How can I clean I go about cleaning it up?
If you want to read the file without the \n character, you might consider doing the following
lines = list1.readlines()
lines2 = list2.readlines()
would read your file without the "\n" characters
Alternatively, for each line, you can do .strip("\n")
The "ugly format" might be because you are using print(match) (which is actually translated by Python to print ( repr(match) ), printing something that is more useful for debugging or as input back to Python - but not 'nice'.
If you want it printed 'nicely', you'd have to decide what format that would be and write the code for it. In the simplest case, you might do:
for i in match:
print(i)
(note your original list contains \n characters, that's what enumerating an open text file does. They will get printed, as well (together with the `\n' added by print() itself). I don't know if you want them removed or not. See the other answer for possible ways of getting rid of them.
I'm trying to read the contents of a 5GB file and then sort them and find duplicates. The file is basically just a list of numbers (each on a new line). There are no empty lines or any symbols other than digits. The numbers are all pretty big (at least 6 digits). I am currently using
for line in f:
do something to line
to avoid memory problems. I am fine with using that. However, I am interested to know why readline() and readlines() didn't work for me. When I try
print f.readline(10)
the program always returns the same line no matter which number I use as a parameter. To be precise, if I do readline(0) it returns an empty line, even though the first line in the file is a big number. If I try readline(1) it returns 2, even though the number 2 is not in the file. When the parameter is >= 6, it always returns the same number: 291965.
Additionally, the readlines() method always returns the same lines no matter what the parameter is. Even if I try to print f.readlines(2), it's still giving me a list of over 1000 numbers.
I am not sure if I explained it very well. Sorry, English is not my first language. Anyway, I can make it work without the readline methods but I really want to know why they don't work as expected.
This is what the first 10 lines of the file look like:
548098
968516
853181
485102
69638
689242
319040
610615
936181
486052
I can not reproduce f.readline(1) returning 2, or f.readlines(10) returning "thousands of lines", but it seems like you misunderstood what the integer parameters to those functions do.
Those number do not specify the number of the line to read, but the maximum bytes readline will read.
>>> f = open("data.txt")
>>> f.readline(1)
'5'
>>>f.readline(100)
'48098\n'
Both commands will read from the first line, which is 548098; the first will only read 1 byte, and the second command reads the rest of the line, as there are less than 100 bytes left. If you call readline again, it will continue with the second line, etc.
Similarly, f.readlines(10) will read full lines until the total amount of bytes read is larger than the specified number:
>>> f.readlines(10)
['968516\n', '853181\n']
i have 2 lists of strings with the same length but when i write them to a file where each item appear on separate lines in the file, they length of the list and file do not match:
print len(x)
print len(y)
317858
317858
However when i write each item in the list to a text file:
the number of lines in the text file do not match to length of the list.
with open('a.txt', 'wb') as f:
for i in x[:222500]:
print >> f, i
in linux, wc -l a.txt gives 222499 which is right.
with open('b.txt', 'wb') as f:
for i in y[:222500]:
print >> f, i
in linux, wc -l b.txt gives 239610 which is wrong.
when i vi b.txt in the terminal, it did have 239610 lines so i am quite confused as to why this is happening..
How can i debug this?
The only possible way for finding more lines in b.txt than the number of string written is that some of the strings in y actually contain new lines.
Here is a small example
l = [ 'a', 'b\nc']
print len(l)
with open('tst.txt', 'wb') as fd:
for i in l:
print >> fd, i
This little code will print 2 because list l contains 2 elements, but the resulting file will contain 3 lines:
a
b
c
I'm sure others will quickly point out the cause of this difference (it's related to newline characters), but since you asked 'How can I debug this?' I'd like to address that question:
Since the only difference between the passing and the failing run are the lists themselves, I'd concentrate on those. There is some difference between the lists (i.e. at least one differing list element) which triggers this. Hence, you could perform a binary search to locate the first differing list element triggering this.
To do so, just chop the lists in halves, e.g. take the first 317858/2 lines of each list. Do you still observe the same symptom? If so, repeat the exercise with that first half. Otherwise, repeat that exercise with the second half. That way, you'll need at most 19 tries to identify the line which triggers this. And at that point, the issue is simplified to a single string.
Chances are that you can spot the issue by just looking at the strings, but in principle (e.g. if the strings are very long), you could then proceed to do a binary search on those strings to identify the first character triggering this issue.