Why is set not calculating my unique integers? - python

I just started teaching myself Python last night via Python documentation, tutorials and SO questions.
So far I can ask a user for a file, open and read the file, remove all # and beginning \n in the file, read each line into an array, and count the number of integers per line.
I want to calculate the number of unique integers per line. I realized that Python uses a set capability which I thought would work perfectly for this calculation. However, I always receive the value of one greater than the prior value (I will show you). I looked at other SO posts related to sets and do not see what I am not missing and have been stumped for a while.
Here is the code:
with open(filename, 'r') as file:
for line in file:
if line.strip() and not line.startswith("#"):
#calculate the number of integers per line
names_list.append(line)
#print "There are ", len(line.split()), " numbers on this line"
#print names_list
#calculate the number of unique integers
myset = set(names_list)
print myset
myset_count = len(myset)
print "unique:",myset_count
For further explanation:
names_list is:
['1 2 3 4 5 6 5 4 5\n', '14 62 48 14\n', '1 3 5 7 9\n', '123 456 789 1234 5678\n', '34 34 34 34 34\n', '1\n', '1 2 2 2 2 2 3 3 4 4 4 4 5 5 6 7 7 7 1 1\n']
and my_set is:
set(['1 2 3 4 5 6 5 4 5\n', '1 3 5 7 9\n', '34 34 34 34 34\n', '14 62 48 14\n', '1\n', '1 2 2 2 2 2 3 3 4 4 4 4 5 5 6 7 7 7 1 1\n', '123 456 789 1234 5678\n'])
The output I receive is:
unique: 1
unique: 2
unique: 3
unique: 4
unique: 5
unique: 6
unique: 7
The output that should occur is:
unique: 6
unique: 3
unique: 5
unique: 5
unique: 1
unique: 1
unique: 7
Any suggestions as to why my set per line is not calculating the correct number of unique integers per line? I would also like any suggestions on how to improve my code in general (if you would like) because I just started learning Python by myself last night and would love tips. Thank you.

The problem is that as you are iterating over your file you are appending each line to the list names_list. After that, you build a set out of these lines. Your text file does not seem to have any duplicate lines, so printing the length of your set just displays the current number of lines you have processed.
Here's a commented fix:
with open(filename, 'r') as file:
for line in file:
if line.strip() and not line.startswith("#"):
numbers = line.split() # splits the string by whitespace and gives you a list
unique_numbers = set(numbers) # builds a set of the strings in numbers
print(len(unique_numbers)) # prints number of items in the set
Note that we are using the currently processed line and build a set from it (after splitting the line). Your original code stores all lines and then builds a set from the lines in each loop.

myset = set(names_list)
should be
myset = set(line.split())

Related

Edit columns in a text file

I have a text file containing 4 columns. I need to remove the first two columns and replace them with one column. The value which I should put as the new column is being produced in a loop. Here is something I am trying to do.
The input is like this:
1 2 3 4
5 6 7 8
9 1 2 3
The output should be like this:
d 3 4
d 7 8
d 2 3
but "d" is a variable that is being produced in a loop for each line.
with open('EQ.txt','r') as f:
i = 0
for line in f
...
...
d=r+d
with open(c.txt, "w") as wrt:
new_line = d\n.format(line[2], line[3])
wrt.write(new_line)
What you want is:
new_line = "%d %d %d\n".format(d, line[2], line[3])
The format string has to be in quotes, with %d formats to specify that you want decimal numbers there. Then you list all three values in the argument list.

Splitting a string into a list of integers with Python

This method inputs a file and the directory of the file. It contains a matrix of data, and needs to copy the first 20 columns of each row after the given row name and the corresponding letter for the row. The first 3 lines of each file is skipped because it has unimportant information that is not needed, and it also doesn't need the data at the bottom of the file.
For example a file would look like:
unimportant information--------
unimportant information--------
-blank line
1 F -1 2 -3 4 5 6 7 (more columns of ints)
2 L 3 -1 3 4 0 -2 1 (more columns of ints)
3 A 3 -1 3 6 0 -2 5 (more columns of ints)
-blank line
unimportant information--------
unimportant information--------
The output of the method needs to print out a "matrix" in some given form.
So far the output gives a list of each row as a string, however I'm trying to figure out the best way to approach the problem. I don't know how to ignore the unimportant information at the end of the files. I don't know how to only retrieve the first 20 columns after the letter in each row, and I don't know how to ignore the row number and the row letter.
def pssmMatrix(self,ipFileName,directory):
dir = directory
filename = ipFileName
my_lst = []
#takes every file in fasta folder and put in files list
for f in os.listdir(dir):
#splits the file name into file name and its extension
file, file_ext = os.path.splitext(f)
if file == ipFileName:
with open(os.path.join(dir,f)) as file_object:
for _ in range(3):
next(file_object)
for line in file_object:
my_lst.append(' '.join(line.strip().split()))
return my_lst
Expected results:
['-1 2 -3 4 5 6 7'], ['3 -1 3 4 0 -2 1'], ['3 -1 3 6 0 -2 5']
Actual results:
['1 F -1 2 -3 4 5 6 7'], ['2 L 3 -1 3 4 0 -2 1'], ['3 A 3 -1 3 6 0 -2 5'], [' '], [' unimportant info'], ['unimportant info']
Try this solution.
import re
reg = re.compile(r'(?<=[0-9]\s[A-Z]\s)[0-9\-\s]+')
text = """
unimportant information--------
unimportant information--------
-blank line
1 F -1 2 -3 4 5 6 7 (more columns of ints)
2 L 3 -1 3 4 0 -2 1 (more columns of ints)
3 A 3 -1 3 6 0 -2 5 (more columns of ints)"""
ignore_start = 5 # 0,1,2,3 = 4
expected_array = []
for index, line in enumerate(text.splitlines()):
if(index >= ignore_start):
if reg.search(line):
result = reg.search(line).group(0).strip()
# Use Result
expected_array.append(' '.join(result))
print(expected_array)
# Result: [
#'- 1 2 - 3 4 5 6 7',
#'3 - 1 3 4 0 - 2 1',
#'3 - 1 3 6 0 - 2 5'
#]
Ok so it looks to me like you have a file with certain lines that you want and the lines that you want always start with a number followed by a letter. So what we can do is apply a regular expression to this that only gets lines that match that pattern and only get the numbers after the pattern
The expression for this would look like (?<=[0-9]\s[A-Z]\s)[0-9\-\s]+
import re
reg = re.compile(r'(?<=[0-9]\s[A-Z]\s)[0-9\-\s]+')
for line in file:
if reg.search(line):
result = reg.search(test).group(0)
# Use Result
my_lst.append(' '.join(result))
Hope that helps

reading a file that is detected as being one column

I have a file full of numbers in the form;
010101228522 0 31010 3 3 7 7 43 0 2 4 4 2 2 3 3 20.00 89165.30
01010222852313 3 0 0 7 31027 63 5 2 0 0 3 2 4 12 40.10 94170.20
0101032285242337232323 7 710153 9 22 9 9 9 3 3 4 80.52 88164.20
0101042285252313302330302323197 9 5 15 9 15 15 9 9 110.63 98168.80
01010522852617 7 7 3 7 31330 87 6 3 3 2 3 2 5 15 50.21110170.50
...
...
I am trying to read this file but I am not sure how to go about it, when I use the built in function open and loadtxt from numpy and i even tried converting to pandas but the file is read as one column, that is, its shape is (364 x 1) but I want it to separate the numbers to columns and the blank spaces to be replaced by zeros, any help would be appreciated. NOTE, some places there are two spaces following each other
If the columns content type is a string have you tried using str.split() This will turn the string into an array, then you have each number split up by each gap. You could then use a for loop for the amount of objects in the mentioned array to create a table out of it, not quite sure this has answered the question, sorry if not.
str.split():
So I finally solved my problem, I actually had to strip the lines and then read each "letter" from the line, in my case I am picking individual numbers from the stripped line and then appending them to an array. Here is the code for my solution;
arr = []
with open('Kp2001', 'r') as f:
for ii, line in enumerate(f):
arr.append([]) #Creates an n-d array
cnt = line.strip() #Strip the lines
for letter in cnt: #Get each 'letter' from the line, in my case it's the individual numbers
arr[ii].append(letter) #Append them individually so python does not read them as one string
df = pd.DataFrame(arr) #Then converting to DataFrame gives proper columns and actually keeps the spaces to their respectful columns
df2 = df.replace(' ', 0) #Replace the spaces with what you will

Finding the most frequent items in a dataset

I am working with a big dataset and thus I only want to use the items that are most frequent.
Simple example of a dataset:
1 2 3 4 5 6 7
1 2
3 4 5
4 5
4
8 9 10 11 12 13 14
15 16 17 18 19 20
4 has 4 occurrences,
1 has 2 occurrences,
2 has 2 occurrences,
5 has 2 occurrences,
I want to be able to generate a new dataset just with the most frequent items, in this case the 4 most common:
The wanted result:
1 2 3 4 5
1 2
3 4 5
4 5
4
I am finding the first 50 most common items, but I am failing to print them out in a correct way. (my output is resulting in the same dataset)
Here is my code:
from collections import Counter
with open('dataset.dat', 'r') as f:
lines = []
for line in f:
lines.append(line.split())
c = Counter(sum(lines, []))
p = c.most_common(50);
with open('dataset-mostcommon.txt', 'w') as output:
..............
Can someone please help me on how I can achieve it?
You have to iterate again the dataset and, for each line, show only those who are int the most common data set.
If the input lines are sorted, you may just do a set intersection and print those in sorted order. If it is not, iterate your line data and check each item
for line in dataset:
for element in line.split()
if element in most_common_elements:
print(element, end=' ')
print()
PS: For Python 2, add from __future__ import print_function on top of your script
According to the documentation, c.most-common returns a list of tuples, you can get the desired output as follow:
with open('dataset-mostcommon.txt', 'w') as output:
for item, occurence in p:
output.writelines("%d has %d occurrences,\n"%(item, occurence))

Python and CodeEval: Why isn't my solution to this Multiplication Table challenge not correct?

So I've been digging into some of the challenges at CodeEval during the last couple of days. One of them asks you to print a pretty-looking multiplication table going up to 12*12. While on my computer this solution works properly, the site does not accept it.
Can somebody help me out?
for i in range(1,13):
for j in range(1,13):
if j==1:
print i*j,
elif i>=10 and j==2:
print '{0:3}'.format(i*j),
else:
print '{0:4}'.format(i*j),
print
Here's a description of the challenge:
Print out the table in a matrix like fashion, each number formatted to a width of 4 (The numbers are right-aligned and strip out leading/trailing spaces on each line).
Print out the table in a matrix like fashion, each number formatted to a width of 4
(The numbers are right-aligned and strip out leading/trailing spaces on each line).
I think that following code may pass, according to the given rules. But it's only a guess
as I'm not playing on codeeval.
# python3 code, use from __future__ import print_function if you're using python2
for i in range(1,13):
for j in range(1,13):
print('{:4}'.format(i*j), end="")
print ("")
I guess your problem is that you're having a 5 alignment because you forgot that the print foo,
syntax in python 2 is adding a space. Also they say width of 4, whereas you tried to be smart, and
align accordingly to the size of the value.
edit: here's more details about what's wrong in your code, let's take two lines of your output:
1 2 3 4 5 6 7 8 9 10 11 12
10 20 30 40 50 60 70 80 90 100 110 120
AA␣BBB␣CCCC␣CCCC␣CCCC␣CCCC␣CCCC␣CCCC␣CCCC␣CCCC␣CCCC␣CCCC␣
so what your code is doing is:
if j==1: print i*j,
outputs the AA␣,
elif i>=10 and j==2:
print '{0:3}'.format(i*j),
outputs the BBB␣ and
else:
print '{0:4}'.format(i*j),
outputs the CCCC␣.
The ␣ is an extra space added by python because of the use of the , at the end of the print, to get that, see that snippet:
>>> def foo():
... print 1234,
... print 1234,
... print
...
>>> foo()
1234 1234
The format '{0:3}' right aligns the integer to the right as expected, but only with a padding of three, whereas the '{0:4}' does it according to the rules. But as you're adding an extra space at the end, it's not working as you're writing it. That's why it's better to use python3's print expression that makes it simpler, using the end keyword.
Print out the table in a matrix like fashion, each number formatted to a width of 4 (The numbers are right-aligned and strip out leading/trailing spaces on each line).
However, the output fails to meet this requirement which is made more visible by timing bars.
1 2 3 4 5 6 7 8 9 10 11 12^N
2 4 6 8 10 12 14 16 18 20 22 24^N
3 6 9 12 15 18 21 24 27 30 33 36^N
||||----||||----||||----||||----||||----||||----||||----
This correct padding for a "width of 4, right-aligned" should look like
1 2 3 4 5 6 7 8 9 10 11 12^N
||||++++||||++++||||++++||||++++||||++++||||++++
And then with "strip out leading/trailing spaces on each line" the correct result is
1 2 3 4 5 6 7 8 9 10 11 12^N
|++++||||++++||||++++||||++++||||++++||||++++
While the current incorrect output, with the timing bars for correct output looks like
1 2 3 4 5 6 7 8 9 10 11 12^N
|++++||||++++||||++++||||++++||||++++||||++++???????????

Categories

Resources