Merging 2 lists and remove double entries - python

I have a piece of code that loads up 2 lists with this code:
with open('blacklists.bls', 'r') as f:
L = [dnsbls.strip() for dnsbls in f]
with open('ignore.bls', 'r') as f2:
L2 = [ignbls.stip() for ignbls in f2]
dnsbls contains:
list1
list2
list3
ignbls contains
list2
What I want to do is merge dnsbls and ignbls and then remove any lines that appears more than once and print those with "for". I was thinking something like:
for combinedlist in L3:
print combinedlist
Which in the aboe example would print out:
list1
list3

You need to use sets instead of lists:
L3 = list(set(L).difference(L2))
Demonstration:
>>> L=['list1','list2','list3']
>>> L2=['list2']
>>> set(L).difference(L2)
set(['list1', 'list3'])
>>> list(set(L).difference(L2))
['list1', 'list3']
For your purposes you probably don't have to convert it back to a list again, you can iterate over the resulting set just fine.

If ignores are smaller than the blacklists (which is normally the case I think), then (untested):
with open('blacklists.bls') as bl, open('ignore.bls') as ig:
bl_for = (line.strip() for line in bl if 'for' not in line)
ig_for = (line.strip() for line in ig if 'for' not in line)
res = set(ig_for).difference(bl_for)

Related

Remove duplicates from large list but remove both if it does exist?

So I have a text file like this
123
1234
123
1234
12345
123456
You can see 123 appears twice so both instances should be removed. but 12345 appears once so it stays. My text file is about 70,000 lines.
Here is what I came up with.
file = open("test.txt",'r')
lines = file.read().splitlines() #to ignore the '\n' and turn to list structure
for appId in lines:
if(lines.count(appId) > 1): #if element count is not unique remove both elements
lines.remove(appId) #first instance removed
lines.remove(appId) #second instance removed
writeFile = open("duplicatesRemoved.txt",'a') #output the left over unique elements to file
for element in lines:
writeFile.write(element + "\n")
When I run this I feel like my logic is correct, but I know for a fact the output is suppose to be around 950, but Im still getting 23000 elements in my output so a lot is not getting removed. Any ideas where the bug could reside?
Edit: I FORGOT TO MENTION. An element can only appear twice MAX.
Use Counter from built in collections:
In [1]: from collections import Counter
In [2]: a = [123, 1234, 123, 1234, 12345, 123456]
In [3]: a = Counter(a)
In [4]: a
Out[4]: Counter({123: 2, 1234: 2, 12345: 1, 123456: 1})
In [5]: a = [k for k, v in a.items() if v == 1]
In [6]: a
Out[6]: [12345, 123456]
For your particular problem I will do it like this:
from collections import defaultdict
out = defaultdict(int)
with open('input.txt') as f:
for line in f:
out[line.strip()] += 1
with open('out.txt', 'w') as f:
for k, v in out.items():
if v == 1: #here you use logic suitable for what you want
f.write(k + '\n')
Be careful about removing elements from a list while still iterating over that list. This changes the behavior of the list iterator, and can make it skip over elements, which may be part of your problem.
Instead, I suggest creating a filtered copy of the list using a list comprehension - instead of removing elements that appear more than twice, you would keep elements that appear less than that:
file = open("test.txt",'r')
lines = file.read().splitlines()
unique_lines = [line for line in lines if lines.count(line) <= 2] # if it appears twice or less
with open("duplicatesRemoved.txt", "w") as writefile:
writefile.writelines(unique_lines)
You could also easily modify this code to look for only one occurrence (if lines.count(line) == 1) or for more than two occurrences.
You can count all of the elements and store them in a dictionary:
dic = {a:lines.count(a) for a in lines}
Then remove all duplicated one from array:
for k in dic:
if dic[k]>1:
while k in lines:
lines.remove(k)
NOTE: The while loop here is becaues line.remove(k) removes first k value from array and it must be repeated till there's no k value in the array.
If the for loop is complicated, you can use the dictionary in another way to get rid of duplicated values:
lines = [k for k, v in dic.items() if v==1]

Creating lists from rows with different lengths in python

I am trying to create a list for each column in python of my data that looks like this:
399.75833 561.572000000 399.75833 561.572000000 a_Fe I 399.73920 nm
399.78316 523.227000000 399.78316 523.227000000
399.80799 455.923000000 399.80799 455.923000000 a_Fe I 401.45340 nm
399.83282 389.436000000 399.83282 389.436000000
399.85765 289.804000000 399.85765 289.804000000
The problem is that each row of my data is a different length. Is there anyway to format the remaining spaces of the shorter rows with a space so they are all the same length?
I would like my data to be in the form:
list one= [399.75833, 399.78316, 399.80799, 399.83282, 399.85765]
list two= [561.572000000, 523.227000000, 455.923000000, 389.436000000, 289.804000000]
list three= [a_Fe, " ", a_Fe, " ", " "]
This is the code I used to import the data into python:
fh = open('help.bsp').read()
the_list = []
for line in fh.split('\n'):
print line.strip()
splits = line.split()
if len(splits) ==1 and splits[0]== line.strip():
splits = line.strip().split(',')
if splits:the_list.append(splits)
You need to use izip_longest to make your column lists, since standard zip will only run till the shortest length in the given list of arrays.
from itertools import izip_longest
with open('workfile', 'r') as f:
fh = f.readlines()
# Process all the rows line by line
rows = [line.strip().split() for line in fh]
# Use izip_longest to get all columns, with None's filled in blank spots
cols = [col for col in izip_longest(*rows)]
# Then run your type conversions for your final data lists
list_one = [float(i) for i in cols[2]]
list_two = [float(i) for i in cols[3]]
# Since you want " " instead of None for blanks
list_three = [i if i else " " for i in cols[4]]
Output:
>>> print list_one
[399.75833, 399.78316, 399.80799, 399.83282, 399.85765]
>>> print list_two
[561.572, 523.227, 455.923, 389.436, 289.804]
>>> print list_three
['a_Fe', ' ', 'a_Fe', ' ', ' ']
So, your lines are either whitespace delimited or comma delimited, and if comma delimited, the line contains no whitespace? (note that if len(splits)==1 is true, then splits[0]==line.strip() is also true). That's not the data you're showing, and not what you're describing.
To get the lists you want from the data you show:
with open('help.bsp') as h:
the_list = [ line.strip().split() for line in h.readlines() ]
list_one = [ d[0] for d in the_list ]
list_two = [ d[1] for d in the_list ]
list_three = [ d[4] if len(d) > 4 else ' ' for d in the_list ]
If you're reading comma separated (or similarly delimited) files, I always recommend using the csv module - it handles a lot of edge cases that you may not have considered.

Creating a program that compares two lists

I am trying to create a program that checks whether items from one list are not in another. It keeps returning lines saying that x value is not in the list. Any suggestions? Sorry about my code, it's quite sloppy.
Searching Within an Array
Putting .txt files into arrays
with open('Barcodes', 'r') as f:
barcodes = [line.strip() for line in f]
with open('EAN Staging', 'r') as f:
EAN_staging = [line.strip() for line in f]
Arrays
list1 = barcodes
list2 = EAN_staging
Main Code
fixed = -1
for x in list1:
for variable in list1: # Moves along each variable in the list, in turn
if list1[fixed] in list2: # If the term is in the list, then
fixed = fixed + 1
location = list2.index(list1[fixed]) # Finds the term in the list
print ()
print ("Found", variable ,"at location", location) # Prints location of terms
Instead of lists, read the files as sets:
with open('Barcodes', 'r') as f:
barcodes = {line.strip() for line in f}
with open('EAN Staging', 'r') as f:
EAN_staging = {line.strip() for line in f}
Then all you need to do is to calculate the symmetric difference between them:
diff = barcodes - EAN_staging # or barcodes.difference(EAN_stagin)
An extracted example:
a = {1, 2, 3}
b = {3, 4, 5}
print(a - b)
>> {1, 2, 4, 5} # 1, 2 are in a but in b
Note that if you are operating with sets, information about how many times an element is present will be lost. If you care about situations when an element is present in barcodes 3 times, but only 2 times in EAN_staging, you should use Counter from collections.
Your code doesn't seem to quite answer your question. If all you want to do is see which elements aren't shared, I think sets are the way to go.
set1 = set(list1)
set2 = set(list2)
in_first_but_not_in_second = set1.difference(set2) # outputs a set
not_in_both = set1.symmetric_difference(set2) # outputs a set

python write in file list of list with different separator

I have a list of list like that :
liste = [["1-2","3-4"],["5-6"]]
I would like to write in a file like that :
1-2,3-4|5,6
I try this :
for l in liste:
sortie.write(",".join(l)+"|")
but it writes:
1-2,3-4|5,6|
How can I delete (or don't write) the last pipe ?
Use a list comprehension to join your lists in one .write() call:
sortie.write('|'.join([','.join(l) for l in liste]))
This replaces your for loop over liste.
Demo:
>>> liste = [["1-2","3-4"],["5-6"]]
>>> '|'.join([','.join(l) for l in liste])
'1-2,3-4|5-6'
Nested joins:
s = '|'.join(','.join(l) for l in liste)
>>> lines = []
>>> for line in liste:
... lines.append(','.join(map(str, line)))
...
>>> print '|'.join(lines)
1-2,3-4|5-6
>>>

Change the display of a list took from text file

I have this code wrote in Python:
with open ('textfile.txt') as f:
list=[]
for line in f:
line = line.split()
if line:
line = [int(i) for i in line]
list.append(line)
print(list)
This actually read integers from a text file and put them in a list.But it actually result as :
[[10,20,34]]
However,I would like it to display like:
10 20 34
How to do this? Thanks for your help!
You probably just want to add the items to the list, rather than appending them:
with open('textfile.txt') as f:
list = []
for line in f:
line = line.split()
if line:
list += [int(i) for i in line]
print " ".join([str(i) for i in list])
If you append a list to a list, you create a sub list:
a = [1]
a.append([2,3])
print a # [1, [2, 3]]
If you add it you get:
a = [1]
a += [2,3]
print a # [1, 2, 3]!
with open('textfile.txt') as f:
lines = [x.strip() for x in f.readlines()]
print(' '.join(lines))
With an input file 'textfiles.txt' that contains:
10
20
30
prints:
10 20 30
It sounds like you are trying to print a list of lists. The easiest way to do that is to iterate over it and print each list.
for line in list:
print " ".join(str(i) for i in line)
Also, I think list is a keyword in Python, so try to avoid naming your stuff that.
If you know that the file is not extremely long, if you want the list of integers, you can do it at once (two lines where one is the with open(.... And if you want to print it your way, you can convert the element to strings and join the result via ' '.join(... -- like this:
#!python3
# Load the content of the text file as one list of integers.
with open('textfile.txt') as f:
lst = [int(element) for element in f.read().split()]
# Print the formatted result.
print(' '.join(str(element) for element in lst))
Do not use the list identifier for your variables as it masks the name of the list type.

Categories

Resources