Comparing two CSV files in Python when rows have multiple values

Comparing two CSV files in Python when rows have multiple values - python

I have two CSV files that I want to compare one looks like this:
"a" 1 6 3 1 8
"b" 15 6 12 5 6
"c" 7 4 1 4 8
"d" 14 8 12 11 4
"e" 1 8 7 13 12
"f" 2 5 4 13 9
"g" 8 6 9 3 3
"h" 5 12 8 2 3
"i" 5 9 2 11 11
"j" 1 9 2 4 9
So "a" possesses the numbers 1,6,3,1,8 etc. The actual CSV file is 1,000s of lines long so you know for efficiency sake when writing the code.
The second CSV file looks like this:
4
15
7
9
2
I have written some code to import these CSV files into lists in python.
with open('winningnumbers.csv', 'rb') as wn:
reader = csv.reader(wn)
winningnumbers = list(reader)
wn1 = winningnumbers[0]
wn2 = winningnumbers[1]
wn3 = winningnumbers[2]
wn4 = winningnumbers[3]
wn5 = winningnumbers[4]
print(winningnumbers)
with open('Entries#x.csv', 'rb') as en:
readere = csv.reader(en)
enl = list(readere)
How would I now search cross reference number 4 so wn1 of CSV file 2 with the first csv file. So that it returns that "b" has wn1 in it. I imported them as a list to see if I could figure out how to do it but just ended up running in circles. I also tried using dict() but had no success.

If I understood you correctly, you want to find the first index (or all indexes) of numbers in entries that are winning. If you want it, you can do that:
with open('winningnumbers.csv', 'rb') as wn:
reader = csv.reader(wn)
winningnumbers = list(reader)
with open('Entries#x.csv', 'rb') as en:
readere = csv.reader(en)
winning_number_index = -1 # Default value which we will print if nothing is found
current_index = 0 # Initial index
for line in readere: # Iterate over entries file
all_numbers_match = True # Default value that will be set to False if any of the elements doesn't match with winningnumbers
for i in range(len(line)):
if line[i] != winningnumbers[i]: # If values of current line and winningnumbers with matching indexes are not equal
all_numbers_match = False # Our default value is set to False
break # Exit "for" without finishing
if all_numbers_match == True: # If our default value is still True (which indicates that all numbers match)
winning_number_index = current_index # Current index is written to winning_number_index
break # Exit "for" without finishing
else: # Not all numbers match
current_index += 1
print(winning_number_index)
This will print the index of the first winning number in entries (if you want all the indexes, write about it in the comments).
Note: this is not the optimal code to solve your problem. It's just easier to undestand and debug if you're not familiar with Python's more advanced features.
You should probably consider not abbreviating your variables. entries_reader takes just a second more to write and 5 seconds less to understand then readere.
This is the variant that is faster, shorter and more memory efficient, but may be harder to understand:
with open('winningnumbers.csv', 'rb') as wn:
reader = csv.reader(wn)
winningnumbers = list(reader)
with open('Entries#x.csv', 'rb') as en:
readere = csv.reader(en)
for line_index, line in enumerate(readere):
if all((line[i] == winningnumbers[i] for i in xrange(len(line)))):
winning_number_index = line_index
break
else:
winning_number_index = -1
print(winning_number_index)
The features that might me unclear are probably enumerate(), any() and using else in for and not in if. Let's go through all of them one by one.
To understand this usage of enumerate, you'll need to understand that syntax:
a, b = [1, 2]
Variables a and b will be assigned according values from the list. In this case a will be 1 and b will be 2. Using this syntax we can do that:
for a, b in [[1, 2], [2, 3], ['spam', 'eggs']]:
# do something with a and b
in each iteration, a and b will be 1 and 2, 2 and 3, 'spam' and 'eggs' accordingly.
Let's assume we have a list a = ['spam', 'eggs', 'potatoes']. enumerate() just returns a "list" like that: [(1, 'spam'), (2, 'eggs'), (3, 'potatoes')]. So, when we use it like that,
for line_index, line in enumerate(readere):
# Do something with line_index and line
line_index will be 1, 2, 3, e.t.c.
any() function accepts a sequence (list, tuple, e.t.c.) and returns True if all the elements in it are equal to True.
Generator expression mylist = [line[i] == winningnumbers[i] for i in range(len(line))] returns a list and is similar to the following:
mylist = []
for i in range(len(line)):
mylist.append(line[i] == winningnumbers[i]) # a == b will return True if a is equal to b
So any will return True only in cases when all the numbers from entry match the winning numbers.
Code in else section of for is called only when for was not interrupted by break, so in our situation it's good for setting a default index to return.

Having duplicate numbers seems illogical but if you want to get the count of matched numbers for each row regardless of index then makes nums a set and sum the times a number from each row is in the set:
from itertools import islice, imap
import csv
with open("in.txt") as f,open("numbers.txt") as nums:
# make a set of all winning nums
nums = set(imap(str.rstrip, nums))
r = csv.reader(f)
# iterate over each row and sum how many matches we get
for row in r:
print("{} matched {}".format(row[0], sum(n in nums
for n in islice(row, 1, None))))
Which using your input will output:
a matched 0
b matched 1
c matched 2
d matched 1
e matched 0
f matched 2
g matched 0
h matched 1
i matched 1
j matched 2
presuming your file is comma separated and you have a number per line in your numbers file.
If you actually want to know which numbers if any are present then you need to iterate over the number and print each one that is in our set:
from itertools import islice, imap
import csv
with open("in.txt") as f, open("numbers.txt") as nums:
nums = set(imap(str.rstrip, nums))
r = csv.reader(f)
for row in r:
for n in islice(row, 1, None):
if n in nums:
print("{} is in row {}".format(n, row[0]))
print("")
But again, I am not sure having duplicate numbers makes sense.
To group the rows based on how many matches, you can use a dict using the sum as the key and appending the first column value:
from itertools import islice, imap
import csv
from collections import defaultdict
with open("in.txt") as f,open("numbers.txt") as nums:
# make a set of all winning nums
nums = set(imap(str.rstrip, nums))
r = csv.reader(f)
results = defaultdict(list)
# iterate over each row and sum how many matches we get
for row in r:
results[sum(n in nums for n in islice(row, 1, None))].append(row[0])
results:
defaultdict(<type 'list'>,
{0: ['a', 'e', 'g'], 1: ['b', 'd', 'h', 'i'],
2: ['c', 'f', 'j']})
The keys are numbers match, the values are the rows ids that matched the n numbers.

Related

Longest subset of five-positions five-elements permutations, only one-element-position in common

I am trying to get the longest list of a set of five ordered position, 1 to 5 each, satisfying the condition that any two members of the list cannot share more than one identical position (index). I.e., 11111 and 12222 is permitted (only the 1 at index 0 is shared), but 11111 and 11222 is not permitted (same value at index 0 and 1).
I have tried a brute-force attack, starting with the complete list of permutations, 3125 members, and walking through the list element by element, rejecting the ones that do not match the criteria, in several steps:
step one: testing elements 2 to 3125 against element 1, getting a new shorter list L'
step one: testing elements 3 to N' against element 2', getting a shorter list yet L'',
and so on.
I get a 17 members solution, perfectly valid. The problem is that:
I know there are, at least, two 25-member valid solution found by a matter of good luck,
The solution by this brute-force method depends strongly on the initial order of the 3125 members list, so I have been able to find from 12- to 21-member solutions, shuffling the L0 list, but I have never hit the 25-member solutions.
Could anyone please put light on the problem? Thank you.
This is my approach so far
import csv, random
maxv = 0
soln=0
for p in range(0,1): #Intended to run multiple times
z = -1
while True:
z = z + 1
file1 = 'Step' + "%02d" % (z+0) + '.csv'
file2 = 'Step' + "%02d" % (z+1) + '.csv'
nextdata=[]
with open(file1, 'r') as csv_file:
data = list(csv.reader(csv_file))
#if file1 == 'Step00.csv': # related to p loop
# random.shuffle(data)
i = 0
while i <= z:
nextdata.append(data[i])
i = i + 1
for j in range(z, len(data)):
sum=0
for k in range(0,5):
if (data[z][k] == data[j][k]):
sum = sum + 1
if sum < 2:
nextdata.append(data[j])
ofile = open(file2, 'wb')
writer = csv.writer(ofile)
writer.writerows(nextdata)
ofile.close()
if (len(nextdata) < z + 1 + 1):
if (z+1)>= maxv:
maxv = z+1
print maxv
ofile = open("Solution"+"%02d" % soln + '.csv', 'wb')
writer = csv.writer(ofile)
writer.writerows(nextdata)
ofile.close()
soln = soln + 1
break

Here is a Picat model for the problem (as I understand it): http://hakank.org/picat/longest_subset_of_five_positions.pi It use constraint modelling and SAT solver.
Edit: Here is a MiniZinc model: http://hakank.org/minizinc/longest_subset_of_five_positions.mzn
The model (predicate go/0) check lengths of 2 to 100. All lengths between 2 and 25 has at least one solution (probably at lot more). So 25 is the longest sub sequence. Here is one 25 length solution:
{1,1,1,3,4}
{1,2,5,1,5}
{1,3,4,4,1}
{1,4,2,2,2}
{1,5,3,5,3}
{2,1,3,2,1}
{2,2,4,5,4}
{2,3,2,1,3}
{2,4,1,4,5}
{2,5,5,3,2}
{3,1,2,5,5}
{3,2,3,4,2}
{3,3,5,2,4}
{3,4,4,3,3}
{3,5,1,1,1}
{4,1,4,1,2}
{4,2,1,2,3}
{4,3,3,3,5}
{4,4,5,5,1}
{4,5,2,4,4}
{5,1,5,4,3}
{5,2,2,3,1}
{5,3,1,5,2}
{5,4,3,1,4}
{5,5,4,2,5}
There is a lot of different 25 lengths solutions (the predicate go2/0 checks that).
Here is the complete model (edited from the file above):
import sat.
main => go.
%
% Test all lengths from 2..100.
% 25 is the longest.
%
go ?=>
nolog,
foreach(M in 2..100)
println(check=M),
if once(check(M,_X)) then
println(M=ok)
else
println(M=not_ok)
end,
nl
end,
nl.
go => true.
%
% Check if there is a solution with M numbers
%
check(M, X) =>
N = 5,
X = new_array(M,N),
X :: 1..5,
foreach(I in 1..M, J in I+1..M)
% at most 1 same number in the same position
sum([X[I,K] #= X[J,K] : K in 1..N]) #<= 1,
% symmetry breaking: sort the sub sequence
lex_lt(X[I],X[J])
end,
solve([ff,split],X),
foreach(Row in X)
println(Row)
end,
nl.

Two-dimensional array of 0s & 1s, sliced and keys returned as set in Python

I have a number of records in a text file that represent days of a 'month' 1-30 and whether a shop is open or closed. The letters represent the shop.
A 00000000001000000000000000000
B 11000000000000000000000000000
C 00000000000000000000000000000
D 00000000000000000000000000000
E 00000000000000000000000000000
F 00000000000000000000000000000
G 00000000000000000000000000000
H 00000000000000000000000000000
I 11101111110111111011111101111
J 11111111111111111111111111111
K 00110000011000001100000110000
L 00010000001000000100000010000
M 00100000010000001000000100000
N 00000000000000000000000000000
O 11011111101111110111111011111
I want to store the 1's and 0's as is in an array (I'm thinking numpy but there is a another way (string, bitstring) I'd be happy with that). Then I want to be able to slice one day , i.e a column and get the record keys back in a set.
e.g.
A 1
B 0
C 0
D 0
E 0
F 0
G 0
H 0
I 0
J 1
K 1
L 1
M 0
N 0
O 1
day10 = {A,J,K,L,O}
I also need this to be as performant as absolutely possible.

Simplest solution I've come up with:
shops = {}
with open('input.txt', 'r') as f:
for line in f:
name, month = line.strip().split()
shops[name] = [d == '1' for d in month]
dayIndex = 14
result = [s for s,v in shops.iteritems() if v[dayIndex]]
print "Shops opened at",dayIndex,":",result

A numpy solution:
stores, isopen = np.genfromtxt('input.txt', dtype="S30", unpack=True)
isopen = np.array(map(list, isopen)).astype(bool)
Then,
>>> stores[isopen[:,10]]
array(['A', 'J', 'K', 'L', 'O'],
dtype='|S30')

with open("datafile") as fin:
D = {i[0]:int(i[:1:-1], 2) for i in fin}
days = [{k for k in D if D[k] & 1<<i} for i in range(31)]
Just keep the days variable between queries

First, I would hesitate to write the amount of code to make things work for example for bitarray.
Second, I already upvoted BartoszKP's answer as it looks like a reasonable approach.
Last, I would use pandas instead of numpy for such a task, as for most tasks it will use underlying numpy functions and will reasonable fast.
If data contains your array as string, converting to DataFrame can be done with
>>> df = pd.DataFrame([[x] + map(int, y)
... for x, y in [l.split() for l in data.splitlines()]])
>>> df.columns = ['Shop'] + map(str, range(1, 30))
and lookups are done with
>>> df[df['3']==1]['Shop']
8 I
9 J
10 K
12 M
Name: Shop, dtype: object

Use a multilayered dictionary:
all_shops = { 'shopA': { 1: True, 2: False, 3: True ...},
.......}
Then your query is translated to
def query(shop_name, day):
return all_shops[shop_name][day]

with open("datafile") as f:
for line in f:
shop, _days = line.split()
for i,d in enumerate(_days):
if d == '1':
days[i].add(shop)
Simpler, faster and answers the question

Iterate through lines changing only one character python

I have a file that looks like this
N1 1.023 2.11 3.789
Cl1 3.124 2.4534 1.678
Cl2 # # #
Cl3 # # #
Cl4
Cl5
N2
Cl6
Cl7
Cl8
Cl9
Cl10
N3
Cl11
Cl12
Cl13
Cl14
Cl15
The three numbers continue down throughout.
What I would like to do is pretty much a permutation. These are 3 data sets, set 1 is N1-Cl5, 2 is N2-Cl10, and set three is N3 - end.
I want every combination of N's and Cl's. For example the first output would be
Cl1
N1
Cl2
then everything else the same. the next set would be Cl1, Cl2, N1, Cl3...and so on.
I have some code but it won't do what I want, becuase it would know that there are three individual data sets. Should I have the three data sets in three different files and then combine, using a code like:
list1 = ['Cl1','Cl2','Cl3','Cl4', 'Cl5']
for line in file1:
line.replace('N1', list1(0))
list1.pop(0)
print >> file.txt, line,
or is there a better way?? Thanks in advance

This should do the trick:
from itertools import permutations
def print_permutations(in_file):
separators = ['N1', 'N2', 'N3']
cur_separator = None
related_elements = []
with open(in_file, 'rb') as f:
for line in f:
line = line.strip()
# Split Nx and CIx from numbers.
value = line.split()[0]
# Found new Nx. Print previous permutations.
if value in separators and related_elements:
for perm in permutations([cur_separator] + related_elements)
print perm
cur_separator = line
related_elements = []
else:
# Found new CIx. Append to the list.
related_elements.append(value)

You could use regex to find the line numbers of the "N" patterns and then slice the file using those line numbers:
import re
n_pat = re.compile(r'N\d')
N_matches = []
with open(sample, 'r') as f:
for num, line in enumerate(f):
if re.match(n_pat, line):
N_matches.append((num, re.match(n_pat, line).group()))
>>> N_matches
[(0, 'N1'), (12, 'N2'), (24, 'N3')]
After you figure out the line numbers where these patterns appear, you can then use itertools.islice to break the file up into a list of lists:
import itertools
first = N_matches[0][0]
final = N_matches[-1][0]
step = N_matches[1][0]
data_set = []
locallist = []
while first < final + step:
with open(file, 'r') as f:
for item in itertools.islice(f, first, first+step):
if item.strip():
locallist.append(item.strip())
dataset.append(locallist)
locallist = []
first += step
itertools.islice is a really nice way to take a slice of an iterable. Here's the result of the above on a sample:
>>> dataset
[['N1 1.023 2.11 3.789', 'Cl1 3.126 2.6534 1.878', 'Cl2 3.124 2.4534 1.678', 'Cl3 3.924 2.1134 1.1278', 'Cl4', 'Cl5'], ['N2', 'Cl6 3.126 2.6534 1.878', 'Cl7 3.124 2.4534 1.678', 'Cl8 3.924 2.1134 1.1278', 'Cl9', 'Cl10'], ['N3', 'Cl11', 'Cl12', 'Cl13', 'Cl14', 'Cl15']]
After that, I'm a bit hazy on what you're seeking to do, but I think you want permutations of each sublist of the dataset? If so, you can use itertools.permutations to find permutations on various sublists of dataset:
for item in itertools.permutations(dataset[0]):
print(item)
etc.
Final Note:
Assuming I understand correctly what you're doing, the number of permutations is going to be huge. You can calculate how many permutations there are them by taking the factorial of the number of items. Anything over 10 (10!) is going to produce over 3,000,000 million permutations.

using an integer in a string to create a dictionary (or list) with that many numbers

so i have this text (wordnet) file made up of numbers and words, for example like this -
"09807754 18 n 03 aristocrat 0 blue_blood 0 patrician"
and i want to read in the first number as a dictionary name (or list) for the words that follow. the layout of this never changes, it is always an 8 digit key followed by a two digit number, a single letter and a two digit number. This last two digit number (03) tells how many words (three words in this case) are associated with the first 8 digit key.
my idea was that i would search for the 14th place in the string and use that number to run a loop to pick in all of the words associated with that key
so i think it would go something like this
with open('nouns.txt','r') as f:
for line in f:
words = range(14,15)
numOfWords = int(words)
while i =< numOfWords
#here is where the problem arises,
#i want to search for words after the spaces 3 (numOfWords) times
#and put them into a dictionary(or list) associated with the key
range(0,7) = {word(i+1), word(i+2)}
Technically i am looking for whichever one of these makes more sense:
09807754 = { 'word1':aristocrat, 'word2':blue_blood , 'word3':patrician }
or
09807754 = ['aristocrat', 'blue_blood', 'patrician']
Obviously this doesnt run but if anyone could give me any pointers it would be greatly appreciated

>>> L = "09807754 18 n 03 aristocrat 0 blue_blood 0 patrician".split()
>>> L[0], L[4::2]
('09807754', ['aristocrat', 'blue_blood', 'patrician'])
>>> D = {}
>>> D.update({L[0]: L[4::2]})
>>> D
{'09807754': ['aristocrat', 'blue_blood', 'patrician']}
For the extra line in your comment, some extra logic is needed
>>> L = "09827177 18 n 03 aristocrat 0 blue_blood 0 patrician 0 013 # 09646208 n 0000".split()
>>> D.update({L[0]: L[4:4 + 2 * int(L[3]):2]})
>>> D
{'09807754': ['aristocrat', 'blue_blood', 'patrician'], '09827177': ['aristocrat', 'blue_blood', 'patrician']}

res = {}
with open('nouns.txt','r') as f:
for line in f:
splited = line.split()
res[splited[0]] = [w for w in splited[4:] if not w.isdigit()]
Output:
{'09807754': ['aristocrat', 'blue_blood', 'patrician']}

Number of elements in Python Set

I have a list of phone numbers that have been dialed (nums_dialed).
I also have a set of phone numbers which are the number in a client's office (client_nums)
How do I efficiently figure out how many times I've called a particular client (total)
For example:
>>>nums_dialed=[1,2,2,3,3]
>>>client_nums=set([2,3])
>>>???
total=4
Problem is that I have a large-ish dataset: len(client_nums) ~ 10^5; and len(nums_dialed) ~10^3.

which client has 10^5 numbers in his office? Do you do work for an entire telephone company?
Anyway:
print sum(1 for num in nums_dialed if num in client_nums)
That will give you as fast as possible the number.
If you want to do it for multiple clients, using the same nums_dialed list, then you could cache the data on each number first:
nums_dialed_dict = collections.defaultdict(int)
for num in nums_dialed:
nums_dialed_dict[num] += 1
Then just sum the ones on each client:
sum(nums_dialed_dict[num] for num in this_client_nums)
That would be a lot quicker than iterating over the entire list of numbers again for each client.

>>> client_nums = set([2, 3])
>>> nums_dialed = [1, 2, 2, 3, 3]
>>> count = 0
>>> for num in nums_dialed:
... if num in client_nums:
... count += 1
...
>>> count
4
>>>
Should be quite efficient even for the large numbers you quote.

Using collections.Counter from Python 2.7:
dialed_count = collections.Counter(nums_dialed)
count = sum(dialed_count[t] for t in client_nums)

Thats very popular way to do some combination of sorted lists in single pass:
nums_dialed = [1, 2, 2, 3, 3]
client_nums = [2,3]
nums_dialed.sort()
client_nums.sort()
c = 0
i = iter(nums_dialed)
j = iter(client_nums)
try:
a = i.next()
b = j.next()
while True:
if a < b:
a = i.next()
continue
if a > b:
b = j.next()
continue
# a == b
c += 1
a = i.next() # next dialed
except StopIteration:
pass
print c
Because "set" is unordered collection (don't know why it uses hashes, but not binary tree or sorted list) and it's not fair to use it there. You can implement own "set" through "bisect" if you like lists or through something more complicated that will produce ordered iterator.

The method I use is to simply convert the set into a list and then use the len() function to count its values.
set_var = {"abc", "cba"}
print(len(list(set_var)))
Output:
2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing two CSV files in Python when rows have multiple values - python

Related

Longest subset of five-positions five-elements permutations, only one-element-position in common

Two-dimensional array of 0s & 1s, sliced and keys returned as set in Python

Iterate through lines changing only one character python

using an integer in a string to create a dictionary (or list) with that many numbers

Number of elements in Python Set

Categories

Resources