I have a list of sorted data arranged so that each item in the list is a csv line to be written to file.
The final step of the script checks the contents of each field and if all but the last field match then it will copy the current line's last field onto the previous line's last field.
I would like to as I've found and processed one of these matches skip the current line where the field was copied from thus only leaving one of the lines.
Here's an example set of data
field1,field2,field3,field4,something
field1,field2,field3,field4,else
Desired output
field1,field2,field3,field4,something else
This is what I have so far
output_csv = ['field1,field2,field3,field4,something',
'field1,field2,field3,field4,else']
# run through the output
# open and create a csv file to save output
with open('output_table.csv', 'w') as f:
previous_line = None
part_duplicate_line = None
part_duplicate_flag = False
for line in output_csv:
part_duplicate_flag = False
if previous_line is not None:
previous = previous_line.split(',')
current = line.split(',')
if (previous[0] == current[0]
and previous[1] == current[1]
and previous[2] == current[2]
and previous[3] == current[3]):
print(previous[0], current[0])
previous[4] = previous[4].replace('\n', '') + ' ' + current[4]
part_duplicate_line = ','.join(previous)
part_duplicate_flag = True
f.write(part_duplicate_line)
if part_duplicate_flag is False:
f.write(previous_line)
previous_line = line
ATM script adds the line but doesn't skip the next line, I've tried various renditions of continue statements after part_duplicate_line is written to file but to no avail.
Looks like you want one entry for each combination of the first 4 fields
You can use a dict to aggregate data -
#First we extract the key and values
output_csv_keys = list(map(lambda x: ','.join(x.split(',')[:-1]), output_csv))
output_csv_values = list(map(lambda x: x.split(',')[-1], output_csv))
#Then we construct a dictionary with these keys and combine the values into a list
from collections import defaultdict
output_csv_dict = defaultdict(list)
for key, value in zip(output_csv_keys, output_csv_values):
output_csv_dict[key].append(value)
#Then we extract the key/value combinations from this dictionary into a list
for_printing = [','.join([k, ' '.join(v)]) for k, v in output_csv_dict.items()]
print(for_printing)
#Output is ['field1,field2,field3,field4,something else']
#Each entry of this list can be output to the csv file
I propose to encapsulate what you want to do in a function where the important part obeys this logic:
either join the new info to the old record
or output the old record and forget it
of course at the end of the loop we have in any case a dangling old record to output
def join(inp_fname, out_fname):
'''Input file contains sorted records, when two (or more) records differ
only in the last field, we join the last fields with a space
and output only once, otherwise output the record as-is.'''
######################### Prepare for action ##########################
from csv import reader, writer
with open(inp_fname) as finp, open(out_fname, 'w') as fout:
r, w = reader(finp), writer(fout)
######################### Important Part starts here ##############
old = next(r)
for new in r:
if old[:-1] == new[:-1]:
old[-1] += ' '+new[-1]
else:
w.writerow(old)
old = new
w.writerow(old)
To check what I've proposed you can use these two snippets (note that these records are shorter than yours, but it's an example and it doesn't matter because we use only -1 to index our records).
The 1st one has a "regular" last record
open('a0.csv', 'w').write('1,1,2\n1,1,3\n1,2,0\n1,3,1\n1,3,2\n3,3,0\n')
join('a0.csv', 'a1.csv')
while the 2nd has a last record that must be joined to the previous one.
open('b0.csv', 'w').write('1,1,2\n1,1,3\n1,2,0\n1,3,1\n1,3,2\n')
join('b0.csv', 'b1.csv')
If you run the snippets, as I have done before posting, in the environment where you have defined join you should get what you want.
Related
This is the file that I am working with called file1.txt
20
Gunsmoke
30
The Simpsons
10
Will & Grace
14
Dallas
20
Law & Order
12
Murder, She Wrote
And here is my code so far:
file = open('file1.txt')
lines = file.readlines()
print(lines)
new_list=[]
for i in lines:
new = i.strip()
new_list.append(new)
print(new_list)
new_dict = {}
for i in range(0,len(new_list),2):
new_dict[new_list[i]]=new_list[i+1]
if i in new_dict:
i[key] = i.values()
new_dict = dict(sorted(new_dict.items()))
print(new_dict)
file_2 = open('output_keys.txt', 'w')
for x, y in new_dict.items():
print(x, y)
file_2.write(x + ': ')
file_2.write(y)
file_2.write('\n')
file_2.close()
file_3 = open('output_titles.txt', 'w')
new_list2 = []
for x, y in new_dict.items():
new_list2.append(y)
new_list2.sort()
print(new_list2)
print(new_list2)
for i in new_list2:
file_3.write(i)
file_3.write('\n')
print(i)
file_3.close()
The instructions state:
Write a program that first reads in the name of an input file and then reads the input file using the file.readlines() method. The input file contains an unsorted list of number of seasons followed by the corresponding TV show. Your program should put the contents of the input file into a dictionary where the number of seasons are the keys, and a list of TV shows are the values (since multiple shows could have the same number of seasons).
Sort the dictionary by key (least to greatest) and output the results to a file named output_keys.txt. Separate multiple TV shows associated with the same key with a semicolon (;), ordering by appearance in the input file. Next, sort the dictionary by values (alphabetical order), and output the results to a file named output_titles.txt.
So the part I am having trouble with 2 parts:
First is "Separate multiple TV shows associated with the same key with a semicolon (;)".
What I have written so far just replaces the new item in the dictionary.
for i in range(0,len(new_list),2):
new_dict[new_list[i]]=new_list[i+1]
if i in new_dict:
i[key] = i.values()
The 2nd part is that in the Zybooks program it seems to add onto output_keys.txt and output_title.txt every time it iterates. But my code does not seem to add to output_keys and output_title. For example, if after I run file1.txt I then try to run file2.txt, it replaces output_keys and output_title instead of adding to it.
Try to break down the problem into smaller sub-problems. Right now, it seems like you're trying to solve everything at once. E.g., I'd suggest you omit the file input and output and focus on the basic functionality of the program. Once that is set, you can go for the I/O.
You first need to create a dictionary with numbers of seasons as keys and a list of tv shows as values. You almost got it; here's a working snippet (I renamed some of your variables: it's always a good idea to have meaningful variable names):
lines = file.readlines()
# formerly "new_list"
clean_lines = []
for line in lines:
line = line.strip()
clean_lines.append(line)
# formerly "new_dict"
seasons = {}
for i in range(0, len(clean_lines), 2):
season_num = int(clean_lines[i])
series = clean_lines[i+1]
# there are only two options: either
# the season_num is already in the dict...
if season_num in seasons:
# append to the existing entry
seasons[season_num].append(series)
# ...or it isn't
else:
# make a new entry with a list containing
# the series
seasons[season_num] = [series]
Here's how you can print the resulting dictionary with the tv shows separated by semicolon using join. Adapt to your needs:
for season_num, series in seasons.items():
print(season_num, '; '.join(series))
Output:
20 Gunsmoke; Law & Order
30 The Simpsons
10 Will & Grace
14 Dallas
12 Murder, She Wrote
as I see you try to check if the key already exists in dictionary but it seems there is a mistake over there, you should check the value instead the index if it exists in dictionary and also you must check before putting into the dictionary and if it exits you can update current value by adding ; end the current value
for i in range(0,len(new_list),2):
if not new_list[i] in new_edict.keys():
new_edict[new_list[i]] = new_list[i+1]
else:
Update it hereā¦ like
new_list[new_list[i]] = new_list[new_list[i]] +";"+ new_list[i+1]
How do I read every line of a file in Python and check if that line is in another one of the lines of the same text?
I've created hash of 2000 images and stored it in the same text file. So to find if the duplicate image exists I want to cross-check hash of all the images generated.
Below is the code in list in which I have extracted the data,
with open('hash_info.txt') as f:
content = f.readlines()
['fbefcfdf961f1919\n', 'aecc9e9696961f2f\n', 'cc1c9c9c1c1e272f\n', 'a4ce9e9e9793134b\n', 'e2e7e7e7e7e7e763\n', 'e64fcbcfcf8f0f27\n', '9c1c3c3c3e1e1e9c\n', 'c8cc9cb43e3c3b1b\n', 'cccd9e9e9e1e1f9f\n', 'ccce9e9e9ece0e4e\n', 'a6a7cbcfcf071736\n', 'f69c9c3c3636373b\n', 'ec9c9cbc3c26272b\n', 'f0cccc9c8c0e3f3b\n', '4c9c363e3e3e1e5d\n', '9c9cbc3e3c3c376f\n', 'f5ccce9e9e9e1f2c\n', 'cccc8c9ccc9ccdca\n', 'dc98ac2c363e5e5f\n', 'f2e7e7e7e7e76746\n', '9a9a1e3e3e3e373f\n', 'cc8c9e9e8ecece8f\n', 'db9f9f1e363e9e9e\n', 'e4cece8e9ececfcf\n', 'cecede9f9bce8f8f\n', 'b8ce4e4e9f1b1b29\n', 'ece6e6e7efcf0d05\n', 'cd8e9696b732163f\n', 'cece9e9ecececfcd\n', 'cc9d9f9f9f8dcdd9\n', '992d2c2c3c3ebe9e\n', 'e6e6cece8f2d2939\n', 'eccfcfcfcf4f6f7d\n', 'e6cecfcfcfefcec6\n', 'edf8e4cecece4e0e\n', 'e9d6e6e7e7a76667\n', 'edcecfcfcfcfcecf\n', 'a5a6c6ce8e0f43c7\n', '3a3e7c7c3d3e3f2f\n', 'cc9c963c361f173f\n', '8c9c9c9d9d9d1a9a\n', 'f0cc8e9e9e9f9d9e\n', '989c3c3c1c2e6e5b\n', 'f0989c1c9e1e1adb\n', 'f09c9c9c9c9e9e9f\n', 'e6ce4e1e86333309\n', 'a6cece9e8f0f0f2f\n', 'e8cccc9cccdc8d8c\n', 'f0ecced6969f0f2d\n', 'e0d89c3c3c3d3d1f\n', 'e6e7c7cfc7c64e4f\n', 'a6cf4b0f0e073739\n', 'cececececccf4b5b\n', 'a6c6cfcfcfc6c6c6\n', 'f0fcf3e3e3e3f303\n', 'f9f2e7e7cbcfcf97\n','fbefcfdf961f1919\n', 'f3e7e5e5e7e5c7c3\n', 'b3e7e7c7c7070f1e\n', 'cb9d97963e3f3325\n', '9b1e2c1c1e1e2e2b\n', '9d9e969f9f9f9f0f\n', 'e6a7a7e7e666666c\n', 'c64e9e9b0b072727\n','fbefcfdf961f1919\n', 'c7cfcfcfcfc7ce86\n', 'e6cecfcfcfc7c745\n', 'e6e6cecececfcfcf\n', 'cbcd9f9f9e1f3a7a\n', 'ccce9ecececec646\n', 'f1c7cfdf9f970325\n', '989d9c1c1e9e9f1f\n', '9c9e1c1e9e9d9c9a\n', '5f3d7656de5d3b1f\n', '5f3d76565e5d3b1f\n']
Below is the text file of the same as above:
33393cccde1b3f7b
71fb989ed79f3b79
78b0a3a34c7c3737
67781c5e9fcc1f4c
313c2ccf4b5f5f7f
ece8cc9c9696171f
f4ec8c9c9c9c1e1e
e8cc94b68c9c1ece
d89c36161c9c1e3f
ecccdacececf6d6d
a4cecbcacf87173d
f9f3e7ebcbc74707
d9e5c7cbd34b4f4d
e4ece6e3cbdb8f1d
ccde9a9ecccecfad
e6e6ced293d6cfc6
cc8c9c989ccc8e8b
f2ccc696cecfcfcf
cc8c9a9a9ececfcd
cc9c9c9cdc9c9ff3
How I solved it
def check_dup(hash):
f = open('hash_text_file.txt')
s = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ)
if s.find(hash.rstrip()) != -1: #rstrip to remove \n
print("Duplicate Image")
return False
else:
return True
#Opens the text document
file=open("Old.txt", "r")
#Reads the text document and splits it into a list with each line being an element
lines=file.read().split("\n")
new_lines=[]
#Iterate over list of lines
for line in lines:
#If line is not in the empty list of lines( i.e the list that will contain unique lines) add the line to it
#This makes sure that no line exists twice in the list
if line not in new_lines:
new_lines.append(line)
#Open a new text file
file_new=open("New.txt","w")
#Add each line of our new unique lines list to the text file
file_new.write("\n".join(new_lines))
file_new.close()
file.close()
I took some of your sample data and cleaned the "\n" from it, converted to set and testing them for in/not in set like this:
data = ['fbefcfdf961f1919\n', 'aecc9e9696961f2f\n', 'cc1c9c9c1c1e272f\n',
'a4ce9e9e9793134b\n', 'e2e7e7e7e7e7e763\n',]
# create a set from your data, lookups are faster that way
cleaned = set(x.strip("\n") for x in data)
for testMe in ['not in it', 'fbefcfdf961f1919']: # supply your list of "new" things
if testMe in cleaned:
print "Got a duplicate: " + testMe
else:
print "Unique: " + testMe
# append to hash-file
with open("hash_info.txt","w+") as f: # if you have 1000 new hashes this
f.write(testMe+"\n") # reopens the file 1000 times (see below)
To compare huge new data to your existing data you should put the new data in a set as well:
newSet = set( ... your data here ... )
and use set operations to get all that are not yet in your cleaned set:
thingsToAddToFile = newSet - cleaned # substracts from newSet all known ones, only
# new ones will be in thingsToAddToFile
# add them all to your exisitng ones by appending them:
with open("hash_info.txt","w+") as f:
f.write("\n".join(thingsToAddToFile) + "\n") # joins all in set and appends `'\n'` on end
See https://docs.python.org/2/library/sets.html:
x in s test x for membership in s
x not in s test x for non-membership in s
s.issubset(t) s <= t test whether every element in s is in t
s.issuperset(t) s >= t test whether every element in t is in s
s.union(t) s | t new set with elements from both s and t
s.intersection(t) s & t new set with elements common to s and t
s.difference(t) s - t new set with elements in s but not in t
s.symmetric_difference(t)
s ^ t new set with elements in either s or t but not both
I have a leaderboard containing information on some scouts. The structure of this information is as follows: ID,forname,surname,points. This information is stored in a file, but isn't in order within the file. I don't want to change this.
I'd like it so that upon updating the listbox (when i call the calcPoints() function), it orders the scouts by the points entity within their record, from largest points to smallest points. Below is the code for my calcPoints() method.
Thanks
def _calcPoints():
mainWin._leaderboard.delete(0,END)
with open(tempFileName,'a') as ft:
for sc in self._Scouts:
sc._addPoints(-int(sc._getPoints()))
with open(badgeFile,"r") as fb:
lines = fb.readlines()
for line in lines:
if sc._getID() == line.split(":")[0]:
badge = ((line.split(':')[1]).split(',')[0])
if badge == "Core":
sc._addPoints(5)
elif badge == 'Activity':
sc._addPoints(1)
elif badge == 'Challenge':
sc._addPoints(3)
elif badge == 'Activity Pack':
sc._addPoints(5)
ft.write(sc.getInfo() + "\n")
os.remove(leadFile)
with open(leadFile,"a") as f:
with open(tempFileName,"r") as ft:
lines = ft.readlines()
for line in lines:
f.write(line)
os.remove(tempFileName)
mainWin.addScoutsLeaderboard()
return
Just call sorted, sorting by the last element in each line which is the points, using reverse=True will sort from high to low:
lines = sorted(fb,key=lambda x: float(x.rsplit(":",1)[1]),reverse=True)
Not sure which file the data is in so your file object and delimiter should match your actual file, also if you have a header you will need to add header = next(fb).
If you are using the values individually you might find the csv module a better fit:
import csv
with open(badgeFile,"r") as fb:
r = csv.reader(fb,delimiter=":")
lines = sorted(r, key=lambda x: float(x[1]), reverse=True)
for fid,fname,sname,pnts in lines:
On a side note, you only need to call .readlines() when you actually need a list, if not you can just iterate over the file object.
I am new to Python. I attempted to use logic from answers in #mgilson, #endolith, and #zackbloom zack's example
I am getting a bunch of blank columns placed in front of the first field of the primary record.
My out_file is empty (more than likely because of the columns from the two files cannot match up.
How can I fix this?
The end result should look like the following:
('PUDO_id','Load_id','carrier_id','PUDO_from_company','PUDOItem_id';'PUDO_id';'PUDOItem_make')
('1','1','14','FMH MATERIAL HANDLING SOLUTIONS','1','1','CROWN','TR3520 / TWR3520','TUGGERS')
('2','2','7','WIESE USA','2','2','CAT','NDC100','3','2','CAT','NDC100','4','2',' 2 BATTERIES')
Note: In the output of the 3rd row, it appended 3 rows from the sub file to the array, while the first 2 rows only appended 1 row from the sub file. This is determined by the value in pri[0] and sub[1] comparing TRUE.
Here is my code based on #Zack Bloom:
def build_set(filename):
# A set stores a collection of unique items. Both adding items and searching for them
# are quick, so it's perfect for this application.
found = set()
with open(filename) as f:
for line in f:
# Tuples, unlike lists, cannot be changed, which is a requirement for anything
# being stored in a set.
line = line.replace('"','')
line = line.replace("'","")
line = line.replace('\n','')
found.add(tuple(sorted(line.split(';'))))
return found
set_primary_records = build_set('C:\\temp\\oz\\loads_pudo.csv')
set_sub_records = build_set('C:\\temp\\oz\\pudo_items.csv')
record = []
with open('C:\\temp\\oz\\loads_pudo_out.csv', 'w') as out_file:
# Using with to open files ensures that they are properly closed, even if the code
# raises an exception.
for pri in set_primary_records :
for sub in set_sub_records :
#out_file.write(" ".join(res) + "\n")
if sub[1] == pri [0] :
record = pri.extend(sub)
out_file.write(record + '\n')
Sample source data (primary records):
PUDO_id;"Load_id";"carrier_id";"PUDO_from_company"
1;"1";"14";"FMH MATERIAL HANDLING SOLUTIONS"
2;"2";"7";"WIESE USA"
Sample source data (sub records):
PUDOItem_id;"PUDO_id";"PUDOItem_make"
1;"1";"CROWN";"TR3520 / TWR3520";"TUGGERS"
2;"2";" CAT";"NDC100"
3;"2";"CAT";"NDC100"
4;"2";" 2 BATTERIES"
5;"11";"MIDLAND"
The extend attribute is not available for tuples which is what build_set is creating. Tuples are immutable but they can be concatenated or sliced with normal python string functions.
For example:
with open('C:\\temp\\oz\\loads_pudo_out.csv', 'w') as out_file:
for pri in set_primary_records :
for sub in set_sub_records :
if sub[1] == pri[0] :
record = pri + sub
out_file.write(str(record)[1:-1] + '\n')
This is the same code as above, just modified to allow for tuple concatenation. In the write line we convert record to a string and strip the start and end brackets, before appending '\n'. Maybe there are better / prettier ways to do this, but I'm new to Python too.
Edit:
To get the output you are expecting, a few changes are required:
# On this line, remove the sort() as we do not wish to change tuple item order..
found.add(tuple(line.split(';')))
...
with open('C:\\temp\\loads_out.csv', 'w') as out_file:
for pri in set_primary_records:
record = pri # record tuple is set in main loop
for sub in set_sub_records:
if sub[1] == pri[0]:
record += sub # for each match, sub appended to record
out_file.write(str(record) + '\n') # removed stripping of brackets
I'm stuck in a script I have to write and can't find a way out...
I have two files with partly overlapping information. Based on the information in one file I have to extract info from the other and save it into multiple new files.
The first is simply a table with IDs and group information (which is used for the splitting).
The other contains the same IDs, but each twice with slightly different information.
What I'm doing:
I create a list of lists with ID and group informazion, like this:
table = [[ID, group], [ID, group], [ID, group], ...]
Then, because the second file is huge and not sorted in the same way as the first, I want to create a dictionary as index. In this index, I would like to save the ID and where it can be found inside the file so I can quickly jump there later. The problem there, of course, is that every ID appears twice. My simple solution (but I'm in doubt about this) is adding an -a or -b to the ID:
index = {"ID-a": [FPos, length], "ID-b": [FPOS, length], "ID-a": [FPos, length], ...}
The code for this:
for line in file:
read = (line.split("\t"))[0]
if not (read+"-a") in indices:
index = read + "-a"
length = len(line)
indices[index] = [FPos, length]
else:
index = read + "-b"
length = len(line)
indices[index] = [FPos, length]
FPos += length
What I am wondering now is if the next step is actually valid (I don't get errors, but I have some doubts about the output files).
for name in table:
head = name[0]
## first round
(FPos,length) = indices[head+"-a"]
file.seek(FPos)
line = file.read(length)
line = line.rstrip()
items = line.split("\t")
output = ["#" + head +" "+ "1:N:0:" +"\n"+ items[9] +"\n"+ "+" +"\n"+ items[10] +"\n"]
name.append(output)
##second round
(FPos,length) = indices[head+"-b"]
file.seek(FPos)
line = file.read(length)
line = line.rstrip()
items = line.split("\t")
output = ["#" + head +" "+ "2:N:0:" +"\n"+ items[9] +"\n"+ "+" +"\n"+ items[10] +"\n"]
name.append(output)
Is it ok to use a for loop like that?
Is there a better, cleaner way to do this?
Use a defaultdict(list) to save all your file offsets by ID:
from collections import defaultdict
index = defaultdict(list)
for line in file:
# ...code that loops through file finding ID lines...
index[id_value].append((fileposn,length))
The defaultdict will take care of initializing to an empty list on the first occurrence of a given id_value, and then the (fileposn,length) tuple will be appended to it.
This will accumulate all references to each id into the index, whether there are 1, 2, or 20 references. Then you can just search through the given fileposn's for the related data.