How to solve this renaming duplicates problem without resorting to renaming with something unique like "_DUPLICATED_#NO" the names have to be unique when finished, and preferably with iterative numbers denoting number of duplicates
from collections import defaultdict
l = ["hello1","hello2","hello3",
"hello","hello","hello"]
tally = defaultdict(lambda:-1)
for i in range(len(l)):
e = l[i]
tally[e] += 1
if tally[e] > 0:
e += str(tally[e])
l[i] = e
print (l)
results:
['hello1', 'hello2', 'hello3', 'hello', 'hello1', 'hello2']
as you can see, the names are not unique
This seems simple enough. You start with a list of filenames:
l = ["hello1","hello2","hello3",
"hello","hello","hello"]
Then you iterate through them to finished filenames, incrementing a trailing number by 1 if a duplicate is found.
result = {}
for fname in l:
orig = fname
i=1
while fname in result:
fname = orig + str(i)
i += 1
result[fname] = orig
This should leave you with a dictionary like:
{"hello1": "hello1",
"hello2": "hello2",
"hello3": "hello3",
"hello": "hello",
"hello4": "hello",
"hello5": "hello"}
Of course if you don't care about mapping the originals to the duplicate names, you can drop that part.
result = set()
for fname in l:
orig = fname
i=1
while fname in result:
fname = orig + str(i)
i += 1
result.add(fname)
If you want a list afterward, just cast it that way.
final = list(result)
Note that if you're creating files, this is exactly what the tempfile module is designed to do.
import tempfile
l = ["hello1","hello2","hello3",
"hello","hello","hello"]
fs = [tempfile.NamedTemporaryFile(prefix=fname, delete=False, dir="/some/directory/") for fname in l]
This will not create nicely incrementing filenames, but they are guaranteed unique, and fs will be a list of the (open) file objects rather than a list of names, although NamedTemporaryFile.name will give you the filename.
Related
Hi I need to retrieve a list of files within a folder ordering by two substrings.
The file names are like this:
503-03-Mar-2022..csv
604-07-Apr-2022..csv
503-17-Mar-2022..csv
604-16-Mar-2022..csv
I need to retrieve the latest file for each of the first three numbers.
So I'd have something like:
503 - 17-Mar-2022
604 - 16-Mar-2022
I'm retrieving the information as such:
import os, pandas as pd, myLibrary
csvPath = r"C:\csvs"
l = os.listdir(csvPath)
str = l[0]
count = 0
for path in os.listdir(csvPath):
if os.path.isfile(os.path.join(csvPath, path)):
count += 1
courseList = [i.split('-', 1)[0] for i in l]
tempList = [i.split('-', 1)[1] for i in l]
dateList = [i.split('..', 1)[0] for i in tempList]
sizeCourseList = len(courseList)
bit503 = courseList.count("503")
bit604 = courseList.count("604")
bit606 = courseList.count("606")
bit607 = courseList.count("607")
df = pd.read_csv(csvPath+"\\"+l[0])
print(l)
If you could point me in the right direction I'd appreciate it.
from datetime import datetime
from itertools import groupby
files = [
'503-03-Mar-2022..csv',
'604-07-Apr-2022..csv',
'503-17-Mar-2022..csv',
'604-16-Mar-2022..csv'
]
def f(s):
k1 = s[:3]
k2 = datetime.strptime(s[4:15], '%d-%b-%Y')
return (k1, k2.day, k2.month)
files.sort(key=f)
print(files) # ['503-03-Mar-2022..csv', '503-17-Mar-2022..csv', '604-07-Apr-2022..csv', '604-16-Mar-2022..csv']
result = [list(g)[-1] for _, g in groupby(files, lambda s: s[:3])]
print(result) # ['503-17-Mar-2022..csv', '604-16-Mar-2022..csv']
We take the first three fields, and convert those to a tuple of 3 integers, which is used to sort the list.
We group by the first element of the tuple, and take the last element from the list that has the latest date.
I'm new to programming and python and I'm looking for a way to distinguish between two input formats in the same input file text file. For example, let's say I have an input file like so where values are comma-separated:
5
Washington,A,10
New York,B,20
Seattle,C,30
Boston,B,20
Atlanta,D,50
2
New York,5
Boston,10
Where the format is N followed by N lines of Data1, and M followed by M lines of Data2. I tried opening the file, reading it line by line and storing it into one single list, but I'm not sure how to go about to produce 2 lists for Data1 and Data2, such that I would get:
Data1 = ["Washington,A,10", "New York,B,20", "Seattle,C,30", "Boston,B,20", "Atlanta,D,50"]
Data2 = ["New York,5", "Boston,10"]
My initial idea was to iterate through the list until I found an integer i, remove the integer from the list and continue for the next i iterations all while storing the subsequent values in a separate list, until I found the next integer and then repeat. However, this would destroy my initial list. Is there a better way to separate the two data formats in different lists?
You could use itertools.islice and a list comprehension:
from itertools import islice
string = """
5
Washington,A,10
New York,B,20
Seattle,C,30
Boston,B,20
Atlanta,D,50
2
New York,5
Boston,10
"""
result = [[x for x in islice(parts, idx + 1, idx + 1 + int(line))]
for parts in [string.split("\n")]
for idx, line in enumerate(parts)
if line.isdigit()]
print(result)
This yields
[['Washington,A,10', 'New York,B,20', 'Seattle,C,30', 'Boston,B,20', 'Atlanta,D,50'], ['New York,5', 'Boston,10']]
For a file, you need to change it to:
with open("testfile.txt", "r") as f:
result = [[x for x in islice(parts, idx + 1, idx + 1 + int(line))]
for parts in [f.read().split("\n")]
for idx, line in enumerate(parts)
if line.isdigit()]
print(result)
You're definitely on the right track.
If you want to preserve the original list here, you don't actually have to remove integer i; you can just go on to the next item.
Code:
originalData = []
formattedData = []
with open("data.txt", "r") as f :
f = list(f)
originalData = f
i = 0
while i < len(f): # Iterate through every line
try:
n = int(f[i]) # See if line can be cast to an integer
originalData[i] = n # Change string to int in original
formattedData.append([])
for j in range(n):
i += 1
item = f[i].replace('\n', '')
originalData[i] = item # Remove newline char in original
formattedData[-1].append(item)
except ValueError:
print("File has incorrect format")
i += 1
print(originalData)
print(formattedData)
The following code will produce a list results which is equal to [Data1, Data2].
The code assumes that the number of entries specified is exactly the amount that there is. That means that for a file like this, it will not work.
2
New York,5
Boston,10
Seattle,30
The code:
# get the data from the text file
with open('filename.txt', 'r') as file:
lines = file.read().splitlines()
results = []
index = 0
while index < len(lines):
# Find the start and end values.
start = index + 1
end = start + int(lines[index])
# Everything from the start up to and excluding the end index gets added
results.append(lines[start:end])
# Update the index
index = end
I've been playing around with a program that will take in information from two files and then write the information out to a single file in sorted order.
So what i did was store each line of the file as an element in a list. I create another function that splits each element into a 2d array where i can easily access the name variables. From there i want to create a nested for loop that as it iterates it checks for the highest value in the array, removes the value from the list and appending it to a new list until there's a sorted list.
I think I am like 90% of the way there, but I am having trouble wrapping my head around the logic of sorting algorithms. It seems like the problem just keeps getting more complex and i keep wanting to use pointers. If someone could help shine some light on the subject I would greatly appreciate it.
import os
from http.cookiejar import DAYS
from macpath import split
# This program reads a given input file and finds its longest line.
class Employee:
def __init__(self, EmployeeID, name, wage, days):
self.EmployeeID = EmployeeID
self.name = name
self.wage = wage
self.days = days
def Extraction(file,file2):
employList = []
while True:
line1 = file.readline().strip()
line2 = file2.readline().strip()
#print(type(line1))
employList.append(line1)
#print(line1)
employList.append(line2)
#print(line2)
if line1 == '' or line2 == '':
break
return employList
def Sort(mylist):
splitlist = []
sortedlist = []
print(len(mylist))
for items in range(len(mylist)):
#print(mylist[items].split())
splitlist.append(mylist[items].split())
print(splitlist)
#print(splitlist[1][1])
#print(splitlist[1][2])
highest = "z"
print(highest)
sortingLength = len(splitlist)
for i in range(10):
for items in range(len(splitlist)-2):
if highest > splitlist[items][2]:
istrue = highest < splitlist[items][2]
highest = splitlist[items][1]
print(items)
print(istrue)
print('marker')
print(splitlist[items][2])
if items == (len(splitlist)-2):
print("End of list",splitlist[items][2])
print(highest)
print(splitlist.index(highest))
print(splitlist[len(splitlist)-1][2])
print(sortingLength)
fPath = 'C:/Temp'
fileName = 'payroll1.txt'
fullFileName = os.path.join(fPath,fileName)
fileName2 = 'payroll2.txt'
fullFileName2 = os.path.join(fPath,fileName2)
f = open(fullFileName,'r')
f2 = open(fullFileName2, 'r')
employeeList = Extraction(f,f2)#pulling out each line in the file and placing into a list
Sort(employeeList)
ReportName= "List of Employees:"
marker = '-'* len(ReportName)
print (ReportName + ' \n' + marker)
total = 0
f.close()
I am having trouble with once having the higest value trying to append that value to a sortedlist, removing the value from the splitlist, and re running the code.
Using the sorted method is much easier and already built-in, per Joran's suggestion. I've edited your reading method so that it builds two lists of tuples, representing the line and the length of the line. The sorted method will return a list sorted according to the key (line length) and descending order (reverse=True)
from operator import itemgetter
class Employee:
def __init__(self, EmployeeID, name, wage, days):
self.EmployeeID = EmployeeID
self.name = name
self.wage = wage
self.days = days
def Extraction(file,file2):
employList = []
mylines = [(i, len(l.strip()), 'file1') for i,l in enumerate(file.readlines())]
mylines2 = [(i, len(l.strip()), 'file2') for i,l in enumerate(file2.readlines())]
employList = [*mylines, *mylines2]
return employList
fPath = 'C:/Temp'
fileName = 'payroll1.txt'
fullFileName = os.path.join(fPath,fileName)
fileName2 = 'payroll2.txt'
fullFileName2 = os.path.join(fPath,fileName2)
f = open(fullFileName,'r')
f2 = open(fullFileName2, 'r')
employeeList = Extraction(f,f2)#pulling out each line in the file and placing the line_number and length into a list
f.close()
f2.close()
# Itemgetter will sort on the second element of the tuple, len(line)
# and reverse will put it in descending order
ReportName = sorted(employeeList, key=itemgetter(1), reverse=True)
EDIT: I've added markers in the tuples so that you can keep track of what lines came from what file. Might be a bit confusing without them
I am trying to return a list of unit numbers from about 1000 csv file names. I can read them in then get python to remove all the junk from around them and replace the 5th character to format it how I need it done. I would like to return a list of all the unit numbers so like ['6726-0501', '6826-1144']. What I am currently getting is it printing out the unit number one by one and not saving them. I have looked through previous questions but can't seem to get the mode of creating a list then appending the unit numbers to the list and saving that list to a variable to work. Does anyone know a good method for simply modifying this to output a list and save the list for later use?
Thanks,
Robin
file_names = ['job_1106_unit_672600501_las_PN23074.LAS.csv', 'job_1108_unit_682601144_las_PN23072.LAS.csv']
def change(file_names):
for comps in file_names:
comps_of_comps = list(comps)
unit_num = comps_of_comps[14:23] #[672600501]
a = (unit_num[0:4]) #[6726]
b = (unit_num[5:9]) #[0501]
unit_num = a + list('-') + b #[6,7,2,6,-,0,5,0,1]
unit_num = ''.join(unit_num) #6726-0501
print unit_num
change(file_names)
You can initialize a new list and append to it and return that list. Like
file_names = ['job_1106_unit_672600501_las_PN23074.LAS.csv', 'job_1108_unit_682601144_las_PN23072.LAS.csv']
def change(file_names):
result = []
for comps in file_names:
comps_of_comps = list(comps)
unit_num = comps_of_comps[14:23] #[672600501]
a = (unit_num[0:4]) #[6726]
b = (unit_num[5:9]) #[0501]
unit_num = a + list('-') + b #[6,7,2,6,-,0,5,0,1]
unit_num = ''.join(unit_num) #6726-0501
result.append(unit_num)
return result
print change(file_names)
OR
import re
def change(file_names):
result = []
for i in file_names:
s = re.match('.*unit_(.*)_las.*', i).group(1)
result.append(s[:len(s)/2]+"-"+s[(len(s)/2)+1:])
return result
When I create a random List of numbers like so:
columns = 10
rows = 10
for x in range(rows):
a_list = []
for i in range(columns):
a_list.append(str(random.randint(1000000,99999999)))
values = ",".join(str(i) for i in a_list)
print values
then all is well.
But when I attempt to send the output to a file, like so:
sys.stdout = open('random_num.csv', 'w')
for i in a_list:
print ", ".join(map(str, a_list))
it is only the last row that is output 10 times. How do I write the entire list to a .csv file ?
In your first example, you're creating a new list for every row. (By the way, you don't need to convert them to strs twice).
In your second example, you print the last list you had created previously. Move the output into the first loop:
columns = 10
rows = 10
with open("random_num.csv", "w") as outfile:
for x in range(rows):
a_list = [random.randint(1000000,99999999) for i in range(columns)]
values = ",".join(str(i) for i in a_list)
print values
outfile.write(values + "\n")
Tim's answer works well, but I think you are trying to print to terminal and the file in different places.
So with minimal modifications to your code, you can use a new variable all_list
import random
import sys
all_list = []
columns = 10
rows = 10
for x in range(rows):
a_list = []
for i in range(columns):
a_list.append(str(random.randint(1000000,99999999)))
values = ",".join(str(i) for i in a_list)
print values
all_list.append(a_list)
sys.stdout = open('random_num.csv', 'w')
for a_list in all_list:
print ", ".join(map(str, a_list))
The csv module takes care of a bunch the the crap needed for dealing with csv files.
As you can see below, you don't need to worry about conversion to strings or adding line-endings.
import csv
columns = 10
rows = 10
with open("random_num.csv", "wb") as outfile:
writer = csv.writer(outfile)
for x in range(rows):
a_list = [random.randint(1000000,99999999) for i in range(columns)]
writer.writerow(a_list)