Splitting a list into a file without duplicates

Splitting a list into a file without duplicates - python

Large data file like this:
133621 652.4 496.7 1993.0 ...
END SAMPLES EVENTS RES 271.0 2215.0 ...
ESACC 935.6 270.6 2215.0 ...
115133 936.7 270.3 2216.0 ...
115137 936.4 270.4 2219.0 ...
115141 936.1 271.0 2220.0 ...
ESACC L 114837 115141 308 938.5 273.3 2200
115145 936.3 271.8 2220.0 ...
END 115146 SAMPLES EVENTS RES 44.11 44.09
SFIX L 133477
133477 650.8 500.0 2013.0 ...
133481 650.2 499.9 2012.0 ...
ESACC 650.0 500.0 2009.0 ...
Want to grab only the ESACC data into trials. When END appears, preceding ESACC data is aggregated into a trial. Right now, I can get the first chunk of ESACC data into a file but because the loop restarts from the beginning of the data, it keeps grabbing only the first chunk so I have 80 trials with the exact same data.
for i in range(num_trials):
with open(fid) as testFile:
for tline in testFile:
if 'END' in tline:
fid_temp_start.close()
fid_temp_end.close() #Close the files
break
elif 'ESACC' in tline:
tline_snap = tline.split()
sac_x_start = tline_snap[4]
sac_y_start = tline_snap[5
sac_x_end = tline_snap[7]
sac_y_end = tline_snap[8]
My question: How to iterate to the next chunk of data without grabbing the previous chunks?

Try rewriting your code something like this:
def data_parse(filepath): #Make it a function
try:
with open(filepath) as testFile:
tline = '' #Initialize tline
while True: #Switch to an infinite while loop (I'll explain why)
while 'ESACC' not in tline: #Skip lines until one containing 'ESACC' is found
tline = next(testFile) #(since it seems like you're doing that anyway)
tline_snap = tline.split()
trial = [tline_snap[4],'','',''] #Initialize list and assign first value
trial[1] = tline_snap[5]
trial[2] = tline_snap[7]
trial[3] = tline_snap[8]
while 'END' not in tline: #Again, seems like you're skipping lines
tline = next(testFile) #so I'll do the same
yield trial #Output list, save function state
except StopIteration:
fid_temp_start.close() #I don't know where these enter the picture
fid_temp_end.close() #but you closed them so I will too
testfile.close()
#Now, initialize a new list and call the function:
trials = list()
for trial in data_parse(fid);
trials.append(trial) #Creates a list of lists
What this creates is a generator function. By using yield instead of return, the function returns a value AND saves its state. The next time you call the function (as you will do repeatedly in the for loop at the end), it picks up where it left off. It starts at the line after the most recently executed yield statement (which in this case restarts the while loop) and, importantly, it remembers the values of any variables (like the value of tline and the point it stopped at in the data file).
When you reach the end of the file (and have thus recorded all of your trials), the next execution of tline = next(testFile) raises a StopIteration error. The try - except structure catches that error and uses it to exit the while loop and close your files. This is why we use an infinite loop; we want to continue looping until that error forces us out.
At the end of the whole thing, your data is stored in trials as a list of lists, where each item equals [sac_x_start, sac_y_start, sac_x_end, sac_y_end], as you defined them in your code, for one trial.
Note: it does seem to me like your code is skipping lines entirely when they don't contain ESACC or END. I've replicated that, but I'm not sure if that's what you want. If you want to get the lines in between, you can rewrite this fairly simply by adding to the 'END' loop as below:
while 'END' not in tline:
tline = next(testFile)
#(put assignment operations to be applied to each line here)
Of course, you'll have to adjust the variable you're using to store this data accordingly.
Edit: Oh dear lord, I just now noticed how old this question is.

Related

Python string not clearing correctly in loop

I am wring a script where I need to go through a csv file and find am looking for the first time that specific user logged in, and the last time they logged out. I have loops set up that are working great but when I clear the lists with the time string of their login/logout, I get an Index out of range error. Can anyone spot anything incorrect with this?
#this gets the earliest login time for each agent (but it assumes all dates to be the same!)
with open(inputFile, 'r') as dailyAgentLog:
csv_read = csv.DictReader(dailyAgentLog)
firstLoginTime = []
lastLogoutTime = []
outputLine = []
while x < len(agentName):
for row in csv_read:
if row["Agent"] == agentName[x]:
firstLoginTime.append(datetime.strptime(row["Login Time"], '%I:%M:%S %p'))
lastLogoutTime.append(datetime.strptime(row["Logout Time"], '%I:%M:%S %p'))
firstLoginTime.sort()
lastLogoutTime.sort()
outputLine = [agentName[x], agentLogin[x], agentExtension[x], row["Login Date"], firstLoginTime[0], row["Logout Date"], lastLogoutTime[-1]]
print(f'Agent {agentName[x]} first login was {firstLoginTime[0]} and last logout {lastLogoutTime[-1]}.')
fileLines.append(outputLine)
x += 1
firstLoginTime.clear() #this should be emptying/clearing the list at the end of every iteration
lastLogoutTime.clear()

The problem is that on the 2nd and following iterations, the for row in csv_read: loop doesn't execute, because there's nothing left to read. So you never fill in the firstLoginTime and lastLoginTime lists on subsequent iterations, and indexing them fails.
If the file isn't too large, you can read it into a list before iterating:
csv_read = list(csv.DictReader(dailyAgentLog))
If it's too big to hold in memory, put
dailyAgentLog.seek(0)
at the end of the loop body.
Also, instead of sorting the lists, you can use min() and max():
firstLogin = min(firstLoginTime)
lastLogin = max(lastLoginTime)
And I suggest you use
for x in range(len(agentName)):
rather than while and increment.

Using multiple booleans in an if statement to decide which file to write to

I'm trying to catch 4570 close encounters between planets, and output the data into certain files, depending on which two planets had the close encounters. I have 5 planets in total, and each planet has a close encounter ONLY with the planet(s) adjacent to it, leaving the possibility of 4 encounters.
data1 = open('data1.txt', 'a+')
data2 = open('data2.txt', 'a+')
data3 = open('data3.txt', 'a+')
data4 = open('data4.txt', 'a+')
for i in range(0,100000): #range this big since close encounters don't happen every iteration
def P_dist(p1, p2):
#function calculating distances between planets
init_SMA = [sim.particles[1].a,sim.particles[2].a,sim.particles[3].a,sim.particles[4].a,sim.particles[5].a]
try:
sim.integrate(10e+9*2*np.pi)
except rebound.Encounter as error:
print(error)
for j in range(len(init_SMA)-1):
distance = P_dist(j, j+1)
if distance <= .01:
count+=1
if count > 4570:
break
elif(init_SMA[j] == init_SMA[0] and init_SMA[j+1] == init_SMA[1])
#write stuff to data1
elif(init_SMA[j] == init_SMA[1] and init_SMA[j+1] == init_SMA[2])
#write stuff to data2
elif(init_SMA[j] == init_SMA[2] and init_SMA[j+1] == init_SMA[3])
#write stuff to data3
elif(init_SMA[j] == init_SMA[3] and init_SMA[j+1] == init_SMA[4])
#write stuff to data4
#close files
Everyone, I apologize. I left out lots of the code that shows the creation of the planetary system. The main for loop is responsible for creating a planetary system, catching a close encounter, writing it to the files, and repeating until 4570 close encounters have occurred.

It isn't ideal to keep four different files open in a running script. What's more, you haven't opened those files using Python's convenient with context manager, which takes care of cleanly closing opened files among other things. You're also performing open operations every loop iteration - files usually should be opened and closed once as there is a lot of consequential I/O overhead.
As for a cleaner approach, I would conditionally accumulate items/lines in Python data storage objects, then just do a one-off open and write at the end of the script. That way, if something goes awry during the main logic, you don't have files that have been partially written to.
This would be something along the lines of:
create 4 empty lists
for loop
logic to conditionally append lines to be written to the text files to those lists
with open('data1.txt', 'a+') as f:
write contents of list1 to f
... copy paste for remaining 3

I'd probably put the four data files in a list, so you can just do:
filesArray = [data1,data2,data3,data4]
#insider your for loop:
if(count > 4750):
break
if(distance <= 0.01):
count += 1
filesArray[j].write(data)#for whatever your data is
else:
break
It would be even better to do
fileNamesArray = ["data1.txt", "data2.txt", "data3.txt", "data4.txt"]
#inside your for loop:
if(count > 4750):
break
if(distance <= 0.01):
count += 1
with open(fileNamesArray[j], "a") as dataFile:
dataFile.write(data)#for whatever your data is
This helps avoid data corruption in case your program crashes for another reason
This also avoids storing every result you get into a list in memory, which I'd guess could be expensive for complex simulations
It does bind your performance to disk speed though, so I guess it's a tradeoff

Python - process a chunk of lines in a file

I have a file containing x number of values each on their own line.
I need to be able to take n number of value from this file, put them into an array, pass that array into a new process, clear the array and then take another n number of values from the file to give to the next process.
The problem I'm having is when x is a value like 12 and I'm trying to give, let's say, 10 chunks of values of each process.
The first process will get it's first 10 values no problem, but I'm having trouble giving the remaining 2 to the last process.
The problem would also arise if, let's say, you tell the program to give each process 10 values from the file, but the file only has 1, or even 9 values.
I need know when I'm at the last set of values that is less than n
I want to avoid taking every value in the file and storing it in an array all at once since I could run into memory problems if there was millions of values in that file.
Here's an example of what I've tried to do:
chunk = 10
value_list = []
with open ('file.txt', 'r') as f:
for value in f:
value_list.append(value)
if (len(value_list) >= chunk):
print 'Got %d' % len(value_list)
value_list = [] # Clear the list
# Put array into new process
This will catch every 10 in this example, but it wont work if there even happend to be less than 10 in the file to begin with.

What I typically do in this situation is just handle the last (short) array after the for loop. For example,
chunk = 10
value_list = []
with open ('file.txt', 'r') as f:
for value in f:
if (len(value_list) >= chunk):
print 'Got %d' % len(value_list)
value_list = [] # Clear the list
# Put array into new process
value_list.append(value)
# send left overs to new process
if value_list:
print 'Got %d' % len(value_list)
# Put final array into new process

appending array breaks program

I am writing a program to analyze some of our invoice data. Basically,I need to take an array containing each individual invoice we sent out over the past year & break it down into twelve arrays which contains the invoices for that month using the dateSeperate() function, so that monthly_transactions[0] returns Januaries transactions, monthly_transactions[1] returns Februaries & so forth.
I've managed to get it working so that dateSeperate returns monthly_transactions[0] as the january transactions. However, once all of the January data is entered, I attempt to append the monthly_transactions array using line 44. However, this just causes the program to break & become unrepsonsive. The code still executes & doesnt return an error, but Python becomes unresponsive & I have to force quite out of it.
I've been writing the the global array monthly_transactions. dateSeperate runs fine as long as I don't include the last else statement. If I do that, monthly_transactions[0] returns an array containing all of the january invoices. the issue arises in my last else statement, which when added, causes Python to freeze.
Can anyone help me shed any light on this?
I have written a program that defines all of the arrays I'm going to be using (yes I know global arrays aren't good. I'm a marketer trying to learn programming so any input you could give me on how to improve this would be much appreciated
import csv
line_items = []
monthly_transactions = []
accounts_seperated = []
Then I import all of my data and place it into the line_items array
def csv_dict_reader(file_obj):
global board_info
reader = csv.DictReader(file_obj, delimiter=',')
for line in reader:
item = []
item.append(line["company id"])
item.append(line["user id"])
item.append(line["Amount"])
item.append(line["Transaction Date"])
item.append(line["FIrst Transaction"])
line_items.append(item)
if __name__ == "__main__":
with open("ChurnTest.csv") as f_obj:
csv_dict_reader(f_obj)
#formats the transacation date data to make it more readable
def dateFormat():
for i in range(len(line_items)):
ddmmyyyy =(line_items[i][3])
yyyymmdd = ddmmyyyy[6:] + "-"+ ddmmyyyy[:2] + "-" + ddmmyyyy[3:5]
line_items[i][3] = yyyymmdd
#Takes the line_items array and splits it into new array monthly_tranactions, where each value holds one month of data
def dateSeperate():
for i in range(len(line_items)):
#if there are no values in the monthly transactions, add the first line item
if len(monthly_transactions) == 0:
test = []
test.append(line_items[i])
monthly_transactions.append(test)
# check to see if the line items year & month match a value already in the monthly_transaction array.
else:
for j in range(len(monthly_transactions)):
line_year = line_items[i][3][:2]
line_month = line_items[i][3][3:5]
array_year = monthly_transactions[j][0][3][:2]
array_month = monthly_transactions[j][0][3][3:5]
#print(line_year, array_year, line_month, array_month)
#If it does, add that line item to that month
if line_year == array_year and line_month == array_month:
monthly_transactions[j].append(line_items[i])
#Otherwise, create a new sub array for that month
else:
monthly_transactions.append(line_items[i])
dateFormat()
dateSeperate()
print(monthly_transactions)
I would really, really appreciate any thoughts or feedback you guys could give me on this code.

Based on the comments on the OP, your csv_dict_reader function seems to do exactly what you want it to do, at least inasmuch as it appends data from its argument csv file to the top-level variable line_items. You said yourself that if you print out line_items, it shows the data that you want.
"But appending doesn't work." I take it you mean that appending the line_items to monthly_transactions isn't being done. The reason for that is that you didn't tell the program to do it! The appending that you're talking about is done as part of your dateSeparate function, however you still need to call the function.
I'm not sure exactly how you want to use your dateFormat and dateSeparate functions, but in order to use them, you need to include them in the main function somehow as calls, i.e. dateFormat() and dateSeparate().
EDIT: You've created the potential for an endless loop in the last else: section, which extends monthly_transactions by 1 if the line/array year/month aren't equal. This is problematic because it's within the loop for j in range(len(monthly_transactions)):. This loop will never get to the end if the length of monthly_transactions is increased by 1 every time through.

index a list in a Python for loop

I'm making a for loop within a for loop. I'm looping through a list and finding a specific string that contains a regular expression pattern. Once I find the line, I need to search to find the next line of a certain pattern. I need to store both lines to be able to parse out the time for them. I've created a counter to keep track of the index number of the list as the outer for loop works. Can I use a construction like this to find the second line I need?
index = 0
for lineString in summaryList:
match10secExp = re.search('taking 10 sec. exposure', lineString)
if match10secExp:
startPlate = lineString
for line in summaryList[index:index+10]:
matchExposure = re.search('taking \d\d\d sec. exposure', line)
if matchExposure:
endPlate = line
break
index = index + 1
The code runs, but I'm not getting the result I'm looking for.
Thanks.

matchExposure = re.search('taking \d\d\d sec. exposure', lineString)
should probably be
matchExposure = re.search('taking \d\d\d sec. exposure', line)

Depending on your exact needs, you can just use an iterator on the list, or two of them as mae by itertools.tee. I.e., if you want to search lines following the first pattern only for the second pattern, a single iterator will do:
theiter = iter(thelist)
for aline in theiter:
if re.search(somestart, aline):
for another in theiter:
if re.search(someend, another):
yield aline, another # or print, whatever
break
This will not search lines from aline to the ending another for somestart, only for someend. If you need to search them for both purposes, i.e., leave theiter itself intact for the outer loop, that's where tee can help:
for aline in theiter:
if re.search(somestart, aline):
_, anotheriter = itertools.tee(iter(thelist))
for another in anotheriter:
if re.search(someend, another):
yield aline, another # or print, whatever
break
This is an exception to the general rule about tee which the docs give:
Once tee() has made a split, the
original iterable should not be used
anywhere else; otherwise, the iterable
could get advanced without the tee
objects being informed.
because the advancing of theiter and that of anotheriter occur in disjoint parts of the code, and anotheriter is always rebuilt afresh when needed (so the advancement of theiter in the meantime is not relevant).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting a list into a file without duplicates - python

Related

Python string not clearing correctly in loop

Using multiple booleans in an if statement to decide which file to write to

Python - process a chunk of lines in a file

appending array breaks program

index a list in a Python for loop

Categories

Resources