Python - process a chunk of lines in a file

Python - process a chunk of lines in a file - python

I have a file containing x number of values each on their own line.
I need to be able to take n number of value from this file, put them into an array, pass that array into a new process, clear the array and then take another n number of values from the file to give to the next process.
The problem I'm having is when x is a value like 12 and I'm trying to give, let's say, 10 chunks of values of each process.
The first process will get it's first 10 values no problem, but I'm having trouble giving the remaining 2 to the last process.
The problem would also arise if, let's say, you tell the program to give each process 10 values from the file, but the file only has 1, or even 9 values.
I need know when I'm at the last set of values that is less than n
I want to avoid taking every value in the file and storing it in an array all at once since I could run into memory problems if there was millions of values in that file.
Here's an example of what I've tried to do:
chunk = 10
value_list = []
with open ('file.txt', 'r') as f:
for value in f:
value_list.append(value)
if (len(value_list) >= chunk):
print 'Got %d' % len(value_list)
value_list = [] # Clear the list
# Put array into new process
This will catch every 10 in this example, but it wont work if there even happend to be less than 10 in the file to begin with.

What I typically do in this situation is just handle the last (short) array after the for loop. For example,
chunk = 10
value_list = []
with open ('file.txt', 'r') as f:
for value in f:
if (len(value_list) >= chunk):
print 'Got %d' % len(value_list)
value_list = [] # Clear the list
# Put array into new process
value_list.append(value)
# send left overs to new process
if value_list:
print 'Got %d' % len(value_list)
# Put final array into new process

Related

How to make sure Python goes into the end of a file line?

Good morning,
Because of memory issues, I'm trying to create a mean function that goes line by line in my file and calculate the mean of each column of my file.
My file has 5000 columns and 20000 rows.
However, when I print the output of this function, the last part of my list is filled with zeros (the value I put to initialize it).
I tried writing the results one by one but it goes until 4482 instead of 5000.
Is there a way to make sure he goes until the end ?
Here is my code:
def mean_by_line(file, size):
calc=[]
file=open(file,"r")
line=file.readline()
table_line=line.split(",")
mean_vector=[0 for i in range(len(table_line)-1)]
#We initialize the first one since we need it beforehand for the length
for j in range(len(table_line)-1):
calc.append(float(table_line[j]))
#We get the other values
for i in range(1,size):
line = file.readline()
table_line = line.split(",")
for j in range(len(table_line)-1):
calc[j]+= float(table_line[j])
#We calculate the average
for j in range(len(table_line)-1):
mean_vector[j]=calc[j]/size
print(j, mean_vector[j])
file.close()
return mean_vector
Thanks in advance

Let's assume that we have a file with comma-separated numbers. Also assume that the number count on each line is always the same (but unknown) and we want to find the arithmetic mean of each "column".
Then:
totals = None
line_count = 0
with open('atest.csv') as csv:
for line in map(str.strip, csv):
nums = line.split(',')
if totals is None:
totals = [0.0] * len(nums)
for i, v in enumerate(map(float, nums)):
totals[i] += v
line_count += 1
for i in range(len(totals)):
totals[i] /= line_count
print(totals)

Thank you very much for all your answers. I checked again the input file and strangely, the last row didn't have 5k columns like the other ones. So this is an error while measuring the data.
Thank you and sorry for such bad mistake from me....

How to read a text file and convert into a list for use with statistics package in Python

The code I am running so far is as follows
import os
import math
import statistics
def main ():
infile = open('USPopulation.txt', 'r')
values = infile.read()
infile.close()
index = 0
while index < len(values):
values(index) = int(values(index))
index += 1
print(values)
main()
The text file contains 41 rows of numbers each entered on a single line like so:
151868
153982
156393
158956
161884
165069
168088
etc.
My tasks is to create a program which shows average change in population during the time period. The year with the greatest increase in population during the time period. The year with the smallest increase in population (from the previous year) during the time period.
The code will print each of the text files entries on a single line, but upon trying to convert to int for use with the statistics package I am getting the following error:
values(index) = int(values(index))
SyntaxError: can't assign to function call
The values(index) = int(values(index)) line was taken from reading as well as resources on stack overflow.

You can change values = infile.read() to values = list(infile.read())
and it will have it ouput as a list instead of a string.
One of the things that tends to happen whenever reading a file like this is, at the end of every line there is an invisible '\n' that declares a new line within the text file, so an easy way to split it by lines and turn them into integers would be, instead of using values = list(infile.read()) you could use values = values.split('\n') which splits the based off of lines, as long as values was previously declared.
and the while loop that you have can be easily replace with a for loop, where you would use len(values) as the end.
the values(index) = int(values(index)) part is a decent way to do it in a while loop, but whenever in a for loop, you can use values[i] = int(values[i]) to turn them into integers, and then values becomes a list of integers.
How I would personally set it up would be :
import os
import math
import statistics
def main ():
infile = open('USPopulation.txt', 'r')
values = infile.read()
infile.close()
values = values.split('\n') # Splits based off of lines
for i in range(0, len(values)) : # loops the length of values and turns each part of values into integers
values[i] = int(values[i])
changes = []
# Use a for loop to get the changes between each number.
for i in range(0, len(values)-1) : # you put the -1 because there would be an indexing error if you tried to count i+1 while at len(values)
changes.append(values[i+1] - values[i]) # This will get the difference between the current and the next.
print('The max change :', max(changes), 'The minimal change :', min(changes))
#And since there is a 'change' for each element of values, meaning if you print both changes and values, you would get the same number of items.
print('A change of :', max(changes), 'Happened at', values[changes.index(max(changes))]) # changes.index(max(changes)) gets the position of the highest number in changes, and finds what population has the same index (position) as it.
print('A change of :', min(changes), 'Happened at', values[changes.index(min(changes))]) #pretty much the same as above just with minimum
# If you wanted to print the second number, you would do values[changes.index(min(changes)) + 1]
main()
If you need any clarification on anything I did in the code, just ask.

I personally would use numpy for reading a text file.
in your case I would do it like this:
import numpy as np
def main ():
infile = np.loadtxt('USPopulation.txt')
maxpop = np.argmax(infile)
minpop = np.argmin(infile)
print(f'maximum population = {maxpop} and minimum population = {minpop}')
main()

Collect data in chunks from stdin: Python

I have the following Python code where I collect data from standard input into a list and run syntaxnet on it. The data is in the form of json objects from which I will extract the text field and feed it to syntaxnet.
data = []
for line in sys.stdin:
data.append(line)
run_syntaxnet(data) ##This is a function##
I am doing this because I do not want Syntaxnet to run for every single tweet since it will take a very long time and hence decrease performance.
Also, when I run this code on very large data, I do not want to keep collecting it forever and run out of memory. So I want to collect data in chunks- may be like 10000 tweets at a time and run Syntaxnet on them. Can someone help me how to do this?
Also, I want to understand what can be the maximum length of the list data so that I do not run out of memory.
EDIT:
I used the code:
data = []
for line in sys.stdin:
data.append(line)
if len(data) == 10000:
run_syntaxnet(data) ##This is a function##
data = []
which runs perfectly fine if the number of rows in the input data is a multiple of 10000. I am not sure what to do with the remainder of the rows.
For example, if the total number of rows is 12000, the first 10000 rows get processed as I want, but the next 2000 are left off since the condition len(data) > 10000 is not met.
I want to do something like:
if len(data) > 10000 or 'EOF of input file is reached':
run_syntaxnet(data)
Can someone tell me how to check for the EOF of input file? Thanks in advance!
PS: All the data to the python file is from Pig Streaming. Also, I can not afford to actually count the number of row sin the input data and send as a parameter since I have millions of rows and counting itself will take forever.

I think this is all you need:
data = []
for line in sys.stdin:
data.append(line)
if len(data) == 10000:
run_syntaxnet(data) ##This is a function##
data = []
once the list get to 10000, then run the function and reset your data list. Also the maximum size of the list will vary from machine to machine, depending on how much memory you have, so it will probably be best to try it out with different lengths and find out what is optimum.

I would gather the data into chunks and process those chunks when they get "large":
LARGE_DATA = 10
data = []
for line in sys.stdin:
data.append(line)
if len(data) > LARGE_DATA:
run_syntaxnet(data)
data = []
run_syntaxnet(data)

Splitting a list into a file without duplicates

Large data file like this:
133621 652.4 496.7 1993.0 ...
END SAMPLES EVENTS RES 271.0 2215.0 ...
ESACC 935.6 270.6 2215.0 ...
115133 936.7 270.3 2216.0 ...
115137 936.4 270.4 2219.0 ...
115141 936.1 271.0 2220.0 ...
ESACC L 114837 115141 308 938.5 273.3 2200
115145 936.3 271.8 2220.0 ...
END 115146 SAMPLES EVENTS RES 44.11 44.09
SFIX L 133477
133477 650.8 500.0 2013.0 ...
133481 650.2 499.9 2012.0 ...
ESACC 650.0 500.0 2009.0 ...
Want to grab only the ESACC data into trials. When END appears, preceding ESACC data is aggregated into a trial. Right now, I can get the first chunk of ESACC data into a file but because the loop restarts from the beginning of the data, it keeps grabbing only the first chunk so I have 80 trials with the exact same data.
for i in range(num_trials):
with open(fid) as testFile:
for tline in testFile:
if 'END' in tline:
fid_temp_start.close()
fid_temp_end.close() #Close the files
break
elif 'ESACC' in tline:
tline_snap = tline.split()
sac_x_start = tline_snap[4]
sac_y_start = tline_snap[5
sac_x_end = tline_snap[7]
sac_y_end = tline_snap[8]
My question: How to iterate to the next chunk of data without grabbing the previous chunks?

Try rewriting your code something like this:
def data_parse(filepath): #Make it a function
try:
with open(filepath) as testFile:
tline = '' #Initialize tline
while True: #Switch to an infinite while loop (I'll explain why)
while 'ESACC' not in tline: #Skip lines until one containing 'ESACC' is found
tline = next(testFile) #(since it seems like you're doing that anyway)
tline_snap = tline.split()
trial = [tline_snap[4],'','',''] #Initialize list and assign first value
trial[1] = tline_snap[5]
trial[2] = tline_snap[7]
trial[3] = tline_snap[8]
while 'END' not in tline: #Again, seems like you're skipping lines
tline = next(testFile) #so I'll do the same
yield trial #Output list, save function state
except StopIteration:
fid_temp_start.close() #I don't know where these enter the picture
fid_temp_end.close() #but you closed them so I will too
testfile.close()
#Now, initialize a new list and call the function:
trials = list()
for trial in data_parse(fid);
trials.append(trial) #Creates a list of lists
What this creates is a generator function. By using yield instead of return, the function returns a value AND saves its state. The next time you call the function (as you will do repeatedly in the for loop at the end), it picks up where it left off. It starts at the line after the most recently executed yield statement (which in this case restarts the while loop) and, importantly, it remembers the values of any variables (like the value of tline and the point it stopped at in the data file).
When you reach the end of the file (and have thus recorded all of your trials), the next execution of tline = next(testFile) raises a StopIteration error. The try - except structure catches that error and uses it to exit the while loop and close your files. This is why we use an infinite loop; we want to continue looping until that error forces us out.
At the end of the whole thing, your data is stored in trials as a list of lists, where each item equals [sac_x_start, sac_y_start, sac_x_end, sac_y_end], as you defined them in your code, for one trial.
Note: it does seem to me like your code is skipping lines entirely when they don't contain ESACC or END. I've replicated that, but I'm not sure if that's what you want. If you want to get the lines in between, you can rewrite this fairly simply by adding to the 'END' loop as below:
while 'END' not in tline:
tline = next(testFile)
#(put assignment operations to be applied to each line here)
Of course, you'll have to adjust the variable you're using to store this data accordingly.
Edit: Oh dear lord, I just now noticed how old this question is.

Python algorithm error when trying to find the next largest value

I've written an algorithm that scans through a file of "ID's" and compares that value with the value of an integer i (I've converted the integer to a string for comparison, and i've trimmed the "\n" prefix from the line). The algorithm compares these values for each line in the file (each ID). If they are equal, the algorithm increases i by 1 and uses reccurtion with the new value of i. If the value doesnt equal, it compares it to the next line in the file. It does this until it has a value for i that isn't in the file, then returns that value for use as the ID of the next record.
My issue is i have a file of ID's that list 1,3,2 as i removed a record with ID 2, then created a new record. This shows the algorithm to be working correctly, as it gave the new record the ID of 2 which was previously removed. However, when i then create a new record, the next ID is 3, resulting in my ID list reading: 1,3,2,3 instead of 1,3,2,4. Bellow is my algorithm, with the results of the print() command. I can see where its going wrong but can't work out why. Any ideas?
Algorithm:
def _getAvailableID(iD):
i = iD
f = open(IDFileName,"r")
lines = f.readlines()
for line in lines:
print("%s,%s,%s"%("i=" + str(i), "ID=" + line[:-1], (str(i) == line[:-1])))
if str(i) == line[:-1]:
i += 1
f.close()
_getAvailableID(i)
return str(i)
Output:
(The output for when the algorithm was run for finding an appropriate ID for the record that should have ID of 4):
i=1,ID=1,True
i=2,ID=1,False
i=2,ID=3,False
i=2,ID=2,True
i=3,ID=1,False
i=3,ID=3,True
i=4,ID=1,False
i=4,ID=3,False
i=4,ID=2,False
i=4,ID=2,False
i=2,ID=3,False
i=2,ID=2,True
i=3,ID=1,False
i=3,ID=3,True
i=4,ID=1,False
i=4,ID=3,False
i=4,ID=2,False
i=4,ID=2,False

I think your program is failing because you need to change:
_getAvailableID(i)
to
return _getAvailableID(i)
(At the moment the recursive function finds the correct answer which is discarded.)
However, it would probably be better to simply put all the ids you have seen into a set to make the program more efficient.
e.g. in pseudocode:
S = set()
loop over all items and S.add(int(line.rstrip()))
i = 0
while i in S:
i += 1
return i

In case you are simply looking for the max ID in the file and then want to return the next available value:
def _getAvailableID(IDFileName):
iD = '0'
with open(IDFileName,"r") as f:
for line in f:
print("ID=%s, line=%s" % (iD, line))
if line > iD:
iD = line
return str(int(iD)+1)
print(_getAvailableID("IDs.txt"))
with an input file containing
1
3
2
it outputs
ID=1, line=1
ID=1
, line=3
ID=3
, line=2
4
However, we can solve it in a more pythonic way:
def _getAvailableID(IDFileName):
with open(IDFileName,"r") as f:
mx_id = max(f, key=int)
return int(mx_id)+1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - process a chunk of lines in a file - python

Related

How to make sure Python goes into the end of a file line?

How to read a text file and convert into a list for use with statistics package in Python

Collect data in chunks from stdin: Python

Splitting a list into a file without duplicates

Python algorithm error when trying to find the next largest value

Categories

Resources