Find max and extract data from a list

Find max and extract data from a list - python

I have a text file with twenty car prices and its serial number there are 50 lines in this file. I would like to find the max car price and its serial for every 10 lines.
priceandserial.txt
102030 4000.30
102040 5000.40
102080 5500.40
102130 4000.30
102140 5000.50
102180 6000.50
102230 2000.60
102240 4000.30
102280 6000.30
102330 9000.70
102340 1000.30
102380 3000.30
102430 4000.80
102440 5000.30
102480 7000.30
When I tried Python's builtin max function I get 102480 as the max value.
x = np.loadtxt('carserial.txt', unpack=True)
print('Max:', np.max(x))
Desired result:
102330 9000.70
102480 7000.30
There are 50 lines in file, therefore I should have a 5 line result with serial and max prices of each 10 lines.

Respectfully, I think the first solution is over-engineered. You don't need numpy or math for this task, just a dictionary. As you loop through, you update the dictionary if the latest value is greater than the current value, and do nothing if it isn't. Everything 10th item, you append the values from the dictionary to an output list and reset the buffer.
with open('filename.txt', 'r') as opened_file:
data = opened_file.read()
rowsplitdata = data.split('\n')
colsplitdata = [u.split(' ') for u in rowsplitdata]
x = [[int(j[0]), float(j[1])] for j in colsplitdata]
output = []
buffer = {"max":0, "index":0}
count = 0
#this assumes x is a list of lists, not a numpy array
for u in x:
count += 1
if u[1] > buffer["max"]:
buffer["max"] = u[1]
buffer["index"] = u[0]
if count == 10:
output.append([buffer["index"], buffer["max"]])
buffer = {"max":0, "index":0}
count = 0
#append the remainder of the buffer in case you didn't get to ten in the final pass
output.append([buffer["index"], buffer["max"]])
output
[[102330, 9000.7], [102480, 7000.3]]

You should iterate over it and for each 10 lines extract the maximum:
import math
# New empty list for colecting the results
max_list=[]
#iterate thorught x supposing
for i in range(math.ceil(len(x)/10)):
### append only 10 elments if i+10 is not superior to the lenght of the array
if i+11<len(x):
max_list=max_list.append(np.max(x[i:i+11]))
### if it is superior, then append all the remaining elements
else:
max_list=max_list.append(np.max(x[i:]))

This should do your job.
number_list = [[],[]]
with open('filename.txt', 'r') as opened_file:
for line in opened_file:
if len(line.split()) == 0:
continue
else:
a , b = line.split(" ")
number_list[0].append(a)
number_list[1].append(b)
col1_max, col2_max = max(number_list[0]), max(number_list[1])
col1_max, col2_max
Just change the filename. col1_max, col2_max have the respective column's max value. You can edit the code to accommodate more columns.

You can transpose your input first, then use np.split and for each submatrix you calculate its max.
x = np.genfromtxt('carserial.txt', unpack=True).T
print(x)
for submatrix in np.split(x,len(x)//10):
print(max(submatrix,key=lambda l:l[1]))
working example

Related

How to create a 2d nested list from a text file using python?

I'm a beginner programmer, and I'm trying to figure out how to create a 2d nested list (grid) from a particular text file. For example, the text file would look like this:
3
3
150
109
80
892
123
982
0
98
23
The first two lines in the text file would be used to create the grid, meaning that it is 3x3. The next 9 lines would be used to populate the grid, with the first 3 making up the first row, the next 3 making up the middle row, and the final 3 making up the last row. So the nested list would look like this:
[[150, 109, 80] [892, 123, 982] [0, 98, 23]]
How do I go about doing this? I was able to make a list of all of the contents, but I can't figure out how to use the first 2 lines to define the size of the inner lists within the outer list:
lineContent = []
innerList = ?
for lines in open('document.txt','r'):
value = int(lines)
lineContent.append(value)
From here, where do I go to turn it into a nested list using the given values on the first 2 lines?
Thanks in advance.

You can make this quite neat using list comprehension.
def txt_grid(your_txt):
with open(your_txt, 'r') as f:
# Find columns and rows
columns = int(f.readline())
rows = int(f.readline())
your_list = [[f.readline().strip() for i in range(rows)] for j in range(columns)]
return your_list
print(txt_grid('document.txt'))
strip() just clears the newline characters (\n) from each line before storing them in the list.
Edit: A modified version with logic for if your txt file didn't have enough rows for the defined dimensions.
def txt_grid(your_txt):
with open(your_txt, 'r') as f:
# Find columns and rows
columns = int(f.readline())
rows = int(f.readline())
dimensions = columns * rows
# Test to see if there are enough rows, creating grid if there are
nonempty_lines = len([line.strip("\n") for line in f]) # This ignores the first two lines as they have already been written
if nonempty_lines < dimensions:
# Either raise an error
# raise ValueError("Insufficient non-empty rows in text file for given dimensions")
# Or return something that's not a list
your_list = None
else:
# Creating grid
your_list = [[f.readline().strip() for i in range(rows)] for j in range(columns)]
return your_list
print(txt_grid('document.txt'))

def parse_txt(filepath):
lineContent = []
with open(filepath, 'r') as txt: # The with statement closes the txt file after its been used
nrows = int(txt.readline())
ncols = int(txt.readline())
for i in range(nrows): # For each row
row = []
for j in range(ncols): # Grab each value in the row
row.append(int(txt.readline()))
lineContent.append(row)
return lineContent
grid_2d = parse_txt('document.txt')

lineContent = []
innerList = []
for lines in open('testQuestion.txt', 'r'):
value = int(lines)
lineContent.append(value)
rowSz = lineContent[0] # row size
colSz = lineContent[1] # column size
del lineContent[0], lineContent[0] # makes line contents just the values in the matrix, could also just start currentLine at 2, notice 0 index is repeated because 1st element was deleted
assert rowSz * colSz == len(lineContent), 'not enough values for array' # to ensure there are enough entries to complete array of rowSz * colSz elements
arr = []
currentLine = 0
for x in range(rowSz):
arr.append([])
for y in range(colSz):
arr[x].append(lineContent[currentLine])
currentLine += 1
print(arr)

Store data in an array from a loop

I have two set of datas which I would like to multiply one by each other, and store the result in an array for each value.
For now I have this:
import csv
from mpdaf.obj import Spectrum, WaveCoord
import matplotlib.pyplot as plt
import pandas as pd
from csv import reader
file_path = input("Enter full transmission curve path : ")
with open(file_path, 'rw') as f:
data = list(reader(f, delimiter=","))
wavelength = [i[0] for i in data]
percentage = [float(str(i[1]).replace(',','.')) for i in data]
spectrum = input("Full spectrum path : ")
spe = Spectrum(filename=spectrum, ext=0)
data_flux = spe.data
flux_array = []
for i in percentage:
for j in data_flux:
flux = i*j
flux_array.append(flux)
print(flux_array)
Like this it take the first i then multiply it by all the j then takes the next i etc etc ...
I would like to just multiply the first i by the first j, then store the value in the array, then multiply the 2nd i by the second j and store the value etc ...

It is as the error message says: your indices i and j are floats, not integers. When you write for i in percentage:, i takes on every value in the percentage list. Instead, you might want to iterate through a range. Here's an example to illustrate the difference:
percentage = [50.0, 60.0, 70.0]
for i in percentage:
print(i)
# 50.0
# 60.0
# 70.0
for i in range(len(percentage)):
print(i)
# 0
# 1
# 2
To iterate through a list of indices, you probably want to iterate through a range:
for i in range(len(percentage)):
for j in range(len(data_flux)):
flux = percentage[i]*data_flux[j]
flux_array.append(flux)
This will iterate through the integers of each list, starting at 0 and ending at the maximum index of the list.

Matching multiple array value to row in csv file slow

I have a numpy array consisting of about 1200 arrays containing 10 values each. np.shape = 1200, 10. Each element has a value between 0 and 5,7 million.
Next I have a .csv file with 3800 lines. Every line contains 2 values. The first value indicates a range the second value is an identifier. The first and last 5 rows of the .csv file:
509,47222
1425,47220
2404,47219
4033,47218
6897,47202
...,...
...,...
...,...
5793850,211
5794901,186
5795820,181
5796176,43
5796467,33
The first columns goes up until it reaches 5,7 million. For each value in the numpy array I want to check the first column of the .csv file. I have for example the value 3333, this means the identifier belonging to 3333 is 47218. Each row indicates that from the first column of the row before till the first column of this row, eg: 2404 - 4033 the identifier is 47218.
Now I want to get the identifier for each value in the numpy array, then I want to safe the identifier and the frequency of which this identifier is found in the numpy array. Which means I need to loop 3800 times over a csv file of 12000 lines and subsequently ++ an integer. This process takes about 30 seconds which is way too long.
This is the code I am currently using:
numpy_file = np.fromfile(filename, dtype=np.int32)
#some code to format numpy_file correctly
with open('/identifer_file.csv') as read_file:
csv_reader = csv.reader(read_file, delimiter=',')
csv_reader = list(csv_reader)
identifier_dict = {}
for numpy_array in numpy_file:
for numpy_value in numpy_array:
#there are 12000 numpy_value in numpy_file
for row in csv_reader:
last_identifier = 0
if numpy_value <= int(row[0]):
last_identifier = int(row[1])
#adding the frequency of the identifier in numpy_file to a dict
if last_identifier in identifier_dict:
identifier_dict[last_identifier] += 1
else:
identifier_dict[last_identifier] = 1
else:
continue
break
for x, y in identifier_dict.items():
if(y > 40):
print("identifier: {} amount of times found: {}".format(x, y))
What algorithm should I implement to speed up this process?
Edit
I have tried folding the numpy array to a 1D array, so it has 12000 values. This has no real affect on the speed. Latest test was 33 seconds

Setup:
import numpy as np
import collections
np.random.seed(100)
numpy_file = np.random.randint(0, 5700000, (1200,10))
#'''range, identifier'''
read_file = io.StringIO('''509,47222
1425,47220
2404,47219
4033,47218
6897,47202
5793850,211
5794901,186
5795820,181
5796176,43
5796467,33''')
csv_reader = csv.reader(read_file, delimiter=',')
csv_reader = list(csv_reader)
# your example code put in a function and adapted for the setup above
def original(numpy_file,csv_reader):
identifier_dict = {}
for numpy_array in numpy_file:
for numpy_value in numpy_array:
#there are 12000 numpy_value in numpy_file
for row in csv_reader:
last_identifier = 0
if numpy_value <= int(row[0]):
last_identifier = int(row[1])
#adding the frequency of the identifier in numpy_file to a dict
if last_identifier in identifier_dict:
identifier_dict[last_identifier] += 1
else:
identifier_dict[last_identifier] = 1
else:
continue
break
# for x, y in identifier_dict.items():
# if(y > 40):
# print("identifier: {} amount of times found: {}".format(x, y))
return identifier_dict
Three solutions each vectorizing some of the operations. The first function consumes the least memory, the last consumes the most memory.
def first(numpy_file,r):
'''compare each value in the array to the entire first column of the csv'''
alternate = collections.defaultdict(int)
for value in np.nditer(numpy_file):
comparison = value < r[:,0]
identifier = r[:,1][comparison.argmax()]
alternate[identifier] += 1
return alternate
def second(numpy_file,r):
'''compare each row of the array to the first column of csv'''
alternate = collections.defaultdict(int)
for row in numpy_file:
comparison = row[...,None] < r[:,0]
indices = comparison.argmax(-1)
id_s = r[:,1][indices]
for thing in id_s:
#adding the frequency of the identifier in numpy_file to a dict
alternate[thing] += 1
return alternate
def third(numpy_file,r):
'''compare the whole array to the first column of csv'''
alternate = collections.defaultdict(int)
other = collections.Counter()
comparison = numpy_file[...,None] < r[:,0]
indices = comparison.argmax(-1)
id_s = r[:,1][indices]
other = collections.Counter(map(int,np.nditer(id_s)))
return other
The functions require the csv file be read into a numpy array:
read_file.seek(0) #io.StringIO object from setup
csv_reader = csv.reader(read_file, delimiter=',')
r = np.array([list(map(int,thing)) for thing in csv_reader])
one = first(numpy_file, r)
two = second(numpy_file,r)
three = third(numpy_file,r)
assert zero == one
assert zero == two
assert zero == three

How to separate different input formats from the same text file with Python

I'm new to programming and python and I'm looking for a way to distinguish between two input formats in the same input file text file. For example, let's say I have an input file like so where values are comma-separated:
5
Washington,A,10
New York,B,20
Seattle,C,30
Boston,B,20
Atlanta,D,50
2
New York,5
Boston,10
Where the format is N followed by N lines of Data1, and M followed by M lines of Data2. I tried opening the file, reading it line by line and storing it into one single list, but I'm not sure how to go about to produce 2 lists for Data1 and Data2, such that I would get:
Data1 = ["Washington,A,10", "New York,B,20", "Seattle,C,30", "Boston,B,20", "Atlanta,D,50"]
Data2 = ["New York,5", "Boston,10"]
My initial idea was to iterate through the list until I found an integer i, remove the integer from the list and continue for the next i iterations all while storing the subsequent values in a separate list, until I found the next integer and then repeat. However, this would destroy my initial list. Is there a better way to separate the two data formats in different lists?

You could use itertools.islice and a list comprehension:
from itertools import islice
string = """
5
Washington,A,10
New York,B,20
Seattle,C,30
Boston,B,20
Atlanta,D,50
2
New York,5
Boston,10
"""
result = [[x for x in islice(parts, idx + 1, idx + 1 + int(line))]
for parts in [string.split("\n")]
for idx, line in enumerate(parts)
if line.isdigit()]
print(result)
This yields
[['Washington,A,10', 'New York,B,20', 'Seattle,C,30', 'Boston,B,20', 'Atlanta,D,50'], ['New York,5', 'Boston,10']]
For a file, you need to change it to:
with open("testfile.txt", "r") as f:
result = [[x for x in islice(parts, idx + 1, idx + 1 + int(line))]
for parts in [f.read().split("\n")]
for idx, line in enumerate(parts)
if line.isdigit()]
print(result)

You're definitely on the right track.
If you want to preserve the original list here, you don't actually have to remove integer i; you can just go on to the next item.
Code:
originalData = []
formattedData = []
with open("data.txt", "r") as f :
f = list(f)
originalData = f
i = 0
while i < len(f): # Iterate through every line
try:
n = int(f[i]) # See if line can be cast to an integer
originalData[i] = n # Change string to int in original
formattedData.append([])
for j in range(n):
i += 1
item = f[i].replace('\n', '')
originalData[i] = item # Remove newline char in original
formattedData[-1].append(item)
except ValueError:
print("File has incorrect format")
i += 1
print(originalData)
print(formattedData)

The following code will produce a list results which is equal to [Data1, Data2].
The code assumes that the number of entries specified is exactly the amount that there is. That means that for a file like this, it will not work.
2
New York,5
Boston,10
Seattle,30
The code:
# get the data from the text file
with open('filename.txt', 'r') as file:
lines = file.read().splitlines()
results = []
index = 0
while index < len(lines):
# Find the start and end values.
start = index + 1
end = start + int(lines[index])
# Everything from the start up to and excluding the end index gets added
results.append(lines[start:end])
# Update the index
index = end

nested for loop in python not working

We basically have a large xcel file and what im trying to do is create a list that has the maximum and minimum values of each column. there are 13 columns which is why the while loop should stop once it hits 14. the problem is once the counter is increased it does not seem to iterate through the for loop once. Or more explicitly,the while loop only goes through the for loop once yet it does seem to loop in that it increases the counter by 1 and stops at 14. it should be noted that the rows in the input file are strings of numbers which is why I convert them to tuples and than check to see if the value in the given position is greater than the column_max or smaller than the column_min. if so I reassign either column_max or column_min.Once this is completed the column_max and column_min are appended to a list( l ) andthe counter,(position), is increased to repeat the next column. Any help will be appreciated.
input_file = open('names.csv','r')
l= []
column_max = 0
column_min = 0
counter = 0
while counter<14:
for row in input_file:
row = row.strip()
row = row.split(',')
row = tuple(row)
if (float(row[counter]))>column_max:
column_max = float(row[counter])
elif (float(row[counter]))<column_min:
column_min = float(row[counter])
else:
column_min=column_min
column_max = column_max
l.append((column_max,column_min))
counter = counter + 1

I think you want to switch the order of your for and while loops.
Note that there is a slightly better way to do this:
with open('yourfile') as infile:
#read first row. Set column min and max to values in first row
data = [float(x) for x in infile.readline().split(',')]
column_maxs = data[:]
column_mins = data[:]
#read subsequent rows getting new min/max
for line in infile:
data = [float(x) for x in line.split(',')]
for i,d in enumerate(data):
column_maxs[i] = max(d,column_maxs[i])
column_mins[i] = min(d,column_mins[i])
If you have enough memory to hold the file in memory at once, this becomes even easier:
with open('yourfile') as infile:
data = [map(float,line.split(',')) for line in infile]
data_transpose = zip(*data)
col_mins = [min(x) for x in data_transpose]
col_maxs = [max(x) for x in data_transpose]

Once you have consumed the file, it has been consumed. Thus iterating over it again won't produce anything.
>>> for row in input_file:
... print row
1,2,3,...
4,5,6,...
etc.
>>> for row in input_file:
... print row
>>> # Nothing gets printed, the file is consumed
That is the reason why your code is not working.
You then have three main approaches:
Read the file each time (inefficient in I/O operations);
Load it into a list (inefficient for large files, as it stores the whole file in memory);
Rework the logic to operate line by line (quite feasible and efficient, though not as brief in code as loading it all into a two-dimensional structure and transposing it and using min and max may be).
Here is my technique for the third approach:
maxima = [float('-inf')] * 13
minima = [float('inf')] * 13
with open('names.csv') as input_file:
for row in input_file:
for col, value in row.split(','):
value = float(value)
maxima[col] = max(maxima[col], value)
minima[col] = min(minima[col], value)
# This gets the value you called ``l``
combined_max_and_min = zip(maxima, minima)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find max and extract data from a list - python

You can transpose your input first, then use np.split and for each submatrix you calculate its max. x = np.genfromtxt('carserial.txt', unpack=True).T print(x) for submatrix in np.split(x,len(x)//10): print(max(submatrix,key=lambda l:l[1])) working example

Related

How to create a 2d nested list from a text file using python?

Store data in an array from a loop

Matching multiple array value to row in csv file slow

How to separate different input formats from the same text file with Python

nested for loop in python not working

Categories

Resources