Reading multiple text files and separating them into different arrays in Python - python

I have this code in MATLAB
txtFiles = dir('*.txt') ; %loads txt files
N = length(txtFiles) ;
for i = 1:N
data = importdata(txtFiles(i).name);
x = data(:,1);
y(:,i) = data(:,2) ;
end
Which takes all 100 of my txt files and creates an array for x then stores the y data in an a separate array where each column corresponds to a different txt file's values.
Is there a similar trick in Python?
this is how the data files are constructed:
896.5000000000 0.8694710776
896.7500000000 0.7608314184
897.0000000000 0.6349069122
897.2500000000 0.5092121001
897.5000000000 0.3955858698
There are 50 of them and each one has about 1000 rows like this,
My solution so far jams it all into a massive list which is impossible to handle. In MATLAB it adds the second column of each text file to an array and I can easily cycle through them.
This is my solution
#%%
import matplotlib.pyplot as plt
import os
import numpy as np
import glob
# This can be further shortened, but will result in a less clear code.
# One quality of a pythonic code
# Gets file names and reads them.
data = [open(file).read() for file in glob.glob('*.txt')]
# each block of data has it's line split and is put
# into seperate arrays
# [['data1 - line1', 'data1 - line2'], ['data2 - line1',...]]
data = [block.splitlines() for block in data]
x, y = [], []
# takes each line within each file data
# splits it, and appends to x and y
for file in glob.glob('*.txt'):
# open the file
with open(file) as _:
# read and splitlines
for line in _.read().splitlines():
# split the columns
line = line.split()
# this splits by the spaces
# example line: -2000 data1
# output = ['-2000', 'data']
# append to lists
x.append(line[0])
y.append(line[1])

Your files are pretty much csv files, and could be read using np.loadtxt or pd.read_csv.
But as you did you can extract values from the text yourself, the following will work for any number of columns:
def extract_values(text, sep=" ", dtype=float):
return (
np.array(x, dtype=dtype)
for x in zip(*(l.split(sep) for l in text.splitlines()))
)
Then just concatenate the results in the shape you want:
import pathlib
dir_in = pathlib.Path("files/")
indexes, datas = zip(
*(
extract_values(f.read_text())
for f in sorted(dir_in.glob("*.txt"))
)
)
index = np.stack(indexes, axis=-1)
data = np.stack(datas, axis=-1)

Related

How to get around a NumPy error "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"

The below code is being used to analysis a csv file and at the moment im trying to remove the columns of the array which are not in my check_list. This only checks the first row and if the first row of the particular column doesnt belong to the check_list it removes the entire column. But this error keeps getting thrown and not sure how to avoid it.
import numpy as np
def load_metrics(filename):
"""opens a csv file and returns stuff"""
check_list = ["created_at","tweet_ID","valence_intensity","anger_intensity","fear_intensity","sadness_intensity","joy_intensity","sentiment_category","emotion_category"]
file=open(filename)
data = []
for lin in file:
lin = lin.strip()
lin = lin.split(",")
data.append(lin)
for col in range(len(data[0])):
if np.any(data[0][col] not in check_list) == True:
data[0]= np.delete(np.array(data), col, 1)
print(col)
return np.array(data)
The below test is being used on the code too:
data = load_metrics("covid_sentiment_metrics.csv")
print(data[0])
Test results:
['created_at' 'tweet_ID' 'valence_intensity' 'anger_intensity'
'fear_intensity' 'sadness_intensity' 'joy_intensity' 'sentiment_category'
'emotion_category']
Change your load_metrics function to:
def load_metrics(filename):
check_list = ["created_at","tweet_ID","valence_intensity","anger_intensity",
"fear_intensity","sadness_intensity","joy_intensity","sentiment_category",
"emotion_category"]
data = []
with open(filename, 'r') as file:
for lin in file:
lin = lin.strip()
lin = lin.split(",")
data.append(lin)
arr = np.array(data)
colFilter = []
for col in arr[0]:
colFilter.append(col in check_list)
return arr[:, colFilter]
I introduced the following corrections:
Use with to automatically close the input file (your code fails to close it).
Create a "full" Numpy array (all columns) after the data has been read.
Compute colFilter list - which columns are in check_list.
Return only filtered columns.
Read columns by checklist
This code does not include checks related to reading a file or a broken data structure, so that the main idea is more or less clear. So, here I assume that a csv-file exists and has at least 2 lines:
import numpy as np
def load_metrics(filename, check_list):
"""open a csv file and return data as numpy.array
with columns from a check list"""
data = []
with open(filename) as file:
headers = file.readline().rstrip("\n").split(",")
for line in file:
data.append(line.rstrip("\n").split(","))
col_to_remove = []
for col in reversed(range(len(headers))):
if headers[col] not in check_list:
col_to_remove.append(col)
headers.pop(col)
data = np.delete(np.array(data), col_to_remove, 1)
return data, headers
Quick testing:
test_data = """\
hello,some,other,world
1,2,3,4
5,6,7,8
"""
with open("test.csv",'w') as f:
f.write(test_data)
check_list = ["hello","world"]
d, h = load_metrics("test.csv", check_list)
print(d, h)
Expected output:
[['1' '4']
['5' '8']] ['hello', 'world']
Some details:
Instead of np.any(data[0][col] not in check_list) == True would be enough data[0][col] not in check_list
Stripping with default parameters is not good as far as you can delete meaningful spaces.
Do not delete anything while looping forward. But we can do it (with some reservations) while looping backward.
check_list is better as a parameter.
Separate data and headers as they may have different types.
In your case it is better to use pandas.read_csv, see the picture below.

how to iterate over files in python and export several output files

I have a code and I want to put it in a for loop. I want to input some data stored as files into my code and based on the each input, generate outputs automatically. At the moment, my code is only working for one input file and consequently gives one output. My input file is named as model000.msh, but the fact is that I have a series of these input files with the names model000.msh, model001.msh, and so on. In the code I am doing some calculation on the imported file and finally compare it to a numpy array (my_data) that is generated by another numpy array (ID) having one column and thousands of rows. ID array is the second variable which I want to iterate over. ID is making my_data through a np.concatenate function. I want to use each column of ID to make my_data (my_data=np.concatenate((ID[:,iterator], gr), axis =1)). So, I want to iterate over several files, then extract arrays from each file (extracted), then follow the loop with generating my_data from each column of ID and do calculations on my_data and extracted and finally export results of each iteration with a dynamic naming method (changed_000, changed_001 and so on). This is my code fo one single input and one single my_data array (made by an ID that has only one column), but I want to change iterate over several input files and several my_data arrays and finally several outputs:
from itertools import islice
with open('model000.msh') as lines:
nodes = np.genfromtxt(islice(lines, 0, 1000))
with open('model000.msh', "r") as f:
saved_lines = np.array([line.split() for line in f if len(line.split()) == 9])
saved_lines[saved_lines == ''] = 0.0
elem = saved_lines.astype(np.int)
# following lines extract some data from my file
extracted=np.c_[elem[:,:-4], nodes[elem[:,-4]-1, 1:], nodes[elem[:,-3]-1, 1:],nodes[elem[:,-2]-1, 1:], nodes[elem[:,-1]-1, 1:]]
…
extracted =np.concatenate((extracted, avs), axis =1) # each input file ('model000.msh') will make this numpy array
# another data set, stored as a numpy array is compared to the data extracted from the file
ID= np.array [[… ..., …, …]] # now, it is has one column, but it should have several columns and each iteration, one column will make a my_data array
my_data=np.concatenate((ID, gr), axis =1) # I think it should be something like my_data=np.concatenate((ID[:,iterator], gr), axis =1)
from scipy.spatial import distance
distances=distance.cdist(extracted [:,17:20],my_data[:,1:4])
ind_min_dis=np.argmin(distances, axis=1).reshape(-1,1)
z=np.array([])
for i in ind_min_dis:
u=my_data[i,0]
z=np.array([np.append(z,u)]).reshape(-1,1)
final_merged=np.concatenate((extracted,z), axis =1)
new_vol=final_merged[:,-1].reshape(-1,1)
new_elements=np.concatenate((elements,new_vol), axis =1)
new_elements[:,[4,-1]] = new_elements[:,[-1,4]]
# The next block is output block
chunk_size = 3
buffer = ""
i = 0
relavent_line = 0
with open('changed_00', 'a') as fout:
with open('model000.msh', 'r') as fin:
for line in fin:
if len(line.split()) == 9:
aux_string = ' '.join([str(num) for num in new_elements[relavent_line]])
buffer += '%s\n' % aux_string
relavent_line += 1
else:
buffer += line
i+=1
if i == chunk_size:
fout.write(buffer)
i=0
buffer = ""
if buffer:
fout.write(buffer)
i=0
buffer = ""
I appreciate any help in advance.
I'm not very sure about your question. But it seems like you are asking for something like:
for idx in range(10):
with open('changed_{:0>2d}'.format(idx), 'a') as fout:
with open('model0{:0>2d}.msh'.format(idx), 'r') as fin:
#read something from fin...
#calculate something...
#write something to fout...
If so, you could search for str.format() for more details.

Function to read specific lines from CSV file with pandas

I have a CSV file containing 200 lines. I want to create a function to read every 50 lines together and then store these (50 lines) in a .txt file until the csv file ends. How can I do that please? Any help appreciated.
import pandas as pd
import csv
def my_function(n):
dataset = pd.read_csv('e.csv', nrows=50)
X = dataset.iloc[:,[0,0]].values
Update::
def my_function(n):
dataset = pd.read_csv('e.csv', nrows=n)
X = dataset.iloc[:,[0,0]].values
with open('funct.txt', 'w') as file:
for i in X:
file.write("{}\n".format(i))
return
row_count = len(open("e.csv").readlines())
print(row_count)
n=50
my_function(n)
Now my problem
how can read each 50 lines after another in each time until reach to the maximum length (200)?
You could test the remainder of the euclidean division of the index by 50 to check if your row number is a multiple of 50:
df=pd.read_csv('e.csv')
df=df[df.index%50==0]
df.to_csv('newfile.txt')
This way you do not need to iterate over your dataframe.

Reading Data into Lists

I'm trying to open a CSV file that contains 100 columns and 2 rows. I want to read the file and put the data in the first column into one list (my x_coordinates) and the data in the second column into another list (my y_coordinates)
X= []
Y = []
data = open("data.csv")
headers = data.readline()
readMyDocument = data.read()
for data in readMyDocument:
X = readMyDocument[0]
Y = readMyDocument[1]
print(X)
print(Y)
I'm looking to get two lists but instead the output is simply a list of 2's.
Any suggestions on how I can change it/where my logic is wrong.
You can do something like:
import csv
# No need to initilize your lists here
X = []
Y = []
with open('data.csv', 'r') as f:
data = list(csv.reader(f))
X = data[0]
Y = data[1]
print(X)
print(Y)
See if that works.
You can use pandas:
import pandas as pd
XY = pd.read_csv(path_to_file)
X = XY.iloc[:,0]
Y = XY.iloc[:,1]
or you can
X=[]
Y=[]
with open(path_to_file) as f:
for line in f:
xy = line.strip().split(',')
X.append(xy[0])
Y.append(xy[1])
First things first: you are not closing your file.
A good practice would be to use with when opening files so it can close even if the code breaks.
Then, if you want just one column, you can break your lines by the column separator and use just the column you want.
But this would be kind of learning only, in a real situation you may want to use a lib like built in csv or, even better, pandas.
X = []
Y = []
with open("data.csv") as data:
lines = data.read().split('\n')
# headers is not being used in this spinet
headers = lines[0]
lines = lines[1:]
# changing variable name for better reading
for line in lines:
X.append(line[0])
Y.append(line[1])
print(X)
print(Y)
Ps.: I'm ignoring some variables that you used but were not declared in your code snipet. But they could be a problem too.
Using numpy's genfromtxt , read the docs here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
Some assumptions:
Delimiter is ","
You don't want the headers obliviously in the lists, that's why
skipping the headers.
You can read the docs and use other keywords as well.
import numpy as np
X= list(np.genfromtxt('data.csv',delimiter=",",skip_header=1)[:,0])
Y = list(np.genfromtxt('data.csv',delimiter=",",skip_header=1)[:,1])

I want to read a column x in a csv file and populate other columns based on the content in the column x ? How do i do that in Python?

I have a csv file. A column x has string values. Based on the values in the column x , I want to populate other columns in a different csv. how do I do that?
you might be able to do something like this if you pass to the function a line number an a column number
def readCSVfile(line, column):
fp = open("file")
for i, line in enumerate(fp):
if i == line-1:
res = line.split(',')
fp.close()
return res[column]
My answer addresses the problem of processing a column of your data
and writing a NEW file to save the results of processing.
The following code has inline comments that, I hope, will clarify its innards.
# processing csv files is simple
# but there are lots of details that can go wrong,
# let's use a builtin module
import csv
# to abstract your (underspecified) problem, let's assume that
# we have defined what we want to do to our data in terms
# of a set of functions
from my_module import f0, f1, f2, ..., fn
# let's define a bunch of constants, in real life these should rather be
# command line arguments
input = './a/path/name.csv'a
out = './anothe/r_path/name.csv'
index_x = 5
# slurp in the data
with f as open(input):
data = [row for row in csv.reader(f)]
# transpose the data — list(...) is necessary for python 3
# where zip() returns a generator
data = list(zip(*data))
# extract the data
x = data[index_x]
# the data processing is done with a double loop,
# the outer loop on x values,
# the inner loop on the processing units (aka the imported functions)
processed = [[f(item)] for item in x for f in [f0, f1, f2, ..., fn]]
# eventually, output the results of our computations to a different csv file
# using the writerows() method that nicely iteratates on the rows of its
# argument in our behalf
with f as open(out, 'w'):
csv.writer(f).writerows(processed)

Categories

Resources