Using CSV arrays as an input to Python - python

I have been presented with a csv file that is full of 100+ arrays that I need to run through my data analysis code but I am not sure how to read these arrays in Python. Each array is preceded with a line that includes only an integer that gives the number of rows in the array and ends with the line '1234567890' to be used as a line separator.
Here is a snippet of the csv file:
7,,,,,,,
1,-199.117,-105.4,-4.525,227.5415,225.2925647,-0.0198891,-2.6547518
2,133.0423,55.4573,-48.4174,155.16,144.1380093,-0.322813,0.3949385
3,129.8405,-16.9527,-303.3192,331.0847,130.9425427,-1.5644458,-0.1298311
4,-73.6373,71.4677,151.517,183.9712,102.616198,1.1678785,2.3711453
5,41.2654,10.4196,30.3773,54.0915,42.5605604,0.6351541,0.2473322
6,-20.3159,-32.4484,62.4574,74.8581,38.2836056,1.2022641,-2.1301853
7,-13.2904,22.029,-28.2895,38.5096,25.7276422,-0.9386666,2.1136489
1234567890,,,,,,,
5,,,,,,,
1,-136.0755,-204.2787,-48.2127,259.2592,245.4512762,-0.1881526,-2.158425
2,220.5184,46.9388,-113.6448,265.1745,225.4586784,-0.4581388,0.2097266
3,-45.3132,169.6283,-49.2729,188.9506,175.576326,-0.2669358,1.8318334
4,-40.7141,34.7414,25.5414,60.9535,53.5219844,0.4465159,2.4351851
5,15.3863,-49.6703,17.1692,56.7635,51.9988166,0.312235,-1.2704018
1234567890,,,,,,,
6,,,,,,,
1,-19.3083,295.4128,191.8666,360.3712,296.0431267,0.5935079,1.6360639
2,-169.8708,-128.3904,-1.0052,215.4187,212.9323449,-0.0046663,-2.4943822
3,15.4505,-209.6656,-178.0715,279.4077,210.2341118,-0.7536439,-1.4972381
4,172.4142,13.0485,-63.7912,192.2842,172.9072576,-0.3447988,0.0755371
5,16.7456,24.8768,-46.5025,55.9188,29.9878358,-1.1933262,0.9783247
6,-8.911,4.1138,12.7751,17.7283,9.8147477,0.9089022,2.7090895
1234567890,,,,,,,
I am certain I could import the array if the csv was just one big array but I am stumped when it comes to picking one array out of many. The data analysis needs to be run on the temporary array before it is replaced with the next array in the csv file.

You could use itertools.groupby to parse the rows into separate arrays:
import csv
import itertools
with open('errors','w') as err: pass
with open('data','r') as f:
for key, group in itertools.groupby(
csv.reader(f),
lambda row: row[0].startswith('1234567890')):
if key: continue # key is True means we've reach the end of an array
group=list(group) # group is an iterator; we turn it into a list
array=group[1:] # everything but the first row is data
arr_length=int(group[0][0]) # first row contains the length
if arr_length != len(array): # sanity check
with open('errors','a') as err:
err.write('''\
Data file claims arr_length = {l}
{a}
{h}
'''.format(l=arr_length,a=str(list(array)),h='-'*80))
print(array)
itertools.groupby returns an iterator. It loops through the rows in csv.reader(f), and applies the lambda function to each row. The lambda function returns True when the row starts with '1234567890'. The return value (e.g. True or False) is called the key. The important point is that itertools.groupby collects together all contiguous rows that return the same key.

This should give you a nicely formatted variable called "data" to work with.
import csv
rows = csv.reader(open('your_file.csv'))
data = []
temp = []
for row in rows:
if '1234567890' in row:
data.append(temp)
temp = []
continue
else:
temp.append(row)
if temp != []:
data.append(temp)

Related

How to get around a NumPy error "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"

The below code is being used to analysis a csv file and at the moment im trying to remove the columns of the array which are not in my check_list. This only checks the first row and if the first row of the particular column doesnt belong to the check_list it removes the entire column. But this error keeps getting thrown and not sure how to avoid it.
import numpy as np
def load_metrics(filename):
"""opens a csv file and returns stuff"""
check_list = ["created_at","tweet_ID","valence_intensity","anger_intensity","fear_intensity","sadness_intensity","joy_intensity","sentiment_category","emotion_category"]
file=open(filename)
data = []
for lin in file:
lin = lin.strip()
lin = lin.split(",")
data.append(lin)
for col in range(len(data[0])):
if np.any(data[0][col] not in check_list) == True:
data[0]= np.delete(np.array(data), col, 1)
print(col)
return np.array(data)
The below test is being used on the code too:
data = load_metrics("covid_sentiment_metrics.csv")
print(data[0])
Test results:
['created_at' 'tweet_ID' 'valence_intensity' 'anger_intensity'
'fear_intensity' 'sadness_intensity' 'joy_intensity' 'sentiment_category'
'emotion_category']
Change your load_metrics function to:
def load_metrics(filename):
check_list = ["created_at","tweet_ID","valence_intensity","anger_intensity",
"fear_intensity","sadness_intensity","joy_intensity","sentiment_category",
"emotion_category"]
data = []
with open(filename, 'r') as file:
for lin in file:
lin = lin.strip()
lin = lin.split(",")
data.append(lin)
arr = np.array(data)
colFilter = []
for col in arr[0]:
colFilter.append(col in check_list)
return arr[:, colFilter]
I introduced the following corrections:
Use with to automatically close the input file (your code fails to close it).
Create a "full" Numpy array (all columns) after the data has been read.
Compute colFilter list - which columns are in check_list.
Return only filtered columns.
Read columns by checklist
This code does not include checks related to reading a file or a broken data structure, so that the main idea is more or less clear. So, here I assume that a csv-file exists and has at least 2 lines:
import numpy as np
def load_metrics(filename, check_list):
"""open a csv file and return data as numpy.array
with columns from a check list"""
data = []
with open(filename) as file:
headers = file.readline().rstrip("\n").split(",")
for line in file:
data.append(line.rstrip("\n").split(","))
col_to_remove = []
for col in reversed(range(len(headers))):
if headers[col] not in check_list:
col_to_remove.append(col)
headers.pop(col)
data = np.delete(np.array(data), col_to_remove, 1)
return data, headers
Quick testing:
test_data = """\
hello,some,other,world
1,2,3,4
5,6,7,8
"""
with open("test.csv",'w') as f:
f.write(test_data)
check_list = ["hello","world"]
d, h = load_metrics("test.csv", check_list)
print(d, h)
Expected output:
[['1' '4']
['5' '8']] ['hello', 'world']
Some details:
Instead of np.any(data[0][col] not in check_list) == True would be enough data[0][col] not in check_list
Stripping with default parameters is not good as far as you can delete meaningful spaces.
Do not delete anything while looping forward. But we can do it (with some reservations) while looping backward.
check_list is better as a parameter.
Separate data and headers as they may have different types.
In your case it is better to use pandas.read_csv, see the picture below.

how to iterate over files in python and export several output files

I have a code and I want to put it in a for loop. I want to input some data stored as files into my code and based on the each input, generate outputs automatically. At the moment, my code is only working for one input file and consequently gives one output. My input file is named as model000.msh, but the fact is that I have a series of these input files with the names model000.msh, model001.msh, and so on. In the code I am doing some calculation on the imported file and finally compare it to a numpy array (my_data) that is generated by another numpy array (ID) having one column and thousands of rows. ID array is the second variable which I want to iterate over. ID is making my_data through a np.concatenate function. I want to use each column of ID to make my_data (my_data=np.concatenate((ID[:,iterator], gr), axis =1)). So, I want to iterate over several files, then extract arrays from each file (extracted), then follow the loop with generating my_data from each column of ID and do calculations on my_data and extracted and finally export results of each iteration with a dynamic naming method (changed_000, changed_001 and so on). This is my code fo one single input and one single my_data array (made by an ID that has only one column), but I want to change iterate over several input files and several my_data arrays and finally several outputs:
from itertools import islice
with open('model000.msh') as lines:
nodes = np.genfromtxt(islice(lines, 0, 1000))
with open('model000.msh', "r") as f:
saved_lines = np.array([line.split() for line in f if len(line.split()) == 9])
saved_lines[saved_lines == ''] = 0.0
elem = saved_lines.astype(np.int)
# following lines extract some data from my file
extracted=np.c_[elem[:,:-4], nodes[elem[:,-4]-1, 1:], nodes[elem[:,-3]-1, 1:],nodes[elem[:,-2]-1, 1:], nodes[elem[:,-1]-1, 1:]]
…
extracted =np.concatenate((extracted, avs), axis =1) # each input file ('model000.msh') will make this numpy array
# another data set, stored as a numpy array is compared to the data extracted from the file
ID= np.array [[… ..., …, …]] # now, it is has one column, but it should have several columns and each iteration, one column will make a my_data array
my_data=np.concatenate((ID, gr), axis =1) # I think it should be something like my_data=np.concatenate((ID[:,iterator], gr), axis =1)
from scipy.spatial import distance
distances=distance.cdist(extracted [:,17:20],my_data[:,1:4])
ind_min_dis=np.argmin(distances, axis=1).reshape(-1,1)
z=np.array([])
for i in ind_min_dis:
u=my_data[i,0]
z=np.array([np.append(z,u)]).reshape(-1,1)
final_merged=np.concatenate((extracted,z), axis =1)
new_vol=final_merged[:,-1].reshape(-1,1)
new_elements=np.concatenate((elements,new_vol), axis =1)
new_elements[:,[4,-1]] = new_elements[:,[-1,4]]
# The next block is output block
chunk_size = 3
buffer = ""
i = 0
relavent_line = 0
with open('changed_00', 'a') as fout:
with open('model000.msh', 'r') as fin:
for line in fin:
if len(line.split()) == 9:
aux_string = ' '.join([str(num) for num in new_elements[relavent_line]])
buffer += '%s\n' % aux_string
relavent_line += 1
else:
buffer += line
i+=1
if i == chunk_size:
fout.write(buffer)
i=0
buffer = ""
if buffer:
fout.write(buffer)
i=0
buffer = ""
I appreciate any help in advance.
I'm not very sure about your question. But it seems like you are asking for something like:
for idx in range(10):
with open('changed_{:0>2d}'.format(idx), 'a') as fout:
with open('model0{:0>2d}.msh'.format(idx), 'r') as fin:
#read something from fin...
#calculate something...
#write something to fout...
If so, you could search for str.format() for more details.

I want to read a column x in a csv file and populate other columns based on the content in the column x ? How do i do that in Python?

I have a csv file. A column x has string values. Based on the values in the column x , I want to populate other columns in a different csv. how do I do that?
you might be able to do something like this if you pass to the function a line number an a column number
def readCSVfile(line, column):
fp = open("file")
for i, line in enumerate(fp):
if i == line-1:
res = line.split(',')
fp.close()
return res[column]
My answer addresses the problem of processing a column of your data
and writing a NEW file to save the results of processing.
The following code has inline comments that, I hope, will clarify its innards.
# processing csv files is simple
# but there are lots of details that can go wrong,
# let's use a builtin module
import csv
# to abstract your (underspecified) problem, let's assume that
# we have defined what we want to do to our data in terms
# of a set of functions
from my_module import f0, f1, f2, ..., fn
# let's define a bunch of constants, in real life these should rather be
# command line arguments
input = './a/path/name.csv'a
out = './anothe/r_path/name.csv'
index_x = 5
# slurp in the data
with f as open(input):
data = [row for row in csv.reader(f)]
# transpose the data — list(...) is necessary for python 3
# where zip() returns a generator
data = list(zip(*data))
# extract the data
x = data[index_x]
# the data processing is done with a double loop,
# the outer loop on x values,
# the inner loop on the processing units (aka the imported functions)
processed = [[f(item)] for item in x for f in [f0, f1, f2, ..., fn]]
# eventually, output the results of our computations to a different csv file
# using the writerows() method that nicely iteratates on the rows of its
# argument in our behalf
with f as open(out, 'w'):
csv.writer(f).writerows(processed)

Iterate through a for loop using multiple cores in Python

I have the following code that is currently running like normal Python code:
def remove_missing_rows(app_list):
print("########### Missing row removal ###########")
missing_rows = []
''' Remove any row that has missing data in the name, id, or description column'''
for row in app_list:
if not row[1]:
missing_rows.append(row)
continue # Continue loop to next row. No need to check more columns
if not row[5]:
missing_rows.append(row)
continue # Continue loop to next row. No need to check more columns
if not row[4]:
missing_rows.append(row)
print("Number of missing entries: " + str(len(missing_rows))) # 967 with current method
# Remove the missing_rows from the original data
app_list = [row for row in app_list if row not in missing_rows]
return app_list
Now, after writing this for a smaller sample I wish to run this on a very large data set. To do this I thought it would be useful to utilise the multiple cores of my computer.
I'm struggling to implement this using the multiprocessing module though. E.g. The idea I have is that Core 1 could work through the first half of the data set, while Core 2 would work through the last half. Etc. And do this in parallel. Is this possible?
This is probably not cpu bound. Try the code below.
I've used a set for very fast (hash-based) contains (you use it when you invoke if row not in missing_rows, and it's very slow for a long list).
If this is the csv module you're already holding tuples which are hashable so not many changes needed:
def remove_missing_rows(app_list):
print("########### Missing row removal ###########")
filterfunc = lambda row: not all([row[1], row[4], row[5]])
missing_rows = set(filter(filterfunc, app_list))
print("Number of missing entries: " + str(len(missing_rows))) # 967 with current method
# Remove the missing_rows from the original data
# note: should be a lot faster with a set
app_list = [row for row in app_list if row not in missing_rows]
return app_list
You can use filter, to not iterate twice:
def remove_missing_rows(app_list):
filter_func = lambda row: all((row[1], row[4], row[5]))
return list(filter(filter_func, app_list))
But if you are doing data analysis, you probably should have a look into pandas.
There you could do something like this:
import pandas as pd
df = pd.read_csv('your/csv/data/file', usecols=(1, 4, 5))
df = df.dropna() # remove missing values

Need more efficient way to parse out csv file in Python

Here's a sample csv file
id, serial_no
2, 500
2, 501
2, 502
3, 600
3, 601
This is the output I'm looking for (list of serial_no withing a list of ids):
[2, [500,501,502]]
[3, [600, 601]]
I have implemented my solution but it's too much code and I'm sure there are better solutions out there. Still learning Python and I don't know all the tricks yet.
file = 'test.csv'
data = csv.reader(open(file))
fields = data.next()
for row in data:
each_row = []
each_row.append(row[0])
each_row.append(row[1])
zipped_data.append(each_row)
for rec in zipped_data:
if rec[0] not in ids:
ids.append(rec[0])
for id in ids:
for rec in zipped_data:
if rec[0] == id:
ser_no.append(rec[1])
tmp.append(id)
tmp.append(ser_no)
print tmp
tmp = []
ser_no = []
**I've omitted var initializing for simplicity of code
print tmp
Gives me output I mentioned above. I know there's a better way to do this or pythonic way to do it. It's just too messy! Any suggestions would be great!
from collections import defaultdict
records = defaultdict(list)
file = 'test.csv'
data = csv.reader(open(file))
fields = data.next()
for row in data:
records[row[0]].append(row[1])
#sorting by ids since keys don't maintain order
results = sorted(records.items(), key=lambda x: x[0])
print results
If the list of serial_nos need to be unique just replace defaultdict(list) with defaultdict(set) and records[row[0]].append(row[1]) with records[row[0]].add(row[1])
Instead of a list, I'd make it a collections.defaultdict(list), and then just call the append() method on the value.
result = collections.defaultdict(list)
for row in data:
result[row[0]].append(row[1])
Here's a version I wrote, looks like there are plenty of answers for this one already though.
You might like using csv.DictReader, gives you easy access to each column by field name (from the header / first line).
#!/usr/bin/python
import csv
myFile = open('sample.csv','rb')
csvFile = csv.DictReader(myFile)
# first row will be used for field names (by default)
myData = {}
for myRow in csvFile:
myId = myRow['id']
if not myData.has_key(myId): myData[myId] = []
myData[myId].append(myRow['serial_no'])
for myId in sorted(myData):
print '%s %s' % (myId, myData[myId])
myFile.close()
Some observations:
0) file is a built-in (a synonym for open), so it's a poor choice of name for a variable. Further, the variable actually holds a file name, so...
1) The file can be closed as soon as we're done reading from it. The easiest way to accomplish that is with a with block.
2) The first loop appears to go over all the rows, grab the first two elements from each, and make a list with those results. However, your rows already all contain only two elements, so this has no net effect. The CSV reader is already an iterator over rows, and the simple way to create a list from an iterator is to pass it to the list constructor.
3) You proceed to make a list of unique ID values, by manually checking. A list of unique things is better known as a set, and the Python set automatically ensures uniqueness.
4) You have the name zipped_data for your data. This is telling: applying zip to the list of rows would produce a list of columns - and the IDs are simply the first column, transformed into a set.
5) We can use a list comprehension to build the list of serial numbers for a given ID. Don't tell Python how to make a list; tell it what you want in it.
6) Printing the results as we get them is kind of messy and inflexible; better to create the entire chunk of data (then we have code that creates that data, so we can do something else with it other than just printing it and forgetting it).
Applying these ideas, we get:
filename = 'test.csv'
with open(filename) as in_file:
data = csv.reader(in_file)
data.next() # ignore the field labels
rows = list(data) # read the rest of the rows from the iterator
print [
# We want a list of all serial numbers from rows with a matching ID...
[serial_no for row_id, serial_no in rows if row_id == id]
# for each of the IDs that there is to match, which come from making
# a set from the first column of the data.
for id in set(zip(*rows)[0])
]
We can probably do even better than this by using the groupby function from the itertools module.
example using itertools.groupby. This only works if the rows are already grouped by id
from csv import DictReader
from itertools import groupby
from operator import itemgetter
filename = 'test.csv'
# the context manager ensures that infile is closed when it goes out of scope
with open(filename) as infile:
# group by id - this requires that the rows are already grouped by id
groups = groupby(DictReader(infile), key=itemgetter('id'))
# loop through the groups printing a list for each one
for i,j in groups:
print [i, map(itemgetter(' serial_no'), list(j))]
note the space in front of ' serial_no'. This is because of the space after the comma in the input file

Categories

Resources