Reducing RAM consumption of Python dict - python

I have a python script that process several files of some gigabytes. With the following code I show below, I store some data into a list, which is stored into a dictionary snp_dict. The RAM consumption is huge. Looking at my code, could you suggest some ways to reduce RAM consumption, if any?
def extractAF(files_vcf):
z=0
snp_dict=dict()
for infile_name in sorted(files_vcf):
print ' * ' + infile_name
###single files
vcf_reader = vcf.Reader(open(infile_name, 'r'))
for record in vcf_reader:
snp_position='_'.join([record.CHROM, str(record.POS)])
ref_F = float(record.INFO['DP4'][0])
ref_R = float(record.INFO['DP4'][1])
alt_F = float(record.INFO['DP4'][2])
alt_R = float(record.INFO['DP4'][3])
AF = (alt_F+alt_R)/(alt_F+alt_R+ref_F+ref_R)
if not snp_position in snp_dict:
snp_dict[snp_position]=list((0) for _ in range(len(files_vcf)))
snp_dict[snp_position][z] = round(AF, 3) #record.INFO['DP4']
z+=1
return snp_dict

I finally adopted the following implementation with MySQL:
for infile_name in sorted(files_vcf):
print infile_name
###single files
vcf_reader = vcf.Reader(open(infile_name, 'r'))
for record in vcf_reader:
snp_position='_'.join([record.CHROM, str(record.POS)])
ref_F = float(record.INFO['DP4'][0])
ref_R = float(record.INFO['DP4'][1])
alt_F = float(record.INFO['DP4'][2])
alt_R = float(record.INFO['DP4'][3])
AF = (alt_F+alt_R)/(alt_F+alt_R+ref_F+ref_R)
if not snp_position in snp_dict:
sql_insert_table = "INSERT INTO snps VALUES ('" + snp_position + "'," + ",".join(list(('0') for _ in range(len(files_vcf)))) + ")"
cursor = db1.cursor()
cursor.execute(sql_insert_table)
db1.commit()
snp_dict.append(snp_position)
sql_update = "UPDATE snps SET " + str(z) + "g=" + str(AF) + " WHERE snp_pos='" + snp_position + "'";
cursor = db1.cursor()
cursor.execute(sql_update)
db1.commit()
z+=1
return snp_dict

For this sort of thing, you are probably better off using another data structure. A pandas DataFrame would work well in your situation.
The simplest solution would be to use an existing library, rather than writing your own parser. vcfnp can read vcf files into a format that is easily convertible to a pandas DataFrame. Something like this should work:
import pandas as pd
def extractAF(files_vcf):
dfs = []
for fname in sorted(files_vcf):
vars = vcfnp.variants(fname, fields=['CHROM', 'POS', 'DP4'])
snp_pos = np.char.add(np.char.add(vars.CHROM, '_'), record.POS.astype('S'))
dp4 = vars.DP4.astype('float')
AF = dp4[2:].sum(axis=0)/dp4.sum(axis=0)
dfs.append(pd.DataFrame(AF, index=snp_pos, columns=[fname]).T)
return pd.concat(dfs).fillna(0.0)
If you absolutely must use PyVCF, it will be slower, but hopefully this will at least be faster than your existing implementation, and should produce the same result as the above code:
def extractAF(files_vcf):
files_vcf = sorted(files_vcf)
dfs = []
for fname in files_vcf:
print ' * ' + fname
vcf_reader = vcf.Reader(open(fname, 'r'))
vars = ((rec.CHROM, rec.POS) + tuple(rec.INFO['DP4']) for rec in vcf_reader)
df = pd.DataFrame(vars, columns=['CHROMS', 'POS', 'ref_F', 'ref_R', 'alt_F', 'alt_R'])
df['snp_position'] = df['CHROMS'] + '_' + df['POS'].astype('S')
df_alt = df.loc[:, ('alt_F', 'alt_R')]
df_dp4 = df.loc[:, ('alt_F', 'alt_R', 'ref_F', 'ref_R')]
df[fname] = df_alt.sum(axis=1)/df_dp4.sum(axis=1)
df = df.set_index('snp_position', drop=True).loc[:, fname:fname].T
dfs.append(df)
return pd.concat(dfs).fillna(0.0)
Now lets say you wanted to read a particular snp_position, say contained in a variable snp_pos, that may or may not be there (from your comment), you wouldn't actually have to change anything:
all_vcf = extractAF(files_vcf)
if snp_pos in all_vcf:
linea_di_AF = all_vcf[snp_pos]
The result will be slightly different, though. It will be a pandas Series, which is like an array but can also be accessed like a dictionary:
all_vcf = extractAF(files_vcf)
if snp_pos in all_vcf:
linea_di_AF = all_vcf[snp_pos]
f_di_AF = linea_di_AF[files_vcf[0]]
This allows you to access a particular file/snp_pos pair directly:
all_vcf = extractAF(files_vcf)
if snp_pos in all_vcf:
f_di_AF = linea_di_AF[snp_pos][files_vcf[0]]
Or, better yet:
all_vcf = extractAF(files_vcf)
if snp_pos in all_vcf:
f_di_AF = linea_di_AF.loc[files_vcf[0], snp_pos]
Or you can get all snp_pos values for a given file:
all_vcf = extractAF(files_vcf)
fpos = linea_di_AF.loc[fname]

Related

How to trigger a google cloud function with inputs [duplicate]

This question already has an answer here:
Passing Variables To Google Cloud Functions
(1 answer)
Closed 1 year ago.
In the script below, I have manually assigned string values to the 4 variables (college, department, course, section).
import pandas as pd
def open_seats(request):
college = "ENG"
department = "EC"
course = "414"
section = "A1"
url = 'https://www.bu.edu/link/bin/uiscgi_studentlink.pl/1630695081?ModuleName=univschr.pl&SearchOptionDesc=Class+Number&SearchOptionCd=S&KeySem=20223&ViewSem=Fall+2021&College=' \
+ college + '&Dept=' + department + '&Course=' + course + '&Section=' + section
table = pd.read_html(url)[4]
class_table = table['Class']
open_seats_table = table['OpenSeats']
new_table = pd.concat([class_table, open_seats_table], axis=1)
full_section_string = college + '\u00A0' + department + course + '\u00A0' + section
for i in range(len(new_table)):
if new_table['Class'][i] == full_section_string:
val = new_table['OpenSeats'][i]
break
return val
I would like to connect this script to a mobile app where the user is asked to input the data for these 4 variables. So instead of having them manually labeled, how can I assign the variables the data that comes from the trigger?
At first, I thought that the data would be sent as a JSON in the form:
{
"college":"ENG",
"department":"EC"
"course":"414",
"section":"A1"
}
So I updated the code to look like this:
def open_seats(request):
college = request["college"]
department = request["department"]
course = request["course"]
section = request["section"]
I am lacking some fundamental knowledge about the way the http trigger functions, and how I can pass inputs to the cloud functions through the http trigger.
using the below code I got a response of 12
import pandas as pd
college = "ENG"
department = "EC"
course = "414"
section = "A1"
def open_seats(a,b,c,d):
url_base = 'https://www.bu.edu/link/bin/uiscgi_studentlink.pl/1630695081?ModuleName=univschr.pl&SearchOptionDesc=Class+Number&SearchOptionCd=S&KeySem=20223&ViewSem=Fall+2021&College='
url = url_base + a + "&Dept=" + b + "&Course=" + c + "&Section=" + d
table = pd.read_html(url)[4]
class_table = table['Class']
open_seats_table = table['OpenSeats']
new_table = pd.concat([class_table, open_seats_table], axis=1)
full_section_string = a + '\u00A0' + b + c + '\u00A0' + d
for i in range(len(new_table)):
if new_table['Class'][i] == full_section_string:
val = new_table['OpenSeats'][i]
return print(f'{val}')
# I call the function with the predefined variables being passed
open_seats(college,department,course,section)
what data are you trying to pass to open_seats(request) ?
def open_seats(request):
college = request["college"]
department = request["department"]
course = request["course"]
section = request["section"]
You didn't post the raw data coming from the app to the script. if you could post it, I would be able to give a full answer.
also the code you are using to take user input?

Parse xml w/ xsd to CSV with Python?

I am trying to parse a very large XML file which I downloaded from OSHA's website and convert it into a CSV so I can use it in a SQLite database along with some other spreadsheets. I would just use an online converter, but the osha file is apparently too big for all of them.
I wrote a script in Python which looks like this:
import csv
import xml.etree.cElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
xml_data_to_csv =open('Out.csv', 'w')
list_head=[]
Csv_writer=csv.writer(xml_data_to_csv)
count=0
for element in root.findall('data'):
List_nodes =[]
if count== 0:
inspection_number = element.find('inspection_number').tag
list_head.append(inspection_number)
establishment_name = element.find('establishment_name').tag
list_head.append(establishment_name)
city = element.find('city')
list_head.append(city)
state = element.find('state')
list_head.append(state)
zip_code = element.find('zip_code')
list_head.append(zip_code)
sic_code = element.find('sic_code')
list_head.append(sic_code)
naics_code = element.find('naics_code')
list_head.append(naics_code)
sampling_number = element.find('sampling_number')
list_head.append(sampling_number)
office_id = element.find('office_id')
list_head.append(office_id)
date_sampled = element.find('date_sampled')
list_head.append(date_sampled)
date_reported = element.find('date_reported')
list_head.append(date_reported)
eight_hour_twa_calc = element.find('eight_hour_twa_calc')
list_head.append(eight_hour_twa_calc)
instrument_type = element.find('instrument_type')
list_head.append(instrument_type)
lab_number = element.find('lab_number')
list_head.append(lab_number)
field_number = element.find('field_number')
list_head.append(field_number)
sample_type = element.find('sample_type')
list_head.append(sample_type)
blank_used = element.find('blank_used')
list_head.append(blank_used)
time_sampled = element.find('time_sampled')
list_head.append(time_sampled)
air_volume_sampled = element.find('air_volume_sampled')
list_head.append(air_volume_sampled)
sample_weight = element.find('sample_weight')
list_head.append(sample_weight)
imis_substance_code = element.find('imis_substance_code')
list_head.append(imis_substance_code)
substance = element.find('substance')
list_head.append(substance)
sample_result = element.find('sample_result')
list_head.append(sample_result)
unit_of_measurement = element.find('unit_of_measurement')
list_head.append(unit_of_measurement)
qualifier = element.find('qualifier')
list_head.append(qualifier)
Csv_writer.writerow(list_head)
count = +1
inspection_number = element.find('inspection_number').text
List_nodes.append(inspection_number)
establishment_name = element.find('establishment_name').text
List_nodes.append(establishment_name)
city = element.find('city').text
List_nodes.append(city)
state = element.find('state').text
List_nodes.append(state)
zip_code = element.find('zip_code').text
List_nodes.append(zip_code)
sic_code = element.find('sic_code').text
List_nodes.append(sic_code)
naics_code = element.find('naics_code').text
List_nodes.append(naics_code)
sampling_number = element.find('sampling_number').text
List_nodes.append(sampling_number)
office_id = element.find('office_id').text
List_nodes.append(office_id)
date_sampled = element.find('date_sampled').text
List_nodes.append(date_sampled)
date_reported = element.find('date_reported').text
List_nodes.append(date_reported)
eight_hour_twa_calc = element.find('eight_hour_twa_calc').text
List_nodes.append(eight_hour_twa_calc)
instrument_type = element.find('instrument_type').text
List_nodes.append(instrument_type)
lab_number = element.find('lab_number').text
List_nodes.append(lab_number)
field_number = element.find('field_number').text
List_nodes.append(field_number)
sample_type = element.find('sample_type').text
List_nodes.append(sample_type)
blank_used = element.find('blank_used').text
List_nodes.append()
time_sampled = element.find('time_sampled').text
List_nodes.append(time_sampled)
air_volume_sampled = element.find('air_volume_sampled').text
List_nodes.append(air_volume_sampled)
sample_weight = element.find('sample_weight').text
List_nodes.append(sample_weight)
imis_substance_code = element.find('imis_substance_code').text
List_nodes.append(imis_substance_code)
substance = element.find('substance').text
List_nodes.append(substance)
sample_result = element.find('sample_result').text
List_nodes.append(sample_result)
unit_of_measurement = element.find('unit_of_measurement').text
List_nodes.append(unit_of_measurement)
qualifier= element.find('qualifier').text
List_nodes.append(qualifier)
Csv_writer.writerow(List_nodes)
xml_data_to_csv.close()
But when I run the code I get a CSV with nothing in it. I suspect this may have something to do with the XSD file associated with the XML, but I'm not totally sure.
Does anyone know what the issue is here?
The code below is a 'compact' version of your code.
It assumes that the XML structure looks like in the script variable xml. (Based on https://www.osha.gov/opengov/sample_data_2011.zip)
The main difference bwtween this sample code and yours is that I define the fields that I want to collect once (see FIELDS) and I use this definition across the script.
import xml.etree.ElementTree as ET
FIELDS = ['lab_number', 'instrument_type'] # TODO add more fields
xml = '''<main xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="health_sample_data.xsd">
<DATA_RECORD>
<inspection_number>316180165</inspection_number>
<establishment_name>PROFESSIONAL ENGINEERING SERVICES, LLC.</establishment_name>
<city>EUFAULA</city>
<state>AL</state>
<zip_code>36027</zip_code>
<sic_code>1799</sic_code>
<naics_code>238990</naics_code>
<sampling_number>434866166</sampling_number>
<office_id>418600</office_id>
<date_sampled>2011-12-30</date_sampled>
<date_reported>2011-12-30</date_reported>
<eight_hour_twa_calc>N</eight_hour_twa_calc>
<instrument_type>TBD</instrument_type>
<lab_number>L13645</lab_number>
<field_number>S1</field_number>
<sample_type>B</sample_type>
<blank_used>N</blank_used>
<time_sampled></time_sampled>
<air_volume_sampled></air_volume_sampled>
<sample_weight></sample_weight>
<imis_substance_code>S777</imis_substance_code>
<substance>Soil</substance>
<sample_result>0</sample_result>
<unit_of_measurement>AAAAA</unit_of_measurement>
<qualifier></qualifier>
</DATA_RECORD>
<DATA_RECORD>
<inspection_number>315516757</inspection_number>
<establishment_name>MARGUERITE CONCRETE CO.</establishment_name>
<city>WORCESTER</city>
<state>MA</state>
<zip_code>1608</zip_code>
<sic_code>1771</sic_code>
<naics_code>238110</naics_code>
<sampling_number>423259902</sampling_number>
<office_id>112600</office_id>
<date_sampled>2011-12-30</date_sampled>
<date_reported>2011-12-30</date_reported>
<eight_hour_twa_calc>N</eight_hour_twa_calc>
<instrument_type>GRAV</instrument_type>
<lab_number>L13355</lab_number>
<field_number>9831B</field_number>
<sample_type>P</sample_type>
<blank_used>N</blank_used>
<time_sampled>184</time_sampled>
<air_volume_sampled>340.4</air_volume_sampled>
<sample_weight>.06</sample_weight>
<imis_substance_code>9135</imis_substance_code>
<substance>Particulates not otherwise regulated (Total Dust)</substance>
<sample_result>0.176</sample_result>
<unit_of_measurement>M</unit_of_measurement>
<qualifier></qualifier>
</DATA_RECORD></main>'''
root = ET.fromstring(xml)
records = root.findall('.//DATA_RECORD')
with open('out.csv', 'w') as out:
out.write(','.join(FIELDS) + '\n')
for record in records:
values = [record.find(f).text for f in FIELDS]
out.write(','.join(values) + '\n')
out.csv
lab_number,instrument_type
L13645,TBD
L13355,GRAV

Parsing Security Matrix Spreadsheet - NoneType is not Iterable

Trying to Nest no's and yes's with their respective applications and services.
That way when a request comes in for a specific zone to zone sequence, a check can be run against this logic to verify accepted requests.
I have tried calling Decision_List[Zone_Name][yes_no].update and i tried ,append when it was a list type and not a dict but there is no update method ?
Base_Sheet = range(5, sh.ncols)
Column_Rows = range(1, sh.nrows)
for colnum in Base_Sheet:
Zone_Name = sh.col_values(colnum)[0]
Zone_App_Header = {sh.col_values(4)[0]:{}}
Zone_Svc_Header = {sh.col_values(3)[0]:{}}
Zone_Proto_Header = {sh.col_values(2)[0]:{}}
Zone_DestPort_Header = {sh.col_values(1)[0]: {}}
Zone_SrcPort_Header = {sh.col_values(0)[0]: {}}
Decision_List = {Zone_Name:{}}
for rows in Column_Rows:
app_object = sh.col_values(4)[rows]
svc_object = sh.col_values(3)[rows]
proto_object = sh.col_values(3)[rows]
dst_object = sh.col_values(2)[rows]
src_object = sh.col_values(1)[rows]
yes_no = sh.col_values(colnum)[rows]
if yes_no not in Decision_List[Zone_Name]:
Decision_List[Zone_Name][yes_no] = [app_object]
else:
Decision_List[Zone_Name]=[yes_no].append(app_object)
I would like it present info as follows
Decision_List{Zone_Name:{yes:[ssh, ssl, soap], no:
[web-browsing,facebook]}}
I would still like to know why i couldnt call the append method on that specific yes_no key whos value was a list.
But in the mean time, i made a work around of sorts. I created a set as the key and gave the yes_no as the value. this will allow me to pair many no type values with the keys being a set of the application, port, service, etc.. and then i can search for yes values and create additional dicts out of them for logic.
Any better ideas out there i am all ears.
for rownum in range(0, sh.nrows):
#row_val is all the values in the row of cell.index[rownum] as determined by rownum
row_val = sh.row_values(rownum)
col_val = sh.col_values(rownum)
print rownum, col_val[0], col_val[1: CoR]
header.append({col_val[0]: col_val[1: CoR]})
print header[0]['Start Port']
dec_tree = {}
count = 1
Base_Sheet = range(5, sh.ncols)
Column_Rows = range(1, sh.nrows)
for colnum in Base_Sheet:
Zone_Name = sh.col_values(colnum)[0]
Zone_App_Header = {sh.col_values(4)[0]:{}}
Zone_Svc_Header = {sh.col_values(3)[0]:{}}
Zone_Proto_Header = {sh.col_values(2)[0]:{}}
Zone_DestPort_Header = {sh.col_values(1)[0]: {}}
Zone_SrcPort_Header = {sh.col_values(0)[0]: {}}
Decision_List = {Zone_Name:{}}
for rows in Column_Rows:
app_object = sh.col_values(4)[rows]
svc_object = sh.col_values(3)[rows]
proto_object = sh.col_values(3)[rows]
dst_object = sh.col_values(2)[rows]
src_object = sh.col_values(1)[rows]
yes_no = sh.col_values(colnum)[rows]
for rule_name in Decision_List.iterkeys():
Decision_List[Zone_Name][(app_object, svc_object, proto_object)]= yes_no
Thanks again.
I think still a better way is to use collections.defaultdict
In this manner it will ensure that i am able to append to the specific yes_no as i had originally intended.
for colnum in Base_Sheet:
Zone_Name = sh.col_values(colnum)[0]
Zone_App_Header = {sh.col_values(4)[0]:{}}
Zone_Svc_Header = {sh.col_values(3)[0]:{}}
Zone_Proto_Header = {sh.col_values(2)[0]:{}}
Zone_DestPort_Header = {sh.col_values(1)[0]: {}}
Zone_SrcPort_Header = {sh.col_values(0)[0]: {}}
Decision_List = {Zone_Name:defaultdict(list)}
for rows in Column_Rows:
app_object = sh.col_values(4)[rows]
svc_object = sh.col_values(3)[rows]
proto_object = sh.col_values(2)[rows]
dst_object = sh.col_values(1)[rows]
src_object = sh.col_values(0)[rows]
yes_no = sh.col_values(colnum)[rows]
if yes_no not in Decision_List[Zone_Name]:
Decision_List[Zone_Name][yes_no]= [app_object, svc_object, proto_object, dst_object, src_object]
else:
Decision_List[Zone_Name][yes_no].append([(app_object, svc_object, proto_object,dst_object, src_object)])
This allows me to then set the values as a set and append them as needed

TypeError: 'DataFrame' object is not callable python function

I have two functions, one which creates a dataframe from a csv and another which manipulates that dataframe. There is no problem the first time I pass the raw data through the lsc_age(import_data()) functions. However, I get the above-referenced error (TypeError: 'DataFrame' object is not callable) upon second+ attempts. Any ideas for how to solve the problem?
def import_data(csv,date1,date2):
global data
data = pd.read_csv(csv,header=1)
data = data.iloc[:,[0,1,4,6,7,8,9,11]]
data = data.dropna(how='all')
data = data.rename(columns={"National: For Dates 9//1//"+date1+" - 8//31//"+date2:'event','Unnamed: 1':'time','Unnamed: 4':'points',\
'Unnamed: 6':'name','Unnamed: 7':'age','Unnamed: 8':'lsc','Unnamed: 9':'club','Unnamed: 11':'date'})
data = data.reset_index().drop('index',axis=1)
data = data[data.time!='Time']
data = data[data.points!='Power ']
data = data[data['event']!="National: For Dates 9//1//"+date1+" - 8//31//"+date2]
data = data[data['event']!='USA Swimming, Inc.']
data = data.reset_index().drop('index',axis=1)
for i in range(len(data)):
if len(str(data['event'][i])) <= 3:
data['event'][i] = data['event'][i-1]
else:
data['event'][i] = data['event'][i]
data = data.dropna()
age = []
event = []
gender = []
for row in data.event:
gender.append(row.split(' ')[0])
if row[:9]=='Female 10':
n = 4
groups = row.split(' ')
age.append(' '.join(groups[1:n]))
event.append(' '.join(groups[n:]))
elif row[:7]=='Male 10':
n = 4
groups = row.split(' ')
age.append(' '.join(groups[1:n]))
event.append(' '.join(groups[n:]))
else:
n = 2
groups = row.split(' ')
event.append(' '.join(groups[n:]))
groups = row.split(' ')
age.append(groups[1])
data['age_group'] = age
data['event_simp'] = event
data['gender'] = gender
data['year'] = date2
return data
def lsc_age(data_two):
global lsc, lsc_age, top, all_performers
lsc = pd.DataFrame(data_two['event'].groupby(data_two['lsc']).count()).reset_index().sort_values(by='event',ascending=False)
lsc_age = data_two.groupby(['year','age_group','lsc'])['event'].count().reset_index().sort_values(by=['age_group','event'],ascending=False)
top = pd.concat([lsc_age[lsc_age.age_group=='10 & under'].head(),lsc_age[lsc_age.age_group=='11-12'].head(),\
lsc_age[lsc_age.age_group=='13-14'].head(),lsc_age[lsc_age.age_group=='15-16'].head(),\
lsc_age[lsc_age.age_group=='17-18'].head()],ignore_index=True)
all_performers = pd.concat([lsc_age[lsc_age.age_group=='10 & under'],lsc_age[lsc_age.age_group=='11-12'],\
lsc_age[lsc_age.age_group=='13-14'],lsc_age[lsc_age.age_group=='15-16'],\
lsc_age[lsc_age.age_group=='17-18']],ignore_index=True)
all_performers = all_performers.rename(columns={'event':'no. top 100'})
all_performers['age_year_lsc'] = all_performers.age_group+' '+all_performers.year.astype(str)+' '+all_performers.lsc
return all_performers
years = [i for i in range(2008,2018)]
for i in range(len(years)-1):
lsc_age(import_data(str(years[i+1])+"national100.csv",\
str(years[i]),str(years[i+1])))
During the first call to your function lsc_age() in line
lsc_age = data_two.groupby(['year','age_group','lsc'])['event'].count().reset_index().sort_values(by=['age_group','event'],ascending=False)
you are overwriting your function object with a dataframe. This is happening since you imported the function object from the global namespace with
global lsc, lsc_age, top, all_performers
Functions in Python are objects. Please see more information about this here.
To solve your problem, try to avoid the global imports. They do not seem to be necessary. Try to pass your data around through the arguments of the function.

calculating the area of an irregular shape from coordinates in a csv file using python

i am using Python to import a csv file with coordinates in it, passing it to a list and using the contained data to calculate the area of each irregular figure. The data within the csv file looks like this.
ID Name DE1 DN1 DE2 DN2 DE3 DN3
88637 Zack Fay -0.026841782 -0.071375637 0.160878583 -0.231788845 0.191811833 0.396593863
88687 Victory Greenfelder 0.219394372 -0.081932907 0.053054879 -0.048356016
88737 Lynnette Gorczany 0.043632299 0.118916157 0.005488698 -0.268612073
88787 Odelia Tremblay PhD 0.083147337 0.152277791 -0.039216388 0.469656787 -0.21725977 0.073797219
The code i am using is below - however it brings up an IndexError: as the first line doesn't have data in all columns. Is there a way to write the csv file so it only uses the colums with data in them ?
import csv
import math
def main():
try:
# ask user to open a file with coordinates for 4 points
my_file = raw_input('Enter the Irregular Differences file name and location: ')
file_list = []
with open(my_file, 'r') as my_csv_file:
reader = csv.reader(my_csv_file)
print 'my_csv_file: ', (my_csv_file)
reader.next()
for row in reader:
print row
file_list.append(row)
all = calculate(file_list)
save_write_file(all)
except IOError:
print 'File reading error, Goodbye!'
except IndexError:
print 'Index Error, Check Data'
# now do your calculations on the 'data' in the file.
def calculate(my_file):
return_list = []
for row in my_file:
de1 = float(row[2])
dn1 = float(row[3])
de2 = float(row[4])
dn2 = float(row[5])
de3 = float(row[6])
dn3 = float(row[7])
de4 = float(row[8])
dn4 = float(row[9])
de5 = float(row[10])
dn5 = float(row[11])
de6 = float(row[12])
dn6 = float(row[13])
de7 = float(row[14])
dn7 = float(row[15])
de8 = float(row[16])
dn8 = float(row[17])
de9 = float(row[18])
dn9 = float(row[19])
area_squared = abs((dn1 * de2) - (dn2 * de1)) + ((de3 * dn4) - (dn3 * de4)) + ((de5 * dn6) - (de6 * dn5)) + ((de7 * dn8) - (dn7 * de8)) + ((dn9 * de1) - (de9 * dn1))
area = area_squared / 2
row.append(area)
return_list.append(row)
return return_list
def save_write_file(all):
with open('output_task4B.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["ID", "Name", "de1", "dn1", "de2", "dn2", "de3", "dn3", "de4", "dn4", "de5", "dn5", "de6", "dn6", "de7", "dn7", "de8", "dn8", "de9", "dn9", "Area"])
writer.writerows(all)
if __name__ == '__main__':
main()
Any suggestions
Your problem appears to be in the calculate function.
You are trying to access various indexes of row without first confirming they exist. One naive approach might be to consider the values to be zero if they are not present, except that:
+ ((dn9 * de1) - (de9 * dn1)
is an attempt to wrap around, and this might invalidate your math since they would go to zero.
A better approach is probably to use a slice of the row, and use the sequence-iterating approach instead of trying to require a certain number of points. This lets your code fit the data.
coords = row[2:] # skip id and name
assert len(coords) % 2 == 0, "Coordinates must come in pairs!"
prev_de = coords[-2]
prev_dn = coords[-1]
area_squared = 0.0
for de, dn in zip(coords[:-1:2], coords[1::2]):
area_squared += (de * prev_dn) - (dn * prev_de)
prev_de, prev_dn = de, dn
area = abs(area_squared) / 2
The next problem will be dealing with variable length output. I'd suggest putting the area before the coordinates. That way you know it's always column 3 (or whatever).

Categories

Resources