graphing columns from a CSV that includes null values- python - python

I am relatively new to python and I am running into a lot of issues.
I am trying to create a graph using two columns from a csv file that contains many null values. Is there a way to convert a null value to a zero or delete the row that contains null values in certain columns?

Your question as asked is underspecified, but I think if we pick a concrete example, you should be able to figure out how to adapt it to your actual use case.
So, let's say your values are all either a string representation of a float, or an empty string representing null:
A,B
1.0,2.0
2.0,
,3.0
4.0,5.0
And let's say you're reading this using a csv.reader, and you're explicitly handling the rows one by one with some do_stuff_with function:
with open('foo.csv') as f:
next(reader) # skip header
for row in csv.reader(f):
a, b = map(float, row)
do_stuff_with(a, b)
Now, if you want to treat null values as 0.0, you just need to replace float with a function that returns float(x) for non-empty x, and 0.0 for empty x:
def nullable_float(x):
return float(x) if x else 0.0
with open('foo.csv') as f:
next(reader) # skip header
for row in csv.reader(f):
a, b = map(nullable_float, row)
do_stuff_with(a, b)
If you want to skip any rows that contain a null value in column B, you just check column B before doing the conversion:
with open('foo.csv') as f:
next(reader) # skip header
for row in csv.reader(f):
if not row[1]:
continue
a, b = map(nullable_float, row)
do_stuff_with(a, b)

Related

Reading and writing to/from csv files

I want my program to read 2 columns (the first and the second one) and add them to an array. They are dependent on eachother - so they need to be written alongside eachother, as in the first row (both columns) next to eachother, and then the second row and so on.
I have managed to write the first column (containing the names) to the array, however have not managed to write the second column to the array.
rownum=1
array=[]
for row in reader:
if row[1] != '' and row[1] != 'Score':
array.append(row[1])
rownum=rownum+1
if rownum==11:
break
I attempted to append more than one row however it returns the error message 'only accepts one argument'.
Any ideas how I can do this so i can reference the score for each name from the csv file
Try using a dictionary.
d = {} #curly braces denote an empty dictionary
for row in reader:
d[row[0]] = row[1]
d, in this case, would be a dictionary with the first column of your csv file as the keys and the second column as the corresponding values.
You can access it very similar to how you access a list. Say you had Brian,80 as one of the entries in your csv file, d["Brian"] would return 80.
EDIT
OP has requested (in the comments) for a more complete version of the code. Assuming OP's code already works, I'll modify that code so it works with a dictionary:
rownum=1
d={} #denotes an empty dictionary
for row in reader:
if row[1] != '' and row[1] != 'Score':
d[row[0]]=row[1] #first column is the key/index, second column is the value
rownum=rownum+1
if rownum==11:
break

Find the average in a csv file with python (ignore null)

I have a csv file read with python and I need to find the average for each row and put it in a list. The problem is the average should be found ignoring null values for each row. To be precise, the length of the row should ignore null entries. In the example below, the average of A is 7 and B should be 67.3
csv file
the python standard csv library should work here.
It returns a list of rows and columns i.e. [[row0column0, row0column1..], ... [rowNcolumn0, rowNcolumn1]]
I think this code sample should provide a good framework...
import csv
columns_to_avg = [1,2] #a list of the indexes of the columns you
# want to avg. In this case, 1 and 2.
with open('example.csv', 'rb') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
#'row' is just a list of column-organized entries
for i, column in enumerate(row):
#Check if this column has a value that is not "null"
# and if it's a column we want to average!
if column != "null" and i in columns_to_avg:
entry_value = float(column) #Convert string to number
...
#Update sum for this column...
...
...
#Calculate final averages for each column here
...
modified from https://docs.python.org/2/library/csv.html

Python CSV Reader - Compare Each Row with Each Other Row Within One Column

I want to compare each row of a CSV file with itself and every other row within a column. For example, if the column values are like this:
Value_1
Value_2
Value_3
The code should pick Value_1 and compare it with Value_1 (yes, with itself too), Value_2 and then with Value_3. Then it should pick up Value_2 and compare it with Value_1, Value_2, Value_3, and so on.
I've written following code for this purpose:
csvfile = "c:\temp\temp.csv"
with open(csvfile, newline='') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
for compare_row in reader:
if row == compare_row
print(row,'is equal to',compare_row)
else:
print(row,'is not equal to',compare_row)
The code gives the following output:
['Value_1'] is not equal to ['Value_2']
['Value_1'] is not equal to ['Value_3']
The code compares Value_1 to Value_2 and Value_3 and then stops. Loop 1 does not pick Value_2, and Value_3. In short, the first loop appears to iterate over only the first row of the CSV file before stopping.
Also, I can't compare Value_1 to itself using this code. Any suggestions for the solution?
I would have suggested loading the CSV into memory but this is not an option considering the size.
Instead think of it like a SQL statement, for every row in the left table you want to match it to a value in the right table. So you would only scan through the left table once and start re-scanning the right table until left has reached EoF.
with open(csvfile, newline='') as f_left:
reader_left = csv.reader(f_left, delimiter=',')
with open(csvfile, newline='') as f_right:
reader_right = csv.reader(f_right, delimiter=',')
for row in reader_left:
for compare_row in reader_right:
if row == compare_row:
print(row,'is equal to',compare_row)
else:
print(row,'is not equal to',compare_row)
f_right.seek(0)
Try to use inbuilt package from Python : Itertools
from itertools import product
with open("abcTest.txt") as inputFile:
aList = inputFile.read().split("\n")
aProduct = product(aList,aList)
for aElem,bElem in aProduct:
if aElem == bElem:
print aElem,'is equal to',bElem
else:
print aElem,'is not equal to',bElem
The problem you are facing is called Cartesian product in Python where we need to compare the row of data with itself and every other row.
For this if you are doing multiple time read from source then it will cause signficant performance issue if the file is big.
Instead you can store the the data in list and iterate it over multiple time but this also will have huge performance over head.
The itertool package is useful in this case as it is optimized for these kind of problems.

Filter Excel Table

I have 2 excel files: IDList.csv and Database.csv. IDList contains a list of 300 ID numbers that I want to filter out of the Database, which contains 2000 entries (leaving 1700 entries in the Database).
I tried writing a for loop (For each ID in the IDList, filter out that ID in Database.csv) but am having some troubles with the filter function. I am using Pyvot (http://packages.python.org/Pyvot/tutorial.html). I get a syntax error...Python/Pyvot doesn't like my syntax for xl.filter, but I can't figure out how to correct the syntax. This is what the documentation says:
xl.tools.filter(func, range)
Filters rows or columns by applying func to the given range. func is called for each value in the range. If it returns False, the corresponding row / column is hidden. Otherwise, the row / column is made visible.
range must be a row or column vector. If it is a row vector, columns are hidden, and vice versa.
Note that, to unhide rows / columns, range must include hidden cells. For example, to unhide a range:
xl.filter(lambda v: True, some_vector.including_hidden)
And here's my code:
import xl
IDList = xl.Workbook("IDList.xls").get("A1:A200").get()
for i in range(1,301):
xl.filter(!=IDList[i-1],"A1:A2000")
How can I filter a column in Database.csv using criteria in IDList.csv? I am open to solutions in Python or an Excel VBA macro, although I prefer Python.
import csv
with open("IDList.csv","rb") as inf:
incsv = csv.reader(inf)
not_wanted = set(row[0] for row in incsv)
with open("Database.csv","rb") as inf, open("FilteredDatabase.csv","wb") as outf:
incsv = csv.reader(inf)
outcsv = csv.writer(outf)
outcsv.writerows(row for row in incsv if row[0] not in not_wanted)

Dynamically parsing research data in python

The long (winded) version:
I'm gathering research data using Python. My initial parsing is ugly (but functional) code which gives me some basic information and turns my raw data into a format suitable for heavy duty statistical analysis using SPSS. However, every time I modify the experiment, I have to dive into the analysis code.
For a typical experiment, I'll have 30 files, each for a unique user. Field count is fixed for each experiment (but can vary from one to another 10-20). Files are typically 700-1000 records long with a header row. Record format is tab separated (see sample which is 4 integers, 3 strings, and 10 floats).
I need to sort my list into categories. In a 1000 line file, I could have 4-256 categories. Rather than trying to pre-determine how many categories each file has, I'm using the code below to count them. The integers at the beginning of each line dictate what category the float values in the row correspond to. Integer combinations can be modified by the string values to produce wildly different results, and multiple combinations can sometimes be lumped together.
Once they're in categories, number crunching begins. I get statistical info (mean, sd, etc. for each category for each file).
The essentials:
I need to parse data like the sample below into categories. Categories are combos of the non-floats in each record. I'm also trying to come up with a dynamic (graphical) way to associate column combinations with categories. Will make a new post fot this.
I'm looking for suggestions on how to do both.
# data is a list of tab separated records
# fields is a list of my field names
# get a list of fieldtypes via gettype on our first row
# gettype is a function to get type from string without changing data
fieldtype = [gettype(n) for n in data[1].split('\t')]
# get the indexes for fields that aren't floats
mask = [i for i, field in enumerate(fieldtype) if field!="float"]
# for each row of data[skipping first and last empty lists] we split(on tabs)
# and take the ith element of that split where i is taken from the list mask
# which tells us which fields are not floats
records = [[row.split('\t')[i] for i in mask] for row in data[1:-1]]
# we now get a unique set of combos
# since set doesn't happily take a list of lists, we join each row of values
# together in a comma seperated string. So we end up with a list of strings.
uniquerecs = set([",".join(row) for row in records])
print len(uniquerecs)
quit()
def gettype(s):
try:
int(s)
return "int"
except ValueError:
pass
try:
float(s)
return "float"
except ValueError:
return "string"
Sample Data:
field0 field1 field2 field3 field4 field5 field6 field7 field8 field9 field10 field11 field12 field13 field14 field15
10 0 2 1 Right Right Right 5.76765674196 0.0310912272139 0.0573603238282 0.0582901376612 0.0648936500524 0.0655294305058 0.0720571099855 0.0748289246137 0.446033755751
3 1 3 0 Left Left Right 8.00982745764 0.0313840132052 0.0576521406854 0.0585844966069 0.0644905497442 0.0653386429438 0.0712603578765 0.0740345755708 0.2641076191
5 19 1 0 Right Left Left 4.69440026591 0.0313852052224 0.0583165354345 0.0592403274967 0.0659404609478 0.0666070804916 0.0715314027001 0.0743022054775 0.465994962101
3 1 4 2 Left Right Left 9.58648184552 0.0303649003017 0.0571579895338 0.0580911765412 0.0634304670863 0.0640132919609 0.0702920967445 0.0730697946335 0.556525293
9 0 0 7 Left Left Left 7.65374257547 0.030318719717 0.0568551744109 0.0577785415066 0.0640577002605 0.0647226582655 0.0711459854908 0.0739256050784 1.23421547397
Not sure if I understand your question, but here are a few thoughts:
For parsing the data files, you usually use the Python csv module.
For categorizing the data you could use a defaultdict with the non-float fields joined as a key for the dict. Example:
from collections import defaultdict
import csv
reader = csv.reader(open('data.file', 'rb'), delimiter='\t')
data_of_category = defaultdict(list)
lines = [line for line in reader]
mask = [i for i, n in enumerate(lines[1]) if gettype(n)!="float"]
for line in lines[1:]:
category = ','.join([line[i] for i in mask])
data_of_category[category].append(line)
This way you don't have to calculate the categories in the first place an can process the data in one pass.
And I didn't understand the part about "a dynamic (graphical) way to associate column combinations with categories".
For at least part of your question, have a look at Named Tuples
Step 1: Use something like csv.DictReader to turn the text file into an iterable of rows.
Step 2: Turn that into a dict of first entry: rest of entries.
with open("...", "rb") as data_file:
lines = csv.Reader(data_file, some_custom_dialect)
categories = {line[0]: line[1:] for line in lines}
Step 3: Iterate over the items() of the data and do something with each line.
for category, line in categories.items():
do_stats_to_line(line)
Some useful answers already but I'll throw mine in as well. Key points:
Use the csv module
Use collections.namedtuple for each row
Group the rows using a tuple of int field values as the key
If your source rows are sorted by the keys (the integer column values), you could use itertools.groupby. This would likely reduce memory consumption. Given your example data, and the fact that your files contain >= 1000 rows, this is probably not an issue to worry about.
def coerce_to_type(value):
_types = (int, float)
for _type in _types:
try:
return _type(value)
except ValueError:
continue
return value
def parse_row(row):
return [coerce_to_type(field) for field in row]
with open(datafile) as srcfile:
data = csv.reader(srcfile, delimiter='\t')
## Read headers, create namedtuple
headers = srcfile.next().strip().split('\t')
datarow = namedtuple('datarow', headers)
## Wrap with parser and namedtuple
data = (parse_row(row) for row in data)
data = (datarow(*row) for row in data)
## Group by the leading integer columns
grouped_rows = defaultdict(list)
for row in data:
integer_fields = [field for field in row if isinstance(field, int)]
grouped_rows[tuple(integer_fields)].append(row)
## DO SOMETHING INTERESTING WITH THE GROUPS
import pprint
pprint.pprint(dict(grouped_rows))
EDIT You may find the code at https://gist.github.com/985882 useful.

Categories

Resources