I have a csv file read with python and I need to find the average for each row and put it in a list. The problem is the average should be found ignoring null values for each row. To be precise, the length of the row should ignore null entries. In the example below, the average of A is 7 and B should be 67.3
csv file
the python standard csv library should work here.
It returns a list of rows and columns i.e. [[row0column0, row0column1..], ... [rowNcolumn0, rowNcolumn1]]
I think this code sample should provide a good framework...
import csv
columns_to_avg = [1,2] #a list of the indexes of the columns you
# want to avg. In this case, 1 and 2.
with open('example.csv', 'rb') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
#'row' is just a list of column-organized entries
for i, column in enumerate(row):
#Check if this column has a value that is not "null"
# and if it's a column we want to average!
if column != "null" and i in columns_to_avg:
entry_value = float(column) #Convert string to number
...
#Update sum for this column...
...
...
#Calculate final averages for each column here
...
modified from https://docs.python.org/2/library/csv.html
Related
I'm using PyQt5 and want to compare values from a csv file with values imputed by the user through QLineEdit(). Then, if the values are the same, I want to get the whole row imported to a QTableWidget.
The csv file contains 3 different columns, with width values, height values and thickness values.
I've tried this to solve the first problem:
import csv
with open('csvTest.csv') as file:
reader = csv.reader(file)
for row in reader:
if row[0] == self.widthTextbox.text() or row[1] == self.heightTextbox.text() or row[2] == self.thickTextbox.text():
print("Found: {}".format(row))
This didn't work, and I know that using "or" is problematic because I want this to act like a filter, so if the user only inputs one of the three atributes he'll get some rows, if he inputs two he'll get fewer rows and if he inputs all three of them he will get even fewer. But using "or" allows any line that fits any condition valid.
The second problem is, if this worked, I'd like to make the number of rows in the table equal to the number of rows that passed through the filter, using something like self.tableWidget.setRowCount('''number of rows found''') .
Finally, the last issue would be to make the QTableWidget rows identical to the ones that the filter found.
To solve first and second issue this could be a way:
import csv
from collections import Counter
rows_finded = []
with open('csvTest.csv') as file:
reader = csv.reader(file)
for row in reader:
values = [self.widthTextbox.text(), self.heightTextbox.text(), self.thickTextbox.text()]
if Counter(values) == Counter(row):
rows_finded.append(row)
self.tableWidget.setRowCount(len(rows_finded))
To solve last issue (source: Python - PyQt - QTable Widget - adding rows):
for i, row in enumerate(rows_finded):
for j, col in enumerate(row):
item = QTableWidgetItem(col)
self.tableWidget.setItem(i, j, item)
I am pretty new to Python, so I am possibly looking over an easy solution, but everything I have tried thus far has been fruitless.
I have hundreds of CSV files with identical format. The format I have is
--Name of File (unimportant)
--Single Number Value (unimportant)
--Important Row of Column Names
--Two More Rows of Unimportant Formatting Garbage
--Thousands of Rows of Important Data
--Several Blank Rows
--Thousands of Rows of Unimportant Garbage Again
I need to format it so that I am able to easily grab the Column Names and the Important Data underneath. The format is set so that the column names are always on row 5 and that the data always starts on row 8, but the amount of data can very from several hundred to several thousand.
EDIT: I got the exact row number of the heading wrong. Also, I forgot to mention that I need to save the result to a dataframe for future analysis.
This is an image of the top of the csv file
This is an image of the bottom of the csv file. Note that when it switches from 'important data' to 'unimportant data' the number of columns increases, which might make programming difficult.
You can use the below code. I got the column names with the line number =5, and data starting from line number =8 and stopped where we encounter a blank line.
import csv,pandas as pd
Space_encounter_linenum_flag=0
index_df=-1
#This flag is set when it encounters first blank line after the data values end
with open("C:/Users/user/PycharmProjects/spacysample/MrX.csv", 'r') as csvfile:
csvreader = csv.reader(csvfile, delimiter=',')
for row in csvreader:
index_df=index_df+1
if csvreader.line_num==5:
#To get column names
print("THE COLUMN NAMES IN LINE NUMBER 5 ARE...........")
print(', '.join(row))
df_col=pd.DataFrame(row)
if csvreader.line_num==8:
#To get data values
print("**********************************************************")
print("THE DATA VALUES STARTING FROM LINE NUMBER 8 ARE...........")
while row[-1] is '':
row.pop()
print(', '.join(row))
df_col.append(row)
if (csvreader.line_num>8) and max(row, key=len)=='':
#set flag when blank line is encountered
Space_encounter_linenum_flag=1
if (csvreader.line_num>8 and row is not '') and (row is not '') and Space_encounter_linenum_flag!=1:
#stop where blank line is encountered
while row[-1] is '':
row.pop()
print(', '.join(row))
df_val=pd.DataFrame(row)
df_col.append(df_val)
if (csvreader.line_num>8) and Space_encounter_linenum_flag==1:
print('Loop breaks at, line number: '+str(csvreader.line_num))
break
Hope this does exactly what you want.
import pandas as pd
df = pd.read_csv('path_to_your_csv', header=5)[7:]
# List Columns
df.columns
In case you don't have pandas : pip install pandas
read_csv docs : https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
I want to compare each row of a CSV file with itself and every other row within a column. For example, if the column values are like this:
Value_1
Value_2
Value_3
The code should pick Value_1 and compare it with Value_1 (yes, with itself too), Value_2 and then with Value_3. Then it should pick up Value_2 and compare it with Value_1, Value_2, Value_3, and so on.
I've written following code for this purpose:
csvfile = "c:\temp\temp.csv"
with open(csvfile, newline='') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
for compare_row in reader:
if row == compare_row
print(row,'is equal to',compare_row)
else:
print(row,'is not equal to',compare_row)
The code gives the following output:
['Value_1'] is not equal to ['Value_2']
['Value_1'] is not equal to ['Value_3']
The code compares Value_1 to Value_2 and Value_3 and then stops. Loop 1 does not pick Value_2, and Value_3. In short, the first loop appears to iterate over only the first row of the CSV file before stopping.
Also, I can't compare Value_1 to itself using this code. Any suggestions for the solution?
I would have suggested loading the CSV into memory but this is not an option considering the size.
Instead think of it like a SQL statement, for every row in the left table you want to match it to a value in the right table. So you would only scan through the left table once and start re-scanning the right table until left has reached EoF.
with open(csvfile, newline='') as f_left:
reader_left = csv.reader(f_left, delimiter=',')
with open(csvfile, newline='') as f_right:
reader_right = csv.reader(f_right, delimiter=',')
for row in reader_left:
for compare_row in reader_right:
if row == compare_row:
print(row,'is equal to',compare_row)
else:
print(row,'is not equal to',compare_row)
f_right.seek(0)
Try to use inbuilt package from Python : Itertools
from itertools import product
with open("abcTest.txt") as inputFile:
aList = inputFile.read().split("\n")
aProduct = product(aList,aList)
for aElem,bElem in aProduct:
if aElem == bElem:
print aElem,'is equal to',bElem
else:
print aElem,'is not equal to',bElem
The problem you are facing is called Cartesian product in Python where we need to compare the row of data with itself and every other row.
For this if you are doing multiple time read from source then it will cause signficant performance issue if the file is big.
Instead you can store the the data in list and iterate it over multiple time but this also will have huge performance over head.
The itertool package is useful in this case as it is optimized for these kind of problems.
I am relatively new to python and I am running into a lot of issues.
I am trying to create a graph using two columns from a csv file that contains many null values. Is there a way to convert a null value to a zero or delete the row that contains null values in certain columns?
Your question as asked is underspecified, but I think if we pick a concrete example, you should be able to figure out how to adapt it to your actual use case.
So, let's say your values are all either a string representation of a float, or an empty string representing null:
A,B
1.0,2.0
2.0,
,3.0
4.0,5.0
And let's say you're reading this using a csv.reader, and you're explicitly handling the rows one by one with some do_stuff_with function:
with open('foo.csv') as f:
next(reader) # skip header
for row in csv.reader(f):
a, b = map(float, row)
do_stuff_with(a, b)
Now, if you want to treat null values as 0.0, you just need to replace float with a function that returns float(x) for non-empty x, and 0.0 for empty x:
def nullable_float(x):
return float(x) if x else 0.0
with open('foo.csv') as f:
next(reader) # skip header
for row in csv.reader(f):
a, b = map(nullable_float, row)
do_stuff_with(a, b)
If you want to skip any rows that contain a null value in column B, you just check column B before doing the conversion:
with open('foo.csv') as f:
next(reader) # skip header
for row in csv.reader(f):
if not row[1]:
continue
a, b = map(nullable_float, row)
do_stuff_with(a, b)
I have 2 excel files: IDList.csv and Database.csv. IDList contains a list of 300 ID numbers that I want to filter out of the Database, which contains 2000 entries (leaving 1700 entries in the Database).
I tried writing a for loop (For each ID in the IDList, filter out that ID in Database.csv) but am having some troubles with the filter function. I am using Pyvot (http://packages.python.org/Pyvot/tutorial.html). I get a syntax error...Python/Pyvot doesn't like my syntax for xl.filter, but I can't figure out how to correct the syntax. This is what the documentation says:
xl.tools.filter(func, range)
Filters rows or columns by applying func to the given range. func is called for each value in the range. If it returns False, the corresponding row / column is hidden. Otherwise, the row / column is made visible.
range must be a row or column vector. If it is a row vector, columns are hidden, and vice versa.
Note that, to unhide rows / columns, range must include hidden cells. For example, to unhide a range:
xl.filter(lambda v: True, some_vector.including_hidden)
And here's my code:
import xl
IDList = xl.Workbook("IDList.xls").get("A1:A200").get()
for i in range(1,301):
xl.filter(!=IDList[i-1],"A1:A2000")
How can I filter a column in Database.csv using criteria in IDList.csv? I am open to solutions in Python or an Excel VBA macro, although I prefer Python.
import csv
with open("IDList.csv","rb") as inf:
incsv = csv.reader(inf)
not_wanted = set(row[0] for row in incsv)
with open("Database.csv","rb") as inf, open("FilteredDatabase.csv","wb") as outf:
incsv = csv.reader(inf)
outcsv = csv.writer(outf)
outcsv.writerows(row for row in incsv if row[0] not in not_wanted)