Comparing Data Inside CSV Files - python

I am brand new to Python, so go easy on me!
I am simply trying to compare the values in two lists and get an output for them in yes or no form.
Image of the CSV file storing values in:
My code looks like this:
import csv
f = open("datatest.csv")
reader = csv.reader(f)
dataListed = [row for row in reader]
rc = csv.writer(f)
column1 = []
for row in dataListed:
column1.append(row[0])
column2 = []
for row in dataListed:
column2.append(row[1])
for row in dataListed:
if (column1) > (column2):
print ("yes,")
else:
print("no,")
Currently, the output is just no, no, no, no, no ..., when it should not look like that based on the values!
I will show below what I have attempted, any help would be huge.

for row in dataListed:
if (column1) > (column2):
print ("yes,")
else:
print("no,")
This loop is comparing the same column1 and column2 variables each time through the loop. Those variables never change.
The code does not magically know that column1 and column2 actually mean "the first column in the current row" and "the second column in the current row".
Presumably you meant something like this instead:
if row[0] > row[1]:
... because this actually does use the current value of row for each loop iteration.

You can simplify your code and obtain the expected result with something like this:
import csv
from pathlib import Path
dataset = Path('dataset.csv')
with dataset.open() as f:
reader = csv.reader(f)
headers = next(reader)
for col1, col2 in reader:
print(col1, col2, 'yes' if int(col1) > int(col2) else 'no', sep=',')
For the sample CSV you posted in the image, the output would be the following:
1,5,no
7,12,no
11,6,yes
89,100,no
99,87,yes

Here you can get a simple alternative for your program.
f=open("sample.csv")
for each in f:
header=""
row=each.split(",")
try:
column1=int(row[0])
column2=int(row[1])
if column1>column2:
print(f"{column1}\t{column2}\tYes")
else:
print(f"{column1}\t{column2}\tNo")
except:
header=each.split(",")
head=f"{header[0]}\t{header[1].strip()}\tresult"
print(head)
print((len(head)+5)*"-")
pass

Related

How to delete a row in a CSV file if a cell is empty using Python

I want to go through large CSV files and if there is missing data I want to remove that row completely, This is only row specific so if there is a cell that = 0 or has no value then I want to remove the entire row. I want this to happen for all the columns so if any column has a black cell it should delete the row, and return the corrected data in a corrected csv.
import csv
with open('data.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
print(row)
if not row[0]:
print("12")
This is what I found and tried but it doesnt not seem to be working and I dont have any ideas about how to aproach this problem, help please?
Thanks!
Due to the way in which CSV reader presents rows of data, you need to know how many columns there are in the original CSV file. For example, if the CSV file content looks like this:
1,2
3,
4
Then the lists return by iterating over the reader would look like this:
['1','2']
['3','']
['4']
As you can see, the third row only has one column whereas the first and second rows have 2 columns albeit that one is (effectively) empty.
This function allows you to either specify the number of columns (if you know them before hand) or allow the function to figure it out. If not specified then it is assumed that the number of columns is the greatest number of columns found in any row.
So...
import csv
DELIMITER = ','
def valid_column(col):
try:
return float(col) != 0
except ValueError:
pass
return len(col.strip()) > 0
def fix_csv(input_file, output_file, cols=0):
if cols == 0:
with open(input_file, newline='') as indata:
cols = max(len(row) for row in csv.reader(indata, delimiter=DELIMITER))
with open(input_file, newline='') as indata, open(output_file, 'w', newline='') as outdata:
writer = csv.writer(outdata, delimiter=DELIMITER)
for row in csv.reader(indata, delimiter=DELIMITER):
if len(row) == cols:
if all(valid_column(col) for col in row):
writer.writerow(row)
fix_csv('original.csv', 'fixed.csv')
maybe like this
import csv
with open('data.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)
data=list(csvreader)
data=[x for x in data if '' not in x and '0' not in x]
you can then rewrite the the csv file if you like
Instead of using csv, you should use Pandas module, something like this.
import pandas as pd
df = pd.read_csv('file.csv')
print(df)
index = 1 #index of the row that you want to remove
df = df.drop(index)
print(df)
df.to_csv('file.csv')

Read Column Values Between Two Rows CSV Python

I have a CSV which is in the format:
Name1,Value1
,Value2
,Value3
Name2,Value40
,Value50
,Value60
Name3,Value5
,Value10
,Value15
There is not a set number of "values" per "name".
There is not pattern to the names.
I want to read the Values for Each Name into a dict such as:
Name1 : [Value1,Value2,Value3]
Name2 : [Value40,Value50,Value60]
etc.
My current code is this:
CSVFile = open("GroupsCSV.csv")
Reader = csv.reader(CSVFile)
for row in Reader:
if row[0] and row[2]:
objlist = []
objlist.append(row[2])
for row in Reader:
if not row[0] and row[2]:
objlist.append(row[2])
else:
break
print(objlist)
This half-works.
It will do Name1,Name3,Name5,Name7 etc.
I cant seem to find a way to stop it skipping.
Would prefer to do this without the use of something like Lambda (as its not something i fully understand yet!).
EDIT: Image of example csv (real data has another unnecessary column, hence the "row[2]" in the code.:
Try pandas:
import pandas as pd
df = pd.read_csv('your_file.csv', header=None)
(df.ffill() # fill the blank with the previous Name
.groupby([0])[1] # collect those with same name
.apply(list) # put those in a list
.to_dict() # make a dictionary
)
Output:
{'Name1': ['Value1', 'Value2', 'Value3'],
'Name2': ['Value40', 'Value50', 'Value60'],
'Name3': ['Value5', 'Value10', 'Value15']}
Update: the pure python(3) solution:
with open('your_file.csv') as f:
lines = f.readlines()
d = {}
for line in lines:
row = line.split(',')
if row[0] != '':
key = row[0]
d[key] = []
d[key].append(row[1])
d
I think the issue you are facing is because of your nested loop. Both loops are pointing to the same iterator. You are starting the second loop after it finds Name1 and breaking it when it finds Name2. By the time the outer loops continues after the break you have already skipped Name2.
You could have both conditions in the same loop:
# with open("GroupsCSV.csv") as csv_file:
# reader = csv.reader(csv_file)
reader = [[1,2,3],[None,5,6]] # Mocking the csv input
objlist = []
for row in reader:
if row[0] and row[2]:
objlist.clear()
objlist.append(row[2])
elif not row[0] and row[2]:
objlist.append(row[2])
print(objlist)
EDIT: I have updated the code to provide a testable output.
The printed output looks as follows:
[3]
[3, 6]

Removing empty lines from csv file using Python

I am creating a code that will calculate means of the first 5 rows. However, I cannot think of the way to remove a row if it initially was left empty. Here is the sketch of my code. Sorry if it is primitive,I am still a novice.Thanks!
import csv
import statistics
with open('Test.csv') as file:
data=csv.reader(file,delimiter=',')
sample1=[]
sample2=[]
sample3=[]
sample4=[]
sample5=[]
#I was trying to do something like that but then
#I receive error message that states that statistics.mean requires at least
#one value.
#(for row in data:
#if row:
#some=row[1])
for row in data:
sp1=row[0]
sample1.append(sp1)
sample1=[int(x) for x in sample1]
sp2=row[1]
sample2.append(sp2)
sample2=[int(x) for x in sample2]
sp3=row[2]
sample3.append(sp3)
sample3=[int(x) for x in sample3]
sp4=row[3]
sample4.append(sp4)
sample4=[int(x) for x in sample4]
sp5=row[4]
sample5.append(sp5)
sample5=[int(x) for x in sample5]
mean1=statistics.mean(sample1)
mean2=statistics.mean(sample2)
mean3=statistics.mean(sample3)
mean4=statistics.mean(sample4)
mean5=statistics.mean(sample5)
print(mean1)
print(mean2)
print(mean3)
print(mean4)
print(mean5)
Here's a cleaner way of doing it:
import csv
import statistics
fromFile = []
with open('sample.csv','r') as fi:
data=csv.reader(fi,delimiter=',')
first = True
for row in data:
if first:
first = False
continue
if not filter(lambda a: a != '', row):
continue
fromFile.append(row)
print statistics.mean([int(item[1]) for item in fromFile])
Sample CSV file:
name, age
bob,9
rachel,90
,,,,
joe,5

shorten csv file based on rules python

I am stuck writing the following program.
I have a csv file
"SNo","Column1","Column2"
"A1","X","Y"
"A2","A","B"
"A1","X","Z"
"A3","M","N"
"A1","D","E"
I want to shorten this csv to follow these rules
a.) If the SNo occurs more than once in the file,
combine all column1 and column2 entries of that serial number
b.) If same column1 entries and column2 entries occur more than once,
then do not combine them twice.
Therefore the output of the above should be
"SNo","Column1","Column2"
"A1","X,D","Y,Z,E"
"A2","A","B"
"A3","M","N"
So far I am reading the csv file, iterating the rows. checking if SNo of next row is same as the previous row. Whats the best way to combine.
import csv
temp = "A1"
col1=""
col2=""
col3=""
with open("C:\\file\\file1.csv","rb") as f:
reader = csv.reader(f)
for row in reader:
if row[0] == temp:
continue
col1 = col1+row[1]
col2=col2+row[2]
col3=col3+row[3]
temp = row[0]
print row[0]+";"+col1+";"+col2+";"+col3
col1=""
col2=""
col3=""
Please let me know a good way to do this.
Thanks
The simplest approach is to maintain a dictionary with keys as serial numbers and sets to contain the columns. Then you could do something like the following:
my_dict = {}
for row in reader:
if not row[0] in my_dict.keys():
my_dict[row[0]] = [set(), set()]
my_dict[row[0]][0].add(row[1])
my_dict[row[0]][1].add(row[2])
Writing the file out (to a file opened as file_out) would be as simple as iterating through the dictionary using a join command:
for k in my_dict.keys():
file_out.write("{0},\"{1}\",\"{2}\"\n".format(
k,
','.join([x for x in my_dict[k][0]]),
','.join([x for x in my_dict[k][1]])
))

Python: General CSV file parsing and manipulation

The purpose of my Python script is to compare the data present in multiple CSV files, looking for discrepancies. The data are ordered, but the ordering differs between files. The files contain about 70K lines, weighing around 15MB. Nothing fancy or hardcore here. Here's part of the code:
def getCSV(fpath):
with open(fpath,"rb") as f:
csvfile = csv.reader(f)
for row in csvfile:
allRows.append(row)
allCols = map(list, zip(*allRows))
Am I properly reading from my CSV files? I'm using csv.reader, but would I benefit from using csv.DictReader?
How can I create a list containing whole rows which have a certain value in a precise column?
Are you sure you want to be keeping all rows around? This creates a list with matching values only... fname could also come from glob.glob() or os.listdir() or whatever other data source you so choose. Just to note, you mention the 20th column, but row[20] will be the 21st column...
import csv
matching20 = []
for fname in ('file1.csv', 'file2.csv', 'file3.csv'):
with open(fname) as fin:
csvin = csv.reader(fin)
next(csvin) # <--- if you want to skip header row
for row in csvin:
if row[20] == 'value':
matching20.append(row) # or do something with it here
You only want csv.DictReader if you have a header row and want to access your columns by name.
This should work, you don't need to make another list to have access to the columns.
import csv
import sys
def getCSV(fpath):
with open(fpath) as ifile:
csvfile = csv.reader(ifile)
rows = list(csvfile)
value_20 = [x for x in rows if x[20] == 'value']
If I understand the question correctly, you want to include a row if value is in the row, but you don't know which column value is, correct?
If your rows are lists, then this should work:
testlist = [row for row in allRows if 'value' in row]
post-edit:
If, as you say, you want a list of rows where value is in a specified column (specified by an integer pos, then:
testlist = []
pos = 20
for row in allRows:
testlist.append([element if index != pos else 'value' for index, element in enumerate(row)])
(I haven't tested this, but let me now if that works).

Categories

Resources