Removing empty lines from csv file using Python

Removing empty lines from csv file using Python - python

I am creating a code that will calculate means of the first 5 rows. However, I cannot think of the way to remove a row if it initially was left empty. Here is the sketch of my code. Sorry if it is primitive,I am still a novice.Thanks!
import csv
import statistics
with open('Test.csv') as file:
data=csv.reader(file,delimiter=',')
sample1=[]
sample2=[]
sample3=[]
sample4=[]
sample5=[]
#I was trying to do something like that but then
#I receive error message that states that statistics.mean requires at least
#one value.
#(for row in data:
#if row:
#some=row[1])
for row in data:
sp1=row[0]
sample1.append(sp1)
sample1=[int(x) for x in sample1]
sp2=row[1]
sample2.append(sp2)
sample2=[int(x) for x in sample2]
sp3=row[2]
sample3.append(sp3)
sample3=[int(x) for x in sample3]
sp4=row[3]
sample4.append(sp4)
sample4=[int(x) for x in sample4]
sp5=row[4]
sample5.append(sp5)
sample5=[int(x) for x in sample5]
mean1=statistics.mean(sample1)
mean2=statistics.mean(sample2)
mean3=statistics.mean(sample3)
mean4=statistics.mean(sample4)
mean5=statistics.mean(sample5)
print(mean1)
print(mean2)
print(mean3)
print(mean4)
print(mean5)

Here's a cleaner way of doing it:
import csv
import statistics
fromFile = []
with open('sample.csv','r') as fi:
data=csv.reader(fi,delimiter=',')
first = True
for row in data:
if first:
first = False
continue
if not filter(lambda a: a != '', row):
continue
fromFile.append(row)
print statistics.mean([int(item[1]) for item in fromFile])
Sample CSV file:
name, age
bob,9
rachel,90
,,,,
joe,5

Related

Comparing Data Inside CSV Files

I am brand new to Python, so go easy on me!
I am simply trying to compare the values in two lists and get an output for them in yes or no form.
Image of the CSV file storing values in:
My code looks like this:
import csv
f = open("datatest.csv")
reader = csv.reader(f)
dataListed = [row for row in reader]
rc = csv.writer(f)
column1 = []
for row in dataListed:
column1.append(row[0])
column2 = []
for row in dataListed:
column2.append(row[1])
for row in dataListed:
if (column1) > (column2):
print ("yes,")
else:
print("no,")
Currently, the output is just no, no, no, no, no ..., when it should not look like that based on the values!
I will show below what I have attempted, any help would be huge.

for row in dataListed:
if (column1) > (column2):
print ("yes,")
else:
print("no,")
This loop is comparing the same column1 and column2 variables each time through the loop. Those variables never change.
The code does not magically know that column1 and column2 actually mean "the first column in the current row" and "the second column in the current row".
Presumably you meant something like this instead:
if row[0] > row[1]:
... because this actually does use the current value of row for each loop iteration.

You can simplify your code and obtain the expected result with something like this:
import csv
from pathlib import Path
dataset = Path('dataset.csv')
with dataset.open() as f:
reader = csv.reader(f)
headers = next(reader)
for col1, col2 in reader:
print(col1, col2, 'yes' if int(col1) > int(col2) else 'no', sep=',')
For the sample CSV you posted in the image, the output would be the following:
1,5,no
7,12,no
11,6,yes
89,100,no
99,87,yes

Here you can get a simple alternative for your program.
f=open("sample.csv")
for each in f:
header=""
row=each.split(",")
try:
column1=int(row[0])
column2=int(row[1])
if column1>column2:
print(f"{column1}\t{column2}\tYes")
else:
print(f"{column1}\t{column2}\tNo")
except:
header=each.split(",")
head=f"{header[0]}\t{header[1].strip()}\tresult"
print(head)
print((len(head)+5)*"-")
pass

Reading CSV in Python and Outputting Specific Data

I am working on a python script that reads user input and returns values from the CSV. I am able to return all values, but I only need a few. There are many columns in the CSV, examples are:
LOC_NBR LOC_NAME ALPHA_CODE FRANCHISE_TYPE FRANCHISEE_LAST_NAME
My code is below, what could I add to this to only pull the data for say LOC_NBR, LOC_NAME, and FRANCHISE_TYPE? Right now if I change the print statement, I get a data type error because the fields are STR in the csv.
import csv
store_Num = input("Enter 5-Digit Store Number: ")
with open('StoreDirectory.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
found = False
for line in reader:
if line[0] == store_Num:
print(line)
found = True
break
if not found:
print(store_Num, "not found")

Using Python csv:
cat csv_test.csv
first,second
1, 1
3, 4
import csv
with open("csv_test.csv") as csv_file:
c = csv.DictReader(csv_file)
for row in c:
if int(row["first"]) == 3:
print(row["first"], row["second"])
3 4

The fastest (or most convenient) way to do this might be to use the pandas module and import the data into a dataframe.
import pandas as pd
df = pd.read_csv('data.csv')
from here you can extract rows or columns as you like.
column = "column_name"
row = 2
print ( df[column][row] )
Ideally the dataframe needs column headers which will make life easy.

How to get around a NumPy error "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"

The below code is being used to analysis a csv file and at the moment im trying to remove the columns of the array which are not in my check_list. This only checks the first row and if the first row of the particular column doesnt belong to the check_list it removes the entire column. But this error keeps getting thrown and not sure how to avoid it.
import numpy as np
def load_metrics(filename):
"""opens a csv file and returns stuff"""
check_list = ["created_at","tweet_ID","valence_intensity","anger_intensity","fear_intensity","sadness_intensity","joy_intensity","sentiment_category","emotion_category"]
file=open(filename)
data = []
for lin in file:
lin = lin.strip()
lin = lin.split(",")
data.append(lin)
for col in range(len(data[0])):
if np.any(data[0][col] not in check_list) == True:
data[0]= np.delete(np.array(data), col, 1)
print(col)
return np.array(data)
The below test is being used on the code too:
data = load_metrics("covid_sentiment_metrics.csv")
print(data[0])
Test results:
['created_at' 'tweet_ID' 'valence_intensity' 'anger_intensity'
'fear_intensity' 'sadness_intensity' 'joy_intensity' 'sentiment_category'
'emotion_category']

Change your load_metrics function to:
def load_metrics(filename):
check_list = ["created_at","tweet_ID","valence_intensity","anger_intensity",
"fear_intensity","sadness_intensity","joy_intensity","sentiment_category",
"emotion_category"]
data = []
with open(filename, 'r') as file:
for lin in file:
lin = lin.strip()
lin = lin.split(",")
data.append(lin)
arr = np.array(data)
colFilter = []
for col in arr[0]:
colFilter.append(col in check_list)
return arr[:, colFilter]
I introduced the following corrections:
Use with to automatically close the input file (your code fails to close it).
Create a "full" Numpy array (all columns) after the data has been read.
Compute colFilter list - which columns are in check_list.
Return only filtered columns.

Read columns by checklist
This code does not include checks related to reading a file or a broken data structure, so that the main idea is more or less clear. So, here I assume that a csv-file exists and has at least 2 lines:
import numpy as np
def load_metrics(filename, check_list):
"""open a csv file and return data as numpy.array
with columns from a check list"""
data = []
with open(filename) as file:
headers = file.readline().rstrip("\n").split(",")
for line in file:
data.append(line.rstrip("\n").split(","))
col_to_remove = []
for col in reversed(range(len(headers))):
if headers[col] not in check_list:
col_to_remove.append(col)
headers.pop(col)
data = np.delete(np.array(data), col_to_remove, 1)
return data, headers
Quick testing:
test_data = """\
hello,some,other,world
1,2,3,4
5,6,7,8
"""
with open("test.csv",'w') as f:
f.write(test_data)
check_list = ["hello","world"]
d, h = load_metrics("test.csv", check_list)
print(d, h)
Expected output:
[['1' '4']
['5' '8']] ['hello', 'world']
Some details:
Instead of np.any(data[0][col] not in check_list) == True would be enough data[0][col] not in check_list
Stripping with default parameters is not good as far as you can delete meaningful spaces.
Do not delete anything while looping forward. But we can do it (with some reservations) while looping backward.
check_list is better as a parameter.
Separate data and headers as they may have different types.
In your case it is better to use pandas.read_csv, see the picture below.

How to find max and min values within lists without using maps/SQL?

I'm learning python and have a data set (csv file) I've been able to split the lines by comma but now I need to find the max and min value in the third column and output the corresponding value in the first column in the same row.
This is the .csv file: https://www.dropbox.com/s/fj8tanwy1lr24yk/loan.csv?dl=0
I also can't use Pandas or any external libraries; I think it would have been easier if I used them
I have written this code so far:
f = open("loanData.csv", "r")
mylist = []
for line in f:
mylist.append(line)
newdata = []
for row in mylist:
data = row.split(",")
newdata.append(data)

I'd use the built-in csv library for parsing your CSV file, and then just generate a list with the 3rd column values in it:
import csv
with open("loanData.csv", "r") as loanCsv:
loanCsvReader = csv.reader(loanCsv)
# Comment out if no headers
next(loanCsvReader, None)
loan_data = [ row[2] for row in loanCsvReader]
max_val = max(loan_data)
min_val = min(loan_data)
print("Max: {}".format(max_val))
print("Max: {}".format(min_val))
Don't know if the details of your file, whether it has a headers or not but you can comment out
next(loanCsvReader, None)
if you don't have any headers present

Something like this might work. The index would start at zero, so the third column should be 2.
min = min([row.split(',')[2] for row in mylist])
max = max([row.split(',')[2] for row in mylist])
Separately, you could probably read and reformat your data to a list with the following:
with open('loanData.csv', 'r') as f:
data = f.read()
mylist = list(data.split('\n'))
This assumes that the end of each row of data is newline (\n) delimited (Windows), but that might be different depending on the OS you're using.

Dealing with strings amidst int in csv file, None value

I'm reading in data from a csv file where some of the values are "None". The values that are being read in are then contained in a list.
The list is the passed to a function which requires all values within the list to be in int() format.
However I can't apply this with the "None" string value being present. I've tried replacing the "None" with None, or with "" but that hasn't worked, it results in an error. The data in the list also needs to stay in the same position so I cant just completely ignore it all together.
I could replace all "None" with 0 but None != 0 really.
EDIT: I've added my code so hopefully it'll make a bit more sense. Trying to create a line chart from data in csv file:
import csv
import sys
from collections import Counter
import pygal
from pygal.style import LightSolarizedStyle
from operator import itemgetter
#Read in file to data variable and set header variable
filename = sys.argv[1]
data = []
with open(filename) as f:
reader = csv.reader(f)
header = reader.next()
data = [row for row in reader]
#count rows in spreadsheet (minus header)
row_count = (sum(1 for row in data))-1
#extract headers which I want to use
headerlist = []
for x in header[1:]:
headerlist.append(x)
#initialise line chart in module pygal. set style, title, and x axis labels using headerlist variable
line_chart = pygal.Line(style = LightSolarizedStyle)
line_chart.title = 'Browser usage evolution (in %)'
line_chart.x_labels = map(str, headerlist)
#create lists for data from spreadsheet to be put in to
empty1 = []
empty2 = []
#select which data i want from spreadsheet
for dataline in data:
empty1.append(dataline[0])
empty2.append(dataline[1:-1])
#DO SOMETHING TO "NONE" VALUES IN EMPTY TWO SO THEY CAN BE PASSED TO INT CONVERTER ASSIGNED TO EMPTY 3
#convert all items in the lists, that are in the list of empty two to int
empty3 = [[int(x) for x in sublist] for sublist in empty2]
#add data to chart line by line
count = -1
for dataline in data:
while count < row_count:
count += 1
line_chart.add(empty1[count], [x for x in empty3[count]])
#function that only takes int data
line_chart.render_to_file("browser.svg")
There will be a lot of inefficiencies or weird ways of doing things, trying to slowly learn.
The above script gives chart:
With all the Nones set as 0, bu this doesn't really reflect the existence of Chrome pre a certain date. Thanks

Without seeing your code, I can only offer limited help.
It sounds like you need to utilize ast.literal_eval().
import ast
csvread = csv.reader(file)
list = []
for row in csvread:
list.append(ast.literal_eval(row[0]))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing empty lines from csv file using Python - python

Related

Comparing Data Inside CSV Files

Reading CSV in Python and Outputting Specific Data

How to get around a NumPy error "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"

How to find max and min values within lists without using maps/SQL?

Dealing with strings amidst int in csv file, None value

Categories

Resources