python: How to clean the csv file

python: How to clean the csv file - python

I am a beginner user of Python and would like to clean the csv file for analysis purpose. However, I am facing the problem with the code.
def open_dataset(file_name):
opened_file = open(file_name)
read_file = reader(opened_file, delimiter=",")
data = list(read_file)
return data
def column(filename):
filename = open_dataset(filename)
for row in filename:
print(row)
with the code above, the output is like
['Name;Product;Sales;Country;Website']
[';Milk;30;Germany;something.com']
[';;;USA;']
['Chris;Milk;40;;']
I would like to have the output following:
['Name','Product','Sales','Country','Website']
[NaN,'Milk','30','Germany','something.com']
[NaN,NaN,NaN,'USA',NaN]
['Chris','Milk',40,NaN,NaN]
I defined a delimiter as "," but still ";" used. I don't know why it is happening. Also Even if I tried to replace the space with "NaN, but still every space is replaced by "NaN".
Would be really appreciated if someone could give me tips
After all, I would like to analyse each column such as percentage of "NaN" etc.
Thank you!

You can get the result that you want by:
specifying ';' as the delimiter when constructing a reader object
passing each row through a function that converts empty cells to 'NaN' (or some other value of your choice)
Here is some sample code:
import csv
def row_factory(row):
return [x if x != '' else 'NaN' for x in row]
with open(filename, 'r', newline='') as f:
reader = csv.reader(f, delimiter=';')
for row in reader:
print(row_factory(row))
Output:
['Name', 'Product', 'Sales', 'Country', 'Website']
['NaN', 'Milk', '30', 'Germany', 'something.com']
['NaN', 'NaN', 'NaN', 'USA', 'NaN']
['Chris', 'Milk', '40', 'NaN', 'NaN']

You need to specify the correct delimiter:
read_file = reader(opened_file, delimiter=";")
Your CSV file appears to be using a semicolon rather than a comma, so you need to tell reader() what to use.
Tip:
filename = open_dataset(filename)
Don't reassign a variable to mean something else. Before this line executes, filename is a string with the name of the file to open. After this assignment, filename is now a list of rows from the file. Use a different variable name instead:
rows = open_dataset(filename)
Now the two variables are distinct and their meaning is clear from the names. Of course, feel free to use something other than rows other than filename.

You might want to look into using pandas. It can make data processing a lot easier, up to and including reading multi file formats.
if you want to read a csv:
import pandas as pd:
my_file = '/pat/to/my_csv.csv'
pd.read_csv(my_file)

That's because your lists contain only one element, and that element is a single string, in order to parse a string into a list you can split it.
This should do what you need:
for row in filename:
parsed_row = row[0].split(';')
for i in range(0, len(parsed_row)):
if parsed_row[i] == '':
parsed_row[i] = None
print(parsed_row)
I made you a Repl.it example

Related

how to select a specific column of a csv file in python

I am a beginner of Python and would like to have your opinion..
I wrote this code that reads the only column in a file on my pc and puts it in a list.
I have difficulties understanding how I could modify the same code with a file that has multiple columns and select only the column of my interest.
Can you help me?
list = []
with open(r'C:\Users\Desktop\mydoc.csv') as file:
for line in file:
item = int(line)
list.append(item)
results = []
for i in range(0,1086):
a = list[i-1]
b = list[i]
c = list[i+1]
results.append(b)
print(results)

You can use pandas.read_csv() method very simply like this:
import pandas as pd
my_data_frame = pd.read_csv('path/to/your/data')
results = my_data_frame['name_of_your_wanted_column'].values.tolist()

A useful module for the kind of work you are doing is the imaginatively named csv module.
Many csv files have a "header" at the top, this by convention is a useful way of labeling the columns of your file. Assuming you can insert a line at the top of your csv file with comma delimited fieldnames, then you could replace your program with something like:
import csv
with open(r'C:\Users\Desktop\mydoc.csv') as myfile:
csv_reader = csv.DictReader(myfile)
for row in csv_reader:
print ( row['column_name_of_interest'])
The above will print to the terminal all the values that match your specific 'column_name_of_interest' after you edit it to match your particular file.
It's normal to work with lots of columns at once, so that dictionary method of packing a whole row into a single object, addressable by column-name can be very convenient later on.

To a pure python implementation, you should use the package csv.
data.csv
Project1,folder1/file1,data
Project1,folder1/file2,data
Project1,folder1/file3,data
Project1,folder1/file4,data
Project1,folder2/file11,data
Project1,folder2/file42a,data
Project1,folder2/file42b,data
Project1,folder2/file42c,data
Project1,folder2/file42d,data
Project1,folder3/filec,data
Project1,folder3/fileb,data
Project1,folder3/filea,data
Your python program should read it by line
import csv
a = []
with open('data.csv') as csv_file:
reader = csv.reader(csv_file, delimiter=',')
for row in reader:
print(row)
# ['Project1', 'folder1/file1', 'data']
If you print the row element you will see it is a list like that
['Project1', 'folder1/file1', 'data']
If I would like to put in my list all elements in column 1, I need to put that element in my list, doing:
a.append(row[1])
Now in list a I will have a list like:
['folder1/file1', 'folder1/file2', 'folder1/file3', 'folder1/file4', 'folder2/file11', 'folder2/file42a', 'folder2/file42b', 'folder2/file42c', 'folder2/file42d', 'folder3/filec', 'folder3/fileb', 'folder3/filea']
Here is the complete code:
import csv
a = []
with open('data.csv') as csv_file:
reader = csv.reader(csv_file, delimiter=',')
for row in reader:
a.append(row[1])

Accessing Data in csv.reader

I'm trying to access a csv file of currency pairs using csv.reader. The first column shows dates, the first row shows the currency pair eg.USD/CAD. I can read in the file but cannot access the currency pairs data to perform simple calculations.
I've tried using next(x) to skip header row (currency pairs). If i do this, i get a Typeerror: csv reader is not subscriptable.
path = x
file = open(path)
dataset = csv.reader(file, delimiter = '\t',)
header = next(dataset)
header
Output shows the header row which is
['Date,USD,Index,CNY,JPY,EUR,KRW,GBP,SGD,INR,THB,NZD,TWD,MYR,IDR,VND,AED,PGK,HKD,CAD,CHF,SEK,SDR']
I expect to be able to access the underlying currency pairs but i'm getting the type error as noted above. Is there a simple way to access the currency pairs, for example I want to use USD.describe() to get simple statistics on the USD currency pair.
How can i move from this stage to accessing the data underlying the header row?

try this example
import csv
with open('file.csv') as csv_file:
csv_reader = csv.Reader(csv_file, delimiter='\t')
line_count = 0
for row in csv_reader:
print(f'\t{row[0]} {row[1]} {row[3]}')

It's apparent from the output of your header row that the columns are comma-delimited rather than tab-delimited, so instead of passing delimiter = '\t' to csv.reader, you should let it use the default delimiter ',' instead:
dataset = csv.reader(file)

If you need to elaborate some statistics pandas is your friend. No need to use the csv module, use pandas.read_csv.
import pandas
filename = 'path/of/file.csv'
dataset = pandas.read_csv(filename, sep = '\t') #or whatever the separator is
pandas.read_csv uses the first line as the header automatically.
To see statistics, simply do:
dataset.describe()
Or for a single column:
dataset['column_name'].describe()

Are you sure that your delimiter is '\t'? In first row your delimiter is ','... Anyway you can skip first row by doing file.readline() before using it by csv.reader:
import csv
example = """Date,USD,Index,CNY,JPY,EUR,KRW,GBP,SGD,INR,THB,NZD,TWD,MYR,IDR,VND,AED,PGK,HKD,CAD,CHF,SEK,SDR
1-2-3\tabc\t1.1\t1.2
4-5-6\txyz\t2.1\t2.2
"""
with open('demo.csv', 'w') as f:
f.write(example)
with open('demo.csv') as f:
f.readline()
reader = csv.reader(f, delimiter='\t')
for row in reader:
print(row)
# ['1-2-3', 'abc', '1.1', '1.2']
# ['4-5-6', 'xyz', '2.1', '2.2']
I think that you need something else... Can you add to your question:
example of first 3 lines in your csv
Example of what you'd like to access:
is using row[0], row[1] enough for you?
or do you want "named" access like row['Date'], row['USD'],
or you want something more complex like data_by_date['2019-05-01']['USD']

Attempting to convert a csv file into a list inside a function using list comprehension

I would like to create a function (utilizing list comprehension) called csv_printer() that takes in the path of the csv file, a list of the columns that I want to print, and the option to change the delimiter that will be printed between each column.
The code I am starting off with: csv_printer(path='dssss.csv', columns=['Ext', 'Time Zone', 'Caller ID First Name'], delimiter='$') that would print something like the following:
1001 $ Asia/Pacific $ VAN
1002 $ Asia/Atlantic $ ALT
My initial thought process is to convert the above code into a list inside of a function to hold it together.
import csv
with open("dssss") as csvfile:
reader = csv.reader(csvfile)
def csv_printer(path='dssss.csv', columns=['Ext', 'Time Zone', 'Caller ID First Name'], delimiter='$'):
for row in reader:
print dict(row)
No data is printed from the above code. Any help would be appreciated. Thanks!

I think you need to do some research on how to use def.
With that aside you could try something like
import csv
def csv_printer(path='dssss.csv', columns=['Ext', 'Time Zone', 'Caller ID First Name'], delimiter='$'):
with open(path, 'r') as csvfile:
content = csv.reader(csvfile)
#find columns, assumes they exist and are in the first row
header = [x.strip() for x in next(content)]
cols = [header.index(x) for x in columns]
for row in content:
print(delimiter.join([row[x] for x in cols]))
return None #you can capture the output and return here if you want
csv_printer()
You will probably need to add some checks, like what happens if the columns you asked for don't exist.

How do I specify which columns to print to a text file using Python?

Here's an example of my current output:
['5216', 'SMITH', 'VICTORIA', 'F', '2009-12-19']
This is my code:
users1 = open('users1.txt','w')
with open('users.txt', 'r') as f:
data = f.readlines()
for line in data:
words = str(line.split())
#print words
f.seek(0)
users1.write(words)
I would like to read in users.txt and separate the information to send it to users1 and another text file I'll call users2. (Keep in mind this is a hypothetical situation and that I acknowledge it would not make sense to separate this information like I'm suggesting below.)
Is it possible to identify specific columns I'd like to insert into each text file?
For example, if I wanted users1.txt to contain, using my sample output from above, ['5216','2009-12-19'] and users2.txt to contain ['SMITH','VICTORIA'], what should I do?

You could use slicing to select items from the list. For example,
In [219]: words = ['5216', 'SMITH', 'VICTORIA', 'F', '2009-12-19']
In [220]: [words[0], words[-1]]
Out[220]: ['5216', '2009-12-19']
In [221]: words[1:3]
Out[221]: ['SMITH', 'VICTORIA']
with open('users.txt', 'r') as f,\
open('users1.txt','w') as users1,\
open('users2.txt','w') as users2:
for line in f:
words = line.split()
users1.write(str([words[0], words[-1]])
users2.write(str(words[1:3])
Including the brackets [] in the output is non-standard.
For portability, and proper handling of quoted strings and strings containing the comma delimiter, you would be better off using the csv module:
import csv
with open('users.txt', 'rb') as f,\
open('users1.txt','wb') as users1,\
open('users2.txt','wb') as users2:
writer1 = csv.writer(users1, delimiter=',')
writer2 = csv.writer(users2, delimiter=',')
for line in f:
words = line.split()
writer1.writerow([words[0], words[-1]])
writer2.writerow(words[1:3])

I (too) suggest you use thecsvmodule. However by using itsDictReaderandDictWriteryou can assign field names to the each column and use them to easily specify which ones you want to go into which output file. Here's an example of what I mean:
import csv
users_fieldnames = 'ID', 'LAST', 'FIRST', 'SEX', 'DATE' # input file field names
users1_fieldnames = 'ID', 'DATE' # fields to go into users1 output file
users2_fieldnames = 'LAST', 'FIRST' # fields to go into users2 output file
with open('users.txt', 'rb') as inf:
csvreader = csv.DictReader(inf, fieldnames=users_fieldnames, delimiter=' ')
with open('users1.txt', 'wb') as outf1, open('users2.txt', 'wb') as outf2:
csvwriter1 = csv.DictWriter(outf1, fieldnames=users1_fieldnames,
extrasaction='ignore', delimiter=' ')
csvwriter2 = csv.DictWriter(outf2, fieldnames=users2_fieldnames,
extrasaction='ignore', delimiter=' ')
for row in csvreader:
csvwriter1.writerow(row) # writes data for only user1_fieldnames
csvwriter2.writerow(row) # writes data for only user2_fieldnames
Only the columns specified in the constructor calls tocsv.DictWriter()will be written to the output file by the correspondingwriterow()method call.

If your data has the same structure for all entries you can make use of pandas and numpy packages.
A lot of flexibility for selecting whatever columns you need.

Python: General CSV file parsing and manipulation

The purpose of my Python script is to compare the data present in multiple CSV files, looking for discrepancies. The data are ordered, but the ordering differs between files. The files contain about 70K lines, weighing around 15MB. Nothing fancy or hardcore here. Here's part of the code:
def getCSV(fpath):
with open(fpath,"rb") as f:
csvfile = csv.reader(f)
for row in csvfile:
allRows.append(row)
allCols = map(list, zip(*allRows))
Am I properly reading from my CSV files? I'm using csv.reader, but would I benefit from using csv.DictReader?
How can I create a list containing whole rows which have a certain value in a precise column?

Are you sure you want to be keeping all rows around? This creates a list with matching values only... fname could also come from glob.glob() or os.listdir() or whatever other data source you so choose. Just to note, you mention the 20th column, but row[20] will be the 21st column...
import csv
matching20 = []
for fname in ('file1.csv', 'file2.csv', 'file3.csv'):
with open(fname) as fin:
csvin = csv.reader(fin)
next(csvin) # <--- if you want to skip header row
for row in csvin:
if row[20] == 'value':
matching20.append(row) # or do something with it here
You only want csv.DictReader if you have a header row and want to access your columns by name.

This should work, you don't need to make another list to have access to the columns.
import csv
import sys
def getCSV(fpath):
with open(fpath) as ifile:
csvfile = csv.reader(ifile)
rows = list(csvfile)
value_20 = [x for x in rows if x[20] == 'value']

If I understand the question correctly, you want to include a row if value is in the row, but you don't know which column value is, correct?
If your rows are lists, then this should work:
testlist = [row for row in allRows if 'value' in row]
post-edit:
If, as you say, you want a list of rows where value is in a specified column (specified by an integer pos, then:
testlist = []
pos = 20
for row in allRows:
testlist.append([element if index != pos else 'value' for index, element in enumerate(row)])
(I haven't tested this, but let me now if that works).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python: How to clean the csv file - python

You might want to look into using pandas. It can make data processing a lot easier, up to and including reading multi file formats. if you want to read a csv: import pandas as pd: my_file = '/pat/to/my_csv.csv' pd.read_csv(my_file)

Related

how to select a specific column of a csv file in python

Accessing Data in csv.reader

Attempting to convert a csv file into a list inside a function using list comprehension

How do I specify which columns to print to a text file using Python?

Python: General CSV file parsing and manipulation

Categories

Resources