python reading csv files comma issue - python

I have a small issue while trying to parse some data from a table. My program reads a row of the table and then puts it in a list as a string (Python does this as default with a reader.next() function). Everything is fine until there aren't any commas separating some text on the same table space. In this case, the program thinks the comma is a separator and makes 2 list indexes instead of one, and this makes things like list[0].split(';') impossible.
I suck at explaining verbally, so let me illustrate:
csv_file = | House floors | Wooden, metal and golden | 2000 | # Illustration of an excel table
reader = csv.reader(open('csv_file.csv', 'r'))
row = reader.next() # row: ['House floors;Wooden', 'metal and golden; 2000']
columns = row.split(';') # columns: ['House floors, Wooden', 'metal and golden', '2000']
# But obviously what i want is this:
# columns : ['House floors', 'Wooden, metal and golden', '2000']
Thank you very much for your help!

set the delimiter http://docs.python.org/2/library/csv.html
csv.reader(fh, delimiter='|')

You need to set correct delimiter which in your case would be | or ; (not clear from OP's example) e.g.
csv.reader(csvfile, delimiter=';')
Assuming you have data like "House floors;Wooden, metal and golden;2000" you can easily parse it using csv module
import csv
import StringIO
data = "House floors;Wooden, metal and golden;2000"
csvfile = StringIO.StringIO(data)
for row in csv.reader(csvfile, delimiter=';'):
print row
output:
['House floors', 'Wooden, metal and golden', '2000']

Related

How to manage a problem reading a csv that is a semicolon-separated file where some strings contain semi-colons?

The problem I have can be illustrated by showing a couple of sample rows in my csv (semicolon-separated) file, which look like this:
4;1;"COFFEE; COMPANY";4
3;2;SALVATION ARMY;4
Notice that in one row, a string is in quotation marks and has a semi-colon inside of it (none of the columns have quotations marks around them in my input file except for the ones containing semicolons).
These rows with the quotation marks and semicolons are causing a problem -- basically, my code is counting the semicolon inside quotation marks within the column/field. So when I read in this row, it reads this semicolon inside the string as a delimiter, thus making it seem like this row has an extra field/column.
The desired output would look like this, with no quotation marks around "coffee company" and no semicolon between 'coffee' and 'company':
4;1;COFFEE COMPANY;4
3;2;SALVATION ARMY;4
Actually, this column with "coffee company" is totally useless to me, so the final file could look like this too:
4;1;xxxxxxxxxxx;4
3;2;xxxxxxxxxxx;4
How can I get rid of just the semi-colons inside of this one particular column, but without getting rid of all of the other semi-colons?
The csv module makes it relatively easy to handle a situation like this:
# Contents of input_file.csv
# 4;1;"COFFEE; COMPANY";4
# 3;2;SALVATION ARMY;4
import csv
input_file = 'input_file.csv' # Contents as shown in your question.
with open(input_file, 'r', newline='') as inp:
for row in csv.reader(inp, delimiter=';'):
row[2] = row[2].replace(';', '') # Remove embedded ';' chars.
# If you don't care about what's in the column, use the following instead:
# row[2] = 'xyz' # Value not needed.
print(';'.join(row))
Printed output:
4;1;COFFEE COMPANY;4
3;2;SALVATION ARMY;4
Follow-on question: How to write this data to a new csv file?
import csv
input_file = 'input_file.csv' # Contents as shown in your question.
output_file = 'output_file.csv'
with open(input_file, 'r', newline='') as inp, \
open(output_file, 'w', newline='') as outp:
writer= csv.writer(outp, delimiter=';')
for row in csv.reader(inp, delimiter=';'):
row[2] = row[2].replace(';', '') # Remove embedded ';' chars.
writer.writerow(row)
Here's an alternative approach using the Pandas library which spares you having to code for loops:
import pandas as pd
#Read csv into dataframe df
df = pd.read_csv('data.csv', sep=';', header=None)
#Remove semicolon in column 2
df[2] = df[2].apply(lambda x: x.replace(';', ''))
This gives the following dataframe df:
0 1 2 3
0 4 1 COFFEE COMPANY 4
1 3 2 SALVATION ARMY 4
Pandas provides several inbuilt functions to help you manipulate data or make statistical conclusions. Having the data in a tabular format can also make working with it more intuitive.

Using python to print strings between csv values

My overarching goal is to write a Python script that transforms each row of a spreadsheet into a standalone markdown file, using each column as a value in the file's YAML header. Right now, the final for loop I've written not only keeps going and going and going… it also doesn't seem to place the values correctly.
import csv
f = open('data.tsv')
csv_f = csv.reader(f, dialect=csv.excel_tab)
date = []
title = []
for column in csv_f:
date.append(column[0])
title.append(column[1])
for year in date:
for citation in title:
print "---\ndate: %s\ntitle: %s\n---\n\n" % (year, citation)
I'm using tab-separated values because some of the fields in my spreadsheet are chunks of text with commas. So ideally, the script should output something like the following (I figured I'd tackle splitting this output into individual markdown files later. One thing at a time):
---
date: 2015
title: foo
---
---
date: 2016
title: bar
---
But instead I getting misplaced values and output that never ends. I'm obviously learning as I go along here, so any advice is appreciated.
import csv
with open('data.tsv', newline='') as f:
csv_f = csv.reader(f, dialect=csv.excel_tab)
for column in csv_f:
year, citation = column # column is a list, unpack them directly
print "---\ndate: %s\ntitle: %s\n---\n\n" % (year, citation)
This is all I can do without the sample CSV file.

Make edits to the original csv file

I have three different columns in my csv file, with their respected values. Column B (Name column) in csv file has the values in all caps. I am trying to convert it into first letter caps but when I run the code it returns all the columns squished together and in quotes.
The Original File:
Company Name Job Title
xxxxxx JACK NICHOLSON Manager
yyyyyy BRAD PITT Accountant
I am trying to do:
Company Name Job Title
xxxxxx Jack Nicholson Manager
yyyyyy Brad Pitt Accountant
My code:
import csv
with open('C:\\Users\\Data.csv', 'rb') as f:
reader = csv.reader(f, delimiter='\t')
data = list(reader)
for item in data:
if len(item) > 1:
item[1] = item[1].title()
with open('C:\\Users\\Data.csv', 'wb') as f:
writer = csv.writer(f, delimiter='\t')
writer.writerows(data)
My result after I run the code is: Instead of returning three different columns and the second column adjusted with the title() syntax, it returns all the three columns squished together in just one column with quotes.
"Company","Name","Job Title"
xxxxxx,"JACK NICHOLSON","Manager"
yyyyyy,"BRAD PITT","Accountant"
I do not know what is wrong with my snippet. The result has absurd markings in the beginning
A slight change to Mohammed's solution using read_fwf to simplify reading the file.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_fwf.html
import pandas as pd
df = pd.read_fwf('old_csv_file')
df.Name = df.Name.str.title()
df.to_csv('new_csv_file', index=False, sep='\t')
EDIT:
Changed to use a string method over lambda. I prefer to use lambdas as a last result.
You can do something like this with pandas:
import pandas as pd
df = pd.read_csv('old_csv_file', sep='\s{3,}')
df.Name = df.Name.apply(lambda x: x.title())
df.to_csv('new_csv_file', index=False, sep='\t')
string.title() converts the string to title case, i.e every first letter of the word in string is capitalized and subsequent letters are converted to lower case.
With df.apply you can perform some operation on an entire column or row.
'\s{3,}' is a regular expression
\s is a space character. \s{3,} is for more than 3 spaces.
When you are reading a CSV format you have to specify how your columns are separated.
Generally columns are separated by comma or tab. But in your case you have like 5,6 spaces between each column of a row.
So by using \s{3,} I am telling the CSV processor that the columns in a row are delimited by more than 3 spaces.
If I had use only \s then it would have treated First Name and Last Name as two separate columns because they have 1 space in between. So by 3+ spaces I made First Name and Last Name as a single column.
Take note that data stores each row as list containing one string only.
Having a length of 1, the statement inside this if block won't execute.
if len(item) > 1:
item[1] = item[1].title()
Aside from that, reading and writing in binary format is unnecessary.
import csv
with open('C:\\Users\\Data.csv', 'r') as f:
reader = csv.reader(f, delimiter='\t')
data = list(reader)
for item in data[1:]: # excludes headers
item[0] = item[0].title() # will capitalize the Company column too
item[0] = item[0][0].lower() + item[0][1:] # that's why we need to revert
print(item)
# see that data contains lists having one element only
# the line above will output to
# ['Company Name Job Title']
# ['xxxxxx Jack Nicholson Manager']
# ['yyyyyy Brad Pitt Accountant']
with open('C:\\Users\\Data.csv', 'w') as f:
writer = csv.writer(f, delimiter='\t')
writer.writerows(data)

Printing selected columns from a csv file in Python

I have some code here:
with open("dsasa.csv", 'rb') as csvfile:
content = csv.reader(csvfile, delimiter='|')
for row in content:
print row```
I would like to print columns 2, 3, 4 from the csv file in the following format:
4556 | 432432898904 | Joseph Henry
4544 | 54522238904 | Mark Mulligan
I have two issues which I am encountering. One is that the delimiter pipe (|) is not appearing between the columns. The second issue is that I cannot print the specific columns I want by doing the manual way, ie. print row[2], row[3], row[4]
I looked at online info and tried a few different solutions but I can't seem to find the route to get this to work.
Any help would be greatly appreciated.
Thanks!
Try this:
with open("dsasa.csv", 'rb') as csvfile:
content = csv.reader(csvfile)
for row in content:
print "|".join([row[2],row[3],row[4]])
The delimiter argument within csv.reader refers to the input file not the output.
What does appear between the columns as the delimiter? Are you sure it is '|' and not a comma? I am guessing because you do not have the correct delimiter you cannot use print row[2], row[3], row[4]. Can you post a line of the CSV?

Excel CSV help Python

I have the following CSV file:
How do I import the numbers only into an array in python one row at a time? No date, no string.
My code:
import csv
def test():
out = open("example.csv","rb")
data = csv.reader(out)
data = [row for row in data]
out.close()
print data
Let me more clear. I don't want a huge 2D array. I want to import just the 2nd row and then manipulate the data then get the 3rd row. I would need a for loop for this, but I am not sure on how csv fully works.
try this:
with open('the_CSV_file.csv','r') as f:
box = f.readlines()
result_box = []
for line in box[1:]:
items = line.split(';') # adjust the separator character in the CSV as needed
result_box.append(items[1:])
print result_box
% <csv # just a silly CSV I got from http://secrets.d8u.us/csv
Secret,Timestamp
Forza la fiera!,1368230474
American healthcare SUXXXXX,1368232342
I am not sure if I wanna take the girl out again,1368240406
I bred a race of intelligent penguin assassins to murder dick cheney. ,1368245584
"I guess it is my mother's time of the month, as it were",1368380424
i've seen walls breath,1368390258
In [33]: %paste
with open('csv', 'rb') as csvfile:
csv_reader = csv.reader(csvfile, dialect='excel') # excel may be the default, but doesn't hurt to be explicit
csv_reader.next()
for row in csv_reader:
array.append(row[1:])
## -- End pasted text --
In [34]: array
Out[34]:
[['1368230474'],
['1368232342'],
['1368240406'],
['1368245584'],
['1368380424'],
['1368390258']]
corrected as per #DSM's comment
You should end up with what you want in array:
import csv
with open('theFile.csv', 'r', encoding = 'utf8') as data:
reader = csv.reader(data)
array = []
next(reader) # skips 'string's row
for row in reader:
numberRow = [float(x) for x in row[1:]) # This slice skips 'date's
array.append(numberRow)
I'm not sure it's necessary to define the encoding. But if you want to treat these as numbers, you will have to use float(x), or else they'll just be strings.

Categories

Resources