Related
So I have a list of "clients" that I need to count how many times in every line this "element" shows up. a little snippet of the text file inside the .zip FakeCostomers:
1,female,Melissa,J,Palmer,4 Lynch Street,Milwaukee,WI,53213,US,Melissa.J.Palmer#gmail.com,920-959-8247,9/29/1972,Visa,281.84,5
2,male,Edwin,M,Corder,4302 Pick Street,RIDGEWAY,CO,81432,US,Edwin.M.Corder#outlook.com,970-626-1897,2/28/1953,Visa,277.58,16
3,female,Laura,A,Olvera,365 Tori Lane,Salt Lake City,UT,84116,US,Laura.A.Olvera#yahoo.com,801-599-5964,4/11/1963,MasterCard,560.63,24
4,male,Wayne,D,Adams,3643 Nash Street,Chicago,IL,60605,US,Wayne.D.Adams#yahoo.com,312-948-6927,7/16/1957,Visa,320.11,3
5,female,Mari,R,Smith,3024 Atha Drive,Palmdale,CA,93550,US,Mari.R.Smith#gmail.com,661-574-4919,7/30/1973,MasterCard,798.58,28
6,male,Craig,H,Salazar,3929 Goosetown Drive,Hendersonville,NC,28792,US,Craig.H.Salazar#gmail.com,828-697-6697,1/15/1959,Visa,183.35,29
7,male,Henry,S,Clark,205 Charla Lane,Mesquite,TX,75150,US,Henry.S.Clark#gmail.com,972-686-5507,8/28/1962,Visa,650.58,27
8,male,Jerry,L,Littleton,1652 My Drive,Elmsford,NY,10523,US,Jerry.L.Littleton#gmail.com,347-219-4091,9/5/1975,MasterCard,525.73,8
9,female,Georgia,V,Allen,1226 Jefferson Street,Norfolk,VA,23510,US,Georgia.V.Allen#yahoo.com,757-774-4490,5/17/1952,Visa,910.39,6
10,male,Ted,A,Harding,2143 Lake Floyd Circle,HOCKESSIN,DE,19707,US,Ted.A.Harding#gmail.com,302-239-3674,7/12/1958,MasterCard,307.51,25
11,male,Jose,J,Houston,2639 Olive Street,Shelby,OH,44875,US,Jose.J.Houston#gmail.com,419-342-5793,4/23/1943,Visa,447.97,27
For example if I wanted to find out how many females there are in this list.
So far I have tried:
def getColumnDistribution(filename,columnNum):
file = open(filename,"r")
listoflists = []
for line in file:
stripped_line = line.strip()
line_list = stripped_line.split()
listoflists.append(line_list)
NUMBER = line_list.count(line_list[columnNum])
it keeps coming up with "list index out of range"
Anyone know how I can fix it or use a better method?
Python gives you a lot of tools that make these kind of tasks painless. For example, you can pass the issues of csv parsing to the csv module and counter to collections.Counter. Then it's just a couple lines:
import csv
from collections import Counter
with open(path, 'r') as f:
reader = csv.reader(f)
headers = next(reader) # pop off the header row
genderCounts = Counter(row[1] for row in reader)
print(genderCounts['female'])
# 15167
print(genderCounts['male'])
# 14833
If you use the Dictreader from csv, you can index by columns name, which makes the code more readable:
with open(path, 'r') as f:
reader = csv.DictReader(f)
genderCounts = Counter(row['Gender'] for row in reader)
Of course if you are doing a lot of work on data like this, pandas will make you life substantially easier:
import pandas as pd
df = pd.read_csv(path)
df['Gender'].value_counts()
# female 15167
# male 14833
# Name: Gender, dtype: int64
This will work for you. Tested on 3.6v
import csv
def openFile(file_name:str)->tuple:
with open(file_name,'r') as csv_file:
csv_reader = csv.reader(csv_file)
return tuple(csv_reader)
def getColumnDistribution(csv_data:tuple,name_to_count:str)->int:
num_of_count = [idx for idx,rows in enumerate(csv_data) if name_to_count in rows]
print("number of occurence:",num_of_count)
return len(num_of_count)
#driver code
csv_data = openFile(your_csv_file_name)
getColumnDistribution(csv_data,'female')
I have a file 'data.csv' that looks something like
ColA, ColB, ColC
1,2,3
4,5,6
7,8,9
I want to open and read the file columns into lists, with the 1st entry of that list omitted, e.g.
dataA = [1,4,7]
dataB = [2,5,8]
dataC = [3,6,9]
In reality there are more than 3 columns and the lists are very long, this is just an example of the format. I've tried:
csv_file = open('data.csv','rb')
csv_array = []
for row in csv.reader(csv_file, delimiter=','):
csv_array.append(row)
Where I would then allocate each index of csv_array to a list, e.g.
dataA = [int(i) for i in csv_array[0]]
But I'm getting errors:
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Also it feels like a very long winded way of just saving data to a few lists...
Thanks!
edit:
Here is how I solved it:
import pandas as pd
df = pd.read_csv('data.csv', names = ['ColA','ColB','ColC']
dataA = map(int,(df.ColA.tolist())[1:3])
and repeat for the rest of the columns.
Just to spell this out for people trying to solve a similar problem, perhaps without Pandas, here's a simple refactoring with comments.
import csv
# Open the file in 'r' mode, not 'rb'
csv_file = open('data.csv','r')
dataA = []
dataB = []
dataC = []
# Read off and discard first line, to skip headers
csv_file.readline()
# Split columns while reading
for a, b, c in csv.reader(csv_file, delimiter=','):
# Append each variable to a separate list
dataA.append(a)
dataB.append(b)
dataC.append(c)
This does nothing to convert the individual fields to numbers (use append(int(a)) etc if you want that) but should hopefully be explicit and flexible enough to show you how to adapt this to new requirements.
Use Pandas:
import pandas as pd
df = pd.DataFrame.from_csv(path)
rows = df.apply(lambda x: x.tolist(), axis=1)
To skip the header, create your reader on a seperate line. Then to convert from a list of rows to a list of columns, use zip():
import csv
with open('data.csv', 'rb') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
data = zip(*[map(int, row) for row in csv_input])
print data
Giving you:
[(1, 4, 7), (2, 5, 8), (3, 6, 9)]
So if needed:
dataA = data[0]
Seems like you have OSX line endings in your csv file. Try saving the csv file as "Windows Comma Separated (.csv)" format.
There are also easier ways to do what you're doing with the csv reader:
csv_array = []
with open('data.csv', 'r') as csv_file:
reader = csv.reader(csv_file)
# remove headers
reader.next()
# loop over rows in the file, append them to your array. each row is already formatted as a list.
for row in reader:
csv_array.append(row)
You can then set dataA = csv_array[0]
First if you read the csv file with csv.reader(csv_file, delimiter=','), you will still read the header.
csv_array[0] will be the header row -> ['ColA', ' ColB', ' ColC']
Also if you're using mac, this issues is already referenced here: CSV new-line character seen in unquoted field error
And I would recommend using pandas&numpy instead if you will do more analysis using the data. It read the csv file to pandas dataframe.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
use csv.DictReader() to select specific columns
dataA = []
dataB = []
with open('data.csv', 'r') as csv_file:
csv_reader = csv.DictReader(csv_file, delimiter=',')
for row in csv_reader:
dataA.append(row['ColA'])
dataB.append(row['ColB'])
I'm trying to access a csv file of currency pairs using csv.reader. The first column shows dates, the first row shows the currency pair eg.USD/CAD. I can read in the file but cannot access the currency pairs data to perform simple calculations.
I've tried using next(x) to skip header row (currency pairs). If i do this, i get a Typeerror: csv reader is not subscriptable.
path = x
file = open(path)
dataset = csv.reader(file, delimiter = '\t',)
header = next(dataset)
header
Output shows the header row which is
['Date,USD,Index,CNY,JPY,EUR,KRW,GBP,SGD,INR,THB,NZD,TWD,MYR,IDR,VND,AED,PGK,HKD,CAD,CHF,SEK,SDR']
I expect to be able to access the underlying currency pairs but i'm getting the type error as noted above. Is there a simple way to access the currency pairs, for example I want to use USD.describe() to get simple statistics on the USD currency pair.
How can i move from this stage to accessing the data underlying the header row?
try this example
import csv
with open('file.csv') as csv_file:
csv_reader = csv.Reader(csv_file, delimiter='\t')
line_count = 0
for row in csv_reader:
print(f'\t{row[0]} {row[1]} {row[3]}')
It's apparent from the output of your header row that the columns are comma-delimited rather than tab-delimited, so instead of passing delimiter = '\t' to csv.reader, you should let it use the default delimiter ',' instead:
dataset = csv.reader(file)
If you need to elaborate some statistics pandas is your friend. No need to use the csv module, use pandas.read_csv.
import pandas
filename = 'path/of/file.csv'
dataset = pandas.read_csv(filename, sep = '\t') #or whatever the separator is
pandas.read_csv uses the first line as the header automatically.
To see statistics, simply do:
dataset.describe()
Or for a single column:
dataset['column_name'].describe()
Are you sure that your delimiter is '\t'? In first row your delimiter is ','... Anyway you can skip first row by doing file.readline() before using it by csv.reader:
import csv
example = """Date,USD,Index,CNY,JPY,EUR,KRW,GBP,SGD,INR,THB,NZD,TWD,MYR,IDR,VND,AED,PGK,HKD,CAD,CHF,SEK,SDR
1-2-3\tabc\t1.1\t1.2
4-5-6\txyz\t2.1\t2.2
"""
with open('demo.csv', 'w') as f:
f.write(example)
with open('demo.csv') as f:
f.readline()
reader = csv.reader(f, delimiter='\t')
for row in reader:
print(row)
# ['1-2-3', 'abc', '1.1', '1.2']
# ['4-5-6', 'xyz', '2.1', '2.2']
I think that you need something else... Can you add to your question:
example of first 3 lines in your csv
Example of what you'd like to access:
is using row[0], row[1] enough for you?
or do you want "named" access like row['Date'], row['USD'],
or you want something more complex like data_by_date['2019-05-01']['USD']
I am asking Python to print the minimum number from a column of CSV data, but the top row is the column number, and I don't want Python to take the top row into account. How can I make sure Python ignores the first line?
This is the code so far:
import csv
with open('all16.csv', 'rb') as inf:
incsv = csv.reader(inf)
column = 1
datatype = float
data = (datatype(column) for row in incsv)
least_value = min(data)
print least_value
Could you also explain what you are doing, not just give the code? I am very very new to Python and would like to make sure I understand everything.
You could use an instance of the csv module's Sniffer class to deduce the format of a CSV file and detect whether a header row is present along with the built-in next() function to skip over the first row only when necessary:
import csv
with open('all16.csv', 'r', newline='') as file:
has_header = csv.Sniffer().has_header(file.read(1024))
file.seek(0) # Rewind.
reader = csv.reader(file)
if has_header:
next(reader) # Skip header row.
column = 1
datatype = float
data = (datatype(row[column]) for row in reader)
least_value = min(data)
print(least_value)
Since datatype and column are hardcoded in your example, it would be slightly faster to process the row like this:
data = (float(row[1]) for row in reader)
Note: the code above is for Python 3.x. For Python 2.x use the following line to open the file instead of what is shown:
with open('all16.csv', 'rb') as file:
To skip the first line just call:
next(inf)
Files in Python are iterators over lines.
Borrowed from python cookbook,
A more concise template code might look like this:
import csv
with open('stocks.csv') as f:
f_csv = csv.reader(f)
headers = next(f_csv)
for row in f_csv:
# Process row ...
In a similar use case I had to skip annoying lines before the line with my actual column names. This solution worked nicely. Read the file first, then pass the list to csv.DictReader.
with open('all16.csv') as tmp:
# Skip first line (if any)
next(tmp, None)
# {line_num: row}
data = dict(enumerate(csv.DictReader(tmp)))
You would normally use next(incsv) which advances the iterator one row, so you skip the header. The other (say you wanted to skip 30 rows) would be:
from itertools import islice
for row in islice(incsv, 30, None):
# process
use csv.DictReader instead of csv.Reader.
If the fieldnames parameter is omitted, the values in the first row of the csvfile will be used as field names. you would then be able to access field values using row["1"] etc
Python 2.x
csvreader.next()
Return the next row of the reader’s iterable object as a list, parsed
according to the current dialect.
csv_data = csv.reader(open('sample.csv'))
csv_data.next() # skip first row
for row in csv_data:
print(row) # should print second row
Python 3.x
csvreader.__next__()
Return the next row of the reader’s iterable object as a list (if the
object was returned from reader()) or a dict (if it is a DictReader
instance), parsed according to the current dialect. Usually you should
call this as next(reader).
csv_data = csv.reader(open('sample.csv'))
csv_data.__next__() # skip first row
for row in csv_data:
print(row) # should print second row
The documentation for the Python 3 CSV module provides this example:
with open('example.csv', newline='') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
# ... process CSV file contents here ...
The Sniffer will try to auto-detect many things about the CSV file. You need to explicitly call its has_header() method to determine whether the file has a header line. If it does, then skip the first row when iterating the CSV rows. You can do it like this:
if sniffer.has_header():
for header_row in reader:
break
for data_row in reader:
# do something with the row
this might be a very old question but with pandas we have a very easy solution
import pandas as pd
data=pd.read_csv('all16.csv',skiprows=1)
data['column'].min()
with skiprows=1 we can skip the first row then we can find the least value using data['column'].min()
The new 'pandas' package might be more relevant than 'csv'. The code below will read a CSV file, by default interpreting the first line as the column header and find the minimum across columns.
import pandas as pd
data = pd.read_csv('all16.csv')
data.min()
Because this is related to something I was doing, I'll share here.
What if we're not sure if there's a header and you also don't feel like importing sniffer and other things?
If your task is basic, such as printing or appending to a list or array, you could just use an if statement:
# Let's say there's 4 columns
with open('file.csv') as csvfile:
csvreader = csv.reader(csvfile)
# read first line
first_line = next(csvreader)
# My headers were just text. You can use any suitable conditional here
if len(first_line) == 4:
array.append(first_line)
# Now we'll just iterate over everything else as usual:
for row in csvreader:
array.append(row)
Well, my mini wrapper library would do the job as well.
>>> import pyexcel as pe
>>> data = pe.load('all16.csv', name_columns_by_row=0)
>>> min(data.column[1])
Meanwhile, if you know what header column index one is, for example "Column 1", you can do this instead:
>>> min(data.column["Column 1"])
For me the easiest way to go is to use range.
import csv
with open('files/filename.csv') as I:
reader = csv.reader(I)
fulllist = list(reader)
# Starting with data skipping header
for item in range(1, len(fulllist)):
# Print each row using "item" as the index value
print (fulllist[item])
I would convert csvreader to list, then pop the first element
import csv
with open(fileName, 'r') as csvfile:
csvreader = csv.reader(csvfile)
data = list(csvreader) # Convert to list
data.pop(0) # Removes the first row
for row in data:
print(row)
I would use tail to get rid of the unwanted first line:
tail -n +2 $INFIL | whatever_script.py
just add [1:]
example below:
data = pd.read_csv("/Users/xyz/Desktop/xyxData/xyz.csv", sep=',', header=None)**[1:]**
that works for me in iPython
Python 3.X
Handles UTF8 BOM + HEADER
It was quite frustrating that the csv module could not easily get the header, there is also a bug with the UTF-8 BOM (first char in file).
This works for me using only the csv module:
import csv
def read_csv(self, csv_path, delimiter):
with open(csv_path, newline='', encoding='utf-8') as f:
# https://bugs.python.org/issue7185
# Remove UTF8 BOM.
txt = f.read()[1:]
# Remove header line.
header = txt.splitlines()[:1]
lines = txt.splitlines()[1:]
# Convert to list.
csv_rows = list(csv.reader(lines, delimiter=delimiter))
for row in csv_rows:
value = row[INDEX_HERE]
Simple Solution is to use csv.DictReader()
import csv
def read_csv(file): with open(file, 'r') as file:
reader = csv.DictReader(file)
for row in reader:
print(row["column_name"]) # Replace the name of column header.
I have an array LiveTick = ['ted3m index','US0003m index','USGG3m index'] and I am reading a CSV file book1.csv. I have to find the row which contains the values in csv.
For example, 15th row will contain ted3m index 500 | 600 and 20th row will contain US0003m index 800 | 900 and likewise.
I then have to get the values contained in the row and parse it for each value contained in array LiveTick. How do I proceed? Below is my sample code:
with open('C:\\blp\\book1.csv', 'r') as f:
reader = csv.reader(f, delimiter=',')
writer = csv.writer(outf)
for row in reader:
for list in LiveTick:
if list in row:
print ('Found: {}'.format(row))
You can use pandas, it's pretty fast and will do all reading, writing and filtering job for you out of the box:
import pandas as pd
df = pd.read_csv('C:\\blp\\book1.csv')
filtered_df = df[df['your_column_name'].isin(LiveTick)]
# now you can save it
filtered_df.to_csv('C:\\blp\\book_filtered.csv')
You have the right idea, but there are a few improvements you can make:
Instead of a nested for loop which doesn't short-circuit, use any to compare the first column to multiple values.
Write to your csv as you go along instead of just print. This is memory-efficient, as you hold in memory only one line at any one time.
Define outf as an open object in your with statement.
Do not shadow built-in list. Use another identifier, e.g. i, for elements in LiveTick.
Here's a demo:
with open('in.csv', 'r') as f, open('out.csv', 'wb', newline='') as outf:
reader = csv.reader(f, delimiter=',')
writer = csv.writer(outf, delimiter=',')
for row in reader:
if any(i in row[0] for i in LiveTick):
writer.writerow(row)