I'm trying to parse through a csv file and extract the data from only specific columns.
Example csv:
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
I'm trying to capture only specific columns, say ID, Name, Zip and Phone.
Code I've looked at has led me to believe I can call the specific column by its corresponding number, so ie: Name would correspond to 2 and iterating through each row using row[2] would produce all the items in column 2. Only it doesn't.
Here's what I've done so far:
import sys, argparse, csv
from settings import *
# command arguments
parser = argparse.ArgumentParser(description='csv to postgres',\
fromfile_prefix_chars="#" )
parser.add_argument('file', help='csv file to import', action='store')
args = parser.parse_args()
csv_file = args.file
# open csv file
with open(csv_file, 'rb') as csvfile:
# get number of columns
for line in csvfile.readlines():
array = line.split(',')
first_item = array[0]
num_columns = len(array)
csvfile.seek(0)
reader = csv.reader(csvfile, delimiter=' ')
included_cols = [1, 2, 6, 7]
for row in reader:
content = list(row[i] for i in included_cols)
print content
and I'm expecting that this will print out only the specific columns I want for each row except it doesn't, I get the last column only.
The only way you would be getting the last column from this code is if you don't include your print statement in your for loop.
This is most likely the end of your code:
for row in reader:
content = list(row[i] for i in included_cols)
print content
You want it to be this:
for row in reader:
content = list(row[i] for i in included_cols)
print content
Now that we have covered your mistake, I would like to take this time to introduce you to the pandas module.
Pandas is spectacular for dealing with csv files, and the following code would be all you need to read a csv and save an entire column into a variable:
import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name #you can also use df['column_name']
so if you wanted to save all of the info in your column Names into a variable, this is all you need to do:
names = df.Names
It's a great module and I suggest you look into it. If for some reason your print statement was in for loop and it was still only printing out the last column, which shouldn't happen, but let me know if my assumption was wrong. Your posted code has a lot of indentation errors so it was hard to know what was supposed to be where. Hope this was helpful!
import csv
from collections import defaultdict
columns = defaultdict(list) # each value in each column is appended to a list
with open('file.txt') as f:
reader = csv.DictReader(f) # read rows into a dictionary format
for row in reader: # read a row as {column1: value1, column2: value2,...}
for (k,v) in row.items(): # go over each column name and value
columns[k].append(v) # append the value into the appropriate list
# based on column name k
print(columns['name'])
print(columns['phone'])
print(columns['street'])
With a file like
name,phone,street
Bob,0893,32 Silly
James,000,400 McHilly
Smithers,4442,23 Looped St.
Will output
>>>
['Bob', 'James', 'Smithers']
['0893', '000', '4442']
['32 Silly', '400 McHilly', '23 Looped St.']
Or alternatively if you want numerical indexing for the columns:
with open('file.txt') as f:
reader = csv.reader(f)
next(reader)
for row in reader:
for (i,v) in enumerate(row):
columns[i].append(v)
print(columns[0])
>>>
['Bob', 'James', 'Smithers']
To change the deliminator add delimiter=" " to the appropriate instantiation, i.e reader = csv.reader(f,delimiter=" ")
Use pandas:
import pandas as pd
my_csv = pd.read_csv(filename)
column = my_csv.column_name
# you can also use my_csv['column_name']
Discard unneeded columns at parse time:
my_filtered_csv = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])
P.S. I'm just aggregating what other's have said in a simple manner. Actual answers are taken from here and here.
You can use numpy.loadtext(filename). For example if this is your database .csv:
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | Adam | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Carl | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Adolf | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Den | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
And you want the Name column:
import numpy as np
b=np.loadtxt(r'filepath\name.csv',dtype=str,delimiter='|',skiprows=1,usecols=(1,))
>>> b
array([' Adam ', ' Carl ', ' Adolf ', ' Den '],
dtype='|S7')
More easily you can use genfromtext:
b = np.genfromtxt(r'filepath\name.csv', delimiter='|', names=True,dtype=None)
>>> b['Name']
array([' Adam ', ' Carl ', ' Adolf ', ' Den '],
dtype='|S7')
With pandas you can use read_csv with usecols parameter:
df = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])
Example:
import pandas as pd
import io
s = '''
total_bill,tip,sex,smoker,day,time,size
16.99,1.01,Female,No,Sun,Dinner,2
10.34,1.66,Male,No,Sun,Dinner,3
21.01,3.5,Male,No,Sun,Dinner,3
'''
df = pd.read_csv(io.StringIO(s), usecols=['total_bill', 'day', 'size'])
print(df)
total_bill day size
0 16.99 Sun 2
1 10.34 Sun 3
2 21.01 Sun 3
Context: For this type of work you should use the amazing python petl library. That will save you a lot of work and potential frustration from doing things 'manually' with the standard csv module. AFAIK, the only people who still use the csv module are those who have not yet discovered better tools for working with tabular data (pandas, petl, etc.), which is fine, but if you plan to work with a lot of data in your career from various strange sources, learning something like petl is one of the best investments you can make. To get started should only take 30 minutes after you've done pip install petl. The documentation is excellent.
Answer: Let's say you have the first table in a csv file (you can also load directly from the database using petl). Then you would simply load it and do the following.
from petl import fromcsv, look, cut, tocsv
#Load the table
table1 = fromcsv('table1.csv')
# Alter the colums
table2 = cut(table1, 'Song_Name','Artist_ID')
#have a quick look to make sure things are ok. Prints a nicely formatted table to your console
print look(table2)
# Save to new file
tocsv(table2, 'new.csv')
I think there is an easier way
import pandas as pd
dataset = pd.read_csv('table1.csv')
ftCol = dataset.iloc[:, 0].values
So in here iloc[:, 0], : means all values, 0 means the position of the column.
in the example below ID will be selected
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
import pandas as pd
csv_file = pd.read_csv("file.csv")
column_val_list = csv_file.column_name._ndarray_values
Thanks to the way you can index and subset a pandas dataframe, a very easy way to extract a single column from a csv file into a variable is:
myVar = pd.read_csv('YourPath', sep = ",")['ColumnName']
A few things to consider:
The snippet above will produce a pandas Series and not dataframe.
The suggestion from ayhan with usecols will also be faster if speed is an issue.
Testing the two different approaches using %timeit on a 2122 KB sized csv file yields 22.8 ms for the usecols approach and 53 ms for my suggested approach.
And don't forget import pandas as pd
If you need to process the columns separately, I like to destructure the columns with the zip(*iterable) pattern (effectively "unzip"). So for your example:
ids, names, zips, phones = zip(*(
(row[1], row[2], row[6], row[7])
for row in reader
))
import pandas as pd
dataset = pd.read_csv('Train.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
X is a a bunch of columns, use it if you want to read more that one column
y is single column, use it to read one column
[:, 1:-1] are [row_index : to_row_index, column_index : to_column_index]
SAMPLE.CSV
a, 1, +
b, 2, -
c, 3, *
d, 4, /
column_names = ["Letter", "Number", "Symbol"]
df = pd.read_csv("sample.csv", names=column_names)
print(df)
OUTPUT
Letter Number Symbol
0 a 1 +
1 b 2 -
2 c 3 *
3 d 4 /
letters = df.Letter.to_list()
print(letters)
OUTPUT
['a', 'b', 'c', 'd']
import csv
with open('input.csv', encoding='utf-8-sig') as csv_file:
# the below statement will skip the first row
next(csv_file)
reader= csv.DictReader(csv_file)
Time_col ={'Time' : []}
#print(Time_col)
for record in reader :
Time_col['Time'].append(record['Time'])
print(Time_col)
From CSV File Reading and Writing you can import csv and use this code:
with open('names.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row['first_name'], row['last_name'])
To fetch column name, instead of using readlines() better use readline() to avoid loop & reading the complete file & storing it in the array.
with open(csv_file, 'rb') as csvfile:
# get number of columns
line = csvfile.readline()
first_item = line.split(',')
I am new to python file data processing. I have the following text file having the report of a new college campus. I want to extract the data from the column "colleges" and for "book_IDs_1" for block_ABC_top which is 23. I also want to know if there is any more occurrence of block_ABC_top in the colleges column and find the value for the book IDs_1 column.
Is it possible in a text file? or il have to change it to csv? How do i write a code for this data processing? Kindly help me!!
Copyright 1986-2019, Inc. All Rights Reserved.
Design Information
-----------------------------------------------------------------------------------------------------------------
| Version : (lin64) Build 2729669 Thu Dec 5 04:48:12 MST 2019
| Date : Wed Aug 26 00:46:08 2020
| Host : running 64-bit Red Hat Enterprise Linux Server release 7.8
| Command : college report
| Design : college
| Device : laptop
| Design State : in construction
-----------------------------------------------------------------------------------------------------------------
Table of Contents
-----------------
1. Information by Hierarchy
1. Information by Hierarchy
---------------------------
+----------------------------------------------+--------------------------------------------+------------+------------+---------+------+-----+
| colleges | Module | Total mems | book IDs_1 | canteen | BUS | UPS |
+----------------------------------------------+--------------------------------------------+------------+------------+---------+------+-----+
| block_ABC_top | (top) | 44 | 23 | 8 | 8 | 8 |
| (block_ABC_top_0) | block_ABC_top_0 | 5 | 5 | 5 | 2 | 9 |
+----------------------------------------------+--------------------------------------------+------------+------------+---------+------+-----+
I have a data List which has data of the colleges such as block_ABC_top, block_ABC_top_1,block_ABC_top, block_ABC_top_1...Here is my code below
The problem i face is..it only takes the data for data[0]..but i have data[0] and data[2] having the same college and i expect the check to happen twice.
with open ("utility.txt", 'r') as f1:
for line in f1:
if data[x] in line:
line_values = line.split('|')
if (int(line_values[4]) == 23 or int(line_values[7]) == 8):
filecheck = fullpath + "/" + filenames[x]
print filecheck
#print "check file "+ filenames[x]
x = x + 1
f1.close()
print [x.split(' ')[0] for x in open(file).readlines()] #colleges column
print [x.split(' ')[3] for x in open(file).readlines()] #book_IDs_1 column
Try running these.
Instead of going with the exact position of reach field, a better way would be to use the split() function, since you have your fields separated by a | symbol. You can loop thru the lines of the file and handle them accordingly.
for loop...:
line_values = line.split("|")
print(line_values[0]) # block_ABC_top
To extract Book id column data, use code below
with open('report.txt') as f:
for line in f:
if 'block_ABC_top' in line:
line_values = line.split('|')
print(line_values[4]) # PRINTS 23 AND 5
These are the three lists I have:
# made up data
products = ['apple','banana','orange']
prices = ['£0.11','£0.07','£0.05']
dates = ['02/04/2017','14/09/2018','06/08/2016']
Important to know
The data in these lists will vary along with its size, although they will maintain the same data type.
The first elements of each list are linked, likewise for the second and third element etc...
Desired command line interface:
Product | Price | Date of Purchase
--------|-------|------------------
apple | £0.11 | 02/04/2017
--------|-------|------------------
banana | £0.07 | 14/09/2018
--------|-------|------------------
orange | £0.05 | 06/08/2016
I want to create a table like this. It should obviously continue if there are more elements in each list but I don't know how I would create it.
I could do
print(""" Product | Price | Date of Purchase # etc...
--------|-------|------------------
%s | %s | %s
""" % (products[0],prices[0],dates[0]))
But I think this would be hardcoding the interface, which isn't ideal because the list has an undetermined length
Any help?
If you want a version that doesn't utilize a library, here's a fairly simple function that makes use of some list comprehensions
def print_table(headers, *columns):
# Ignore any columns of data that don't have a header
columns = columns[:len(headers)]
# Start with a space to set the header off from the left edge, then join the header strings with " | "
print(" " + " | ".join(headers))
# Draw the header separator with column dividers based on header length
print("|".join(['-' * (len(header) + 2) for header in headers]))
# Iterate over all lists passed in, and combine them together in a tuple by row
for row in zip(*columns):
# Center the contents within the space available in the column based on the header width
print("|".join([
col.center((len(headers[idx]) + 2), ' ')
for idx, col in enumerate(row)
]))
This doesn't handle cell values that are longer than the column header length + 2. But that would be easy to implement with a truncation of the cell contents (an example of string truncation can be seen here).
Try pandas:
import pandas as pd
products = ['apple','banana','orange']
prices = ['£0.11','£0.07','£0.05']
dates = ['02/04/2017','14/09/2018','06/08/2016']
df = pd.DataFrame({"Product": products, "Price": prices, "Date of Purchase": dates})
print(df)
Output:
Product Price Date of Purchase
0 apple £0.11 02/04/2017
1 banana £0.07 14/09/2018
2 orange £0.05 06/08/2016
import beautifultable
from beautifultable import BeautifulTable
table = BeautifulTable()
# made up data
products = ['apple','banana','orange']
prices = ['£0.11','£0.07','£0.05']
dates = ['02/04/2017','14/09/2018','06/08/2016']
table.column_headers=['Product' ,'Price','Date of Purchase']
for i in zip(products,prices,dates):
table.append_row(list(i))
print(table)
output is :
+---------+-------+------------------+
| Product | Price | Date of Purchase |
+---------+-------+------------------+
| apple | £0.11 | 02/04/2017 |
+---------+-------+------------------+
| banana | £0.07 | 14/09/2018 |
+---------+-------+------------------+
| orange | £0.05 | 06/08/2016 |
+---------+-------+------------------+
I am new to python programming, pardon me if I make any mistakes. I am writing a python script to read a csv file and print out the required cell of the column if it contains the information in the row.
| A | B | C
---|----|---|---
1 | Re | Mg| 23
---|----|---|---
2 | Ra | Fe| 90
For example, I if-else the row C for value between 20 to 24. Then if the condition passes, it will return Cell A1 (Re) as the result.
At the moment, i only have the following and i have no idea how to proceed from here on.
f = open( 'imageResults.csv', 'rU' )
for line in f:
cells = line.split( "," )
if(cells[2] >= 20 and cells[2] <= 24):
f.close()
This might contain the answer to my question but i can't seem to make it work.
UPDATE
If in the row, there is a header, how do i get it to work? I wanted to change the condition to string but it don't work if I want to search for a range of values.
| A | B | C
---|----|---|---
1 |Name|Lat|Ref
---|----|---|---
2 | Re | Mg| 23
---|----|---|---
3 | Ra | Fe| 90
You should use a csv reader. It's built into python so there's no dependencies to install. Then you need to tell python that the third column is an integer. Something like this will do it:
import csv
with open('data.csv', 'rb') as f:
for line in csv.reader(f):
if 20 <= int(line[2]) <= 24:
print(line)
With this data in data.csv:
Re,Mg,23
Ra,Fe,90
Ha,Ns,50
Ku,Rt,20
the output will be:
$ python script.py
['Re', 'Mg', '23']
['Ku', 'Rt', '20']
Update:
If in the [first] row, there is a header, how do i get it to work?
There's csv.DictReader which is for that. Indeed it is safer to work with DictReader, especially when the order of the columns might change or you insert a column before the third column. Given this data in data.csv
Name,Lat,Ref
Re,Mg,23
Ra,Fe,90
Ha,Ns,50
Ku,Rt,20
Then is this the python script:
import csv
with open('data.csv', 'rb') as f:
for line in csv.DictReader(f):
if 20 <= int(line['Ref']) <= 24:
print(line)
P.S. Welcome at python. It's a good language for learning to program