How to read several rows from a csv - python

I have a csv file which contains among other things the names and the phone numbers. I'm only interested in a name only if I've its phone number.
with open(phone_numbers) as f:
reader = csv.DictReader(f)
names = [record['Name'] for record in reader if record['phone']]
But I also want the respective phone number, I've try this:
user_data = {}
with open(phone_numbers) as f:
reader = csv.DictReader(f)
user_data['Name'] = [record['Name'] for record in reader if record['phone']]
user_data['phone'] = [record['phone'] for record in reader if record['phone']]
But for the second item I got an empty string, I'm guessing that record is a generator and that's why I can iterate over it twice.
I've try to use tuples, but only had worked this way:
user_data = {}
with open(phone_numbers) as f:
reader = csv.DictReader(f)
user_data['Name'] = [(record['Name'],record['phone']) for record in reader if record['phone']]
In that case I have the two variables, phone and Name stored in user_data['Name'], that isn't what I want.
And if I try this:
user_data = {}
with open(phone_numbers) as f:
reader = csv.DictReader(f)
user_data['Name'],user_data['phone'] = [(record['Name'],record['phone']) for record in reader if record['phone']]
I got the following error:
ValueError: too many values to unpack
Edit:
This is a sample of the table:
+--------+---------------+
| Phone | Number |
+--------+---------------+
| Luis | 000 111 22222 |
+--------+---------------+
| Paul | 000 222 3333 |
+--------+---------------+
| Andrea | |
+--------+---------------+
| Jorge | 111 222 3333 |
+--------+---------------+
So all rows have a Name but not all have phones.

You can use dict to convert your list of tuple into dictionary. Also you need to use get if you have record without phone value.
import csv
user_data = {}
with open(phone_numbers) as f:
reader = csv.DictReader(f)
user_data = dict([(record['Name'], record['phone']) for record in reader if record.get('phone').strip())
If you want a list of names and phones separately you can use the * expression
with open(phone_numbers) as f:
reader = csv.DictReader(f)
names, phones = zip(*[(record['name'], record['value']) for record in reader if record.get('phone').strip()])

I think there is a much easier approach Because it is a csv file since there are column headings as you indicate then there is a value for phone in each row, it is either nothing or something - so this tests for nothing and if not nothing adds the name and phone to user_data
import csv
user_data = []
with open(f,'rb') as fh:
my_reader = csv.DictReader(fh)
for row in my_reader:
if row['phone'] != ''
user_details = dict()
user_details['Name'] = row['Name']
user_details['phone'] = row['phone']
user_data.append(user_details)
By using DictReader we are letting the magic happen so we don't have to worry about seek etc.
If I did not understand and you want a dictionary then easy enough
import csv
user_data = dict()
with open(f,'rb') as fh:
my_reader = csv.DictReader(fh)
for row in my_reader:
if row['phone'] != ''
user_data['Name'] = row['phone']

Your guess is quite right. If this is the approach you want take - iteration twice, you should use seek(0)
reader = csv.DictReader(f)
user_data['Name'] = [record['Name'] for record in reader if record['phone']]
f.seek(0) # role back to begin of file ...
reader = csv.DictReader(f)
user_data['phone'] = [record['phone'] for record in reader if record['phone']]
However, this is not very efficient and you should try and get your data in one roll. The following should do it in one roll:
user_data = {}
def extract_user(user_data, record):
if record['phone']:
name = record.pop('name')
user_data.update({name: record})
[extract_user(user_data, record) for record in reader]
Example:
In [20]: cat phones.csv
name,phone
hans,01768209213
grettel,
henzel,123457123
In [21]: f = open('phones.csv')
In [22]: reader = csv.DictReader(f)
In [24]: %paste
user_data = {}
def extract_user(user_data, record):
if record['phone']:
name = record.pop('name')
user_data.update({name: record})
[extract_user(user_data, record) for record in reader]
## -- End pasted text --
Out[24]: [None, None, None]
In [25]: user_data
Out[25]: {'hans': {'phone': '01768209213'}, 'henzel': {'phone': '123457123'}}

Is it possible that what you're looking for is throwing away some info in your data file?
In [26]: !cat data00.csv
Name,Phone,Address
goofey,,ade
mickey,1212,heaven
tip,3231,earth
In [27]: f = open('data00.csv')
In [28]: r = csv.DictReader(f)
In [29]: lod = [{'Name':rec['Name'], 'Phone':rec['Phone']} for rec in r if rec['Phone']]
In [30]: lod
Out[30]: [{'Name': 'mickey', 'Phone': '1212'}, {'Name': 'tip', 'Phone': '3231'}]
In [31]:
On the other hand, should your file contain ONLY Name and Phone columns, it's
just
In [31]: lod = [rec for rec in r if rec['Phone']]

I normally use row indexing:
input = open('mycsv.csv', 'r')
user_data = {}
for row in csv.reader(input):
if row[<row # containing phone>]:
name = row[<row # containing name>]
user_data[name] = row[<row # containing phone>]

You were correct the whole time, except for the unpacking.
result = [(record["name"], record["phone"]) for record in reader if record["phone"]]
# this gives [(name1, phone1), (name2,phone2),....]
You have to do [dostuff for name, phone in result] not name,phone = result, which does not make sense semantically and syntactically.

Related

Parsing a file with readlines and split function in python

I have files I'm trying to parse. I want to print the date_of_birth on each line. The code below only returns the first line. I don't want to use readlines, as some of my files are very large.
HEADER: Date_of_birth, ID, First_Name, Last_Name
1/1/1970, 1, John, Smith
12/31/1969, 2, Peter, Smith
with open("test.csv", "r") as f:
lines = f.readline().split[0]
print(lines)
I suggest the csv module, though you have a slightly odd file format because it starts with "HEADER: " followed by the actual headers that you care about. Maybe just read in those initial 8 bytes, verify that they actually contain the string "HEADER: " but otherwise discard them, then pass the open file handle to csv to parse the rest of the file.
Here's a simple example, which you might want to tweak to do more graceful handling of any errors:
import csv
with open('test.csv') as f:
start_bytes = f.read(8)
assert(start_bytes == 'HEADER: ')
c = csv.reader(f)
header_row = next(c)
column_number = header_row.index('Date_of_birth')
for row in c:
print(row[column_number])
Update: thanks to another contributor for suggesting csv.DictReader. Similarly it seems that you can instantiate this with a file object positioned at some non-zero offset to discard the initial bytes containing "HEADER: " from the start of the file.
import csv
with open('test.csv') as f:
start_bytes = f.read(8)
assert(start_bytes == 'HEADER: ')
c = csv.DictReader(f)
for row in c:
print(row['Date_of_birth'])
use csv module
import csv
with open("test.csv", "r") as f:
reader = csv.DictReader(f)
for line in reader:
print(line['Date_of_birth'])
Sorry for my mistake
Check this
dates = []
with open("test.csv") as f:
for row in f:
dates.append(row.split()[0])
The readline function returns only one line at a time, so you have to use a while loop to read the lines:
with open("test.csv", "r") as f:
dates = []
while True:
line = f.readline()
if not line: # if line is blank, there are no more lines
break # stop the loop
dates.append(line.split()[0])
If the first row does not actually contain what you show as the Header, i.e. Date_of_birth, ID, First_Name, Last_Name, then:
import csv
with open("test.csv", "r", newline='') as f:
fieldnames = ['Date_of_birth', 'ID', 'First_Name', 'Last_Name']
rdr = csv.DictReader(f, fieldnames=fieldnames)
for row in rdr:
date_of_birth = row['Date_of_birth']
print(date_of_birth)
Otherwise:
import csv
with open("test.csv", "r", newline='') as f:
rdr = csv.DictReader(f)
for row in rdr:
date_of_birth = row['Date_of_birth']
print(date_of_birth)
If the file's first row actually contains HEADER: Date_of_birth, ID, First_Name, Last_Name, then you must use the first alternative code but add logic to skip the first row.
My answer would have been 60% shorter had you been 10% clearer.

Summing values from duplicate keys in a CSV file without panda

I have a large dataset that looks like the following
party,cp,qualifier,amount
ABC,DEF,GOOGLE_2,100
ABC,DEF,GOOGLE_2,200
GHI,JKL,FACEBOOK_1,500
GHI,JKL,FACEBOOK_1,-600
I would like to output :
ABC,DEF,GOOGLE,300
GHI,JKL,FACEBOOK,-100
Here is my python code so far:
headers = ["valuation_date","party_group_name","type","party_name","cp_group_name","cp_name","qualifier","amount"]
data = {}
with open(t1file,'rb') as f:
reader = csv.reader(f)
headers = reader.next()
for row in reader:
party = row[headers.index('party')]
cp = row[headers.index('cp')]
qualifier = row[headers.index('qualifier')]
amount = row[headers.index('amount')]
if row[headers.index('type')] == "Equity":
new_qualifier = qualifier.split("_")[0]
if party in data.keys():
if cp in data.keys():
if new_qualifier in data.keys():
data[party][cp][new_qualifier] += float(amount)
else:
data[party][cp][qualifier][amount] = data[party][cp][new_qualifier][amount]
else:
data[cp] = cp
else:
data[party] = party
When I run the above code I get the following error:
data[party][cp][qualifier][amount] = data[party][cp][new_qualifier][amount]
TypeError: string indices must be integers, not str
Very rusty with python apologize if it's glaringly obivous but any insights as to what i'm doing wrong ?
Thanks !
you can use pandas.drop_duplicates to drop duplicates of multiple columns and combine it with pandas.groupby() & sum to get the desired result
>>>import pandas as pd
>>>#read file using pandas.read_csv()
>>>df
party cp qualifier amount
0 ABC DEF GOOGLE_2 100
1 ABC DEF GOOGLE_2 200
2 GHI JKL FACEBOOK_1 500
3 GHI JKL FACEBOOK_1 -600
>>>df['Total'] = df.groupby(['party','cp','qualifier'])['amount'].transform('sum')
>>>print(df.drop_duplicates(subset=['party','cp','qualifier'], keep='last'))
party cp qualifier amount Total
1 ABC DEF GOOGLE_2 200 300
3 GHI JKL FACEBOOK_1 -600 -100
Below
from collections import defaultdict
PARTY_IDX = 0
CP_IDX = 1
QUALIFIER_IDX = 2
AMOUNT_IDX = 3
data = defaultdict(int)
with open('del-me.csv') as f:
lines = [l.strip() for l in f.readlines()]
for idx, line in enumerate(lines):
if idx > 0:
fields = line.split(',')
party = fields[PARTY_IDX]
cp = fields[CP_IDX]
qualifier = fields[QUALIFIER_IDX]
qualifier = qualifier[:qualifier.find('_')]
key = ','.join([party, cp, qualifier])
amount = int(fields[AMOUNT_IDX])
data[key] += amount
with open('out.csv', 'w') as f:
for k, v in data.items():
f.write('{},{}\n'.format(k, v))
del-me.csv
party,cp,qualifier,amount
ABC,DEF,GOOGLE_2,100
ABC,DEF,GOOGLE_2,200
GHI,JKL,FACEBOOK_1,500
GHI,JKL,FACEBOOK_1,-600
out.csv
ABC,DEF,GOOGLE,300
GHI,JKL,FACEBOOK,-100
You have already enough answers, but let me correct your own code to help you derive the answer and understand the original issue:
import csv as csv
headers = ["valuation_date","party_group_name","party_name","cp_group_name","cp_name","qualifier","amount"]
data = {}
with open('test_data.csv','rt', encoding='utf-8') as f:
reader = csv.reader(f)
headers = next(reader)
for row in reader:
party = row[headers.index('party')]
cp = row[headers.index('cp')]
qualifier = row[headers.index('qualifier')]
amount = row[headers.index('amount')]
if row[headers.index('type')] == "Equity":
new_qualifier = qualifier.split("_")[0]
if party in data.keys():
cp_ = data[party]
if cp in cp_.keys():
qualifier_ = data[party][cp]
if new_qualifier in qualifier_.keys():
data[party][cp][new_qualifier] += float(amount)
else:
data[party][cp][qualifier][amount] = {}
else:
data[cp] = {}
else:
data[party] = {}
data[party][cp] = {}
data[party][cp][qualifier.split("_")[0]] = float(amount)
print(data)
This gives you
{'ABC': {'DEF': {'GOOGLE': 300.0}}, 'GHI': {'JKL': {'FACEBOOK': -100.0}}}
The problem was how you were populating your dictionary and how you were accessing it.
In order to simplify things, you might use just one key for the dict which is composed out of the identifying parts of a given line.
You might have to extract values by the header names like you already did. The following is based on the specified input. rsplit is used to split the string once at the end in order to use the party,cp,qualifier combination as a key and extract the amount.
def sumUp():
d = {}
with open(t1file,'rb') as f:
for line in f:
if 'party' in line:
continue # skip header
key, value = line.rsplit(',', 1) # split once at the end
d[key] = d[key] + int(value) if key in d else int(value)
You can do it like this:
from csv import DictReader, DictWriter
map_dic = dict()
with open('test1.csv', 'r') as fr:
csv_reader = DictReader(fr, delimiter=',')
for line in csv_reader:
key = '{}_{}_{}'.format(line['party'], line['cp'], line['qualifier'])
if key not in map_dic.keys():
map_dic[key] = {'party': line['party'], 'cp': line['cp'], 'qualifier': line['qualifier'], 'amount': int(line['amount'])}
else:
map_dic[key]['amount'] = map_dic[key]['amount'] + int(line['amount'])
with open('test2.csv', 'w') as csvfile:
writer = DictWriter(csvfile, fieldnames=['party', 'cp', 'qualifier', 'amount'])
writer.writeheader()
for key, data in map_dic.items():
writer.writerow(data)

(Simple Python) CSV input to usernames

I have a CSV file names.csv
First_name, Last_name
Mike, Hughes
James, Tango
, Stoke
Jack,
....etc
What I want is to be able to take the first letter of the First_name and the full Last_name and output it on screen as usernames but not include the people with First_name and Last_name property's empty. I'm completely stuck any help would be greatly appreciated
import csv
ifile = open('names.csv', "rb")
reader = csv.reader(ifile)
rownum = 0
for row in reader:
if rownum == 0:
header = row
else:
colnum = 0
for col in row:
print '%-8s: %s' % (header[colnum], col)
colnum += 1
rownum += 1
ifile.close()
Attempt #2
import csv
dataFile = open('names.csv','rb')
reader = csv.reader(dataFile)
next(reader, None)
for row in reader:
if (row in reader )
print (row[0])
I haven't saved many attempts because none of them have worked :S
import csv
dataFile = open('names.csv','rb')
reader = csv.reader(dataFile, delimiter=',', quoting=csv.QUOTE_NONE)
for row in reader:
if not row[0] or not row[1]:
continue
print (row[0][0] + row[1]).lower()
Or
import csv
dataFile = open('names.csv','rb')
reader = csv.reader(dataFile, delimiter=',', quoting=csv.QUOTE_NONE)
[(row[0][0] + row[1]).lower() for row in reader if
row[0] and row[1]]
Once you get the text from the .csv you can use the split() function to break up the text by the new lines. Your sample text is a little inconsistent, but if I understand you question correctly you can say
import csv
dataFile = open('names.csv','rb')
reader = csv.reader(dataFile)
reader = reader.split('\n')
for x in reader
print(reader[x])
Or if you want to break it up by commas just replace the '\n' with ','
Maybe like this
from csv import DictReader
with open('names.csv') as f:
dw = DictReader(f, skipinitialspace=True)
fullnames = filter(lambda n: n['First_name'] and n['Last_name'], dw)
for f in fullnames:
print('{}{}'.format(f['First_name'][0], f['Last_name']))
You have headings in your csv so use a DictReader and just filter out those whose with empty first or last names and display the remaining names.

change order of columns in csv (python)

I made a script, which reads a given input-file (csv), manipulates the data somehow and writes an output-file (csv).
In my case, my given input-file looks like this:
| sku | article_name |
| 1 | MyArticle |
For my output-file, I need to re-arrange these columns (there are plenty more, but I think i might be able to solve it, when someone shows me the way)
My output-file should look like this:
| article_name | another_column | sku |
| MyArticle | | 1 |
Note, that here is a new column, that isn't in the source csv-file, but it has to be printed anyway (the order is important as well)
This is what I have so far:
#!/usr/bin/env python
# -*- coding: latin_1 -*-
import csv
import argparse
import sys
header_mappings = {'attr_artikel_bezeichnung1': 'ARTICLE LABEL',
'sku': 'ARTICLE NUMBER',
'Article label locale': 'Article label locale',
'attr_purchaseprice': 'EK-Preis',
'attr_salesPrice': 'EuroNettoPreis',
'attr_salesunit': 'Einheit',
'attr_salesvatcode': 'MwSt.-Satz',
'attr_suppliercode': 'Lieferantennummer',
'attr_suppliersitemcode': 'Artikelnummer Lieferant',
'attr_isbatchitem': 'SNWarenausgang'}
row_mapping = {'Einheit': {'pc': 'St.'},
'MwSt.-Satz': {'3': '19'}}
def remap_header(header):
for h_map in header_mappings:
if h_map in header:
yield header_mappings.get(h_map), header.get(h_map)
def map_header(header):
for elem in header:
yield elem, header.index(elem)
def read_csv(filename):
with open(filename, 'rb') as incsv:
csv_reader = csv.reader(incsv, delimiter=';')
for r in csv_reader:
yield r
def add_header(header, fields=()):
for f in fields:
header.append(f)
return header
def duplicate(csv_row, header_name, fields):
csv_row[new_csv_header.index(fields)] = csv_row[new_csv_header.index(header_name)]
return csv_row
def do_new_row(csv_row):
for header_name in new_csv_header:
for r_map in row_mapping:
row_content = csv_row[mapped_header.get(r_map)]
if row_content in row_mapping.get(r_map):
csv_row[mapped_header.get(r_map)] = row_mapping.get(r_map).get(row_content)
try:
yield csv_row[mapped_header.get(header_name)]
except TypeError:
continue
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--infile', metavar='CSV')
parser.add_argument('-o', '--outfile', metavar='CSV')
args = parser.parse_args()
arguments = vars(args)
if len(sys.argv[1:]) == 0:
parser.print_usage()
sys.exit(0)
# print arguments
# parse_csv(**arguments)
"""
"""
csv_reader_iter = read_csv(arguments.get('infile'))
# neuer csv header
new_csv_header = list()
csv_header = next(csv_reader_iter)
for h in csv_header:
if h in header_mappings:
new_csv_header.append(header_mappings.get(h))
# print new_csv_header
new_csv_header = add_header(new_csv_header, ('Article label locale', 'Nummer'))
mapped_header = dict(remap_header(dict(map_header(csv_header))))
# print mapped_header
with open(arguments.get('outfile'), 'wb') as outcsv:
csv_writer = csv.writer(outcsv, delimiter=';')
csv_writer.writerow(new_csv_header)
for row in csv_reader_iter:
row = list(do_new_row(row))
delta = len(new_csv_header) - len(row)
if delta > 0:
row = row + (delta * [''])
# duplicate(row, 'SNWarenausgang', 'SNWareneingang')
# duplicate(row, 'SNWarenausgang', 'SNWareneingang')
csv_writer.writerow(row)
print "Done."
"""
print new_csv_header
for row in csv_reader_iter:
row = list(do_new_row(row))
delta = len(new_csv_header) - len(row)
if delta > 0:
row = row + (delta * [''])
duplicate(row, 'Herstellernummer', 'Nummer')
duplicate(row, 'SNWarenausgang', 'SNWareneingang')
print row
"""
Right now, even though it says "ARTICLE LABEL" first, the sku is printed first. My guess: This is due the order of the csv-file, since sku is the first field there... right?
If you use the DictWriter from the csv lib you can specify the order of the columns. Use DictReader to read in rows from your file as dicts. Then you just explicitly specify the order of the keys when you create your DictWriter.
https://docs.python.org/2/library/csv.html#csv.DictReader
As riotburn already suggested, you can use a DictWriter and its fieldnames argument to adjust the order of columns in the new file.
Reordering a file could look like this:
def read_csv (filename):
with open(filename) as incsv:
reader = csv.DictReader(incsv, delimiter=';')
for r in reader:
yield r
columns = ['article_name', 'another_column', 'sku']
with open('newfile.csv', 'w+') as f:
writer = csv.DictWriter(f, columns, delimiter=';')
writer.writeheader()
for row in read_csv('oldfile.csv'):
# add a property
row['another_column'] = 'foo'
# write row (using the order specified in columns)
writer.writerow(row)

How to read multiple records from a CSV file?

I have a csv file, l__cyc.csv, that contains this:
trip_id, time, O_lat, O_lng, D_lat, D_lng
130041910101,1300,51.5841153671,0.134444590094,51.5718053872,0.134878021928
130041910102,1335,51.5718053872,0.134878021928,51.5786920389,0.180940040247
130041910103,1600,51.5786920389,0.180940040247,51.5841153671,0.134444590094
130043110201,1500,51.5712712038,0.138532882664,51.5334949484,0.130489470325
130043110202,1730,51.5334949484,0.130489470325,51.5712712038,0.138532882664
And I am trying to pull out separate values, using:
with open('./l__cyc.csv', 'rU') as csvfile:
reader = csv.DictReader(csvfile)
origincoords = ['{O_lat},{O_lng}'.format(**row) for row in reader]
with open('./l__cyc.csv', 'rU') as csvfile:
reader = csv.DictReader(csvfile)
trip_id = ['{trip_id}'.format(**row) for row in reader]
with open('./l__cyc.csv', 'rU') as csvfile:
reader = csv.DictReader(csvfile)
destinationcoords = ['{D_lat},{D_lng}'.format(**row) for row in reader]
Where origincoords should be 51.5841153671, 0.134444590094,
trip_id should be 130041910101, and destinationcoords should be
51.5718053872, 0.134878021928.
However, I get a KeyError:
KeyError: 'O_lat'
Is this something simple and there's something fundamental I'm misunderstanding?
You just avoid the space between headers
trip_id,time,O_lat,O_lng,D_lat,D_lng
OR
reader = csv.DictReader(csvfile, skipinitialspace=True)
First things first, you get the key error, because the key does not exist in your dictionary.
Next, I would advise against running through the file 3 times, when you can do it a single time!
For me it worked, when I added the fieldnames to the reader.
import csv
from cStringIO import StringIO
src = """trip_id, time, O_lat, O_lng, D_lat, D_lng
130041910101,1300,51.5841153671,0.134444590094,51.5718053872,0.134878021928
130041910102,1335,51.5718053872,0.134878021928,51.5786920389,0.180940040247
130041910103,1600,51.5786920389,0.180940040247,51.5841153671,0.134444590094
130043110201,1500,51.5712712038,0.138532882664,51.5334949484,0.130489470325
130043110202,1730,51.5334949484,0.130489470325,51.5712712038,0.138532882664
"""
f = StringIO(src)
# determine the fieldnames
fieldnames= "trip_id,time,O_lat,O_lng,D_lat,D_lng".split(",")
# read the file
reader = csv.DictReader(f, fieldnames=fieldnames)
# storage
origincoords = []
trip_id = []
destinationcoords = []
# iterate the rows
for row in reader:
origincoords.append('{O_lat},{O_lng}'.format(**row))
trip_id.append('{trip_id}'.format(**row))
destinationcoords.append('{D_lat},{D_lng}'.format(**row))
# pop the header off the list
origincoords.pop(0)
trip_id.pop(0)
destinationcoords.pop(0)
# show the result
print origincoords
print trip_id
print destinationcoords
I don't really know what you are trying to achieve there, but I'm sure there is a better way of doing it!

Categories

Resources