Need more efficient way to parse out csv file in Python - python

Here's a sample csv file
id, serial_no
2, 500
2, 501
2, 502
3, 600
3, 601
This is the output I'm looking for (list of serial_no withing a list of ids):
[2, [500,501,502]]
[3, [600, 601]]
I have implemented my solution but it's too much code and I'm sure there are better solutions out there. Still learning Python and I don't know all the tricks yet.
file = 'test.csv'
data = csv.reader(open(file))
fields = data.next()
for row in data:
each_row = []
each_row.append(row[0])
each_row.append(row[1])
zipped_data.append(each_row)
for rec in zipped_data:
if rec[0] not in ids:
ids.append(rec[0])
for id in ids:
for rec in zipped_data:
if rec[0] == id:
ser_no.append(rec[1])
tmp.append(id)
tmp.append(ser_no)
print tmp
tmp = []
ser_no = []
**I've omitted var initializing for simplicity of code
print tmp
Gives me output I mentioned above. I know there's a better way to do this or pythonic way to do it. It's just too messy! Any suggestions would be great!

from collections import defaultdict
records = defaultdict(list)
file = 'test.csv'
data = csv.reader(open(file))
fields = data.next()
for row in data:
records[row[0]].append(row[1])
#sorting by ids since keys don't maintain order
results = sorted(records.items(), key=lambda x: x[0])
print results
If the list of serial_nos need to be unique just replace defaultdict(list) with defaultdict(set) and records[row[0]].append(row[1]) with records[row[0]].add(row[1])

Instead of a list, I'd make it a collections.defaultdict(list), and then just call the append() method on the value.
result = collections.defaultdict(list)
for row in data:
result[row[0]].append(row[1])

Here's a version I wrote, looks like there are plenty of answers for this one already though.
You might like using csv.DictReader, gives you easy access to each column by field name (from the header / first line).
#!/usr/bin/python
import csv
myFile = open('sample.csv','rb')
csvFile = csv.DictReader(myFile)
# first row will be used for field names (by default)
myData = {}
for myRow in csvFile:
myId = myRow['id']
if not myData.has_key(myId): myData[myId] = []
myData[myId].append(myRow['serial_no'])
for myId in sorted(myData):
print '%s %s' % (myId, myData[myId])
myFile.close()

Some observations:
0) file is a built-in (a synonym for open), so it's a poor choice of name for a variable. Further, the variable actually holds a file name, so...
1) The file can be closed as soon as we're done reading from it. The easiest way to accomplish that is with a with block.
2) The first loop appears to go over all the rows, grab the first two elements from each, and make a list with those results. However, your rows already all contain only two elements, so this has no net effect. The CSV reader is already an iterator over rows, and the simple way to create a list from an iterator is to pass it to the list constructor.
3) You proceed to make a list of unique ID values, by manually checking. A list of unique things is better known as a set, and the Python set automatically ensures uniqueness.
4) You have the name zipped_data for your data. This is telling: applying zip to the list of rows would produce a list of columns - and the IDs are simply the first column, transformed into a set.
5) We can use a list comprehension to build the list of serial numbers for a given ID. Don't tell Python how to make a list; tell it what you want in it.
6) Printing the results as we get them is kind of messy and inflexible; better to create the entire chunk of data (then we have code that creates that data, so we can do something else with it other than just printing it and forgetting it).
Applying these ideas, we get:
filename = 'test.csv'
with open(filename) as in_file:
data = csv.reader(in_file)
data.next() # ignore the field labels
rows = list(data) # read the rest of the rows from the iterator
print [
# We want a list of all serial numbers from rows with a matching ID...
[serial_no for row_id, serial_no in rows if row_id == id]
# for each of the IDs that there is to match, which come from making
# a set from the first column of the data.
for id in set(zip(*rows)[0])
]
We can probably do even better than this by using the groupby function from the itertools module.

example using itertools.groupby. This only works if the rows are already grouped by id
from csv import DictReader
from itertools import groupby
from operator import itemgetter
filename = 'test.csv'
# the context manager ensures that infile is closed when it goes out of scope
with open(filename) as infile:
# group by id - this requires that the rows are already grouped by id
groups = groupby(DictReader(infile), key=itemgetter('id'))
# loop through the groups printing a list for each one
for i,j in groups:
print [i, map(itemgetter(' serial_no'), list(j))]
note the space in front of ' serial_no'. This is because of the space after the comma in the input file

Related

Writing a function that returns the number of unique names in a column in a dataset - Python

I'm currently trying to write a function that takes an integer a dataset (one that I already have, named data). And looks for a column in this dataset called name. It then has to return the number of different types of names there are in the column (there are 4 values, but only 3 types of values--two of them are the same).
I'm having a hard time with this program, but this is what I have so far:
def name_count(data):
unique = []
for name in data:
if name.strip() not in unique:
unique[name] += 1
else:
unique[name] = 1
unique.append(name)
The only import I'm allowed to use for this challenge is math.
Does anyone have any help or advice they can offer with this problem?
You can use a set to keep duplicates from it, for example:
data = ['name1', 'name2', 'name3', 'name3 ']
cleaned_data = map(lambda x: x.strip(), data)
count = len(set(cleaned_data))
print(count)
>>> 3
You almost had it. Unique should be a dictionary, not a list.
def name_count(data):
unique = {}
for name in data:
if name.strip() in unique:
unique[name] += 1
else:
unique[name] = 1
return unique
#test
print(name_count(['Jack', 'Jill', 'Mary', 'Sam', 'Jack', 'Mary']))
#output
{'Jack': 2, 'Jill': 1, 'Mary': 2, 'Sam': 1}
def name_count(data):
df = pandas.DataFrame(data)
unique = []
for name in df["name"]: #if column name is "name"
if name:
if (name not in unique) :
unique.append(name)
return unique
You need to pass the complete dataset to the function and not just the integers.
It is not clear what kind of data variable you already have there.
So, I will suggest a solution, starting from reading the file.
Considering that you have a csv file and that there is a restriction on importing only math module (as you mentioned), then this should work.
def name_count(filename):
with open(filename, 'r') as fh:
headers = next(fh).split(',')
name_col_idx = headers.index('name')
names = [
line.split(',')[name_col_idx]
for line in fh
]
return len(set(names))
Here we read the first line, identify the location of name in the header, collect all items in the name column into a variable names and finally return the length of the set, which contains only unique elements.
Here is the solution if you are feeding a csv file to your function. It reads the csv file, gets rid of the header line, accumulates the names which are on index 1 of each line, casts the list as a set to get rid of the duplicates and returns the length of the set which is the same as the number of unique names.
import csv
def name_count(filename):
with open(filename, "r") as csvfile:
csvreader = csv.reader(csvfile)
names = [row[1] for row in csvreader if row][1:]
return len(set(names))
Alternatively, if you don't want to use a csv reader, you can use a tect file reader without any imports as follows. The code splits each line on commas.
def name_count(filename):
with open(filename, "r") as input:
names = [row.rstrip('\n').split(',')[1] for row in input if row][1:]
return len(set(names))

How to get the corresponding elements of a list in the data file quicker?

I have a list of some old IDs, for example, list = [1, 2, 4, 7, 5].
I have a csv data file, which the third column is the old IDs and the fifth column is the corresponding new Id. what can I do to extract the new Id without reading the csv file every time for an element.
Now my plan is reading the csv every time for every element in the list, but it takes much time since the csv file is quite large.
id_list = []
for element in list:
with open(path) as file:
for row in file:
if str(element) == row.split(",")[2]:
id = int(row.split(",")[4])
id_list.append(id)
Cheers!
A note about your program first. id and list are reserved words in python. You should not use them as variable names.
That said, here is a possible solution (may be better off with a large csv file to load into a database as stovfl suggested - but that's another idea).
You could use the code below and it should get the results. (You should probably use the csv module for working with csv files).
import csv
_list = [1, 2, 4, 7, 5]
old_id = set(_list) # using a set will get faster lookups
id_list = []
path = '/old_data/perlp/into.csv' # insert your path here
with open(path, 'r') as file:
reader = csv.reader(file)
for row in reader:
old = row[2]
if int(old) in old_id:
new_id = int(row[4])
id_list.append(new_id)

Python: efficient way to create new csv from large dataset

I have a script that removes "bad elements" from a master list of elements, then returns a csv with the updated elements and their associated values.
My question, is whether there is a more efficient way to perform the same operation in the for loop?
Master=pd.read_csv('some.csv', sep=',',header=0,error_bad_lines=False)
MasterList = Master['Elem'].tolist()
MasterListStrain1 = Master['Max_Principal_Strain'].tolist()
#this file should contain elements that are slated for deletion
BadElem=pd.read_csv('delete_me_elements_column.csv', sep=',',header=None, error_bad_lines=False)
BadElemList = BadElem[0].tolist()
NewMasterList = (list(set(MasterList) - set(BadElemList)))
filename = 'NewOutput.csv'
outfile = open(filename,'w')
#pdb.set_trace()
for i,j in enumerate(NewMasterList):
#pdb.set_trace()
Elem_Loc = MasterList.index(j)
line ='\n%s,%.25f'%(j,MasterListStrain1[Elem_Loc])
outfile.write(line)
print ("\n The new output file will be named: " + filename)
outfile.close()
Stage 1
If you necessarily want to iterate in the for loop then besides using pd.to_csv which likely to improve performance you can do the following:
...
SetBadElem = set(BadElemList)
...
for i,Elem_Loc in enumerate(MasterList):
if Elem_Loc not in SetBadElem:
line ='\n%s,%.25f'%(j,MasterListStrain1[Elem_Loc])
outfile.write(line)
Jumping around the index is never efficient whereas iteration with skipping will give you much better performance (checking presence in a set is log n operation so it is relatively quick).
Stage 2 Using Pandas properly
...
SetBadElem = set(BadElemList)
...
for Elem in Master:
if Elem not in SetBadElem:
line ='\n%s,%.25f'%(Elem['elem'], Elem['Max_Principal_Strain'])
outfile.write(line)
There is no need to create lists out of pandas dataframe columns. Using the whole dataframe (and indexing into it) is a much better approach.
Stage 3 Removing messy iterated formatting operations
We can add a column ('Formatted') that will contain formatted data. For that we will create a lambda function:
formatter = lambda row: '\n%s,%.25f'%(row['elem'], row['Max_Principal_Strain'])
Master['Formatted'] = Master.apply(formatter)
Stage 4 Pandas-way filtering and output
We can format the dataframe in two ways. My preference is to reuse the formatting function:
import numpy as np
formatter = lambda row: '\n%s,%.25f'%(row['elem'], row['Max_Principal_Strain']) if row not in SetBadElem else np.nan
Now we can use the built-in dropna which drops all rows that have any NaN values
Master.dropna()
Master.to_csv(filename)

How can I write to an existing csv file from a dictionary to a specific column?

I have a dictionary I created from a csv file and would like to use this dict to update the values in a specific column of a different csv file called sheet2.csv.
Sheet2.csv has many columns with different headers and I need to only update the column PartNumber based on my key value pairs in my dict.
My question is how would I use the keys in dict to search through sheet2.csv and update/write to only the column PartNumber with the appropriate value?
I am new to python so I hope this is not too confusing and any help is appreciated!
This is the code I used to create the dict:
import csv
a = open('sheet1.csv', 'rU')
csvReader = csv.DictReader(a)
dict = {}
for line in csvReader:
dict[line["ReferenceID"]] = line["PartNumber"]
print(dict)
dict = {'R150': 'PN000123', 'R331': 'PN000873', 'C774': 'PN000064', 'L7896': 'PN000447', 'R0640': 'PN000878', 'R454': 'PN000333'}
To make things even more confusing, I also need to make sure that already existing rows in sheet2 remain unchanged. For example, if there is a row with ReferenceID as R1234 and PartNumber as PN000000, it should stay untouched. So I would need to skip rows which are not in my dict.
Link to sample CSVs:
http://dropbox.com/s/zkagunnm0xgroy5/Sheet1.csv
http://dropbox.com/s/amb7vr48mdc94v6/Sheet2.csv
EDIT: Let me rephrase my question and provide a better example csvfile.
Let's say I have a Dict = {'R150': 'PN000123', 'R331': 'PN000873', 'C774': 'PN000064', 'L7896': 'PN000447', 'R0640': 'PN000878', 'R454': 'PN000333'}.
I need to fill in this csv file: https://www.dropbox.com/s/c95mlitjrvyppef/sheet.csv
Specifically, I need to fill in the PartNumber column using the keys of the dict I created. So I need to iterate through column ReferenceID and compare that value to my keys in dict. If there is a match I need to fill in the corresponding PartNumber cell with that value.... I'm sorry if this is all confusing!
The code below should do the trick. It first builds a dictionary just like your code and then moves on to read Sheet2.csv row by row, possibly updating the part number. The output goes to temp.csv which you can compare with the inital Sheet2.csv. In case you want to overwrite Sheet2.csv with the contents of temp.csv, simply uncomment the line with shutil.move.
Note that the sample files you provided do not contain any updateable data, so Sheet2.csv and temp.csv will be identical. I tested this with a slightly modified Sheet1.csv where I made sure that it actually contains a reference ID used by Sheet2.csv.
import csv
import shutil
def createReferenceIdToPartNumberMap(csvToReadPath):
result = {}
print 'read part numbers to update from', csvToReadPath
with open(csvToReadPath, 'rb') as csvInFile:
csvReader = csv.DictReader(csvInFile)
for row in csvReader:
result[row['ReferenceID']] = row['PartNumber']
return result
def updatePartNumbers(csvToUpdatePath, referenceIdToPartNumberMap):
tempCsvPath = 'temp.csv'
print 'update part numbers in', csvToUpdatePath
with open(csvToUpdatePath, 'rb') as csvInFile:
csvReader = csv.reader(csvInFile)
# Figure out which columns contain the reference ID and part number.
titleRow = csvReader.next()
referenceIdColumn = titleRow.index('ReferenceID')
partNumberColumn = titleRow.index('PartNumber')
# Write tempoary CSV file with updated part numbers.
with open(tempCsvPath, 'wb') as tempCsvFile:
csvWriter = csv.writer(tempCsvFile)
csvWriter.writerow(titleRow)
for row in csvReader:
# Check if there is an updated part number.
referenceId = row[referenceIdColumn]
newPartNumber = referenceIdToPartNumberMap.get(referenceId)
# If so, update the row just read accordingly.
if newPartNumber is not None:
row[partNumberColumn] = newPartNumber
print ' update part number for %s to %s' % (referenceId, newPartNumber)
csvWriter.writerow(row)
# TODO: Move the temporary CSV file over the initial CSV file.
# shutil.move(tempCsvPath, csvToUpdatePath)
if __name__ == '__main__':
referenceIdToPartNumberMap = createReferenceIdToPartNumberMap('Sheet1.csv')
updatePartNumbers('Sheet2.csv', referenceIdToPartNumberMap)

Writing to csv file

Let's say I have a dictionary:
dict = {'R150': 'PN000123', 'R331': 'PN000873', 'C774': 'PN000064', 'L7896': 'PN000447', 'R0640': 'PN000878', 'R454': 'PN000333'}.
I need to fill in this sample csv file: https://www.dropbox.com/s/c95mlitjrvyppef/sheet.csv
example rows
HEADER,ID,ReferenceID,Value,Location X-Coordinate,Location Y-Coordinate,ROOM,ALT_SYMBOLS,Voltage,Thermal_Rating,Tolerance,PartNumber,MPN,Description,Part_Type,PCB Footprint,SPLIT_INST,SWAP_INFO,GROUP,Comments,Wattage,Tol,Population Notes,Gender,ICA_MFR_NAME,ICA_PARTNUM,Order#,CLASS,INSTALLED,TN,RATING,OriginalSymbolOrigin,Rated_Current,Manufacturer 2,Status,Need To Mirror/Rotate Pin Display Properties,TOLERANCE,LEVEL
,,R150,1,,,,,<null>,<null>,<null>,,,to be linked,Resistor,TODO,<null>,<null>,<null>,<null>,1/16W,?,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>
,,R4737,1,,,,,<null>,<null>,<null>,,,to be linked,Resistor,TODO,<null>,<null>,<null>,<null>,1/16W,?,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>
,,R4738,1,,,,,<null>,<null>,<null>,,,to be linked,Resistor,TODO,<null>,<null>,<null>,<null>,1/16W,?,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>,<null>
Specifically, I need to fill in the PartNumber column based on the keys of the dict I created. So I need to iterate through column ReferenceID and compare that value to my keys in dict. If there is a match I need to fill in the corresponding PartNumber cell with that value (in the dict)....
I'm sorry if this is all confusing I am new to python and am having trouble with the csv module.
To get you started - here's something that uses the csv.DictReader object, and loops through your file, one row at a time, and where something exists in your_dict based on the row's ReferenceID sets the PartNumber to that value, otherwise to an empty string.
If you use this in conjuction with the docs at http://docs.python.org/2/library/stdtypes.html#typesmapping and http://docs.python.org/2/library/csv.html - you should be able to write out the data and better understand what's happening.
import csv
your_dict = {'R150': 'PN000123', 'R331': 'PN000873', 'C774': 'PN000064', 'L7896': 'PN000447', 'R0640': 'PN000878', 'R454': 'PN000333'}
with open('your_csv_file.csv') as fin:
csvin = csv.DictReader(fin)
for row in csvin:
row['PartNumber'] = your_dict.get(row['ReferenceID'], '')
print row

Categories

Resources