I have a large dataset stored as a 17GB csv file (fileData), which contains a variable number of records (up to approx 30,000) for each customer_id. I am trying to search for specific customers (listed in fileSelection - around 1500 out of a total of 90000) and copy the records for each of these customers into a seperate csv file (fileOutput).
I am very new to Python, but using it because vba and matlab (which i am more familiar with) can't handle the file size. (I am using Aptana studio to write the code, but running the python directly from the cmd line for speed. Running 64bit Windows 7.)
The code I have written is extracting some of the customers, but has two problems:
1) It is failing to find most of the customers in the large dataset. (I believe they are all in the dataset, but cannot be completely sure.)
2) It is VERY slow. Any way to speed the code would be appreciated, including code that can better utilise a 16 core PC.
here is the code:
`def main():
# Initialisation :
# - identify columns in slection file
#
fS = open (fileSelection,"r")
if fS.mode == "r":
header = fS.readline()
selheaderlist = header.split(",")
custkey = selheaderlist.index('CUSTOMER_KEY')
#
# Identify columns in dataset file
fileData = path2+file_data
fD = open (fileData,"r")
if fD.mode == "r":
header = fD.readline()
dataheaderlist = header.split(",")
custID = dataheaderlist.index('CUSTOMER_ID')
fD.close()
# For each customer in the selection file
customercount=1
for sr in fS:
# Find customer key and locate it in customer ID field in dataset
selrecord = sr.split(",")
requiredcustomer = selrecord[custkey]
#Look for required customer in dataset
found = 0
fD = open (fileData,"r")
if fD.mode == "r":
while found == 0:
dr = fD.readline()
if not dr: break
datrecord = dr.split(",")
if datrecord[custID] == requiredcustomer:
found = 1
# Open outputfile
fileOutput= path3+file_out_root + str(requiredcustomer)+ ".csv"
fO=open(fileOutput,"w+")
fO.write(str(header))
#copy all records for required customer number
while datrecord[custID] == requiredcustomer:
fO.write(str(dr))
dr = fD.readline()
datrecord = dr.split(",")
#Close Output file
fO.close()
if found == 1:
print ("Customer Count "+str(customercount)+ " Customer ID"+str(requiredcustomer)+" copied. ")
customercount = customercount+1
else:
print("Customer ID"+str(requiredcustomer)+" not found in dataset")
fL.write (str(requiredcustomer)+","+"NOT FOUND")
fD.close()
fS.close()
`
It has taken a few days to extract a couple of hundred customers, but has failed to find many more.
Sample Output
Thanks # Paul Cornelius. This is much more efficient. I have adopted your approach, also using the csv handling suggested by #Bernardo :
# Import Modules
import csv
def main():
# Initialisation :
fileSelection = path1+file_selection
fileData = path2+file_data
# Step through selection file and create dictionary with required ID's as keys, and empty objects
with open(fileSelection,'rb') as csvfile:
selected_IDs = csv.reader(csvfile)
ID_dict = {}
for row in selected_IDs:
ID_dict.update({row[1]:[]})
# step through data file: for selected customer ID's, append records to dictionary objects
with open(fileData,'rb') as csvfile:
dataset = csv.reader(csvfile)
for row in dataset:
if row[0] in ID_dict:
ID_dict[row[0]].extend([row[1]+','+row[4]])
# write all dictionary objects to csv files
for row in ID_dict.keys():
fileOutput = path3+file_out_root+row+'.csv'
with open(fileOutput,'wb') as csvfile:
output = csv.writer(csvfile, delimiter='\n')
output.writerows([ID_dict[row]])
Use the csv reader instead. Python has a good library to handle CSV files so that it is not necessary for you to do splits.
Check out the documentation: https://docs.python.org/2/library/csv.html
>>> import csv
>>> with open('eggs.csv', 'rb') as csvfile:
... spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
... for row in spamreader:
... print ', '.join(row)
Spam, Spam, Spam, Spam, Spam, Baked Beans
Spam, Lovely Spam, Wonderful Spam
It should perform much better.
The task is way too involved for a simple answer. But your approach is very inefficient because you have too many nested loops. Try making ONE pass through the list of customers, and for each build a "customer" object with any information that you need to use later. You put these in a dictionary; the keys are the different requiredcustomer variables and the values are the customer objects. If I were you, I would get this part to work first, before ever fooling around with the big file.
Now you step ONCE through the massive file of customer data, and each time you encounter a record whose datarecord[custID] field is in the dictionary, you append a line to the output file. You can use the relatively efficient in operator to test for membership in the dictionary.
No nested loops are necessary.
The code as you present it can't run since you write to some object named fL without ever opening it. Also, as Tim Pietzcker pointed out, you aren't closing your files since you don't actually call the close function.
Try using pandas if your machine can handle the size of the csv in memory.
If you are looking for out of core computation - take a look at dask (they provide similar APIs)
In pandas, you can read only specific columns from a csv file, if you run into memory problems.
Anyways - both pandas and dask use C bindings which are significantly faster than pure python.
In pandas, your code would look something like:
import pandas as pd
input_csv = pd.read_csv('path_to_csv')
records_for_interesting customers = input_csv[input_csv.fileSelection.isin([list_of_ids])]
records_for_interesting customers.to_csv('output_path')
Related
I'm having trouble figuring out where to dive in on this personal project and I was hoping this community could help me create a Python script to deal with this data.
I have a CSV file that contains a list of meals fed to dogs at an animal rescue, associated by with the kennel number:
Source CSV - mealsandtreats.csv
blank_column,Kennel_Number,Species,Food,Meal_ID
,1,Dog,Meal,11.2
,5,Dog,Meal,45.2
,3,Dog,Meal,21.4
,4,Dog,Meal,17
,2,Dog,Meal,11.2
,4,Dog,Meal,21.4
,6,Dog,Meal,17
,2,Dog,Meal,45.2
I have a second CSV file that provides a key which maps the meals to what treats come with the meal:
Meal to Treat Key - MealsToTreatsKey.csv
Meals_fed,Treats_fed
10.1,2.4
11.2,2.4
13.5,3
15.6,3.2
17,3.2
20.1,5.1
21.4,5.2
35.7,7.7
45.2,7.9
I need to take every meal type (eg; drop duplicate entries) that was delivered from table 1, find the associated treat type, and then create an individual entry for every time a treat was served to a specific kennel. The final result should look something like this:
Result CSV - mealsandtreats.csv
blank_column,Kennel_Number,Species,Food,Meal_ID
,1,Dog,Meal,11.2
,5,Dog,Meal,45.2
,3,Dog,Meal,21.4
,4,Dog,Meal,17
,2,Dog,Meal,11.2
,4,Dog,Meal,21.4
,6,Dog,Meal,17
,2,Dog,Meal,45.2
,1,Dog,Treat,2.4
,5,Dog,Treat,7.9
,3,Dog,Treat,5.2
,4,Dog,Treat,3.2
,1,Dog,Treat,2.4
,4,Dog,Treat,5.2
Would prefer to do this with the csv module and not Pandas, but I'm open to using Pandas if necessary.
I have a bit of code so far just opening the CSVs, but I'm really stuck on where to go next:
import csv
with open('./meals/results/foodToTreats.csv', 'r') as t1,
open('./results/food.csv', 'r') as t2:
key = t1.readlines()
map = t2.readlines()
with open('./results/food.csv', 'w') as outFileF:
for line in map:
if line not in key:
outFileF.write(line)
with open('./results/foodandtreats.csv', 'w') as outFileFT:
for line in map:
if line not in key:
outFileFT.write(line)
So basically I just need to take every treat entry in the 2nd sheet, search for matching associated food entries in the 1st sheet, look up the kennel number associated with that entry and then write it to the 1st sheet.
Giving it my best shot in pseudo code, something like:
for x in column 0,y:
y,1 = Z
food = x
treat = y
kennel_number = z
when x,z:
writerows('', {'kennel_number"}, 'species', '{food/treat}',
{'meal_id"})
Update: Here is the exact code I'm using, thanks to #wwii. Seeing a minor bug:
import csv
import collections
treats = {}
with open('mealsToTreatsKey.csv') as f2:
for line in f2:
meal,treat = line.strip().split(',')
treats[meal] = treat
new_items = set()
Treat = collections.namedtuple('Treat', ['blank_column','Kennel_Number','Species','Food','Meal_ID'])
with open('foodandtreats.csv') as f1:
reader = csv.DictReader(f1)
for row in reader:
row['Food'] = 'Treat'
row['Meal_ID'] = treats[row['Meal_ID']]
new_items.add(Treat(**row))
fieldnames = reader.fieldnames
with open('foodandtreats.csv', 'a') as f1:
writer = csv.DictWriter(f1, fieldnames)
for row in new_items:
writer.writerow(row._asdict())
This works perfectly except for one small bug. The first new row written isn't starting on its own line:
enter image description here
Make a dictionary mapping meals to treats
treats = {}
with open(treatfile) as f2:
for line in f2:
meal,treat = line.strip().split(',')
treats[meal] = treat
Iterate over the meal file and create set of new entries. Use namedtuples for the new items.
import collections
new_items = set()
Treat = collections.namedtuple('Treat', ['blank_column','Kennel_Number','Species','Food','Meal_ID'])
with open(mealfile) as f1:
reader = csv.DictReader(f1)
for row in reader:
row['Food'] = 'Treat'
row['Meal_ID'] = treats[row['Meal_ID']]
new_items.add(Treat(**row))
fieldnames = reader.fieldnames
Open the meal file (again) for appending and write the new entries
with open(mealfile, 'a') as f1:
writer = csv.DictWriter(f1, fieldnames)
for row in new_items:
writer.writerow(row._asdict())
If the meals file does not end with a newline character, you will need to add one before writing the new treat lines. Since you have control of the files you should just make sure it always ends in a blank line.
I have a text file with about 20 entries. They look like this:
~
England
Link: http://imgur.com/foobar.jpg
Capital: London
~
Iceland
Link: http://imgur.com/foobar2.jpg
Capital: Reykjavik
...
etc.
I would like to take these entries and turn them into a CSV.
There is a '~' separating each entry. I'm scratching my head trying to figure out how to go thru line by line and create the CSV values for each country. Can anyone give me a clue on how to go about this?
Use the libraries luke :)
I'm assuming your data is well formatted. Most real world data isn't that way. So, here goes a solution.
>>> content.split('~')
['\nEngland\nLink: http://imgur.com/foobar.jpg\nCapital: London\n', '\nIceland\nLink: http://imgur.com/foobar2.jpg\nCapital: Reykjavik\n', '\nEngland\nLink: http://imgur.com/foobar.jpg\nCapital: London\n', '\nIceland\nLink: http://imgur.com/foobar2.jpg\nCapital: Reykjavik\n']
For writing the CSV, Python has standard library functions.
>>> import csv
>>> csvfile = open('foo.csv', 'wb')
>>> fieldnames = ['Country', 'Link', 'Capital']
>>> writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
>>> for entry in entries:
... cols = entry.strip().splitlines()
... writer.writerow({'Country': cols[0], 'Link':cols[1].split(': ')[1], 'Capital':cols[2].split(':')[1]})
...
If your data is more semi structured or badly formatted, consider using a library like PyParsing.
Edit:
Second column contains URLs, so we need to handle the splits well.
>>> cols[1]
'Link: http://imgur.com/foobar2.jpg'
>>> cols[1].split(':')[1]
' http'
>>> cols[1].split(': ')[1]
'http://imgur.com/foobar2.jpg'
The way that I would do that would be to use the open() function using the syntax of:
f = open('NameOfFile.extensionType', 'a+')
Where "a+" is append mode. The file will not be overwritten and new data can be appended. You could also use "r+" to open the file in read mode, but would lose the ability to edit. The "+" after a letter signifies that if the document does not exist, it will be created. The "a+" I've never found to work without the "+".
After that I would use a for loop like this:
data = []
tmp = []
for line in f:
line.strip() #Removes formatting marks made by python
if line == '~':
data.append(tmp)
tmp = []
continue
else:
tmp.append(line)
Now you have all of the data stored in a list, but you could also reformat it as a class object using a slightly different algorithm.
I have never edited CSV files using python, but I believe you can use a loop like this to add the data:
f2 = open('CSVfileName.csv', 'w') #Can change "w" for other needs i.e "a+"
for entry in data:
for subentry in entry:
f2.write(str(subentry) + '\n') #Use '\n' to create a new line
From my knowledge of CSV that loop would create a single column of all of the data. At the end remember to close the files in order to save the changes:
f.close()
f2.close()
You could combine the two loops into one in order to save space, but for the sake of explanation I have not.
I want to parse a csv file which is in the following format:
Test Environment INFO for 1 line.
Test,TestName1,
TestAttribute1-1,TestAttribute1-2,TestAttribute1-3
TestAttributeValue1-1,TestAttributeValue1-2,TestAttributeValue1-3
Test,TestName2,
TestAttribute2-1,TestAttribute2-2,TestAttribute2-3
TestAttributeValue2-1,TestAttributeValue2-2,TestAttributeValue2-3
Test,TestName3,
TestAttribute3-1,TestAttribute3-2,TestAttribute3-3
TestAttributeValue3-1,TestAttributeValue3-2,TestAttributeValue3-3
Test,TestName4,
TestAttribute4-1,TestAttribute4-2,TestAttribute4-3
TestAttributeValue4-1-1,TestAttributeValue4-1-2,TestAttributeValue4-1-3
TestAttributeValue4-2-1,TestAttributeValue4-2-2,TestAttributeValue4-2-3
TestAttributeValue4-3-1,TestAttributeValue4-3-2,TestAttributeValue4-3-3
and would like to turn this into tab seperated format like in the following:
TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3
TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3
TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3
TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3
Number of TestAttributes vary from test to test. For some tests there are only 3 values, for some others 7, etc. Also as in TestName4 example, some tests are executed more than once and hence each execution has its own TestAttributeValue line. (in the example testname4 is executed 3 times, hence we have 3 value lines)
I am new to python and do not have much knowledge but would like to parse the csv file with python. I checked 'csv' library of python and could not be sure whether it will be enough for me or shall I write my own string parser? Could you please help me?
Best
I'd use a solution using the itertools.groupby function and the csv module. Please have a close look at the documentation of itertools -- you can use it more often than you think!
I've used blank lines to differentiate the datasets, and this approach uses lazy evaluation, storing only one dataset in memory at a time:
import csv
from itertools import groupby
with open('my_data.csv') as ifile, open('my_out_data.csv', 'wb') as ofile:
# Use the csv module to handle reading and writing of delimited files.
reader = csv.reader(ifile)
writer = csv.writer(ofile, delimiter='\t')
# Skip info line
next(reader)
# Group datasets by the condition if len(row) > 0 or not, then filter
# out all empty lines
for group in (v for k, v in groupby(reader, lambda x: bool(len(x))) if k):
test_data = list(group)
# Write header
writer.writerow([test_data[0][1]])
# Write transposed data
writer.writerows(zip(*test_data[1:]))
# Write blank line
writer.writerow([])
Output, given that the supplied data is stored in my_data.csv:
TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3
TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3
TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3
TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3
The following does what you want, and only reads up to one section at a time (saves memory for a large file). Replace in_path and out_path with the input and output file paths respectively:
import csv
def print_section(section, f_out):
if len(section) > 0:
# find maximum column length
max_len = max([len(col) for col in section])
# build and print each row
for i in xrange(max_len):
f_out.write('\t'.join([col[i] if len(col) > i else '' for col in section]) + '\n')
f_out.write('\n')
with csv.reader(open(in_path, 'r')) as f_in, open(out_path, 'w') as f_out:
line = f_in.next()
section = []
for line in f_in:
# test for new "Test" section
if len(line) == 3 and line[0] == 'Test' and line[2] == '':
# write previous section data
print_section(section, f_out)
# reset section
section = []
# write new section header
f_out.write(line[1] + '\n')
else:
# add line to section
section.append(line)
# print the last section
print_section(section, f_out)
Note that you'll want to change 'Test' in the line[0] == 'Test' statement to the correct word for indicating the header line.
The basic idea here is that we import the file into a list of lists, then write that list of lists back out using an array comprehension to transpose it (as well as adding in blank elements when the columns are uneven).
I am a complete newb at this and I have the following script..
It writes some random data to .csv. My end goal is to keep this preexisting .csv but add ONE random generated datapoint to the beginning of this csv in a separate Python script.
Completely new at this -- not sure how to go about doing this. Thanks for your help.
output = [a,b]
d = csv.writer(csvfile, delimiter=',', quotechar='|',
quoting=csv.QUOTE_MINIMAL)
d.writerow(output)
Are you sure you are trying to add it to the start of the file? I feel like you would want to add it to the end or if you did want to add it at the beginning you would at least want to put it after the header row which is ['name', 'value'].
As it stands your current script has several errors when I try to compile it myself so I can help you out a bit there.
The directory string doesn't work because of the slashes. It will work if you add an r in front (for raw string) like so r'C:/Users/AMB/Documents/Aptana Studio 3 Workspace/RAVE/RAVE/resources/csv/temperature.csv'
You don't need JSON to import json or logging if this is the entirety of your code.
Inside of your for loop you redefine the temperature writer which is unnecessary, your definition at the start is good enough.
You have an extra comma in your the line output = [timeperiod, temp,]
Moving on to a script that inserts a single data point. This script reads in your existing file. Inserts a new line (you would use random values, I used 1 for time and 2 for value) on the second line which is beneath the header. Let me know if this isn't what you are looking for.
directory = r"C:/Users/AMB/Documents/Aptana Studio 3 Workspace/RAVE/RAVE/resources/csv/temperature.csv"
with open(directory, 'r') as csvfile:
s = csvfile.readlines()
time = 1
value = 2
s.insert(2, '%(time)d,%(value)d\n\n' % \
{'time': time, "value": value})
with open(directory, 'w') as csvfile:
csvfile.writelines(s)
This next section is in response to your more detailed question in the comments:
import csv
import random
directory = r"C:\Users\snorwood\Desktop\temperature.csv"
# Open the file
with open(directory, 'r') as csvfile:
s = csvfile.readlines()
# This array will store your data
data = []
# This for loop converts the data read from the text file into integers values in your data set
for i, point in enumerate(s[1:]):
seperatedPoint = point.strip("\n").split(",")
if len(seperatedPoint) == 2:
data.append([int(dataPoint) for dataPoint in seperatedPoint])
# Loop through your animation numberOfLoops times
numberOfLoops = 100
for i in range(numberOfLoops):
if len(data) == 0:
break
del data[0] # Deletes the first data point
newTime = data[len(data) - 1][0] + 1 # An int that is one higher than the current last time value
newRandomValue = 2
data.append([newTime, newRandomValue]) # Adds the new data point to the end of the array
# Insert your drawing code here
# Write the data back into the text file
with open(directory, 'w') as csvfile: #opens the file for writing
temperature = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL) # The object that knows how to write to files
temperature.writerow(["name", "values"]) # Write the header row
for point in data: # Loop through the points stored in data
temperature.writerow(point) # Write current point in set
So working on a program in Python 3.3.2. New to it all, but I've been getting through it. I have an app that I made that will take 5 inputs. 3 of those inputs are comboboxs, two are entry widgets. I have then created a button event that will save those 5 inputs into a text file, and a csv file. Opening each file everything looks proper. For example saved info would look like this:
Brad M.,Mike K.,Danny,Iconnoshper,Strong Wolf Lodge
I then followed a csv demo and copied this...
import csv
ifile = open('myTestfile.csv', "r")
reader = csv.reader(ifile)
rownum = 0
for row in reader:
# Save header row.
if rownum == 0:
header = row
else:
colnum = 0
for col in row:
print('%-15s: %s' % (header[colnum], col))
colnum += 1
rownum += 1
ifile.close()
and that ends up printing beautifully as:
rTech: Brad M.
pTech: Mike K.
cTech: Danny
proNam: ohhh
jobNam: Yeah
rTech: Damien
pTech: Aaron
so on and so on. What I'm trying to figure out is if I've named my headers via
if rownum == 0:
header = row
is there a way to pull a specific row / col combo and print what is held there??
I have figured out that I could after the program ran do
print(col)
or
print(col[0:10]
and I am able to print the last col printed, or the letters from the last printed col. But I can't go any farther back than that last printed col.
My ultimate goal is to be able to assign variables so I could in turn have a label in another program get it's information from the csv file.
rTech for job is???
look in Jobs csv at row 1, column 1, and return value for rTech
do I need to create a dictionary that is loaded with the information then call the dictionary?? Thanks for any guidance
Thanks for the direction. So been trying a few different things one of which Im really liking is the following...
import csv
labels = ['rTech', 'pTech', 'cTech', 'productionName', 'jobName']
fn = 'my file.csv'
cameraTech = 'Danny'
f = open(fn, 'r')
reader = csv.DictReader(f, labels)
jobInformation = [(item["productionName"],
item["jobName"],
item["pTech"],
item["rTech"]) for item in reader if \
item['cTech'] == cameraTech]
f.close()
print ("Camera Tech: %s\n" % (cameraTech))
print ("\n".join(["Production Name: %s \nJob Name: %s \nPrep Tech: %s \nRental Agent: %s\n" % (item) for item in jobInformation]))
That shows me that I could create a variable through cameraTech and as long as that matched what was loaded into the reader that holds the csv file and that if cTech column had a match for cameraTech then it would fill in the proper information. 95% there WOOOOOO..
So now what I'm curious about is calling each item. The plan is in a window I have a listbox that is populated with items from a .txt file with "productionName" and "jobName". When I click on one of those items in the listbox a new window opens up and the matching information from the .csv file is then filled into the appropriate labels.
Thoughts??? Thanks again :)
I think that reading the CSV file into a dictionary might be a working solution for your problem.
The Python CSV package has built-in support for reading CSV files into a Python dictionary using DictReader, have a look at the documentation here: http://docs.python.org/2/library/csv.html#csv.DictReader
Here is an (untested) example using DictReader that reads the CSV file into a Python dictionary and prints the contents of the first row:
import csv
csv_data = csv.DictReader(open("myTestfile.csv"))
print(csv_data[0])
Okay so I was able to put this together after seeing the following (https://gist.github.com/zstumgoren/911615)
That showed me how to give each header a variable I could call. From there I could then create a function that would allow for certain variables to be called and compared and if that matched I would be able to see certain data needed. So the example I made to show myself it could be done is as follows:
import csv
source_file = open('jobList.csv', 'r')
for line in csv.DictReader(source_file, delimiter=','):
pTech= line['pTech']
cTech= line['cTech']
rAgent= line['rTech']
prodName= line['productionName']
jobName= line['jobName']
if prodName == 'another':
print(pTech, cTech, rAgent, jobName)
However I just noticed something, while my .csv file has one line this works great!!!! But, creating my proper .csv file, I am only able to print information from the last line read. Grrrrr.... Getting closer though.... I'm still searching but if someone understands my issue, would love some light.