Optimize Python CSV Reader Performance

Optimize Python CSV Reader Performance - python

My following code works correctly, but far too slowly. I would greatly appreciate any help you can provide:
import gf
import csv
cic = gf.ct
cii = gf.cit
li = gf.lt
oc = "Output.csv"
with open(cic, "rb") as input1:
reader = csv.DictReader(cie,gf.ctih)
with open(oc,"wb") as outfile:
writer = csv.DictWriter(outfile,gf.ctoh)
writer.writerow(dict((h,h) for h in gf.ctoh))
next(reader)
for ci in reader:
row = {}
row["ci"] = ci["id"]
row["cyf"] = ci["yf"]
with open(cii,"rb") as ciif:
reader2 = csv.DictReader(ciif,gf.citih)
next(reader2)
with open(li, "rb") as lif:
reader3 = csv.DictReader(lif,gf.lih)
next(reader3)
for cii in reader2:
if ci["id"] == cii["id"]:
row["ci"] = cii["ca"]
for li in reader3:
if ci["id"] == li["en_id"]:
row["cc"] = li["c"]
writer.writerow(row)
The reason I open reader2 and reader3 for every row in reader is because reader objects iterate through once and then are done. But there has to be a much more efficient way of doing this and I would greatly appreciate any help you can provide!
If it helps, the intuition behind this code is the following: From Input file 1, grab two cells; see if input file 2 has the same Primary Key as in input file 1, if so, grab a cell from input file 2 and save it with the two other saved cells; see if input file 3 has the same primary key as in input file 1, if so, grab a cell from inputfile3 and save it. Then output these four values. That is, I'm grabbing meta-data from normalized tables and I'm trying to denormalize it. There must be a way of doing this very efficiently in Python. One problem with the current code is that I iterate through reader objects until I find the relevant ID, when there must be a simpler way of searching for a given ID in a reader object...

For one, if this really does live in a relational database, why not just do a big join with some carefully phrased selects?
If I were doing this, I would use pandas.DataFrame and merge the 3 tables together, then I would iterate over each row and use suitable logic to transform the resulting "join"ed datasets into a single final result.

Related

(Closed) Python CSV: (Master and Detail) Search and Insert Value From Detail File into Specific Column On Master File

I am new in Python actually ):
I have task regarding Python and CSV where I need to add value into specific column on Master file after getting data from Detail file.
Let see sample Master File, Detail File and Output expected.
Master File:
Detail File:
Output Expected:
I have a few source-code to run it but not complete on this stage.
Let See what I have below:
from csv import DictReader
from collections import defaultdict
loaded = defaultdict(list)
month1=[]
month2=[]
month3=[]
def getdetailpayment(data):
f=open(data)
csv_file = csv.DictReader(f, delimiter=",")
for row in csv_file:
print(dict(row))
f.close()
def search_masterfile(data):
word = input("Search name: ")
f=open(data)
my_reader=csv.DictReader(f,delimiter=",")
for row in my_reader:
for entry in row:
if row[entry]==word:
print(row)
#insert value into this row on specific column
f.close()
search_masterfile("csv/master.csv")
getdetailpayment("csv/detail.csv")
My plan to playing with dictionary where I assumed I can insert into Master file value and specific column based Detail file record existing. Unfortunately I am very weak of knowledge regarding this scope and I already try to get source-code from google but still not what I wants.
Please help me guys regarding this matter and I prompt thank you very munch on advance.

If you want to do this with basic Python you could try the following:
import csv
from datetime import datetime as dt
with open("master.csv", "r") as file:
reader = csv.DictReader(file)
fieldnames = reader.fieldnames
master = {(row["Name"], row["Key No"]): row for row in reader}
def to_month(string):
return f"Month {dt.strptime(string, '%b-%y').month}"
with open("details.csv", "r") as file:
next(file)
for name, key, month, amount in csv.reader(file):
master[name, key][to_month(month)] = amount
with open("result.csv", "w") as file:
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(master.values())
The first step reads the master file (here master.csv) into a dictionary master with (Name, Key No) tuples as keys (for the merging). On the way the column names get picked up (fieldenames) for the writing later.
The second step is reading through the details file (here details.csv) and updating the respective values in master. The to_month function transforms the month-strings into the corresponding key (column name).
Then write the result into a new file (here result.csv).
You might have to adjust the filenames.
The result for the given input
master.csv:
Name,Key No,Month 1,Month 2,Month 3
A,1,,,
B,2,,,
C,3,,,
D,4,,,
details.csv:
Name,Key No,Month,Amount
A,1,Feb-22,100
B,2,Jan-22,80
C,3,Feb-22,80
D,4,Jan-22,100
A,1,Jan-22,200
C,3,Jan-22,90
is the following file result.csv:
Name,Key No,Month 1,Month 2,Month 3
A,1,200,100,
B,2,80,,
C,3,90,80,
D,4,100,,
If you have to do stuff like this often you might want to look into the Pandas library.

Parsing column from CSV and replace a value in text file with the new value

I have one CSV file, and I want to extract the first column of it. My CSV file is like this:
Device ID;SysName;Entry address(es);IPv4 address;Platform;Interface;Port ID (outgoing port);Holdtime
PE1-PCS-RANCAGUA;;;192.168.203.153;cisco CISCO7606 Capabilities Router Switch IGMP;TenGigE0/5/0/1;TenGigabitEthernet3/3;128 sec
P2-CORE-VALPO.cisco.com;P2-CORE-VALPO.cisco.com;;200.72.146.220;cisco CRS Capabilities Router;TenGigE0/5/0/0;TenGigE0/5/0/4;128 sec
PE2-CONCE;;;172.31.232.42;Cisco 7204VXR Capabilities Router;GigabitEthernet0/0/0/14;GigabitEthernet0/3;153 sec
P1-CORE-CRS-CNT.entel.cl;P1-CORE-CRS-CNT.entel.cl;;200.72.146.49;cisco CRS Capabilities Router;TenGigE0/5/0/0;TenGigE0/1/0/6;164 sec
For that purpose I use the following code that I saw here:
import csv
makes = []
with open('csvoutput/topologia.csv', 'rb') as f:
reader = csv.reader(f)
# next(reader) # Ignore first row
for row in reader:
makes.append(row[0])
print makes
Then I want to replace into a textfile a particular value for each one of the values of the first column and save it as a new file.
Original textfile:
PLANNED.IMPACTO_ID = IMPACTO.ID AND
PLANNED.ESTADOS_ID = ESTADOS_PLANNED.ID AND
TP_CLASIFICACION.ID = TP_DATA.ID_TP_CLASIFICACION AND
TP_DATA.PLANNED_ID = PLANNED.ID AND
PLANNED.FECHA_FIN >= CURDATE() - INTERVAL 1 DAY AND
PLANNED.DESCRIPCION LIKE '%P1-CORE-CHILLAN%’;
Expected output:
PLANNED.IMPACTO_ID = IMPACTO.ID AND
PLANNED.ESTADOS_ID = ESTADOS_PLANNED.ID AND
TP_CLASIFICACION.ID = TP_DATA.ID_TP_CLASIFICACION AND
TP_DATA.PLANNED_ID = PLANNED.ID AND
PLANNED.FECHA_FIN >= CURDATE() - INTERVAL 1 DAY AND
PLANNED.DESCRIPCION LIKE 'FIRST_COLUMN_VALUE’;
And so on for every value in the first column, and save it as a separate file.
How can I do this? Thank you very much for your help.

You could just read the file, apply changes, and write the file back again. There is no efficient way to edit a file (inserting characters is not efficiently possible), you can only rewrite it.
If your file is going to be big, you should not keep the whole table in memory.
import csv
makes = []
with open('csvoutput/topologia.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
makes.append(row)
# Apply changes in makes
with open('csvoutput/topologia.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(makes);

how to edit a csv in python and add one row after the 2nd row that will have the same values in all columns except 1

I'm new in Python language and i'm facing a small challenge in which i havent been able to figure it out so far.
I receive a csv file with around 30-40 columns and 5-50 rows with various details in each cell. The 1st row of the csv has the title for each column and by the 2nd row i have item values.
What i want to do is to create a python script which will read the csv file and every time to do the following:
Add a row after the actual 1st item row, (literally after the 2nd row, cause the 1st row is titles), and in that new 3rd row to contain the same information like the above one with one difference only. in the column "item_subtotal" i want to add the value from the column "discount total".
all the bellow rows should remain as they are, and save this modified csv as a new file with the word "edited" added in the file name.
I could really use some help because so far i've only managed to open the csv file with a python script im developing, but im not able so far to add the contents of the above row to that newly created row and replace that specific value.
Looking forward any help.
Thank you
Here Im attaching the CSV with some values changed for privacy reasons.
order_id,order_number,date,status,shipping_total,shipping_tax_total,fee_total,fee_tax_total,tax_total,discount_total,order_total,refunded_total,order_currency,payment_method,shipping_method,customer_id,billing_first_name,billing_last_name,billing_company,billing_email,billing_phone,billing_address_1,billing_address_2,billing_postcode,billing_city,billing_state,billing_country,shipping_first_name,shipping_last_name,shipping_address_1,shipping_address_2,shipping_postcode,shipping_city,shipping_state,shipping_country,shipping_company,customer_note,item_id,item_product_id,item_name,item_sku,item_quantity,item_subtotal,item_subtotal_tax,item_total,item_total_tax,item_refunded,item_refunded_qty,item_meta,shipping_items,fee_items,tax_items,coupon_items,order_notes,download_permissions_granted,admin_custom_order_field:customer_type_5
15001_TEST_2,,"2017-10-09 18:53:12",processing,0,0.00,0.00,0.00,5.36,7.06,33.60,0.00,EUR,PayoneCw_PayPal,"0,00",0,name,surname,,name.surname#gmail.com,0123456789,"address 1",,41541_TEST,location,,DE,name,surname,address,01245212,14521,location,,DE,,,1328,302,"product title",103,1,35.29,6.71,28.24,5.36,0.00,0,,"id:1329|method_id:free_shipping:3|method_title:0,00|total:0.00",,id:1330|rate_id:1|code:DE-MWST-1|title:MwSt|total:5.36|compound:,"id:1331|code:#getgreengent|amount:7.06|description:Launchcoupon for friends","text string",1,

You can also use pandas to manipulate the data from the csv like this:
import pandas
import copy
Read the csv file into a pandas dataframe:
df = pandas.read_csv(filename)
Make a deepcopy of the first row of data and add the discount total to the item subtotal:
new_row = copy.deepcopy(df.loc[1])
new_row['item_subtotal'] += new_row['discount total']
Concatenate the first 2 rows with the new row and then everything after that:
df = pandas.concat([df.loc[:1], new_row, df.loc[2:]], ignore_index=True)
Change the filename and write the out the new csv file:
filename = filename.strip('.csv') + 'edited.csv'
df.to_csv(filename)
I hope this helps! Pandas is great for cleanly handling massive amounts of data, but may be overkill for what you are trying to do. Then again, maybe not. It would help to see an example data file.

The first step is to turn that .csv into something that is a little easier to work with. Fortunately, python has the 'csv' module which makes it easy to turn your .csv file into a much nicer list of lists. The below will give you a way to both turn your .csv into a list of lists and turn the modified data back into a .csv file.
import csv
import copy
def csv2list(ifile):
"""
ifile = the path of the csv to be converted into a list of lists
"""
f = open(ifile,'rb')
olist=[]
c = csv.reader(f, dialect='excel')
for line in c:
olist.append(line) #and update the outer array
f.close
return olist
#------------------------------------------------------------------------------
def list2csv(ilist,ofile):
"""
ilist = the list of lists to be converted
ofile = the output path for your csv file
"""
with open(ofile, 'wb') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',',
quotechar='|', quoting=csv.QUOTE_MINIMAL)
[csvwriter.writerow(x) for x in ilist]
Now, you can simply copy list[1] and change the appropriate element to reflect your summed value using:
listTemp = copy.deepcopy(ilist[1])
listTemp[n] = listTemp[n] + listTemp[n-x]
ilist.insert(2,listTemp)
As for how to change the file name, just use:
import os
newFileName = os.path.splitext(oldFileName)[0] + "edited" + os.path.splitext(oldFileName)[1]
Hopefully this will help you out!

Loop within loop when comparing csv files in Python

I have two csv files. I am trying to look up a value the first column in one file (file 1) in the first column in the other file (file 2). If they match then print the row from file 2.
Pseudo code:
read file1.csv
read file2.csv
loop through file1
compare each row with each row of file 2 in turn
if file1[0] == file2[0]:
print row of file 2
file1:
45,John
46,Fred
47,Bill
File2:
46,Roger
48,Pete
49,Bob
I want it to print :
46 Roger
EDIT - these are examples, the actual file is much bigger (5,000 rows, 7 columns)
I have the following:
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1reader = csv.reader(csvfile1)
csv2reader = csv.reader(csvfile2)
for rowcsv1 in csv1reader:
for rowcsv2 in csv2reader:
if rowcsv1[0] == rowcsv2[0]:
print(rowcsv1)
However I am getting no output.
I am aware there are other ways of doing it (with dict, pandas) but I cam keen to know why my approach is not working.
EDIT: I now see that it is only iterating through the first row of file 1 and then closing, but I am unclear how to stop it closing (I also understand that this is not the best way to do do it).

You open csv2reader = csv.reader(csvfile2) then iterate through it vs the first row of csv1reader - it has now reached end of file and will not produce any more data.
So for the second through last rows of csv1reader you are comparing against the items of an empty list, ie no comparison takes place.
In any case, this is a very inefficient method; unless you are working on very large files, it would be much better to do
import csv
# load second file as lookup table
data = {}
with open("csv2file.csv") as inf2:
for row in csv.reader(inf2):
data[row[0]] = row
# now process first file against it
with open("csv1file.csv") as inf1:
for row in csv.reader(inf1):
if row[0] in data:
print(data[row[0]])

See Hugh Bothwell's answer for why your code isn't working. For a fast way of doing what you stated you want to do in your question, try this:
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1 = list(csv.reader(csvfile1))
csv2 = list(csv.reader(csvfile2))
duplicates = {a[0] for a in csv1} & {a[0] for a in csv2}
for row in csv2:
if row[0] in duplicates:
print(row)
It gets the duplicate numbers from the two csv files, then loops through the second cvs file, printing the row if the number at index 0 is in the first cvs file. This is a much faster algorithm than what you were attempting to do.

If order matters, as #hugh-bothwell's mentioned in #will-da-silva's answer, you could do:
import csv
from collections import OrderedDict
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1 = list(csv.reader(csvfile1))
csv2 = list(csv.reader(csvfile2))
d = {row[0]: row for row in csv2}
k = OrderedDict.fromkeys([a[0] for a in csv1]).keys()
duplicate_keys = [k for k in k if k in d]
for k in duplicate_keys:
print(d[k])

I'm pretty sure there's a better way to do this, but try out this solution, it should work.
counter = 0
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as
csvfile2:
csv1reader = csv.reader(csvfile1)
csv2reader = csv.reader(csvfile2)
for rowcsv1 in csv1reader:
for rowcsv2 in csv2reader:
if rowcsv1[counter] == rowcsv2[counter]:
print(rowcsv1)
counter += 1 #increment it out of the IF statement.

Sorting a table in python

I am creating a league table for a 6 a side football league and I am attempting to sort it by the points column and then display it in easygui. The code I have so far is this:
data = csv.reader(open('table.csv'), delimiter = ',')
sortedlist = sorted(data, key=operator.itemgetter(7))
with open("Newtable.csv", "wb") as f:
fileWriter = csv.writer(f, delimiter=',')
for row in sortedlist:
fileWriter.writerow(row)
os.remove("table.csv")
os.rename("Newtable.csv", "table.csv")
os.close
The number 7 relates to the points column in my csv file. I have a problem with Newtable only containing the teams information that has the highest points and the table.csv is apparently being used by another process and so cannot be removed.
If anyone has any suggestions on how to fix this it would be appreciated.

If the indentation in your post is actually the indentation in your script (and not a copy-paste error), then the problem is obvious:
os.rename() is executed during the for loop (which means that it's called once per line in the CSV file!), at a point in time where Newtable.csv is still open (not by a different process but by your script itself), so the operation fails.
You don't need to close f, by the way - the with statement takes care of that for you. What you do need to close is data - that file is also still open when the call occurs.
Finally, since a csv object contains strings, and strings are sorted alphabetically, not numerically (so "10" comes before "2"), you need to sort according to the numerical value of the string, not the string itself.
You probably want to do something like
with open('table.csv', 'rb') as infile:
data = csv.reader(infile, delimiter = ',')
sortedlist = [next(data)] + sorted(data, key=lambda x: int(x[7])) # or float?
# next(data) reads the header before sorting the rest
with open("Newtable.csv", "wb") as f:
fileWriter = csv.writer(f, delimiter=',')
fileWriter.writerows(sortedList) # No for loop needed :)
os.remove("table.csv")
os.rename("Newtable.csv", "table.csv")

I'd suggest using pandas:
Assuming an input file like this:
team,points
team1, 5
team2, 6
team3, 2
You could do:
import pandas as pd
a = pd.read_csv('table.csv')
b=a.sort('points',ascending=False)
b.to_csv('table.csv',index=False)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimize Python CSV Reader Performance - python

Related

(Closed) Python CSV: (Master and Detail) Search and Insert Value From Detail File into Specific Column On Master File

Parsing column from CSV and replace a value in text file with the new value

how to edit a csv in python and add one row after the 2nd row that will have the same values in all columns except 1

Loop within loop when comparing csv files in Python

Sorting a table in python

Categories

Resources