How to specify index start point in CSV in python? - python

I have a python script that generates a CSV file for me. The script adds an index column to the CSV that rightfully starts at 0 and goes onto the last row. This works as expected. However, I need to find a way to make sure that the index starts from the last index number in the previously generated CSV file. That is, if I have the following CSV:
csv1.csv
,Name,Customer
0,test1,customer1
1,test2,customer2
I would like the next CSV file generated to read:
csv2.csv
,Name,Customer
2,test1,customer1
3,test2,customer2
I suspect we will need to import the last generated file to read the last index on it, but how do I make the new CSV file generated start from that point?
Note: I do not wish to import data from the previously generated CSV. I simply want to have the new index start from where the last index ended.
Below is my script thus far, as a reference.
by_name = {}
with open('flavor.csv') as b:
for row in csv.DictReader(b):
name = row.pop('Name')
by_name[name] = row
with open('output.csv', 'w') as c:
w = csv.DictWriter(c, ['ID', 'Name', 'Flavor', 'RAM', 'Disk', 'Ephemeral', 'VCPUs', 'Customer', 'Misc', 'date_stamp', 'Month', 'Year','Unixtime'])
w.writeheader()
with open('instance.csv') as a:
for row in csv.DictReader(a):
try:
match = by_name[row['Flavor']]
except KeyError:
continue
row.update(match)
w.writerow(row)
df = pd.read_csv('output.csv')
df[['Customer','Misc']] = df.Name.str.split('-', n=1,expand=True)
df[['date_stamp']] = date_time
df[['Month']] = month
df[['Year']] = year
df[['Unixtime']] = unixtime
df.loc[df.Misc.str.startswith('business', na=False),'Customer']+='-business'
df.Misc=df.Misc.str.strip('business-')
df[['Customer']] = df.Customer.str.title()
df.to_csv('final-output.csv')

Related

How to dynamically join two CSV files?

I have two csv files and I was thinking about combining them via python - to practice my skill, and it turns out much more difficult than I ever imagined...
A simple conclusion of my problem: I feel like my code should be correct but the edited csv file turns out not to be what I thought.
One file, which I named as chrM_location.csv is the file that I want to edit.
The first file looks like this
The other file, named chrM_genes.csv is the file that I take reference at.
The second file looks like this:
There are a few other columns but I'm not using them at the moment. The first few roles are subject "CDS", then there is a blank row, followed by a few other roles with subject "exon", then another blank row, followed by some rows "genes" (and a few others).
What I tried to do is, I want to read first file row by row, focus on the number in the second column (42 for row 1 without header), see if it belongs to the range of 4-5 columns in file two (also read row by row), then if it is, I record the information of that corresponding row, and paste it back to the first file, at the end of the row, if not, I skip it.
below is my code, where I set out to run everything through the CDS section first, so I wrote a function refcds(). It returns me with:
whether or not the value is in range;
if in range, it forms a list of the information I want to paste to the second file.
Everything works fine for the main part of the code, I have the list final[] containing all information of that row, supposedly I only need to past it on that row and overwrite everything before. I used print(final) to check the info and it seems like just what I want it to be.
but this is what the result looks like:
I have no idea why a new row is inserted and why some rows are pasted here together, when column 2 is supposedly small -> large according to value.
similar things happened in other places as well.
Thank you so much for your help! I'm running out of solution... No error messages are given and I couldn't really figure out what went wrong.
import csv
from csv import reader
from csv import writer
mylist=[]
a=0
final=[]
def refcds(value):
mylist=[]
with open("chrM_genes.csv", "r") as infile:
r = csv.reader(infile)
for rows in r:
for i in range(0,12):
if value >= rows[3] and value <= rows[4]:
mylist = ["CDS",rows[3],rows[4],int(int(value)-int(rows[3])+1)]
return 0, mylist
else:
return 1,[]
with open('chrM_location.csv','r+') as myfile:
csv_reader = csv.reader(myfile)
csv_writer = csv.writer(myfile)
for row in csv_reader:
if (row[1]) != 'POS':
final=[]
a,mylist = refcds(row[1])
if a==0:
lista=[row[0],row[1],row[2],row[3],row[4],row[5]]
final.extend(lista)
final.extend(mylist),
csv_writer.writerow(final)
if a==1:
pass
if (row[1]) == 'END':
break
myfile.close()```
If I understand correctly - your code is trying to read and write to the same file at the same time.
csv_reader = csv.reader(myfile)
csv_writer = csv.writer(myfile)
I haven't tried your code: but I'm pretty sure this is going to cause weird stuff to happen... (If you refactor and output to a third file - do you still see the same issue?)
I think the problem is that you have your reader and writer set to the same file—I have no idea what that does. A much cleaner solution is to accumulate your modified rows in the read loop, then once you're out of the read loop (and have closed the file), open the same file for writing (not appending) and write your accumulated rows.
I've made the one big change that fixes the problem.
You also said you were trying to improve your Python, so I made some other changes that are more pythonic.
import csv
# Return a matched list, or return None
def refcds(value):
with open('chrM_genes.csv', 'r', newline='') as infile:
reader = csv.reader(infile)
for row in reader:
if value >= row[3] and value <= row[4]:
computed = int(value)-int(row[3])+1 # probably negative??
mylist = ['CDS', row[3], row[4], computed]
return mylist
return None # if we get to this return, we've evaluated every row and didn't already return (because of a match)
# Accumulate rows here
final_rows = []
with open('chrM_location.csv', 'r', newline='') as myfile:
reader = csv.reader(myfile)
# next(reader) ## if you know your file has a header
for row in reader:
# Show unusual conditions first...
if row[1] == 'POS':
continue # skip header??
if row[1] == 'END':
break
# ...and if not met, do desired work
mylist = refcds(row[1])
if mylist is not None:
# no need to declare an empty list and then extend it
# just create it with initial items...
final = row[0:6] # use slice notation to get a subset of a list (6 non-inclusive, so only to 5th col)
final.extend(mylist)
final_rows.append(final)
# Write accumulated rows here
with open('final.csv', 'w', newline='') as finalfile:
writer = csv.writer(finalfile)
writer.writerows(final_rows)
I also tried to figure out the whole thing, and came up with the following...
I think you want to look up rows of chrM_genes by Subject and compare a POS (from chrM_locaction) against Start and End bound for each gene, if POS is within the range of Start and End, return the chrM_gene data and fill in some empty cells already in chrM_location.
My first step would be to create a data structure from chrM_genes, since we'll be reading from that over and over again. Reading a bit into your problem, I can see the need to "filter" the results by subject ('CDS','exon', etc...), but I'm not sure of this. Still, I'm going to index this data structure by subject:
import csv
from collections import defaultdict
# This will create a dictionary, where subject will be the key
# and the value will be a list (of chrM (gene) rows)
chrM_rows_by_subject = defaultdict(list)
# Fill the data structure
with open('chrM_genes.csv', newline='') as f:
reader = csv.reader(f)
next(reader) # read (skip) header
subject_col = 2
for row in reader:
# you mentioned empty rows, that divide subjects, so skip empty rows
if row == []:
continue
subject = row[subject_col]
chrM_rows_by_subject[subject].append(row)
I mocked up chrM_genes.csv (and added a header, to try and clarify the structure):
Col1,Col2,Subject,Start,End
chrM,ENSEMBL,CDS,3307,4262
chrM,ENSEMBL,CDS,4470,5511
chrM,ENSEMBL,CDS,5904,7445
chrM,ENSEMBL,CDS,7586,8266
chrM,ENSEMBL,exon,100,200
chrM,ENSEMBL,exon,300,400
chrM,ENSEMBL,exon,700,750
Just printing the data structure to get an idea of what it's doing:
import pprint
pprint.pprint(chrM_rows_by_subject)
yields:
defaultdict(<class 'list'>,
{'CDS': [['chrM', 'ENSEMBL', 'CDS', '3307', '4262'],
['chrM', 'ENSEMBL', 'CDS', '4470', '5511'],
...
],
'exon': [['chrM', 'ENSEMBL', 'exon', '100', '200'],
['chrM', 'ENSEMBL', 'exon', '300', '400'],
...
],
})
Next, I want a function to match a row by subject and POS:
# Return a row that matches `subject` with `pos` between Start and End; or return None.
def match_gene_row(subject, pos):
rows = chrM_rows_by_subject[subject]
pos = int(pos)
start_col = 3
end_col = 4
for row in rows:
start = row[start_col])
end = row[end_col])
if pos >= start and pos <= end:
# return just the data we want...
return row
# or return nothing at all
return None
If I run these commands to test:
print(match_gene_row('CDS', '42'))
print(match_gene_row('CDS', '4200'))
print(match_gene_row('CDS', '7586'))
print(match_gene_row('exon', '500'))
print(match_gene_row('exon', '399'))
I get :
['chrM', 'ENSEMBL', 'CDS', '3307', '4262']
['chrM', 'ENSEMBL', 'CDS', '3307', '4262']
['chrM', 'ENSEMBL', 'CDS', '7586', '8266']
None # exon: 500
['chrM', 'ENSEMBL', 'exon', '300', '400']
Read chrM_location.csv, and build a list of rows with matching gene data.
final_rows = [] # accumulate all rows here, for writing later
with open('chrM_location.csv', newline='') as f:
reader = csv.reader(f)
# Modify header
header = next(reader)
header.extend(['CDS','Start','End','cc'])
final_rows.append(header)
# Read rows and match to genes
pos_column = 1
for row in reader:
pos = row[pos_column]
matched_row = match_gene_row('CDS', pos) # hard-coded to CDS
if matched_row is not None:
subj, start, end = matched_row[2:5]
computed = str(int(pos)-int(start)+1) # this is coming out negative??
row.extend([subj, start, end, computed])
final_rows.append(row)
Finally, write.
with open('final.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(final_rows)
I mocked up chrM_location.csv:
name,POS,id,Ref,ALT,Frequency
chrM,41,.,C,T,0.002498
chrM,42,rs377245343,T,TC,0.001562
chrM,55,.,TA,T,0.00406
chrM,55,.,T,C,0.001874
When I run the whole thing, I get a final.csv that looks likes this:
name
POS
id
Ref
ALT
Frequency
CDS
Start
End
sequence_cc
chrM
41
.
C
T
0.002498
CDS
3307
4262
-3265
chrM
42
rs377245343
T
TC
0.001562
CDS
3307
4262
-3264
chrM
55
.
TA
T
0.00406
CDS
4470
5511
-4414
chrM
55
.
T
C
0.001874
CDS
4470
5511
-4414
I put this all together in a Gist.

How do I sort through a CSV file in python so that it only returns certain values?

I am trying to sort through a CSV file in python so that only a certain value from each entry is printed. Each line of my csv files has the date, location, weather, temperature, etc. I am trying to return the temperature column, but instead it is printing the entire csv file. This is what I currently have:
with open('2000-2009.csv', newline = "") as csv_file:
csv_reader = csv.reader(csv_file, delimiter = ',')
temp = 0
tempList = []
index = 0
for Tavg in csv_reader:
temp = int(Tavg)
tempList.append(temp)
print(tempList)
that's because you are importing the entire CSV file. you want to extract the column. add a counter based on the number of columns and read it when the counter hits the column.

How to find max and min values within lists without using maps/SQL?

I'm learning python and have a data set (csv file) I've been able to split the lines by comma but now I need to find the max and min value in the third column and output the corresponding value in the first column in the same row.
This is the .csv file: https://www.dropbox.com/s/fj8tanwy1lr24yk/loan.csv?dl=0
I also can't use Pandas or any external libraries; I think it would have been easier if I used them
I have written this code so far:
f = open("loanData.csv", "r")
mylist = []
for line in f:
mylist.append(line)
newdata = []
for row in mylist:
data = row.split(",")
newdata.append(data)
I'd use the built-in csv library for parsing your CSV file, and then just generate a list with the 3rd column values in it:
import csv
with open("loanData.csv", "r") as loanCsv:
loanCsvReader = csv.reader(loanCsv)
# Comment out if no headers
next(loanCsvReader, None)
loan_data = [ row[2] for row in loanCsvReader]
max_val = max(loan_data)
min_val = min(loan_data)
print("Max: {}".format(max_val))
print("Max: {}".format(min_val))
Don't know if the details of your file, whether it has a headers or not but you can comment out
next(loanCsvReader, None)
if you don't have any headers present
Something like this might work. The index would start at zero, so the third column should be 2.
min = min([row.split(',')[2] for row in mylist])
max = max([row.split(',')[2] for row in mylist])
Separately, you could probably read and reformat your data to a list with the following:
with open('loanData.csv', 'r') as f:
data = f.read()
mylist = list(data.split('\n'))
This assumes that the end of each row of data is newline (\n) delimited (Windows), but that might be different depending on the OS you're using.

how to edit a csv in python and add one row after the 2nd row that will have the same values in all columns except 1

I'm new in Python language and i'm facing a small challenge in which i havent been able to figure it out so far.
I receive a csv file with around 30-40 columns and 5-50 rows with various details in each cell. The 1st row of the csv has the title for each column and by the 2nd row i have item values.
What i want to do is to create a python script which will read the csv file and every time to do the following:
Add a row after the actual 1st item row, (literally after the 2nd row, cause the 1st row is titles), and in that new 3rd row to contain the same information like the above one with one difference only. in the column "item_subtotal" i want to add the value from the column "discount total".
all the bellow rows should remain as they are, and save this modified csv as a new file with the word "edited" added in the file name.
I could really use some help because so far i've only managed to open the csv file with a python script im developing, but im not able so far to add the contents of the above row to that newly created row and replace that specific value.
Looking forward any help.
Thank you
Here Im attaching the CSV with some values changed for privacy reasons.
order_id,order_number,date,status,shipping_total,shipping_tax_total,fee_total,fee_tax_total,tax_total,discount_total,order_total,refunded_total,order_currency,payment_method,shipping_method,customer_id,billing_first_name,billing_last_name,billing_company,billing_email,billing_phone,billing_address_1,billing_address_2,billing_postcode,billing_city,billing_state,billing_country,shipping_first_name,shipping_last_name,shipping_address_1,shipping_address_2,shipping_postcode,shipping_city,shipping_state,shipping_country,shipping_company,customer_note,item_id,item_product_id,item_name,item_sku,item_quantity,item_subtotal,item_subtotal_tax,item_total,item_total_tax,item_refunded,item_refunded_qty,item_meta,shipping_items,fee_items,tax_items,coupon_items,order_notes,download_permissions_granted,admin_custom_order_field:customer_type_5
15001_TEST_2,,"2017-10-09 18:53:12",processing,0,0.00,0.00,0.00,5.36,7.06,33.60,0.00,EUR,PayoneCw_PayPal,"0,00",0,name,surname,,name.surname#gmail.com,0123456789,"address 1",,41541_TEST,location,,DE,name,surname,address,01245212,14521,location,,DE,,,1328,302,"product title",103,1,35.29,6.71,28.24,5.36,0.00,0,,"id:1329|method_id:free_shipping:3|method_title:0,00|total:0.00",,id:1330|rate_id:1|code:DE-MWST-1|title:MwSt|total:5.36|compound:,"id:1331|code:#getgreengent|amount:7.06|description:Launchcoupon for friends","text string",1,
You can also use pandas to manipulate the data from the csv like this:
import pandas
import copy
Read the csv file into a pandas dataframe:
df = pandas.read_csv(filename)
Make a deepcopy of the first row of data and add the discount total to the item subtotal:
new_row = copy.deepcopy(df.loc[1])
new_row['item_subtotal'] += new_row['discount total']
Concatenate the first 2 rows with the new row and then everything after that:
df = pandas.concat([df.loc[:1], new_row, df.loc[2:]], ignore_index=True)
Change the filename and write the out the new csv file:
filename = filename.strip('.csv') + 'edited.csv'
df.to_csv(filename)
I hope this helps! Pandas is great for cleanly handling massive amounts of data, but may be overkill for what you are trying to do. Then again, maybe not. It would help to see an example data file.
The first step is to turn that .csv into something that is a little easier to work with. Fortunately, python has the 'csv' module which makes it easy to turn your .csv file into a much nicer list of lists. The below will give you a way to both turn your .csv into a list of lists and turn the modified data back into a .csv file.
import csv
import copy
def csv2list(ifile):
"""
ifile = the path of the csv to be converted into a list of lists
"""
f = open(ifile,'rb')
olist=[]
c = csv.reader(f, dialect='excel')
for line in c:
olist.append(line) #and update the outer array
f.close
return olist
#------------------------------------------------------------------------------
def list2csv(ilist,ofile):
"""
ilist = the list of lists to be converted
ofile = the output path for your csv file
"""
with open(ofile, 'wb') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',',
quotechar='|', quoting=csv.QUOTE_MINIMAL)
[csvwriter.writerow(x) for x in ilist]
Now, you can simply copy list[1] and change the appropriate element to reflect your summed value using:
listTemp = copy.deepcopy(ilist[1])
listTemp[n] = listTemp[n] + listTemp[n-x]
ilist.insert(2,listTemp)
As for how to change the file name, just use:
import os
newFileName = os.path.splitext(oldFileName)[0] + "edited" + os.path.splitext(oldFileName)[1]
Hopefully this will help you out!

Python 2.7 Export csv file with data

I am trying to export my data into a csv file
My code is as follows
x1=1
x2=2
y1=1
y2=4
z1=1
z2=2
a = (x1,y1,z1)
b = (x2,y2,z2)
def distance(x1,y1,z1,x2,y2,z2):
return (((x2-x1)**2)+((y2-y1)**2)+((z2-z1)**2))**0.5
w=distance(x1,y1,z1,x2,y2,z2)
print w
if w>1 and w<2:
time = 1
if w>2 and w<3:
time = 2
if w>3 and w<4:
time = 3
import csv
with open('concert_time.csv', 'w') as output:
I am stuck at this last line. I would like a file with two columns for 'Letter Combinations' and 'Time', and the output to look like:
Letters Time
a and b 3
I know there is a way to get this output by explicitly labeling a row as '3' for the appropriate time, but I would like a csv file that will change if I change the values of a or b, and thus the value of 'time' would change. I have tried
writer.writerows(enumerate(range(time), 1))
but this does not get me the desired output (in fact it is probably very far off, but I am new to Python so my methods have been guess and check)
Any help is greatly appreciated!
completing your code for csv writer. Also make sure to open file as wb on windows so that carriage return (newline) is added automatically.
with open('concert_time.csv', 'wb') as output:
writer = csv.writer(output)
writer.writerow(["Letter Combinations", "Time"]) #header
writer.writerow(["a and b", time]) #data
Content of concert_time.csv:
Letter Combinations,Time
a and b,3

Categories

Resources