How To iterate throw two csv file in Python - python

I have two data tables related to journals (title, issn ...) and basically I want to know if journal of table1 is present in table2. For comparing I only use a digital identifier, named issn
My basic problem is that I dont manage to iterate throw all tab1, it stops after the end of tab2.
import csv
tab1 = open("tab1.csv", 'r', encoding='utf8')
readtab1 = csv.DictReader(tab1)
tab2 = open("tab2.csv", 'r', encoding='utf8')
readtab2 = csv.DictReader(tab2)
linenb1 = 0
for row1 in readtab1:
issn1 = row1['ISSN'].strip()
linenb1 +=1
linenb2 = 0
for row2 in readtab2 :
issn2 = row2['ISSN'].strip()
if len(issn2) < 2 : continue
linenb2+=1
print(linenb1, linenb2, issn1,issn2)
consol
1 1 2552-8831 0001-253X
1 2 2552-8831 0002-2667
[Finished in 0.2s]
tab1
Nom du titre,ISSN,Format,nom editeur
Revue Droit & Litterature,2552-8831,Papier,LGDJ MONTCHRESTIEN
Memoires en Jeu,2497-2711,Papier,EDITIONS KIME
Le Monde,2262-4694,Online,LE MONDE
Journal des Energies Renouvelables,2491-8687,Papier + e-mail,OBSERVER
tab2
ISSN,TITLE,TARGET_PUBLIC_NAME,TARGET_SERVICE,THRESHOLD_ACTIVE,THRESHOLD_GLOBAL,PUBLISHER,LOCAL_THRESHOLD
0001-253X,Aslib proceedings,French National Licences Emerald,getFullTxt,"$obj->parsedDate('>=',1949,1,1) && $obj->parsedDate('<=',2010,65,6)","$obj->parsedDate('>=',1949,1,1) && $obj->parsedDate('<=',2010,65,6)",Emerald Group Publishing Ltd.,
0002-2667,Aircraft Engineering,French National Licences Emerald,getFullTxt,"$obj->parsedDate('>=',1929,1,1) && $obj->parsedDate('<=',1986,58,3)","$obj->parsedDate('>=',1929,1,1) && $obj->parsedDate('<=',1986,58,3)",Emerald Group Pub.,
I dont get because nested for loop works :
for i in range(0,5):
for j in range(1,10):
print(i,j)

I would try setting the fieldnames and delimiter values of the DictReaders. Because they respectively default to the first row's values and a comma, whereas the data does not have a header row and it's separated by spaces.
fieldnames = "Nom du titre,ISSN,Format,Nom de l'éditeur,Nom de l'abonné".split(',')
abt = open("tab1.csv", 'r', encoding='utf8')
readTab1= csv.DictReader(abt, fieldnames=fieldnames, delimiter=' ')
sfx = open("tab2.csv", 'r', encoding='utf8')
readTab2 = csv.DictReader(sfx, fieldnames=fieldnames, delimiter=' ')
linenb = 0
for row1 in readTab1:
issn1 = row1['ISSN'].strip()
if len(issn1) < 2 : continue
linenb +=1
linenb2 = 0
for row2 in readTab2:
issn2 = row2['ISSN'].strip()
if len(issn2) < 2 : continue
linenb2+=1
print(linenb,issn1,issn2)

The second loop was satying at the end, the solution is to add tab2.seek(0)
I found the solution here
https://stackoverflow.com/a/26526224/3334635
What is strange for me is that it works with number, but not with csv reader
import csv
tab1 = open("tab1.csv", 'r', encoding='utf8', newline='')
readtab1 = csv.DictReader(tab1)
tab2 = open("tab2.csv", 'r', encoding='utf8', newline='')
readtab2 = csv.DictReader(tab2)
linenb1 = 0
for row1 in readtab1:
issn1 = row1['ISSN'].strip()
linenb1 +=1
linenb2 = 0
for row2 in readtab2 :
issn2 = row2['ISSN'].strip()
if len(issn2) < 2 : continue
linenb2+=1
print(linenb1, linenb2, issn1,issn2)
tab2.seek(0)
next(readtab2) # skip header
1 1 2552-8831 0001-253X
1 2 2552-8831 0002-2667
2 1 2497-2711 0001-253X
2 2 2497-2711 0002-2667
3 1 2262-4694 0001-253X
3 2 2262-4694 0002-2667
4 1 2491-8687 0001-253X
4 2 2491-8687 0002-2667

Related

Python - issue with removing fields from the output

I have an issue with my code. I need the script to remove fields which fill all three conditions:
the CreatedBy is koala,
Book is PI or SI or II or OT or FG,
and the Category **is ** Cert or CertPlus or Cap or Downside.
Currently my code removes all koala and all books and only takes the last argument. So for example my current output leaves fields only if the category is different. I would like it to show fields ONLY if all 3 arguments are met and not if koala or book = PI or SI or II or OT or FG and to show everything else which is in range.
If field is created by koala and category is Cert I wish to see this field but now it is removed.
Or if none of the arguments are met I also want to see those fields ( e.g. createdby is Extra, Book is NG and Category is Multiple. Now those are also removed from the output.
Example dataset:
In the link below - I wish to remove only those marked red:
current_path = os.path.dirname(os.path.realpath(sys.argv[0]))
a_path, q_path = 0, 0
def assign_path(current_path, a_path = 0, q_path = 0):
files = os.listdir(current_path)
for i in files:
if re.search('(?i)activity',i):
a_path = '\\'.join([current_path,i])
elif re.search('(?i)query',i):
q_path = '\\'.join([current_path,i])
return a_path, q_path
a_path, q_path = assign_path(current_path)
if a_path == 0 or q_path == 0:
files = os.listdir(current_path)
directories = []
for i in files:
if os.path.isdir(i): directories.append(i)
for i in directories:
if re.search('(?i)input',i):
a_path, q_path = assign_path('\\'.join([current_path,i]), a_path, q_path)
L = list(range(len(qr)))
L1 = list(range(len(qr2)))
L2 = list(range(len(ac)))
-------------------------------------------------------
qr = pd.read_excel(q_path)
qr2 = pd.read_excel(q_path)
qr_rec = qr2.iloc[[0,1]]
d = qr2.iloc[0].to_dict()
for i in list(d.keys()): d[i] = np.nan
for i in range(len(qr2)):
if qr2.iloc[i]['LinkageLinkType'] != 'B2B_COUNTER_TRADE'\
and qr2.iloc[i]['CreatedBy'] == 'koala_'\
and qr2.iloc[i]['Book'] in {'PI','SI','II','OT','FG'}\
and qr2.iloc[i]['Category'] not in {'Cert','CertPlus','Cap','Downside'}:
while i in L: L.remove(i)
if qr2.iloc[i]['PrimaryRiskId'] not in list(aID):
qr_rec = qr_rec.append(qr2.iloc[i],ignore_index=True)
I have added the beggining of the code which allows me to use the Excel file. I have two files, one of them being a_path ( please disregard this one). The issue I have is on the q_path.
Check this out:
pd.read_csv('stackoverflow.csv')
category book createdby
0 Multiple NG panda
1 Cert DG koala
2 Cap PI monkey
3 CertPlus ZZ panda
4 Cap ll joey
5 Cert OT koala
6 Cap FG koala
7 Cert PI koala
8 Block SI koala
9 Cap II koala
df.query("~(category in ['Cert', 'Cap'] and book in ['OT', 'FG', 'PI', 'II'] and createdby=='koala')")
category book createdby
0 Multiple NG panda
1 Cert DG koala
2 Cap PI monkey
3 CertPlus ZZ panda
4 Cap ll joey
8 Block SI koala
pd.DataFrame.query can be used to filter data, the ~ at the beginning is a not operator.
BR
E

Python Dictionary to CSV Issue

I put together a python script to clean CSV files. The reformatting works, but the data rows the writer writes to the new CSV file are wrong. I am constructing a dictionary of all rows of data before writing using writer.writerows(). When I check the dictionary using print statements, the correct data is appending to the list. However, after appending, the incorrect values are in the dictionary.
import csv
data = []
with open(r'C:\\Data\\input.csv', 'r') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
line_count = 0
street_fields = [] # Store new field names in list
street_fields.append("startdate")
street_fields.append("starttime")
street_fields.append("sitecode")
street_fields.append("recordtime")
street_fields.append("direction")
street_fields.append("turnright")
street_fields.append("wentthrough")
street_fields.append("turnleft")
street_fields.append("pedestrians")
for row in csv_reader: # Read input rows
if line_count == 0:
startdate = row[1] # Get Start Date from B1
line_count += 1
elif line_count == 1:
starttime = row[1] # Get Start Time from B2
line_count += 1
elif line_count == 2:
sitecode = str(row[1]) # Get Site code from B3
line_count += 1
elif line_count == 3:
street_count = len(row) - 3 # Determine number of streets in report
streetnames = []
i = 1
while i < street_count:
streetnames.append(row[i]) # Add streets to list
i += 4
line_count += 1
elif line_count > 4:
street_values = {} # Create dictionary to store new row values
n = 1
for street in streetnames:
turnright = 0 + n
wentthrough = 1 + n
turnleft = 2 + n
pedestrians = 3 + n
street_values["startdate"] = startdate
street_values["starttime"] = starttime
street_values["sitecode"] = sitecode
street_values["recordtime"] = row[0]
street_values["direction"] = street
street_values["turnright"] = int(row[turnright])
street_values["wentthrough"] = int(row[wentthrough])
street_values["turnleft"] = int(row[turnleft])
street_values["pedestrians"] = int(row[pedestrians])
data.append(street_values) # Append row dictionary to list
#print(street_values) ### UNCOMMENT TO SEE CORRECT ROW DATA ###
#print(data) ### UNCOMMENT TO SEE INCORRECT ROW DATA ###
n += 4
line_count += 1
else:
line_count += 1
with open(r'C:\\Data\\output.csv', 'w', newline='', encoding="utf-8") as w_scv_file:
writer = csv.DictWriter(w_scv_file,fieldnames=street_fields)
writer.writerow(dict((fn,fn) for fn in street_fields)) # Write headers to new CSV
writer.writerows(data) # Write data from list of dictionaries
An example of the list of dictionaries created (JSON):
[
{
"startdate":"11/9/2017",
"starttime":"7:00",
"sitecode":"012345",
"recordtime":"7:00",
"direction":"Cloud Dr. From North",
"turnright":0,
"wentthrough":2,
"turnleft":11,
"pedestrians":0
},
{
"startdate":"11/9/2017",
"starttime":"7:00",
"sitecode":"012345",
"recordtime":"7:00",
"direction":"Florida Blvd. From East",
"turnright":4,
"wentthrough":433,
"turnleft":15,
"pedestrians":0
},
{
"startdate":"11/9/2017",
"starttime":"7:00",
"sitecode":"012345",
"recordtime":"7:00",
"direction":"Cloud Dr. From South",
"turnright":15,
"wentthrough":4,
"turnleft":6,
"pedestrians":0
},
{
"startdate":"11/9/2017",
"starttime":"7:00",
"sitecode":"012345",
"recordtime":"7:00",
"direction":"Florida Blvd. From West",
"turnright":2,
"wentthrough":219,
"turnleft":2,
"pedestrians":0
},
{
"startdate":"11/9/2017",
"starttime":"7:00",
"sitecode":"012345",
"recordtime":"7:15",
"direction":"Cloud Dr. From North",
"turnright":1,
"wentthrough":3,
"turnleft":8,
"pedestrians":0
}
]
What actually writes to the CSV:
Note the Direction field and data rows are incorrect. For some reason when it loops through the streetnames list, the last street name and the corresponding row values persist for the individual record time.
Do I need to delete my variables before re-assigning them values?
It looks like you are appending the same dictionary to the list over and over.
In general, when appending a nuber of separate dictionaries to a list, I would use mylist.append(mydict.copy()), otherwise later on when you assign new values within a dictionary of the same name you are really just updating your old dictionary, including entries in your list that point to a dictionary of the same name (see mutable vs immutable objects in python).
In short: If you want the dictionary in the list to be a separate entity from the new one, create a deep copy using dict.copy() when appending it to the list.

Biopython translate() error

I have a file that looks as so:
Type Variant_class ACC_NUM dbsnp genomic_coordinates_hg18 genomic_coordinates_hg19 HGVS_cdna HGVS_protein gene disease sequence_context_hg18 sequence_context_hg19 codon_change codon_number intron_number site location location_reference_point author journal vol page year pmid entrezid sift_score sift_prediction mutpred_score
1 DM CM920001 rs1800433 null chr12:9232351:- NM_000014.4 NP_000005.2:p.C972Y A2M Chronicobstructivepulmonarydisease null CACAAAATCTTCTCCAGATGCCCTATGGCT[G/A]TGGAGAGCAGAATATGGTCCTCTTTGCTCC TGT TAT 972 null null 2 null Poller HUMGENET 88 313 1992 1370808 2 0 DAMAGING 0.594315245478036
1 DM CM004784 rs74315453 null chr22:43089410:- NM_017436.4 NP_059132.1:p.M183K A4GALT Pksynthasedeficiency(pphenotype) null TGCTCTCCGACGCCTCCAGGATCGCACTCA[T/A]GTGGAAGTTCGGCGGCATCTACCTGGACAC ATG AAG 183 null null 2 null Steffensen JBC 275 16723 2000 10747952 53947 0 DAMAGING 0.787878787878788
I want to translate the information from column 13 and 14 to their corresponding amino acids. Here is the script that I've generated:
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
InFile = open("disease_mut_splitfinal.txt", 'rU')
InFile.readline()
OriginalSeq_list = []
MutSeq_list = []
import csv
with open("disease_mut_splitfinal.txt") as f:
reader = csv.DictReader(f, delimiter= "\t")
for row in reader:
OriginalSeq = row['codon_change']
MutSeq = row['codon_number']
region = row["genomic_coordinates_hg19"]
gene = row["gene"]
OriginalSeq_list.append(OriginalSeq)
MutSeq_list.append(MutSeq)
OutputFileName = "Translated.txt"
OutputFile = open(OutputFileName, 'w')
OutputFile.write(''+region+'\t'+gene+'\n')
for i in range(0, len(OriginalSeq_list)):
OrigSeq = OriginalSeq_list[i]
MutSEQ = MutSeq_list[i]
print OrigSeq
translated_original = OrigSeq.translate()
translated_mut= MutSEQ.translate()
OutputFile.write("\n" + OriginalSeq_list[i]+ "\t" + str(translated_original) + "\t" +MutSeq_list[i] + "\t" + str(translated_mut)+ "\n")
However, I keep getting this error:
TypeError: translate expected at least 1 arguments, got 0
I'm kind of at a loss for what I'm doing wrong. Any suggestions?
https://www.dropbox.com/s/cd8chtacj3glb8d/disease_mut_splitfinal.txt?dl=0
(File should still be downloadable even if you don't have a dropbox)
You are using the string method "translate" instead of the biopython seq object method translate, which is what I assume you want to do. You need to convert the string into a seq object and then translate that. Try
from Bio import Seq
OrigSeq = Seq.Seq(OriginalSeq_list[i])
translated_original = OrigSeq.translate()
Alternatively
from Bio.Seq import Seq
OrigSeq = Seq(OriginalSeq_list[i])
translated_original = OrigSeq.translate()

Data Analysis using Python

I have 2 CSV files. One with city name, population and humidity. In second cities are mapped to states. I want to get state-wise total population and average humidity. Can someone help? Here is the example:
CSV 1:
CityName,population,humidity
Austin,1000,20
Sanjose,2200,10
Sacramento,500,5
CSV 2:
State,city name
Ca,Sanjose
Ca,Sacramento
Texas,Austin
Would like to get output(sum population and average humidity for state):
Ca,2700,7.5
Texas,1000,20
The above solution doesn't work because dictionary will contain one one key value. i gave up and finally used a loop. below code is working, mentioned input too
csv1
state_name,city_name
CA,sacramento
utah,saltlake
CA,san jose
Utah,provo
CA,sanfrancisco
TX,austin
TX,dallas
OR,portland
CSV2
city_name population humidity
sacramento 1000 1
saltlake 300 5
san jose 500 2
provo 100 7
sanfrancisco 700 3
austin 2000 4
dallas 2500 5
portland 300 6
def mapping_within_dataframe(self, file1,file2,file3):
self.csv1 = file1
self.csv2 = file2
self.outcsv = file3
one_state_data = 0
outfile = csv.writer(open('self.outcsv', 'w'), delimiter=',')
state_city = read_csv(self.csv1)
city_data = read_csv(self.csv2)
all_state = list(set(state_city.state_name))
for one_state in all_state:
one_state_cities = list(state_city.loc[state_city.state_name == one_state, "city_name"])
one_state_data = 0
for one_city in one_state_cities:
one_city_data = city_data.loc[city_data.city_name == one_city, "population"].sum()
one_state_data = one_state_data + one_city_data
print one_state, one_state_data
outfile.writerows(whatever)
def output(file1, file2):
f = lambda x: x.strip() #strips newline and white space characters
with open(file1) as cities:
with open(file2) as states:
states_dict = {}
cities_dict = {}
for line in states:
line = line.split(',')
states_dict[f(line[0])] = f(line[1])
for line in cities:
line = line.split(',')
cities_dict[f(line[0])] = (int(f(line[1])) , int(f(line[2])))
for state , city in states_dict.iteritems():
try:
print state, cities_dict[city]
except KeyError:
pass
output(CSV1,CSV2) #these are the names of the files
This gives the output you wanted. Just make sure the names of cities in both files are the same in terms of capitalization.

How can I count different values per same key with Python?

I have a code which is able to give me the list like this:
Name id number week number
Piata 4 6
Mali 2 20,5
Goerge 5 4
Gooki 3 24,64,6
Mali 5 45,9
Piata 6 1
Piata 12 2,7,8,27,16 etc..
with the below code:
import csv
from datetime import date
datedict = defaultdict(set)
with open('d:/info.csv', 'r') as csvfile:
filereader = csv.reader(csvfile, 'excel')
#passing the header
read_header = False
start_date=date(year=2009,month=1,day=1)
#print((seen_date - start_date).days)
tdic = {}
for row in filereader:
if not read_header:
read_header = True
continue
# reading the rest rows
name,id,firstseen = row[0],row[1],row[3]
try:
seen_date = datetime.datetime.strptime(firstseen, '%d/%m/%Y').date()
deltadays = (seen_date-start_date).days
deltaweeks = deltadays/7 + 1
key = name,id
currentvalue = tdic.get(key, set())
currentvalue.add(deltaweeks)
tdic[key] = currentvalue
except ValueError:
print('Date value error')
pass
Right now I want to convert my list to a list that give me number of ids for each name and its weeks numbers like the below list:
Name number of ids weeknumbers
Mali 2 20,5,45,9
Piata 3 1,6,2,7,8,27,16
Goerge 1 4
Gooki 1 24,64,6
Can anyone help me with writing the code for this part?
Since it looks like your csv file has headers (which you are currently ignoring) why not use a DictReader instead of the standard reader class? If you don't supply fieldnames the DictReader will assume the first line contains them, which will also save you from having to skip the first line in your loop.
This seems like a great opportunity to use defaultdict and Counter from the collections module.
import csv
from datetime import date
from collections import defaultdict, Counter
datedict = defaultdict(set)
namecounter = Counter()
with open('d:/info.csv', 'r') as csvfile:
filereader = csv.DictReader(csvfile)
start_date=date(year=2009,month=1,day=1)
for row in filereader:
name,id,firstseen = row['name'], row['id'], row['firstseen']
try:
seen_date = datetime.datetime.strptime(firstseen, '%d/%m/%Y').date()
except ValueError:
print('Date value error')
pass
deltadays = (seen_date-start_date).days
deltaweeks = deltadays/7 + 1
datedict[name].add(deltaweeks)
namecounter.update([name]) # Without putting name into a list, update will index each character
This assumes that (name, id) is unique. If this is not the case then you can use anotherdefaultdict for namecounter. I've also moved the try-except statement so it is more explicit in what you are testing.
givent that :
tdict = {('Mali', 5): set([9, 45]), ('Gooki', 3): set([24, 64, 6]), ('Goerge', 5): set([4]), ('Mali', 2): set([20, 5]), ('Piata', 4): set([4]), ('Piata', 6): set([1]), ('Piata', 12): set([8, 16, 2, 27, 7])}
then to output the result above:
names = {}
for ((name, id), more_weeks) in tdict.items():
(ids, weeks) = names.get(name, (0, set()))
ids = ids + 1
weeks = weeks.union(more_weeks)
names[name] = (ids, weeks)
for (name, (id, weeks)) in names.items():
print("%s, %s, %s" % (name, id, weeks)

Categories

Resources