How can I count different values per same key with Python?

How can I count different values per same key with Python? - python

I have a code which is able to give me the list like this:
Name id number week number
Piata 4 6
Mali 2 20,5
Goerge 5 4
Gooki 3 24,64,6
Mali 5 45,9
Piata 6 1
Piata 12 2,7,8,27,16 etc..
with the below code:
import csv
from datetime import date
datedict = defaultdict(set)
with open('d:/info.csv', 'r') as csvfile:
filereader = csv.reader(csvfile, 'excel')
#passing the header
read_header = False
start_date=date(year=2009,month=1,day=1)
#print((seen_date - start_date).days)
tdic = {}
for row in filereader:
if not read_header:
read_header = True
continue
# reading the rest rows
name,id,firstseen = row[0],row[1],row[3]
try:
seen_date = datetime.datetime.strptime(firstseen, '%d/%m/%Y').date()
deltadays = (seen_date-start_date).days
deltaweeks = deltadays/7 + 1
key = name,id
currentvalue = tdic.get(key, set())
currentvalue.add(deltaweeks)
tdic[key] = currentvalue
except ValueError:
print('Date value error')
pass
Right now I want to convert my list to a list that give me number of ids for each name and its weeks numbers like the below list:
Name number of ids weeknumbers
Mali 2 20,5,45,9
Piata 3 1,6,2,7,8,27,16
Goerge 1 4
Gooki 1 24,64,6
Can anyone help me with writing the code for this part?

Since it looks like your csv file has headers (which you are currently ignoring) why not use a DictReader instead of the standard reader class? If you don't supply fieldnames the DictReader will assume the first line contains them, which will also save you from having to skip the first line in your loop.
This seems like a great opportunity to use defaultdict and Counter from the collections module.
import csv
from datetime import date
from collections import defaultdict, Counter
datedict = defaultdict(set)
namecounter = Counter()
with open('d:/info.csv', 'r') as csvfile:
filereader = csv.DictReader(csvfile)
start_date=date(year=2009,month=1,day=1)
for row in filereader:
name,id,firstseen = row['name'], row['id'], row['firstseen']
try:
seen_date = datetime.datetime.strptime(firstseen, '%d/%m/%Y').date()
except ValueError:
print('Date value error')
pass
deltadays = (seen_date-start_date).days
deltaweeks = deltadays/7 + 1
datedict[name].add(deltaweeks)
namecounter.update([name]) # Without putting name into a list, update will index each character
This assumes that (name, id) is unique. If this is not the case then you can use anotherdefaultdict for namecounter. I've also moved the try-except statement so it is more explicit in what you are testing.

givent that :
tdict = {('Mali', 5): set([9, 45]), ('Gooki', 3): set([24, 64, 6]), ('Goerge', 5): set([4]), ('Mali', 2): set([20, 5]), ('Piata', 4): set([4]), ('Piata', 6): set([1]), ('Piata', 12): set([8, 16, 2, 27, 7])}
then to output the result above:
names = {}
for ((name, id), more_weeks) in tdict.items():
(ids, weeks) = names.get(name, (0, set()))
ids = ids + 1
weeks = weeks.union(more_weeks)
names[name] = (ids, weeks)
for (name, (id, weeks)) in names.items():
print("%s, %s, %s" % (name, id, weeks)

Related

How to get rid of the rest of the text after getting the results I want?

import urllib.request
import json
from collections import Counter
def count_coauthors(author_id):
coauthors_dict = {}
url_str = ('https://api.semanticscholar.org/graph/v1/author/47490276?fields=name,papers.authors')
respons = urllib.request.urlopen(url_str)
text = respons.read().decode()
for line in respons:
print(line.decode().rstip())
data = json.loads(text)
print(type(data))
print(list(data.keys()))
print(data["name"])
print(data["authorId"])
name = []
for lines in data["papers"]:
for authors in lines["authors"]:
name.append(authors.get("name"))
print(name)
count = dict()
names = name
for i in names:
if i not in count:
count[i] = 1
else:
count[i] += 1
print(count)
c = Counter(count)
top = c.most_common(10)
print(top)
return coauthors_dict
author_id = '47490276'
cc = count_coauthors(author_id)
top_coauthors = sorted(cc.items(), key=lambda item: item[1], reverse=True)
for co_author in top_coauthors[:10]:
print(co_author)
This is how my code looks this far, there are no error. I need to get rid of the rest of the text when I run it, so it should look like this:
('Diego Calvanese', 47)
('D. Lanti', 28)
('Martín Rezk', 21)
('Elem Güzel Kalayci', 18)
('B. Cogrel', 17)
('E. Botoeva', 16)
('E. Kharlamov', 16)
('I. Horrocks', 12)
('S. Brandt', 11)
('V. Ryzhikov', 11)
I have tried using rstrip and split on my 'c' variable but it doesn't work. Im only allowed importing what I already have imported and must use the link which is included.
Tips on simplifying or bettering the code is also appreciated!
("Extend the program below so that it prints the names of the top-10 coauthors together with the numbers of the coauthored publications")

From what I understand you are not quite sure where your successful output originates from. It is not the 5 lines at the end.
Your result is printed by the print(top) on line 39. This top variable is what you want to return from the function, as the coauthors_dict you are currently returning never actually gets any data written to it.
You will also have to slightly adjust your sorted(...) as you now have a list and not a dictionary, but you should then get the correct result.

If I understand correctly you are wanting this function to return a count of each distinct co-author (excluding the author), which it seems like you already have in your count variable, which you don't return. The variable you DO return is empty.
Instead consider:
import urllib.request
import json
from collections import Counter
def count_coauthors(author_id):
url_str = (f'https://api.semanticscholar.org/graph/v1/author/{author_id}?fields=name,papers.authors')
response = urllib.request.urlopen(url_str)
text = response.read().decode()
data = json.loads(text)
names = [a.get("name") for l in data["papers"] for a in l["authors"] if a['authorId'] != author_id]
#The statement above can be written long-hand like:
#names=[]
#for l in data["papers"]:
# for a in l["authors"]:
# if a['authorId'] != author_id:
# names.append(a.get("name"))
return list(Counter(names).items())
author_id = '47490276'
cc = count_coauthors(author_id)
top_coauthors = sorted(cc, key=lambda item: item[1], reverse=True)
for co_author in top_coauthors[:10]:
print(co_author)
('Diego Calvanese', 47)
('D. Lanti', 28)
('Martín Rezk', 21)
('Elem Güzel Kalayci', 18)
('B. Cogrel', 17)
('E. Botoeva', 16)
('E. Kharlamov', 16)
('I. Horrocks', 12)
('S. Brandt', 11)
('V. Ryzhikov', 11)
You might also consider moving the top N logic into the function as an optional paramter:
import urllib.request
import json
from collections import Counter
def count_coauthors(author_id, top=0):
url_str = (f'https://api.semanticscholar.org/graph/v1/author/{author_id}?fields=name,papers.authors')
response = urllib.request.urlopen(url_str)
text = response.read().decode()
data = json.loads(text)
names = [a.get("name") for l in data["papers"] for a in l["authors"] if a['authorId'] != author_id]
name_count = list(Counter(names).items())
top = top if top!=0 else len(name_count)
return sorted(name_count, key=lambda x: x[1], reverse=True)[:top]
author_id = '47490276'
for auth in count_coauthors(author_id, top=10):
print(auth)

combine two for loops in to fill same dictionary

I am trying to get two different merchants from a list of dictionaries with priority to merchants who have prices,if no two different merchants are found with prices, merchant 1 or 2 prices are to be filled with data from list,if list is not enough merchant 1 or 2 should be None.
I.e the for loop will return two merchants,priority to merchants who have prices, if that is not enough to fill merchants (1 or 2) get merchants with no prices.finally if still merchant 1 or 2 not created fill them with None value.
here is the code I have so far, it does the job but I believe it can be combined in a more Pythonic way.
import csv
with open('/home/timmy/testing/example/example/test.csv') as csvFile:
reader=csv.DictReader(csvFile)
for row in reader:
dummy_list.append(row)
item=dict()
index = 1
for merchant in dummy_list:
if merchant['price']:
if index==2:
if item['merchant_1']==merchant['name']:
continue
item['merchant_%d'%index] = merchant['name']
item['merchant_%d_price'%index] = merchant['price']
item['merchant_%d_stock'%index] = merchant['stock']
item['merchant_%d_link'%index] = merchant['link']
if index==3:
break
index+=1
for merchant in dummy_list:
if index==3:
break
if index<3:
try:
if item['merchant_1']==merchant['name']:
continue
except KeyError:
pass
item['merchant_%d'%index] = merchant['name']
item['merchant_%d_price'%index] = merchant['price']
item['merchant_%d_stock'%index] = merchant['stock']
item['merchant_%d_link'%index] = merchant['link']
index+=1
while index<3:
item['merchant_%d'%index] = ''
item['merchant_%d_price'%index] = ''
item['merchant_%d_stock'%index] = ''
item['merchant_%d_link'%index] = ''
index+=1
print(item)
here is the contents of the csv file:
price,link,name,stock
,https://www.samsclub.com/sams/donut-shop-100-ct-k-cups/prod19381344.ip,Samsclub,
,https://www.costcobusinessdelivery.com/Green-Mountain-Original-Donut-Shop-Coffee%2C-Medium%2C-Keurig-K-Cup-Pods%2C-100-ct.product.100297848.html,Costcobusinessdelivery,
,https://www.costco.com/The-Original-Donut-Shop%2C-Medium-Roast%2C-K-Cup-Pods%2C-100-count.product.100381350.html,Costco,
57.99,https://www.target.com/p/the-original-donut-shop-regular-medium-roast-coffee-keurig-k-cup-pods-108ct/-/A-13649874,Target,Out of Stock
10.99,https://www.target.com/p/the-original-donut-shop-dark-roast-coffee-keurig-k-cup-pods-18ct/-/A-16185668,Target,In Stock
,https://www.homedepot.com/p/Keurig-Kcup-Pack-The-Original-Donut-Shop-Coffee-108-Count-110030/204077166,Homedepot,Undertermined

As you only want to keep at most 2 merchants, I would process the csv only once keeping separately a list of merchant with prices and a list of merchant without prices, stopping as soon as 2 merchant with prices have been found.
After that loop, I would concatenate those 2 list and a list of two empty merchants and take the first 2 elements of that. That will be enough to guarantee your requirements of 2 distinct merchants with priority to those having prices. Finaly, I would use that to fill the item dict.
Code would be:
import csv
with open('/home/timmy/testing/example/example/test.csv') as csvFile:
reader=csv.DictReader(csvFile)
names_price = set()
names_no_price = set()
merchant_price = []
merchant_no_price = []
item = {}
for merchant in reader:
if merchant['price']:
if not merchant['name'] in names_price:
names_price.add(merchant['name'])
merchant_price.append(merchant)
if len(merchant_price) == 2:
break;
else:
if not merchant['name'] in names_no_price:
names_no_price.add(merchant['name'])
merchant_no_price.append(merchant)
void = { k: '' for k in reader.fieldnames}
merchant_list = (merchant_price + merchant_no_price + [void, void.copy()])[:2]
for index, merchant in enumerate(merchant_list, 1):
item['merchant_%d'%index] = merchant['name']
item['merchant_%d_price'%index] = merchant['price']
item['merchant_%d_stock'%index] = merchant['stock']
item['merchant_%d_link'%index] = merchant['link']

My code doesn't produce any output -- Python

I have two columns of data
(sample data) and I want to calculate total users for each week day.
For instance, I'd want my output like this (dict/list anything will do):
Monday: 25,
Tuesday: 30,
Wednesday:45,
Thursday: 50,
Friday:24,
Saturday:22,
Sunday:21
Here's my attempt:
def rider_ship (filename):
with open('./data/Washington-2016-Summary.csv','r') as f_in:
Sdict = []
Cdict = []
reader = csv.DictReader(f_in)
for row in reader:
if row['user_type']=="Subscriber":
if row['day_of_week'] in Sdict:
Sdict[row['day_of_week']]+=1
else:
Sdict [row['day_of_week']] = row['day_of_week']
else:
if row ['day_of_week'] in Cdict:
Cdict[row['day_of_week']] +=1
else:
Cdict[row['day_of_week']] = row['day_of_week']
return Sdict, Cdict
print (Sdict)
print (Cdict)
t= rider_ship ('./data/Washington-2016-Summary.csv')
print (t)
TypeError::list indices must be integers or slices, not str

How about using pandas?
Let's first create a file-like object with io library:
import io
s = u"""day_of_week,user_type
Monday,subscriber
Tuesday,customer
Tuesday,subscriber
Tuesday,subscriber"""
file = io.StringIO(s)
Now to the actual code:
import pandas as pd
df = pd.read_csv(file) # "path/to/file.csv"
Sdict = df[df["user_type"] == "subscriber"]["day_of_week"].value_counts().to_dict()
Cdict = df[df["user_type"] == "customer"]["day_of_week"].value_counts().to_dict()
Now we have:
Sdict = {'Tuesday': 2, 'Monday': 1}
Cdict = {'Tuesday': 1}

calculating the area of an irregular shape from coordinates in a csv file using python

i am using Python to import a csv file with coordinates in it, passing it to a list and using the contained data to calculate the area of each irregular figure. The data within the csv file looks like this.
ID Name DE1 DN1 DE2 DN2 DE3 DN3
88637 Zack Fay -0.026841782 -0.071375637 0.160878583 -0.231788845 0.191811833 0.396593863
88687 Victory Greenfelder 0.219394372 -0.081932907 0.053054879 -0.048356016
88737 Lynnette Gorczany 0.043632299 0.118916157 0.005488698 -0.268612073
88787 Odelia Tremblay PhD 0.083147337 0.152277791 -0.039216388 0.469656787 -0.21725977 0.073797219
The code i am using is below - however it brings up an IndexError: as the first line doesn't have data in all columns. Is there a way to write the csv file so it only uses the colums with data in them ?
import csv
import math
def main():
try:
# ask user to open a file with coordinates for 4 points
my_file = raw_input('Enter the Irregular Differences file name and location: ')
file_list = []
with open(my_file, 'r') as my_csv_file:
reader = csv.reader(my_csv_file)
print 'my_csv_file: ', (my_csv_file)
reader.next()
for row in reader:
print row
file_list.append(row)
all = calculate(file_list)
save_write_file(all)
except IOError:
print 'File reading error, Goodbye!'
except IndexError:
print 'Index Error, Check Data'
# now do your calculations on the 'data' in the file.
def calculate(my_file):
return_list = []
for row in my_file:
de1 = float(row[2])
dn1 = float(row[3])
de2 = float(row[4])
dn2 = float(row[5])
de3 = float(row[6])
dn3 = float(row[7])
de4 = float(row[8])
dn4 = float(row[9])
de5 = float(row[10])
dn5 = float(row[11])
de6 = float(row[12])
dn6 = float(row[13])
de7 = float(row[14])
dn7 = float(row[15])
de8 = float(row[16])
dn8 = float(row[17])
de9 = float(row[18])
dn9 = float(row[19])
area_squared = abs((dn1 * de2) - (dn2 * de1)) + ((de3 * dn4) - (dn3 * de4)) + ((de5 * dn6) - (de6 * dn5)) + ((de7 * dn8) - (dn7 * de8)) + ((dn9 * de1) - (de9 * dn1))
area = area_squared / 2
row.append(area)
return_list.append(row)
return return_list
def save_write_file(all):
with open('output_task4B.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["ID", "Name", "de1", "dn1", "de2", "dn2", "de3", "dn3", "de4", "dn4", "de5", "dn5", "de6", "dn6", "de7", "dn7", "de8", "dn8", "de9", "dn9", "Area"])
writer.writerows(all)
if __name__ == '__main__':
main()
Any suggestions

Your problem appears to be in the calculate function.
You are trying to access various indexes of row without first confirming they exist. One naive approach might be to consider the values to be zero if they are not present, except that:
+ ((dn9 * de1) - (de9 * dn1)
is an attempt to wrap around, and this might invalidate your math since they would go to zero.
A better approach is probably to use a slice of the row, and use the sequence-iterating approach instead of trying to require a certain number of points. This lets your code fit the data.
coords = row[2:] # skip id and name
assert len(coords) % 2 == 0, "Coordinates must come in pairs!"
prev_de = coords[-2]
prev_dn = coords[-1]
area_squared = 0.0
for de, dn in zip(coords[:-1:2], coords[1::2]):
area_squared += (de * prev_dn) - (dn * prev_de)
prev_de, prev_dn = de, dn
area = abs(area_squared) / 2
The next problem will be dealing with variable length output. I'd suggest putting the area before the coordinates. That way you know it's always column 3 (or whatever).

Doing operations on a large data set

I have to perform some analysis on a PSL record which contains information on DNA sequence fragments. Basically I have to find entries that are from the same read in the same contig (these are both values in the PSL entry). The problem is the PSL records are large (10-30 Mb text documents). I wrote a program that works on short records and on the long records given enough time but it took way longer than specified. I was told the program shouldn't take more than ~15 seconds. Mine took over 15 minutes.
PSL records look like this:
275 11 0 0 0 0 0 0 - M02034:35:000000000-A7UU0:1:1101:19443:1992/2 286 0 286 NODE_406138_length_13407_cov_13.425076 13465 408 694 1 286, 0, 408,
171 5 0 0 0 0 0 0 + M02034:35:000000000-A7UU0:1:1101:13497:2001/2 294 0 176 NODE_500869_length_34598_cov_30.643419 34656 34334 34510 1 176, 0, 34334,
188 14 0 10 0 0 0 0 + M02034:35:000000000-A7UU0:1:1101:18225:2002/1 257 45 257 NODE_455027_length_12018_cov_13.759444 12076 11322 11534 1 212, 45, 11322,
My code looks like this:
import sys
class PSLreader :
'''
Class to provide reading of a file containing psl alignments
formatted sequences:
object instantiation:
myPSLreader = PSLreader(<file name>):
object attributes:
fname: the initial file name
methods:
readPSL() : reads psl file, yielding those alignments that are within the first or last
1000 nt
readPSLpairs() : yields psl pairs that support a circular hypothesis
Author: David Bernick
Date: May 12, 2013
'''
def __init__ (self, fname=''):
'''contructor: saves attribute fname '''
self.fname = fname
def doOpen (self):
if self.fname is '':
return sys.stdin
else:
return open(self.fname)
def readPSL (self):
'''
using filename given in init, returns each filtered psl records
that contain alignments that are within the terminal 1000nt of
the target. Incomplete psl records are discarded.
If filename was not provided, stdin is used.
This method selects for alignments that could may be part of a
circle.
Illumina pairs aligned to the top strand would have read1(+) and read2(-).
For the bottoms trand, read1(-) and read2(+).
For potential circularity,
these are the conditions that can support circularity:
read1(+) near the 3' terminus
read1(-) near the 5' terminus
read2(-) near the 5' terminus
read2(+) near the 3' terminus
so...
any read(+) near the 3', or
any read(-) near the 5'
'''
nearEnd = 1000 # this constant determines "near the end"
with self.doOpen() as fileH:
for line in fileH:
pslList = line.split()
if len(pslList) < 17:
continue
tSize = int(pslList[14])
tStart = int(pslList[15])
strand = str(pslList[8])
if strand.startswith('+') and (tSize - tStart > nearEnd):
continue
elif strand.startswith('-') and (tStart > nearEnd):
continue
yield line
def readPSLpairs (self):
read1 = []
read2 = []
for psl in self.readPSL():
parsed_psl = psl.split()
strand = parsed_psl[9][-1]
if strand == '1':
read1.append(parsed_psl)
elif strand == '2':
read2.append(parsed_psl)
output = {}
for psl1 in read1:
name1 = psl1[9][:-1]
contig1 = psl1[13]
for psl2 in read2:
name2 = psl2[9][:-1]
contig2 = psl2[13]
if name1 == name2 and contig1 == contig2:
try:
output[contig1] += 1
break
except:
output[contig1] = 1
break
print(output)
PSL_obj = PSLreader('EEV14-Vf.filtered.psl')
PSL_obj.readPSLpairs()
I was given some example code that looks like this:
def doSomethingPairwise (a):
for leftItem in a[1]:
for rightItem in a[2]:
if leftItem[1] is rightItem[1]:
print (a)
thisStream = [['David', 'guitar', 1], ['David', 'guitar', 2],
['John', 'violin', 1], ['John', 'oboe', 2],
['Patrick', 'theremin', 1], ['Patrick', 'lute',2] ]
thisGroup = None
thisGroupList = [ [], [], [] ]
for name, instrument, num in thisStream:
if name != thisGroup:
doSomethingPairwise(thisGroupList)
thisGroup = name
thisGroupList = [ [], [], [] ]
thisGroupList[num].append([name, instrument, num])
doSomethingPairwise(thisGroupList)
But when I tried to implement it my program still took a long time. Am I thinking about this the wrong way? I realize the nested loop is slow but I don't see an alternative.
Edit: I figured it out, the data was presorted which made my brute force solution very impractical and unnecessary.

I hope help you, since, the question needs a best input example file
#is better create PSLRecord class
class PSLRecord:
def __init__(self, line):
pslList = line.split()
properties = ("matches", "misMatches", "repMatches", "nCount",
"qNumInsert", "qBaseInsert", "tNumInsert",
"tBaseInsert", "strand", "qName", "qSize", "qStart",
"qEnd", "tName", "tSize", "tStart", "tEnd", "blockCount",
"blockSizes", "qStarts", "tStarts")
self.__dict__.update(dict(zip(properties, pslList)))
class PSLreader :
def __init__ (self, fname=''):
self.fname = fname
def doOpen (self):
if self.fname is '':
return sys.stdin
else:
return open(self.fname)
def readPSL (self):
with self.doOpen() as fileH:
for line in fileH:
pslrc = PSLRecord(line)
yield pslrc
#return a dictionary with all psl records group by qName and tName
def readPSLpairs (self):
dictpsl = {}
for pslrc in self.readPSL():
#OP requirement, remove '1' or '2' char, in pslrc.qName[:-1]
key = (pslrc.qName[:-1], pslrc.tName)
if not key in dictpsl:
dictpsl[key] = []
dictpsl[key].append(pslrc)
return dictpsl
#Function filter .... is better out and self-contained
def f_filter(pslrec, nearEnd = 1000):
if (pslrec.strand.startswith('+') and
(int(pslrec.tSize) - int(pslrec.tStart) > nearEnd)):
return False
if (pslrec.strand.startswith('-') and
(int(pslrec.tStart) > nearEnd)):
return False
return True
PSL_obj = PSLreader('EEV14-Vf.filtered.psl')
#read dictionary of pairs
dictpsl = PSL_obj.readPSLpairs()
from itertools import product
#product from itertools
#(1) x (2,3) = (1,2),(1,3)
output = {}
for key, v in dictpsl.items():
name, contig = key
#i get filters aligns in principal strand
strand_princ = [pslrec for pslrec in v if f_filter(pslrec) and
pslrec.qName[-1] == '1']
#i get filters aligns in secondary strand
strand_sec = [pslrec for pslrec in v if f_filter(pslrec) and
pslrec.qName[-1] == '2']
for pslrec_princ, pslrec_sec in product(strand_princ, strand_sec):
#This For has fewer comparisons, since I was grouped before
if not contig in output:
output[contig] = 1
output[contig] += 1
Note: 10-30 Mb isn't large file, if you ask me

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I count different values per same key with Python? - python

Related

How to get rid of the rest of the text after getting the results I want?

combine two for loops in to fill same dictionary

My code doesn't produce any output -- Python

calculating the area of an irregular shape from coordinates in a csv file using python

Doing operations on a large data set

Categories

Resources