I'm trying to standardize street address by converting the abbreviations to the full word (e.g. RD - Road). I created many lines to account for different spellings and ran into an issue where one replace code overrode another one
import pandas as pd
mydata = {'Street_type': ['PL', 'pl', 'Pl', 'PLACE', 'place']}
mydata = pd.DataFrame(mydata)
mydata['Street_type'] = mydata['Street_type'].replace('PL','Place',regex=True)
mydata['Street_type'] = mydata['Street_type'].replace('pl','Place',regex=True)
mydata['Street_type'] = mydata['Street_type'].replace('Pl','Place',regex=True)
mydata['Street_type'] = mydata['Street_type'].replace('PLACE','Place',regex=True)
mydata['Street_type'] = mydata['Street_type'].replace('place','Place',regex=True)
Instead of Place, I got Placeace. What is the best way to avoid this error? Do I write a if-else statement or any function? Thanks in advance!
Among other problems, you have overlapping logic: you fail to check that the target ("old") string is a full word before you replace it. For instance, with the input type of "PLACE", you trigger both the first and third replacements, generating PlaceACE and then PlaceaceACE before you get to the condition you wanted.
You need to work through your tracking and exclusion logic carefully, and then apply only one of the replacements. You can check the length of the street_type and apply the unique transition you need for that length.
If you're trying to convert a case statement, then you need to follow that logic pattern, rather than the successive applications you coded. You can easily look up how to simulate a "case" statement in Python.
Also consider using a translation dictionary, such as
type_trans = {
"pl": "Place",
"Pl": "Place",
"PLACE": "Place",
...
}
Then your change is simply
mydata['Street_type'] = type_trans[mydata['Street_type']]
Also, you might list all of the variants in a tuple, such as:
type_place = ("PL", "Pl", "pl", "PLACE", "place")
if mydata['Street_type'] in type_place
mydata['Street_type'] = "Place"
... but be sure to generalize this properly for your entire list of street types.
You can do this correctly with a single pass if you use a proper regex here, e.g. use word boundaries (\b):
In [11]: places = ["PL", "pl", "Pl", "PLACE", "Place", "place"]
In [12]: mydata.Street_type
Out[12]:
0 PL
1 pl
2 Pl
3 PLACE
4 place
Name: Street_type, dtype: object
In [13]: mydata.Street_type.replace("(^|\b)({})(\b|$)".format("|".join(places)), "Place", regex=True)
Out[13]:
0 Place
1 Place
2 Place
3 Place
4 Place
Name: Street_type, dtype: object
#Needlemanwunch
def zeros(shape):
retval = []
for x in range(shape[0]):
retval.append([])
for y in range(shape[1]):
retval[-1].append(0)
return retval
match_award = 10
mismatch_penalty = -3
gap_penalty = -4 # both for opening and extanding
def match_score(alpha, beta):
if alpha == beta:
return match_award
elif alpha == '-' or beta == '-':
return gap_penalty
else:
return mismatch_penalty
def finalize(align1, align2):
align1 = align1[::-1] #reverse sequence 1
align2 = align2[::-1] #reverse sequence 2
i,j = 0,0
#calcuate identity, score and aligned sequeces
symbol = ''
found = 0
score = 0
identity = 0
for i in range(0,len(align1)):
# if two AAs are the same, then output the letter
if align1[i] == align2[i]:
symbol = symbol + align1[i]
identity = identity + 1
score += match_score(align1[i], align2[i])
# if they are not identical and none of them is gap
elif align1[i] != align2[i] and align1[i] != '-' and align2[i] != '-':
score += match_score(align1[i], align2[i])
symbol += ' '
found = 0
#if one of them is a gap, output a space
elif align1[i] == '-' or align2[i] == '-':
symbol += ' '
score += gap_penalty
identity = float(identity) / len(align1) * 100
print('Similarity =', "%3.3f" % identity, 'percent')
print('Score =', score)
# print(align1)
# print(symbol)
# print(align2)
def needle(seq1, seq2):
m, n = len(seq1), len(seq2) # length of two sequences
# Generate DP table and traceback path pointer matrix
score = zeros((m+1, n+1)) # the DP table
# Calculate DP table
for i in range(0, m + 1):
score[i][0] = gap_penalty * i
for j in range(0, n + 1):
score[0][j] = gap_penalty * j
for i in range(1, m + 1):
for j in range(1, n + 1):
match = score[i - 1][j - 1] + match_score(seq1[i-1], seq2[j-1])
delete = score[i - 1][j] + gap_penalty
insert = score[i][j - 1] + gap_penalty
score[i][j] = max(match, delete, insert)
# Traceback and compute the alignment
align1, align2 = '', ''
i,j = m,n # start from the bottom right cell
while i > 0 and j > 0: # end toching the top or the left edge
score_current = score[i][j]
score_diagonal = score[i-1][j-1]
score_up = score[i][j-1]
score_left = score[i-1][j]
if score_current == score_diagonal + match_score(seq1[i-1], seq2[j-1]):
align1 += seq1[i-1]
align2 += seq2[j-1]
i -= 1
j -= 1
elif score_current == score_left + gap_penalty:
align1 += seq1[i-1]
align2 += '-'
i -= 1
elif score_current == score_up + gap_penalty:
align1 += '-'
align2 += seq2[j-1]
j -= 1
# Finish tracing up to the top left cell
while i > 0:
align1 += seq1[i-1]
align2 += '-'
i -= 1
while j > 0:
align1 += '-'
align2 += seq2[j-1]
j -= 1
finalize(align1, align2)
needle('kizlerlo','killerpo' )
***********************************************************************************************************************
#import textdistance as txd
import numpy
txd.overlap('kizlerlo','kilerpo' )
txd.jaro('kizlerlo','killerpo' )
txd.cosine('kizlerlo','killerpo' )
#txd.needleman_wunsch('kizlerlo','killerpo' )
txd.jaro_winkler('kizlerlo','killerpo' )
#txd.smith_waterman('Loans and Accounts','Loans Accounts' )
#txd.levenshtein.normalized_similarity('Loans and Accounts','Loans Accounts' )
from scipy.spatial import distance
a = 'kizlerlo'
b = 'kilerpoo'
#txd.gotoh('Loans and Accounts','Loans Accounts' )
print(txd.needleman_wunsch.normalized_similarity('Loans and Accounts','Loans Accounts' ))
***************************************************************************************************************************
#Euclidean
import math
import numpy as np
def euclid(str1,str2):
dist=0.0
x=str1
y=str2
set1=set()
for a in range(0,len(x)):
set1.add(x[a])
for a in range(0,len(y)):
set1.add(y[a])
vec1=[None]*len(set1)
vec2=[None]*len(set1)
for counter,each_char in enumerate(set1):
vec1[counter]=x.count(each_char)
vec2[counter]=y.count(each_char)
dist=1/(1+math.sqrt(sum([(a - b) ** 2 for a, b in zip(vec1, vec2)])))
print(dist)
euclid('kizlerlo','killerpo')
***************************************************************************************************************************
from similarity.qgram import QGram
import affinegap
qgram = QGram(2)
#print(qgram.distance('kizlerlo', 'killerpo'))
affinegap.affineGapDistance('kizlerlokill' ,'erpozlerlzler')
***************************************************************************************************************************
#manhattan
def manhattan(str1,str2):
dist=0.0
x=str1
y=str2
set1=set()
for a in range(0,len(x)):
set1.add(x[a])
for a in range(0,len(y)):
set1.add(y[a])
vec1=[None]*len(set1)
vec2=[None]*len(set1)
for counter,each_char in enumerate(set1):
vec1[counter]=x.count(each_char)
vec2[counter]=y.count(each_char)
#dist= sum([np.abs(a - b) for a, b in zip(vec1, vec2)])
dist=1/(1+sum([np.abs(a - b) for a, b in zip(vec1, vec2)]))
print(dist)
manhattan('kizlerlo','killerpo')
import jellyfish
import json
from Levenshtein import distance,jaro_winkler,jaro,ratio,seqratio
def comp(a,b):
return jellyfish.jaro_winkler(a,b)*100 + distance(a,b) + jaro(a,b)*100
ip = {"CED":"WALMART INC_10958553"}
ala = {}
for index,row in df_ala.iterrows():
a = ip.get("CED")
b = row['NN_UID']
c = comp(a,b)
ala.update({row['N_UID'] : c})
ala_max = max(ala, key=ala.get)
ala_f = {"ALACRA" : ala_max}
ces_f = {"CESIUM" : "WALMART_10958553_CESIUM"}
dun_f = {"DUNS" : "WALMART_10958053_DUNS"}
ref_f = {"REF" : "WALMART INC_10958553_REF"}
cax_f = {"CAX" : "WALMART LTD_10958553_CAX"}
final_op = {**ala_f,**ces_f,**dun_f,**ref_f,**cax_f }
final_json = json.dumps(final_op)
print(final_json)
from flask import Flask,request, jsonify
app = Flask(__name__)
#app.route('/test',methods = ['GET','POST'])
def test():
if request.method == "GET":
return jsonify({"response":"Get request called"})
elif request.method == "POST":
req_Json = request.json
name = req_Json['name']
return jsonify({"response": "Hi" + name})
if __name__ == '__main__':
app.run(debug = True,port = 9090)
{
"name": "Mike"
}
import usaddress
import pandas as pd
import statistics
#sa = dict(usaddress.parse('123 Main St. Suite Chicago, IL' ))
adr = pd.read_excel('C:\\VINAYAK\\Address.xlsx')
adr.columns = ['Address']
strlen = []
scr = []
loop = adr['Address'].tolist()
for i in loop:
strlen.append(len(i))
x = statistics.median(strlen)
for i in loop:
sa = dict(usaddress.parse(i))
sa = list(sa.values())
a = 0
if len(i) > x :
a+= 5
if 'AddressNumber' in sa :
a+= 23
if 'StreetName' in sa :
#a = a + 20
a+= 17
if 'OccupancyType' in sa :
a+= 6
if 'OccupancyIdentifier' in sa :
a+= 12
if 'PlaceName' in sa :
a+= 12
if 'StateName' in sa :
a+= 13
if 'ZipCode' in sa :
a+= 12
scr.append(a)
adr['Adr_Score'] = scr
adr.head()
#(pd.DataFrame([(key) for key in sa.items()])).transpose()
#pd.DataFrame(dict([(value, key) for key, value in sa.items()]))
#pd.DataFrame(dict([(value, key) for key, value in sa.items()]))
# df_ts = pd.DataFrame(columns = ['AddressNumber' , 'Age', 'City' , 'Country'])
# df_ts.append(sa, ignore_index=False, verify_integrity=False, sort=None)
# df_ts.head()
import pandas as pd
from zipfile import ZipFile
# core = []
# f = open('C:/Users/s.natarajakarayalar/1.txt','r')
# core.append(str(f.readlines()))
# print(core)
import os
import zipfile
import re
import nltk
import os
core = []
with zipfile.ZipFile('C:/Users/s.natarajakarayalar/TF.zip') as z:
a = 0
for filename in z.namelist():
#if a < 1:
#if not os.path.isdir(filename):
# read the file
with z.open(filename) as f:
#a = 2
x = f.readlines()
core = core + x
with open('C:/Users/s.natarajakarayalar/fins.txt', 'w') as f:
for item in core:
f.write("%s\n" % item)
# for i in core:
# if k < 5:
# tkt = re.sub(r'.*CONTENT', '', i)
# new_core.append(tkt)
# k = k+1
# for item in core:
# new_core.append(len(item.split()))
# print(sum(new_core))
# from nltk.tokenize import word_tokenize
# new_core = []
# stp = ['URL:https://','TITLE:b','META-KEYWORDS:','None','DOC ID:','CONTENT:b','URL:','TITLE:','META-CONTENT:']
# #new_core = [word for word in core if word not in stopwords]
# for i in core:
# wk = word_tokenize(i)
# for w in wk:
# if w not in stp:
# new_core.append(w)
Related
I have a piece of code to calculate one matrix from another. But when I try to print the output of the function the code runs without any errors but does not print anything. Furthermore, if I try to print the input matrix to the function and then call the function, the original input does not get printed. It prints fine when I don't call this function inf_mat. Does anybody know what might be the reason?
import numpy as np
from itertools import permutations
def inf_mat(adjacency, l_max):
## Influence matrix - Returns the influence matrix corresponding to the input adjacency matrix
inf_arr = np.zeros(np.shape(adjacency))
print(inf_arr)
norm_mat = adjacency
for i in range(np.shape(norm_mat)[0]):
for j in range(np.shape(norm_mat)[1]):
if norm_mat[i,j] != 0:
norm_mat[i,j] = 1
for i in range(np.shape(inf_arr)[0]):
for j in range(np.shape(inf_arr)[1]):
ind_lists = []
for k in range(l_max-1):
all_ind = [m for m in range(np.shape(adjacency)[1])]
all_ind.remove(i)
if j != i:
all_ind.remove(j)
ind_lists.append(list(permutations(all_ind, k+1)))
pairwise_list = []
for p in range(len(ind_lists)):
pairs = []
for q in range(len(ind_lists[p])):
mult_list = [i]
for r in range(len(ind_lists[p][q])):
mult_list.append(ind_lists[p][q][r])
mult_list.append(j)
pairs.append(mult_list)
pairwise_list.append(pairs)
inf_sum = adjacency[i,j]
norm_sum = norm_mat[i,j]
for t in range(len(pairwise_list)):
for t_ind in range(len(pairwise_list[t])):
inf_ele = 1
norm_ele = 1
for s in range(len(pairwise_list[t_ind])-1):
inf_ele *= adjacency[s,s+1]
norm_ele *= norm_mat[s,s+1]
inf_sum += inf_ele
norm_sum += norm_ele
inf_arr[i,j] = inf_sum/norm_sum
return inf_arr/l_max
def readFile(Filename):
fileObj = open(Filename, "r")
words = fileObj.read().splitlines()
fileObj.close()
return words
topo_list = readFile("EMT_RACIPE.topo")
topo_list.remove(topo_list[0])
gene_list = []
for i in range(len(topo_list)):
topo_list[i] = topo_list[i].split()
for j in range(2):
if gene_list.count(topo_list[i][j]) == 0:
gene_list.append(topo_list[i][j])
adjacency = np.zeros(shape = (len(gene_list),len(gene_list)))
for i in range(len(gene_list)):
indices = []
for j in range(len(topo_list)):
if topo_list[j].count(gene_list[i]) > 0:
indices.append(j)
for k in range(len(indices)):
if topo_list[indices[k]][0] == gene_list[i]:
if topo_list[indices[k]][2] == '1':
adjacency[i, gene_list.index(topo_list[indices[k]][1])] = 1
if topo_list[indices[k]][2] == '2':
adjacency[i, gene_list.index(topo_list[indices[k]][1])] = -1
if topo_list[indices[k]][1] == gene_list[i]:
if topo_list[indices[k]][2] == '1':
adjacency[gene_list.index(topo_list[indices[k]][0]), i] = 1
if topo_list[indices[k]][2] == '2':
adjacency[gene_list.index(topo_list[indices[k]][0]), i] = -1
N22E82 = adjacency
print(N22E82)
inf_N22E82 = inf_mat(N22E82, 10)
print(inf_N22E82)
I have 2 separate pandas dataframes, one of which tracks whether the platform of a train station is free or not and another which tracks the movement of the trains (note I am only at proof-of-concept stage, I appreciate my code is not tidy).
Code is as below:
L = []
M = []
x = 0
for i in range(0,k*2):
L.append(0)
M.append(x)
if (i == k):
x = 1
list_of_tuples = list(zip(M, L))
blocks_df = pd.DataFrame(list_of_tuples, columns = ['Direction', 'BlockTaken'])
L = ["London Depot", "London Platform", "Birmingham Platform", "Crossover"]
M = [0,0,0,0]
list_of_tuples = list(zip(L, M))
stations_control = pd.DataFrame(list_of_tuples, columns = ['Location', 'BlockTaken'])
for i in range (0,3600):
if (i%300==0): #Every 5 minutes, a new train enters service
print("Train " + str(TrainNumber) + " leaving depot for " + str(train_df.loc[train_df['Train_Number'] == TrainNumber, 'Start_Station'].iloc[0]) + " at " + str(t.time()) )
train_df.loc[train_df['Train_Number'] == TrainNumber, 'Dep'] = 'N'
train_df.loc[train_df['Train_Number'] == TrainNumber, 'Dep_Time'] = t.time()
train_df.loc[train_df['Train_Number'] == TrainNumber, 'From'] = L[0]
train_df.loc[train_df['Train_Number'] == TrainNumber, 'To'] = L[1]
if(stations_control[train_df.loc[train_df['Train_Number'] == TrainNumber, 'To']]['BlockTaken'] ==0):
print("Platform is free!!!")
t = t + datetime.timedelta(0,300)
#TrainNumber+=1
I know I am doing the if statement 4 lines from the end wrong, but I can't figure out how to do it properly. What I want to check is where the train is going and if the platform is free, print. What is the correct syntax here?
I think it is
b = train_df.loc[train_df['Train_Number'] == TrainNumber, 'To'].iloc[0]
if(stations_control.loc[stations_control['Location']==b]['BlockTaken'] == 0):
Answering my own question here. Thanks for the help guys.
if(stations_control.loc[stations_control['Location']==b].BlockTaken.iloc[0]== 0):
print("Platform is free!!!")
I am conducting a project in data science to analyse large volumes of cancer genome data, my computer is relatively inefficient and has a low cpu and low ram. As a result to run through all the samples it take sufficiently too long.
I have tried reducing any excess code, I have tried getting rid of for loops for list comprehensions, I have used multiprocessing to split up my tasks to run faster.
import re
import xlrd
import os
import time
from multiprocessing import Pool
import collections
import pandas as pd
if os.path.exists("C:\\Users\\js769\\genomemutations\\Input\\ChromosomesVersion") == True:
print("chromosomes in folder")
else:
os.makedirs("C:\\Users\\js769\\genomemutations\\Input\\ChromosomesVersion")
print(
"Chromosome Folder Created, Please transfer current version of chromosome number base data to new file."
)
if os.path.exists("C:\\Users\\js769\\genomemutations\\Input\\MutationSamples") == True:
print("Add sample data to run.")
else:
os.makedirs("C:\\Users\\js769\\genomemutations\\Input\\MutationSamples")
print("Mutation Sample Folder Created, please add mutation sample data to folder.")
if os.path.exists("C:\\Users\\js769\\genomemutations\\output") == True:
print("3")
else:
os.makedirs("C:\\Users\\js769\\genomemutations\\output")
# Require editing of this so it works both on a mac or windows system. Currently this version suited to mac because of higher processing power.
# Require ability to check to see if error occurs
def Main(Yeram):
import os
import glob
import errno
import shutil
import xlrd
import pandas as pd
import time
import re
import numpy as np
FragmentSize = 10000000 # This is fragment size which is adjustable.
# Code not needed
Position1 = Yeram.vectx
Position2 = Yeram.vecty
samplelist = Yeram.samplelist
dictA = Yeram.dictA
FragmentSize = Yeram.FragmentSize
chromosomesizes = Yeram.chromosomesizes
def chromosomex_mutation_data(
chromosomenumber, mutationlist
): # It selects the correct chromosome mutation point data, then it selects the data before the -. Mutation data in form(12-20)
chromosomexlist = ["0-1"]
for mutationposition in mutationlist:
if mutationposition[0:2] == str(chromosomenumber):
chromosomexlist.append(mutationposition[3:])
elif mutationposition[0:2] == (str(chromosomenumber) + ":"):
chromosomexlist.append(mutationposition[2:])
else:
continue
Puremutationdatapoints = [int(mutationposition.split("-")[0]) for mutationposition in chromosomexlist]
return Puremutationdatapoints
def Dictionary_Of_Fragment_mutation(FragmentSize, MutationData, ChromosomeNumber): #
chromosomes = {} # Dictionary
chromosomesize = chromosomesizes[ChromosomeNumber - 1]
# Opening up specific chromosome data and calculating amount of bases present in chromosome
Number_of_fragments = int(chromosomesize / FragmentSize)
for mutation in MutationData:
for i in range(0, (Number_of_fragments), 1):
a = (
"Chromosome"
+ str(ChromosomeNumber)
+ "Fragment"
+ str(i)
+ ",Basepairs "
+ str(i * FragmentSize + 1)
+ "-"
+ str(i * FragmentSize + FragmentSize)
)
if mutation in range(i * FragmentSize + 1, i * FragmentSize + FragmentSize + 1):
if chromosomes.get(a) == None:
chromosomes.update({a: 1})
else:
b = (chromosomes.get(a)) + 1
chromosomes.update({a: b})
else:
if chromosomes.get(a) == None:
chromosomes.update({a: 0})
else:
continue
return chromosomes # adds
# This adds mutations or no mutation to each fragment for chromosome,makes dicitonaries
def DictionaryRead(FragmentSize, Dict, ChromosomeNumber):
chromosomesize = chromosomesizes[ChromosomeNumber - 1]
Number_of_fragments = int(chromosomesize / FragmentSize)
chromosomefragmentlist = []
for i in range(0, (Number_of_fragments), 1):
a = (
"Chromosome"
+ str(ChromosomeNumber)
+ "Fragment"
+ str(i)
+ ",Basepairs "
+ str(i * FragmentSize + 1)
+ "-"
+ str(i * FragmentSize + FragmentSize)
)
chromosomefragmentlist.append(str(Dict.get((a))))
return chromosomefragmentlist
# This uses dictionary to create list
def forwardpackage2(FragmentSize, PureMutationData):
C = [] # list of data in numerical order 0 = no mutation
for i in range(1, 23, 1):
A = chromosomex_mutation_data(i, PureMutationData) # Purifies Data
B = Dictionary_Of_Fragment_mutation(FragmentSize, A, i) # Constructs Dictionary
C += DictionaryRead(
FragmentSize, B, i
) # Uses constructed Dictionary amd generates list of numbers, each number being a fragment in numerical order.
return C
def Mutationpointdata(Position1, Position2, dictA, FragmentSize): # Require dictA
vectx = Position1
vecty = Position2
Samplesandmutationpoints = []
for i in range(vectx, vecty):
print(samplelist[i])
new = [k for k, v in dictA.items() if int(v) == samplelist[i]]
mutationlist = [excelsheet.cell_value(i, 23) for i in new]
mutationlist.sort()
Samplesandmutationpoints.append(forwardpackage2(FragmentSize, mutationlist))
return Samplesandmutationpoints
# Opening sample data from excel table
return Mutationpointdata(Position1, Position2, dictA, FragmentSize) # yeram to james samples
def ChromosomeSequenceData(ChromosomeNumber): # Formats the chromosome file into readable information
with open(
r"C:\Users\js769\genomemutations\Input\ChromosomesVersion\chr" + str(ChromosomeNumber) + ".fa"
) as text_file:
text_data = text_file.read()
listA = re.sub("\n", "", text_data)
# list2=[z for z in text_data if z!= "\n"]
if ChromosomeNumber < 10:
ChromosomeSequenceData = listA[5:]
else:
ChromosomeSequenceData = listA[6:]
return ChromosomeSequenceData
def basepercentage_single(
i, FragmentSize, ChromosomeSequenceData
): # Creates a list of base percentage known for certain type of chromosome.
sentence = ChromosomeSequenceData[(i * FragmentSize + 1) : (i * FragmentSize + FragmentSize)]
a = sentence.count("N") + sentence.count("n")
c = str(((FragmentSize - a) / FragmentSize) * 100) + "%"
return c
def basepercentage_multiple(
FragmentSize, ChromosomeSequenceData
): # Creates a a list of base percentages known which correspond with the dna fragments for every chromosome.
fragmentamount = int(len(ChromosomeSequenceData) / FragmentSize)
list = [
basepercentage_single(i, FragmentSize, ChromosomeSequenceData) for i in range(0, (fragmentamount), 1)
]
return list
def FragmentEncodedPercentage(
FragmentSize
): # Packages a list of base percentages known which correspond with the dna fragments for every chromosome.
Initial_list = [basepercentage_multiple(FragmentSize, ChromosomeSequenceData(i)) for i in range(1, 23, 1)]
List_of_fragment_encoded_percentages = [item for sublist in Initial_list for item in sublist]
return List_of_fragment_encoded_percentages
def chromosomefragmentlist(
FragmentSize, ChromosomeNumber
): # Creares a list of fragment sizes for a specific chromosome.
chromosomesize = chromosomesizes[ChromosomeNumber - 1]
Number_of_fragments = int(chromosomesize / FragmentSize)
chromosomefragmentlist = []
for i in range(0, (Number_of_fragments), 1):
a = (
"Chromosome"
+ str(ChromosomeNumber)
+ "Fragment"
+ str(i)
+ ",Basepairs "
+ str(i * FragmentSize + 1)
+ "-"
+ str(i * FragmentSize + FragmentSize)
)
chromosomefragmentlist.append(str(((a))))
return chromosomefragmentlist
def GenomeFragmentGenerator(
FragmentSize
): # Creates the genome fragments for all chromosomes and adds them all to a list.
list = [chromosomefragmentlist(FragmentSize, i) for i in range(1, 23, 1)]
A = [item for sublist in list for item in sublist]
return A
def excelcreation(
mutationdata, samplelist, alpha, bravo, FragmentSize, A, B
): # Program runs sample alpha to bravo and then constructs excel table
data = {"GenomeFragments": A, "Encoded Base Percentage": B}
for i in range(alpha, bravo):
data.update({str(samplelist[i]): mutationdata[i]})
df = pd.DataFrame(data, index=A)
export_csv = df.to_csv(
r"C:/Users/js769/genomemutations/output/chromosomeAll.csv", index=None, header=True
)
start_time = time.time()
# Code determine base fragment size
FragmentSize = 1000000
chromosomesizes = [] # This calculates the base pair sizes for each chromosome.
for i in range(1, 23):
with open(r"C:\Users\js769\genomemutations\Input\ChromosomesVersion\chr" + str(i) + ".fa") as text_file:
text_data = text_file.read()
list = re.sub("\n", "", text_data)
if i < 10:
chromosomesizes.append(len(list[5:]))
else:
chromosomesizes.append(len(list[6:]))
wb = xlrd.open_workbook("C:/Users/js769/genomemutations/input/MutationSamples/Complete Sample For lungs.xlsx")
excelsheet = wb.sheet_by_index(0)
excelsheet.cell_value(0, 0)
sampleswithduplicates = [excelsheet.cell_value(i, 5) for i in range(1, excelsheet.nrows)]
samplelist = []
for sample in sampleswithduplicates:
if sample not in samplelist:
samplelist.append(int(sample)) # Constructs list of sample , each sample only comes up once
dictA = {}
counter = 1 # Creates a dictionary where it counts the
for sample in sampleswithduplicates:
dictA.update({counter: int(sample)})
counter = counter + 1
A = GenomeFragmentGenerator(FragmentSize)
B = FragmentEncodedPercentage(FragmentSize)
value = collections.namedtuple(
"value", ["vectx", "vecty", "samplelist", "dictA", "FragmentSize", "chromosomesizes"]
)
SampleValues = (
value(
vectx=0,
vecty=2,
samplelist=samplelist,
dictA=dictA,
FragmentSize=FragmentSize,
chromosomesizes=chromosomesizes,
),
value(
vectx=2,
vecty=4,
samplelist=samplelist,
dictA=dictA,
FragmentSize=FragmentSize,
chromosomesizes=chromosomesizes,
),
value(
vectx=4,
vecty=6,
samplelist=samplelist,
dictA=dictA,
FragmentSize=FragmentSize,
chromosomesizes=chromosomesizes,
),
value(
vectx=6,
vecty=8,
samplelist=samplelist,
dictA=dictA,
FragmentSize=FragmentSize,
chromosomesizes=chromosomesizes,
),
value(
vectx=8,
vecty=10,
samplelist=samplelist,
dictA=dictA,
FragmentSize=FragmentSize,
chromosomesizes=chromosomesizes,
),
value(
vectx=10,
vecty=12,
samplelist=samplelist,
dictA=dictA,
FragmentSize=FragmentSize,
chromosomesizes=chromosomesizes,
),
value(
vectx=12,
vecty=14,
samplelist=samplelist,
dictA=dictA,
FragmentSize=FragmentSize,
chromosomesizes=chromosomesizes,
),
value(
vectx=14,
vecty=16,
samplelist=samplelist,
dictA=dictA,
FragmentSize=FragmentSize,
chromosomesizes=chromosomesizes,
),
)
print("starting multiprocessing")
if __name__ == "__main__":
with Pool(4) as p:
result = p.map(Main, SampleValues)
Allmutationdata = []
for i in result:
for b in i:
Allmutationdata.append(b)
excelcreation(Allmutationdata, samplelist, 0, 16, FragmentSize, A, B)
print("My program took " + str(time.time() - start_time) + " to run")
So the program runs that isn't the issue, the issue is the time it runs,can anyone spot anywhere where my code maybe at fault.
This article How to make your pandas loop run 72,000x faster has really resonated with me and I think will help you.
It provides clear instructions on how to vectorize your for loops to drastically speed them up
Methods to speed up a For Loop:
Utilize pandas iterrows()
~321 times faster
Example
for index, row in dataframe.iterrows():
print(index, row)
Pandas Vectorization
~9280 times faster
Example
df.loc[((col1 == val1) & (col2 == val2)), column_name] = conditional_result
Numpy Vectorization
~72,000 times faster
Example
df.loc[((col1.values == val1) & (col2.values == val2)), column_name] = conditional_result
By adding .values we receive a numpy array.
Credit for the timing results goes to this article
Basically in the last for loop the k variable uses the number of items in the list and then I have a false and unique answer rather than multiple answers I want to do some sort of n roots of a complex number (if my question isn't clear sorry i'm not a native english speaker I'll do my best to make it clearer)
from math import *
deg = int(input("entrez le degré:"))
re = int(input("le réel:"))
im = int(input("l'imaginaire:"))
counter = 0
while counter < deg:
counter = counter + 1
kkk = []
kkk.append(counter)
r = sqrt(pow(re,2)+pow(im,2))
if im != 0:
teton = round(pi/radians(degrees(acos(re/r))),1)
else:
teton = round(pi/radians(degrees(acos(im/r))),1)
if round(r) != r:
r = "sqrt(",(pow(re,2)+pow(im,2)),")"
else:
r = r
teta = "pi/%s" %teton
print("z = ",r,"e^i",teta,)
for k in kkk:
if re != 0 or im != 0:
print(r,"e^i*2*",teta,"*",k,"pi")
else:
print(r,"^1/",deg,"e^i(",teta,"/",deg," +(2",k,"pi)/",deg)
print(k)
If I understood the problem correctly, you are saying that for loop is not iterating over all the items in the list kkk.
if you check your code the list kkk always have only one item as you are initializing and appending item in same loop.
please move below statement out of the first loop.
kkk = []
like below.
from math import *
deg = int(input("entrez le degré:"))
re = int(input("le réel:"))
im = int(input("l'imaginaire:"))
counter = 0
kkk = []
while counter < deg:
counter = counter + 1
kkk.append(counter)
r = sqrt(pow(re,2)+pow(im,2))
if im != 0:
teton = round(pi/radians(degrees(acos(re/r))),1)
else:
teton = round(pi/radians(degrees(acos(im/r))),1)
if round(r) != r:
r = "sqrt(",(pow(re,2)+pow(im,2)),")"
else:
r = r
teta = "pi/%s" %teton
print("z = ",r,"e^i",teta,)
for k in kkk:
if re != 0 or im != 0:
print(r,"e^i*2*",teta,"*",k,"pi")
else:
print(r,"^1/",deg,"e^i(",teta,"/",deg," +(2",k,"pi)/",deg)
print(k)
I just can't get it done. Therefore I'll post the full code.
The .csv used is from http://www.football-data.co.uk/mmz4281/1415/E0.csv
Now when run, the variables home_team_a, home_team_d, away_team_a and away_team_d are based on all of the previous matches but I want them to be based always on the last 6 matches.
import csv, math, ast, numpy as np
def poisson(actual, mean):
return math.pow(mean, actual) * math.exp(-mean) / math.factorial(actual)
csvFile = '20152016.csv'
team_list = []
k = open('team_list.txt', 'w')
k.write("""{
""")
csvRead = csv.reader(open(csvFile))
next(csvRead)
for row in csvRead:
if row[2] not in team_list:
team_list.append(row[2])
if row[3] not in team_list:
team_list.append(row[3])
team_list.sort()
for team in team_list:
k.write(""" '%s': {'home_goals': 0, 'away_goals': 0, 'home_conceded': 0, 'away_conceded': 0, 'home_games': 0, 'away_games': 0, 'alpha_h': 0, 'beta_h': 0, 'alpha_a': 0, 'beta_a': 0},
""" % (team))
k.write("}")
k.close()
s = open('team_list.txt', 'r').read()
dict = ast.literal_eval(s)
GAMES_PLAYED = 0
WEEKS_WAIT = 4
TOTAL_VALUE = 0
csvRead = csv.reader(open(csvFile))
next(csvRead)
for game in csvRead:
home_team = game[2]
away_team = game[3]
home_goals = int(game[4])
away_goals = int(game[5])
home_win_prob = 0
draw_win_prob = 0
away_win_prob = 0
curr_home_goals = 0
curr_away_goals = 0
avg_home_goals = 1
avg_away_goals = 1
team_bet = ''
ev_bet = ''
# GETTING UPDATED VARIABLES
for key, value in dict.items():
curr_home_goals += dict[key]['home_goals']
curr_away_goals += dict[key]['away_goals']
if GAMES_PLAYED > (WEEKS_WAIT * 10):
avg_home_goals = curr_home_goals / (GAMES_PLAYED)
avg_away_goals = curr_away_goals / (GAMES_PLAYED)
# CALCULATING FACTORS
if GAMES_PLAYED > (WEEKS_WAIT * 10):
home_team_a = (dict[home_team]['alpha_h'] + dict[home_team]['alpha_a']) / 2
away_team_a = (dict[away_team]['alpha_h'] + dict[away_team]['alpha_a']) / 2
home_team_d = (dict[home_team]['beta_h'] + dict[home_team]['beta_a']) / 2
away_team_d = (dict[away_team]['beta_h'] + dict[away_team]['beta_a']) / 2
home_team_exp = avg_home_goals * home_team_a * away_team_d
away_team_exp = avg_away_goals * away_team_a * home_team_d
# RUNNING POISSON
l = open('poisson.txt', 'w')
for i in range(10):
for j in range(10):
prob = poisson(i, home_team_exp) * poisson(j, away_team_exp)
l.write("Prob%s%s = %s\n" % (i, j, prob))
l.close()
with open('poisson.txt') as f:
for line in f:
home_goals_m = int(line.split(' = ')[0][4])
away_goals_m = int(line.split(' = ')[0][5])
prob = float(line.split(' = ')[1])
if home_goals_m > away_goals_m:
home_win_prob += prob
elif home_goals_m == away_goals_m:
draw_win_prob += prob
elif home_goals_m < away_goals_m:
away_win_prob += prob
#CALCULATE VALUE
bet365odds_h, bet365odds_d, bet365odds_a = float(game[23]), float(game[24]), float(game[25])
ev_h = (home_win_prob * (bet365odds_h - 1)) - (1 - home_win_prob)
ev_d = (draw_win_prob * (bet365odds_d - 1)) - (1 - draw_win_prob)
ev_a = (away_win_prob * (bet365odds_a - 1)) - (1 - away_win_prob)
highestEV = max(ev_h, ev_d, ev_a)
if (ev_h == highestEV) and (ev_h > 0):
team_bet = home_team
ev_bet = ev_h
if home_goals > away_goals:
TOTAL_VALUE += (bet365odds_h - 1)
else:
TOTAL_VALUE -= 1
elif (ev_d == highestEV) and (ev_d > 0):
team_bet = 'Draw'
ev_bet = ev_d
if home_goals == away_goals:
TOTAL_VALUE += (bet365odds_d - 1)
else:
TOTAL_VALUE -= 1
elif (ev_a == highestEV) and (ev_a > 0):
team_bet = away_team
ev_bet = ev_a
if home_goals < away_goals:
TOTAL_VALUE += (bet365odds_a - 1)
else:
TOTAL_VALUE -= 1
if (team_bet != '') and (ev_bet != ''):
print ("Bet on '%s' (EV = %s)" % (team_bet, ev_bet))
print (TOTAL_VALUE)
# UPDATE VARIABLES AFTER MATCH HAS BEEN PLAYED
dict[home_team]['home_goals'] += home_goals
dict[home_team]['home_conceded'] += away_goals
dict[home_team]['home_games'] += 1
dict[away_team]['away_goals'] += away_goals
dict[away_team]['away_conceded'] += home_goals
dict[away_team]['away_games'] += 1
GAMES_PLAYED += 1
# CREATE FACTORS
if GAMES_PLAYED > (WEEKS_WAIT * 10):
for key, value in dict.items():
alpha_h = (dict[key]['home_goals'] / dict[key]['home_games']) / avg_home_goals
beta_h = (dict[key]['home_conceded'] / dict[key]['home_games']) / avg_away_goals
alpha_a = (dict[key]['away_goals'] / dict[key]['away_games']) / avg_away_goals
beta_a = (dict[key]['away_conceded'] / dict[key]['away_games']) / avg_home_goals
dict[key]['alpha_h'] = alpha_h
dict[key]['beta_h'] = beta_h
dict[key]['alpha_a'] = alpha_a
dict[key]['beta_a'] = beta_a
Use a deque to keep the 6 most recent items in memory; adding a new record will "push out" the oldest one.
import collections
import itertools
import csv
with open("foo.csv") as fh:
# Skip the first 44 rows
csv_read = islice(csv.reader(fh), 44, None)
# Initialize the deque with the next 6 rows
d = collections.deque(islice(csv_read, 6), 6)
for record in csv_read:
d.append(record)
print(list(d)) # Rows 46-51, then 47-52, then 48-53, etc
Because you set the maximum length of the deque to 6, each append to a "full" deque pushes out the older one. On the first iteration, d.append pushes out row 45 and adds row 51. On the next iteration, adding row 52 pushes out row 46, etc.
In general, a deque is a data structure that is like a combination of a queue and a stack; you can add or remove items to either end efficiently, but accessing an arbitrary item or modifying the "middle" is slow. Here, we're taking advantage of the fact that appending to a full deque causes an implicit removal from the opposite end.
How about:
if seen_records == 200:
recs = list(csvRead)[seen_records - 6:seen_records + 1]
You can do something like this....
previous_index = 0
previous_max = 6 # max number of previous numbers to remember
previous = [None for _ in range(previous_max)]
csvFile = 'X.csv'
seen_records = 0
csvRead = csv.reader(open(csvFile))
# Enumerate over the records to keep track of the index of each one
for i, records in enumerate(csvRead):
if (i > 50):
seen_records =+ 1
if previous_index == previous_max:
previous_index = 0 # Reset to the beginning when we reach the end
# Store the record and increment the index to the next location
previous[previous_index] = record
previous_index += 1
This creates a very basic array of length previous_max and just stores the oldest data at index 0 and newest at previous_max -1.