Python Separating String into Words and Recursion

Python Separating String into Words and Recursion - python

I'm trying to create a code that will take in an input (example below)
Input:
BHK158 VEHICLE 11
OIUGHH MOTORCYCLE 34.46
BHK158 VEHICLE 12.000
TRIR TRUCK 2.0
BLAS215 MOTORCYCLE 0.001
END
and produce an output where each license plate number is listed and the total cost is listed beside (example below)
Corresponding output:
OIUGHH: 5.8582
BHK158: 5.75
TRIR: 2.666
BLAS215: 0.00017
The vehicle license plates are charged $0.25 per kilometer (the kilometers are the number values in the input list), trucks are charged $1.333 per kilometer, and motorcycles $0.17 per kilometer. The output is listed in descending order.
Here is my code thus far:
fileinput = input('Input: \n')
split_by_space = fileinput.split(' ')
vehicles = {}
if split_by_space[1] == 'VEHICLE':
split_by_space[2] = (float(split_by_space[2]) * 0.25)
elif split_by_space[1] == 'TRUCK':
split_by_space[2] = float(split_by_space[2]) * 1.333
elif split_by_space[1] == 'MOTORCYCLE':
split_by_space[2] = float(split_by_space[2]) * 0.17
if split_by_space[0] in vehicles:
previousAmount = vehicles[split_by_space[0]]
vehicles[split_by_space[0]] = previousAmount + split_by_space[2]
else:
vehicles[split_by_space[0]] = split_by_space[2]
Thanks, any help/hints would be greatly appreciated.

Going through your code I noticed a few things, list indicies in python start at 0, not 1 so you were get a bunch of out of bounds errors. Secondly, input only takes the first line of the input so it was never going past the first line. .split() splits text by \n by default, you have to specify if you want to split by something else, like a space.
test.txt contents:
BHK158 VEHICLE 11
OIUGHH MOTORCYCLE 34.46
BHK158 VEHICLE 12.000
TRIR TRUCK 2.0
BLAS215 MOTORCYCLE 0.001
python code:
fileinput = open('test.txt', 'r')
lines = fileinput.readlines()
vehicles = {}
for line in lines:
split_by_space = line.split(' ')
if split_by_space[1] == "VEHICLE":
split_by_space[2] = (float(split_by_space[2]) * 0.25)
elif split_by_space[1] == "TRUCK":
split_by_space[2] = float(split_by_space[2]) * 1.333
elif split_by_space[1] == "MOTORCYCLE":
split_by_space[2] = float(split_by_space[2]) * 0.17
if split_by_space[0] in vehicles:
previousAmount = vehicles[split_by_space[0]]
vehicles[split_by_space[0]] = previousAmount + split_by_space[2]
else:
vehicles[split_by_space[0]] = split_by_space[2]
output:
{'BLAS215': 0.00017, 'OIUGHH': 5.858200000000001, 'TRIR': 2.666, 'BHK158': 5.75}

Related

How to add running total using complex if statement in Python

I have the following data in a text file.
Intium II,2.8,24,128
Celerisc I,1.6,32,256
Ethloan III,2.6,32,128
Powerup II,1.9,64,512
Mitduo III,3.1,24,128
I am doing the following:
Allocating points to each processor
Points will be awarded to each processor as follows:
Clock speed is less than 2 GHz, award 10 points.
Clock speed is between 2 GHz and 3 GHz inclusive, award 20 points.
Clock speed is greater than 3 GHz, award 30 points.
Data Bus Points
1 point for each line on the data bus e.g. 24 bit data bus, award 24 points.
Cache Size Points
1 point for each whole 10 Kb of cache e.g. 128Kb cache, award 12 points.
(128/10 = 12·8, which should be rounded down to 12)
The output from the program should be similar to the following:
John Doe your order code is JD3f
Processor Points
Intium II 56
Celerisc I 67
Ethloan III 64
Powerup II 125
Mitduo III 66
Here is my code
import random
import string
def getDetails():
forename = input("Enter first name: ")
surname = input("Enter last name: ")
number = random.randint(0,9)
character = random.choice(string.ascii_letters).lower()
code = (forename[:1] + str(surname[:1]) + str(number) + str(character))
return forename, surname, code
def readProcessorDetails():
processorName = []*5
clockSpeed = []*5
dataBusWidth = []*5
cacheSize = []*5
file = open("processors.txt","r")
for line in file:
data = line.split(",")
processorName.append(data[0])
clockSpeed.append(float(data[1]))
dataBusWidth.append(data[2])
cacheSize.append(data[3])
input("file read successfully.. Press any key to continue")
return processorName, clockSpeed, dataBusWidth, cacheSize
def allocatePoints(clockSpeed, dataBusWidth, cacheSize):
proPoints = 0.0
processorPoints = []
for counter in range(len(clockSpeed)):
if clockSpeed[counter] < 2.0 or dataBusWidth[counter] == 24 or dataBusWidth[counter] == 128:
proPoints = proPoints + 10 + 24 + 12
processorPoints.append(proPoints)
elif clockSpeed[counter] > 2.0 or clockSpeed[counter] <= 3.0 or dataBusWidth[counter] == 24 or dataBusWidth[counter] == 128:
proPoints = proPoints + 20 + 24 + 12
processorPoints.append(proPoints)
elif clockSpeed[counter] > 3.0 or dataBusWidth[counter] == 24 or dataBusWidth[counter] == 128:
proPoints = proPoints + 30 + 24 + 12
processorPoints.append(proPoints)
return processorPoints
def display(processorName, processorPoints):
print(f"Processor \t Points")
for counter in range(len(processorPoints)):
print(f"{processorName[counter]}\t {processorPoints[counter]}\n")
def main():
forename, surname, code = getDetails()
processorName, clockSpeed, dataBusWidth, cacheSize = readProcessorDetails()
processorPoints = allocatePoints(clockSpeed, dataBusWidth, cacheSize)
print()
print(forename + " " + surname + " your order code is " + code)
print()
display(processorName, processorPoints)
main()
Here is my code output
Enter first name: John
Enter last name: Doe
file read successfully.. Press any key to continue
John Doe your order code is JD8z
Processor Points
Intium II 56.0
Celerisc I 102.0
Ethloan III 158.0
Powerup II 204.0
Mitduo III 260.0
I am not sure what I am doing wrong in my allocatePoints() function where I have used complex if statements and running total to add processor points.

Consider keeping all of the data in one list, rather than trying to coordinate multiple lists. This lets you avoid the unPythonic "for x in range(len(x))" construct, and iterate through the list directly.
I also fixed your root problem, which is that you weren't resetting the points to zero for every new processor.
import random
import string
def getDetails():
forename = input("Enter first name: ")
surname = input("Enter last name: ")
number = random.randint(0,9)
character = random.choice(string.ascii_letters).lower()
code = (forename[:1] + str(surname[:1]) + str(number) + str(character))
return forename, surname, code
def readProcessorDetails():
pdata = []
for line in open("processors.txt"):
data = line.split(",")
pdata.append( [data[0], float(data[1]), int(data[2]), int(data[3])] )
return pdata
def allocatePoints(pdata):
for row in pdata:
points = 0
if row[1] < 2.0:
points += 10
elif row[1] <= 3.0:
points += 20
else:
points += 30
points += row[2]
points += row[3] // 10
row.append( points )
def display(pdata):
print("Processor\tPoints")
for row in pdata:
print(f"{row[0]}\t{row[-1]}")
def main():
forename, surname, code = getDetails()
processorData = readProcessorDetails()
allocatePoints(processorData)
print()
print(forename + " " + surname + " your order code is " + code)
print()
display(processorData)
main()

How do you replace segments in a line using fileinput

I am creating a program for counting coins and I want to create a mechanism which essentially scans a specifically written text file and is able to calculate whether it has been falsely counted but also will replace the ending segment of the line with either Y for Yes or N for No.
The txt file reads as such:
Abena,5p,325.00,Y
Malcolm,1p,3356.00,Y
Jane,£2,120.00,Y
Andy,£1,166.25,N
Sandip,50p,160.00,Y
Liz,20p,250.00,Y
Andy,20p,250.00,Y
Andy,50p,160.00,Y
Jane,£1,183.75,N
Liz,£,179.0,N
Liz,50p,170.0,N
Jane,50p,160.0,Y
Sandip,£1,183.0,N
Jane,£2,132.0,N
Abena,1p,3356.0,N
Andy,2p,250.0,N
Abena,£1,175.0,Y
Malcolm,50p,160.0,Y
Malcolm,£2,175.0,N
Malcolm,£1,175.0,Y
Malcolm,1p,356.0,Y
Liz,20p,250.0,Y
Jane,£2,120.0,Y
Jane,50p,160.0,Y
Andy,£1,175.0,Y
Abena,1p,359.56,N
Andy,5p,328.5,N
Andy,£2,108.0,N
Malcolm,£2,12.0,N
as you can see every line is split into 4 segments, I want the fileinput to be able to replace the fourth segment within the specified line.
My program (all the relevant things to see right now) is as follows:
class Volunteer:
def __init__(self, name, coin_type, weight_of_bag, true_count):
self.name = name
self.coin_type = coin_type # a function allowing me to class the data
self.weight_of_bag = weight_of_bag
self.true_count = true_count
just a simple object system to make things easier for later
with open("CoinCount.txt", "r", encoding="'utf-8'") as csvfile:
volunteers = []
for line in csvfile:
volunteers.append(Volunteer(*line.strip().split(',')))
just to create a list as well as an object for easier calculations
def runscan():
with open("CoinCount.txt", "r+", encoding='utf-8') as csvfile:
num_lines = 0
for line in csvfile:
num_lines = num_lines + 1
i = 0
while i < num_lines:
ct = (volunteers[i].coin_type)
wob = float(volunteers[i].weight_of_bag)
if ct == ("£2" or "2"):
accurate_weight = float(12.0)
limit = 10
bag_value = 10 * 12
elif ct == ("£1" or "1"):
accurate_weight = float(8.75)
limit = 20
bag_value = 20 * 8.75
elif ct == "50p":
accurate_weight = float(8)
limit = 20
bag_value = 20 * 8
elif ct == "20p":
accurate_weight = float(5)
limit = 50
bag_value = 5 * 50
elif ct == "10p":
accurate_weight = float(6.5)
limit = 50
bag_value = 6.5 * 50
elif ct == "5p":
accurate_weight = float(3.25)
limit = 100
bag_value = 3.25 * 100
elif ct == "2p":
accurate_weight = float(7.12)
limit = 50
bag_value = 50 * 7.12
elif ct == "1p":
accurate_weight = float(3.56)
limit = 100
bag_value = 3.56 * 100
number_of_bags = wob / bag_value
print("Number of bags on this is" + str(number_of_bags))
import fileinput
line = line[i]
if number_of_bags.is_integer():
with fileinput.FileInput('CoinCount.txt',inplace=True) as fileobj:
for line in fileobj:
x = line.split(',')
for w, word in enumerate(x):
if w == 3 and word == 'N':
print(line[i].replace('N', 'Y'), end='')
i = i + 1
else:
i = i + 1
else:
with fileinput.FileInput('CoinCount.txt',inplace=True) as fileobj:
for line in fileobj:
x = line.split(',')
for w, word in enumerate(x):
if w == 3 and word == 'Y':
print(line[i].replace('Y', 'N'), end='')
i = i + 1
else:
i = i + 1
and finally the thing Im having issues with, the scan function.
the issue is specifically within the last few lines of code here (the replacement part):
import fileinput
if number_of_bags.is_integer():
target, replacement = ('N', 'Y')
else:
target, replacement = ('Y', 'N')
with fileinput.FileInput('CoinCount.txt', inplace=True) as fileobj:
for i, line in enumerate(fileobj):
words = line.rstrip().split(',')
if line.words[3] == target:
line.words[3] = replacement
print(','.join(words))
i = i + 1
f = fileobj.lineno() # Number of lines processed.
print(f'Done, {f} lines processed')
I basically have created a function which goes down each line and calculates the next line down until there aren't anymore lines, the issue with the last part is that I am unable to replace the actual txt file, If I were to run this program right now the result would be a completely blank page. I know that the fix is most likely a simple but tricky discovery but It is really bothering me as this is all that is needed for my program to be complete.
I understand the majority of the coding used but I am very new to fileinput, I want to be able to go from each line and replace the final segment if the segments name (i.e "Y" or "N") given is inaccurate to the actual legitimate segment name as Y is for true and N is for false. Please help, I tried to make sure this question was as easily understandable as possible, please make your example relatable to my program

As far as I understood, the problem is whether the calculation of the weight is correct or not. So just create another file instead of using fileinput. Do you really need it ?
test.csv
Abena,5p,325.00,Y
Malcolm,1p,3356.00,Y
Read the csv and assign some header names
Remove the last column, we don't care if it's correct or not, we will calculate the result anyways
Gather your calculation function in one method, we will apply this to every row
Apply function to every row, if it's correct write "Y" else write "N"
Truncate the whole file and write it over
import pandas as pd
with open("test.csv", "r+") as f:
df = pd.read_csv(f, names=["name", "coin", "weight", "res"])
del df["res"]
def calculate(row):
if row["coin"] == "5p":
return "Y" if 3.25 * 100 == row["weight"] else "N"
elif row["coin"] == "1p":
return "Y" if 3.56 * 100 == row["weight"] else "N"
df["res"] = df.apply(lambda row: calculate(row), axis=1)
f.seek(0)
f.truncate()
df.to_csv(f, index=False, header=False)
test.csv
Abena,5p,325.0,Y
Malcolm,1p,3356.0,N

Python 3 equivalent to code involving tuples passed to a function

I have a code which calculates the movie similarities given a dataset, which is ratings.dat and movies.dat. The code however, is written in python 2.7.
I tried converting the code to python-3 but was unable to get the desired results. Need some expert help to review if there are any mistakes in the code.
Below is the code is the code area which I need to convert to python 3:
def makePairs((user, ratings)):
(movie1, rating1) = ratings[0]
(movie2, rating2) = ratings[1]
return ((movie1, movie2), (rating1, rating2))
def filterDuplicates( (userID, ratings) ):
(movie1, rating1) = ratings[0]
(movie2, rating2) = ratings[1]
return movie1 < movie2
and also this
# Filter for movies with this sim that are "good" as defined by
# our quality thresholds above
filteredResults = moviePairSimilarities.filter(lambda((pair,sim)): \
(pair[0] == movieID or pair[1] == movieID) \
and sim[0] > scoreThreshold and sim[1] > coOccurenceThreshold)
# Sort by quality score.
results = filteredResults.map(lambda((pair,sim)): (sim, pair)).sortByKey(ascending = False).take(10)
the complete code as follows
spark-submit mycodefile.py 50
here's the code in python 2.7
import sys
from pyspark import SparkConf, SparkContext
from math import sqrt
def loadMovieNames():
movieNames = {}
with open("movies.dat") as f:
for line in f:
fields = line.split("::")
movieNames[int(fields[0])] = fields[1].decode('ascii', 'ignore')
return movieNames
def makePairs((user, ratings)):
(movie1, rating1) = ratings[0]
(movie2, rating2) = ratings[1]
return ((movie1, movie2), (rating1, rating2))
def filterDuplicates( (userID, ratings) ):
(movie1, rating1) = ratings[0]
(movie2, rating2) = ratings[1]
return movie1 < movie2
def computeCosineSimilarity(ratingPairs):
numPairs = 0
sum_xx = sum_yy = sum_xy = 0
for ratingX, ratingY in ratingPairs:
sum_xx += ratingX * ratingX
sum_yy += ratingY * ratingY
sum_xy += ratingX * ratingY
numPairs += 1
numerator = sum_xy
denominator = sqrt(sum_xx) * sqrt(sum_yy)
score = 0
if (denominator):
score = (numerator / (float(denominator)))
return (score, numPairs)
conf = SparkConf()
sc = SparkContext(conf = conf)
print("\nLoading movie names...")
nameDict = loadMovieNames()
data = sc.textFile("ratings.dat")
# Map ratings to key / value pairs: user ID => movie ID, rating
ratings = data.map(lambda l: l.split("::")).map(lambda l: (int(l[0]), (int(l[1]), float(l[2]))))
# Emit every movie rated together by the same user.
# Self-join to find every combination.
ratingsPartitioned = ratings.partitionBy(100)
joinedRatings = ratingsPartitioned.join(ratingsPartitioned)
# At this point our RDD consists of userID => ((movieID, rating), (movieID, rating))
# Filter out duplicate pairs
uniqueJoinedRatings = joinedRatings.filter(filterDuplicates)
# Now key by (movie1, movie2) pairs.
moviePairs = uniqueJoinedRatings.map(makePairs).partitionBy(100)
# We now have (movie1, movie2) => (rating1, rating2)
# Now collect all ratings for each movie pair and compute similarity
moviePairRatings = moviePairs.groupByKey()
# We now have (movie1, movie2) = > (rating1, rating2), (rating1, rating2) ...
# Can now compute similarities.
moviePairSimilarities = moviePairRatings.mapValues(computeCosineSimilarity).persist()
# Save the results if desired
moviePairSimilarities.sortByKey()
moviePairSimilarities.saveAsTextFile("movie-sims")
# Extract similarities for the movie we care about that are "good".
if (len(sys.argv) > 1):
scoreThreshold = 0.97
coOccurenceThreshold = 1000
movieID = int(sys.argv[1])
# Filter for movies with this sim that are "good" as defined by
# our quality thresholds above
filteredResults = moviePairSimilarities.filter(lambda((pair,sim)): \
(pair[0] == movieID or pair[1] == movieID) \
and sim[0] > scoreThreshold and sim[1] > coOccurenceThreshold)
# Sort by quality score.
results = filteredResults.map(lambda((pair,sim)): (sim, pair)).sortByKey(ascending = False).take(10)
print("Top 10 similar movies for " + nameDict[movieID])
for result in results:
(sim, pair) = result
# Display the similarity result that isn't the movie we're looking at
similarMovieID = pair[0]
if (similarMovieID == movieID):
similarMovieID = pair[1]
print(nameDict[similarMovieID] + "\tscore: " + str(sim[0]) + "\tstrength: " + str(sim[1]))
any help is much appreciated.
Regard
what I have already done is converting this code to a python 3 equivalent code as follows, but unable to get the desired results.
import sys
from pyspark import SparkConf, SparkContext
from math import sqrt
def loadMovieNames():
movieNames = {}
with open("movies.dat") as f:
for line in f:
fields = line.split("::")
movieNames[int(fields[0])] = fields[1] #.decode('ascii', 'ignore')
return movieNames
def makePairs(*ratings):
for t in ratings:
(movie1, rating1) = t[1][0]
(movie2, rating2) = t[1][1]
return ((movie1, movie2), (rating1, rating2))
def filterDuplicates(*ratings):
for t in ratings:
(movie1, rating1) = t[1][0]
(movie2, rating2) = t[1][1]
return movie1 < movie2
def computeCosineSimilarity(ratingPairs):
numPairs = 0
sum_xx = sum_yy = sum_xy = 0
for ratingX, ratingY in ratingPairs:
sum_xx += ratingX * ratingX
sum_yy += ratingY * ratingY
sum_xy += ratingX * ratingY
numPairs += 1
numerator = sum_xy
denominator = sqrt(sum_xx) * sqrt(sum_yy)
score = 0
if (denominator):
score = (numerator / (float(denominator)))
return (score, numPairs)
conf = SparkConf().setMaster("local[*]").setAppName("MovieSimilarities")
sc = SparkContext(conf = conf)
print("\nLoading movie names...")
nameDict = loadMovieNames()
print("\nLoading movie ratings...")
data = sc.textFile("ratings100.dat")
print("\nDone..")
# Map ratings to key / value pairs: user ID => movie ID, rating
ratings = data.map(lambda l: l.split("::")).map(lambda l: (int(l[0]), (int(l[1]), float(l[2]))))
# Emit every movie rated together by the same user.
# Self-join to find every combination.
ratingsPartitioned = ratings.partitionBy(100)
joinedRatings = ratingsPartitioned.join(ratingsPartitioned)
#joinedRatings = ratings.join(ratings)
# At this point our RDD consists of userID => ((movieID, rating), (movieID, rating))
# Filter out duplicate pairs
uniqueJoinedRatings = joinedRatings.filter(filterDuplicates)
# Now key by (movie1, movie2) pairs.
moviePairs = uniqueJoinedRatings.map(makePairs).partitionBy(100)
# We now have (movie1, movie2) => (rating1, rating2)
# Now collect all ratings for each movie pair and compute similarity
moviePairRatings = moviePairs.groupByKey()
# We now have (movie1, movie2) = > (rating1, rating2), (rating1, rating2) ...
# Can now compute similarities.
moviePairSimilarities = moviePairRatings.mapValues(computeCosineSimilarity).persist()
# Save the results if desired
moviePairSimilarities.sortByKey()
moviePairSimilarities.saveAsTextFile("movie-sims")
# Extract similarities for the movie we care about that are "good".
if (len(sys.argv) > 1):
scoreThreshold = 0.9
coOccurenceThreshold = 1000
movieID = int(sys.argv[1])
# Filter for movies with this sim that are "good" as defined by
# our quality thresholds above
filteredResults = moviePairSimilarities.filter(lambda pairSim: (pairSim[0][0] == movieID or pairSim[0][1] == movieID) and pairSim[1][0] > scoreThreshold and pairSim[1][1] > coOccurenceThreshold)
# Sort by quality score.
results = filteredResults.map(lambda pairSim: (pairSim[1], pairSim[0])).sortByKey(ascending = False).take(10)
print("Top 10 similar movies for " + str(nameDict[movieID]))
for result in results:
(sim, pair) = result
# Display the similarity result that isn't the movie we're looking at
similarMovieID = pair[0]
if (similarMovieID == movieID):
similarMovieID = pair[1]
print(nameDict[similarMovieID] + "\tscore: " + str(sim[0]) + "\tstrength: " + str(sim[1]))
Below is expected result, which should show top 10 similar movie results.
Top 10 similar movies for Wizard of Oz, The (1939)
Toy Story (1995) score: 661 strength: 1545
Some Other Movie score: 594 strength: 720
Another Movie score: 2018 strength: 2804

def f(*tuplex) is not the same as def f((x, y)); it is (more or less) the same as def f(x, y). That is, the first (py3) function receives a list of non-keyword argument, the second (py2) a single tuple argument. Since you're passing a single-element (which happens to be a tuple), tuplex will be a tuple of one element (and the resulting for t in tuplex will only iterate once). You should make it def(xy), where xy will be your (x, y) tuple.
Your Python 2 code:
def makePairs((user, ratings)):
(movie1, rating1) = ratings[0]
(movie2, rating2) = ratings[1]
return ((movie1, movie2), (rating1, rating2))
The actual compatible Python 3 code:
def makePairs(user_ratings):
_, ratings = user_ratings
(movie1, rating1) = ratings[0]
(movie2, rating2) = ratings[1]
return ((movie1, movie2), (rating1, rating2))
As also mentioned somewhere in the comments, you can replace this whole function by a simple zip call, for example:
>>> a = (('movie1', 'rating1'), ('movie2', 'rating2'))
>>> list(zip(*a))
[('movie1', 'movie2'), ('rating1', 'rating2')]
(you don't need list(...) if you just need to return an iterator, but that doesn't show the actual contents on the command line. so leave out the call to list(...) unless you get an error about a "zip object" in your actual code.)
The unfortunate parts here are that
- the map method where you use makePairs passes only a function, so you can't specify the asterisk.
- you need to get rid of the first argument, user.
You could probably use the following:
moviePairs = uniqueJoinedRatings.map(lambda x: zip(*x[1])).partitionBy(100)
(untested)
That gets rid of the complete makePairs function, at the cost of some clarity.
Last tidbit: make_pairs follows the style guide; makePairs is not Python style. As goes for all other names in your code. Since you mentioned the word review at the top of your question (but that's probably more a matter for Code Review.

python is inexplicably shortening the step size with each iteration of a sliding window analysis

I am working on a program that estimates the statistic Tajima's D in a series of sliding windows across a chromosome. The chromosome itself is also divided into a number of different regions with (hopefully) functional significance. The sliding window analysis is performed by my script on each region.
At the start of the program, I define the size of the sliding windows and the size of the steps that move from one window to the next. I import a file which contains the coordinates for each different chromosomal region, and import another file which contains all the SNP data I am working with (this is read line-by-line, as it is a large file). The program loops through the list of chromosomal locations. For each location, it generates an index of steps and windows for the analysis, partitions the SNP data into output files (corresponding with the steps), calculates key statistics for each step file, and combines these statistics to estimate Tajima's D for each window.
The program works well for small files of SNP data. It also works well for the first iteration over the first chromosomal break point. However, for large files of SNP data, the step size in the analysis is inexplicably decreased as the program iterates over each chromosomal regions. For the first chromosomal regions, the step size is 2500 nucleotides (this is what it is suppose to be). For the second chromosome segment, however, the step size is 1966, and for the third it is 732.
If anyone has any suggestions at to why this might be the case, please let me know. I am especially stumped as this program seems to work size for small files but not for larger ones.
My code is below:
import sys
import math
import fileinput
import shlex
import string
windowSize = int(500)
stepSize = int(250)
n = int(50) #number of individuals in the anaysis
SNP_file = open("SNPs-1.txt",'r')
SNP_file.readline()
breakpoints = open("C:/Users/gwilymh/Desktop/Python/Breakpoint coordinates.txt", 'r')
breakpoints = list(breakpoints)
numSegments = len(breakpoints)
# Open a file to store the Tajima's D results:
outputFile = open("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/Tajima's D estimates.txt", 'a')
outputFile.write(str("segmentNumber\tchrSegmentName\tsegmentStart\tsegmentStop\twindowNumber\twindowStart\twindowStop\tWindowSize\tnSNPs\tS\tD\n"))
#Calculating parameters a1, a2, b1, b2, c1 and c2
numPairwiseComparisons=n*((n-1)/2)
b1=(n+1)/(3*(n-1))
b2=(2*(n**2+n+3))/(9*n*(n-1))
num=list(range(1,n)) # n-1 values as a list
i=0
a1=0
for i in num:
a1=a1+(1/i)
i=i+1
j=0
a2=0
for j in num:
a2=a2+(1/j**2)
j=j+1
c1=(b1/a1)-(1/a1**2)
c2=(1/(a1**2+a2))*(b2 - ((n+2)/(a1*n))+ (a2/a1**2) )
counter6=0
#For each segment, assign a number and identify the start and stop coodrinates and the segment name
for counter6 in range(counter6,numSegments):
segment = shlex.shlex(breakpoints[counter6],posix = True)
segment.whitespace += '\t'
segment.whitespace_split = True
segment = list(segment)
segmentName = segment[0]
segmentNumber = int(counter6+1)
segmentStartPos = int(segment[1])
segmentStopPos = int(segment[2])
outputFile1 = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_Count of SNPs and mismatches per step.txt")%(str(segmentNumber),str(segmentName))), 'a')
#Make output files to index the lcoations of each window within each segment
windowFileIndex = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_windowFileIndex.txt")%(str(segmentNumber),str(segmentName))), 'a')
k = segmentStartPos - 1
windowNumber = 0
while (k+1) <=segmentStopPos:
windowStart = k+1
windowNumber = windowNumber+1
windowStop = k + windowSize
if windowStop > segmentStopPos:
windowStop = segmentStopPos
windowFileIndex.write(("%s\t%s\t%s\n")%(str(windowNumber),str(windowStart),str(windowStop)))
k=k+stepSize
windowFileIndex.close()
# Make output files for each step to export the corresponding SNP data into + an index of these output files
stepFileIndex = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_stepFileIndex.txt")%(str(segmentNumber),str(segmentName))), 'a')
i = segmentStartPos-1
stepNumber = 0
while (i+1) <= segmentStopPos:
stepStart = i+1
stepNumber = stepNumber+1
stepStop = i+stepSize
if stepStop > segmentStopPos:
stepStop = segmentStopPos
stepFile = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_step_%s.txt")%(str(segmentNumber),str(segmentName),str(stepNumber))), 'a')
stepFileIndex.write(("%s\t%s\t%s\n")%(str(stepNumber),str(stepStart),str(stepStop)))
i=i+stepSize
stepFile.close()
stepFileIndex.close()
# Open the index file for each step in current chromosomal segment
stepFileIndex = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_stepFileIndex.txt")%(str(segmentNumber),str(segmentName))), 'r')
stepFileIndex = list(stepFileIndex)
numSteps = len(stepFileIndex)
while 1:
currentSNP = SNP_file.readline()
if not currentSNP: break
currentSNP = shlex.shlex(currentSNP,posix=True)
currentSNP.whitespace += '\t'
currentSNP.whitespace_split = True
currentSNP = list(currentSNP)
SNPlocation = int(currentSNP[0])
if SNPlocation > segmentStopPos:break
stepIndexBin = int(((SNPlocation-segmentStartPos-1)/stepSize)+1)
#print(SNPlocation, stepIndexBin)
writeFile = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_step_%s.txt")%(str(segmentNumber),str(segmentName),str(stepIndexBin))), 'a')
writeFile.write((("%s\n")%(str(currentSNP[:]))))
writeFile.close()
counter3=0
for counter3 in range(counter3,numSteps):
# open up each step in the list of steps across the chromosomal segment:
L=shlex.shlex(stepFileIndex[counter3],posix=True)
L.whitespace += '\t'
L.whitespace_split = True
L=list(L)
#print(L)
stepNumber = int(L[0])
stepStart = int(L[1])
stepStop = int(L[2])
stepSize = int(stepStop-(stepStart-1))
#Now open the file of SNPs corresponding with the window in question and convert it into a list:
currentStepFile = open(("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_step_%s.txt")%(str(segmentNumber),str(segmentName),str(counter3+1)),'r')
currentStepFile = list(currentStepFile)
nSNPsInCurrentStepFile = len(currentStepFile)
print("number of SNPs in this step is:", nSNPsInCurrentStepFile)
#print(currentStepFile)
if nSNPsInCurrentStepFile == 0:
mismatchesPerSiteList = [0]
else:
# For each line of the file, estimate the per site parameters relevent to Tajima's D
mismatchesPerSiteList = list()
counter4=0
for counter4 in range(counter4,nSNPsInCurrentStepFile):
CountA=0
CountG=0
CountC=0
CountT=0
x = counter4
lineOfData = currentStepFile[x]
counter5=0
for counter5 in range(0,len(lineOfData)):
if lineOfData[counter5]==("A" or "a"): CountA=CountA+1
elif lineOfData[counter5]==("G" or "g"): CountG=CountG+1
elif lineOfData[counter5]==("C" or "c"): CountC=CountC+1
elif lineOfData[counter5]==("T" or "t"): CountT=CountT+1
else: continue
AxG=CountA*CountG
AxC=CountA*CountC
AxT=CountA*CountT
GxC=CountG*CountC
GxT=CountG*CountT
CxT=CountC*CountT
NumberMismatches = AxG+AxC+AxT+GxC+GxT+CxT
mismatchesPerSiteList=mismatchesPerSiteList+[NumberMismatches]
outputFile1.write(str(("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n")%(segmentNumber, segmentName,stepNumber,stepStart,stepStop,stepSize,nSNPsInCurrentStepFile,sum(mismatchesPerSiteList))))
outputFile1.close()
windowFileIndex = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_windowFileIndex.txt")%(str(segmentNumber),str(segmentName))), 'r')
windowFileIndex = list(windowFileIndex)
numberOfWindows = len(windowFileIndex)
stepData = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_Count of SNPs and mismatches per step.txt")%(str(segmentNumber),str(segmentName))), 'r')
stepData = list(stepData)
numberOfSteps = len(stepData)
counter = 0
for counter in range(counter, numberOfWindows):
window = shlex.shlex(windowFileIndex[counter], posix = True)
window.whitespace += "\t"
window.whitespace_split = True
window = list(window)
windowNumber = int(window[0])
firstCoordinateInCurrentWindow = int(window[1])
lastCoordinateInCurrentWindow = int(window[2])
currentWindowSize = lastCoordinateInCurrentWindow - firstCoordinateInCurrentWindow +1
nSNPsInThisWindow = 0
nMismatchesInThisWindow = 0
counter2 = 0
for counter2 in range(counter2,numberOfSteps):
step = shlex.shlex(stepData[counter2], posix=True)
step.whitespace += "\t"
step.whitespace_split = True
step = list(step)
lastCoordinateInCurrentStep = int(step[4])
if lastCoordinateInCurrentStep < firstCoordinateInCurrentWindow: continue
elif lastCoordinateInCurrentStep <= lastCoordinateInCurrentWindow:
nSNPsInThisStep = int(step[6])
nMismatchesInThisStep = int(step[7])
nSNPsInThisWindow = nSNPsInThisWindow + nSNPsInThisStep
nMismatchesInThisWindow = nMismatchesInThisWindow + nMismatchesInThisStep
elif lastCoordinateInCurrentStep > lastCoordinateInCurrentWindow: break
if nSNPsInThisWindow ==0 :
S = 0
D = 0
else:
S = nSNPsInThisWindow/currentWindowSize
pi = nMismatchesInThisWindow/(currentWindowSize*numPairwiseComparisons)
print(nSNPsInThisWindow,nMismatchesInThisWindow,currentWindowSize,S,pi)
D = (pi-(S/a1))/math.sqrt(c1*S + c2*S*(S-1/currentWindowSize))
outputFile.write(str(("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n")%(segmentNumber,segmentName,segmentStartPos,segmentStopPos,windowNumber,firstCoordinateInCurrentWindow,lastCoordinateInCurrentWindow,currentWindowSize,nSNPsInThisWindow,S,D)))

A quick search shows that you do change your stepSize on line 110:
stepStart = int(L[1])
stepStop = int(L[2])
stepSize = int(stepStop-(stepStart-1))
stepStop and stepStart appear to depend on your files' contents, so we can't debug it further.

getting all of the currecny values out of a text file

im having trouble with getting my python code to read through a text file and add together all of the monetary values. the code seemed to be working fine on my pc, but as soon as i transferred the file to my mac it gave me a whole slew of errors. here is the code
#!usr/bin/python
import sys
def findnum(x):
list = x.split(' ')
index = 0
listindex = -1
numlist = []
sum = 0
for w in list:
if ((w.strip('. n,')).isalpha() != True and w[0].isalpha() != True and w[-2].isdigit() == True):
numlist.append(w)
listindex += 1
while listindex >= 0:
sum += float(numlist[listindex].strip('$ n.'))
listindex -= 1
return sum
def main():
text = open(sys.argv[1])
x = text.readline()
sum = 0
if len(x) > 0:
findnum(x)
while len(x) > 0:
sum += findnum(x)
x = text.readline()
print '{0:.2f}'.format(sum)
if __name__ == '__main__':
main()
here is the text
This is your invoice from the ACME materials
company. You received 50lbs of sand at a
cost of $40. The brick we delivered is 70.5
for the 75Kg. In addition, we delivered 30yards
of sod for $200.00. Delivery charge is $35.
so i need to add 40 + 70.5 + 200 +35
i keep getting a index out of range error..
anyone think they can help me out?

import re
print re.findall('(\$\d+(?:\.\d{2})?)', x)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Separating String into Words and Recursion - python

Related

How to add running total using complex if statement in Python

How do you replace segments in a line using fileinput

Python 3 equivalent to code involving tuples passed to a function

python is inexplicably shortening the step size with each iteration of a sliding window analysis

getting all of the currecny values out of a text file

Categories

Resources