I used random module in my ptoject. Im not a programmer, but i need to do a script which will create a randomized database from another database. In my case, I need the script to select random words lined up in a column from the "wordlist.txt" file, and then also absolutely randomly line them up in a line of 12 words, write them to another file (for example, "result1.txt") and switch to a new line, and so many times, ad infinitum. Everything seems to be working, but I noticed that it adds absolutely all words, except for words consisting of 8 letters.
And also i want to increase the perfomance of this code.
Code:
import random
# Initial wordlists
fours = []
fives = []
sixes = []
sevens = []
eights = []
# Fill above lists with corresponding word lengths from wordlist
with open('wordlist.txt') as wordlist:
for line in wordlist:
if len(line) == 4:
fours.append(line.strip())
elif len(line) == 5:
fives.append(line.strip())
elif len(line) == 6:
sixes.append(line.strip())
elif len(line) == 7:
sevens.append(line.strip())
elif len(line) == 8:
eights.append(line.strip())
# Create new lists and fill with number of items in fours
fivesLess = []
sixesLess = []
sevensLess = []
eightsLess = []
fivesCounter = 0
while fivesCounter < len(fours):
randFive = random.choice(fives)
if randFive not in fivesLess:
fivesLess.append(randFive)
fivesCounter += 1
sixesCounter = 0
while sixesCounter < len(fours):
randSix = random.choice(sixes)
if randSix not in sixesLess:
sixesLess.append(randSix)
sixesCounter += 1
sevensCounter = 0
while sevensCounter < len(fours):
randSeven = random.choice(sevens)
if randSeven not in sevensLess:
sevensLess.append(randSeven)
sevensCounter += 1
eightsCounter = 0
while eightsCounter < len(fours):
randEight = random.choice(eights)
if randEight not in eightsLess:
eightsLess.append(randEight)
eightsCounter += 1
choices = [eights]
# Generate n number of seeds and print
seedCounter = 0
while seedCounter < 1:
seed = []
while len(seed) < 12:
wordLengthChoice = random.choice(choices)
wordChoice = random.choice(wordLengthChoice)
seed.append(wordChoice)
seedCounter += 0
with open("result1.txt", "a") as f:
f.write(' '.join(seed))
f.write('\n')
If I'm understanding you correctly, something like
import random
from collections import defaultdict
words_by_length = defaultdict(list)
def generate_line(*, n_words, word_length):
return ' '.join(random.choice(words_by_length[word_length]) for _ in range(n_words))
with open('wordlist.txt') as wordlist:
for line in wordlist:
line = line.strip()
words_by_length[len(line)].append(line)
for x in range(10):
print(generate_line(n_words=12, word_length=8))
should be enough – you can just use a single dict-of-lists to contain all of the words, no need for separate variables. (Also, I suspect your original bug stemmed from not strip()ing the line before looking at its length.)
If you need a single line to never repeat a word, you'll want
def generate_line(*, n_words, word_length):
return ' '.join(random.sample(words_by_length[word_length], n_words))
instead.
Related
I am trying to generate combination of ID's
Input: cid = SPARK
oupout: list of all the comibnations as below, position of each element should be constant. I am a beginner in python any help here is much appreciated.
'S****'
'S***K'
'S**R*'
'S**RK'
'S*A**'
'S*A*K'
'S*AR*'
'S*ARK'
'SP***'
'SP**K'
'SP*R*'
'SP*RK'
'SPA**'
'SPA*K'
'SPAR*'
'SPARK'
I tried below, I need a dynamic code:
cid = 'SPARK'
# print(cid.replace(cid[1],'*'))
# cu_len = lenth of cid [SPARK] here which is 5
# com_stars = how many stars i.e '*' or '**'
def cubiod_combo_gen(cu_len, com_stars, j_ite, i_ite):
cubiodList = []
crange = cu_len
i = i_ite #2 #3
j = j_ite #1
# com_stars = ['*','**','***','****']
while( i <= crange):
# print(j,i)
if len(com_stars) == 1:
x = len(com_stars)
n_cid = cid.replace(cid[j:i],com_stars)
i += x
j += x
cubiodList.append(n_cid)
elif len(com_stars) == 2:
x = len(com_stars)
n_cid = cid.replace(cid[j:i],com_stars)
i += x
j += x
cubiodList.append(n_cid)
elif len(com_stars) == 3:
x = len(com_stars)
n_cid = cid.replace(cid[j:i],com_stars)
i += x
j += x
cubiodList.append(n_cid)
return cubiodList
#print(i)
#print(n_cid)
# for item in cubiodList:
# print(item)
print(cubiod_combo_gen(5,'*',1,2))
print(cubiod_combo_gen(5,'**',1,3))
For every character in your given string, you can represent it as a binary string, using a 1 for a character that stays the same and a 0 for a character to replace with an asterisk.
def cubiod_combo_gen(string, count_star):
str_list = [char0 for char0 in string] # a list with the characters of the string
itercount = 2 ** (len(str_list)) # 2 to the power of the length of the input string
results = []
for config in range(itercount):
# return a string of i in binary representation
binary_repr = bin(config)[2:]
while len(binary_repr) < len(str_list):
binary_repr = '0' + binary_repr # add padding
# construct a list with asterisks
i = -1
result_list = str_list.copy() # soft copy, this made me spend like 10 minutes debugging lol
for char in binary_repr:
i += 1
if char == '0':
result_list[i] = '*'
if char == '1':
result_list[i] = str_list[i]
# now we have a possible string value
if result_list.count('*') == count_star:
# convert back to string and add to list of accepted strings
result = ''
for i in result_list:
result = result + i
results.append(result)
return results
# this function returns the value, so you have to use `print(cubiod_combo_gen(args))`
# comment this stuff out if you don't want an interactive user prompt
string = input('Enter a string : ')
count_star = input('Enter number of stars : ')
print(cubiod_combo_gen(string, int(count_star)))
It iterates through 16 characters in about 4 seconds and 18 characters in about 17 seconds. Also you made a typo on "cuboid" but I left the original spelling
Enter a string : DPSCT
Enter number of stars : 2
['**SCT', '*P*CT', '*PS*T', '*PSC*', 'D**CT', 'D*S*T', 'D*SC*', 'DP**T', 'DP*C*', 'DPS**']
As a side effect of this binary counting, the list is ordered by the asterisks, where the earliest asterisk takes precedence, with next earliest asterisks breaking ties.
If you want a cumulative count like 1, 4, 5, and 6 asterisks from for example "ABCDEFG", you can use something like
star_counts = (1, 4, 5, 6)
string = 'ABCDEFG'
for i in star_counts:
print(cubiod_combo_gen(string, star_counts))
If you want the nice formatting you have in your answer, try adding this block at the end of your code:
def formatted_cuboid(string, count_star):
values = cubiod_combo_gen(string, count_star)
for i in values:
print(values[i])
I honestly do not know what your j_ite and i_ite are, but it seems like they have no use so this should work. If you still want to pass these arguments, change the first line to def cubiod_combo_gen(string, count_star, *args, **kwargs):
I am not sure what com_stars does, but to produce your sample output, the following code does.
def cuboid_combo(cid):
fill_len = len(cid)-1
items = []
for i in range(2 ** fill_len):
binary = f'{i:0{fill_len}b}'
#print(binary, 'binary', 'num', i)
s = cid[0]
for idx, bit in enumerate(binary,start=1):
if bit == '0':
s += '*'
else: # 'bit' == 1
s += cid[idx]
items.append(s)
return items
#cid = 'ABCDEFGHI'
cid = 'DPSCT'
result = cuboid_combo(cid)
for item in result:
print(item)
Prints:
D****
D***T
D**C*
D**CT
D*S**
D*S*T
D*SC*
D*SCT
DP***
DP**T
DP*C*
DP*CT
DPS**
DPS*T
DPSC*
DPSCT
The substring has to be with 6 characters. The number I'm gettig is smaller than it should be.
first I've written code to get the sequences from a file, then put them in a dictionary, then written 3 nested for loops: the first iterates over the dictionary and gets a sequence in each iteration. The second takes each sequence and gets a substring with 6 characters from it. In each iteration, the second loop increases the index of the start of the string (the long sequence) by 1. The third loop takes each substring from the second loop, and counts how many times it appears in each string (long sequence).
I tried rewriting the code many times. I think I got very close. I checked if the loops actually do their iterations, and they do. I even checked manually to see if the counts for a substring in random sequences are the same as the program gives, and they are. Any idea? maybe a different approach? what debugger do you use for Python?
I added a file with 3 shortened sequences for testing. Maybe try smaller substring: say with 3 characters instead of 6: rep_len = 3
The code
matches = []
count = 0
final_count = 0
rep_len = 6
repeat = ''
pos = 0
seq_count = 0
seqs = {}
f = open(r"file.fasta")
# inserting each sequences from the file into a dictionary
for line in f:
line = line.rstrip()
if line[0] == '>':
seq_count += 1
name = seq_count
seqs[name] = ''
else:
seqs[name] += line
for key, seq in seqs.items(): # getting one sequence in each iteration
for pos in range(len(seq)): # setting an index and increasing it by 1 in each iteration
if pos <= len(seq) - rep_len: # checking no substring from the end of the sequence are selected
repeat = seq[pos:pos + rep_len] # setting a substring
if repeat not in matches: # checking if the substring was already scanned
matches.append(repeat) # adding the substring to previously checked substrings' list
for key1, seq2 in seqs.items(): # iterating over each sequence
count += seq2.count(repeat) # counting the substring's repetitions
if count > final_count: # if the count is greater than the previously saved greatest number
final_count = count # the new value is saved
count = 0
print('repetitions: ', final_count) # printing
sequences.fasta
The code is not very clear, so it is a bit difficult to debug. I suggest rewriting.
Anyway, I (currently) just noted one small mistake:
if pos < len(seq) - rep_len:
Should be
if pos <= len(seq) - rep_len:
Currently, the last character in each sequence is ignored.
EDIT:
Here some rewriting of your code that is clearer and might help you investigate the errors:
rep_len = 6
seq_count = 0
seqs = {}
filename = "dna2.txt"
# Extract the data into a dictionary
with open(filename, "r") as f:
for line in f:
line = line.rstrip()
if line[0] == '>':
seq_count += 1
name = seq_count
seqs[name] = ''
else:
seqs[name] += line
# Store all the information, so that you can reuse it later
counter = {}
for key, seq in seqs.items():
for pos in range(len(seq)-rep_len):
repeat = seq[pos:pos + rep_len]
if repeat in counter:
counter[repeat] += 1
else:
counter[repeat] = 1
# Sort the counter to have max occurrences first
sorted_counter = sorted(counter.items(), key = lambda item:item[1], reverse=True )
# Display the 5 max occurrences
for i in range(5):
key, rep = sorted_counter[i]
print("{} -> {}".format(key, rep))
# GCGCGC -> 11
# CCGCCG -> 11
# CGCCGA -> 10
# CGCGCG -> 9
# CGTCGA -> 9
It might be easier to use Counter from the collections module in Python. Also check out the NLTK library.
An example:
from collections import Counter
from nltk.util import ngrams
sequence = "cggttgcaatgagcgtcttgcacggaccgtcatgtaagaccgctacgcttcgatcaacgctattacgcaagccaccgaatgcccggctcgtcccaacctg"
def reps(substr):
"Counts repeats in a substring"
return sum([i for i in Counter(substr).values() if i>1])
def make_grams(sent, n=6):
"splits a sentence into n-grams"
return ["".join(seq) for seq in (ngrams(sent,n))]
grams = make_grams(sequence) # splits string into substrings
max_length = max(list(map(reps, grams))) # gets maximum repeat count
result = [dna for dna in grams if reps(dna) == max_length]
print(result)
Output: ['gcgtct', 'cacgga', 'acggac', 'tgtaag', 'agaccg', 'gcttcg', 'cgcaag', 'gcaagc', 'gcccgg', 'cccggc', 'gctcgt', 'cccaac', 'ccaacc']
And if the question is look for the string with the most repeated character:
repeat_count = [max(Counter(a).values()) for a in result] # highest character repeat count
result_dict = {dna:ct for (dna,ct) in zip(result, repeat_count)}
another_result = [dna for dna in result_dict.keys() if result_dict[dna] == max(repeat_count)]
print(another_result)
Output: ['cccggc', 'cccaac', 'ccaacc']
I want to divide a file in a two random halfs with python. I have a small script, but it did not divide exactly into 2. Any suggestions?
import random
fin = open("test.txt", 'rb')
f1out = open("test1.txt", 'wb')
f2out = open("test2.txt", 'wb')
for line in fin:
r = random.random()
if r < 0.5:
f1out.write(line)
else:
f2out.write(line)
fin.close()
f1out.close()
f2out.close()
The notion of randomness means that you will not be able to deterministically rely on the number to produce an equal amount of results below 0.5 and above 0.5.
You could use a counter and check if it is even or odd after shuffling all the lines in a list:
file_lines = [line for line in fin]
random.shuffle(file_lines)
counter = 0
for line in file_lines:
counter += 1
if counter % 2 == 0:
f1out.write(line)
else:
f2out.write(line)
You can use this pattern with any number (10 in this example):
counter = 0
for line in file_lines:
counter += 1
if counter % 10 == 0:
f1out.write(line)
elif counter % 10 == 1:
f2out.write(line)
elif counter % 10 == 2:
f3out.write(line)
elif counter % 10 == 3:
f4out.write(line)
elif counter % 10 == 4:
f5out.write(line)
elif counter % 10 == 5:
f6out.write(line)
elif counter % 10 == 6:
f7out.write(line)
elif counter % 10 == 7:
f8out.write(line)
elif counter % 10 == 8:
f9out.write(line)
else:
f10out.write(line)
random will not give you exactly half each time. If you flip a coin 10 times, you dont necessarily get 5 heads and 5 tails.
One approach would be to use the partitioning method described in Python: Slicing a list into n nearly-equal-length partitions, but shuffling the result beforehand.
import random
N_FILES = 2
out = [open("test{}.txt".format(i), 'wb') for i in range(min(N_FILES, n))]
fin = open("test.txt", 'rb')
lines = fin.readlines()
random.shuffle(lines)
n = len(lines)
size = n / float(N_FILES)
partitions = [ lines[int(round(size * i)): int(round(size * (i + 1)))] for i in xrange(n) ]
for f, lines in zip(out, partitions):
for line in lines:
f.write(line)
fin.close()
for f in out:
f.close()
The code above will split the input file into N_FILES (defined as a constant at the top) of approximately equal size, but never splitting beyond one line per file. Handling things this way would let you put this into a function that can take a variable number of files to split into without having to alter code for each case.
Here is my code I need it to read a line of text that just composes of y's a's and n's y meaning yes n meaning no a meaning abstain, I'm trying to add up the number of yes votes. The text file looks like this:
Aberdeenshire
yyynnnnynynyannnynynanynaanyna
Midlothian
nnnnynyynyanyaanynyanynnnanyna
Berwickshire
nnnnnnnnnnnnnnnnnnnnynnnnnynnnnny
here is my code:
def main():
file = open("votes.txt")
lines = file.readlines()
votes = 0
count = 0
count_all = 0
for m in range(1,len(lines),2):
line = lines[m]
for v in line:
if v == 'a':
votes += 1
elif v == 'y':
count_all += 1
count += 1
votes += 1
else:
count_all += 1
print("percentage:" + (str(count/count_all)))
print("Overall there were ", (count/count_all)," yes votes")
main()
First of all, you should note that your file.readlines() actually gives you the \n at the end of each line, which in your code will both be treated in the else block, so as no's:
>>> with open("votes.txt","r") as f:
... print(f.readlines())
...
['Aberdeenshire\n',
'yyynnnnynynyannnynynanynaanyna\n',
'Midlothian\n',
'nnnnynyynyanyaanynyanynnnanyna\n',
'Berwickshire\n',
'nnnnnnnnnnnnnnnnnnnnynnnnnynnnnny\n']
So that might explain why you don't find the good numbers...
Now, to make the code a bit more efficient, we could look into the count method of str, and maybe also get rid of those \n with a split rather than a readlines:
with open("votes.txt","r") as f:
full = f.read()
lines = full.split("\n")
votes = 0
a = 0
y = 0
n = 0
for m in range(1,len(lines),2):
line = lines[m]
votes += len(line) # I'm counting n's as well here
a += line.count("a")
y += line.count("y")
n += line.count("n")
print("Overall, there were " + str(100 * y / (y + n)) + "% yes votes.")
Hope that helped!
More or less pythonic one liner, it doesn't give you the votes for each person/city tho:
from collections import Counter
l = """Aberdeenshire
yyynnnnynynyannnynynanynaanyna
Midlothian
nnnnynyynyanyaanynyanynnnanyna
Berwickshire
nnnnnnnnnnnnnnnnnnnnynnnnnynnnnny"""
Counter([char for line in l.split('\n')[1::2] for char in line.strip()])
Returns:
Counter({'a': 11, 'n': 60, 'y': 22})
I have an input file:
3
PPP
TTT
QPQ
TQT
QTT
PQP
QQQ
TXT
PRP
I want to read this file and group these cases into proper boards.
To read the Count (no. of boards) i have code:
board = []
count =''
def readcount():
fp = open("input.txt")
for i, line in enumerate(fp):
if i == 0:
count = int(line)
break
fp.close()
But i don't have any idea of how to parse these blocks into List:
TQT
QTT
PQP
I tried using
def readboard():
fp = open('input.txt')
for c in (1, count): # To Run loop to total no. of boards available
for k in (c+1, c+3): #To group the boards into board[]
board[c].append(fp.readlines)
But its wrong way. I know basics of List but here i am not able to parse the file.
These boards are in line 2 to 4, 6 to 8 and so on. How to get them into Lists?
I want to parse these into Count and Boards so that i can process them further?
Please suggest
I don't know if I understand your desired outcome. I think you want a list of lists.
Assuming that you want boards to be:
[[data,data,data],[data,data,data],[data,data,data]], then you would need to define how to parse your input file... specifically:
line 1 is the count number
data is entered per line
boards are separated by white space.
If that is the case, this should parse your files correctly:
board = []
count = 0
currentBoard = 0
fp = open('input.txt')
for i,line in enumerate(fp.readlines()):
if i == 0:
count = int(i)
board.append([])
else:
if len(line[:-1]) == 0:
currentBoard += 1
board.append([])
else: #this has board data
board[currentBoard].append(line[:-1])
fp.close()
import pprint
pprint.pprint(board)
If my assumptions are wrong, then this can be modified to accomodate.
Personally, I would use a dictionary (or ordered dict) and get the count from len(boards):
from collections import OrderedDict
currentBoard = 0
board = {}
board[currentBoard] = []
fp = open('input.txt')
lines = fp.readlines()
fp.close()
for line in lines[1:]:
if len(line[:-1]) == 0:
currentBoard += 1
board[currentBoard] = []
else:
board[currentBoard].append(line[:-1])
count = len(board)
print(count)
import pprint
pprint.pprint(board)
If you just want to take specific line numbers and put them into a list:
line_nums = [3, 4, 5, 1]
fp = open('input.txt')
[line if i in line_nums for i, line in enumerate(fp)]
fp.close()