how to convert text file to dictionary in python - python

I need to convert lines of different lengths to one dictionary. It's for player stats. The text file is formatted like below. I need to return a dictionary with each player's stats.
{Lebron James:(25,7,1),(34,5,6), Stephen Curry: (25,7,1),(34,5,6), Draymond Green: (25,7,1),(34,5,6)}
Data:
Lebron James
25,7,1
34,5,6
Stephen Curry
25,7,1
34,5,6
Draymond Green
25,7,1
34,5,6
I need help starting the code. So far I have a code that removes the blank lines and makes the lines into a list.
myfile = open("stats.txt","r")
for line in myfile.readlines():
if line.rstrip():
line = line.replace(",","")
line = line.split()

I think this should do what you want:
data = {}
with open("myfile.txt","r") as f:
for line in f:
# Skip empty lines
line = line.rstrip()
if len(line) == 0: continue
toks = line.split(",")
if len(toks) == 1:
# New player, assumed to have no commas in name
player = toks[0]
data[player] = []
elif len(toks) == 3:
data[player].append(tuple([int(tok) for tok in toks]))
else: raise ValueErorr # or something
The format is somewhat ambiguous, so we have to make some assumptions about what the names can be. I've assumed that names can't contain commas here, but you could relax that a bit if needed by trying to parse int,int,int, and falling back on treating it as a name if it fails to parse.

Here's a simple way to do this:
scores = {}
with open('stats.txt', 'r') as infile:
i = 0
for line in infile.readlines():
if line.rstrip():
if i%3!=0:
t = tuple(int(n) for n in line.split(","))
j = j+1
if j==1:
score1 = t # save for the next step
if j==2:
score = (score1,t) # finalize tuple
scores.update({name:score}) # add to dictionary
else:
name = line[0:-1] # trim \n and save the key
j = 0 # start over
i=i+1 #increase counter
print scores

Maybe something like this:
For Python 2.x
myfile = open("stats.txt","r")
lines = filter(None, (line.rstrip() for line in myfile))
dictionary = dict(zip(lines[0::3], zip(lines[1::3], lines[2::3])))
For Python 3.x
myfile = open("stats.txt","r")
lines = list(filter(None, (line.rstrip() for line in myfile)))
dictionary = dict(zip(lines[0::3], zip(lines[1::3], lines[2::3])))

Related

From a .txt, how can I get the content of the line above a certain line

I have a .txt file which includes the following:
Karin
3543
Joe
2354
Bob
2019
I am able to find the maximum value of all integers and have the line in a variable, i, but the problem comes in when I try and find the contents of the line above the highest integer. For example, 3543 is the highest, so "Karin" would be stored in a variable. Any idea on how this would be done?
with open("r.txt", 'r') as f:
highestScore = 0
highestPlayer = ""
line_numbers = [1, 3, 5]
for i, line in enumerate(f):
if i in line_numbers:
if int(line) > int(highestScore):
highestScore = line
elif i > 5: #ammount of lines in .txt file
break
print(highestPlayer, str(highestScore))
The most efficient way should be a derivate of the York's answer like this (without using pickle or json files):
with open('r.txt') as f:
highestScore = 0
highestPlayer = ""
currentPlayer = ""
for i, line in enumerate(f):
if i % 2 == 0: # even line --> player name
currentPlayer = line
else: # odd line --> score
if int(line) > highestScore:
highestScore = int(line)
highestPlayer = currentPlayer.strip()
print(currentPlayer)
A shorter way is:
with open('r.txt') as f:
lines = f.readlines()
number_per_name = {name.strip():int(number) for name, number in zip(lines[::2], lines[1::2])}
print(max(number_per_name, key=number_per_name.get))
I think the best way to get consecutive lines from a file lined up horizontally is to use the builtin zip function (this works because the file handles returned by open are iterators):
with open("r.txt", "r") as file:
max_score = float("-inf")
for player, score in zip(file, file):
if max_score < (score := int(score)):
max_player, max_score = player.rstrip(), score
print(max_player, max_score)
If the sample text file you described is representative of full text file you might want to consider keeping track of whether or not the line is even or odd instead of explicitly keeping track of the line numbers where a score is. Then you can record the name on the even line number and then compare the score on the odd line number. If it's larger than the highest score you can overwrite your highestPlayer variable and highestScore variable.
As an additional note, the final elif statement you have there is also unnecessary as the loop will end once it runs out of lines in the text file.
Here's an example trying to keep the code as similar as possible to your current draft.
with open("r.txt", 'r') as f:
highestScore = 0
highestPlayer = ""
currentPlayer = ""
for i, line in enumerate(f):
# Modulo determines if the line number is even or odd
if i % 2 == 0:
currentPlayer = line
else:
if int(line) > int(highestScore):
highestScore = line
highestPlayer = currentPlayer
print(highestPlayer, str(highestScore))

How to continue append into one list line until a certain character?

I'm trying to make multiple lines before a '>' character append into one list so I can convert it to a value in a dictionary. For example, I'm trying to make:
> 1
AAA
CCC
> 2
become AAACCC.
The code is below:
def parse_fasta(path):
with open(path) as thefile:
label = []
sequences = []
for k, line in enumerate(thefile):
if line.startswith('>'):
labeler = line.strip('>').strip('\n')
label.append(labeler)
else:
seqfix = ''.join(line.strip('\n'))
sequences.append(seqfix)
dict_version = {k: v for k, v in zip(label, sequences)}
return dict_version
parse_fasta('small.fasta')
You can create the dictionary as you go. Here is a method for doing that.
EDIT: removed defaultdict (so no modules)
from pprint import pprint
dict_version = {}
with open('fasta_sample.txt', 'r') as f:
for line in f:
line = line.rstrip()
if line.startswith('>'):
key = line[1:]
else:
if key in dict_version:
dict_version[key] += line
else:
dict_version[key] = line
pprint(dict_version)
The sample file:
>1FN3:A|PDBID|CHAIN|SEQUENCE
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNAL
SALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
>5OKT:A|PDBID|CHAIN|SEQUENCE
MGSSHHHHHHSSGLVPRGSHMELRVGNRYRLGRKIGSGSFGDIYLGTDIAAGEEVAIKLECVKTKHPQLHIESKIYKMMQ
GGVGIPTIRWCGAEGDYNVMVMELLGPSLEDLFNFCSRKFSLKTVLLLADQMISRIEYIHSKNFIHRDVKPDNFLMGLGK
KGNLVYIIDFGLAKKYRDARTHQHIPYRENKNLTGTARYASINTHLGIEQSRRDDLESLGYVLMYFNLGSLPWQGLKAAT
KRQKYERISEKKMSTPIEVLCKGYPSEFATYLNFCRSLRFDDKPDYSYLRQLFRNLFHRQGFSYDYVFDWNMLK*
>2PAB:A|PDBID|CHAIN|SEQUENCE
GPTGTGESKCPLMVKVLDAVRGSPAINVAVHVFRKAADDTWEPFASGKTSESGELHGLTTEEQFVEGIYKVEIDTKSYWK
ALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTTAVVTNPKE*
>3IDP:B|PDBID|CHAIN|SEQUENCE
HHHHHHDRNRMKTLGRRDSSDDWEIPDGQITVGQRIGSGSFGTVYKGKWHGDVAVKMLNVTAPTPQQLQAFKNEVGVLRK
TRHVNILLFMGYSTKPQLAIVTQWCEGSSLYHHLHIIETKFEMIKLIDIARQTAQGMDYLHAKSIIHRDLKSNNIFLHED
LTVKIGDFGLATEKSRWSGSHQFEQLSGSILWMAPEVIRMQDKNPYSFQSDVYAFGIVLYELMTGQLPYSNINNRDQIIF
MVGRGYLSPDLSKVRSNCPKAMKRLMAECLKKKRDERPLFPQILASIELLARSLPKIHRS
>4QUD:A|PDBID|CHAIN|SEQUENCE
MENTENSVDSKSIKNLEPKIIHGSESMDSGISLDNSYKMDYPEMGLCIIINNKNFHKSTGMTSRSGTDVDAANLRETFRN
LKYEVRNKNDLTREEIVELMRDVSKEDHSKRSSFVCVLLSHGEEGIIFGTNGPVDLKKIFNFFRGDRCRSLTGKPKLFII
QACRGTELDCGIETDSGVDDDMACHKIPVEADFLYAYSTAPGYYSWRNSKDGSWFIQSLCAMLKQYADKLEFMHILTRVN
RKVATEFESFSFDATFHAKKQIPCIVSMLTKELYFYH
Pretty print of the dictionary created is:
{'1FN3:A|PDBID|CHAIN|SEQUENCE': 'VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR',
'2PAB:A|PDBID|CHAIN|SEQUENCE': 'GPTGTGESKCPLMVKVLDAVRGSPAINVAVHVFRKAADDTWEPFASGKTSESGELHGLTTEEQFVEGIYKVEIDTKSYWKALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTTAVVTNPKE*',
'3IDP:B|PDBID|CHAIN|SEQUENCE': 'HHHHHHDRNRMKTLGRRDSSDDWEIPDGQITVGQRIGSGSFGTVYKGKWHGDVAVKMLNVTAPTPQQLQAFKNEVGVLRKTRHVNILLFMGYSTKPQLAIVTQWCEGSSLYHHLHIIETKFEMIKLIDIARQTAQGMDYLHAKSIIHRDLKSNNIFLHEDLTVKIGDFGLATEKSRWSGSHQFEQLSGSILWMAPEVIRMQDKNPYSFQSDVYAFGIVLYELMTGQLPYSNINNRDQIIFMVGRGYLSPDLSKVRSNCPKAMKRLMAECLKKKRDERPLFPQILASIELLARSLPKIHRS',
'4QUD:A|PDBID|CHAIN|SEQUENCE': 'MENTENSVDSKSIKNLEPKIIHGSESMDSGISLDNSYKMDYPEMGLCIIINNKNFHKSTGMTSRSGTDVDAANLRETFRNLKYEVRNKNDLTREEIVELMRDVSKEDHSKRSSFVCVLLSHGEEGIIFGTNGPVDLKKIFNFFRGDRCRSLTGKPKLFIIQACRGTELDCGIETDSGVDDDMACHKIPVEADFLYAYSTAPGYYSWRNSKDGSWFIQSLCAMLKQYADKLEFMHILTRVNRKVATEFESFSFDATFHAKKQIPCIVSMLTKELYFYH',
'5OKT:A|PDBID|CHAIN|SEQUENCE': 'MGSSHHHHHHSSGLVPRGSHMELRVGNRYRLGRKIGSGSFGDIYLGTDIAAGEEVAIKLECVKTKHPQLHIESKIYKMMQGGVGIPTIRWCGAEGDYNVMVMELLGPSLEDLFNFCSRKFSLKTVLLLADQMISRIEYIHSKNFIHRDVKPDNFLMGLGKKGNLVYIIDFGLAKKYRDARTHQHIPYRENKNLTGTARYASINTHLGIEQSRRDDLESLGYVLMYFNLGSLPWQGLKAATKRQKYERISEKKMSTPIEVLCKGYPSEFATYLNFCRSLRFDDKPDYSYLRQLFRNLFHRQGFSYDYVFDWNMLK*'}
EDIT: To work the solution following your try:
from pprint import pprint
def parse_fasta(path):
with open(path) as thefile:
label = []
sequences = ''
total_seq = []
for line in thefile:
line = line.strip()
if len(line) == 0:
continue
if line.startswith('>'):
line = line.strip('>')
label.append(line)
if len(sequences) > 0:
total_seq.append(sequences)
sequences = ''
else:
sequences += line
total_seq.append(sequences)
dict_version = {k: v for k, v in zip(label, total_seq)}
return dict_version
d = parse_fasta('fasta_sample.txt')
pprint(d)
You'll see I made some changes to get the correct output. I added an array total_seq to hold the sequences for each sequence header. (You didn't have this and was a problem in your solution). The joins in your code were not doing anything. The value was just a single string although you had the right idea. You'll see in the revised code the join was done to join the accumulated sequences for one header id into one string of fasta characters.
I tested for blank lines and did a continue if the line was blank, (len(line) == 0).
There was a test if len(sequences) > 0 to see if any sequences had been seen yet. Which they wouldn't on the first record. It would see the ID before it had seen any sequences.
After the for loop completes, it is necessary to add the last sequence
total_seq.append(sequences)
since all other sequences except the last are added to the total_seq when a new ID is detected.
I hope this explanation is helpful as it more closely follows your code.

how can I print lines of a file that specefied by a list of numbers Python?

I open a dictionary and pull specific lines the lines will be specified using a list and at the end i need to print a complete sentence in one line.
I want to open a dictionary that has a word in each line
then print a sentence in one line with a space between the words:
N = ['19','85','45','14']
file = open("DICTIONARY", "r")
my_sentence = #?????????
print my_sentence
If your DICTIONARY is not too big (i.e. can fit your memory):
N = [19,85,45,14]
with open("DICTIONARY", "r") as f:
words = f.readlines()
my_sentence = " ".join([words[i].strip() for i in N])
EDIT: A small clarification, the original post didn't use space to join the words, I've changed the code to include it. You can also use ",".join(...) if you need to separate the words by a comma, or any other separator you might need. Also, keep in mind that this code uses zero-based line index so the first line of your DICTIONARY would be 0, the second would be 1, etc.
UPDATE:: If your dictionary is too big for your memory, or you just want to consume as little memory as possible (if that's the case, why would you go for Python in the first place? ;)) you can only 'extract' the words you're interested in:
N = [19, 85, 45, 14]
words = {}
word_indexes = set(N)
counter = 0
with open("DICTIONARY", "r") as f:
for line in f:
if counter in word_indexes:
words[counter] = line.strip()
counter += 1
my_sentence = " ".join([words[i] for i in N])
you can use linecache.getline to get specific line numbers you want:
import linecache
sentence = []
for line_number in N:
word = linecache.getline('DICTIONARY',line_number)
sentence.append(word.strip('\n'))
sentence = " ".join(sentence)
Here's a simple one with more basic approach:
n = ['2','4','7','11']
file = open("DICTIONARY")
counter = 1 # 1 if you're gonna count lines in DICTIONARY
# from 1, else 0 is used
output = ""
for line in file:
line = line.rstrip() # rstrip() method to delete \n character,
# if not used, print ends with every
# word from a new line
if str(counter) in n:
output += line + " "
counter += 1
print output[:-1] # slicing is used for a white space deletion
# after last word in string (optional)

Problems using readline method with nested while loops Python

I am trying to write a code that will take a .txt file containing words and their definitions and produce a dictionary of {'word1':['definition1', 'definition2'...]}. The .txt file is in the following format:
word1
definition1
definition2
(blank line)
word2
definition1
definition2
...
so far the body of the function I have written is as follows:
line = definition_file.readline()
dictx = {}
while line != '':
key = line.strip()
defs = []
line = definition_file.readline()
while line != '\n':
defx = [line.strip()]
defs += defx
line = definition_file.readline()
if key not in dictx:
dictx[key] = defs
return dictx
I quickly realized the problem with this code is that it will only return a dictionary with the very first word within it. I need a way to make the code loop so that it returns a dictionary with all the words + definitions. I was hoping to do this without using a break.
thanks!
This should do it:
from collections import defaultdict
d = defaultdict(list)
is_definition = False
with open('test.txt') as f:
for line in f:
line = line.strip().rstrip('\n')
if line == '': # blank line
is_definition=False
continue
if is_definition: # definition line
d[word].append(line)
else: # word line
word = line
is_definition = True
This one-liner will also do the trick:
>>> tesaurus = open('tesaurus.txt').read()
>>> dict(map(lambda x: (x[0], x[1].split()), [term.split("\n", 1) for term in tesaurus.replace("\r", "").split("\n\n")]))
{'word1': ['definition1', 'definition2'], 'word3': ['def1', 'def2'], 'word2': ['definition1', 'definition2']}
Here's another possibility:
d = dict()
defs = list()
with open('test.txt') as infile:
for line in infile:
if not line:
d[defs[0]] = defs[1:]
defs = list()
else:
defs.append(line.strip())
Read the whole file
d = dict()
with open('file.txt') as f:
stuff = f.read()
Split the file on blank lines.
word_defs = stuff.split('\n\n')
Iterate over the definition groups and split the word from the definitions.
for word_def in word_defs:
word_def = word_def.split('\n')
word = word_def[0]
defs = word_def[1:]
d[word] = defs
If you prefer something more functional /compact (same thing but different). First an iterator that produces [word, def, def, ...] groups.
definition_groups = (thing.split('\n') for thing in stuff.split('\n\n'))
dict comprehension to build the dictionary
import operator
word = operator.itemgetter(0)
defs = operator.itemgetter(slice(1,None))
g = {word(group):defs(group) for group in definition_groups}
Here is my best answer that meets your criteria.
import sys
d = {}
with open(sys.argv[1], "r") as f:
done = False
while not done:
word = f.readline().strip()
done = not word
line = True
defs = []
while line:
line = f.readline().rstrip('\n')
if line.strip():
defs.append(line)
if not done:
d[word] = defs
print(d)
But I don't understand why you are trying to avoid using break. I think this code is clearer with break... the flow of control is simpler and we don't need as many variables. When word is an empty string, this code just breaks out (immediately stops what it is doing) and that is very easy to understand. You have to study the first code to make sure you know how it works when end-of-file is reached.
import sys
d = {}
with open(sys.argv[1], "r") as f:
while True:
word = f.readline().strip()
defs = []
if not word:
break
while True:
line = f.readline().rstrip('\n')
if not line:
break
defs.append(line)
d[word] = defs
print(d)
But I think the best way to write this is to make a helper function that packages up the job of parsing out the definitions:
import sys
def _read_defs(f):
while True:
line = f.readline().rstrip('\n')
if not line:
break
yield line
d = {}
with open(sys.argv[1], "r") as f:
while True:
word = f.readline().strip()
if not word:
break
d[word] = list(_read_defs(f))
print(d)
The first one is trickier because it is avoiding the use of break. The others are simpler to understand, with two similar loops that have similar flow of control.

Python Beginning Program Dictionary and List Issue

Write a program that reads the contents of a random text file. The program should create a dictionary in which the keys are individual words found in the file and the values are the number of times each word appears.
How would I go about doing this?
def main():
c = 0
dic = {}
words = set()
inFile = open('text2', 'r')
for line in inFile:
line = line.strip()
line = line.replace('.', '')
line = line.replace(',', '')
line = line.replace("'", '') #strips the punctuation
line = line.replace('"', '')
line = line.replace(';', '')
line = line.replace('?', '')
line = line.replace(':', '')
words = line.split()
for x in words:
for y in words:
if x == y:
c += 1
dic[x] = c
print(dic)
print(words)
inFile.close()
main()
Sorry for the vague question. Never asked any questions here before. This is what I have so far. Also, this is the first ever programming I've done so I expect it to be pretty terrible.
with open('path/to/file') as infile:
# code goes here
That's how you open a file
for line in infile:
# code goes here
That's how you read a file line-by-line
line.strip().split()
That's how you split a line into (white-space separated) words.
some_dictionary['abcd']
That's how you access the key 'abcd' in some_dictionary.
Questions for you:
What does it mean if you can't access the key in a dictionary?
What error does that give you? Can you catch it with a try/except block?
How do you increment a value?
Is there some function that GETS a default value from a dict if the key doesn't exist?
For what it's worth, there's also a function that does almost exactly this, but since this is pretty obviously homework it won't fulfill your assignment requirements anyway. It's in the collections module. If you're interested, try and figure out what it is :)
There are at least three different approaches to add a new word to the dictionary and count the number of occurences in this file.
def add_element_check1(my_dict, elements):
for e in elements:
if e not in my_dict:
my_dict[e] = 1
else:
my_dict[e] += 1
def add_element_check2(my_dict, elements):
for e in elements:
if e not in my_dict:
my_dict[e] = 0
my_dict[e] += 1
def add_element_except(my_dict, elements):
for e in elements:
try:
my_dict[e] += 1
except KeyError:
my_dict[e] = 1
my_words = {}
with open('pathtomyfile.txt', r) as in_file:
for line in in_file:
words = [word.strip().lower() word in line.strip().split()]
add_element_check1(my_words, words)
#or add_element_check2(my_words, words)
#or add_element_except(my_words, words)
If you are wondering which is the fastest? The answer is: it depends. It depends on how often a given word might occur in the file. If a word does only occur (relatively) few times, the try-except would be the best choice in your case.
I have done some simple benchmarks here
This is a perfect job for the built in Python Collections class. From it, you can import Counter, which is a dictionary subclass made for just this.
How you want to process your data is up to you. One way to do this would be something like this
from collections import Counter
# Open your file and split by white spaces
with open("yourfile.txt","r") as infile:
textData = infile.read()
# Replace characters you don't want with empty strings
textData = textData.replace(".","")
textData = textData.replace(",","")
textList = textData.split(" ")
# Put your data into the counter container datatype
dic = Counter(textList)
# Print out the results
for key,value in dic.items():
print "Word: %s\n Count: %d\n" % (key,value)
Hope this helps!
Matt

Categories

Resources