All I have a text file formatted like below which I am bringing into Python:
hammer#9.95
saw#20.15
shovel#35.40
Ultimately I want to develop a dynamic query that allows me to remove the '#' symbol and replace with a '$' symbol, and then add up the values within the text file/count the number of items within. I came up with this through some trial and error, but it isn't dynamic to handle changes in the text file:
# display header line for items list
print('{0: <10}'.format('Item'), '{0: >17}'.format('Cost'), sep = '' )
# add your remaining code below
with open('invoice.txt','rt') as infile:
for line in infile:
print("{:<21} {}".format(line.strip().split('#')[0],"$"+line.strip().split("#")[1]))
print(' ')
str1 = 'Total cost\t' +' ' + '$65.50'
print(str1)
str2 = 'Number of tools\t' + ' ' +'3'
print(str2)
Any suggestions? Thanks ahead of time for reading.
You can do it the following way:
d = ['hammer#9.95', 'saw#20.15', 'shovel#35.40']
## replace hash
values = []
items = set()
for line in d:
line = line.replace('#', '$')
values.append(line.split('$')[1])
items.add(line.split('$')[0])
## sum values
sum(map(lambda x: float(x), values))
65.5
## count items
len(items)
3
Explanation:
To count items, we've used a set to get unique count. If you want all, use a list instead.
We've calculated sum by extracting the numbers from list by splitting on dollar sign.
prices = []
with open(...) as infile:
for line in infile.readlines()
price = line.split('#')[-1]
prices.append(float(price))
result = sum(prices)
What about:
items = {}
with open("temp.txt") as f:
for line in f:
item,cost = line.split('#')
cost = float(cost)
items[item] = cost
Now, you have a dictionary, keyed by item "name" (so they need to be unique in your file, otherwise a dictionary isn't the best structure here) and each value is a float corresponding to the parsed cost.
# Print items and cost
print(items.items())
#> dict_items([('hammer', 9.95), ('saw', 20.15), ('shovel', 35.4)])
# Print Number of Items
print(len(items))
#> 3
# Print Total Cost (unformatted)
print(sum(items.values()))
#> 65.5
# Print Total Cost (formatted)
print("$%.02f" % sum(items.values()))
#> $65.50
There are some corner cases you may want to look at to make this solution more robust. For example if the item "name" includes a # sign (i.e. there is more than one # per line), the values aren't properly formatted to be parsed by float, etc.
You can use:
total_price, total_products = 0, 0
for line in [open('invoice.txt').read().split("\n")]:
total_price += float(line.split("#")[1]); total_products += 1
print("Total Price\n${}".format(total_price))
print("Number of tools\n{}".format(total_products))
Total Price
$65.5
Number of tools
3
We have to cast the price (line.split("#")[1]), which is a string, to a float, otherwise we get a Type Error when we try to add it to total_price.
float(line.split("#")[1])
Since it is long due that i should refresh my Python skills i had some fun with your question and came up with a parser class:
import re
from contextlib import contextmanager
class Parser(object):
def __init__(self, file_path, regex):
self.file_path = file_path
self.pattern = re.compile(regex, flags=re.LOCALE | re.IGNORECASE | re.UNICODE)
self.values = []
self.parse()
#contextmanager
def read_lines(self):
try:
with open(self.file_path, "r", encoding="utf-8") as f:
yield f.readlines()
except FileNotFoundError:
print("Couldn't open file: ", self.file_path)
def parse_line(self, line):
try:
return self.pattern.match(line).groupdict()
except AttributeError:
return None
def parse(self):
with self.read_lines() as lines:
self.values = [value for value in map(self.parse_line, lines) if value]
def get_values(self, converters=dict()):
if len(converters) is 0:
return self.values
new_values = []
for value in self.values:
new_value = {}
for key in value:
if key in converters:
new_value[key] = converters[key](value[key])
else:
new_value[key] = value[key]
new_values.append(new_value)
return new_values
This class takes a file path and a regex-like string, which is then compiled to a regex object. On instantiation it reads and parses the contents of the file while ignoring invalid lines (not matching the regex syntax like empty lines).
I also added a get_values method which can apply converters to named groups from the regex, see the example (it converts the named group price of every line into a float value):
parser = Parser(r"fully_qualified_file_path.txt", r".\s*(?P<name>[\w\s]+)\#(?P<price>[\d\.]+)")
total = 0
count = 0
for line in parser.get_values({'price': lambda x: float(x)}):
total += line['price']
count += 1
print('Item: {name}, Price: ${price}'.format(**line))
print()
print('Item count:', count)
print('Total:', "${0}".format(total))
Result
Item: hammer, Price: $9.95
Item: saw, Price: $20.15
Item: shovel, Price: $35.4
Item count: 3
Total: $65.5
But coding fun aside, i suggest you try to get clean csv-like data and handle it properly through the csv class.
Related
I put together python script which will read the string "BatchSequence="NUMBER INCREMENT HERE" and just return the integers. How can i find a certain integer and increment the rest by one but leaving the integers before the same? It skips 3 and goes to 5. I want it to go 3,4,5.
Also,
Once i have figured this script out. How can i replace the numbers of the original text file with the new script numbers? Would i have to write into a new file?
I have tried increment the numbers by one but it starts from the beginning.
code that i tried:
import re
file = '\\\MyDataNEE\\user$\\bxt058y\\Desktop\\75736.oxi.error'
counter = 0
for line in open(file):
match = re.search('BatchSequence="(\d+)"', line)
if match:
print(int(match.group(1)) + 1)
Original Code:
import re
file = 'FILENAME HERE'
counter = 0
for line in open(file):
match = re.search('BatchSequence="(\d+)"', line)
if match:
print(match.group(1))
Currently:
BatchSequence="1"
BatchSequence="2"
BatchSequence="3"
BatchSequence="5"
BatchSequence="6"
BatchSequence="7"
BatchSequence="8"
New output should be:
BatchSequence="1"
BatchSequence="2"
BatchSequence="3"
BatchSequence="4"
BatchSequence="5"
BatchSequence="6"
BatchSequence="7"
My take on the problem:
txt = '''BatchSequence="1"
BatchSequence="2"
BatchSequence="3"
BatchSequence="5"
BatchSequence="6"
BatchSequence="7"
BatchSequence="8"'''
import re
def fn(my_number):
val = yield
while True:
val = yield str(val) if val < my_number else str(val-1)
f = fn(4)
next(f)
s = re.sub(r'BatchSequence="(\d+)"', lambda g: 'BatchSequence="' + f.send(int(g.group(1))) + '"', txt)
print(s)
Prints:
BatchSequence="1"
BatchSequence="2"
BatchSequence="3"
BatchSequence="4"
BatchSequence="5"
BatchSequence="6"
BatchSequence="7"
The function fn(my_number) will return same values until it reaches my_number, then the values are decremented by one.
I have a simple script that takes a list containing 3 columns of data. The second column of data contains currency values with a leading dollar sign. I have stripped away the dollar sign from the second column, now I need to add up the values. I'm getting a "decimal.Decimal is not iterable" error. Here is the code:
from decimal import Decimal
def main():
total = 0.0
try:
infile = open('list.txt', 'r')
for i in infile:
parts = i.split()
if len(parts) > 1:
dollar_dec = Decimal((parts[1]).strip('$'))
total = sum(dollar_dec)
print (total)
infile.close()
except Exception as err:
print(err)
main()
Say, you have the following file content:
content = """\
one $1.50
two $3.00
three $4.50"""
You can use the in-place operator += to calculate the total:
from decimal import Decimal
import io
total = Decimal(0)
with io.StringIO(content) as fd:
for line in fd:
parts = line.strip().split()
if len(parts) > 1:
dollard_dec = Decimal(parts[1].strip("$"))
total += dollard_dec
print(total)
Her, you get: 9.00
You can also use sum() on a iterable (here a comprehension list):
with io.StringIO(content) as fd:
total = sum(Decimal(line.strip().split()[1].strip("$"))
for line in fd)
print(total)
Yes, you get 9.00 too!
total = sum(dollar_dec)
sum() takes an iterable (a list, for example) and adds up all the values. You are passing it a single number, which is an error. You probably want
total = Decimal('0.0')
...
total += dollar_dec
Which will keep a running total.
(edit- total must be a Decimal for you to add Decimals to it)
sum() takes an iterable. Just change your code to total += dollar_dec
Write a program that reads the contents of a random text file. The program should create a dictionary in which the keys are individual words found in the file and the values are the number of times each word appears.
How would I go about doing this?
def main():
c = 0
dic = {}
words = set()
inFile = open('text2', 'r')
for line in inFile:
line = line.strip()
line = line.replace('.', '')
line = line.replace(',', '')
line = line.replace("'", '') #strips the punctuation
line = line.replace('"', '')
line = line.replace(';', '')
line = line.replace('?', '')
line = line.replace(':', '')
words = line.split()
for x in words:
for y in words:
if x == y:
c += 1
dic[x] = c
print(dic)
print(words)
inFile.close()
main()
Sorry for the vague question. Never asked any questions here before. This is what I have so far. Also, this is the first ever programming I've done so I expect it to be pretty terrible.
with open('path/to/file') as infile:
# code goes here
That's how you open a file
for line in infile:
# code goes here
That's how you read a file line-by-line
line.strip().split()
That's how you split a line into (white-space separated) words.
some_dictionary['abcd']
That's how you access the key 'abcd' in some_dictionary.
Questions for you:
What does it mean if you can't access the key in a dictionary?
What error does that give you? Can you catch it with a try/except block?
How do you increment a value?
Is there some function that GETS a default value from a dict if the key doesn't exist?
For what it's worth, there's also a function that does almost exactly this, but since this is pretty obviously homework it won't fulfill your assignment requirements anyway. It's in the collections module. If you're interested, try and figure out what it is :)
There are at least three different approaches to add a new word to the dictionary and count the number of occurences in this file.
def add_element_check1(my_dict, elements):
for e in elements:
if e not in my_dict:
my_dict[e] = 1
else:
my_dict[e] += 1
def add_element_check2(my_dict, elements):
for e in elements:
if e not in my_dict:
my_dict[e] = 0
my_dict[e] += 1
def add_element_except(my_dict, elements):
for e in elements:
try:
my_dict[e] += 1
except KeyError:
my_dict[e] = 1
my_words = {}
with open('pathtomyfile.txt', r) as in_file:
for line in in_file:
words = [word.strip().lower() word in line.strip().split()]
add_element_check1(my_words, words)
#or add_element_check2(my_words, words)
#or add_element_except(my_words, words)
If you are wondering which is the fastest? The answer is: it depends. It depends on how often a given word might occur in the file. If a word does only occur (relatively) few times, the try-except would be the best choice in your case.
I have done some simple benchmarks here
This is a perfect job for the built in Python Collections class. From it, you can import Counter, which is a dictionary subclass made for just this.
How you want to process your data is up to you. One way to do this would be something like this
from collections import Counter
# Open your file and split by white spaces
with open("yourfile.txt","r") as infile:
textData = infile.read()
# Replace characters you don't want with empty strings
textData = textData.replace(".","")
textData = textData.replace(",","")
textList = textData.split(" ")
# Put your data into the counter container datatype
dic = Counter(textList)
# Print out the results
for key,value in dic.items():
print "Word: %s\n Count: %d\n" % (key,value)
Hope this helps!
Matt
For example, if I take this file: http://vlm1.uta.edu/~athitsos/courses/cse1310_summer2013/assignments/assignment7/albums.txt
I need the function to count each band and the number of times they are listed in the file and print it on screen in descending order.
It should be in this format
band1: number1
band2: number2
band3: number3
this is what I have so far:
def read_albums(filename):
counter = 0
work_list = []
my_file = open(filename, 'r')
for line in my_file:
my_list = line.split()
work_list = line.split()
for i in range(0, len(my_list)):
item = my_list[0]
counter = 1
j = i + 1
for j in range(j, len(my_list)):
if j > len(my_list):
j = len(my_list)
if item == my_list[0]:
counter = counter + 1
work_list[j] = None
else:
continue
if work_list[0] != None:
print(work_list[0], counter)
Any tips regarding what I am doing wrong would be very helpful, I just cant seem to get it
d = defaultdict(int)
with open("some_file.txt") as f:
for line in file:
artist,album = line.split("-")
d[artist] += 1
for k,v in d.items():
print "%s:%s"%(k,v)
Something like this would be the Pythonic way to go:
from collections import Counter
with open('albums.txt') as f:
print Counter(line.split(' - ')[0] for line in f)
I recommend you take look at this talk.
You already have a working answer so I will just say where you went wrong.
my_list = line.split()
work_list = line.split()
They are exactly the same so I'll just stick with work_list.
work_list = line.split()
This splits the text at each space, so "Pink Floyd - Album" will become ["Pink", "Floyd", "-", "Album"]. Also, what it does is it sets the variable work_list to the latest line you split. What you want is to put all the split lines in a list:
work_list.append(line.split("-")[0])
This splits the line properly and only returns the first element, which is the band name. This is then appended to the list work_list, which you have properly initialised as empty at the beginning.
Once you have the bands in a list, you can use any method to count all the occurrances. Counter is brilliant for that. Your method had a lot of logic flaws, but I think what you were going for was (in pseudocode):
for each item in the array (item)
go through all the remaining items (new_item)
if item == new_item
increase counter
This doesn't count the occurances of each item once. For example, every time it comes across a band, it will count all the duplicate ones from that point forward. What you would want instead is a set, which is like a list but with no duplicate entries.
work_set = set(work_list)
for band in work_set:
counter = 0
for i in range(len(work_list)):
if work_list[i] == band:
counter += 1
print (band, counter)
If your programs don't behave as expected, you can print your variables to see whether they are assigned what you expect them to.
I am trying to parse a large fasta file and I am encountering out of memory errors. Some suggestions to improve the data handling would be appreciated. Currently the program correctly prints out the names however partially through the file I get a MemoryError
Here is the generator
def readFastaEntry( fp ):
name = ""
seq = ""
for line in fp:
if line.startswith( ">" ):
tmp = []
tmp.append( name )
tmp.append( seq )
name = line
seq = ""
yield tmp
else:
seq = seq.join( line )
and here is the caller stub more will be added after this part works
fp = open( sys.argv[1], 'r' )
for seq in readFastaEntry( fp ) :
print seq[0]
For those not fimilar with the fasta format here is an example
>1 (PB2)
AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATAC
TCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCC
TGCACTCAGGATGAAGTGGATGATG
>2 (PB1)
AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACC
ACATTTCCCTATACTGGAGACCCTCC
each entry starts with a ">" stating the name etc then the next N lines are data. There is no defined ending of the data other than the next line having a ">" at the beginning.
Have you considered using BioPython. They have a sequence reader that can read fasta files. And if you are interested in coding one yourself, you can take a look at BioPython's code.
Edit: Code added
def read_fasta(fp):
name, seq = None, []
for line in fp:
line = line.rstrip()
if line.startswith(">"):
if name: yield (name, ''.join(seq))
name, seq = line, []
else:
seq.append(line)
if name: yield (name, ''.join(seq))
with open('f.fasta') as fp:
for name, seq in read_fasta(fp):
print(name, seq)
A pyparsing parser for this format is only a few lines long. See the annotations in the following code:
data = """>1 (PB2)
AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATAC
TCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCC
TGCACTCAGGATGAAGTGGATGATG
>2 (PB1)
AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACC
ACATTTCCCTATACTGGAGACCCTCC"""
from pyparsing import Word, nums, QuotedString, Combine, OneOrMore
# define some basic forms
integer = Word(nums)
key = QuotedString("(", endQuoteChar=")")
# sequences are "words" made up of the characters A, G, C, and T
# we want to match one or more of them, and have the parser combine
# them into a single string (Combine by default requires all of its
# elements to be adjacent within the input string, but we want to allow
# for the intervening end of lines, so we add adjacent=False)
sequence = Combine(OneOrMore(Word("AGCT")), adjacent=False)
# define the overall pattern to scan for - attach results names
# to each matched element
seqEntry = ">" + integer("index") + key("key") + sequence("sequence")
for seq,s,e in seqEntry.scanString(data):
# just dump out the matched data
print seq.dump()
# could also access fields as seq.index, seq.key and seq.sequence
Prints:
['>', '1', 'PB2', 'AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATACTCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCCTGCACTCAGGATGAAGTGGATGATG']
- index: 1
- key: PB2
- sequence: AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATACTCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCCTGCACTCAGGATGAAGTGGATGATG
['>', '2', 'PB1', 'AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACCACATTTCCCTATACTGGAGACCCTCC']
- index: 2
- key: PB1
- sequence: AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACCACATTTCCCTATACTGGAGACCCTCC
Without having a great understanding of what you are doing, I would have written the code like this:
def readFastaEntry( fp ):
name = ""
while True:
line = name or f.readline()
if not line:
break
seq = []
while True:
name = f.readline()
if not name or name.startswith(">"):
break
else:
seq.append(name)
yield (line, "".join(seq))
This gathers up the data after a starting line up to the next starting line. Making seq an array means that you minimize the string joining until the last possible moment. Yielding a tuple makes more sense than a list.
def read_fasta(filename):
name = None
with open(filename) as file:
for line in file:
if line[0] == ">":
if name:
yield (name, seq)
name = line[1:-1].split("|")[0]
seq = ""
else:
seq += line[:-1]
yield (name, seq)