Counting Entries in a File - python

I am trying to count entries in a text file but having difficulty. The key is that each line is one entry and if the term "ADALIMUMAB" shows up in the line, it counts as one. If it shows up twice, it still should only count as one. Here is an example of lines in the text file.
101700392$10170039$3$I$BUDESONIDE.$BUDESONIDE$1$Oral$9 MG, DAILY$$$$$$$$9$MG$$
101700392$10170039$4$C$ADALIMUMAB$ADALIMUMAB$1$$UNK$$$$$$$$$$$
102117144$10211714$1$PS$HUMIRA$ADALIMUMAB$1$Subcutaneous$$$$$N$ NOT AVAILABLE,NOT
I currently have this working:
fDRUG14Q3 = open("DRUG14Q3.txt")
data = fDRUG14Q3.read()
occurencesDRUG14Q3 = data.count("ADALIMUMAB")
But it will count line 2 in the example above as 2 entries rather than one.

You can use a generator expression passed to sum(). Each line will either be True(1) of False(0) and you'll take the total count. Basically you are counting how many lines return True for 'ADALIMUMAB' in line:
with open(path, 'r') as f:
total = sum('ADALIMUMAB' in line for line in f)
print(total)
# 2
This has the added benefit of not requiring you to read the whole file into memory first too.

Related

Too many values to unpack in python: Caused by the file format

I have two files, which have two columns as following:
file 1
------
main 46
tag 23
bear 15
moon 2
file 2
------
main 20
rocky 6
zoo 4
bear 2
I am trying to compare the first 2 rows of each file together and in case there are some words that are the same, I will sum up the numbers and write those in a new file.
I read the file and used a foreach loop to go through each line, but it returns a ValueError:too many values to unpack.
import os
from itertools import islice
DIR = r'dir'
for filename in os.listdir(DIR):
with open(os.path.sep.join([DIR, filename]), 'r') as f:
for i in range(2):
line = f.readline().strip()
word, freq = line.split():
print(word)
print(count)
In the file, there is an extra empty line after each line of the text. I searched for the \n; but nothing is there.
then I removed them manually and then it worked.
If you don't know how many items you have in the line, then you can't use the nice unpack facility. You'll need to split and check how many you got. For instance:
with open(os.path.sep.join([DIR, filename]), 'r') as f:
for line in f:
data = line.split()
if len(data) >= 2:
word, count = line[:2]
This will get you the first two fields of any line containing at least that many. Since you haven't specified what to do with other lines or extra fields, I'll leave that (any else part) up to you. I've also left out the strip part to accent the existing code; line input and split will get rid of newlines and spaces, but not necessarily all white space.

Calculate the average value of the numbers in a file

I am having a problem of calculating the average value of numbers in a file.
So far i have made a function that reads in files and calculate the number of lines.
The file consists of many columns of numbers, but the column 8 is the one i need to calculate from.
def file_read():
fname = input("Input filname: ")
infile = open(fname,'r')
txt = infile.readlines()
print("opens",fname,"...")
num_lines = sum(1 for line in open(fname))
#The first line in the file is only text, so i subtract 1
print("Number of days:",(num_lines-1))
The numbers are also decimals, so i use float.
This is my try on calculating the sum of numbers,
which shall be divided by the number of lines , but i comes an error, cuz the first line is text.
with open(fname) as txt:
return sum(float(x)
for line in txt
for x in line.split()[8]
Is there a way i can get python to ignore the first line and just concentrate about the numbers down under?
You could use txt.readline() to read the first line, but to stick with iterators way to do it, just drop the first line using iteration on file with next
with open(fname) as txt:
next(txt) # it returns the first line, we just ignore the return value
# your iterator is now on the second line, where the numbers are
for line in txt:
...
Side note: this is also very useful to skip title lines of files open with the csv module, that's where next is better than readline since csv title can be on multiple lines.
Try this
import re
#regular expression for decimals
digits_reg = re.compile(r"\d+\.\d+|\d+")
with open('''file name''', "r") as file:
allNum = []
#find numbers in each line and add them to the list
for line in file:
allNum.extend(digits_reg.findall(line))
#should be a list that contains all numbers in the file
print(alNum)

Python 3.4.3: Iterating over each line and each character in each line in a text file

I have to write a program that iterates over each line in a text file and then over each character in each line in order to count the number of entries in each line.
Here is a segment of the text file:
N00000031,B,,D,D,C,B,D,A,A,C,D,C,A,B,A,C,B,C,A,C,C,A,B,D,D,D,B,A,B,A,C,B,,,C,A,A,B,D,D
N00000032,B,A,D,D,C,B,D,A,C,C,D,,A,A,A,C,B,D,A,C,,A,B,D,D
N00000033,B,A,D,D,C,,D,A,C,B,D,B,A,B,C,C,C,D,A,C,A,,B,D,D
N00000034,B,,D,,C,B,A,A,C,C,D,B,A,,A,C,B,A,B,C,A,,B,D,D
The first and last lines are "unusable lines" because they contain too many entries (more or less than 25). I would like to count the amount of unusable lines in the file.
Here is my code:
for line in file:
answers=line.split(",")
i=0
for i in answers:
i+=1
unusable_line=0
for line in file:
if i!=26:
unusable_line+=1
print("Unusable lines in the file:", unusable_line)
I tried using this method as well:
alldata=file.read()
for line in file:
student=alldata.split("\n")
answer=student.split(",")
My problem is each variable I create doesn't exist when I try to run the program. I get a "students" is not defined error.
I know my coding is awful but I'm a beginner. Sorry!!! Thank you and any help at all is appreciated!!!
A simplified code for your method using list,count and if condition
Code:
unusable_line = 0
for line in file:
answers = line.strip().split(",")
if len(answers) < 26:
unusable_line += 1
print("Unusable lines in the file:", unusable_line)
Notes:
Initially I have created a variable to store count of unstable lines unusable_line.
Then I iterate over the lines of the file object.
Then I split the lines at , to create a list.
Then I check if the count of list is less then 26. If so I increment the unusable_line varaiable.
Finally I print it.
You could use something like this and wrap it into a function. You don't need to re-iterate the items in the line, str.split() returns a list[] that has your elements in it, you can count the number of its elements with len()
my_file = open('temp.txt', 'r')
lines_count = usable = ununsable = 0
for line in my_file:
lines_count+=1
if len(line.split(',')) == 26:
usable+=1
else:
ununsable+=1
my_file.close()
print("Processed %d lines, %d usable and %d ununsable" % (lines_count, usable, ununsable))
You can do it much shorter:
with open('my_fike.txt') as fobj:
unusable = sum(1 for line in fobj if len(line.split(',')) != 26)
The line with open('my_fike.txt') as fobj: opens the file for reading and closes it automatically after leaving the indented block. I use a generator expression to go through all lines and add up all that have a length different from 26.

How to count the number of characters in a file (not using the len function)?

Basically, I want to be able to count the number of characters in a txt file (with user input of file name). I can get it to display how many lines are in the file, but not how many characters. I am not using the len function and this is what I have:
def length(n):
value = 0
for char in n:
value += 1
return value
filename = input('Enter the name of the file: ')
f = open(filename)
for data in f:
data = length(f)
print(data)
All you need to do is sum the number of characters in each line (data):
total = 0
for line in f:
data = length(line)
total += data
print(total)
There are two problems.
First, for each line in the file, you're passing f itself—that is, a sequence of lines—to length. That's why it's printing the number of lines in the file. The length of that sequence of lines is the number of lines in the file.
To fix this, you want to pass each line, data—that is, a sequence of characters. So:
for data in f:
print length(data)
Next, while that will properly calculate the length of each line, you have to add them all up to get the length of the whole file. So:
total_length = 0
for data in f:
total_length += length(data)
print(total_length)
However, there's another way to tackle this that's a lot simpler. If you read() the file, you will get one giant string, instead of a sequence of separate lines. So you can just call length once:
data = f.read()
print(length(data))
The problem with this is that you have to have enough memory to store the whole file at once. Sometimes that's not appropriate. But sometimes it is.
When you iterate over a file (opened in text mode) you are iterating over its lines.
for data in f: could be rewritten as for line in f: and it is easier to see what it is doing.
Your length function looks like it should work but you are sending the open file to it instead of each line.

Update iteration value in Python for loop

Pretty new to Python and have been writing up a script to pick out certain lines of a basic log file
Basically the function searches lines of the file and when it finds one I want to output to a separate file, adds it into a list, then also adds the next five lines following that. This then gets output to a separate file at the end in a different funcition.
What I've been trying to do following that is jump the loop to continue on from the last of those five lines, rather than going over them again. I thought the last line in the code would solved the problem, but unfortunately not.
Are there any recommended variations of a for loop I could use for this purpose?
def readSingleDayLogs(aDir):
print 'Processing files in ' + str(aDir) + '\n'
lineNumber = 0
try:
open_aDirFile = open(aDir) #open the log file
for aLine in open_aDirFile: #total the num. lines in file
lineNumber = lineNumber + 1
lowerBound = 0
for lineIDX in range(lowerBound, lineNumber):
currentLine = linecache.getline(aDir, lineIDX)
if (bunch of logic conditions):
issueList.append(currentLine)
for extraLineIDX in range(1, 6): #loop over the next five lines of the error and append to issue list
extraLine = linecache.getline(aDir, lineIDX+ extraLineIDX) #get the x extra line after problem line
issueList.append(extraLine)
issueList.append('\n\n')
lowerBound = lineIDX
You should use a while loop :
line = lowerBound
while line < lineNumber:
...
if conditions:
...
for lineIDX in range(line, line+6):
...
line = line + 6
else:
line = line + 1
A for-loop uses an iterator over the range, so you can have the ability to change the loop variable.
Consider using a while-loop instead. That way, you can update the line index directly.
I would look at something like:
from itertools import islice
with open('somefile') as fin:
line_count = 0
my_lines = []
for line in fin:
line_count += 1
if some_logic(line):
my_lines.append(line)
next_5 = list(islice(fin, 5))
line_count += len(next_5)
my_lines.extend(next_5)
This way, by using islice on the input, you're able to move the iterator ahead and resume after the 5 lines (perhaps fewer if near the end of the file) are exhausted.
This is based on if I'm understanding correctly that you can read forward through the file, identify a line, and only want a fixed number of lines after that point, then resume looping as per normal. (You may not even require the line counting if that's all you're after as it only appears to be for the getline and not any other purpose).
If you indeed you want to take the next 5, and still consider the following line, you can use itertools.tee to branch at the point of the faulty line, and islice that and let the fin iterator resume on the next line.

Categories

Resources