Counting specific characters in a file (Python) - python

I'd like to count specific things from a file, i.e. how many times "--undefined--" appears. Here is a piece of the file's content:
"jo:ns 76.434
pRE 75.417
zi: 75.178
dEnt --undefined--
ba --undefined--
I tried to use something like this. But it won't work:
with open("v3.txt", 'r') as infile:
data = infile.readlines().decode("UTF-8")
count = 0
for i in data:
if i.endswith("--undefined--"):
count += 1
print count
Do I have to implement, say, dictionary of tuples to tackle this or there is an easier solution for that?
EDIT:
The word in question appears only once in a line.

you can read all the data in one string and split the string in a list, and count occurrences of the substring in that list.
with open('afile.txt', 'r') as myfile:
data=myfile.read().replace('\n', ' ')
data.split(' ').count("--undefined--")
or directly from the string :
data.count("--undefined--")

readlines() returns the list of lines, but they are not stripped (ie. they contain the newline character).
Either strip them first:
data = [line.strip() for line in data]
or check for --undefined--\n:
if line.endswith("--undefined--\n"):
Alternatively, consider string's .count() method:
file_contents.count("--undefined--")

Or don't limit yourself to .endswith(), use the in operator.
data = ''
count = 0
with open('v3.txt', 'r') as infile:
data = infile.readlines()
print(data)
for line in data:
if '--undefined--' in line:
count += 1
count

When reading a file line by line, each line ends with the newline character:
>>> with open("blookcore/models.py") as f:
... lines = f.readlines()
...
>>> lines[0]
'# -*- coding: utf-8 -*-\n'
>>>
so your endswith() test just can't work - you have to strip the line first:
if i.strip().endswith("--undefined--"):
count += 1
Now reading a whole file in memory is more often than not a bad idea - even if the file fits in memory, it still eats fresources for no good reason. Python's file objects are iterable, so you can just loop over your file. And finally, you can specify which encoding should be used when opening the file (instead of decoding manually) using the codecs module (python 2) or directly (python3):
# py3
with open("your/file.text", encoding="utf-8") as f:
# py2:
import codecs
with codecs.open("your/file.text", encoding="utf-8") as f:
then just use the builtin sum and a generator expression:
result = sum(line.strip().endswith("whatever") for line in f)
this relies on the fact that booleans are integers with values 0 (False) and 1 (True).

Quoting Raymond Hettinger, "There must be a better way":
from collections import Counter
counter = Counter()
words = ('--undefined--', 'otherword', 'onemore')
with open("v3.txt", 'r') as f:
lines = f.readlines()
for line in lines:
for word in words:
if word in line:
counter.update((word,)) # note the single element tuple
print counter

Related

left padding with python

I have following data and link combination of 100000 entries
dn:id=2150fccc-beb8-42f8-b201-182a6bf5ddfe,ou=test,dc=com
link:545214569
dn:id=ffa55959-457d-49e6-b4cf-a34eff8bbfb7,ou=test,dc=com
link:32546897
dn:id=3452a4c3-b768-43f5-8f1e-d33c14787b9b,ou=test,dc=com
link:6547896541
I am trying to write a program in python 2.7 to add left padding zeros if value of link is less than 10 .
Eg:
545214569 --> 0545214569
32546897 --> 0032546897
can you please guide me what am i doing wrong with the following program :
with open("test.txt", "r") as f:
line=f.readline()
line1=f.readline()
wordcheck = "link"
wordcheck1= "dn"
for wordcheck1 in line1:
with open("pad-link.txt", "a") as ff:
for wordcheck in line:
with open("pad-link.txt", "a") as ff:
key, val = line.strip().split(":")
val1 = val.strip().rjust(10,'0')
line = line.replace(val,val1)
print (line)
print (line1)
ff.write(line1 + "\n")
ff.write('%s:%s \n' % (key, val1))
The usual pythonic way to pad values in Python is by using string formatting and the Format Specification Mini Language
link = 545214569
print('{:0>10}'.format(link))
Your for wordcheck1 in line1: and for workcheck in line: aren't doing what you think. They iterate one character at a time over the lines and assign that character to the workcheck variable.
If you only want to change the input file to have leading zeroes, this can be simplified as:
import re
# Read the whole file into memory
with open('input.txt') as f:
data = f.read()
# Replace all instances of "link:<digits>", passing the digits to a function that
# formats the replacement as a width-10 field, right-justified with zeros as padding.
data = re.sub(r'link:(\d+)', lambda m: 'link:{:0>10}'.format(m.group(1)), data)
with open('output.txt','w') as f:
f.write(data)
output.txt:
dn:id=2150fccc-beb8-42f8-b201-182a6bf5ddfe,ou=test,dc=com
link:0545214569
dn:id=ffa55959-457d-49e6-b4cf-a34eff8bbfb7,ou=test,dc=com
link:0032546897
dn:id=3452a4c3-b768-43f5-8f1e-d33c14787b9b,ou=test,dc=com
link:6547896541
i don't know why you have to open many times. Anyway, open 1 time, then for each line, split by :. the last element in list is the number. Then you know what lenght the digits should consistently b, say 150, then use zfill to padd the 0. then put the lines back by using join
for line in f.readlines():
words = line.split(':')
zeros = 150-len(words[-1])
words[-1] = words[-1].zfill(zeros)
newline = ':'.join(words)
# write this line to file

splitlines() and iterating over an opened file give different results

I have files with sometimes weird end-of-lines characters like \r\r\n. With this, it works like I want:
with open('test.txt', 'wb') as f: # simulate a file with weird end-of-lines
f.write(b'abc\r\r\ndef')
with open('test.txt', 'rb') as f:
for l in f:
print(l)
# b'abc\r\r\n'
# b'def'
I want to able to get the same result from a string. I thought about splitlines but it does not give the same result:
print(b'abc\r\r\ndef'.splitlines())
# [b'abc', b'', b'def']
Even with keepends=True, it's not the same result.
Question: how to have the same behaviour than for l in f with splitlines()?
Linked: Changing str.splitlines to match file readlines and https://bugs.python.org/issue22232
Note: I don't want to put everything in a BytesIO or StringIO, because it does a x0.5 speed performance (already benchmarked); I want to keep a simple string. So it's not a duplicate of How do I wrap a string in a file in Python?.
Why don't you just split it:
input = b'\nabc\r\r\r\nd\ref\nghi\r\njkl'
result = input.split(b'\n')
print(result)
[b'', b'abc\r\r\r', b'd\ref', b'ghi\r', b'jkl']
You will loose the trailing \n that can be added later to every line, if you really need them. On the last line there is a need to check if it is really needed. Like
fixed = [bstr + b'\n' for bstr in result]
if input[-1] != b'\n':
fixed[-1] = fixed[-1][:-1]
print(fixed)
[b'\n', b'abc\r\r\r\n', b'd\ref\n', b'ghi\r\n', b'jkl']
Another variant with a generator. This way it will be memory savvy on the huge files and the syntax will be similar to the original for l in bin_split(input) :
def bin_split(input_str):
start = 0
while start>=0 :
found = input_str.find(b'\n', start) + 1
if 0 < found < len(input_str):
yield input_str[start : found]
start = found
else:
yield input_str[start:]
break
There are a couple ways to do this, but none are especially fast.
If you want to keep the line endings, you might try the re module:
lines = re.findall(r'[\r\n]+|[^\r\n]+[\r\n]*', text)
# or equivalently
line_split_regex = re.compile(r'[\r\n]+|[^\r\n]+[\r\n]*')
lines = line_split_regex.findall(text)
If you need the endings and the file is really big, you may want to iterate instead:
for r in re.finditer(r'[\r\n]+|[^\r\n]+[\r\n]*', text):
line = r.group()
# do stuff with line here
If you don't need the endings, then you can do it much more easily:
lines = list(filter(None, text.splitlines()))
You can omit the list() part if you just iterate over the results (or if using Python2):
for line in filter(None, text.splitlines()):
pass # do stuff with line
I would iterate through like this:
text = "b'abc\r\r\ndef'"
results = text.split('\r\r\n')
for r in results:
print(r)
This is a for l in f: solution:
The key to this is the newline argument on the open call. From the documentation:
[![enter image description here][1]][1]
Therefore, you should use newline='' when writing to suppress newline translation and then when reading use newline='\n', which will work if all your lines terminate with 0 or more '\r' characters followed by a '\n' character:
with open('test.txt', 'w', newline='') as f:
f.write('abc\r\r\ndef')
with open('test.txt', 'r', newline='\n') as f:
for line in f:
print(repr(line))
Prints:
'abc\r\r\n'
'def'
A quasi-splitlines solution:
This strictly speaking not a splitlines solution since to be able to handle arbitrary line endings a regular expression version of split would have to be used capturing the line endings and then re-assembling the lines and their endings. So, instead this solution just uses a regular expression to break up the input text allowing line endings consisting of any number of '\r' characters followed by a '\n' character:
import re
input = '\nabc\r\r\ndef\nghi\r\njkl'
with open('test.txt', 'w', newline='') as f:
f.write(input)
with open('test.txt', 'r', newline='') as f:
text = f.read()
lines = re.findall(r'[^\r\n]*\r*\n|[^\r\n]+$', text)
for line in lines:
print(repr(line))
Prints:
'\n'
'abc\r\r\n'
'def\n'
'ghi\r\n'
'jkl'
Regex Demo

How would I read only the first word of each line of a text file?

I wanted to know how I could read ONLY the FIRST WORD of each line in a text file. I tried various codes and tried altering codes but can only manage to read whole lines from a text file.
The code I used is as shown below:
QuizList = []
with open('Quizzes.txt','r') as f:
for line in f:
QuizList.append(line)
line = QuizList[0]
for word in line.split():
print(word)
This refers to an attempt to extract only the first word from the first line. In order to repeat the process for every line i would do the following:
QuizList = []
with open('Quizzes.txt','r') as f:
for line in f:
QuizList.append(line)
capacity = len(QuizList)
capacity = capacity-1
index = 0
while index!=capacity:
line = QuizList[index]
for word in line.split():
print(word)
index = index+1
You are using split at the wrong point, try:
for line in f:
QuizList.append(line.split(None, 1)[0]) # add only first word
Changed to a one-liner that's also more efficient with the strip as Jon Clements suggested in a comment.
with open('Quizzes.txt', 'r') as f:
wordlist = [line.split(None, 1)[0] for line in f]
This is pretty irrelevant to your question, but just so the line.split(None, 1) doesn't confuse you, it's a bit more efficient because it only splits the line 1 time.
From the str.split([sep[, maxsplit]]) docs
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace. Consequently,
splitting an empty string or a string consisting of just whitespace
with a None separator returns [].
' 1 2 3 '.split() returns ['1', '2', '3']
and
' 1 2 3 '.split(None, 1) returns ['1', '2 3 '].
with Open(filename,"r") as f:
wordlist = [r.split()[0] for r in f]
I'd go for the str.split and similar approaches, but for completness here's one that uses a combination of mmap and re if you needed to extract more complicated data:
import mmap, re
with open('quizzes.txt') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
wordlist = re.findall('^(\w+)', mf, flags=re.M)
You should read one character at a time:
import string
QuizList = []
with open('Quizzes.txt','r') as f:
for line in f:
for i, c in enumerate(line):
if c not in string.letters:
print line[:i]
break
l=[]
with open ('task-1.txt', 'rt') as myfile:
for x in myfile:
l.append(x)
for i in l:
print[i.split()[0] ]

python line separated values in a text when converted to list, adds "\n" to the elements in the list

I was astonished that a thing this simple has been troubling me. Below is the code
list = []
f = open("log.txt", "rb") # log.txt file has line separated values,
for i in f.readlines():
for value in i.split(" "):
list.append(value)
print list
The output is
['xx00', '\n', 'xx01in', '\n', 'xx01na', '\n', 'xx01oz', '\n', 'xx01uk', '\n']
How can I get rid of the new line i.e. '\n'?
list = []
f = open("log.txt", "rb") # log.txt file has line separated values,
for i in f.readlines():
for value in i.strip().split(" "):
list.append(value)
print list
.strip() removes trailing newlines. to be explicit you can use .strip('\n') or .strip('\r\n') in some cases.
you can read more about .strip() here
edit
better way to do what you wanted:
with open("log.txt", 'rb') as f:
mylist = [val for subl in [l.split(' ') for l in f.read().splitlines()] for val in subl]
for an answer which is much easier on the eyes, you can import itertools and use chain to flatten the list of lists, like #Jon Clements example
so it would look like this:
from itertools import chain
with open("log.txt", 'rb') as f:
mylist = list(chain.from_iterable(l.split(' ') for l in f.read().splitlines()))
If line-separated means that there is only one value per line, you don't need split() at all:
with open('log.txt', 'rb') as f:
mylist = map(str.strip, f)
In Python 3 wrap map() in a list().
with open("log.txt", "rb") as f:
mylist = f.read().splitlines()
Also, don't use list as a variable name, as it overshadows the python type list().
The correct way to do this, is:
with open('log.txt') as fin:
for line in fin:
print line.split()
By using split() without an argument, the '\n''s automatically don't become a problem (as split or split(None) uses different rules for splitting).
Or, more concisely:
from itertools import chain
with open('log.txt') as fin:
mylist = list(chain.from_iterable(line.split() for line in fin))
If you have a bunch of lines with space separated values, and you just want a list of all the values without caring about where the line breaks were (which appears to be the case from your example, since you're always appending to the same list regardless of what line you're on), then don't bother looping over lines. Just read the whole file as a single string and call split() with no arguments; it will split the string on any sequence of one or more whitespace characters, including both spaces and newlines, with the result that none of the values will contain any whitespace:
with open('log.txt', 'rb') as f:
values = f.read().split()

How to read a text file into a string variable and strip newlines?

I have a text file that looks like:
ABC
DEF
How can I read the file into a single-line string without newlines, in this case creating a string 'ABCDEF'?
For reading the file into a list of lines, but removing the trailing newline character from each line, see How to read a file without newlines?.
You could use:
with open('data.txt', 'r') as file:
data = file.read().replace('\n', '')
Or if the file content is guaranteed to be one-line
with open('data.txt', 'r') as file:
data = file.read().rstrip()
In Python 3.5 or later, using pathlib you can copy text file contents into a variable and close the file in one line:
from pathlib import Path
txt = Path('data.txt').read_text()
and then you can use str.replace to remove the newlines:
txt = txt.replace('\n', '')
You can read from a file in one line:
str = open('very_Important.txt', 'r').read()
Please note that this does not close the file explicitly.
CPython will close the file when it exits as part of the garbage collection.
But other python implementations won't. To write portable code, it is better to use with or close the file explicitly. Short is not always better. See https://stackoverflow.com/a/7396043/362951
To join all lines into a string and remove new lines, I normally use :
with open('t.txt') as f:
s = " ".join([l.rstrip("\n") for l in f])
with open("data.txt") as myfile:
data="".join(line.rstrip() for line in myfile)
join() will join a list of strings, and rstrip() with no arguments will trim whitespace, including newlines, from the end of strings.
This can be done using the read() method :
text_as_string = open('Your_Text_File.txt', 'r').read()
Or as the default mode itself is 'r' (read) so simply use,
text_as_string = open('Your_Text_File.txt').read()
I'm surprised nobody mentioned splitlines() yet.
with open ("data.txt", "r") as myfile:
data = myfile.read().splitlines()
Variable data is now a list that looks like this when printed:
['LLKKKKKKKKMMMMMMMMNNNNNNNNNNNNN', 'GGGGGGGGGHHHHHHHHHHHHHHHHHHHHEEEEEEEE']
Note there are no newlines (\n).
At that point, it sounds like you want to print back the lines to console, which you can achieve with a for loop:
for line in data:
print(line)
It's hard to tell exactly what you're after, but something like this should get you started:
with open ("data.txt", "r") as myfile:
data = ' '.join([line.replace('\n', '') for line in myfile.readlines()])
I have fiddled around with this for a while and have prefer to use use read in combination with rstrip. Without rstrip("\n"), Python adds a newline to the end of the string, which in most cases is not very useful.
with open("myfile.txt") as f:
file_content = f.read().rstrip("\n")
print(file_content)
Here are four codes for you to choose one:
with open("my_text_file.txt", "r") as file:
data = file.read().replace("\n", "")
or
with open("my_text_file.txt", "r") as file:
data = "".join(file.read().split("\n"))
or
with open("my_text_file.txt", "r") as file:
data = "".join(file.read().splitlines())
or
with open("my_text_file.txt", "r") as file:
data = "".join([line for line in file])
you can compress this into one into two lines of code!!!
content = open('filepath','r').read().replace('\n',' ')
print(content)
if your file reads:
hello how are you?
who are you?
blank blank
python output
hello how are you? who are you? blank blank
You can also strip each line and concatenate into a final string.
myfile = open("data.txt","r")
data = ""
lines = myfile.readlines()
for line in lines:
data = data + line.strip();
This would also work out just fine.
This is a one line, copy-pasteable solution that also closes the file object:
_ = open('data.txt', 'r'); data = _.read(); _.close()
f = open('data.txt','r')
string = ""
while 1:
line = f.readline()
if not line:break
string += line
f.close()
print(string)
python3: Google "list comprehension" if the square bracket syntax is new to you.
with open('data.txt') as f:
lines = [ line.strip('\n') for line in list(f) ]
Oneliner:
List: "".join([line.rstrip('\n') for line in open('file.txt')])
Generator: "".join((line.rstrip('\n') for line in open('file.txt')))
List is faster than generator but heavier on memory. Generators are slower than lists and is lighter for memory like iterating over lines. In case of "".join(), I think both should work well. .join() function should be removed to get list or generator respectively.
Note: close() / closing of file descriptor probably not needed
Have you tried this?
x = "yourfilename.txt"
y = open(x, 'r').read()
print(y)
To remove line breaks using Python you can use replace function of a string.
This example removes all 3 types of line breaks:
my_string = open('lala.json').read()
print(my_string)
my_string = my_string.replace("\r","").replace("\n","")
print(my_string)
Example file is:
{
"lala": "lulu",
"foo": "bar"
}
You can try it using this replay scenario:
https://repl.it/repls/AnnualJointHardware
I don't feel that anyone addressed the [ ] part of your question. When you read each line into your variable, because there were multiple lines before you replaced the \n with '' you ended up creating a list. If you have a variable of x and print it out just by
x
or print(x)
or str(x)
You will see the entire list with the brackets. If you call each element of the (array of sorts)
x[0]
then it omits the brackets. If you use the str() function you will see just the data and not the '' either.
str(x[0])
Maybe you could try this? I use this in my programs.
Data= open ('data.txt', 'r')
data = Data.readlines()
for i in range(len(data)):
data[i] = data[i].strip()+ ' '
data = ''.join(data).strip()
Regular expression works too:
import re
with open("depression.txt") as f:
l = re.split(' ', re.sub('\n',' ', f.read()))[:-1]
print (l)
['I', 'feel', 'empty', 'and', 'dead', 'inside']
with open('data.txt', 'r') as file:
data = [line.strip('\n') for line in file.readlines()]
data = ''.join(data)
from pathlib import Path
line_lst = Path("to/the/file.txt").read_text().splitlines()
Is the best way to get all the lines of a file, the '\n' are already stripped by the splitlines() (which smartly recognize win/mac/unix lines types).
But if nonetheless you want to strip each lines:
line_lst = [line.strip() for line in txt = Path("to/the/file.txt").read_text().splitlines()]
strip() was just a useful exemple, but you can process your line as you please.
At the end, you just want concatenated text ?
txt = ''.join(Path("to/the/file.txt").read_text().splitlines())
This works:
Change your file to:
LLKKKKKKKKMMMMMMMMNNNNNNNNNNNNN GGGGGGGGGHHHHHHHHHHHHHHHHHHHHEEEEEEEE
Then:
file = open("file.txt")
line = file.read()
words = line.split()
This creates a list named words that equals:
['LLKKKKKKKKMMMMMMMMNNNNNNNNNNNNN', 'GGGGGGGGGHHHHHHHHHHHHHHHHHHHHEEEEEEEE']
That got rid of the "\n". To answer the part about the brackets getting in your way, just do this:
for word in words: # Assuming words is the list above
print word # Prints each word in file on a different line
Or:
print words[0] + ",", words[1] # Note that the "+" symbol indicates no spaces
#The comma not in parentheses indicates a space
This returns:
LLKKKKKKKKMMMMMMMMNNNNNNNNNNNNN, GGGGGGGGGHHHHHHHHHHHHHHHHHHHHEEEEEEEE
with open(player_name, 'r') as myfile:
data=myfile.readline()
list=data.split(" ")
word=list[0]
This code will help you to read the first line and then using the list and split option you can convert the first line word separated by space to be stored in a list.
Than you can easily access any word, or even store it in a string.
You can also do the same thing with using a for loop.
file = open("myfile.txt", "r")
lines = file.readlines()
str = '' #string declaration
for i in range(len(lines)):
str += lines[i].rstrip('\n') + ' '
print str
Try the following:
with open('data.txt', 'r') as myfile:
data = myfile.read()
sentences = data.split('\\n')
for sentence in sentences:
print(sentence)
Caution: It does not remove the \n. It is just for viewing the text as if there were no \n

Categories

Resources