Remove commas and newlines from text file in python - python

I have text file which looks like this:
ab initio
ab intestato
ab intra
a.C.
acanka, acance, acanek, acankach, acankami, acankÄ…
Achab, Achaba, Achabem, Achabie, Achabowi
I would like to pars every word separated by comma into a list. So it would look like ['ab initio', 'ab intestato', 'ab intra','a.C.', 'acanka', ...] Also mind the fact that there are words on new lines that are not ending with commas.
When I used
list1.append(line.strip()) it gave me string of every line instead of separate words. Can someone provide me some insight into this?
Full code below:
list1=[]
filepath="words.txt"
with open(filepath, encoding="utf8") as fp:
line = fp.readline()
while line:
list1.append(line.strip(','))
line = fp.readline()

Very close, but I think you want split instead of strip, and extend instead of append
You can also iterate directly over the lines with a for loop.
list1=[]
filepath="words.txt"
with open(filepath, encoding="utf8") as fp:
for line in fp:
list1.extend(line.strip().split(', '))

You can use your code to get down to "list of line"-content and apply:
cleaned = [ x for y in list1 for x in y.split(',')]
this essentially takes any thing you parsed into your list and splits it at , to creates a new list.
sberrys all in one solution that uses no intermediate list is faster.

Related

Im trying to unify words together on my python application [duplicate]

This question already has answers here:
How to read a file without newlines?
(12 answers)
Closed 5 years ago.
I have a .txt file with values in it.
The values are listed like so:
Value1
Value2
Value3
Value4
My goal is to put the values in a list. When I do so, the list looks like this:
['Value1\n', 'Value2\n', ...]
The \n is not needed.
Here is my code:
t = open('filename.txt')
contents = t.readlines()
This should do what you want (file contents in a list, by line, without \n)
with open(filename) as f:
mylist = f.read().splitlines()
I'd do this:
alist = [line.rstrip() for line in open('filename.txt')]
or:
with open('filename.txt') as f:
alist = [line.rstrip() for line in f]
You can use .rstrip('\n') to only remove newlines from the end of the string:
for i in contents:
alist.append(i.rstrip('\n'))
This leaves all other whitespace intact. If you don't care about whitespace at the start and end of your lines, then the big heavy hammer is called .strip().
However, since you are reading from a file and are pulling everything into memory anyway, better to use the str.splitlines() method; this splits one string on line separators and returns a list of lines without those separators; use this on the file.read() result and don't use file.readlines() at all:
alist = t.read().splitlines()
After opening the file, list comprehension can do this in one line:
fh=open('filename')
newlist = [line.rstrip() for line in fh.readlines()]
fh.close()
Just remember to close your file afterwards.
I used the strip function to get rid of newline character as split lines was throwing memory errors on 4 gb File.
Sample Code:
with open('C:\\aapl.csv','r') as apple:
for apps in apple.readlines():
print(apps.strip())
for each string in your list, use .strip() which removes whitespace from the beginning or end of the string:
for i in contents:
alist.append(i.strip())
But depending on your use case, you might be better off using something like numpy.loadtxt or even numpy.genfromtxt if you need a nice array of the data you're reading from the file.
from string import rstrip
with open('bvc.txt') as f:
alist = map(rstrip, f)
Nota Bene: rstrip() removes the whitespaces, that is to say : \f , \n , \r , \t , \v , \x and blank ,
but I suppose you're only interested to keep the significant characters in the lines. Then, mere map(strip, f) will fit better, removing the heading whitespaces too.
If you really want to eliminate only the NL \n and RF \r symbols, do:
with open('bvc.txt') as f:
alist = f.read().splitlines()
splitlines() without argument passed doesn't keep the NL and RF symbols (Windows records the files with NLRF at the end of lines, at least on my machine) but keeps the other whitespaces, notably the blanks and tabs.
.
with open('bvc.txt') as f:
alist = f.read().splitlines(True)
has the same effect as
with open('bvc.txt') as f:
alist = f.readlines()
that is to say the NL and RF are kept
I had the same problem and i found the following solution to be very efficient. I hope that it will help you or everyone else who wants to do the same thing.
First of all, i would start with a "with" statement as it ensures the proper open/close of the file.
It should look something like this:
with open("filename.txt", "r+") as f:
contents = [x.strip() for x in f.readlines()]
If you want to convert those strings (every item in the contents list is a string) in integer or float you can do the following:
contents = [float(contents[i]) for i in range(len(contents))]
Use int instead of float if you want to convert to integer.
It's my first answer in SO, so sorry if it's not in the proper formatting.
I recently used this to read all the lines from a file:
alist = open('maze.txt').read().split()
or you can use this for that little bit of extra added safety:
with f as open('maze.txt'):
alist = f.read().split()
It doesn't work with whitespace in-between text in a single line, but it looks like your example file might not have whitespace splitting the values. It is a simple solution and it returns an accurate list of values, and does not add an empty string: '' for every empty line, such as a newline at the end of the file.
with open('D:\\file.txt', 'r') as f1:
lines = f1.readlines()
lines = [s[:-1] for s in lines]
The easiest way to do this is to write file.readline()[0:-1]
This will read everything except the last character, which is the newline.

Removing formatting when reading from file [duplicate]

This question already has answers here:
How to read a file without newlines?
(12 answers)
Closed 5 years ago.
I have a .txt file with values in it.
The values are listed like so:
Value1
Value2
Value3
Value4
My goal is to put the values in a list. When I do so, the list looks like this:
['Value1\n', 'Value2\n', ...]
The \n is not needed.
Here is my code:
t = open('filename.txt')
contents = t.readlines()
This should do what you want (file contents in a list, by line, without \n)
with open(filename) as f:
mylist = f.read().splitlines()
I'd do this:
alist = [line.rstrip() for line in open('filename.txt')]
or:
with open('filename.txt') as f:
alist = [line.rstrip() for line in f]
You can use .rstrip('\n') to only remove newlines from the end of the string:
for i in contents:
alist.append(i.rstrip('\n'))
This leaves all other whitespace intact. If you don't care about whitespace at the start and end of your lines, then the big heavy hammer is called .strip().
However, since you are reading from a file and are pulling everything into memory anyway, better to use the str.splitlines() method; this splits one string on line separators and returns a list of lines without those separators; use this on the file.read() result and don't use file.readlines() at all:
alist = t.read().splitlines()
After opening the file, list comprehension can do this in one line:
fh=open('filename')
newlist = [line.rstrip() for line in fh.readlines()]
fh.close()
Just remember to close your file afterwards.
I used the strip function to get rid of newline character as split lines was throwing memory errors on 4 gb File.
Sample Code:
with open('C:\\aapl.csv','r') as apple:
for apps in apple.readlines():
print(apps.strip())
for each string in your list, use .strip() which removes whitespace from the beginning or end of the string:
for i in contents:
alist.append(i.strip())
But depending on your use case, you might be better off using something like numpy.loadtxt or even numpy.genfromtxt if you need a nice array of the data you're reading from the file.
from string import rstrip
with open('bvc.txt') as f:
alist = map(rstrip, f)
Nota Bene: rstrip() removes the whitespaces, that is to say : \f , \n , \r , \t , \v , \x and blank ,
but I suppose you're only interested to keep the significant characters in the lines. Then, mere map(strip, f) will fit better, removing the heading whitespaces too.
If you really want to eliminate only the NL \n and RF \r symbols, do:
with open('bvc.txt') as f:
alist = f.read().splitlines()
splitlines() without argument passed doesn't keep the NL and RF symbols (Windows records the files with NLRF at the end of lines, at least on my machine) but keeps the other whitespaces, notably the blanks and tabs.
.
with open('bvc.txt') as f:
alist = f.read().splitlines(True)
has the same effect as
with open('bvc.txt') as f:
alist = f.readlines()
that is to say the NL and RF are kept
I had the same problem and i found the following solution to be very efficient. I hope that it will help you or everyone else who wants to do the same thing.
First of all, i would start with a "with" statement as it ensures the proper open/close of the file.
It should look something like this:
with open("filename.txt", "r+") as f:
contents = [x.strip() for x in f.readlines()]
If you want to convert those strings (every item in the contents list is a string) in integer or float you can do the following:
contents = [float(contents[i]) for i in range(len(contents))]
Use int instead of float if you want to convert to integer.
It's my first answer in SO, so sorry if it's not in the proper formatting.
I recently used this to read all the lines from a file:
alist = open('maze.txt').read().split()
or you can use this for that little bit of extra added safety:
with f as open('maze.txt'):
alist = f.read().split()
It doesn't work with whitespace in-between text in a single line, but it looks like your example file might not have whitespace splitting the values. It is a simple solution and it returns an accurate list of values, and does not add an empty string: '' for every empty line, such as a newline at the end of the file.
with open('D:\\file.txt', 'r') as f1:
lines = f1.readlines()
lines = [s[:-1] for s in lines]
The easiest way to do this is to write file.readline()[0:-1]
This will read everything except the last character, which is the newline.

How to create a list from a text file in Python

I have a text file called "test", and I would like to create a list in Python and print it. I have the following code, but it does not print a list of words; it prints the whole document in one line.
file = open("test", 'r')
lines = file.readlines()
my_list = [line.split(' , ')for line in open ("test")]
print (my_list)
You could do
my_list = open("filename.txt").readlines()
When you do this:
file = open("test", 'r')
lines = file.readlines()
Lines is a list of lines. If you want to get a list of words for each line you can do:
list_word = []
for l in lines:
list_word.append(l.split(" "))
I believe you are trying to achieve something like this:
data = [word.split(',') for word in open("test", 'r').readlines()]
It would also help if you were to specify what type of text file you are trying to read as there are several modules(i.e. csv) that would produce the result in a much simpler way.
As pointed out, you may also strip a new line(depends on what line ending you are using) and you'll get something like this:
data = [word.strip('\n').split(',') for word in open("test", 'r').readlines()]
This produces a list of lines with a list of words.

Getting rid of \n when using .readlines() [duplicate]

This question already has answers here:
How to read a file without newlines?
(12 answers)
Closed 5 years ago.
I have a .txt file with values in it.
The values are listed like so:
Value1
Value2
Value3
Value4
My goal is to put the values in a list. When I do so, the list looks like this:
['Value1\n', 'Value2\n', ...]
The \n is not needed.
Here is my code:
t = open('filename.txt')
contents = t.readlines()
This should do what you want (file contents in a list, by line, without \n)
with open(filename) as f:
mylist = f.read().splitlines()
I'd do this:
alist = [line.rstrip() for line in open('filename.txt')]
or:
with open('filename.txt') as f:
alist = [line.rstrip() for line in f]
You can use .rstrip('\n') to only remove newlines from the end of the string:
for i in contents:
alist.append(i.rstrip('\n'))
This leaves all other whitespace intact. If you don't care about whitespace at the start and end of your lines, then the big heavy hammer is called .strip().
However, since you are reading from a file and are pulling everything into memory anyway, better to use the str.splitlines() method; this splits one string on line separators and returns a list of lines without those separators; use this on the file.read() result and don't use file.readlines() at all:
alist = t.read().splitlines()
After opening the file, list comprehension can do this in one line:
fh=open('filename')
newlist = [line.rstrip() for line in fh.readlines()]
fh.close()
Just remember to close your file afterwards.
I used the strip function to get rid of newline character as split lines was throwing memory errors on 4 gb File.
Sample Code:
with open('C:\\aapl.csv','r') as apple:
for apps in apple.readlines():
print(apps.strip())
for each string in your list, use .strip() which removes whitespace from the beginning or end of the string:
for i in contents:
alist.append(i.strip())
But depending on your use case, you might be better off using something like numpy.loadtxt or even numpy.genfromtxt if you need a nice array of the data you're reading from the file.
from string import rstrip
with open('bvc.txt') as f:
alist = map(rstrip, f)
Nota Bene: rstrip() removes the whitespaces, that is to say : \f , \n , \r , \t , \v , \x and blank ,
but I suppose you're only interested to keep the significant characters in the lines. Then, mere map(strip, f) will fit better, removing the heading whitespaces too.
If you really want to eliminate only the NL \n and RF \r symbols, do:
with open('bvc.txt') as f:
alist = f.read().splitlines()
splitlines() without argument passed doesn't keep the NL and RF symbols (Windows records the files with NLRF at the end of lines, at least on my machine) but keeps the other whitespaces, notably the blanks and tabs.
.
with open('bvc.txt') as f:
alist = f.read().splitlines(True)
has the same effect as
with open('bvc.txt') as f:
alist = f.readlines()
that is to say the NL and RF are kept
I had the same problem and i found the following solution to be very efficient. I hope that it will help you or everyone else who wants to do the same thing.
First of all, i would start with a "with" statement as it ensures the proper open/close of the file.
It should look something like this:
with open("filename.txt", "r+") as f:
contents = [x.strip() for x in f.readlines()]
If you want to convert those strings (every item in the contents list is a string) in integer or float you can do the following:
contents = [float(contents[i]) for i in range(len(contents))]
Use int instead of float if you want to convert to integer.
It's my first answer in SO, so sorry if it's not in the proper formatting.
I recently used this to read all the lines from a file:
alist = open('maze.txt').read().split()
or you can use this for that little bit of extra added safety:
with f as open('maze.txt'):
alist = f.read().split()
It doesn't work with whitespace in-between text in a single line, but it looks like your example file might not have whitespace splitting the values. It is a simple solution and it returns an accurate list of values, and does not add an empty string: '' for every empty line, such as a newline at the end of the file.
with open('D:\\file.txt', 'r') as f1:
lines = f1.readlines()
lines = [s[:-1] for s in lines]
The easiest way to do this is to write file.readline()[0:-1]
This will read everything except the last character, which is the newline.

Python - file contents to nested list

I have a file in tab delimited format with trailing newline characters, e.g.,
123 abc
456 def
789 ghi
I wish to write function to convert the contents of the file into a nested list. To date I have tried:
def ls_platform_ann():
keyword = []
for line in open( "file", "r" ).readlines():
for value in line.split():
keyword.append(value)
and
def nested_list_input():
nested_list = []
for line in open("file", "r").readlines():
for entry in line.strip().split():
nested_list.append(entry)
print nested_list
.
The former creates a nested list but includes \n and \t characters. The latter does not make a nested list but rather lots of equivalent lists without \n and \t characters.
Anyone help?
Regards,
S ;-)
You want the csv module.
import csv
source = "123\tabc\n456\tdef\n789\tghi"
lines = source.split("\n")
reader = csv.reader(lines, delimiter='\t')
print [word for word in [row for row in reader]]
Output:
[['123', 'abc'], ['456', 'def'], ['789', 'ghi']]
In the code above Ive put the content of the file right in there for easy testing. If youre reading from a file from disk you can do this as well (which might be considered cleaner):
import csv
reader = csv.reader(open("source.csv"), delimiter='\t')
print [word for word in [row for row in reader]]
Another option that doesn't involve the csv module is:
data = [[item.strip() for item in line.rstrip('\r\n').split('\t')] for line in open('input.txt')]
As a multiple line statement it would look like this:
data = []
for line in open('input.txt'):
items = line.rstrip('\r\n').split('\t') # strip new-line characters and split on column delimiter
items = [item.strip() for item in items] # strip extra whitespace off data items
data.append(items)
First off, have a look at the csv module, it should handle the whitespace for you. You may also want to call strip() on value/entry.

Categories

Resources