Remove regex elements from list - python

I use python 2.7.
I have data in file 'a':
myname1#abc.com;description1
myname2#abc.org;description2
myname3#this_is_ok.ok;description3
myname5#qwe.in;description4
myname4#qwe.org;description5
abc#ok.ok;description7
I read this file like:
with open('a', 'r') as f:
data = [x.strip() for x in f.readlines()]
i have a list named bad:
bad = ['abc', 'qwe'] # could be more than 20 elements
Now i'm trying to remove all lines with 'abc' and 'qwe' after # and write the rest to the newfile.
So in newfile should be only 2 lines:
myname3#this_is_ok.ok;description3
abc#ok.ok;description7
I've been tryin to use regexp (.?)#(.?);(.*) to get groups, but i don't know what to do next.
Advice me, please!

Here's a non-regex solution:
bad = set(['abc', 'qwe'])
with open('a', 'r') as f:
data = [line.strip() for line in f if line.split('#')[1].split('.')[0] in bad]

import re
bad = ['abc', 'qwe']
with open('a') as f:
print [line.strip()
for line in f
if not re.search('|'.join(bad), line.partition('#')[2]]
This solution works as long as bad only contains normal characters eg. letters, numbers, underscores but nothing that interferes with the regex expression like 'a|b' as #phihag pointed out.

The regexp .? matches either no or one character. You want .*?, which is a lazy match of multiple characters:
import re
bad = ['abc', 'qwe']
filterf = re.compile('(.*?)#(?!' + '|'.join(map(re.escape, bad)) + ')').match
with open('a') as inf, open('newfile', 'w') as outf:
outf.writelines(filter(filterf, inf))

I have used regular expression to remove lines which contains #abc or #qwe. Not sure if it is the right method
import re
with open('testFile.txt', 'r') as f:
data = [x.strip() for x in f.readlines() if re.match(r'.*#([^abc|qwe]+)\..*;.*',x)]
print data
Now the data will have lines which does not have '#abc' and '#qwe'
Or use
data = [x.strip() for x in f.readlines() if re.search(r'.*#(?!abc|qwe)',x)]
Based on astynax 's suggestion...

Related

Find specific word and print till newline in Python

I have a string and I want to match something that start with specific word and end with newline. How can this be done?
Website:https://www.abc1.xyz/
Product:Apparal
TM Link:https://www.abc2.xyz/
Other Link:https://www.abc3.xyz/
I want to extract [Website,Product,TM Link,Other Link] and save this in a CSV.
I am new to writing regular expression, I was wondering if anyone had a good solution to this that would be awesome!
No need for regex, two splits do the trick here too
s = """Website:https://www.abc1.xyz/
Product:Apparal
TM Link:https://www.abc2.xyz/
Other Link:https://www.abc3.xyz/"""
res = [l.split(':')[0] for l in [line for line in s.split('\n')]]
with open('file.csv', 'w') as f:
f.write(','.join(res))
If you want to get the part after the first semicolon:
s = """Website:https://www.abc1.xyz/
Product:Apparal
TM Link:https://www.abc2.xyz/
Other Link:https://www.abc3.xyz/"""
res = [l.split(':', maxsplit=1)[1] for l in [line for line in s.split('\n')]]
with open('file.csv', 'w') as f:
f.write(','.join(res))

splitlines() and iterating over an opened file give different results

I have files with sometimes weird end-of-lines characters like \r\r\n. With this, it works like I want:
with open('test.txt', 'wb') as f: # simulate a file with weird end-of-lines
f.write(b'abc\r\r\ndef')
with open('test.txt', 'rb') as f:
for l in f:
print(l)
# b'abc\r\r\n'
# b'def'
I want to able to get the same result from a string. I thought about splitlines but it does not give the same result:
print(b'abc\r\r\ndef'.splitlines())
# [b'abc', b'', b'def']
Even with keepends=True, it's not the same result.
Question: how to have the same behaviour than for l in f with splitlines()?
Linked: Changing str.splitlines to match file readlines and https://bugs.python.org/issue22232
Note: I don't want to put everything in a BytesIO or StringIO, because it does a x0.5 speed performance (already benchmarked); I want to keep a simple string. So it's not a duplicate of How do I wrap a string in a file in Python?.
Why don't you just split it:
input = b'\nabc\r\r\r\nd\ref\nghi\r\njkl'
result = input.split(b'\n')
print(result)
[b'', b'abc\r\r\r', b'd\ref', b'ghi\r', b'jkl']
You will loose the trailing \n that can be added later to every line, if you really need them. On the last line there is a need to check if it is really needed. Like
fixed = [bstr + b'\n' for bstr in result]
if input[-1] != b'\n':
fixed[-1] = fixed[-1][:-1]
print(fixed)
[b'\n', b'abc\r\r\r\n', b'd\ref\n', b'ghi\r\n', b'jkl']
Another variant with a generator. This way it will be memory savvy on the huge files and the syntax will be similar to the original for l in bin_split(input) :
def bin_split(input_str):
start = 0
while start>=0 :
found = input_str.find(b'\n', start) + 1
if 0 < found < len(input_str):
yield input_str[start : found]
start = found
else:
yield input_str[start:]
break
There are a couple ways to do this, but none are especially fast.
If you want to keep the line endings, you might try the re module:
lines = re.findall(r'[\r\n]+|[^\r\n]+[\r\n]*', text)
# or equivalently
line_split_regex = re.compile(r'[\r\n]+|[^\r\n]+[\r\n]*')
lines = line_split_regex.findall(text)
If you need the endings and the file is really big, you may want to iterate instead:
for r in re.finditer(r'[\r\n]+|[^\r\n]+[\r\n]*', text):
line = r.group()
# do stuff with line here
If you don't need the endings, then you can do it much more easily:
lines = list(filter(None, text.splitlines()))
You can omit the list() part if you just iterate over the results (or if using Python2):
for line in filter(None, text.splitlines()):
pass # do stuff with line
I would iterate through like this:
text = "b'abc\r\r\ndef'"
results = text.split('\r\r\n')
for r in results:
print(r)
This is a for l in f: solution:
The key to this is the newline argument on the open call. From the documentation:
[![enter image description here][1]][1]
Therefore, you should use newline='' when writing to suppress newline translation and then when reading use newline='\n', which will work if all your lines terminate with 0 or more '\r' characters followed by a '\n' character:
with open('test.txt', 'w', newline='') as f:
f.write('abc\r\r\ndef')
with open('test.txt', 'r', newline='\n') as f:
for line in f:
print(repr(line))
Prints:
'abc\r\r\n'
'def'
A quasi-splitlines solution:
This strictly speaking not a splitlines solution since to be able to handle arbitrary line endings a regular expression version of split would have to be used capturing the line endings and then re-assembling the lines and their endings. So, instead this solution just uses a regular expression to break up the input text allowing line endings consisting of any number of '\r' characters followed by a '\n' character:
import re
input = '\nabc\r\r\ndef\nghi\r\njkl'
with open('test.txt', 'w', newline='') as f:
f.write(input)
with open('test.txt', 'r', newline='') as f:
text = f.read()
lines = re.findall(r'[^\r\n]*\r*\n|[^\r\n]+$', text)
for line in lines:
print(repr(line))
Prints:
'\n'
'abc\r\r\n'
'def\n'
'ghi\r\n'
'jkl'
Regex Demo

extracting certain strings from a a file using python

I have a file with some lines. Out of those lines I will choose only lines which starts with xxx. Now the lines which starts with xxx have pattern as follows:
xxx:(12:"pqrs",223,"rst",-90)
xxx:(23:"abc",111,"def",-80)
I want to extract only the string which are their in the first double quote
i.e., "pqrs" and "abc".
Any help using regex is appreciated.
My code is as follows:
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
for phrase in 'xxx:':
if re.match('^xxx:',line):
c=line
break
this code is giving me error
Your code is wrongly indented. Your f = f.readlines() has 9 spaces in front while for line in f: has 4 spaces. It should look like below.
import re
list_of_prefixes = ["xxx","aaa"]
resulting_list = []
with open("raw.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
for phrase in list_of_prefixes:
if re.match(phrase + ':\(\d+:\"(\w+)',line) != None:
resulting_list.append(re.findall(phrase +':\(\d+:\"(\w+)',line)[0])
Well you are heading in the right direction.
If the input is this simple, you can use regex groups.
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
m = re.match('^xxx:\(\d*:("[^"]*")',line)
if m is not None:
print(m.group(1))
All the magic is in the regular expression.
^xxx:(\d*:("[^"]*") means
Start from the beginning of the line, match on "xxx:(<any number of numbers>:"<anything but ">"
and because the sequence "<anything but ">" is enclosed in round brackets it will be available as a group (by calling m.group(1)).
PS: next time make sure to include the exact error you are getting
results = []
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
if line.startswith("xxx"):
line = line.split(":") # line[1] will be what is after :
result = line[1].split(",")[0][1:-1] # will be pqrs
results.append(result)
You want to look for lines that start with xxx
then split the line on the :. The first thing after the : is what you want -- up to the comma. Then your result is that string, but remove the quotes. There is no need for regex. Python string functions will be fine
To check if a line starts with xxx do
line.startswith('xxx')
To find the text in first double-quotes do
re.search(r'"(.*?)"', line).group(1)
(as match.group(1) is the first parenthesized subgroup)
So the code will be
with open("file") as f:
for line in f:
if line.startswith('xxx'):
print(re.search(r'"(.*?)"', line).group(1))
re module docs

How to replace a character in some specific word in a text file using python

I got a task to replace "O"(capital O) by "0" in a text file by using python. But one condition is that I have to preserve the other words like Over, NATO etc. I have to replace only the words like 9OO to 900, 2OO6 to 2006 and so on. I tried a lot but yet not successful. My code is given below. Please help me any one. Thanks in advance
import re
srcpatt = 'O'
rplpatt = '0'
cre = re.compile(srcpatt)
with open('myfile.txt', 'r') as file:
content = file.read()
wordlist = re.findall(r'(\d+O|O\d+)',str(content))
print(wordlist)
for word in wordlist:
subcontent = cre.sub(rplpatt, word)
newrep = re.compile(word)
newcontent = newrep.sub(subcontent,content)
with open('myfile.txt', 'w') as file:
file.write(newcontent)
print('"',srcpatt,'" is successfully replaced by "',rplpatt,'"')
re.sub can take in a replacement function, so we can pare this down pretty nicely:
import re
with open('myfile.txt', 'r') as file:
content = file.read()
with open('myfile.txt', 'w') as file:
file.write(re.sub(r'\d+[\dO]+|[\dO]+\d+', lambda m: m.group().replace('O', '0'), content))
import re
srcpatt = 'O'
rplpatt = '0'
cre = re.compile(srcpatt)
reg = r'\b(\d*)O(O*\d*)\b'
with open('input', 'r') as f:
for line in f:
while re.match(reg,line): line=re.sub(reg, r'\g<1>0\2', line)
print line
print('"',srcpatt,'" is successfully replaced by "',rplpatt,'"')
You can probably get away with matching just a leading digit followed by O. This won't handle OO7, but it will work nicely with 8080 for example. Which none of the answers here matching the trailing digits will. If you want to do that you need to use a lookahead match.
re.sub(r'(\d)(O+)', lambda m: m.groups()[0] + '0'*len(m.groups()[1]), content)

How to read a text file into a string variable and strip newlines?

I have a text file that looks like:
ABC
DEF
How can I read the file into a single-line string without newlines, in this case creating a string 'ABCDEF'?
For reading the file into a list of lines, but removing the trailing newline character from each line, see How to read a file without newlines?.
You could use:
with open('data.txt', 'r') as file:
data = file.read().replace('\n', '')
Or if the file content is guaranteed to be one-line
with open('data.txt', 'r') as file:
data = file.read().rstrip()
In Python 3.5 or later, using pathlib you can copy text file contents into a variable and close the file in one line:
from pathlib import Path
txt = Path('data.txt').read_text()
and then you can use str.replace to remove the newlines:
txt = txt.replace('\n', '')
You can read from a file in one line:
str = open('very_Important.txt', 'r').read()
Please note that this does not close the file explicitly.
CPython will close the file when it exits as part of the garbage collection.
But other python implementations won't. To write portable code, it is better to use with or close the file explicitly. Short is not always better. See https://stackoverflow.com/a/7396043/362951
To join all lines into a string and remove new lines, I normally use :
with open('t.txt') as f:
s = " ".join([l.rstrip("\n") for l in f])
with open("data.txt") as myfile:
data="".join(line.rstrip() for line in myfile)
join() will join a list of strings, and rstrip() with no arguments will trim whitespace, including newlines, from the end of strings.
This can be done using the read() method :
text_as_string = open('Your_Text_File.txt', 'r').read()
Or as the default mode itself is 'r' (read) so simply use,
text_as_string = open('Your_Text_File.txt').read()
I'm surprised nobody mentioned splitlines() yet.
with open ("data.txt", "r") as myfile:
data = myfile.read().splitlines()
Variable data is now a list that looks like this when printed:
['LLKKKKKKKKMMMMMMMMNNNNNNNNNNNNN', 'GGGGGGGGGHHHHHHHHHHHHHHHHHHHHEEEEEEEE']
Note there are no newlines (\n).
At that point, it sounds like you want to print back the lines to console, which you can achieve with a for loop:
for line in data:
print(line)
It's hard to tell exactly what you're after, but something like this should get you started:
with open ("data.txt", "r") as myfile:
data = ' '.join([line.replace('\n', '') for line in myfile.readlines()])
I have fiddled around with this for a while and have prefer to use use read in combination with rstrip. Without rstrip("\n"), Python adds a newline to the end of the string, which in most cases is not very useful.
with open("myfile.txt") as f:
file_content = f.read().rstrip("\n")
print(file_content)
Here are four codes for you to choose one:
with open("my_text_file.txt", "r") as file:
data = file.read().replace("\n", "")
or
with open("my_text_file.txt", "r") as file:
data = "".join(file.read().split("\n"))
or
with open("my_text_file.txt", "r") as file:
data = "".join(file.read().splitlines())
or
with open("my_text_file.txt", "r") as file:
data = "".join([line for line in file])
you can compress this into one into two lines of code!!!
content = open('filepath','r').read().replace('\n',' ')
print(content)
if your file reads:
hello how are you?
who are you?
blank blank
python output
hello how are you? who are you? blank blank
You can also strip each line and concatenate into a final string.
myfile = open("data.txt","r")
data = ""
lines = myfile.readlines()
for line in lines:
data = data + line.strip();
This would also work out just fine.
This is a one line, copy-pasteable solution that also closes the file object:
_ = open('data.txt', 'r'); data = _.read(); _.close()
f = open('data.txt','r')
string = ""
while 1:
line = f.readline()
if not line:break
string += line
f.close()
print(string)
python3: Google "list comprehension" if the square bracket syntax is new to you.
with open('data.txt') as f:
lines = [ line.strip('\n') for line in list(f) ]
Oneliner:
List: "".join([line.rstrip('\n') for line in open('file.txt')])
Generator: "".join((line.rstrip('\n') for line in open('file.txt')))
List is faster than generator but heavier on memory. Generators are slower than lists and is lighter for memory like iterating over lines. In case of "".join(), I think both should work well. .join() function should be removed to get list or generator respectively.
Note: close() / closing of file descriptor probably not needed
Have you tried this?
x = "yourfilename.txt"
y = open(x, 'r').read()
print(y)
To remove line breaks using Python you can use replace function of a string.
This example removes all 3 types of line breaks:
my_string = open('lala.json').read()
print(my_string)
my_string = my_string.replace("\r","").replace("\n","")
print(my_string)
Example file is:
{
"lala": "lulu",
"foo": "bar"
}
You can try it using this replay scenario:
https://repl.it/repls/AnnualJointHardware
I don't feel that anyone addressed the [ ] part of your question. When you read each line into your variable, because there were multiple lines before you replaced the \n with '' you ended up creating a list. If you have a variable of x and print it out just by
x
or print(x)
or str(x)
You will see the entire list with the brackets. If you call each element of the (array of sorts)
x[0]
then it omits the brackets. If you use the str() function you will see just the data and not the '' either.
str(x[0])
Maybe you could try this? I use this in my programs.
Data= open ('data.txt', 'r')
data = Data.readlines()
for i in range(len(data)):
data[i] = data[i].strip()+ ' '
data = ''.join(data).strip()
Regular expression works too:
import re
with open("depression.txt") as f:
l = re.split(' ', re.sub('\n',' ', f.read()))[:-1]
print (l)
['I', 'feel', 'empty', 'and', 'dead', 'inside']
with open('data.txt', 'r') as file:
data = [line.strip('\n') for line in file.readlines()]
data = ''.join(data)
from pathlib import Path
line_lst = Path("to/the/file.txt").read_text().splitlines()
Is the best way to get all the lines of a file, the '\n' are already stripped by the splitlines() (which smartly recognize win/mac/unix lines types).
But if nonetheless you want to strip each lines:
line_lst = [line.strip() for line in txt = Path("to/the/file.txt").read_text().splitlines()]
strip() was just a useful exemple, but you can process your line as you please.
At the end, you just want concatenated text ?
txt = ''.join(Path("to/the/file.txt").read_text().splitlines())
This works:
Change your file to:
LLKKKKKKKKMMMMMMMMNNNNNNNNNNNNN GGGGGGGGGHHHHHHHHHHHHHHHHHHHHEEEEEEEE
Then:
file = open("file.txt")
line = file.read()
words = line.split()
This creates a list named words that equals:
['LLKKKKKKKKMMMMMMMMNNNNNNNNNNNNN', 'GGGGGGGGGHHHHHHHHHHHHHHHHHHHHEEEEEEEE']
That got rid of the "\n". To answer the part about the brackets getting in your way, just do this:
for word in words: # Assuming words is the list above
print word # Prints each word in file on a different line
Or:
print words[0] + ",", words[1] # Note that the "+" symbol indicates no spaces
#The comma not in parentheses indicates a space
This returns:
LLKKKKKKKKMMMMMMMMNNNNNNNNNNNNN, GGGGGGGGGHHHHHHHHHHHHHHHHHHHHEEEEEEEE
with open(player_name, 'r') as myfile:
data=myfile.readline()
list=data.split(" ")
word=list[0]
This code will help you to read the first line and then using the list and split option you can convert the first line word separated by space to be stored in a list.
Than you can easily access any word, or even store it in a string.
You can also do the same thing with using a for loop.
file = open("myfile.txt", "r")
lines = file.readlines()
str = '' #string declaration
for i in range(len(lines)):
str += lines[i].rstrip('\n') + ' '
print str
Try the following:
with open('data.txt', 'r') as myfile:
data = myfile.read()
sentences = data.split('\\n')
for sentence in sentences:
print(sentence)
Caution: It does not remove the \n. It is just for viewing the text as if there were no \n

Categories

Resources