I have a string and I want to match something that start with specific word and end with newline. How can this be done?
Website:https://www.abc1.xyz/
Product:Apparal
TM Link:https://www.abc2.xyz/
Other Link:https://www.abc3.xyz/
I want to extract [Website,Product,TM Link,Other Link] and save this in a CSV.
I am new to writing regular expression, I was wondering if anyone had a good solution to this that would be awesome!
No need for regex, two splits do the trick here too
s = """Website:https://www.abc1.xyz/
Product:Apparal
TM Link:https://www.abc2.xyz/
Other Link:https://www.abc3.xyz/"""
res = [l.split(':')[0] for l in [line for line in s.split('\n')]]
with open('file.csv', 'w') as f:
f.write(','.join(res))
If you want to get the part after the first semicolon:
s = """Website:https://www.abc1.xyz/
Product:Apparal
TM Link:https://www.abc2.xyz/
Other Link:https://www.abc3.xyz/"""
res = [l.split(':', maxsplit=1)[1] for l in [line for line in s.split('\n')]]
with open('file.csv', 'w') as f:
f.write(','.join(res))
Related
Background
I have a two column CSV file like this:
Find
Replace
is
was
A
one
b
two
etc.
First column is text to find and second is text to replace.
I have second file with some text like this:
"This is A paragraph in a text file." (Please note the case sensitivity)
My requirement:
I want to use that csv file to search and replace in the text file with three conditions:-
whole word replacement.
case sensitive replacement.
Replace all instances of each entry in CSV
Script tried:
with open(CSV_file.csv', mode='r') as infile:
reader = csv.reader(infile)
mydict = {(r'\b' + rows[0] + r'\b'): (r'\b' + rows[1]+r'\b') for rows in reader}<--Requires Attention
print(mydict)
with open('find.txt') as infile, open(r'resul_out.txt', 'w') as outfile:
for line in infile:
for src, target in mydict.items():
line = re.sub(src, target, line) <--Requires Attention
# line = line.replace(src, target)
outfile.write(line)
Description of script
I have loaded my csv into a python dictionary and use regex to find whole words.
Problems
I used r'\b' to make word boundry in order to make whole word replacement but output gives me "\\b" in the dictionary instead of '\b' ??
using REPLACE function gives like:
"Thwas was one paragraph in a text file."
secondly I don't know how to make replacement case sensitive in regex pattern?
If anyone know better solution than this script or can improve the script?
Thanks for help if any..
Here's a more cumbersome approach (more code) but which is easier to read and does not rely on regular expressions. In fact, given the very simple nature of your CSV control file, I wouldn't normally bother using the csv module at all:-
import csv
with open('temp.csv', newline='') as c:
reader = csv.DictReader(c, delimiter=' ')
D = {}
for row in reader:
D[row['Find']] = row['Replace']
with open('input.txt', newline='') as infile:
with open('output.txt', 'w') as outfile:
for line in infile:
tokens = line.split()
for i, t in enumerate(tokens):
if t in D:
tokens[i] = D[t]
outfile.write(' '.join(tokens)+'\n')
I'd just put pure strings into mydict so it looks like
{'is': 'was', 'A': 'one', ...}
and replace this line:
# line = re.sub(src, target, line) # old
line = re.sub(r'\b' + src + r'\b', target, line) # new
Note that you don't need \b in the replacement pattern. Regarding your other questions,
regular expressions are case-sensitive by default,
changing '\b' to '\\b' is exactly what the r'' does. You can omit the r and write '\\b', but that quickly gets ugly with more complex regexs.
I have a file with some lines. Out of those lines I will choose only lines which starts with xxx. Now the lines which starts with xxx have pattern as follows:
xxx:(12:"pqrs",223,"rst",-90)
xxx:(23:"abc",111,"def",-80)
I want to extract only the string which are their in the first double quote
i.e., "pqrs" and "abc".
Any help using regex is appreciated.
My code is as follows:
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
for phrase in 'xxx:':
if re.match('^xxx:',line):
c=line
break
this code is giving me error
Your code is wrongly indented. Your f = f.readlines() has 9 spaces in front while for line in f: has 4 spaces. It should look like below.
import re
list_of_prefixes = ["xxx","aaa"]
resulting_list = []
with open("raw.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
for phrase in list_of_prefixes:
if re.match(phrase + ':\(\d+:\"(\w+)',line) != None:
resulting_list.append(re.findall(phrase +':\(\d+:\"(\w+)',line)[0])
Well you are heading in the right direction.
If the input is this simple, you can use regex groups.
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
m = re.match('^xxx:\(\d*:("[^"]*")',line)
if m is not None:
print(m.group(1))
All the magic is in the regular expression.
^xxx:(\d*:("[^"]*") means
Start from the beginning of the line, match on "xxx:(<any number of numbers>:"<anything but ">"
and because the sequence "<anything but ">" is enclosed in round brackets it will be available as a group (by calling m.group(1)).
PS: next time make sure to include the exact error you are getting
results = []
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
if line.startswith("xxx"):
line = line.split(":") # line[1] will be what is after :
result = line[1].split(",")[0][1:-1] # will be pqrs
results.append(result)
You want to look for lines that start with xxx
then split the line on the :. The first thing after the : is what you want -- up to the comma. Then your result is that string, but remove the quotes. There is no need for regex. Python string functions will be fine
To check if a line starts with xxx do
line.startswith('xxx')
To find the text in first double-quotes do
re.search(r'"(.*?)"', line).group(1)
(as match.group(1) is the first parenthesized subgroup)
So the code will be
with open("file") as f:
for line in f:
if line.startswith('xxx'):
print(re.search(r'"(.*?)"', line).group(1))
re module docs
I am trying to search through a long text file to locate sections where a phrase is located and then print the phrase in one column and the corresponding data in another in a new text file.
Phrase I am looking for is "Initialize All". The text file will have thousands of lines - the one I am looking for will look something like this:
14-09-23 13:47:46.053 -07 000000027 INF: Initialize All start
This is where I am at so far
Still trying to print three separate columns: Initialize All, Date, Time
with open ('Result.txt', 'w') as wFile:
with open('Log.txt', 'r') as f:
for line in f:
if 'Initialize All' in line:
date, time = line.split(" ",2)[:2]
wFile.write(date)
with open('file.txt', 'r') as f:
for line in f:
if 'Inintialize All' in line:
# do stuff with line
you can use regex:
lines=open('file.txt', 'r').readlines()
[re.search(r'\d{2}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3}',line).group(0) for line in lines: if 'Inintialize All' in line]
s = "14-09-23 13:47:46.053 -07 000000027 INF: Initialize All start"
if "Initialize All" in s: # check for substring
date, time = s.split(" ",2)[:2] # split on whitespace and get the first two elements
print date,time
14-09-23 13:47:46.053
The 2 in s.split(" ",2) means the maxsplit is set to 2 so we just split twice other than splitting the whole string, s.split()[:2] will also work as it splits on whitespace by default but as we only want the first two substrings there is no point splitting the whole string.
I wanted to know how I could read ONLY the FIRST WORD of each line in a text file. I tried various codes and tried altering codes but can only manage to read whole lines from a text file.
The code I used is as shown below:
QuizList = []
with open('Quizzes.txt','r') as f:
for line in f:
QuizList.append(line)
line = QuizList[0]
for word in line.split():
print(word)
This refers to an attempt to extract only the first word from the first line. In order to repeat the process for every line i would do the following:
QuizList = []
with open('Quizzes.txt','r') as f:
for line in f:
QuizList.append(line)
capacity = len(QuizList)
capacity = capacity-1
index = 0
while index!=capacity:
line = QuizList[index]
for word in line.split():
print(word)
index = index+1
You are using split at the wrong point, try:
for line in f:
QuizList.append(line.split(None, 1)[0]) # add only first word
Changed to a one-liner that's also more efficient with the strip as Jon Clements suggested in a comment.
with open('Quizzes.txt', 'r') as f:
wordlist = [line.split(None, 1)[0] for line in f]
This is pretty irrelevant to your question, but just so the line.split(None, 1) doesn't confuse you, it's a bit more efficient because it only splits the line 1 time.
From the str.split([sep[, maxsplit]]) docs
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace. Consequently,
splitting an empty string or a string consisting of just whitespace
with a None separator returns [].
' 1 2 3 '.split() returns ['1', '2', '3']
and
' 1 2 3 '.split(None, 1) returns ['1', '2 3 '].
with Open(filename,"r") as f:
wordlist = [r.split()[0] for r in f]
I'd go for the str.split and similar approaches, but for completness here's one that uses a combination of mmap and re if you needed to extract more complicated data:
import mmap, re
with open('quizzes.txt') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
wordlist = re.findall('^(\w+)', mf, flags=re.M)
You should read one character at a time:
import string
QuizList = []
with open('Quizzes.txt','r') as f:
for line in f:
for i, c in enumerate(line):
if c not in string.letters:
print line[:i]
break
l=[]
with open ('task-1.txt', 'rt') as myfile:
for x in myfile:
l.append(x)
for i in l:
print[i.split()[0] ]
I use python 2.7.
I have data in file 'a':
myname1#abc.com;description1
myname2#abc.org;description2
myname3#this_is_ok.ok;description3
myname5#qwe.in;description4
myname4#qwe.org;description5
abc#ok.ok;description7
I read this file like:
with open('a', 'r') as f:
data = [x.strip() for x in f.readlines()]
i have a list named bad:
bad = ['abc', 'qwe'] # could be more than 20 elements
Now i'm trying to remove all lines with 'abc' and 'qwe' after # and write the rest to the newfile.
So in newfile should be only 2 lines:
myname3#this_is_ok.ok;description3
abc#ok.ok;description7
I've been tryin to use regexp (.?)#(.?);(.*) to get groups, but i don't know what to do next.
Advice me, please!
Here's a non-regex solution:
bad = set(['abc', 'qwe'])
with open('a', 'r') as f:
data = [line.strip() for line in f if line.split('#')[1].split('.')[0] in bad]
import re
bad = ['abc', 'qwe']
with open('a') as f:
print [line.strip()
for line in f
if not re.search('|'.join(bad), line.partition('#')[2]]
This solution works as long as bad only contains normal characters eg. letters, numbers, underscores but nothing that interferes with the regex expression like 'a|b' as #phihag pointed out.
The regexp .? matches either no or one character. You want .*?, which is a lazy match of multiple characters:
import re
bad = ['abc', 'qwe']
filterf = re.compile('(.*?)#(?!' + '|'.join(map(re.escape, bad)) + ')').match
with open('a') as inf, open('newfile', 'w') as outf:
outf.writelines(filter(filterf, inf))
I have used regular expression to remove lines which contains #abc or #qwe. Not sure if it is the right method
import re
with open('testFile.txt', 'r') as f:
data = [x.strip() for x in f.readlines() if re.match(r'.*#([^abc|qwe]+)\..*;.*',x)]
print data
Now the data will have lines which does not have '#abc' and '#qwe'
Or use
data = [x.strip() for x in f.readlines() if re.search(r'.*#(?!abc|qwe)',x)]
Based on astynax 's suggestion...