extracting certain strings from a a file using python

extracting certain strings from a a file using python - python

I have a file with some lines. Out of those lines I will choose only lines which starts with xxx. Now the lines which starts with xxx have pattern as follows:
xxx:(12:"pqrs",223,"rst",-90)
xxx:(23:"abc",111,"def",-80)
I want to extract only the string which are their in the first double quote
i.e., "pqrs" and "abc".
Any help using regex is appreciated.
My code is as follows:
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
for phrase in 'xxx:':
if re.match('^xxx:',line):
c=line
break
this code is giving me error

Your code is wrongly indented. Your f = f.readlines() has 9 spaces in front while for line in f: has 4 spaces. It should look like below.
import re
list_of_prefixes = ["xxx","aaa"]
resulting_list = []
with open("raw.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
for phrase in list_of_prefixes:
if re.match(phrase + ':\(\d+:\"(\w+)',line) != None:
resulting_list.append(re.findall(phrase +':\(\d+:\"(\w+)',line)[0])

Well you are heading in the right direction.
If the input is this simple, you can use regex groups.
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
m = re.match('^xxx:\(\d*:("[^"]*")',line)
if m is not None:
print(m.group(1))
All the magic is in the regular expression.
^xxx:(\d*:("[^"]*") means
Start from the beginning of the line, match on "xxx:(<any number of numbers>:"<anything but ">"
and because the sequence "<anything but ">" is enclosed in round brackets it will be available as a group (by calling m.group(1)).
PS: next time make sure to include the exact error you are getting

results = []
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
if line.startswith("xxx"):
line = line.split(":") # line[1] will be what is after :
result = line[1].split(",")[0][1:-1] # will be pqrs
results.append(result)
You want to look for lines that start with xxx
then split the line on the :. The first thing after the : is what you want -- up to the comma. Then your result is that string, but remove the quotes. There is no need for regex. Python string functions will be fine

To check if a line starts with xxx do
line.startswith('xxx')
To find the text in first double-quotes do
re.search(r'"(.*?)"', line).group(1)
(as match.group(1) is the first parenthesized subgroup)
So the code will be
with open("file") as f:
for line in f:
if line.startswith('xxx'):
print(re.search(r'"(.*?)"', line).group(1))
re module docs

Related

Find specific word and print till newline in Python

I have a string and I want to match something that start with specific word and end with newline. How can this be done?
Website:https://www.abc1.xyz/
Product:Apparal
TM Link:https://www.abc2.xyz/
Other Link:https://www.abc3.xyz/
I want to extract [Website,Product,TM Link,Other Link] and save this in a CSV.
I am new to writing regular expression, I was wondering if anyone had a good solution to this that would be awesome!

No need for regex, two splits do the trick here too
s = """Website:https://www.abc1.xyz/
Product:Apparal
TM Link:https://www.abc2.xyz/
Other Link:https://www.abc3.xyz/"""
res = [l.split(':')[0] for l in [line for line in s.split('\n')]]
with open('file.csv', 'w') as f:
f.write(','.join(res))
If you want to get the part after the first semicolon:
s = """Website:https://www.abc1.xyz/
Product:Apparal
TM Link:https://www.abc2.xyz/
Other Link:https://www.abc3.xyz/"""
res = [l.split(':', maxsplit=1)[1] for l in [line for line in s.split('\n')]]
with open('file.csv', 'w') as f:
f.write(','.join(res))

regex Extract word and end with space in string

I am trying to filter and extract one word from line.
Pattern is: GR.C.24 GRCACH GRALLDKD GR_3AD etc
input will be : the data is GRCACH got from server.
output : GRCAACH
problem : Pattern will start from GR<can be any thing> and end when whitespace encount
I am able to find pattern but not able to end when space encounter.
code is:
import re
fp_data = []
with open("output", "r") as fp:
fp_data = fp.readlines()
for da in fp_data:
match = re.search("\sGR.*", da)
print da
if match:
print dir(match)
print match.group()
Output: GRCACH got from server
Excepted: GRCAACH (or possible word start with GR)

Use:
(?:\s|^)(GR\S*)
(?:\s|^) matches whitespace or start of string
(GR\S*) matches GR followed by 0 or more non-whitespace characters and places match in Group 1
No need to read the entire file into memory (what if the file were very large?). You can iterate the file line by line.
import re
with open("output", "r") as fp:
for line in fp:
matches = re.findall(r"(?:\s|^)(GR\S*)", line)
print(line, matches)
Regex Demo

readlines() method leave trailing new line character "\n" so I used list comprehension to delete this character using rstrip() method and to not operate on empty lines using isspace() method.
import re
fp_data = []
with open("output", "r") as fp:
fp_data = [line.rstrip() for line in fp if not line.isspace()]
for line in fp_data:
match = re.search("\sGR.*", line)
print(line)
if match:
print(match)
print(match.group())

Not sure if I understood your answer and your edit after my question about the desired output correctly, but assuming that you want to list all occurences of words that start with GR, here is a suggestion:
import re
fp_data = []
with open("output", "r") as fp:
fp_data = fp.readlines()
for da in fp_data:
print da
match = re.findall('\\b(GR\\S*)\\b', da)
if match:
print match
The usage of word boundaries (\b) has the benefit of matching at beginning of line and end of line as well.

Is there any shortcut in Python to remove all blanks at the end of each line in a file?

I've learned that we can easily remove blank lined in a file or remove blanks for each string line, but how about remove all blanks at the end of each line in a file ?
One way should be processing each line for a file, like:
with open(file) as f:
for line in f:
store line.strip()
Is this the only way to complete the task ?

Possibly the ugliest implementation possible but heres what I just scratched up :0
def strip_str(string):
last_ind = 0
split_string = string.split(' ')
for ind, word in enumerate(split_string):
if word == '\n':
return ''.join([split_string[0]] + [ ' {} '.format(x) for x in split_string[1:last_ind]])
last_ind += 1

Don't know if these count as different ways of accomplishing the task. The first is really just a variation on what you have. The second does the whole file at once, rather than line-by-line.
Map that calls the 'rstrip' method on each line of the file.
import operator
with open(filename) as f:
#basically the same as (line.rstrip() for line in f)
for line in map(operator.methodcaller('rstrip'), f)):
# do something with the line
read the whole file and use re.sub():
import re
with open(filename) as f:
text = f.read()
text = re.sub(r"\s+(?=\n)", "", text)

You just want to remove spaces, another solution would be...
line.replace(" ", "")
Good to remove white spaces.

How do I find a token in a file and then count the number of spaces that precede the token?

Using python, I am trying to search a file for a token, and then count the number of white spaces which precede that token to the start of the line.
So if the file is like this:
<index>
<scm>
</scm>
</index>
I want to find the number of spaces which precede <scm>

The solution for a single-line mode:
import itertools
with open('yourfile.txt', 'r') as f:
txt = f.read()
print(len(list(itertools.takewhile(lambda c: c.isspace(), txt[txt.index('<scm>')-1::-1]))))
The output:
5
txt[txt.index('<scm>')-1::-1] - "reversed" slice from the position of string <scm> to the beginning of the text
itertools.takewhile(func, iterable) - will accumulate values/characters from the input string(iterable) untill value/character is a whitespace (c.isspace())

If you meant just for the single line case, this would get you the preceeding spaces for that line
def get_preceeding_spaces(file_name, tag):
with open(file_name, 'r') as f:
for line in f.readlines():
if tag in line:
prefix = line.split(tag)[0]
if re.match('\s*', prefix):
return len(prefix)
print(get_preceeding_spaces('test.html', '<scm>'))
returns for your file:
3

You could use a regular expression. The number of spaces would be:
import re
with open('input.txt') as f_input:
r = re.search('( +)' + re.escape("<scm>"), f_input.read(), re.S)
print len(r.groups()[0])
Which would be 3. Or the number of whitespace characters:
with open('input.txt') as f_input:
r = re.search('(\s+)' + re.escape("<scm>"), f_input.read(), re.S)
print len(r.groups()[0])
Which would be 5

how to replace a line of two words in a file using python

I want to replace a line in a file but my code doesn't do what I want. The code doesn't change that line. It seems that the problem is the space between ALS and 4277 characters in the input.txt. I need to keep that space in the file. How can I fix my code?
A part part of input.txt:
ALS 4277
Related part of the code:
for lines in fileinput.input('input.txt', inplace=True):
print(lines.rstrip().replace("ALS"+str(4277), "KLM" + str(4945)))
Desired output:
KLM 4945

Using the same idea that other user have already pointed out, you could also reproduce the same spacing, by first matching the spacing and saving it in a variable (spacing in my code):
import re
with open('input.txt') as f:
lines = f.read()
match = re.match(r'ALS(\s+)4277', lines)
if match != None:
spacing = match.group(1)
lines = re.sub(r'ALS\s+4277', 'KLM%s4945'%spacing, lines.rstrip())
print lines

As the spaces vary you will need to use regex to account for the spaces.
import re
lines = "ALS 4277 "
line = re.sub(r"(ALS\s+4277)", "KLM 4945", lines.rstrip())
print(line)

Try:
with open('input.txt') as f:
for line in f:
a, b = line.strip().split()
if a == 'ALS' and b == '4277':
line = line.replace(a, 'KLM').replace(b, '4945')
print(line, end='') # as line has '\n'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

extracting certain strings from a a file using python - python

Related

Find specific word and print till newline in Python

regex Extract word and end with space in string

Is there any shortcut in Python to remove all blanks at the end of each line in a file?

How do I find a token in a file and then count the number of spaces that precede the token?

how to replace a line of two words in a file using python

Categories

Resources