I've seen variations of this question asked a million times but somehow can't figure out a solution for myself.
( PIN 700W_start_stop( STS_PROP( POS_X 1233 )( POS_Y 456 )( BIT_CNT 1 )( CNCT_ID 7071869 ))(USR_PROP( VAR 1( Var_typ -1 )(AssocCd H12 )( termLBLttt +S)( Anorm 011.1)(Amax 1.0))
How do I pull out the number after 'POS_X'? i.e. 1233
I thought I had it figured out using regex because it seems extremely straightforward. But it's not working (go figure).
import re
import pandas as pd
df_pin = pd.DataFrame(columns =
['ID','Pos_x','Pos_y','conn_ID','Association_Code','Anorm','Amax'])
with open(r'C:\Users\user1\Documents\Python Scripts\test1.txt', 'r',
encoding="ISO-8859-1") as txt:
for line in txt:
data = txt.read()
line = line.strip()
x = re.search(r'POS_X (\d+)', data)
df_pin = df_pin.append({'POS_X' : x, ignore_index = True}
print (x)
Shouldn't this give me the numbers after 'POS_X' and then append it do the corresponding column in my dataframe?? There may be multiple occurrences of 'POS_X ###' on the same line, I only want to find the first. What if I wanted to do the same for 'PIN' and extract '700W_start_stop'?
re.search() returns a MatchObject object. \d+ is matched by the first capture group in the regexp, so you need to use
if x:
print(x.group(1))
else:
print("POS_X not found")
to print that.
DEMO
The whole loop should be:
import re
with open(r'C:\Users\user1\Documents\Python Scripts\test1.txt', 'r', encoding="ISO-8859-1") as txt:
for line in txt:
line = line.strip()
x = re.search(r'POS_X (\d+)', line)
if x:
print(x.group(1))
else:
print("POS_X not found in", line)
For PIN, you could use:
x = re.search(r'PIN (\w+)')
\w matches alphanumeric characters and _.
Related
I want to search for multi-line string in a file in python. If there is a match, then I want to get the start line number, end line number, start column and end column number of the match. For example: in the below file,
I want to match the below multi-line string:
pattern = """b'0100000001685c7c35aabe690cc99f947a8172ad075d4401448a212b9f26607d6ec5530915010000006a4730'
b'440220337117278ee2fc7ae222ec1547b3a40fa39a05f91c1e19db60060541c4b3d6e4022020188e1d5d843c'"""
The result of the match should be as: start_line: 2, end_line = 3, start_column: 23 and end_column: 114
The start column is the index in that line where the first character is matched of the pattern and end column is the last index of the line where the last character is matched of the pattern. The end column is shown below:
I tried with the re package of python but it returns None as it could not find any match.
import re
pattern = """b'0100000001685c7c35aabe690cc99f947a8172ad075d4401448a212b9f26607d6ec5530915010000006a4730'
b'440220337117278ee2fc7ae222ec1547b3a40fa39a05f91c1e19db60060541c4b3d6e4022020188e1d5d843c'"""
with open("test.py") as f:
content = f.read()
print(re.search(pattern, content))
I can find the metadata of the location of the match of a single line strings in a file using
with open("test.py") as f:
data = f.read()
for n, line in enumerate(data):
match_index = line.find(pattern)
if match_index != -1:
print("Start Line:", n + 1)
print("End Line", n + 1)
print("Start Column:", match_index)
print("End Column:", match_index + len(pattern) + 1)
break
But, I am struggling to make it work for multi-line strings. How can I match multi-line strings in a file and get the metadata of the location of the match in python?
You should use the re.MULTILINE flag to search multiple lines
import re
pattern = r"(c\nd)"
string = """
a
b
c
d
e
f
"""
match = re.search(pattern, string, flags=re.MULTILINE)
print(match)
To get the start line, you could count the newline characters as follows
start, stop = match.span()
start_line = string[:start].count('\n')
You could do the same for the end_line, or if you know how many lines is your pattern, you can just add this info to avoid counting twice.
To also get the start column, you can check the line itself, or a pure regex solution could also look line:
pattern = "(?:.*\n)*(\s*(c\s*\n\s*d)\s*)"
match = re.match(pattern, string, flags=re.MULTILINE)
start_column = match.start(2) - match.start(1)
start_line = string[:match.start(1)].count('\n')
print(start_line, start_column)
However, I think difflib could be more useful here.
Alternative Solution
Below, I got a more creative solution to your problem:
You are interested in the row and column position of some sample text (not a pattern, but a fixed text) in a larger text.
This problem reminds me a lot on image registration, see https://en.wikipedia.org/wiki/Digital_image_correlation_and_tracking for a short introduction or https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.correlate2d.html for a more sophisticated example.
import os
from itertools import zip_longest
import numpy as np
text = """Some Title
abc xyz ijk
12345678
abcdefgh
xxxxxxxxxxx
012345678
abcabcabc
yyyyyyyyyyy
"""
template = (
"12345678",
"abcdefgh"
)
moving = np.array([
[ord(char) for char in line]
for line in template
])
lines = text.split(os.linesep)
values = [
[ord(char) for char in line]
for line in lines
]
# use zip longest, to pad array with fill value
reference = np.array(list(zip_longest(*values, fillvalue=0))).T
windows = np.lib.stride_tricks.sliding_window_view(reference, moving.shape)
# get a distance matrix
distance = np.linalg.norm(windows - moving, axis=(2, 3))
# find minimum and retrun index location
row, column = np.unravel_index(np.argmin(distance), distance.shape)
print(row, column)
I need to extract the name of the constants and their corresponding values from a .txt file into a dictionary. Where key = NameOfConstants and Value=float.
The start of the file looks like this:
speed of light 299792458.0 m/s
gravitational constant 6.67259e-11 m**3/kg/s**2
Planck constant 6.6260755e-34 J*s
elementary charge 1.60217733e-19 C
How do I get the name of the constants easy?
This is my attempt:
with open('constants.txt', 'r') as infile:
file1 = infile.readlines()
constants = {i.split()[0]: i.split()[1] for i in file1[2:]}
I'm not getting it right with the split(), and I need a little correction!
{' '.join(line.split()[:-2]):' '.join(line.split()[-2:]) for line in lines}
From your text file I'm unable to get the correct value of no of spaces to split. So below code is designed to help you. Please have a look, it worked for you above stated file.
import string
valid_char = string.ascii_letters + ' '
valid_numbers = string.digits + '.'
constants = {}
with open('constants.txt') as file1:
for line in file1.readlines():
key = ''
for index, char in enumerate(line):
if char in valid_char:
key += char
else:
key = key.strip()
break
value = ''
for char in line[index:]:
if char in valid_numbers:
value += char
else:
break
constants[key] = float(value)
print constants
Have You tried using regular expressions?
for example
([a-z]|\s)*
matches the first part of a line until the digits of the constants begin.
Python provides a very good tutorial on regular expressions (regex)
https://docs.python.org/2/howto/regex.html
You can try out your regex online as well
https://regex101.com/
with open('constants.txt', 'r') as infile:
lines = infile.readlines()
constants = {' '.join(line.split()[:-2]):float(' '.join(line.split()[-2:-1])) for line in lines[2:]}
Since there were two lines above not needed.
This would best be solved using a regexp.
Focussing on your question (how to get the names) and your desires (have something shorter):
import re
# Regular expression fetches all characters
# until the first occurence of a number
REGEXP = re.compile('^([a-zA-Z\s]+)\d.*$')
with open('tst.txt', 'r') as f:
for line in f:
match = REGEXP.match(line)
if match:
# On a match the part between parentheses
# are copied to the first group
name = match.group(1).strip()
else:
# Raise something, or change regexp :)
pass
What about re.split-
import re
lines = open(r"C:\txt.txt",'r').readlines()
for line in lines:
data = re.split(r'\s{3,}',line)
print "{0} : {1}".format(data[0],''.join(data[1:]))
Or use oneliner to make dictionary-
{k:v.strip() for k,v in [(re.split(r'\s{3,}',line)[0],''.join(re.split(r'\s{3,}',line)[1:])) for line in open(r"C:\txt.txt",'r').readlines() ]}
Output-
gravitational constant : 6.67259e-11m**3/kg/s**2
Planck constant : 6.6260755e-34J*s
elementary charge : 1.60217733e-19C
Dictionary-
{'Planck constant': '6.6260755e-34J*s', 'elementary charge': '1.60217733e-19C', 'speed of light': '299792458.0m/s', 'gravitational constant': '6.67259e-11m**3/kg/s**2'}
import sys
import re
x = sys.argv[1]
y = sys.argv[2]
f = open("formula.txt" ,'r')
line = f.read()
match = re.search(r'x',line,re.M|re.I)
match = re.search(r'y',line,re.M|re.I)
f.close()
print x
print y
I tried this above program but I could not get the output?
~
~
Desire output should follows:
when I want execute the above program;
>>>python argument.py circle_area rectangle_area
the output should like this:
x = 2*3.14*r*r
y = l*b
And the given file in program is formula.txt
formula.txt file contains following data;
circle_area = '3.14*r*r'
circle_circumference = '2*3.14*r'
rectangle_area = 'l*b'
rectangle_perimeter = '2(l+b)'
------------------------------------
~
can anybody help me to implement above.
You made so many mistakes in your code.
Don't put variable names inside quotes.
Use capturing groups or lookarounds to match the text you want to print.
Use .group() attribute in re.search function to get the matched text.
Code should look like.
import sys
import re
x = sys.argv[1]
y = sys.argv[2]
f = open("formula.txt" ,'r')
line = f.read()
match1 = re.search(x + r"\s*=\s*'([^']*)'" , line, re.M|re.I).group(1)
match2 = re.search(y + r"\s*=\s*'([^']*)'" , line, re.M|re.I).group(1)
f.close()
print match1
print match2
r"\s*=\s*'([^']*)'", \s* matches zero or more spaces and [^']* matches any character but not of a single quote, zero or more times. This text (value part) was captured into group 1 . Later we refer the captured chars by specifying the index number in group attribute.
First off you don't search argument values, rather do:
match = re.search(r"^%s\s+=\s*'(.*)'" % x, line, re.M|re.I)
Then do something with the match like putting it back into existing variable.
x = match.group(1)
I have a large text document that I am reading in and attempting to split into a multiple list. I'm having a hard time with the logic behind actually splitting up the string.
example of the text:
Youngstown, OH[4110,8065]115436
Yankton, SD[4288,9739]12011
966
Yakima, WA[4660,12051]49826
1513 2410
This data contains 4 pieces of information in this format:
City[coordinates]Population Distances_to_previous
My aim is to split this data up into a List:
Data = [[City] , [Coordinates] , [Population] , [Distances]]
As far as I know I need to use .split statements but I've gotten lost trying to implement them.
I'd be very grateful for some ideas to get started!
I would do this in stages.
Your first split is at the '[' of the coordinates.
Your second split is at the ']' of the coordinates.
Third split is end of line.
The next line (if it starts with a number) is your distances.
I'd start with something like:
numCities = 0
Data = []
i = 0
while i < len(lines):
split = lines[i].partition('[')
if (split[1]): # We found something
city = split[0]
split = split[2].partition(']')
if (split[1]):
coords = split[0] #If you want this as a list then rsplit it
population = split[2]
distances = []
if i > 0:
i += 1
distances = lines[i].rsplit(' ')
Data.append([city, coords, population, distances])
numCities += 1
i += 1
for data in Data:
print (data)
This will print
['Youngstown, OH', '4110,8065', '115436', []]
['Yankton, SD', '4288,9739', '12011', ['966']]
['Yakima, WA', '4660,12051', '49826', ['1513', '2410']]
The easiest way would be with a regex.
lines = """Youngstown, OH[4110,8065]115436
Yankton, SD[4288,9739]12011
966
Yakima, WA[4660,12051]49826
1513 2410"""
import re
pat = re.compile(r"""
(?P<City>.+?) # all characters up to the first [
\[(?P<Coordinates>\d+,\d+)\] # grabs [(digits,here)]
(?P<Population>\d+) # population digits here
\s # a space or a newline?
(?P<Distances>[\d ]+)? # Everything else is distances""", re.M | re.X)
groups = pat.finditer(lines)
results = [[[g.group("City")],
[g.group("Coordinates")],
[g.group("Population")],
g.group("Distances").split() if
g.group("Distances") else [None]]
for g in groups]
DEMO:
In[50]: results
Out[50]:
[[['Youngstown, OH'], ['4110,8065'], ['115436'], [None]],
[['Yankton, SD'], ['4288,9739'], ['12011'], ['966']],
[['Yakima, WA'], ['4660,12051'], ['49826'], ['1513', '2410']]]
Though if I may, it's probably BEST to do this as a list of dictionaries.
groups = pat.finditer(lines)
results = [{key: g.group(key)} for g in groups for key in
["City", "Coordinates", "Population", "Distances"]]
# then modify later
for d in results:
try:
d['Distances'] = d['Distances'].split()
except AttributeError:
# distances is None -- that's okay
pass
Really been struggling with this one for some time now, i have many text files with a specific format from which i need to extract all the data and file into different fields of a database. The struggle is tweaking the parameters for parsing, ensuring i get all the info correctly.
the format is shown below:
WHITESPACE HERE of unknown length.
K PA DETAILS
2 4565434 i need this sentace as one DB record
2 4456788 and this one
5 4879870 as well as this one, content will vary!
X Max - there sometimes is a line beginning with 'Max' here which i don't need
There is a Line here that i do not need!
WHITESPACE HERE of unknown length.
The tough parts were 1) Getting rid of whitespace, and 2)defining the fields from each other, see my best attempt, below:
dict = {}
XX = (open("XX.txt", "r")).readlines()
for line in XX:
if line.isspace():
pass
elif line.startswith('There is'):
pass
elif line.startswith('Max', 2):
pass
elif line.startswith('K'):
pass
else:
for word in line.split():
if word.startswith('4'):
tmp_PA = word
elif word == "1" or word == "2" or word == "3" or word == "4" or word == "5":
tmp_K = word
else:
tmp_DETAILS = word
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',(tmp_PA,tmp_K,tmp_DETAILS))
At the minute, i can pull the K & PA fields no problem using this, however my DETAILS is only pulling one word, i need the entire sentance, or at least 25 chars of it.
Thanks very much for reading and I hope you can help! :)
K
You are splitting the whole line into words. You need to split into first word, second word and the rest. Like line.split(None, 2).
It would probably use regular expressions. And use the oposite logic, that is if it starts with number 1 through 5, use it, otherwise pass. Like:
pattern = re.compile(r'([12345])\s+\(d+)\s+\(.*\S)')
f = open('XX.txt', 'r') # No calling readlines; lazy iteration is better
for line in f:
m = pattern.match(line)
if m:
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
(m.group(2), m.group(1), m.group(3)))
Oh, and of course, you should be using prepared statement. Parsing SQL is orders of magnitude slower than executing it.
If I understand correctly your file format, you can try this script
filename = 'bug.txt'
f = file(filename,'r')
foundHeaders = False
records = []
for rawline in f:
line = rawline.strip()
if not foundHeaders:
tokens = line.split()
if tokens == ['K','PA','DETAILS']:
foundHeaders = True
continue
else:
tokens = line.split(None,2)
if len(tokens) != 3:
break
try:
K = int(tokens[0])
PA = int(tokens[1])
except ValueError:
break
records.append((K,PA,tokens[2]))
f.close()
for r in records:
print r # replace this by your DB insertion code
This will start reading the records when it encounters the header line, and stop as soon as the format of the line is no longer (K,PA,description).
Hope this helps.
Here is my attempt using re
import re
stuff = open("source", "r").readlines()
whitey = re.compile(r"^[\s]+$")
header = re.compile(r"K PA DETAILS")
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
if whitey.match(line):
pass
elif header.match(line):
pass
elif juicy_info.match(line):
result = juicy_info.search(line)
print result.group('third')
print result.group('second')
print result.group('first')
Using re I can pull the data out and manipulate it on a whim. If you only need the juicy info lines, you can actually take out all the other checks, making this a REALLY concise script.
import re
stuff = open("source", "r").readlines()
#create a regular expression using subpatterns.
#'first, 'second' and 'third' are our own tags ,
# we could call them Adam, Betty, etc.
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
result = juicy_info.search(line)
if result:#do stuff with data here just use the tag we declared earlier.
print result.group('third')
print result.group('second')
print result.group('first')
import re
reg = re.compile('K[ \t]+PA[ \t]+DETAILS[ \t]*\r?\n'\
+ 3*'([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*\r?\n')
with open('XX.txt') as f:
mat = reg.search(f.read())
for tripl in ((2,1,3),(5,4,6),(8,7,9)):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(*tripl)
I prefer to use [ \t] instead of \s because \s matches the following characters:
blank , '\f', '\n', '\r', '\t', '\v'
and I don't see any reason to use a symbol representing more that what is to be matched, with risks to match erratic newlines at places where they shouldn't be
Edit
It may be sufficient to do:
import re
reg = re.compile(r'^([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*$',re.MULTILINE)
with open('XX.txt') as f:
for mat in reg.finditer(f.read()):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(2,1,3)