How to grab value that is being 'globbed' in Python - python

I have code that concatenates two parts of a file path I am interested in disecting:
import glob
prefix = /aldo/programs/info
suffix = /final/*_cube/myFile.txt
prefix = prefix.rstrip()
file = glob.glob(prefix+'/final/*_cube/myFile.txt')
print (file)
Printing the final file gives me:
/aldo/programs/info/final/Michael_cube/myFile.txt
Which is GOOD and INTENDED. However, I am trying to set the string that was globbed, in this case, 'Michael' equal to a variable. I have tried using regular expressions but cannot find a way to grab the value (Michael) that was globbed. I am quite stuck and any guidance would be greatly appreciated.

You can use string slicing, you got all the parts that you need to strip from the result to get what whas provided as *-value:
import glob
prefix = "/aldo/programs/info"
s0,g,s1 = "/final/", "*", "_cube/myFile.txt" # split the parts around the *
suffix = s0+g+s1 # and recombinate
prefix = prefix.rstrip()
file = glob.glob(prefix+'/final/*_cube/myFile.txt')
name = "/aldo/programs/info/final/Michael_cube/myFile.txt"
# slice: len(prefix+s0) starting and stopping at -len(s1)
print(name[len(prefix+s0):-len(s1)])
Output:
Michael

DEMO
^.*?\/final\/(.*?)_cube\/myFile\.txt$
You can either grab the contents from group 1, or replace the entire match with the substitution string $1 to get the output.
Explanation:
starting and ending the patern with ^ and $ requires the pattern to match the entire line. you can account for any unknowns in the data with "match all" quntifiers '.*?' and then all you need to do is grab the desired output with a capture groups.

Related

How to assign variables to a value in text file and check if it satisfies a given condition?

I have a file in.txt
name="XYZ_PP_0" number="0x12" bytesize="4" info="0x0000001A"
name="GK_LMP_2_0" number="0xA5" bytesize="8" info="0x00000000bbae321f"
name="MP_LKO_1_0" number="0x356" bytesize="4" info="0x00000234"
I need to check whether it satisfies the condition that is check if info value of number "0x12" + 0x00000004 = info value of number="0x356".
If it matches print the resulted value matches with given info value of number="0x356".
else print not matching.
How can i do this?
this is current attempt:
import re
pattern = r'(number=\"\w+\").*(info=\"\w+\")'
with open("in.txt", "rb") as fin:
for line in fin:
for match_number, match_info in re.findall(pattern, line):
but this will simply extract the number and info value.
Break it into steps.
Look up how to read in a text file, line by line. You'll end up with a list of lines of this file.
Figure out how to extract the value from the "number" field. A simple regular expression would serve you well here I think.
[Optional] Cast this value to the correct data type for your problem.
Do the comparison you're interested in.
You can easily google the syntax for all of these I think.
Edit: posted before there was any code in the original post. I'm not entirely sure what the question is anymore. Do you need help debugging?
Edit 2: Taking another stab at this since I think you're asking for RegEx syntax.
Change your RegEx pattern to have parentheses around the information you want to extract. A RegEx match for such a pattern will allow you to assign the values inside this parentheses to Python variables.
See this partial example.
import re
pattern = r'number=(\"\w+\").*info=(\"\w+\")'
s = 'name="XYZ_PP_0" number="0x12" bytesize="4" info="0x0000001A"'
m = re.search(pattern, s)
if m:
number, info = m.groups()
print("number is ", number)
print("info is", info)
# number is "0x12"
# info is "0x0000001A"

pandas read_table with regex header definition

For the data file formated like this:
("Time Step" "courantnumber_max" "courantnumber_avg" "flow-time")
0 0.55432343242 0.34323443432242 0.00001
I can use pd.read_table(filename, sep=' ', header=0) and it will get everything correct except for the very first header, "Time Step".
Is there a way to specify a regex string for read_table() to use to parse out the header names?
I know a way to solve the issue is to just use regex to create a list of names for the read_table() function to use, but I figured there might/should be a way to directly express that in the import itself.
Edit: Here's what it returns as headers:
['("Time', 'Step"', 'courantnumber_max', 'courantnumber_avg', 'flow-time']
So it doesn't appear to be actually possible to do this inside the pandas.read_table() function. Below is posted the actual solution I ended up using to fix the problem:
import re
def get_headers(file, headerline, regexstring, exclude):
# Get string of selected headerline
with file.open() as f:
for i, line in enumerate(f):
if i == headerline-1:
headerstring = line
elif i > headerline-1:
break
# Parse headerstring
reglist = re.split(regexstring, headerstring)
# Filter entries in reglist
#filter out blank strs
filteredlist = list(filter(None, reglist))
#filter out items in exclude list
headerslist = []
if exclude:
for entry in filteredlist:
if not entry in exclude:
headerslist.append(entry)
return headerslist
get_headers(filename, 3, r'(?:" ")|["\)\(]', ['\n'])
Code explanation:
get_headers():
Arguments, file is a file object that contains the header. headerline is the line number (starting at 1) that the header names exist. regexstring is the pattern that will be fed into re.split(). Highly recommended that you prepend a r to the regex pattern. exclude is a list of miscellaneous strings that you want to be removed from the headerlist.
The regex pattern I used:
First up we have the pipe (|) symbol. This was done to separate both the "normal" split method (which is the " ") and the other stuff that needs to be rid of (namely the parenthesis).
Starting with the first group: (?:" "). We have the (...) since we want to match those characters in order. The " " is what we want to match as the stuff to split around. The ?: basically says to not capture the contents of the group. This is important/useful as otherwise re.split() will keep any groups as a separate item. See re.split() in documentation.
The second group is simply the other characters. Without them, the first and last items would be '("Time Step' and 'flow-time)\n'. Note that this causes \n to be treated as a separate entry to the list. This is why we use the exclude argument to fix that up after the fact.

Regex: Capture a line when certain columns are equal to certain values

Let's say we have this data extract:
ID,from,to,type,duration
1,paris,berlin,member,12
2,berlin,paris,member,12
3,paris,madrid,non-member,10
I want to retrieve the line when from = paris, and type = member.
Which means in this example I have only:
1,paris,berlin,member,12
That satisfy these rules. I am trying to do this with Regex only. I am still learning and I could only get this:
^.*(paris).*(member).*$
However, this will give me also the second line where paris is a destination.
The idea I guess is to:
Divide the line by commas.
Check if the second item is equal to 'paris'
Check if the fourth item is equal to 'member', or even check if there is 'member' in that line as there is no confusion with this part.
Any solution where I can use only regex?
Use [^,]* instead of .* to match a sequence of characters that doesn't include the comma separator. Use this for each field you want to skip when matching the line.
^[^,]*,paris,[^,]*,member,
Note that this is a very fragile mechanism compared to use the csv module, since it will break if you have any fields that contain comma (the csv module understands quoting a field to protect the delimiter).
This should do it:
^.*,(paris),.*,(member),.*$
As many have pointed out, I would read this into a dictionary using csv. However, if you insist on using regex, this should work:
[0-9]+\,paris.*[^-]member.*
try this.
import re
regex = r"\d,paris,\w+,member,\d+"
str = """ID,from,to,type,duration
1,paris,berlin,member,12
2,berlin,paris,member,12
3,paris,madrid,non-member,10"""
str = str.split("\n")
for line in str:
if (re.match(regex, line)):
print(line)
You can try this:
import re
s = """
ID,from,to,type,duration
1,paris,berlin,member,12
2,berlin,paris,member,12
3,paris,madrid,non-member,10
"""
final_data = re.findall('\d+,paris,\w+,member,\d+', s)
Output:
['1,paris,berlin,member,12']
However, note that the best solution is to read the file and use a dictionary:
import csv
l = list(csv.reader(open('filename.csv')))
final_l = [dict(zip(l[0], i)) for i in l[1:]]
final_data = [','.join(i[b] for b in l[0]) for i in final_l if i['from'] == 'paris' and i['type'] == 'member']

fuzzy string split in Python 2.x

Input file:
rep_origin 607..1720
/label=Rep
Region 2643..5020
/label="region"
extra_info and stuff
I'm trying to split by the first column-esque entry. For example, I want to get a list that looks like this...
Desired Output:
['rep_origin 607..1720 /label=Rep', 'Region 2643..5020 /label="region" extra_info and stuff']
I tried splitting by ' ' but that gave me some crazy stuff. If I could add a "fuzzy" search term at the end that includes all alphabet characters but NOT a whitespace. That would solve the problem. I suppose you can do it with regex with something like ' [A-Z]' findall but I wasn't sure if there was a less complicated way.
Is there a way to add a "fuzzy" search term at the very end of string.split identifier? (i.e. original_string.' [alphabet_character]'
I'm not sure exactly what you're looking for but the parse function below takes the text from your question and returns a list of sections and a section is a list of the lines from each section (with leading and trailing whitespace removed).
#!/usr/bin/env python
import re
# This is the input from your question
INPUT_TEXT = '''\
rep_origin 607..1720
/label=Rep
Region 2643..5020
/label="region"
extra_info and stuff'''
# A regular expression that matches the start of a section. A section
# start is a line that has 4 spaces before the first non-space
# character.
match_section_start = re.compile(r'^ [^ ]').match
def parse(text):
sections = []
section_lines = None
def append_section_if_lines():
if section_lines:
sections.append(section_lines)
for line in text.split('\n'):
if match_section_start(line):
# We've found the start of a new section. Unless this is
# the first section, save the previous section.
append_section_if_lines()
section_lines = []
section_lines.append(line.strip())
# Save the last section.
append_section_if_lines()
return sections
sections = parse(INPUT_TEXT)
print(sections)

Search for string in file while ignoring id and replacing only a substring

I’ve got a master .xml file generated by an external application and want to create several new .xmls by adapting and deleting some rows with python. The search strings and replace strings for these adaptions are stored within an array, e.g.:
replaceArray = [
[u'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"',
u'ref_layerid_mapping="x4049" lyvis="on" toc_visible="on"'],
[u'<TOOL_BUFFER RowID="106874" id_tool_base="3651" use="false"/>',
u'<TOOL_BUFFER RowID="106874" id_tool_base="3651" use="true"/>'],
[u'<TOOL_SELECT_LINE RowID="106871" id_tool_base="3658" use="false"/>',
u'<TOOL_SELECT_LINE RowID="106871" id_tool_base="3658" use="true"/>']]
So I'd like to iterate through my file and replace all occurences of 'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"' with 'ref_layerid_mapping="x4049" lyvis="on" toc_visible="on"' and so on.
Unfortunately the ID values of "RowID", “id_tool_base” and “ref_layerid_mapping” might change occassionally. So what I need is to search for matches of the whole string in the master file regardless which id value is inbetween the quotation mark and only to replace the substring that is different in both strings of the replaceArray (e.g. use=”true” instead of use=”false”). I’m not very familiar with regular expressions, but I think I need something like that for my search?
re.sub(r'<TOOL_SELECT_LINE RowID="\d+" id_tool_base="\d+" use="false"/>', "", sentence)
I'm happy about any hint that points me in the right direction! If you need any further information or if something is not clear in my question, please let me know.
One way to do this is to have a function for replacing text. The function would get the match object from re.sub and insert id captured from the string being replaced.
import re
s = 'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"'
pat = re.compile(r'ref_layerid_mapping=(.+) lyvis="off" toc_visible="off"')
def replacer(m):
return "ref_layerid_mapping=" + m.group(1) + 'lyvis="on" toc_visible="on"';
re.sub(pat, replacer, s)
Output:
'ref_layerid_mapping="x4049"lyvis="on" toc_visible="on"'
Another way is to use back-references in replacement pattern. (see http://www.regular-expressions.info/replacebackref.html)
For example:
import re
s = "Ab ab"
re.sub(r"(\w)b (\w)b", r"\1d \2d", s)
Output:
'Ad ad'

Categories

Resources