For the data file formated like this:
("Time Step" "courantnumber_max" "courantnumber_avg" "flow-time")
0 0.55432343242 0.34323443432242 0.00001
I can use pd.read_table(filename, sep=' ', header=0) and it will get everything correct except for the very first header, "Time Step".
Is there a way to specify a regex string for read_table() to use to parse out the header names?
I know a way to solve the issue is to just use regex to create a list of names for the read_table() function to use, but I figured there might/should be a way to directly express that in the import itself.
Edit: Here's what it returns as headers:
['("Time', 'Step"', 'courantnumber_max', 'courantnumber_avg', 'flow-time']
So it doesn't appear to be actually possible to do this inside the pandas.read_table() function. Below is posted the actual solution I ended up using to fix the problem:
import re
def get_headers(file, headerline, regexstring, exclude):
# Get string of selected headerline
with file.open() as f:
for i, line in enumerate(f):
if i == headerline-1:
headerstring = line
elif i > headerline-1:
break
# Parse headerstring
reglist = re.split(regexstring, headerstring)
# Filter entries in reglist
#filter out blank strs
filteredlist = list(filter(None, reglist))
#filter out items in exclude list
headerslist = []
if exclude:
for entry in filteredlist:
if not entry in exclude:
headerslist.append(entry)
return headerslist
get_headers(filename, 3, r'(?:" ")|["\)\(]', ['\n'])
Code explanation:
get_headers():
Arguments, file is a file object that contains the header. headerline is the line number (starting at 1) that the header names exist. regexstring is the pattern that will be fed into re.split(). Highly recommended that you prepend a r to the regex pattern. exclude is a list of miscellaneous strings that you want to be removed from the headerlist.
The regex pattern I used:
First up we have the pipe (|) symbol. This was done to separate both the "normal" split method (which is the " ") and the other stuff that needs to be rid of (namely the parenthesis).
Starting with the first group: (?:" "). We have the (...) since we want to match those characters in order. The " " is what we want to match as the stuff to split around. The ?: basically says to not capture the contents of the group. This is important/useful as otherwise re.split() will keep any groups as a separate item. See re.split() in documentation.
The second group is simply the other characters. Without them, the first and last items would be '("Time Step' and 'flow-time)\n'. Note that this causes \n to be treated as a separate entry to the list. This is why we use the exclude argument to fix that up after the fact.
Related
I am searching for sentences containing characters using Python regular expressions.
But I can't find the sentence I want.
Please help me
regex.py
opfile = open(file.txt, 'r')
contents = opfile.read()
opfile.close()
index = re.findall(r'\[start file\](?:.|\n)*\[end file\]', contents)
item = re.search(r'age.*', str(index))
file.txt(example)
[start file]
name: steve
age: 23
[end file]
result
<re.Match object; span=(94, 738), match='age: >
The age is not printed
There are several issues here:
The str(index) returns the string literal representation of the string list, and it makes it difficult to further process the result
(?:.|\n)* is a very resource consuming construct, use a mere . with the re.S or re.DOTALL option
If you plan to find a single match, use re.search, not re.findall.
Here is a possible solution:
match = re.search(r'\[start file].*\[end file]', contents, re.S)
if match:
match2 = re.search(r"\bage:\s*(\d+)", match.group())
if match2:
print(match2.group(1))
Output:
23
If you want to get age in the output, use match2.group().
If you want to match the age only once between the start and end file markers, you could use a single pattern with a capture group and in between match all lines that do not start with age: or the start or end marker.
^\[start file](?:\n(?!age:|\[(?:start|end) file]).*)*\nage: (\d+)(?:\n(?!\[(?:start|end) file]).*)*\n\[end file]
Regex demo
Example
import re
regex = r"^\[start file](?:\n(?!age:|\[(?:start|end) file]).*)*\nage: (\d+)(?:\n(?!\[(?:start|end) file]).*)*\n\[end file]"
s = ("[start file]\n" "name: steve \n" "age: 23\n" "[end file]")
m = re.search(regex, s)
if m:
print(m.group(1))
Output
23
The example input looks like a list of key, value pairs enclosed between some start/end markers. For this use-case, it might be more efficient and readable to write the parsing stage as:
re.search to locate the document
splitlines() to isolate individual records
split() to extract the key and value of each record
Then, in a second step, access the extracted records.
Doing this allows to separate the parsing and exploitation parts and makes the code easier to maintain.
Additionally, a good practice is to wrap access to a file in a "context manager" (the with statement) to guarantee all resources are correctly cleaned on error.
Here is a full standalone example:
import re
# 1: Load the raw data from disk, in a context manager
with open('/tmp/file.txt') as f:
contents = f.read()
# 2: Parse the raw data
fields = {}
if match := re.search(r'\[start file\]\n(.*)\[end file\]', contents, re.S):
for line in match.group(1).splitlines():
k, v = line.split(':', 1)
fields[k.strip()] = v.strip()
# 3: Actual data exploitation
print(fields['age'])
long time ago I wrote a tool for parsing text files, line by line, and do some stuff, depending on commands and conditions in the file.
I used regex for this, however, I was never good in regex.
A line holding a condition looks like this:
[type==STRING]
And the regex I use is:
re.compile(r'^[^\[\]]*\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*$', re.MULTILINE)
This regex would result me the keyword "type" and the value "STRING".
However, now I need to update my tool to have more conditions in one line, e.g.
[type==STRING][amount==0]
I need to update my regex to get me two pairs of results, one pair type/STRING and one pair amount/0.
But I'm lost on this. My regex above gets me zero results with this line.
Any ideas how to do this?
You could either match a second pair of groups:
^[^\[\]]*\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*(?:\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*)?$
Regex demo
Or you can omit the anchors and the [^\[\]]* part to get the group1 and group 2 values multiple times:
\[([^\]\[=]*)==([^\]\[=]*)\]
Regex demo
Is it a requirement that you use regex? You can alternatively accomplish this pretty easily using the split function twice and stripping the first opening and last closing bracket.
line_to_parse = "[type==STRING]"
# omit the first and last char before splitting
pairs = line_to_parse[1:-1].split("][")
for pair in pairs:
x, y = pair.split("==")
Rather depends on the precise "rules" that describe your data. However, for your given data why not:
import re
text = '[type==STRING][amount==0]'
words = re.findall('\w+', text)
lst = []
for i in range(0, len(words), 2):
lst.append((words[i], words[i+1]))
print(lst)
Output:
[('type', 'STRING'), ('amount', '0')]
I have code that concatenates two parts of a file path I am interested in disecting:
import glob
prefix = /aldo/programs/info
suffix = /final/*_cube/myFile.txt
prefix = prefix.rstrip()
file = glob.glob(prefix+'/final/*_cube/myFile.txt')
print (file)
Printing the final file gives me:
/aldo/programs/info/final/Michael_cube/myFile.txt
Which is GOOD and INTENDED. However, I am trying to set the string that was globbed, in this case, 'Michael' equal to a variable. I have tried using regular expressions but cannot find a way to grab the value (Michael) that was globbed. I am quite stuck and any guidance would be greatly appreciated.
You can use string slicing, you got all the parts that you need to strip from the result to get what whas provided as *-value:
import glob
prefix = "/aldo/programs/info"
s0,g,s1 = "/final/", "*", "_cube/myFile.txt" # split the parts around the *
suffix = s0+g+s1 # and recombinate
prefix = prefix.rstrip()
file = glob.glob(prefix+'/final/*_cube/myFile.txt')
name = "/aldo/programs/info/final/Michael_cube/myFile.txt"
# slice: len(prefix+s0) starting and stopping at -len(s1)
print(name[len(prefix+s0):-len(s1)])
Output:
Michael
DEMO
^.*?\/final\/(.*?)_cube\/myFile\.txt$
You can either grab the contents from group 1, or replace the entire match with the substitution string $1 to get the output.
Explanation:
starting and ending the patern with ^ and $ requires the pattern to match the entire line. you can account for any unknowns in the data with "match all" quntifiers '.*?' and then all you need to do is grab the desired output with a capture groups.
So, I've written the code below to extract hashtags and also tags with '#', and then append them to a list and sort them in descending order. The thing is that the text might not be perfectly formatted and not have spaces between each individual hashtag and the following problem may occur - as it may be checked with the #print statement inside the for loop :
#socality#thisismycommunity#themoderndayexplorer#modernoutdoors#mountaincultureelevated
So, the .split() method doesn't deal with those. What would be the best practice to this issue?
Here is the .txt file
Grateful for your time.
name = input("Enter file:")
if len(name) < 1 : name = "tags.txt"
handle = open(name)
tags = dict()
lst = list()
for line in handle :
hline = line.split()
for word in hline:
if word.startswith('#') : tags[word] = tags.get(word,0) + 1
else :
tags[word] = tags.get(word,0) + 1
#print(word)
for k,v in tags.items() :
tags_order = (v,k)
lst.append(tags_order)
lst = sorted(lst, reverse=True)[:34]
print('Final Dictionary: ' , '\n')
for v,k in lst :
print(k , v, '')
Use a regular expression. There are only a few limits; a tag must start with either # or #, and it may not contain any spaces or other whitespace characters.
This code
import re
tags = []
with open('../Downloads/tags.txt','Ur') as file:
for line in f.readline():
tags += re.findall(r'[##][^\s##]+', line)
creates a list of all tags in the file. You can easily adjust it to store the found tags in your dictionary; instead of storing the result straight away in tags, loop over it and do with each item as you please.
The regex is built up from these two custom character classes:
[##] - either the single character # or # at the start
[^\s##]+ - a sequence of not any single whitespace character (\s matches all whitespace such as space, tab, and returns), #, or #; at least one, and as many as possible.
So findall starts matching at the start of any tag and then grabs as much as it can, stopping only when encountering any of the "not" characters.
findall returns a list of matching items, which you can immediately add to an existing list, or loop over the found items in turn:
for tag in re.findall(r'[##][^\s##]+', line):
# process "tag" any way you want here
The source text file contains Windows-style \r\n line endings, and so I initially got a lot of empty "lines" on my Mac. Opening the text file in Universal newline mode makes sure that is handled transparently by the line reading part of Python.
Let's say we have this data extract:
ID,from,to,type,duration
1,paris,berlin,member,12
2,berlin,paris,member,12
3,paris,madrid,non-member,10
I want to retrieve the line when from = paris, and type = member.
Which means in this example I have only:
1,paris,berlin,member,12
That satisfy these rules. I am trying to do this with Regex only. I am still learning and I could only get this:
^.*(paris).*(member).*$
However, this will give me also the second line where paris is a destination.
The idea I guess is to:
Divide the line by commas.
Check if the second item is equal to 'paris'
Check if the fourth item is equal to 'member', or even check if there is 'member' in that line as there is no confusion with this part.
Any solution where I can use only regex?
Use [^,]* instead of .* to match a sequence of characters that doesn't include the comma separator. Use this for each field you want to skip when matching the line.
^[^,]*,paris,[^,]*,member,
Note that this is a very fragile mechanism compared to use the csv module, since it will break if you have any fields that contain comma (the csv module understands quoting a field to protect the delimiter).
This should do it:
^.*,(paris),.*,(member),.*$
As many have pointed out, I would read this into a dictionary using csv. However, if you insist on using regex, this should work:
[0-9]+\,paris.*[^-]member.*
try this.
import re
regex = r"\d,paris,\w+,member,\d+"
str = """ID,from,to,type,duration
1,paris,berlin,member,12
2,berlin,paris,member,12
3,paris,madrid,non-member,10"""
str = str.split("\n")
for line in str:
if (re.match(regex, line)):
print(line)
You can try this:
import re
s = """
ID,from,to,type,duration
1,paris,berlin,member,12
2,berlin,paris,member,12
3,paris,madrid,non-member,10
"""
final_data = re.findall('\d+,paris,\w+,member,\d+', s)
Output:
['1,paris,berlin,member,12']
However, note that the best solution is to read the file and use a dictionary:
import csv
l = list(csv.reader(open('filename.csv')))
final_l = [dict(zip(l[0], i)) for i in l[1:]]
final_data = [','.join(i[b] for b in l[0]) for i in final_l if i['from'] == 'paris' and i['type'] == 'member']