Use regex to isolate information in filename

Use regex to isolate information in filename - python

I have various files that are formatted like so;
file_name_twenty_135032952.txt
where file_name_twenty is a description of the contents, and 13503295 is the id.
I want two different regexes; one to get the description out of the filename, and one to get the id.
Here are some other rules that the filenames follow:
The filename will never contains spaces, or uppercase characters
the id will always come directly before the extension
the id will always follow an underscore
the description may sometimes have numbers in it; for example, in this filename: part_1_of_file_324980332.txt, part_1_of_file is the description, and 324980332 is the id.
I've been toiling for a while and can't seem to figure out a regex to solve this. I'm using python, so any limitations thereof with its regex engine follow.

rsplit once on an underscore and to remove the extension from id.
s = "file_name_twenty_13503295.txt"
name, id = s.rsplit(".",1)[0].rsplit("_", 1)
print(name, id)
file_name_twenty 13503295

Related

How to extract questions from a word doc with Python using regex

I am using docx library to read files from a word doc, I am trying to extract only the questions using regex search and match. I found infinite ways of doing it but I keep getting a "TypeError".
The data I am trying to extract is this:
Will my financial aid pay for housing?
Off Campus Housing - After financial aid applies toward your tuition and fees, any remaining funds will be sent to you as a refund that will either be directly deposited (which can be set up through your account) or mailed to you as a paper check. You can then use the refund to pay your rent. It is important to note that financial aid may not be available when rent is due, so make sure to have a plan in place to pay your rent. Will my financial aid pay for housing?
"financial" "help" "house"
funds "univ oak"
"money" "chisho"
"pay" "chap"
"grant" "laurel"
What are the requirements to receive a room and grant?
How do I pay for my housing?
How do I pay for housing?
If there's also an easier method of exporting the word doc into a different type of file, that'll be great to know for feedback. Thank you
I am using regex 101, I've tried the following regex expressions to match only the sentences that end in a question mark
".*[?=?]$"
"^(W|w).*[?=?]$"
"^[A-Za-z].*[?=?]$"
import re
import sys
from docx import Document
wordDoc = Document('botDoc.docx')
result = re.search('.*[?=?]$', wordDoc)
print(result)
if result:
print(result.group(0))
for table in wordDoc.tables:
for row in table.rows:
for cell in row.cells:
print("test")
I expect to save the matching patterns into directories so I can export the data to a csv file

Your error:
result = re.search('.*[?=?]$', wordDoc)
I believe that this line is the cause of the problem. search() is expecting a string as a second parameter, but is receiving a Document object.
What you should do is use the findall() function. search() only finds the first match for a pattern; findall() finds all the matches and returns them as a list of strings, with each string representing one match.
Since you are working with docx, you would have to extract the contents of the docx and use them as second parameter of the findall() method. If I remember correctly, this is done by first extracting all the paragraphs, and then extracting the text of the individual paragraphs. Refer to this question.
FYI, the way you would do this for a simple text file is the following:
# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'your pattern', f.read())
Your Regex:
Unfortunately, your regex is not quite correct, because although logically it makes sense to match only sentences that end on a ?, one of your matches is place to pay your rent. Will my financial aid pay for housing?, for example. Only the second part of that sentence is an actual question. So discard any lower case letters. Your regex should be something like:
[A-Z].*\?$

Best way to parse the human names in a large list of addresses when the order of names is unkown in advance [duplicate]

ok so basically I am asking the question of their name
I want this to be one input rather than Forename and Surname.
Now is there any way of splitting this name? and taking just the last word from the "Sentence" e.g.
name = "Thomas Winter"
print name.split()
and what would be output is just "Winter"

You'll find that your key problem with this approach isn't a technical one, but a human one - different people write their names in different ways.
In fact, the terminology of "forename" and "surname" is itself flawed.
While many blended families use a hyphenated family name, such as Smith-Jones, there are some who just use both names separately, "Smith Jones" where both names are the family name.
Many european family names have multiple parts, such as "de Vere" and "van den Neiulaar". Sometimes these extras have important family history - for example, a prefix awarded by a king hundreds of years ago.
Side issue: I've capitalised these correctly for the people I'm referencing - "de" and "van den" don't get captial letters for some families, but do for others.
Conversely, many Asian cultures put the family name first, because the family is considered more important than the individual.
Last point - some people place great store in being "Junior" or "Senior" or "III" - and your code shouldn't treat those as the family name.
Also noting that there are a fair number of people who use a name that isn't the one bestowed by their parents, I've used the following scheme with some success:
Full Name (as normally written for addressing mail);
Family Name;
Known As (the name commonly used in conversation).
e.g:
Full Name: William Gates III; Family Name: Gates; Known As: Bill
Full Name: Soong Li; Family Name: Soong; Known As: Lisa

This is a pretty old issue but I found it searching around for a solution to parsing out pieces from a globbed together name.
http://code.google.com/p/python-nameparser/

The problem with trying to split the names from a single input is that you won't get the full surname for people with spaces in their surname, and I don't believe you'll be able to write code to manage that completely.
I would recommend that you ask for the names separately if it is at all possible.

An easy way to do exactly what you asked in python is
name = "Thomas Winter"
LastName = name.split()[1]
(note the parantheses on the function call split.)
split() creates a list where each element is from your original string, delimited by whitespace. You can now grab the second element using name.split()[1] or the last element using name.split()[-1]
However, as others said, unless you're SURE you're just getting a string like "First_Name Last_Name", there are a lot more issues involved.

Golden rule of data - don't aggregate too early - it is much easier to glue fields together than separate them. Most people also have a middle name which should be an optional field. Some people have a plethora of middle names. Some people only have one name, one word. Some cultures commonly have a dictionary of middle names, paying homage to the family tree back to the Golgafrincham Ark landing.
You don't need a code solution here - you need a business rule.

Like this:
print name.split()[-1]

If you're trying to parse apart a human name in PHP, I recomment Keith Beckman's nameparse.php script.

This is how I do it in my application:
def get_first_name(fullname):
firstname = ''
try:
firstname = fullname.split()[0]
except Exception as e:
print str(e)
return firstname
def get_last_name(fullname):
lastname = ''
try:
index=0
for part in fullname.split():
if index > 0:
if index > 1:
lastname += ' '
lastname += part
index += 1
except Exception as e:
print str(e)
return lastname
def get_last_word(string):
return string.split()[-1]
print get_first_name('Jim Van Loon')
print get_last_name('Jim Van Loon')
print get_last_word('Jim Van Loon')

Since there are so many different variation's of how people write their names, but here's how a basic way to get the first/lastname via regex.
import re
p = re.compile(r'^(\s+)?(Mr(\.)?|Mrs(\.)?)?(?P<FIRST_NAME>.+)(\s+)(?P<LAST_NAME>.+)$', re.IGNORECASE)
m = p.match('Mr. Dingo Bat')
if(m != None):
first_name = m.group('FIRST_NAME')
last_name = m.group('LAST_NAME')

Splitting names is harder than it looks. Some names have two word last names; some people will enter a first, middle, and last name; some names have two work first names. The more reliable (or least unreliable) way to handle names is to always capture first and last name in separate fields. Of course this raises its own issues, like how to handle people with only one name, making sure it works for users that have a different ordering of name parts.
Names are hard, handle with care.

It's definitely a more complicated task than it appears on the surface. I wrote up some of the challenges as well as my algorithm for solving it on my blog. Be sure to check out my Google Code project for it if you want the latest version in PHP:
http://www.onlineaspect.com/2009/08/17/splitting-names/

Here's how to do it in SQL. But data normalization with this kind of thing is really a bear. I agree with Dave DuPlantis about asking for separate inputs.

I would specify a standard format (some forms use them), such as "Please write your name in First name, Surname form".
It makes it easier for you, as names don't usually contain a comma. It also verifies that your users actually enter both first name and surname.

name = "Thomas Winter"
first, last = name.split()
print("First = {first}".format(first=first))
#First = Thomas
print("Last = {last}".format(last=" ".join(last)))
#Last = Winter

You can use str.find() for this.
x=input("enter your name ")
l=x.find(" ")
print("your first name is",x[:l])
print("your last name is",x[l:])

You would probably want to use rsplit for this:
rsplit([sep [,maxsplit]])
Return a list of the words in the string, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done, the rightmost ones. If sep is not specified or None, any whitespace string is a separator. Except for splitting from the right, rsplit() behaves like split() which is described in detail below. New in version 2.4.

Parsing directories and detecting unexpected blanks

I'm trying to parse some directories and identifying folders witch do not have a specific correct pattern. Let's exemplify:
Correct: Level1\\Level2\\Level3\\Level4_ID\\Date\\Hour\\file.txt
Incorrect: Level1\\Level2\\Level3\\Level4\\Date\\Hour\\file.txt
Notice that the incorrect one does not have the _ID. My final desired goal is parse the data replacing the '\' for a delimiter to import for MS excel:
Level1;Level2;Level3;Level4;ID;Date;Hour;file.txt
Level1;Level2;Level3;Level4; ;Date;Hour;file.txt
I had successfully parsed all the correct data making this steps:
Let files be a list of my all directories
for i in arange(len(files)):
processed_str = files[i].replace(" ", "").replace("_", "\\")
processed_str = processed_str.split("\\")
My issue is detecting whether or not Level4 folder does have an ID after the underscore using the same script, since "files" contains both correct and incorrect directories.
The problem is that since the incorrect one does not have the ID, after performing split("\") I end up having the columns mixed without a blanck between Level4 and Date:
Level1;Level2;Level3;Level4;Date;Hour;file.txt
Thanks,

Do the "_ID" check after splitting the directories, that way you don't loose information. Assuming the directory names themselves don't contain escaped backslashes and that the ID field is always in level 4 (counting from 1), this should do it:
for i in arange(len(files)):
parts = files[i].split("\\")
if parts[3].endswith("_ID"):
parts.insert(4, parts[3][:-len("_ID")])
else:
parts.insert(4, " ")
final = ";".join(parts)

Incremental Saves

I am trying to write up a script on incremental saves but there are a few hiccups that I am running into.
If the file name is "aaa.ma", I will get the following error - ValueError: invalid literal for int() with base 10: 'aaa' # and it does not happens if my file is named "aaa_0001"
And this happens if I wrote my code in this format: Link
As such, to rectify the above problem, I input in an if..else.. statement - Link, it seems to have resolved the issue on hand, but I was wondering if there is a better approach to this?
Any advice will be greatly appreciated!

Use regexes for better flexibility especially for file rename scripts like these.
In your case, since you know that the expected filename format is "some_file_name_<increment_number>", you can use regexes to do the searching and matching for you. The reason we should do this is because people/users may are not machines, and may not stick to the exact naming conventions that our scripts expect. For example, the user may name the file aaa_01.ma or even aaa001.ma instead of aaa_0001 that your script currently expects. To build this flexibility into your script, you can use regexes. For your use case, you could do:
# name = lastIncFile.partition(".")[0] # Use os.path.split instead
name, ext = os.path.splitext(lastIncFile)
import re
match_object = re.search("([a-zA-Z]*)_*([0-9]*)$", name)
# Here ([a-zA-Z]*) would be group(1) and would have "aaa" for ex.
# and ([0-9]*) would be group(2) and would have "0001" for ex.
# _* indicates that there may be an _, or not.
# The $ indicates that ([0-9]*) would be the LAST part of the name.
padding = 4 # Try and parameterize as many components as possible for easy maintenance
default_starting = 1
verName = str(default_starting).zfill(padding) # Default verName
if match_object: # True if the version string was found
name = match_object.group(1)
version_component = match_object.group(2)
if version_component:
verName = str(int(version_component) + 1).zfill(padding)
newFileName = "%s_%s.%s" % (name, verName, ext)
incSaveFilePath = os.path.join(curFileDir, newFileName)
Check out this nice tutorial on Python regexes to get an idea what is going on in the above block. Feel free to tweak, evolve and build the regex based on your use cases, tests and needs.
Extra tips:
Call cmds.file(renameToSave=True) at the beginning of the script. This will ensure that the file does not get saved over itself accidentally, and forces the script/user to rename the current file. Just a safety measure.
If you want to go a little fancy with your regex expression and make them more readable, you could try doing this:
match_object = re.search("(?P<name>[a-zA-Z]*)_*(?P<version>[0-9]*)$", name)
name = match_object.group('name')
version_component = match_object('version')
Here we use the ?P<var_name>... syntax to assign a dict key name to the matching group. Makes for better readability when you access it - mo.group('version') is much more clearer than mo.group(2).
Make sure to go through the official docs too.
Save using Maya's commands. This will ensure Maya does all it's checks while and before saving:
cmds.file(rename=incSaveFilePath)
cmds.file(save=True)
Update-2:
If you want space to be checked here's an updated regex:
match_object = re.search("(?P<name>[a-zA-Z]*)[_ ]*(?P<version>[0-9]*)$", name)
Here [_ ]* will check for 0 - many occurrences of _ or (space). For more regex stuff, trying and learn on your own is the best way. Check out the links on this post.
Hope this helps.

How to check set of files conform to a naming scheme

I have a bunch of files (TV episodes, although that is fairly arbitrary) that I want to check match a specific naming/organisation scheme..
Currently: I have three arrays of regex, one for valid filenames, one for files missing an episode name, and one for valid paths.
Then, I loop though each valid-filename regex, if it matches, append it to a "valid" dict, if not, do the same with the missing-ep-name regexs, if it matches this I append it to an "invalid" dict with an error code (2:'missing epsiode name'), if it matches neither, it gets added to invalid with the 'malformed name' error code.
The current code can be found here
I want to add a rule that checks for the presence of a folder.jpg file in each directory, but to add this would make the code substantially more messy in it's current state..
How could I write this system in a more expandable way?
The rules it needs to check would be..
File is in the format Show Name - [01x23] - Episode Name.avi or Show Name - [01xSpecial02] - Special Name.avi or Show Name - [01xExtra01] - Extra Name.avi
If filename is in the format Show Name - [01x23].avi display it a 'missing episode name' section of the output
The path should be in the format Show Name/season 2/the_file.avi (where season 2 should be the correct season number in the filename)
each Show Name/season 1/ folder should contain "folder.jpg"
.any ideas? While I'm trying to check TV episodes, this concept/code should be able to apply to many things..
The only thought I had was a list of dicts in the format:
checker = [
{
'name':'valid files',
'type':'file',
'function':check_valid(), # runs check_valid() on all files
'status':0 # if it returns True, this is the status the file gets
}

I want to add a rule that checks for
the presence of a folder.jpg file in
each directory, but to add this would
make the code substantially more messy
in it's current state..
This doesn't look bad. In fact your current code does it very nicely, and Sven mentioned a good way to do it as well:
Get a list of all the files
Check for "required" files
You would just have have add to your dictionary a list of required files:
checker = {
...
'required': ['file', 'list', 'for_required']
}
As far as there being a better/extensible way to do this? I am not exactly sure. I could only really think of a way to possibly drop the "multiple" regular expressions and build off of Sven's idea for using a delimiter. So my strategy would be defining a dictionary as follows (and I'm sorry I don't know Python syntax and I'm a tad to lazy to look it up but it should make sense. The /regex/ is shorthand for a regex):
check_dict = {
'delim' : /\-/,
'parts' : [ 'Show Name', 'Episode Name', 'Episode Number' ],
'patterns' : [/valid name/, /valid episode name/, /valid number/ ],
'required' : ['list', 'of', 'files'],
'ignored' : ['.*', 'hidden.txt'],
'start_dir': '/path/to/dir/to/test/'
}
Split the filename based on the delimiter.
Check each of the parts.
Because its an ordered list you can determine what parts are missing and if a section doesn't match any pattern it is malformed. Here the parts and patterns have a 1 to 1 ratio. Two arrays instead of a dictionary enforces the order.
Ignored and required files can be listed. The . and .. files should probably be ignored automatically. The user should be allowed to input "globs" which can be shell expanded. I'm thinking here of svn:ignore properties, but globbing is natural for listing files.
Here start_dir would be default to the current directory but if you wanted a single file to run automated testing of a bunch of directories this would be useful.
The real loose end here is the path template and along the same lines what path is required for "valid files". I really couldn't come up with a solid idea without writing one large regular expression and taking groups from it... to build a template. It felt a lot like writing a TextMate language grammar. But that starts to stray on the ease of use. The real problem was that the path template was not composed of parts, which makes sense but adds complexity.
Is this strategy in tune with what you were thinking of?

maybe you should take the approach of defaulting to: "the filename is correct" and work from there to disprove that statement:
with the fact that you only allow filenames with: 'show name', 'season number x episode number' and 'episode name', you know for certain that these items should be separated by a "-" (dash) so you have to have 2 of those for a filename to be correct.
if that checks out, you can use your code to check that the show name matches the show name as seen in the parent's parent folder (case insensitive i assume), the season number matches the parents folder numeric value (with or without an extra 0 prepended).
if however you don't see the correct amount of dashes you instantly know that there is something wrong and stop before the rest of the tests etc.
and separately you can check if the file folder.jpg exists and take the necessary actions. or do that first and filter that file from the rest of the files in that folder.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.