Address Parsing [Py] - python

so im looking to write a function that will take input in the form:
123 1st street APT 32S or
320 Jumping Alien Road
555 Google Ave
and output in a dictionary / json all the information parsed from the inputted string
dictionary would look something like
output = {
"streetNum" : "123",
"roadName" : "1st",
"suffix" : "street",
"enders" : "APT", #or None /null
"room" : "32S" #or None / null
}
Im trying to thing of the logic but the best I can come up with is something along the lines of address.split(' ') and then taking where the roadname, suffix, and streetname would typically be located in said string but obviously these things aren't always gonna be located in that order and when road names have spaces inside them that would break the function as well.
def addressParser(addressString):
return {
"streetNum" : addressString.split(' ')[0], #prob need regex help
"roadName" : addressString.split(' ')[1],
"suffix" : addressString.split(' ')[2],
"enders" : addressString.split(' ')[3],
"room" : addressString.split(' ')[4]
}
Edit: found exactly what i needed here https://pypi.org/project/address/

Related

Selenium innerHTML list, print specific value

First of all, I'm new to working with Python, especially Selenium. So I connected to a page with the webdriver and also already grabbed the InnerHTML I need. Here's my problem, InnerHTML is a "list" and I only want to output one value. It looks something like this:
<html>
<body>
<pre style="example" xpath="1">
"amount": 12{
"value" : 3
},
</pre>
</body>
</html>
^It's just for illustration, because the actual thing is much longer. InnerHTML looks like this:
"amount": 12{
"value" : 3
},
^This is where I am now. I can't specify a line because the page is not static. How do I make python find "value" from a variable in InnerHTML ? Please note that there is a colon after "value"!
Thank you very much in advance!
I suggest using regular expression to find the value. I assume that you only need the number part, so here's the code:
innerHTML = '''
"amount": 12{
"value" : 3
},"value":4
'value': 5
'''
import re
regex = re.compile(r'''("|')value("|')\s*:\s*(?P<number>\d+)''')
startpos = 0
m = None
while 1:
m = regex.search(innerHTML, startpos)
if m is None: break
print(m.group("number"))
startpos = m.start() + 1
# output:
# 3
# 4
# 5
This will print out all the value numbers found, as strings. You can convert them to integers afterwards, for example.
NOTE: My code also accounts for the case value is surrounded by single quotes ' rather than double quotes ". This is for your convenience; if not, you can change the appropriate line above to:
regex = re.compile(r'''"value"\s*:\s*(?P<number>\d+)''')
In that case, the output would not include the value 5.

How to ignore tokens in ply.yacc

I'm writing a JSON configuration (i.e config file in JSON format) interpreter with PLY.
There are huge swaths of the configuration file that I'd like to ignore. Some parts that I'd like to ignore contain tokens that I can't ignore in other parts of the file.
For example, I want to ignore:
"features" : [{
"name" : "someObscureFeature",
"version": "1.2",
"options": {
"values" : ["a", "b", "c"]
"allowWithoutContentLength": false,
"enabled": true
}
...
}]
But I do NOT want to ignore:
"features" : [{
"name" : "importantFeature",
"version": "1.1",
"options": {
"value": {
"id": 587842,
"description": "ramy-single-hostmatch",
"products": [
"Fresca"
]
...
}]
There are also lots of other tokens within the array of features that I want to ignore if the name value is not 'importantFeature'. For example there is likely to be an array of values in both important and obscure features. I need to ignore accordingly.
Notice also that I need to extract certain elements of the values field and that I'd like the values field to be tokenized so I can make use of it. Effectively, I'd like to conditionally tokenize the values field if it's inside of an importantMatch.
Also note that importantFeature is just standing in for what will eventually be about a dozen different features, each with their own grammar inside of the their respective features blocks.
The problem I'm running into is that every feature, obviously, has a name. I'd like to write something along these lines:
def p_FEATURES(p):
'''FEATURES : ARRAY_START FEATURE COMMA FEATURES ARRAY_END
| ARRAY_START FEATURE ARRAY_END'''
def p_FEATURE(p):
'''FEATURE : TESTABLE_FEATURE
| UNTESTABLE_FEATURE'''
def p_TESTABLE_FEATURE(p):
'''TESTABLE_FEATURE : BLOCK_START QUOTE NAME_KEY QUOTE COLON QUOTE CPCODE_FEATURE QUOTE COMMA IGNORE_KEY_VAL_PAIRS COMMA CPCODE_OPTIONS COMMA IGNORE_KEY_VAL_PAIRS'''
def p_UNTESTABLE_FEATURE(p):
'''UNTESTABLE_FEATURE : IGNORE_BLOCK '''
def p_IGNORE_BLOCK(p):
'''IGNORE_BLOCK : BLOCK_START LINES BLOCK_END'''
However the problem i'm running into is that I can't just "IGNORE_BLOCK" because the block with have a 'name' and I have a token in my lexer called 'name':
def t_NAME_KEY(t): r'name'; return t
Any help greatly appreciated.
When you define a regex rule function, you can choose whether or not to return the token. Depending on what is returned, the token is either ignored or considered. For example:
def t_BLOCK(t):
r'\{[\s]*name[\s]*:[\s]*(importantFeature)|(obscureFeature)\}' # will match a full block with the 'name' key in it
if 'obscureFeature' not in t:
return t
else:
pass
You can build a rule somewhat along these lines, and then choose whether to return the token or not based on whether your important feature was present or not.
Also, a general convention for specifying tokens to ignore as a string is to append t_IGNORE_ to the name.
Based on OP's edit. Forget about elimination during tokenisation. What you could, instead do is, manually rebuild the json as you parse it with the grammar. For example.
Replace
def p_FEATURE(p):
'''FEATURE : TESTABLE_FEATURE
| UNTESTABLE_FEATURE'''
def p_TESTABLE_FEATURE(p):
'''TESTABLE_FEATURE : BLOCK_START QUOTE NAME_KEY QUOTE COLON QUOTE CPCODE_FEATURE QUOTE COMMA IGNORE_KEY_VAL_PAIRS COMMA CPCODE_OPTIONS COMMA IGNORE_KEY_VAL_PAIRS'''
def p_UNTESTABLE_FEATURE(p):
'''UNTESTABLE_FEATURE : IGNORE_BLOCK '''
with
data = []
def p_FEATURE(p):
'''FEATURE : BLOCK_START DATA BLOCK_END FEATURE
| BLOCK_START DATA BLOCK_END'''
def p_DATA(p):
'''DATA : KEY COLON VALUE COMMA DATA
| KEY COLON VALUE ''' # and so on (have another function for values)
What you can do now is to examine p[2] and see if it is important. If yes, add it to your data variable. Else, ignore.
This is just a rough idea. You'll still have to figure out the grammar rules exactly (for example, VALUE would also probably lead to another state), and adding the right blocks to data and how. But it is possible.

how to "find" docs in mongodb (in python) where a substring exists in a field which is a list of strings? [duplicate]

I want to query something with SQL's like query:
SELECT * FROM users WHERE name LIKE '%m%'
How can I achieve the same in MongoDB? I can't find an operator for like in the documentation.
That would have to be:
db.users.find({"name": /.*m.*/})
Or, similar:
db.users.find({"name": /m/})
You're looking for something that contains "m" somewhere (SQL's '%' operator is equivalent to regular expressions' '.*'), not something that has "m" anchored to the beginning of the string.
Note: MongoDB uses regular expressions which are more powerful than "LIKE" in SQL. With regular expressions you can create any pattern that you imagine.
For more information on regular expressions, refer to Regular expressions (MDN).
db.users.insert({name: 'patrick'})
db.users.insert({name: 'petra'})
db.users.insert({name: 'pedro'})
Therefore:
For:
db.users.find({name: /a/}) // Like '%a%'
Output: patrick, petra
For:
db.users.find({name: /^pa/}) // Like 'pa%'
Output: patrick
For:
db.users.find({name: /ro$/}) // Like '%ro'
Output: pedro
In
PyMongo using Python
Mongoose using Node.js
Jongo, using Java
mgo, using Go
you can do:
db.users.find({'name': {'$regex': 'sometext'}})
In PHP, you could use the following code:
$collection->find(array('name'=> array('$regex' => 'm'));
Here are different types of requirements and solutions for string search with regular expressions.
You can do with a regular expression which contains a word, i.e., like. Also you can use $options => i for a case insensitive search.
Contains string
db.collection.find({name:{'$regex' : 'string', '$options' : 'i'}})
Doesn't contain string, only with a regular expression
db.collection.find({name:{'$regex' : '^((?!string).)*$', '$options' : 'i'}})
Exact case insensitive string
db.collection.find({name:{'$regex' : '^string$', '$options' : 'i'}})
Start with string
db.collection.find({name:{'$regex' : '^string', '$options' : 'i'}})
End with string
db.collection.find({name:{'$regex' : 'string$', '$options' : 'i'}})
Keep Regular Expressions Cheat Sheet as a bookmark, and a reference for any other alterations you may need.
You would use a regular expression for that in MongoDB.
For example,
db.users.find({"name": /^m/})
You have two choices:
db.users.find({"name": /string/})
or
db.users.find({"name": {"$regex": "string", "$options": "i"}})
For the second one, you have more options, like "i" in options to find using case insensitive.
And about the "string", you can use like ".string." (%string%), or "string.*" (string%) and ".*string) (%string) for example. You can use a regular expression as you want.
If using Node.js, it says that you can write this:
db.collection.find( { field: /acme.*corp/i } );
// Or
db.collection.find( { field: { $regex: 'acme.*corp', $options: 'i' } } );
Also, you can write this:
db.collection.find( { field: new RegExp('acme.*corp', 'i') } );
Already you got the answers, but to match with a regular expression with case insensitivity, you could use the following query:
db.users.find ({ "name" : /m/i } ).pretty()
The i in the /m/i indicates case insensitivity and .pretty() provides a prettier output.
For Mongoose in Node.js:
db.users.find({'name': {'$regex': '.*sometext.*'}})
With MongoDB Compass, you need to use the strict mode syntax, as such:
{ "text": { "$regex": "^Foo.*", "$options": "i" } }
(In MongoDB Compass, it's important that you use " instead of ')
You can use the new feature of MongoDB 2.6:
db.foo.insert({desc: "This is a string with text"});
db.foo.insert({desc:"This is a another string with Text"});
db.foo.ensureIndex({"desc":"text"});
db.foo.find({
$text:{
$search:"text"
}
});
In a Node.js project and using Mongoose, use a like query:
var User = mongoose.model('User');
var searchQuery = {};
searchQuery.email = req.query.email;
searchQuery.name = {$regex: req.query.name, $options: 'i'};
User.find(searchQuery, function(error, user) {
if(error || user === null) {
return res.status(500).send(error);
}
return res.status(200).send(user);
});
You can use a where statement to build any JavaScript script:
db.myCollection.find( { $where: "this.name.toLowerCase().indexOf('m') >= 0" } );
Reference: $where
In MongoDb, can use like using MongoDb reference operator regular expression(regex).
For Same Ex.
MySQL - SELECT * FROM users WHERE name LIKE '%m%'
MongoDb
1) db.users.find({ "name": { "$regex": "m", "$options": "i" } })
2) db.users.find({ "name": { $regex: new RegExp("m", 'i') } })
3) db.users.find({ "name": { $regex:/m/i } })
4) db.users.find({ "name": /mail/ })
5) db.users.find({ "name": /.*m.*/ })
MySQL - SELECT * FROM users WHERE name LIKE 'm%'
MongoDb Any of Above with /^String/
6) db.users.find({ "name": /^m/ })
MySQL - SELECT * FROM users WHERE name LIKE '%m'
MongoDb Any of Above with /String$/
7) db.users.find({ "name": /m$/ })
In Go and the mgo driver:
Collection.Find(bson.M{"name": bson.RegEx{"m", ""}}).All(&result)
where the result is the struct instance of the sought-after type.
In SQL, the ‘like’ query looks like this:
select * from users where name like '%m%'
In the MongoDB console, it looks like this:
db.users.find({"name": /m/}) // Not JSON formatted
db.users.find({"name": /m/}).pretty() // JSON formatted
In addition, the pretty() method will produce a formatted JSON structure in all the places which is more readable.
For PHP mongo Like.
I had several issues with PHP mongo like. I found that concatenating the regular expression parameters helps in some situations - PHP mongo find field starts with.
For example,
db()->users->insert(['name' => 'john']);
db()->users->insert(['name' => 'joe']);
db()->users->insert(['name' => 'jason']);
// starts with
$like_var = 'jo';
$prefix = '/^';
$suffix = '/';
$name = $prefix . $like_var . $suffix;
db()->users->find(['name' => array('$regex'=>new MongoRegex($name))]);
output: (joe, john)
// contains
$like_var = 'j';
$prefix = '/';
$suffix = '/';
$name = $prefix . $like_var . $suffix;
db()->users->find(['name' => array('$regex'=>new MongoRegex($name))]);
output: (joe, john, jason)
String yourdb={deepakparmar, dipak, parmar}
db.getCollection('yourdb').find({"name":/^dee/})
ans deepakparmar
db.getCollection('yourdb').find({"name":/d/})
ans deepakparmar, dipak
db.getCollection('yourdb').find({"name":/mar$/})
ans deepakparmar, parmar
Using template literals with variables also works:
{"firstname": {$regex : `^${req.body.firstname}.*` , $options: 'si' }}
Regular expressions are expensive to process.
Another way is to create an index of text and then search it using $search.
Create a text index of fields you want to make searchable:
db.collection.createIndex({name: 'text', otherField: 'text'});
Search for a string in the text index:
db.collection.find({
'$text'=>{'$search': "The string"}
})
Use regular expressions matching as below. The 'i' shows case insensitivity.
var collections = mongoDatabase.GetCollection("Abcd");
var queryA = Query.And(
Query.Matches("strName", new BsonRegularExpression("ABCD", "i")),
Query.Matches("strVal", new BsonRegularExpression("4121", "i")));
var queryB = Query.Or(
Query.Matches("strName", new BsonRegularExpression("ABCD","i")),
Query.Matches("strVal", new BsonRegularExpression("33156", "i")));
var getA = collections.Find(queryA);
var getB = collections.Find(queryB);
It seems that there are reasons for using both the JavaScript /regex_pattern/ pattern as well as the MongoDB {'$regex': 'regex_pattern'} pattern. See: MongoDB RegEx Syntax Restrictions
This is not a complete regular expression tutorial, but I was inspired to run these tests after seeing a highly voted ambiguous post above.
> ['abbbb','bbabb','bbbba'].forEach(function(v){db.test_collection.insert({val: v})})
> db.test_collection.find({val: /a/})
{ "val" : "abbbb" }
{ "val" : "bbabb" }
{ "val" : "bbbba" }
> db.test_collection.find({val: /.*a.*/})
{ "val" : "abbbb" }
{ "val" : "bbabb" }
{ "val" : "bbbba" }
> db.test_collection.find({val: /.+a.+/})
{ "val" : "bbabb" }
> db.test_collection.find({val: /^a/})
{ "val" : "abbbb" }
> db.test_collection.find({val: /a$/})
{ "val" : "bbbba" }
> db.test_collection.find({val: {'$regex': 'a$'}})
{ "val" : "bbbba" }
A like query would be as shown below:
db.movies.find({title: /.*Twelve Monkeys.*/}).sort({regularizedCorRelation : 1}).limit(10);
For the Scala ReactiveMongo API,
val query = BSONDocument("title" -> BSONRegex(".*" + name + ".*", "")) // like
val sortQ = BSONDocument("regularizedCorRelation" -> BSONInteger(1))
val cursor = collection.find(query).sort(sortQ).options(QueryOpts().batchSize(10)).cursor[BSONDocument]
If you are using Spring-Data MongoDB, you can do it in this way:
String tagName = "m";
Query query = new Query();
query.limit(10);
query.addCriteria(Criteria.where("tagName").regex(tagName));
If you have a string variable, you must convert it to a regex, so MongoDB will use a like statement on it.
const name = req.query.title; //John
db.users.find({ "name": new Regex(name) });
Is the same result as:
db.users.find({"name": /John/})
Use aggregation substring search (with index!!!):
db.collection.aggregate([{
$project : {
fieldExists : {
$indexOfBytes : ['$field', 'string']
}
}
}, {
$match : {
fieldExists : {
$gt : -1
}
}
}, {
$limit : 5
}
]);
You can query with a regular expression:
db.users.find({"name": /m/});
If the string is coming from the user, maybe you want to escape the string before using it. This will prevent literal chars from the user to be interpreted as regex tokens.
For example, searching the string "A." will also match "AB" if not escaped.
You can use a simple replace to escape your string before using it. I made it a function for reusing:
function textLike(str) {
var escaped = str.replace(/[\-\[\]\/\{\}\(\)\*\+\?\.\\\^\$\|]/g, '\\$&');
return new RegExp(escaped, 'i');
}
So now, the string becomes a case-insensitive pattern matching also the literal dot. Example:
> textLike('A.');
< /A\./i
Now we are ready to generate the regular expression on the go:
db.users.find({ "name": textLike("m") });
If you want a 'like' search in MongoDB then you should go with $regex. By using it, the query will be:
db.product.find({name:{$regex:/m/i}})
For more, you can read the documentation as well - $regex
One way to find the result as with equivalent to a like query:
db.collection.find({name:{'$regex' : 'string', '$options' : 'i'}})
Where i is used for a case-insensitive fetch data.
Another way by which we can also get the result:
db.collection.find({"name":/aus/})
The above will provide the result which has the aus in the name containing aus.

MongoDB file path as unique index

How should I organize my collection for documents like this:
{
"path" : "\\192.168.77.1\user\1.wav", // unique text index
"sex" : "male", "age" : 28 // some fields
}
I use this scheme in Python (pymongo):
client = MongoClient(self.addr)
db = self.client['some']
db.files.ensure_index([('path', TEXT)], unique=True)
data = [
{"path": r'\\192.168.77.5\1.wav', "base": "CAGS2"},
{"path": r'\\192.168.77.5\2.wav', "base": "CAGS2"}
]
sid = self.db.files.insert(data)
But error occurs:
pymongo.errors.DuplicateKeyError: insertDocument ::
caused by :: 11000 E11000 duplicate key error index:
some.files.$path_text dup key: { : "168", : 0.75 }
If I remove all dots ('.') inside path keys, everything is ok. What is wrong?
Why are you creating a unique text index? For that matter, why is MongoDB letting you? When you create a text index, the input field values are tokenized:
"the brown fox jumps" -> ["the", "brown", "fox", "jumps"]
The tokens are stemmed, meaning they are reduced (in a language-specific way) to a different form to support natural language matching like "loves" with "love" and "loving" with "love". Stopwords like "the", which are common words that would be more harmful than helpful to match on, are thrown out.
["the", "brown", "fox", "jumps"] -> ["brown", "fox", "jump"]
The index entries for the document are the stemmed tokens of the original field value with a score that's calculated based off of how important the term is in the value string. Ergo, when you put a unique index on these values, you are ensuring that you cannot have two documents with terms that stem to the same thing and have the same score. This is pretty much never what you would want because it's hard to tell what it's going to reject. Here is an example:
> db.test.drop()
> db.test.ensureIndex({ "t" : "text" }, { "unique" : true })
> db.test.insert({ "t" : "ducks are quacking" })
WriteResult({ "nInserted" : 1 })
> db.test.insert({ "t" : "did you just quack?" })
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 11000,
"errmsg" : "insertDocument :: caused by :: 11000 E11000 duplicate key error index: test.test.$a_text dup key: { : \"quack\", : 0.75 }"
}
})
> db.test.insert({ "t" : "though I walk through the valley of the shadow of death, I will fear no quack" })
WriteResult({ "nInserted" : 1 })
The stemmed term "quack" will result from all three documents, but in the first two it receives the score of 0.75, so the second insertion is rejected by the unique constraint. It receives a score of 0.5625 in the third document.
What are you actually trying to achieve with the index on the path? A unique text index is not what you want.
have you escaped all the text in the input fields to ensure that it is a valid JSON document?
Here is a valid json document
{
"path": "\"\\\\192.168.77.1\\user\\1.wav\"",
"sex": "male",
"age": 28
}
You have set the text index to be unique - is there already a document in the collection with a path value of "\\192.168.77.1\user\1.wav" ?
Mongo may also be treating the punctuation in the path as delimiters which may be affecting how its stored.
MongoDB $search field
I created a scheme with TEXT index for 'path' and it was saved in DB.
I tried to change TEXT to ASCENDING/DESCENDING after and nothing worked because I didn't do the index reset (or delete and create entire DB again).
So, as wdberkeley wrote below: when you create a text index, the input field values are tokenized:
"the brown fox jumps" -> ["the", "brown", "fox", "jumps"]
And TEXT index is not solution for filenames. Use ASCENDING/DESCENDING instead.

extracting numbers from list of strings with python

I have a list of strings that I am trying to parse for data that is meaningful to me. I need an ID number that is contained within the string. Sometimes it might be two or even three of them. Example string might be:
lst1 = [
"(Tower 3rd floor window corner_ : option 3_floor cut out_large : GA - floors : : model lines : id 3999595(tower 4rd floor window corner : option 3_floor: : whatever else is in iit " new floor : id 3999999)",
"(Tower 3rd floor window corner_ : option 3_floor cut out_large : GA - floors : : model lines : id 3998895(tower 4rd floor window corner : option 3_floor: : id 5555456 whatever else is in iit " new floor : id 3998899)"
]
I would like to be able to iterate over that list of strings and extract only those highlighted id values.
Output would be a lst1 = ["3999595; 3999999", "3998895; 5555456; 3998899"] where each id values from the same input string is separated by a colon but list order still matches the input list.
You can use id\s(\d{7}) regular expression.
Iterate over items in a list and join the results of findall() call by ;:
import re
lst1 = [
'(Tower 3rd floor window corner_ : option 3_floor cut out_large : GA - floors : : model lines : id 3999595(tower 4rd floor window corner : option 3_floor: : whatever else is in iit " new floor : id 3999999)',
'(Tower 3rd floor window corner_ : option 3_floor cut out_large : GA - floors : : model lines : id 3998895(tower 4rd floor window corner : option 3_floor: : id 5555456 whatever else is in iit " new floor : id 3998899)'
]
pattern = re.compile(r'id\s(\d{7})')
print ["; ".join(pattern.findall(item)) for item in lst1]
prints:
['3999595; 3999999', '3998895; 5555456; 3998899']
Based on #alecxe solution you can also do it without any imports.
If your id numbers are always after id and have a fixed (7) number of digits I would probably just use .split('id ') to separate it and get the 7 digits from the second block onwards.
You can put them together in the desired format by using '; '.join()
Putting everything together:
pattern = ['; '.join([value[:7] for value in valueList.split('id ')[1:]]) for valueList in lst1]
Which prints out:
['3999595; 3999999', '3998895; 5555456; 3998899']

Categories

Resources