Related
I got the following problem for the Google Coding Challenge which happened on 16th August 2020. I tried to solve it but couldn't.
There are N words in a dictionary such that each word is of fixed
length and M consists only of lowercase English letters, that is
('a', 'b', ...,'z') A query word is denoted by Q. The length
of query word is M. These words contain lowercase English letters
but at some places instead of a letter between 'a', 'b', ...,'z'
there is '?'. Refer to the Sample input section to understand this
case. A match count of Q, denoted by match_count(Q) is the
count of words that are in the dictionary and contain the same English
letters(excluding a letter that can be in the position of ?) in the
same position as the letters are there in the query word Q. In other
words, a word in the dictionary can contain any letters at the
position of '?' but the remaining alphabets must match with the
query word.
You are given a query word Q and you are required to compute
match_count.
Input Format
The first line contains two space-separated integers N and M denoting the number of words in the dictionary and length of each word
respectively.
The next N lines contain one word each from the dictionary.
The next line contains an integer Q denoting the number of query words for which you have to compute match_count.
The next Q lines contain one query word each.
Output Format For each query word, print match_count for a specific word in a new line.
Constraints
1 <= N <= 5X10^4
1 <= M <= 7
1 <= Q <= 10^5
So, I got 30 minutes for this question and I could write the following code which is incorrect and hence didn't give the expected output.
def Solve(N, M, Words, Q, Query):
output = []
count = 0
for i in range(Q):
x = Query[i].split('?')
for k in range(N):
if x in Words:
count += 1
else:
pass
output.append(count)
return output
N, M = map(int , input().split())
Words = []
for _ in range(N):
Words.append(input())
Q = int(input())
Query = []
for _ in range(Q):
Query.append(input())
out = Solve(N, M, Words, Q, Query)
for x in out_:
print(x)
Can somebody help me with some pseudocode or algorithm which can solve this problem, please?
I guess my first try would have been to replace the ? with a . in the query, i.e. change ?at to .at, and then use those as regular expressions and match them against all the words in the dictionary, something as simple as this:
import re
for q in queries:
p = re.compile(q.replace("?", "."))
print(sum(1 for w in words if p.match(w)))
However, seeing the input sizes as N up to 5x104 and Q up to 105, this might be too slow, just as any other algorithm comparing all pairs of words and queries.
On the other hand, note that M, the number of letters per word, is constant and rather low. So instead, you could create Mx26 sets of words for all letters in all positions and then get the intersection of those sets.
from collections import defaultdict
from functools import reduce
M = 3
words = ["cat", "map", "bat", "man", "pen"]
queries = ["?at", "ma?", "?a?", "??n"]
sets = defaultdict(set)
for word in words:
for i, c in enumerate(word):
sets[i,c].add(word)
all_words = set(words)
for q in queries:
possible_words = (sets[i,c] for i, c in enumerate(q) if c != "?")
w = reduce(set.intersection, possible_words, all_words)
print(q, len(w), w)
In the worst case (a query that has a non-? letter that is common to most or all words in the dictionary) this may still be slow, but should be much faster in filtering down the words than iterating all the words for each query. (Assuming random letters in both words and queries, the set of words for the first letter will contain N/26 words, the intersection for the first two has N/26² words, etc.)
This could probably be improved a bit by taking the different cases into account, e.g. (a) if the query does not contain any ?, just check whether it is in the set (!) of words without creating all those intersections; (b) if the query is all-?, just return the set of all words; and (c) sort the possible-words-sets by size and start the intersection with the smallest sets first to reduce the size of temporarily created sets.
About time complexity: To be honest, I am not sure what time complexity this algorithm has. With N, Q, and M being the number of words, number of queries, and length of words and queries, respectively, creating the initial sets will have complexity O(N*M). After that, the complexity of the queries obviously depends on the number of non-? in the queries (and thus the number of set intersections to create), and the average size of the sets. For queries with zero, one, or M non-? characters, the query will execute in O(M) (evaluating the situation and then a single set/dict lookup), but for queries with two or more non-?-characters, the first set intersections will have on average complexity O(N/26), which strictly speaking is still O(N). (All following intersections will only have to consider N/26², N/26³ etc. elements and are thus negligible.) I don't know how this compares to The Trie Approach and would be very interested if any of the other answers could elaborate on that.
This question can be done by the help of Trie Data Structures.
First add all words to trie ds.
Then you have to see if the word is present in trie or not, there's a special condition of ' ?' So you have to take care for that condition also, like if the character is ? then simply go to next character of the word.
I think this approach will work, there's a similar Question in Leetcode.
Link : https://leetcode.com/problems/design-add-and-search-words-data-structure/
It should be O(N) time and space approach given M is small and can be considered constant. You might want to look at implementation of Trie here.
Perform the first pass and store the words in Trie DS.
Next for your query, you perform a combination of DFS and BFS in the following order.
If you receive a ?, Perform BFS and add all the children.
For non ?, Perform a DFS and that should point to the existence of a word.
For further optimization, a suffix tree may also be used for storage DS.
You can use a simplified version of trie as the query string has pre-defined length. No need of ends variable in the Trie node
#include <bits/stdc++.h>
using namespace std;
typedef struct TrieNode_ {
struct TrieNode_* nxt[26];
} TrieNode;
void addWord(TrieNode* root, string s) {
TrieNode* node = root;
for(int i = 0; i < s.size(); ++i) {
if(node->nxt[s[i] - 'a'] == NULL) {
node->nxt[s[i] - 'a'] = new TrieNode;
}
node = node->nxt[s[i] - 'a'];
}
}
void matchCount(TrieNode* root, string s, int& cnt) {
if(root == NULL) {
return;
}
if(s.empty()) {
++cnt;
return;
}
TrieNode* node = root;
if(s[0] == '?') {
for(int i = 0; i < 26; ++i) {
matchCount(node->nxt[i], s.substr(1), cnt);
}
}
else {
matchCount(node->nxt[s[0] - 'a'], s.substr(1), cnt);
}
}
int main() {
int N, M;
cin >> N >> M;
vector<string> s(N);
TrieNode *root = new TrieNode;
for (int i = 0; i < N; ++i) {
cin >> s[i];
addWord(root, s[i]);
}
int Q;
cin >> Q;
for(int i = 0; i < Q; ++i) {
string queryString;
int cnt = 0;
cin >> queryString;
matchCount(root, queryString, cnt);
cout << cnt << endl;
}
}
Notes: 1. This code doesn't read the input but instead takes params from main method.
2. For large inputs, we could use java 8 streams to parallelize the search process and improve the performance.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class WordSearch {
private void matchCount(int N, int M, int Q, String[] words, String[] queries) {
Pattern p = null;
Matcher m = null;
int count = 0;
for (int i=0; i<Q; i++) {
p = Pattern.compile(queries[i].replace('?','.'));
for (int j=0; j<N; j++) {
m = p.matcher(words[j]);
if (m.find()) {
count++;
}
}
System.out.println("For query word '"+ queries[i] + "', the count is: " + count) ;
count=0;
}
System.out.println("\n");
}
public static void main(String[] args) {
WordSearch ws = new WordSearch();
int N = 5; int M=3; int Q=4;
String[] w = new String[] {"cat", "map", "bat", "man", "pen"};
String[] q = new String[] {"?at", "ma?", "?a?", "??n" };
ws.matchCount(N, M, Q, w, q);
w = new String[] {"uqqur", "1xzev", "ydfgz"};
q = new String[] {"?z???", "???i?", "???e?", "???f?", "?z???"};
N=3; M=5; Q=5;
ws.matchCount(N, M, Q, w, q);
}
}
I can think of kind of trie with bfs for lookup approach
class Node:
def __init__(self, letter):
self.letter = letter
self.chidren = {}
#classmethod
def construct(cls):
return cls(letter=None)
def add_word(self, word):
current = self
for letter in word:
if letter not in current.chidren:
node = Node(letter)
current.chidren[letter] = node
else:
node = current.chidren[letter]
current = node
def lookup_word(self, word, m):
def _lookup_next_letter(_letter, _node):
if _letter == '?':
for node in _node.chidren.values():
q.put((node, i))
elif _letter in _node.chidren:
q.put((_node.chidren[_letter], i))
q = SimpleQueue()
count = 0
i = 0
current = self
letter = word[i]
i += 1
_lookup_next_letter(letter, current)
while not q.empty():
current, i = q.get()
if i == m:
count += 1
continue
letter = word[i]
i += 1
_lookup_next_letter(letter, current)
return count
def __eq__(self, other):
return self.letter == other.letter if isinstance(other, Node) else other
def __hash__(self):
return hash(self.letter)
I would create a lookup table for each letter of each word, and then use that table to iterate with. While the lookup table will cost O(NM) memory (or 15 entries in the situation shown), it will allow an easy O(NM) time complexity to be implemented, with a best case O(log N * log M).
The lookup table can be stored in the form of a coordinate plane. Each letter will have an "x" position (the letters index) as well as a "y" position (the words index in the dictionary). This will allow a quick cross reference from the query to look up a letter's position for existence and the word's position for eligibility.
Worst case, this approach has a time complexity O(NM) whereby there must be N iterations, one for each dictionary entry, times M iterations, one for each letter in each entry. In many cases it will skip the lookups though.
A coordinate system is also created, which also has O(NM) spacial complexity.
Unfamiliar with python, so this is written in JavaScript which was as close as I could come language wise. Hopefully this at least serves as an example of a possible solution.
In addition, as an added section, I included a heavily loaded section to use for performance comparisons. This takes about 5 seconds to complete a set with 2000 words, 5000 querys, each at a length of 200.
// Main function running the analysis
function run(dict, qs) {
// Use a coordinate system for tracking the letter and position
var coordinates = 'abcdefghijklmnopqrstuvwxyz'.split('').reduce((p, c) => (p[c] = {}, p), {});
// Populate the system
for (var i = 0; i < dict.length; i++) {
// Current word in the given dictionary
var dword = dict[i];
// Iterate the letters for tracking
for (var j = 0; j < dword.length; j++) {
// Current letter in our current word
var letter = dword[j];
// Make sure that there is object existence for assignment
coordinates[letter][j] = coordinates[letter][j] || {};
// Note the letter's coordinate by storing its array
// position (i) as well as its letter position (j)
coordinates[letter][j][i] = 1;
}
}
// Lookup the word letter by letter in our coordinate system
function match_count(Q) {
// Create an array which maps from the dictionary indices
// to a truthy value of 1 for tracking successful matches
var availLookup = dict.reduce((p,_,i) => (p[i]=1,p),{});
// Iterate the letters of Q to check against the coordinate system
for (var i = 0; i < Q.length; i++) {
// Current letter in Q
var letter = Q[i];
// Skip '?' characters
if (letter == '?') continue;
// Look up the existence of "points" in our coordinate system for
// the current letter
var points = coordinates[letter];
// If nothing from the dictionary matches in this position,
// then there are no matches anywhere and we return a 0
if (!points || !points[i]) return 0;
// Iterate the availability truth table made earlier
// and look up whether any points in our coordinate system
// are present for the current letter. If they are, then the word
// remains, if not, it is removed from consideration.
for(var n in availLookup){
if(!points[i][n]) delete availLookup[n];
}
}
// Sum the "truthy" 1 values we used earlier to determine the count of
// matched words
return Object.values(availLookup).reduce((x, y) => x + y, 0);
}
var matches = [];
for (var i = 0; i < qs.length; i++) {
matches.push(match_count(qs[i]));
}
return matches;
}
document.querySelector('button').onclick=_=>{
console.clear();
var d1 = [
'cat',
'map',
'bat',
'man',
'pen'
];
var q1 = [
'?at',
'ma?',
'?a?',
'??n'
];
console.log('running...');
console.log(run(d1, q1));
var d2 = [
'uqqur',
'lxzev',
'ydfgz'
];
var q2 = [
'?z???',
'???i?',
'???e?',
'???f?',
'?z???'
];
console.log('running...');
console.log(run(d2, q2));
// Load it up (try this with other versions to compare with efficiency)
var d3 = [];
var q3 = [];
var wordcount = 2000;
var querycount = 5000;
var len = 200;
var alphabet = 'abcdefghijklmnopqrstuvwxyz'.split('');
for(var i = 0; i < wordcount; i++){
var word = "";
for(var n = 0; n < len; n++){
var rand = (Math.random()*25)|0;
word += alphabet[rand];
}
d3.push(word);
}
for(var i = 0; i < querycount; i++){
var qword = d3[(Math.random()*(wordcount-1))|0];
var query = "";
for(var n = 0; n < len; n++){
var rand = (Math.random()*100)|0;
if(rand > 98){ word += alphabet[(Math.random()*25)|0]; }
else{ query += rand > 75 ? qword[n] : '?'; }
}
q3.push(query);
}
if(document.querySelector('input').checked){
//console.log(d3,q3);
console.log('running...');
console.log(run(d3, q3).reduce((x, y) => x + y, 0) + ' matches');
}
};
<input type=checkbox>Include the ~5 second larger version<br>
<button type=button>run</button>
I don't know Python, but the gist of the naive algorithm looks like this:
#count how many words in Words list match a single query
def DoQuery(Words, OneQuery):
count = 0
#for each word in the Words list
for i in range(Words.size()):
word = Words.at(i)
#compare each letter to the query
match = true
for j in range(word.size()):
wordLetter = word.at(j)
queryLetter = OneQuery.at(j)
#if the letters do not match and are not ?, then skip to next word
if queryLetter != '?' and queryLetter != wordLetter:
match = false
break
#if we did not skip, the words match. Increase the count
if match == true
count = count + 1
#we have now checked all the words, return the count
return count
Of course, this executes the innermost loop around 3.5x10^10 times, which might be too slow. So one would need to read in the dictionary, precompute some short of shortcut data structure, then use the shortcut to find the answers faster.
One shortcut data structure would be to make a map of possible queries to answers, making the query O(1). There are only 4.47*10^9 possible queries, so this is possibly faster.
A similar shortcut data structure would be to make a trie of possible queries to answers, making the query O(M). There are only 4.47*10^9 possible queries, so this is possibly faster. This is more complex code, but may also be easier to understand for some people.
Another shortcut would be to "assume" each query has exactly one non-question-mark, and make a map of possible queries to subset dictionaries. This would mean you'd still have to run the naive query on the subset dictionary, but it would be ~26x smaller, and thus ~26x faster. You'd also have to convert the real query into only having one non-question-mark to lookup the subset dictionary in the map, but that should be easy.
I think we can use trie to solve this problem.
Initially, we will just add all the strings to the trie, and later when we get each query we can just check whether it exists in trie or not.
The only thing different here is the '?' but we can use it as an all char match, so whenever we will detect the '?' in our search string we will look what are all possible words possible from here and then simply do a dfs by searching the word in all possible paths.
Below is the C++ code
class Trie {
public:
bool isEnd;
vector<Trie*> children;
Trie() {
this->isEnd = false;
this->children = vector<Trie*>(26, nullptr);
}
};
Trie* root;
void insert(string& str) {
int n = str.size(), idx, i = 0;
Trie* node = root;
while(i < n) {
idx = str[i++] - 'a';
if (node->children[idx] == nullptr) {
node->children[idx] = new Trie();
}
node = node->children[idx];
}
node->isEnd = true;
}
int getMatches(int i, string& str, Trie* node) {
int idx, n = str.size();
while(i < n) {
if (str[i] >= 'a' && str[i] <='z')
idx = str[i] - 'a';
else {
int res = 0;
for(int j = 0;j<26;j++) {
if (node->children[j] != nullptr)
res += getMatches(i+1, str, node->children[j]);
}
return res;
}
if (node->children[idx] == nullptr) return 0;
node = node->children[idx];
++i;
}
return node->isEnd ? 1 : 0;
}
int main() {
int n, m;
cin>>n>>m;
string str;
root = new Trie();
while(n--) {
cin>>str;
insert(str);
}
int q;
cin>>q;
while(q--) {
cin>>str;
cout<<(str.size() == m ? getMatches(0, str, root) : 0)<<"\n";
}
}
Can I do it with ascii values like:
for charcters in queryword calculate the ascii values sum.
for words in dictionary, calculate ascii of words character wise and check it with ascii sum of query word, like for bat, if ascii of b matches ascii sum of queryword then increment count else calculate ascii of a and check with query ascii if not then add it to ascii of b then check and hence atlast return the count.
How's this approach?
Java Implementation using Trie
import java.util.*;
import java.io.*;
import java.lang.*;
public class Main {
static class TrieNode
{
TrieNode []children = new TrieNode[26];
boolean endOfWord;
TrieNode()
{
this.endOfWord = false;
for (int i = 0; i < 26; i++) {
this.children[i] = null;
}
}
void addWord(String word)
{
// Crawl pointer points the object
// in reference
TrieNode pCrawl = this;
// Traverse the given array of words
for (int i = 0; i < word.length(); i++) {
int index = word.charAt(i) - 'a';
if (pCrawl.children[index]==null)
pCrawl.children[index]
= new TrieNode();
pCrawl = pCrawl.children[index];
}
pCrawl.endOfWord = true;
}
public static int ans2 = 0;
void search(String word, boolean found, String curr_found, int pos)
{
TrieNode pCrawl = this;
if (pos == word.length()) {
if (pCrawl.endOfWord) {
found = true;
ans2++;
}
return;
}
if (word.charAt(pos) == '?') {
// Iterate over every letter and
// proceed further by replacing
// the character in place of '.'
for (int i = 0; i < 26; i++) {
if (pCrawl.children[i] != null) {
pCrawl.children[i].search(word,found,curr_found + (char)('a' + i),pos + 1);
}
}
}
else { // Check if pointer at character
// position is available,
// then proceed
if (pCrawl.children[word.charAt(pos) - 'a'] != null) {
pCrawl.children[word.charAt(pos) - 'a']
.search(word,found,curr_found + word.charAt(pos),pos + 1);
}
}
return;
}
// Utility function for search operation
int searchUtil(String word)
{
TrieNode pCrawl = this;
boolean found = false;
ans2 = 0;
pCrawl.search(word, found,"",0);
return ans2;
}
}
static int searchPattern(String arr[], int N,String str)
{
// Object of the class Trie
TrieNode obj = new TrieNode();
for (int i = 0; i < N; i++) {
obj.addWord(arr[i]);
}
// Search pattern
return obj.searchUtil(str);
}
public static void ans(String []arr , int n, int m,String [] query, int q){
for(int i=0;i<q;i++)
System.out.println(searchPattern(arr,n,query[i]));
}
public static void main(String args[]) {
Scanner scn = new Scanner();
int n = scn.nextInt();
int m = scn.nextInt();
String []arr = new String[n];
for(int i=0;i<n;i++){
arr[i] = scn.next();
}
int q = scn.nextInt();
String []query = new String[q];
for(int i=0;i<q;i++){
query[i] = scn.next();
}
ans(arr,n,m,query,q);
}
}
This is brute but Trie is a better implemntaion.
"""
Input: db whic is a list of words
chk : str to find
"""
def check(db,chk):
seen = collections.defaultdict(list)
for i in db:
for j in range(len(i)):
temp = i[:j] + "?" + i[j+1:]
seen[temp].append(i)
return len(seen[chk])
print check(["cat","bat"], "?at")
Sounds like it was a coding challenge about https://en.wikipedia.org/wiki/Space%E2%80%93time_tradeoff
Depending on parameters N,M,Q as well as data and query distribution, the "best" algorithm will be different. A simple example, given the query ??? you know the answer — the length of the dictionary — without any computation 😸
In the general case, most likely, it pays to create a search index in advance (that is while reading the dictionary, before any query is seen).
I'd go with this: number the input 0 cat; 1 map; ...
Then build a search index per letter position:
index = [
{"c": 0b00001, "m": 0b00010, ...} # first query letter
{"a": 0b01111, "e": 0x10000} # second query letter
]
Prepare all = 0x11111 (all bits set) as "matches everything".
Then query lookup: ?a? ⇒ all & index[1]["a"] & all. †
Afterwards you'll need to count number of bits set in the result.
The time complexity of single query is therefore O(N) * (M + O(1)) ‡, which is a decent trade-off.
The entire batch is O(N*M*Q).
Python (as well as es2020) supports native arbitrary precision integers, which can be elegantly used for bitmaps, as well as native dictionaries, use them :) However if the data is sparse, an adaptive or compressed bitmap such as https://pypi.org/project/roaringbitmap may perform better.
† In practice ... & index[1].get("a", 0) & ... in case you hit a blank.
‡ Python data structure time complexity is reported O(...) amortised worst case while in CS O(...) worst case is usually considered. While the difference is subtle, it can bite even experienced developers, see e.g. https://bugs.python.org/issue13703
One approach could be to use Python's fnmatch module (for every pattern sum the matches in words):
import fnmatch
names = ['uqqur', 'lxzev', 'ydfgs']
patterns = ['?z???', '???i?', '???e?', '???f?', '?z???']
[sum(fnmatch.fnmatch(name, pattern) for name in names) for pattern in patterns]
# [0, 0, 1, 0, 0]
I need to know how I would go about recreating a version of the int() function in Python so that I can fully understand it and create my own version with multiple bases that go past base 36. I can convert from a decimal to my own base (base 54, altered) just fine, but I need to figure out how to go from a string in my base's format to an integer (base 10).
Basically, I want to know how to go from my base, which I call base 54, to base 10. I don't need specifics, because if I have an example, I can work it out on my own. Unfortunately, I can't find anything on the int() function, though I know it has to be somewhere, since Python is open-source.
This is the closest I can find to it, but it doesn't provide source code for the function itself. Python int() test.
If you can help, thanks. If not, well, thanks for reading this, I guess?
You're not going to like this answer, but int(num, base) is defined in C (it's a builtin)
I went searching around and found it:
https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Objects/longobject.c
/* Parses an int from a bytestring. Leading and trailing whitespace will be
* ignored.
*
* If successful, a PyLong object will be returned and 'pend' will be pointing
* to the first unused byte unless it's NULL.
*
* If unsuccessful, NULL will be returned.
*/
PyObject *
PyLong_FromString(const char *str, char **pend, int base)
{
int sign = 1, error_if_nonzero = 0;
const char *start, *orig_str = str;
PyLongObject *z = NULL;
PyObject *strobj;
Py_ssize_t slen;
if ((base != 0 && base < 2) || base > 36) {
PyErr_SetString(PyExc_ValueError,
"int() arg 2 must be >= 2 and <= 36");
return NULL;
}
while (*str != '\0' && Py_ISSPACE(Py_CHARMASK(*str))) {
str++;
}
if (*str == '+') {
++str;
}
else if (*str == '-') {
++str;
sign = -1;
}
if (base == 0) {
if (str[0] != '0') {
base = 10;
}
else if (str[1] == 'x' || str[1] == 'X') {
base = 16;
}
else if (str[1] == 'o' || str[1] == 'O') {
base = 8;
}
else if (str[1] == 'b' || str[1] == 'B') {
base = 2;
}
else {
/* "old" (C-style) octal literal, now invalid.
it might still be zero though */
error_if_nonzero = 1;
base = 10;
}
}
if (str[0] == '0' &&
((base == 16 && (str[1] == 'x' || str[1] == 'X')) ||
(base == 8 && (str[1] == 'o' || str[1] == 'O')) ||
(base == 2 && (str[1] == 'b' || str[1] == 'B')))) {
str += 2;
/* One underscore allowed here. */
if (*str == '_') {
++str;
}
}
if (str[0] == '_') {
/* May not start with underscores. */
goto onError;
}
start = str;
if ((base & (base - 1)) == 0) {
int res = long_from_binary_base(&str, base, &z);
if (res < 0) {
/* Syntax error. */
goto onError;
}
}
else {
/***
Binary bases can be converted in time linear in the number of digits, because
Python's representation base is binary. Other bases (including decimal!) use
the simple quadratic-time algorithm below, complicated by some speed tricks.
First some math: the largest integer that can be expressed in N base-B digits
is B**N-1. Consequently, if we have an N-digit input in base B, the worst-
case number of Python digits needed to hold it is the smallest integer n s.t.
BASE**n-1 >= B**N-1 [or, adding 1 to both sides]
BASE**n >= B**N [taking logs to base BASE]
n >= log(B**N)/log(BASE) = N * log(B)/log(BASE)
The static array log_base_BASE[base] == log(base)/log(BASE) so we can compute
this quickly. A Python int with that much space is reserved near the start,
and the result is computed into it.
The input string is actually treated as being in base base**i (i.e., i digits
are processed at a time), where two more static arrays hold:
convwidth_base[base] = the largest integer i such that base**i <= BASE
convmultmax_base[base] = base ** convwidth_base[base]
The first of these is the largest i such that i consecutive input digits
must fit in a single Python digit. The second is effectively the input
base we're really using.
Viewing the input as a sequence <c0, c1, ..., c_n-1> of digits in base
convmultmax_base[base], the result is "simply"
(((c0*B + c1)*B + c2)*B + c3)*B + ... ))) + c_n-1
where B = convmultmax_base[base].
Error analysis: as above, the number of Python digits `n` needed is worst-
case
n >= N * log(B)/log(BASE)
where `N` is the number of input digits in base `B`. This is computed via
size_z = (Py_ssize_t)((scan - str) * log_base_BASE[base]) + 1;
below. Two numeric concerns are how much space this can waste, and whether
the computed result can be too small. To be concrete, assume BASE = 2**15,
which is the default (and it's unlikely anyone changes that).
Waste isn't a problem: provided the first input digit isn't 0, the difference
between the worst-case input with N digits and the smallest input with N
digits is about a factor of B, but B is small compared to BASE so at most
one allocated Python digit can remain unused on that count. If
N*log(B)/log(BASE) is mathematically an exact integer, then truncating that
and adding 1 returns a result 1 larger than necessary. However, that can't
happen: whenever B is a power of 2, long_from_binary_base() is called
instead, and it's impossible for B**i to be an integer power of 2**15 when
B is not a power of 2 (i.e., it's impossible for N*log(B)/log(BASE) to be
an exact integer when B is not a power of 2, since B**i has a prime factor
other than 2 in that case, but (2**15)**j's only prime factor is 2).
The computed result can be too small if the true value of N*log(B)/log(BASE)
is a little bit larger than an exact integer, but due to roundoff errors (in
computing log(B), log(BASE), their quotient, and/or multiplying that by N)
yields a numeric result a little less than that integer. Unfortunately, "how
close can a transcendental function get to an integer over some range?"
questions are generally theoretically intractable. Computer analysis via
continued fractions is practical: expand log(B)/log(BASE) via continued
fractions, giving a sequence i/j of "the best" rational approximations. Then
j*log(B)/log(BASE) is approximately equal to (the integer) i. This shows that
we can get very close to being in trouble, but very rarely. For example,
76573 is a denominator in one of the continued-fraction approximations to
log(10)/log(2**15), and indeed:
>>> log(10)/log(2**15)*76573
16958.000000654003
is very close to an integer. If we were working with IEEE single-precision,
rounding errors could kill us. Finding worst cases in IEEE double-precision
requires better-than-double-precision log() functions, and Tim didn't bother.
Instead the code checks to see whether the allocated space is enough as each
new Python digit is added, and copies the whole thing to a larger int if not.
This should happen extremely rarely, and in fact I don't have a test case
that triggers it(!). Instead the code was tested by artificially allocating
just 1 digit at the start, so that the copying code was exercised for every
digit beyond the first.
***/
twodigits c; /* current input character */
Py_ssize_t size_z;
Py_ssize_t digits = 0;
int i;
int convwidth;
twodigits convmultmax, convmult;
digit *pz, *pzstop;
const char *scan, *lastdigit;
char prev = 0;
static double log_base_BASE[37] = {0.0e0,};
static int convwidth_base[37] = {0,};
static twodigits convmultmax_base[37] = {0,};
if (log_base_BASE[base] == 0.0) {
twodigits convmax = base;
int i = 1;
log_base_BASE[base] = (log((double)base) /
log((double)PyLong_BASE));
for (;;) {
twodigits next = convmax * base;
if (next > PyLong_BASE) {
break;
}
convmax = next;
++i;
}
convmultmax_base[base] = convmax;
assert(i > 0);
convwidth_base[base] = i;
}
/* Find length of the string of numeric characters. */
scan = str;
lastdigit = str;
while (_PyLong_DigitValue[Py_CHARMASK(*scan)] < base || *scan == '_') {
if (*scan == '_') {
if (prev == '_') {
/* Only one underscore allowed. */
str = lastdigit + 1;
goto onError;
}
}
else {
++digits;
lastdigit = scan;
}
prev = *scan;
++scan;
}
if (prev == '_') {
/* Trailing underscore not allowed. */
/* Set error pointer to first underscore. */
str = lastdigit + 1;
goto onError;
}
/* Create an int object that can contain the largest possible
* integer with this base and length. Note that there's no
* need to initialize z->ob_digit -- no slot is read up before
* being stored into.
*/
double fsize_z = (double)digits * log_base_BASE[base] + 1.0;
if (fsize_z > (double)MAX_LONG_DIGITS) {
/* The same exception as in _PyLong_New(). */
PyErr_SetString(PyExc_OverflowError,
"too many digits in integer");
return NULL;
}
size_z = (Py_ssize_t)fsize_z;
/* Uncomment next line to test exceedingly rare copy code */
/* size_z = 1; */
assert(size_z > 0);
z = _PyLong_New(size_z);
if (z == NULL) {
return NULL;
}
Py_SIZE(z) = 0;
/* `convwidth` consecutive input digits are treated as a single
* digit in base `convmultmax`.
*/
convwidth = convwidth_base[base];
convmultmax = convmultmax_base[base];
/* Work ;-) */
while (str < scan) {
if (*str == '_') {
str++;
continue;
}
/* grab up to convwidth digits from the input string */
c = (digit)_PyLong_DigitValue[Py_CHARMASK(*str++)];
for (i = 1; i < convwidth && str != scan; ++str) {
if (*str == '_') {
continue;
}
i++;
c = (twodigits)(c * base +
(int)_PyLong_DigitValue[Py_CHARMASK(*str)]);
assert(c < PyLong_BASE);
}
convmult = convmultmax;
/* Calculate the shift only if we couldn't get
* convwidth digits.
*/
if (i != convwidth) {
convmult = base;
for ( ; i > 1; --i) {
convmult *= base;
}
}
/* Multiply z by convmult, and add c. */
pz = z->ob_digit;
pzstop = pz + Py_SIZE(z);
for (; pz < pzstop; ++pz) {
c += (twodigits)*pz * convmult;
*pz = (digit)(c & PyLong_MASK);
c >>= PyLong_SHIFT;
}
/* carry off the current end? */
if (c) {
assert(c < PyLong_BASE);
if (Py_SIZE(z) < size_z) {
*pz = (digit)c;
++Py_SIZE(z);
}
else {
PyLongObject *tmp;
/* Extremely rare. Get more space. */
assert(Py_SIZE(z) == size_z);
tmp = _PyLong_New(size_z + 1);
if (tmp == NULL) {
Py_DECREF(z);
return NULL;
}
memcpy(tmp->ob_digit,
z->ob_digit,
sizeof(digit) * size_z);
Py_DECREF(z);
z = tmp;
z->ob_digit[size_z] = (digit)c;
++size_z;
}
}
}
}
if (z == NULL) {
return NULL;
}
if (error_if_nonzero) {
/* reset the base to 0, else the exception message
doesn't make too much sense */
base = 0;
if (Py_SIZE(z) != 0) {
goto onError;
}
/* there might still be other problems, therefore base
remains zero here for the same reason */
}
if (str == start) {
goto onError;
}
if (sign < 0) {
Py_SIZE(z) = -(Py_SIZE(z));
}
while (*str && Py_ISSPACE(Py_CHARMASK(*str))) {
str++;
}
if (*str != '\0') {
goto onError;
}
long_normalize(z);
z = maybe_small_long(z);
if (z == NULL) {
return NULL;
}
if (pend != NULL) {
*pend = (char *)str;
}
return (PyObject *) z;
onError:
if (pend != NULL) {
*pend = (char *)str;
}
Py_XDECREF(z);
slen = strlen(orig_str) < 200 ? strlen(orig_str) : 200;
strobj = PyUnicode_FromStringAndSize(orig_str, slen);
if (strobj == NULL) {
return NULL;
}
PyErr_Format(PyExc_ValueError,
"invalid literal for int() with base %d: %.200R",
base, strobj);
Py_DECREF(strobj);
return NULL;
}
If you want to defined it in C, go ahead and try using this- if not, you'll have to write it yourself
I need to generate the following string in C:
$(python -c "print('\x90' * a + 'blablabla' + '\x90' * b + 'h\xef\xff\xbf')")
where a and b are arbitrary integers and blablabla represents an arbitrary string. I am attempting to do this by first creating
char str1[size];
and then doing:
for (int i = 0; i < a; i+=1) {
strcat(str1, "\x90");
}
Next I use strcat again:
strcat(str1, "blablabla");
and I run the loop again, this time b times, to concatenate the next b x90 characters. Finally, I use strcat once more as follows:
strcat(str1, "h\xef\xff\xbf");
However, these two strings do not match. Is there a more efficient way of replicating the behaviour of python's * in C? Or am I missing something?
char str1[size];
Even assuming you calculated size correctly, I recommend using
char * str = malloc(size);
Either way, after you get the needed memory for the string one way or the other, you gonna have to initialize it by first doing
str[0]=0;
if you intend in using strcat.
for (int i = 0; i < a; i+=1) {
strcat(str1, "\x90");
}
This is useful, if "\x90" actually is a string (i.e. something composed of more than one character) and that string is short (hard to give a hard border, but something about 16 bytes would be tops) and a is rather small[1]. Here, as John Coleman already suggested, memset is a better way to do it.
memset(str, '\x90', a);
Because you know the location, where "blablabla" shall be stored, just store it there using strcpy instead of strcat
// strcat(str1, "blablabla");
strcpy(str + a, "blablabla");
However, you need the address of the character after "blablabla" (one way or the other). So I would not even do it that way but instead like this:
const char * add_str = "blablabla";
size_t sl = strlen(add_str);
memcpy(str + a, add_str, sl);
Then, instead of your second loop, use another memset:
memset(str + a + sl, '\x90', b);
Last but not least, instead of strcat again strcpy is better (here, memcpy doesn't help):
strcpy(str + a + sl + b, "h\xef\xff\xbf");
But you need it's size for the size calculation at the beginning, so better do it like the blablabla string anyway (and remember the tailing '\0').
Finally, I would put all this code into a function like this:
char * gen_string(int a, int b) {
const char * add_str_1 = "blablabla";
size_t sl_1 = strlen(add_str_1);
const char * add_str_2 = "h\xef\xff\xbf";
size_t sl_2 = strlen(add_str_2);
size_t size = a + sl_1 + b + sl_2 + 1;
// The + 1 is important for the '\0' at the end
char * str = malloc(size);
if (!str) {
return NULL;
}
memset(str, '\x90', a);
memcpy(str + a, add_str_1, sl_1);
memset(str + a + sl_1, '\x90', b);
memcpy(str + a + sl_1 + b, add_str_2, sl_2);
str[a + sl_1 + b + sl_2] = 0; // 0 is the same as '\0'
return str;
}
Remember to free() the retval of gen_string at some point.
If the list of memset and memcpy calls get longer, then I'd suggest to do it like this:
char * ptr = str;
memset(ptr, '\x90', a ); ptr += a;
memcpy(ptr, add_str_1, sl_1); ptr += sl_1;
memset(ptr, '\x90', b ); ptr += b;
memcpy(ptr, add_str_2, sl_2); ptr += sl_2;
*ptr = 0; // 0 is the same as '\0'
maybe even creating a macro for memset and memcpy:
#define MEMSET(c, l) do { memset(ptr, c, l); ptr += l; } while (0)
#define MEMCPY(s, l) do { memcpy(ptr, s, l); ptr += l; } while (0)
char * ptr = str;
MEMSET('\x90', a );
MEMCPY(add_str_1, sl_1);
MEMSET('\x90', b );
MEMCPY(add_str_2, sl_2);
*ptr = 0; // 0 is the same as '\0'
#undef MEMSET
#undef MEMCPY
For the justifications why to do it the way I recommend it, I suggest you read the blog post Back to Basics (by one of the founders of Stack Overflow) which happens not only to be John Coleman's favorite blog post but mine also. There you will learn, that using strcat in a loop like the way you tried it first has quadratic run time and hence, why not use it the way you did it.
[1] If a is big and/or the string that needs to be repeated is long, a better solution would be something like this:
const char * str_a = "\x90";
size_t sl_a = strlen(str_a);
char * ptr = str;
for (size_t i = 0; i < a; ++i) {
strcpy(ptr, str_a);
ptr += sl_a;
}
// then go on at address str + a * sl_a
For individual 1-byte chars you can use memset to partially replicate the behavior of Python's *:
#include<stdio.h>
#include<string.h>
int main(void){
char buffer[100];
memset(buffer,'#',10);
buffer[10] = '\0';
printf("%s\n",buffer);
memset(buffer, '*', 5);
buffer[5] = '\0';
printf("%s\n",buffer);
return 0;
}
Output:
##########
*****
For a more robust solution, see this.
Efficient way to count number of 1s in the binary representation of a number in O(1) if you have enough memory to play with. This is an interview question I found on an online forum, but it had no answer. Can somebody suggest something, I cant think of a way to do it in O(1) time?
That's the Hamming weight problem, a.k.a. population count. The link mentions efficient implementations. Quoting:
With unlimited memory, we could simply create a large lookup table of the Hamming weight of every 64 bit integer
I've got a solution that counts the bits in O(Number of 1's) time:
bitcount(n):
count = 0
while n > 0:
count = count + 1
n = n & (n-1)
return count
In worst case (when the number is 2^n - 1, all 1's in binary) it will check every bit.
Edit:
Just found a very nice constant-time, constant memory algorithm for bitcount. Here it is, written in C:
int BitCount(unsigned int u)
{
unsigned int uCount;
uCount = u - ((u >> 1) & 033333333333) - ((u >> 2) & 011111111111);
return ((uCount + (uCount >> 3)) & 030707070707) % 63;
}
You can find proof of its correctness here.
Please note the fact that: n&(n-1) always eliminates the least significant 1.
Hence we can write the code for calculating the number of 1's as follows:
count=0;
while(n!=0){
n = n&(n-1);
count++;
}
cout<<"Number of 1's in n is: "<<count;
The complexity of the program would be: number of 1's in n (which is constantly < 32).
I saw the following solution from another website:
int count_one(int x){
x = (x & (0x55555555)) + ((x >> 1) & (0x55555555));
x = (x & (0x33333333)) + ((x >> 2) & (0x33333333));
x = (x & (0x0f0f0f0f)) + ((x >> 4) & (0x0f0f0f0f));
x = (x & (0x00ff00ff)) + ((x >> 8) & (0x00ff00ff));
x = (x & (0x0000ffff)) + ((x >> 16) & (0x0000ffff));
return x;
}
public static void main(String[] args) {
int a = 3;
int orig = a;
int count = 0;
while(a>0)
{
a = a >> 1 << 1;
if(orig-a==1)
count++;
orig = a >> 1;
a = orig;
}
System.out.println("Number of 1s are: "+count);
}
countBits(x){
y=0;
while(x){
y += x & 1 ;
x = x >> 1 ;
}
}
thats it?
Below are two simple examples (in C++) among many by which you can do this.
We can simply count set bits (1's) using __builtin_popcount().
int numOfOnes(int x) {
return __builtin_popcount(x);
}
Loop through all bits in an integer, check if a bit is set and if it is then increment the count variable.
int hammingDistance(int x) {
int count = 0;
for(int i = 0; i < 32; i++)
if(x & (1 << i))
count++;
return count;
}
That will be the shortest answer in my SO life: lookup table.
Apparently, I need to explain a bit: "if you have enough memory to play with" means, we've got all the memory we need (nevermind technical possibility). Now, you don't need to store lookup table for more than a byte or two. While it'll technically be Ω(log(n)) rather than O(1), just reading a number you need is Ω(log(n)), so if that's a problem, then the answer is, impossible—which is even shorter.
Which of two answers they expect from you on an interview, no one knows.
There's yet another trick: while engineers can take a number and talk about Ω(log(n)), where n is the number, computer scientists will say that actually we're to measure running time as a function of a length of an input, so what engineers call Ω(log(n)) is actually Ω(k), where k is the number of bytes. Still, as I said before, just reading a number is Ω(k), so there's no way we can do better than that.
Below will work as well.
nofone(int x) {
a=0;
while(x!=0) {
x>>=1;
if(x & 1)
a++;
}
return a;
}
The following is a C solution using bit operators:
int numberOfOneBitsInInteger(int input) {
int numOneBits = 0;
int currNum = input;
while (currNum != 0) {
if ((currNum & 1) == 1) {
numOneBits++;
}
currNum = currNum >> 1;
}
return numOneBits;
}
The following is a Java solution using powers of 2:
public static int numOnesInBinary(int n) {
if (n < 0) return -1;
int j = 0;
while ( n > Math.pow(2, j)) j++;
int result = 0;
for (int i=j; i >=0; i--){
if (n >= Math.pow(2, i)) {
n = (int) (n - Math.pow(2,i));
result++;
}
}
return result;
}
The function takes an int and returns the number of Ones in binary representation
public static int findOnes(int number)
{
if(number < 2)
{
if(number == 1)
{
count ++;
}
else
{
return 0;
}
}
value = number % 2;
if(number != 1 && value == 1)
count ++;
number /= 2;
findOnes(number);
return count;
}
I came here having a great belief that I know beautiful solution for this problem. Code in C:
short numberOfOnes(unsigned int d) {
short count = 0;
for (; (d != 0); d &= (d - 1))
++count;
return count;
}
But after I've taken a little research on this topic (read other answers:)) I found 5 more efficient algorithms. Love SO!
There is even a CPU instruction designed specifically for this task: popcnt.
(mentioned in this answer)
Description and benchmarking of many algorithms you can find here.
The best way in javascript to do so is
function getBinaryValue(num){
return num.toString(2);
}
function checkOnces(binaryValue){
return binaryValue.toString().replace(/0/g, "").length;
}
where binaryValue is the binary String eg: 1100
There's only one way I can think of to accomplish this task in O(1)... that is to 'cheat' and use a physical device (with linear or even parallel programming I think the limit is O(log(k)) where k represents the number of bytes of the number).
However you could very easily imagine a physical device that connects each bit an to output line with a 0/1 voltage. Then you could just electronically read of the total voltage on a 'summation' line in O(1). It would be quite easy to make this basic idea more elegant with some basic circuit elements to produce the output in whatever form you want (e.g. a binary encoded output), but the essential idea is the same and the electronic circuit would produce the correct output state in fixed time.
I imagine there are also possible quantum computing possibilities, but if we're allowed to do that, I would think a simple electronic circuit is the easier solution.
I have actually done this using a bit of sleight of hand: a single lookup table with 16 entries will suffice and all you have to do is break the binary rep into nibbles (4-bit tuples). The complexity is in fact O(1) and I wrote a C++ template which was specialized on the size of the integer you wanted (in # bits)… makes it a constant expression instead of indetermined.
fwiw you can use the fact that (i & -i) will return you the LS one-bit and simply loop, stripping off the lsbit each time, until the integer is zero — but that’s an old parity trick.
The below method can count the number of 1s in negative numbers as well.
private static int countBits(int number) {
int result = 0;
while(number != 0) {
result += number & 1;
number = number >>> 1;
}
return result;
}
However, a number like -1 is represented in binary as 11111111111111111111111111111111 and so will require a lot of shifting. If you don't want to do so many shifts for small negative numbers, another way could be as follows:
private static int countBits(int number) {
boolean negFlag = false;
if(number < 0) {
negFlag = true;
number = ~number;
}
int result = 0;
while(number != 0) {
result += number & 1;
number = number >> 1;
}
return negFlag? (32-result): result;
}
In python or any other convert to bin string then split it with '0' to get rid of 0's then combine and get the length.
len(''.join(str(bin(122011)).split('0')))-1
By utilizing string operations of JS one can do as follows;
0b1111011.toString(2).split(/0|(?=.)/).length // returns 6
or
0b1111011.toString(2).replace("0","").length // returns 6
I had to golf this in ruby and ended up with
l=->x{x.to_s(2).count ?1}
Usage :
l[2**32-1] # returns 32
Obviously not efficient but does the trick :)
Ruby implementation
def find_consecutive_1(n)
num = n.to_s(2)
arr = num.split("")
counter = 0
max = 0
arr.each do |x|
if x.to_i==1
counter +=1
else
max = counter if counter > max
counter = 0
end
max = counter if counter > max
end
max
end
puts find_consecutive_1(439)
Two ways::
/* Method-1 */
int count1s(long num)
{
int tempCount = 0;
while(num)
{
tempCount += (num & 1); //inc, based on right most bit checked
num = num >> 1; //right shift bit by 1
}
return tempCount;
}
/* Method-2 */
int count1s_(int num)
{
int tempCount = 0;
std::string strNum = std::bitset< 16 >( num ).to_string(); // string conversion
cout << "strNum=" << strNum << endl;
for(int i=0; i<strNum.size(); i++)
{
if('1' == strNum[i])
{
tempCount++;
}
}
return tempCount;
}
/* Method-3 (algorithmically - boost string split could be used) */
1) split the binary string over '1'.
2) count = vector (containing splits) size - 1
Usage::
int count = 0;
count = count1s(0b00110011);
cout << "count(0b00110011) = " << count << endl; //4
count = count1s(0b01110110);
cout << "count(0b01110110) = " << count << endl; //5
count = count1s(0b00000000);
cout << "count(0b00000000) = " << count << endl; //0
count = count1s(0b11111111);
cout << "count(0b11111111) = " << count << endl; //8
count = count1s_(0b1100);
cout << "count(0b1100) = " << count << endl; //2
count = count1s_(0b11111111);
cout << "count(0b11111111) = " << count << endl; //8
count = count1s_(0b0);
cout << "count(0b0) = " << count << endl; //0
count = count1s_(0b1);
cout << "count(0b1) = " << count << endl; //1
A Python one-liner
def countOnes(num):
return bin(num).count('1')
I have to do a program that gives all permutations of n numbers {1,2,3..n} using backtracking. I managed to do it in C, and it works very well, here is the code:
int st[25], n=4;
int valid(int k)
{
int i;
for (i = 1; i <= k - 1; i++)
if (st[k] == st[i])
return 0;
return 1;
}
void bktr(int k)
{
int i;
if (k == n + 1)
{
for (i = 1; i <= n; i++)
printf("%d ", st[i]);
printf("\n");
}
else
for (i = 1; i <= n; i++)
{
st[k] = i;
if (valid(k))
bktr(k + 1);
}
}
int main()
{
bktr(1);
return 0;
}
Now I have to write it in Python. Here is what I did:
st=[]
n=4
def bktr(k):
if k==n+1:
for i in range(1,n):
print (st[i])
else:
for i in range(1,n):
st[k]=i
if valid(k):
bktr(k+1)
def valid(k):
for i in range(1,k-1):
if st[k]==st[i]:
return 0
return 1
bktr(1)
I get this error:
list assignment index out of range
at st[k]==st[i].
Python has a "permutations" functions in the itertools module:
import itertools
itertools.permutations([1,2,3])
If you need to write the code yourself (for example if this is homework), here is the issue:
Python lists do not have a predetermined size, so you can't just set e.g. the 10th element to 3. You can only change existing elements or add to the end.
Python lists (and C arrays) also start at 0. This means you have to access the first element with st[0], not st[1].
When you start your program, st has a length of 0; this means you can not assign to st[1], as it is not the end.
If this is confusing, I recommend you use the st.append(element) method instead, which always adds to the end.
If the code is done and works, I recommend you head over to code review stack exchange because there are a lot more things that could be improved.