I need to generate the following string in C:
$(python -c "print('\x90' * a + 'blablabla' + '\x90' * b + 'h\xef\xff\xbf')")
where a and b are arbitrary integers and blablabla represents an arbitrary string. I am attempting to do this by first creating
char str1[size];
and then doing:
for (int i = 0; i < a; i+=1) {
strcat(str1, "\x90");
}
Next I use strcat again:
strcat(str1, "blablabla");
and I run the loop again, this time b times, to concatenate the next b x90 characters. Finally, I use strcat once more as follows:
strcat(str1, "h\xef\xff\xbf");
However, these two strings do not match. Is there a more efficient way of replicating the behaviour of python's * in C? Or am I missing something?
char str1[size];
Even assuming you calculated size correctly, I recommend using
char * str = malloc(size);
Either way, after you get the needed memory for the string one way or the other, you gonna have to initialize it by first doing
str[0]=0;
if you intend in using strcat.
for (int i = 0; i < a; i+=1) {
strcat(str1, "\x90");
}
This is useful, if "\x90" actually is a string (i.e. something composed of more than one character) and that string is short (hard to give a hard border, but something about 16 bytes would be tops) and a is rather small[1]. Here, as John Coleman already suggested, memset is a better way to do it.
memset(str, '\x90', a);
Because you know the location, where "blablabla" shall be stored, just store it there using strcpy instead of strcat
// strcat(str1, "blablabla");
strcpy(str + a, "blablabla");
However, you need the address of the character after "blablabla" (one way or the other). So I would not even do it that way but instead like this:
const char * add_str = "blablabla";
size_t sl = strlen(add_str);
memcpy(str + a, add_str, sl);
Then, instead of your second loop, use another memset:
memset(str + a + sl, '\x90', b);
Last but not least, instead of strcat again strcpy is better (here, memcpy doesn't help):
strcpy(str + a + sl + b, "h\xef\xff\xbf");
But you need it's size for the size calculation at the beginning, so better do it like the blablabla string anyway (and remember the tailing '\0').
Finally, I would put all this code into a function like this:
char * gen_string(int a, int b) {
const char * add_str_1 = "blablabla";
size_t sl_1 = strlen(add_str_1);
const char * add_str_2 = "h\xef\xff\xbf";
size_t sl_2 = strlen(add_str_2);
size_t size = a + sl_1 + b + sl_2 + 1;
// The + 1 is important for the '\0' at the end
char * str = malloc(size);
if (!str) {
return NULL;
}
memset(str, '\x90', a);
memcpy(str + a, add_str_1, sl_1);
memset(str + a + sl_1, '\x90', b);
memcpy(str + a + sl_1 + b, add_str_2, sl_2);
str[a + sl_1 + b + sl_2] = 0; // 0 is the same as '\0'
return str;
}
Remember to free() the retval of gen_string at some point.
If the list of memset and memcpy calls get longer, then I'd suggest to do it like this:
char * ptr = str;
memset(ptr, '\x90', a ); ptr += a;
memcpy(ptr, add_str_1, sl_1); ptr += sl_1;
memset(ptr, '\x90', b ); ptr += b;
memcpy(ptr, add_str_2, sl_2); ptr += sl_2;
*ptr = 0; // 0 is the same as '\0'
maybe even creating a macro for memset and memcpy:
#define MEMSET(c, l) do { memset(ptr, c, l); ptr += l; } while (0)
#define MEMCPY(s, l) do { memcpy(ptr, s, l); ptr += l; } while (0)
char * ptr = str;
MEMSET('\x90', a );
MEMCPY(add_str_1, sl_1);
MEMSET('\x90', b );
MEMCPY(add_str_2, sl_2);
*ptr = 0; // 0 is the same as '\0'
#undef MEMSET
#undef MEMCPY
For the justifications why to do it the way I recommend it, I suggest you read the blog post Back to Basics (by one of the founders of Stack Overflow) which happens not only to be John Coleman's favorite blog post but mine also. There you will learn, that using strcat in a loop like the way you tried it first has quadratic run time and hence, why not use it the way you did it.
[1] If a is big and/or the string that needs to be repeated is long, a better solution would be something like this:
const char * str_a = "\x90";
size_t sl_a = strlen(str_a);
char * ptr = str;
for (size_t i = 0; i < a; ++i) {
strcpy(ptr, str_a);
ptr += sl_a;
}
// then go on at address str + a * sl_a
For individual 1-byte chars you can use memset to partially replicate the behavior of Python's *:
#include<stdio.h>
#include<string.h>
int main(void){
char buffer[100];
memset(buffer,'#',10);
buffer[10] = '\0';
printf("%s\n",buffer);
memset(buffer, '*', 5);
buffer[5] = '\0';
printf("%s\n",buffer);
return 0;
}
Output:
##########
*****
For a more robust solution, see this.
Related
I need to know how I would go about recreating a version of the int() function in Python so that I can fully understand it and create my own version with multiple bases that go past base 36. I can convert from a decimal to my own base (base 54, altered) just fine, but I need to figure out how to go from a string in my base's format to an integer (base 10).
Basically, I want to know how to go from my base, which I call base 54, to base 10. I don't need specifics, because if I have an example, I can work it out on my own. Unfortunately, I can't find anything on the int() function, though I know it has to be somewhere, since Python is open-source.
This is the closest I can find to it, but it doesn't provide source code for the function itself. Python int() test.
If you can help, thanks. If not, well, thanks for reading this, I guess?
You're not going to like this answer, but int(num, base) is defined in C (it's a builtin)
I went searching around and found it:
https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Objects/longobject.c
/* Parses an int from a bytestring. Leading and trailing whitespace will be
* ignored.
*
* If successful, a PyLong object will be returned and 'pend' will be pointing
* to the first unused byte unless it's NULL.
*
* If unsuccessful, NULL will be returned.
*/
PyObject *
PyLong_FromString(const char *str, char **pend, int base)
{
int sign = 1, error_if_nonzero = 0;
const char *start, *orig_str = str;
PyLongObject *z = NULL;
PyObject *strobj;
Py_ssize_t slen;
if ((base != 0 && base < 2) || base > 36) {
PyErr_SetString(PyExc_ValueError,
"int() arg 2 must be >= 2 and <= 36");
return NULL;
}
while (*str != '\0' && Py_ISSPACE(Py_CHARMASK(*str))) {
str++;
}
if (*str == '+') {
++str;
}
else if (*str == '-') {
++str;
sign = -1;
}
if (base == 0) {
if (str[0] != '0') {
base = 10;
}
else if (str[1] == 'x' || str[1] == 'X') {
base = 16;
}
else if (str[1] == 'o' || str[1] == 'O') {
base = 8;
}
else if (str[1] == 'b' || str[1] == 'B') {
base = 2;
}
else {
/* "old" (C-style) octal literal, now invalid.
it might still be zero though */
error_if_nonzero = 1;
base = 10;
}
}
if (str[0] == '0' &&
((base == 16 && (str[1] == 'x' || str[1] == 'X')) ||
(base == 8 && (str[1] == 'o' || str[1] == 'O')) ||
(base == 2 && (str[1] == 'b' || str[1] == 'B')))) {
str += 2;
/* One underscore allowed here. */
if (*str == '_') {
++str;
}
}
if (str[0] == '_') {
/* May not start with underscores. */
goto onError;
}
start = str;
if ((base & (base - 1)) == 0) {
int res = long_from_binary_base(&str, base, &z);
if (res < 0) {
/* Syntax error. */
goto onError;
}
}
else {
/***
Binary bases can be converted in time linear in the number of digits, because
Python's representation base is binary. Other bases (including decimal!) use
the simple quadratic-time algorithm below, complicated by some speed tricks.
First some math: the largest integer that can be expressed in N base-B digits
is B**N-1. Consequently, if we have an N-digit input in base B, the worst-
case number of Python digits needed to hold it is the smallest integer n s.t.
BASE**n-1 >= B**N-1 [or, adding 1 to both sides]
BASE**n >= B**N [taking logs to base BASE]
n >= log(B**N)/log(BASE) = N * log(B)/log(BASE)
The static array log_base_BASE[base] == log(base)/log(BASE) so we can compute
this quickly. A Python int with that much space is reserved near the start,
and the result is computed into it.
The input string is actually treated as being in base base**i (i.e., i digits
are processed at a time), where two more static arrays hold:
convwidth_base[base] = the largest integer i such that base**i <= BASE
convmultmax_base[base] = base ** convwidth_base[base]
The first of these is the largest i such that i consecutive input digits
must fit in a single Python digit. The second is effectively the input
base we're really using.
Viewing the input as a sequence <c0, c1, ..., c_n-1> of digits in base
convmultmax_base[base], the result is "simply"
(((c0*B + c1)*B + c2)*B + c3)*B + ... ))) + c_n-1
where B = convmultmax_base[base].
Error analysis: as above, the number of Python digits `n` needed is worst-
case
n >= N * log(B)/log(BASE)
where `N` is the number of input digits in base `B`. This is computed via
size_z = (Py_ssize_t)((scan - str) * log_base_BASE[base]) + 1;
below. Two numeric concerns are how much space this can waste, and whether
the computed result can be too small. To be concrete, assume BASE = 2**15,
which is the default (and it's unlikely anyone changes that).
Waste isn't a problem: provided the first input digit isn't 0, the difference
between the worst-case input with N digits and the smallest input with N
digits is about a factor of B, but B is small compared to BASE so at most
one allocated Python digit can remain unused on that count. If
N*log(B)/log(BASE) is mathematically an exact integer, then truncating that
and adding 1 returns a result 1 larger than necessary. However, that can't
happen: whenever B is a power of 2, long_from_binary_base() is called
instead, and it's impossible for B**i to be an integer power of 2**15 when
B is not a power of 2 (i.e., it's impossible for N*log(B)/log(BASE) to be
an exact integer when B is not a power of 2, since B**i has a prime factor
other than 2 in that case, but (2**15)**j's only prime factor is 2).
The computed result can be too small if the true value of N*log(B)/log(BASE)
is a little bit larger than an exact integer, but due to roundoff errors (in
computing log(B), log(BASE), their quotient, and/or multiplying that by N)
yields a numeric result a little less than that integer. Unfortunately, "how
close can a transcendental function get to an integer over some range?"
questions are generally theoretically intractable. Computer analysis via
continued fractions is practical: expand log(B)/log(BASE) via continued
fractions, giving a sequence i/j of "the best" rational approximations. Then
j*log(B)/log(BASE) is approximately equal to (the integer) i. This shows that
we can get very close to being in trouble, but very rarely. For example,
76573 is a denominator in one of the continued-fraction approximations to
log(10)/log(2**15), and indeed:
>>> log(10)/log(2**15)*76573
16958.000000654003
is very close to an integer. If we were working with IEEE single-precision,
rounding errors could kill us. Finding worst cases in IEEE double-precision
requires better-than-double-precision log() functions, and Tim didn't bother.
Instead the code checks to see whether the allocated space is enough as each
new Python digit is added, and copies the whole thing to a larger int if not.
This should happen extremely rarely, and in fact I don't have a test case
that triggers it(!). Instead the code was tested by artificially allocating
just 1 digit at the start, so that the copying code was exercised for every
digit beyond the first.
***/
twodigits c; /* current input character */
Py_ssize_t size_z;
Py_ssize_t digits = 0;
int i;
int convwidth;
twodigits convmultmax, convmult;
digit *pz, *pzstop;
const char *scan, *lastdigit;
char prev = 0;
static double log_base_BASE[37] = {0.0e0,};
static int convwidth_base[37] = {0,};
static twodigits convmultmax_base[37] = {0,};
if (log_base_BASE[base] == 0.0) {
twodigits convmax = base;
int i = 1;
log_base_BASE[base] = (log((double)base) /
log((double)PyLong_BASE));
for (;;) {
twodigits next = convmax * base;
if (next > PyLong_BASE) {
break;
}
convmax = next;
++i;
}
convmultmax_base[base] = convmax;
assert(i > 0);
convwidth_base[base] = i;
}
/* Find length of the string of numeric characters. */
scan = str;
lastdigit = str;
while (_PyLong_DigitValue[Py_CHARMASK(*scan)] < base || *scan == '_') {
if (*scan == '_') {
if (prev == '_') {
/* Only one underscore allowed. */
str = lastdigit + 1;
goto onError;
}
}
else {
++digits;
lastdigit = scan;
}
prev = *scan;
++scan;
}
if (prev == '_') {
/* Trailing underscore not allowed. */
/* Set error pointer to first underscore. */
str = lastdigit + 1;
goto onError;
}
/* Create an int object that can contain the largest possible
* integer with this base and length. Note that there's no
* need to initialize z->ob_digit -- no slot is read up before
* being stored into.
*/
double fsize_z = (double)digits * log_base_BASE[base] + 1.0;
if (fsize_z > (double)MAX_LONG_DIGITS) {
/* The same exception as in _PyLong_New(). */
PyErr_SetString(PyExc_OverflowError,
"too many digits in integer");
return NULL;
}
size_z = (Py_ssize_t)fsize_z;
/* Uncomment next line to test exceedingly rare copy code */
/* size_z = 1; */
assert(size_z > 0);
z = _PyLong_New(size_z);
if (z == NULL) {
return NULL;
}
Py_SIZE(z) = 0;
/* `convwidth` consecutive input digits are treated as a single
* digit in base `convmultmax`.
*/
convwidth = convwidth_base[base];
convmultmax = convmultmax_base[base];
/* Work ;-) */
while (str < scan) {
if (*str == '_') {
str++;
continue;
}
/* grab up to convwidth digits from the input string */
c = (digit)_PyLong_DigitValue[Py_CHARMASK(*str++)];
for (i = 1; i < convwidth && str != scan; ++str) {
if (*str == '_') {
continue;
}
i++;
c = (twodigits)(c * base +
(int)_PyLong_DigitValue[Py_CHARMASK(*str)]);
assert(c < PyLong_BASE);
}
convmult = convmultmax;
/* Calculate the shift only if we couldn't get
* convwidth digits.
*/
if (i != convwidth) {
convmult = base;
for ( ; i > 1; --i) {
convmult *= base;
}
}
/* Multiply z by convmult, and add c. */
pz = z->ob_digit;
pzstop = pz + Py_SIZE(z);
for (; pz < pzstop; ++pz) {
c += (twodigits)*pz * convmult;
*pz = (digit)(c & PyLong_MASK);
c >>= PyLong_SHIFT;
}
/* carry off the current end? */
if (c) {
assert(c < PyLong_BASE);
if (Py_SIZE(z) < size_z) {
*pz = (digit)c;
++Py_SIZE(z);
}
else {
PyLongObject *tmp;
/* Extremely rare. Get more space. */
assert(Py_SIZE(z) == size_z);
tmp = _PyLong_New(size_z + 1);
if (tmp == NULL) {
Py_DECREF(z);
return NULL;
}
memcpy(tmp->ob_digit,
z->ob_digit,
sizeof(digit) * size_z);
Py_DECREF(z);
z = tmp;
z->ob_digit[size_z] = (digit)c;
++size_z;
}
}
}
}
if (z == NULL) {
return NULL;
}
if (error_if_nonzero) {
/* reset the base to 0, else the exception message
doesn't make too much sense */
base = 0;
if (Py_SIZE(z) != 0) {
goto onError;
}
/* there might still be other problems, therefore base
remains zero here for the same reason */
}
if (str == start) {
goto onError;
}
if (sign < 0) {
Py_SIZE(z) = -(Py_SIZE(z));
}
while (*str && Py_ISSPACE(Py_CHARMASK(*str))) {
str++;
}
if (*str != '\0') {
goto onError;
}
long_normalize(z);
z = maybe_small_long(z);
if (z == NULL) {
return NULL;
}
if (pend != NULL) {
*pend = (char *)str;
}
return (PyObject *) z;
onError:
if (pend != NULL) {
*pend = (char *)str;
}
Py_XDECREF(z);
slen = strlen(orig_str) < 200 ? strlen(orig_str) : 200;
strobj = PyUnicode_FromStringAndSize(orig_str, slen);
if (strobj == NULL) {
return NULL;
}
PyErr_Format(PyExc_ValueError,
"invalid literal for int() with base %d: %.200R",
base, strobj);
Py_DECREF(strobj);
return NULL;
}
If you want to defined it in C, go ahead and try using this- if not, you'll have to write it yourself
First of all, I know that nested functions are not supported by the C standard.
However, it's often very useful, in other languages, to define an auxiliary recursive function that will make use of data provided by the outer function.
Here is an example, computing the number of solutions of the N-queens problem, in Python. It's easy to write the same in Lisp, Ada or Fortran for instance, which all allow some kind of nested function.
def queens(n):
a = list(range(n))
u = [True]*(2*n - 1)
v = [True]*(2*n - 1)
m = 0
def sub(i):
nonlocal m
if i == n:
m += 1
else:
for j in range(i, n):
p = i + a[j]
q = i + n - 1 - a[j]
if u[p] and v[q]:
u[p] = v[q] = False
a[i], a[j] = a[j], a[i]
sub(i + 1)
u[p] = v[q] = True
a[i], a[j] = a[j], a[i]
sub(0)
return m
Now my question: is there a way to do something like this in C? I would think of two solutions: using globals or passing data as parameters, but they both look rather unsatisfying.
There is also a way to write this as an iterative program, but it's clumsy:actually, I first wrote the iterative solution in Fortran 77 for Rosetta Code and then wanted to sort out this mess. Fortran 77 does not have recursive functions.
For those who wonder, the function manages the NxN board as a permutation of [0, 1 ... N-1], so that queens are alone on lines and columns. The function is looking for all permutations that are also solutions of the problem, starting to check the first column (actually nothing to check), then the second, and recursively calling itself only when the first i columns are in a valid configuration.
Of course. You need to simulate the special environment in use by your nested function, as static variables on the module level. Declare them above your nested function.
To not mess things up, you put this whole thing into a separate module.
Editor's Note: This answer was moved from the content of a question edit, it is written by the Original Poster.
Thanks all for the advice. Here is a solution using a structure passed as an argument. This is roughly equivalent to what gfortran and gnat do internally to deal with nested functions. The argument i could also be passed in the structure, by the way.
The inner function is declared static so as to help compiler optimizations. If it's not recursive, the code can then be integrated to the outer function (tested with GCC on a simple example), since the compiler knows the function will not be called from the "outside".
#include <stdio.h>
#include <stdlib.h>
struct queens_data {
int n, m, *a, *u, *v;
};
static void queens_sub(int i, struct queens_data *e) {
if(i == e->n) {
e->m++;
} else {
int p, q, j;
for(j = i; j < e->n; j++) {
p = i + e->a[j];
q = i + e->n - 1 - e->a[j];
if(e->u[p] && e->v[q]) {
int k;
e->u[p] = e->v[q] = 0;
k = e->a[i];
e->a[i] = e->a[j];
e->a[j] = k;
queens_sub(i + 1, e);
e->u[p] = e->v[q] = 1;
k = e->a[i];
e->a[i] = e->a[j];
e->a[j] = k;
}
}
}
}
int queens(int n) {
int i;
struct queens_data s;
s.n = n;
s.m = 0;
s.a = malloc((5*n - 2)*sizeof(int));
s.u = s.a + n;
s.v = s.u + 2*n - 1;
for(i = 0; i < n; i++) {
s.a[i] = i;
}
for(i = 0; i < 2*n - 1; i++) {
s.u[i] = s.v[i] = 1;
}
queens_sub(0, &s);
free(s.a);
return s.m;
}
int main() {
int n;
for(n = 1; n <= 16; n++) {
printf("%d %d\n", n, queens(n));
}
return 0;
}
I am looking for a algorithm that takes a string and splits it into a certain number of parts. These parts shall contain complete words (so whitespaces are used to split the string) and the parts shall be of nearly the same length, or contain the longest possible parts.
I know it is not that hard to code a function that can do what I want but I wonder whether there is a well-proven and fast algorithm for that purpose?
edit:
To clarify my question I'll describe you the problem I am trying to solve.
I generate images with a fixed width. Into these images I write user names using GD and Freetype in PHP. Since I have a fixed width I want to split the names into 2 or 3 lines if they don't fit into one.
In order to fill as much space as possible I want to split the names in a way that each line contains as much words as possible. With this I mean that in one line should be as much words as neccessary in order to keep each line's length near to an average line length of the whole text block. So if there are one long word and two short words the two short words should stand on one line if it makes all lines about equal long.
(Then I compute the text block width using 1, 2 or 3 lines and if it fits into my image I render it. Just if there are 3 lines and it won't fit I decrease the font size until everything is fine.)
Example:
This is a long text
should be display something like that:
This is a
long text
or:
This is
a long
text
but not:
This
is a long
text
and also not:
This is a long
text
Hope I could explain clearer what I am looking for.
If you're talking about line-breaking, take a look at Dynamic Line Breaking, which gives a Dynamic Programming solution to divide words into lines.
I don't know about proven, but it seems like the simplest and most efficient solution would be to divide the length of the string by N then find the closest white space to the split locations (you'll want to search both forward and back).
The below code seems to work though there are plenty of error conditions that it doesn't handle. It seems like it would run in O(n) where n is the number of strings you want.
class Program
{
static void Main(string[] args)
{
var s = "This is a string for testing purposes. It will be split into 3 parts";
var p = s.Length / 3;
var w1 = 0;
var w2 = FindClosestWordIndex(s, p);
var w3 = FindClosestWordIndex(s, p * 2);
Console.WriteLine(string.Format("1: {0}", s.Substring(w1, w2 - w1).Trim()));
Console.WriteLine(string.Format("2: {0}", s.Substring(w2, w3 - w2).Trim()));
Console.WriteLine(string.Format("3: {0}", s.Substring(w3).Trim()));
Console.ReadKey();
}
public static int FindClosestWordIndex(string s, int startIndex)
{
int wordAfterIndex = -1;
int wordBeforeIndex = -1;
for (int i = startIndex; i < s.Length; i++)
{
if (s[i] == ' ')
{
wordAfterIndex = i;
break;
}
}
for (int i = startIndex; i >= 0; i--)
{
if (s[i] == ' ')
{
wordBeforeIndex = i;
break;
}
}
if (wordAfterIndex - startIndex <= startIndex - wordBeforeIndex)
return wordAfterIndex;
else
return wordBeforeIndex;
}
}
The output for this is:
1: This is a string for
2: testing purposes. It will
3: be split into 3 parts
Again, following Brian's answer, I made a PHP version of his code:
// Input text
$txt = "This is a really long string that should be broken up onto lines of about the same number of characters.";
// Number of lines
$numLines = 3;
/* Do it, result comes as an array: */
$aResult = splitLinesByClosestWhitespace($txt, $numLines);
/* Output result: */
if ($aResult)
{
for ($x=1; $x<=$numLines; $x++)
echo "Line ".$x.": ".$aResult[$x]."<br>";
} else {
echo "Not enough spaces to generate the lines!";
}
/**********************/
/**
* Splits a string into multiple lines of the closest possible same length,
* using the closest whitespaces
* #param string $txt String to split
* #param integer $numLines Number of lines
* #return array|false
*/
function splitLinesByClosestWhitespace($txt, $numLines)
{
$p = intval( strlen($txt) / $numLines );
$aTxtIndx = array();
$aTxt = array();
// Check we have enough whitespaces to generate the number of lines
$wsCount = count( explode(" ", $txt) ) - 1;
if ($wsCount<$numLines)
return false;
// Get the indexes
for ($x=1; $x<=$numLines; $x++)
{
$aTxtIndx[$x] = FindClosestWordIndex($txt, $p * ($x-1) );
}
// Do the split
for ($x=1; $x<=$numLines; $x++)
{
if ($x != $numLines)
$aTxt[$x] = substr( $txt, $aTxtIndx[$x], trim($aTxtIndx[$x+1]) );
else
$aTxt[$x] = substr( $txt, trim($aTxtIndx[$x]) );
}
return $aTxt;
}
/**
* Finds the closest word to a string index
* #param string $s String to search
* #param integer $startIndex Index at which to find the closest word
* #return integer
*/
function FindClosestWordIndex($s, $startIndex)
{
$wordAfterIndex = 0;
$wordBeforeIndex = 0;
for ($i = $startIndex; $i < strlen($s); $i++)
{
if ($s[$i] == ' ')
{
$wordAfterIndex = $i;
break;
}
}
for ($i = $startIndex; $i >= 0; $i--)
{
if ($s[$i] == ' ')
{
$wordBeforeIndex = $i;
break;
}
}
if ($wordAfterIndex - $startIndex <= $startIndex - $wordBeforeIndex)
return $wordAfterIndex;
else
return $wordBeforeIndex;
}
Partitioning into equal sizes is NP-Complete
Working python codes
Wrap.py - Break paragraphs into lines, attempting to avoid short lines.
SMAWK.py - Same thing in O(n)
codes by David Eppstein.
The way word-wrap is usually implemented is to place as many words as possible onto one line, and break to the next when there is no more room. This assumes, of course, that you have a maximum-width in mind.
Regardless of what algorithm you use, keep in mind that unless you are working with a fixed-width font, you want to work with the physical width of the word, not the number of letters.
Following Brian's answer, I made a JavaScript version of his code: http://jsfiddle.net/gmoz22/CPGY2/.
// Input text
var txt = "This is a really long string that should be broken up onto lines of about the same number of characters.";
// Number of lines
var numLines = 3;
/* Do it, result comes as an array: */
var aResult = splitLinesByClosestWhitespace(txt, numLines);
/* Output result: */
if (aResult)
{
for (var x = 1; x<=numLines; x++)
document.write( "Line "+x+": " + aResult[x] + "<br>" );
} else {
document.write("Not enough spaces to generate the lines!");
}
/**********************/
// Original algorithm by http://stackoverflow.com/questions/2381525/algorithm-split-a-string-into-n-parts-using-whitespaces-so-all-parts-have-nearl/2381772#2381772, rewritten for JavaScript by Steve Oziel
/**
* Trims a string for older browsers
* Used only if trim() if it is not already available on the Prototype-Object
* since overriding it is a huge performance hit (generally recommended when extending Native Objects)
*/
if (!String.prototype.trim)
{
String.prototype.trim = function(){return this.replace(/^\s+|\s+$/g, '');};
}
/**
* Splits a string into multiple lines of the closest possible same length,
* using the closest whitespaces
* #param {string} txt String to split
* #param {integer} numLines Number of lines
* #returns {Array}
*/
function splitLinesByClosestWhitespace(txt, numLines)
{
var p = parseInt(txt.length / numLines);
var aTxtIndx = [];
var aTxt = [];
// Check we have enough whitespaces to generate the number of lines
var wsCount = txt.split(" ").length - 1;
if (wsCount<numLines)
return false;
// Get the indexes
for (var x=1; x<=numLines; x++)
{
aTxtIndx[x] = FindClosestWordIndex(txt, p * (x-1) );
}
// Do the split
for (var x=1; x<=numLines; x++)
{
if (x != numLines)
aTxt[x] = txt.slice(aTxtIndx[x], aTxtIndx[x+1]).trim();
else
aTxt[x] = txt.slice(aTxtIndx[x]).trim();
}
return aTxt;
}
/**
* Finds the closest word to a string index
* #param {string} s String to search
* #param {integer} startIndex Index at which to find the closest word
* #returns {integer}
*/
function FindClosestWordIndex(s, startIndex)
{
var wordAfterIndex = 0;
var wordBeforeIndex = 0;
for (var i = startIndex; i < s.length; i++)
{
if (s[i] == ' ')
{
wordAfterIndex = i;
break;
}
}
for (var i = startIndex; i >= 0; i--)
{
if (s[i] == ' ')
{
wordBeforeIndex = i;
break;
}
}
if (wordAfterIndex - startIndex <= startIndex - wordBeforeIndex)
return wordAfterIndex;
else
return wordBeforeIndex;
}
It works fine when the number of desired lines is not too close to the number of whitespaces.
In the example I gave, there are 19 whitespaces and it starts to bug when you ask to break it into 17, 18 or 19 lines.
Edits welcome!
If I have a list:
a = [1,2,3,4]
and then add 4 elements using extend
a.extend(range(5,10))
I get
a = [1, 2, 3, 4, 5, 6, 7, 8, 9]
How does python do this? does it create a new list and copy the elements across or does it make 'a' bigger? just concerned that using extend will gobble up memory. I'am also asking as there is a comment in some code I'm revising that extending by 10000 x 100 is quicker than doing it in one block of 1000000.
Python's documentation on it says:
Extend the list by appending all the
items in the given list; equivalent to
a[len(a):] = L.
As to "how" it does it behind the scene, you really needn't concern yourself about it.
L.extend(M) is amortized O(n) where n=len(m), so excessive copying is not usually a problem. The times it can be a problem is when there is not enough space to extend into, so a copy is performed. This is a problem when the list is large and you have limits on how much time is acceptable for an individual extend call.
That is the point when you should look for a more efficient datastructure for your problem. I find it is rarely a problem in practice.
Here is the relevant code from CPython, you can see that extra space is allocated when the list is extended to avoid excessive copying
static PyObject *
listextend(PyListObject *self, PyObject *b)
{
PyObject *it; /* iter(v) */
Py_ssize_t m; /* size of self */
Py_ssize_t n; /* guess for size of b */
Py_ssize_t mn; /* m + n */
Py_ssize_t i;
PyObject *(*iternext)(PyObject *);
/* Special cases:
1) lists and tuples which can use PySequence_Fast ops
2) extending self to self requires making a copy first
*/
if (PyList_CheckExact(b) || PyTuple_CheckExact(b) || (PyObject *)self == b) {
PyObject **src, **dest;
b = PySequence_Fast(b, "argument must be iterable");
if (!b)
return NULL;
n = PySequence_Fast_GET_SIZE(b);
if (n == 0) {
/* short circuit when b is empty */
Py_DECREF(b);
Py_RETURN_NONE;
}
m = Py_SIZE(self);
if (list_resize(self, m + n) == -1) {
Py_DECREF(b);
return NULL;
}
/* note that we may still have self == b here for the
* situation a.extend(a), but the following code works
* in that case too. Just make sure to resize self
* before calling PySequence_Fast_ITEMS.
*/
/* populate the end of self with b's items */
src = PySequence_Fast_ITEMS(b);
dest = self->ob_item + m;
for (i = 0; i < n; i++) {
PyObject *o = src[i];
Py_INCREF(o);
dest[i] = o;
}
Py_DECREF(b);
Py_RETURN_NONE;
}
it = PyObject_GetIter(b);
if (it == NULL)
return NULL;
iternext = *it->ob_type->tp_iternext;
/* Guess a result list size. */
n = _PyObject_LengthHint(b, 8);
if (n == -1) {
Py_DECREF(it);
return NULL;
}
m = Py_SIZE(self);
mn = m + n;
if (mn >= m) {
/* Make room. */
if (list_resize(self, mn) == -1)
goto error;
/* Make the list sane again. */
Py_SIZE(self) = m;
}
/* Else m + n overflowed; on the chance that n lied, and there really
* is enough room, ignore it. If n was telling the truth, we'll
* eventually run out of memory during the loop.
*/
/* Run iterator to exhaustion. */
for (;;) {
PyObject *item = iternext(it);
if (item == NULL) {
if (PyErr_Occurred()) {
if (PyErr_ExceptionMatches(PyExc_StopIteration))
PyErr_Clear();
else
goto error;
}
break;
}
if (Py_SIZE(self) < self->allocated) {
/* steals ref */
PyList_SET_ITEM(self, Py_SIZE(self), item);
++Py_SIZE(self);
}
else {
int status = app1(self, item);
Py_DECREF(item); /* append creates a new ref */
if (status < 0)
goto error;
}
}
/* Cut back result list if initial guess was too large. */
if (Py_SIZE(self) < self->allocated)
list_resize(self, Py_SIZE(self)); /* shrinking can't fail */
Py_DECREF(it);
Py_RETURN_NONE;
error:
Py_DECREF(it);
return NULL;
}
PyObject *
_PyList_Extend(PyListObject *self, PyObject *b)
{
return listextend(self, b);
}
It works as if it were defined like this
def extend(lst, iterable):
for x in iterable:
lst.append(x)
This mutates the list, it does not create a copy of it.
Depending on the underlying implementation, append and extend may trigger the list to copy its own data structures but this is normal and nothing to worry about. For example array-based implementations typically grow the underlying array exponentially and need to copy the list of elements when they do so.
How does python do this? does it create a new list and copy the elements across or does it make 'a' bigger?
>>> a = ['apples', 'bananas']
>>> b = a
>>> a is b
True
>>> c = ['apples', 'bananas']
>>> a is c
False
>>> a.extend(b)
>>> a
['apples', 'bananas', 'apples', 'bananas']
>>> b
['apples', 'bananas', 'apples', 'bananas']
>>> a is b
True
>>>
It does not create a new list object, it extends a. This is self-evident from the fact that you don't make an assigment. Python will not magically replace your objects with other objects. :-)
How the memory allocation happens inside the list object is implementation dependent.
I have to do a program that gives all permutations of n numbers {1,2,3..n} using backtracking. I managed to do it in C, and it works very well, here is the code:
int st[25], n=4;
int valid(int k)
{
int i;
for (i = 1; i <= k - 1; i++)
if (st[k] == st[i])
return 0;
return 1;
}
void bktr(int k)
{
int i;
if (k == n + 1)
{
for (i = 1; i <= n; i++)
printf("%d ", st[i]);
printf("\n");
}
else
for (i = 1; i <= n; i++)
{
st[k] = i;
if (valid(k))
bktr(k + 1);
}
}
int main()
{
bktr(1);
return 0;
}
Now I have to write it in Python. Here is what I did:
st=[]
n=4
def bktr(k):
if k==n+1:
for i in range(1,n):
print (st[i])
else:
for i in range(1,n):
st[k]=i
if valid(k):
bktr(k+1)
def valid(k):
for i in range(1,k-1):
if st[k]==st[i]:
return 0
return 1
bktr(1)
I get this error:
list assignment index out of range
at st[k]==st[i].
Python has a "permutations" functions in the itertools module:
import itertools
itertools.permutations([1,2,3])
If you need to write the code yourself (for example if this is homework), here is the issue:
Python lists do not have a predetermined size, so you can't just set e.g. the 10th element to 3. You can only change existing elements or add to the end.
Python lists (and C arrays) also start at 0. This means you have to access the first element with st[0], not st[1].
When you start your program, st has a length of 0; this means you can not assign to st[1], as it is not the end.
If this is confusing, I recommend you use the st.append(element) method instead, which always adds to the end.
If the code is done and works, I recommend you head over to code review stack exchange because there are a lot more things that could be improved.