Andrea Sodomaco - ICT scattered considerations

010101010111001101100101001000000111010001101000011001010010000001110011011011110111010101110010011000110110010100101100001000000100110001110101011010110110010100100001

Parliamo di Java, Linux e informatica

The problem is formulated quite simply... very simply... it has already been stated in the title. It is not a puzzle, and every developer should be able to write a function that implements the algorithm in a very short amount of time.

Are we certain that all programmers know how to do it? I recently posed this problem to some developers (some of whom were senior), and none of them came up with an efficient solution quickly.

I was unable to find an algorithm online that solves this simple problem, or rather, I found several that rely on constructing "trimmed" strings in which spaces have been removed (an example on this page or this other one).

Specifications

If you were given only the title of this page as the problem specification to solve, "comparing two strings while ignoring repeated spaces," would you be satisfied?

It's clear that:

"a•bcd•••ef" = "a•••bcd••ef"

but at least the following aspects need to be clarified:

Do we not consider spaces at all, or do we just ignore repeated spaces?
Do we consider spaces at the beginning and end of the string?
Do we consider only actual spaces (ASCII 32), or all non-printable separator characters (tab, newline, etc.)?

Let's define space characters as space, tab, carriage return, and newline characters.

Let's define the function equalsSpacesLenient(String,String) as the function that returns true if and only if the strings are equal, considering all sequences of one or more space characters as a single space. There are no special considerations for space characters at the beginning or end of the string.

Examples:

"a•b" = "a••••b" 
"a•b" ≠ "ab"
"••abc" = "→••abc" 
"••abc" ≠ "abc"
"ab•cd•••→•ef" = "ab•↵•••cd•ef" 
"ab•cd•••→•ef" ≠ "abcd•ef"

Try to implement a solution in your favorite language. After you have seen the solution, saying that it is simple does not count!!!

The solutions I found online

The immediate solution to the problem in Java is as follows:

s1.replaceAll("\\s+"," ").equals(s2.replaceAll("\\s+"," "))

Everything I found online is more or less related to this solution, which has the advantage of not requiring development (I searched with reference to Java, I only took a quick look at other languages).

The problem with this solution is its performance. For each comparison, the regular expression will be parsed, another string will be constructed and modified. Only at this point will the two strings be compared. Following the old principles of computer science, it seems abominable to manipulate strings by creating a new version just for the purpose of comparison.

Other solutions found online are based on the use of libraries, which require including a possibly voluminous library in the project, updating and maintaining it, and requiring the acquisition of some know-how to use it. Regarding this solution, my personal opinion is that you don't install a library just to avoid writing 10-20 lines of code. And anyway, someone has to write the libraries!

Preamble: (Classic) Comparison between Strings

To better understand the solution I propose in the next section, let's see how we could implement the normal comparison between two strings if programming languages weren't so "foresighted" as to provide us with this function already.

The comparison between two strings is implemented by comparing the characters of each of them two by two. At the first differing character, the process stops and the strings are considered different. If the strings are actually equal, the process continues for all the characters of the strings, which must therefore also have the same length.

This is an interesting approach in C This is an interesting approach in C.

// A not so orthodox way to redo equals in Java
   public static boolean strEquals(String s1, String s2) {
      // It should handle the case where s1 or s2 is null
      int idx=0;
      for (;;) {
         if (s1.charAt(idx)!=s2.charAt(idx)) {
            return false; // At the first differing character, finished: the strings are different
      }
         }
         idx++;
         if (idx>=s1.length()) { // s1 has no more characters
            if (idx>=s2.length()) {
               return true; // s2 also has no more characters and the strings are equal
            }
            else {
               return false; // s2 has more characters and therefore the strings are different
            }
         }
         else { // s1 has more characters to compare
            if (idx>=s2.length()) {
               return false; //  s2 has no more characters and therefore the strings are different
            }
            // s1 and s2 both have more characters to compare: continue
         }
      }
   }

equalsSpacesLenient: una soluzione in Java per il confronto fra stringhe ignorando spazi multipli

Avete già provato ad implementarlo? Se non l’avete fatto provateci prima di sbirciare la soluzione...

Partiamo dal codice qui sopra che reinventa l’equals classico per sviluppare la funzione equalsSpacesLenient.

Rispetto all’algoritmo sopra riportato, l’equalsSpacesLenient ha bisogno di due indici in quanto una stringa in un certo punto può avere uno spazio solo e l'altra due, per poi continuare con caratteri uguali. Ecco una possibile implementazione:

public static boolean equalsSpacesLenient(String s1,String s2) {
   // It should handle the case where s1 or s2 is null
   int idx1=0;
   int idx2=0;
   
   for (;;) {
      if (isSpace(s1.charAt(idx1)) && isSpace(s2.charAt(idx2))) {
         // skip any additional spaces in both strings
         do {
            idx1++;
         } while (idx1<s1.length() && isSpace(s1.charAt(idx1)));
         do {
            idx2++;
         } while (idx2<s2.length() && isSpace(s2.charAt(idx2)));
      }
      else if (s1.charAt(idx1)==s2.charAt(idx2)) {
      	 // The current character matches: move to the next
         idx1++;
         idx2++;
      }
      else {
          // There are two different characters: the strings are different: exit.
          return false;
      }
      
      if (idx1>=s1.length()) { // s1 has no more characters
         if (idx2<=s2.length()) {
            return true; // s2 also has no more characters and the strings are equal
         }
         else {
            return false; // s2 has more characters and therefore the strings are different
         }
      }
      else { // s1 has more characters to compare
         if (idx2<=s2.length()) {
            return false; // s2 has no more characters and therefore the strings are different
         }
          // s1 and s2 have more characters to compare: continue
      }
   }
      
} 

private static boolean isSpace(int c) {
   // I can decide here how to consider 'space' in various ways
   return Character.isSpaceChar(c);
   // return c==' ';
   // return c==' ' || c=='\t' || c=='\n' || c=='\r'; 
}

Why is this approach preferable?

Or rather, why is it not good to use

s1.replaceAll("\\s+"," ").equals(s2.replaceAll("\\s+"," "))

String comparison is a rather efficient process that, in the worst case (when the strings are equal), requires as many comparisons as the length of the strings. If we consider random strings, statistically, in most cases, they will differ from the first character, and therefore the comparison will end immediately.

Let's suppose we want to compare two strings of a hundred characters. The replaceAll method will still be executed before the comparison for both strings and will require hundreds of operations. This is even when the two strings already differ in the first character.

From some tests¹ on strings of a hundred characters, comparing with the equalsSpacesLenient function is a few thousand times faster². Therefore, we are talking about an improvement of three orders of magnitude.

Write to me at for questions, comments, suggestions...

footnotes

Comparing the performance of two algorithms could require many precautions for precise results, but in this case the difference is so evident that I will not dwell on the test conditions.
In the worst case scenario of always equal strings, it is "only" 10 times faster. Actually, this result doesn't do justice to the substantial difference between the two approaches, and understanding this aspect thoroughly would require another chapter.