Bash and Spaces in File Names

On LinkedIn one can specify skills that one has as a collection of keywords. Contacts can verify these skills by vouching that one has them. A recently added feature is that one can take a 15 question multiple choice test and show a badge if the test result lies in the 0.7 quantile or above. In principle a nice idea.

The tests for C++, Python, R and Git seemed sensible, The one about Bash was rather well for the most part, except for one question:

In order to write a script that iterates through the files in a directory, which of the following could you use?

  1. for $ls; do …; done
  2. for $(ls); do …; done
  3. for i in $(ls); do …; done
  4. for i in $ls; do …; done

Well, the third one will get the job done under a lot of dangerous assumption from the programmer, so it likely is the β€œcorrect” answer. But is is terribly brittle and I would never accept that in any code that I review. Let's take a look at this and make it fail.

Take the following test program:

for i in $(ls); do
    echo "Processing '$i'"
done

Next we create a simple four files in our current directory, such that the directory tree looks like this:

.
β”œβ”€β”€ a
β”œβ”€β”€ b
β”œβ”€β”€ c
└── d

And when we run that simple Bash script in this directory, the output is exactly what we would expect:

Processing 'a'
Processing 'b'
Processing 'c'
Processing 'd'

Now I just add a subdirectory with some more files to the mix. We then have the following structure:

.
β”œβ”€β”€ a
β”œβ”€β”€ b
β”œβ”€β”€ c
β”œβ”€β”€ d
└── subdirectory
    β”œβ”€β”€ 1
    β”œβ”€β”€ 2
    β”œβ”€β”€ 3
    └── 4

And the output of the script is this here:

Processing 'a'
Processing 'b'
Processing 'c'
Processing 'd'
Processing 'subdirectory'

Well, the question asked to iterate through all the files, so we already have one example where we broke this approach. In order to make sure that they are files, we could add some test to it:

for i in $(ls); do
    if ! [[ -f "$i" ]]; then
        continue
    fi
    echo "Processing '$i'"
done

This way we skip everything that is not a simple file. That will also skip symlinks, fifos, sockets and devices. One has to be a bit careful how to test. One could also just skip directories if that is the intention.

Another crucial point is that Bash does splitting by spaces in an awful lot of places and the output of ls is meant for the human reader. So I have added a file called two words (with a space) into the directory, it now looks like this:

.
β”œβ”€β”€ a
β”œβ”€β”€ b
β”œβ”€β”€ c
β”œβ”€β”€ d
β”œβ”€β”€ subdirectory
β”‚   β”œβ”€β”€ 1
β”‚   β”œβ”€β”€ 2
β”‚   β”œβ”€β”€ 3
β”‚   └── 4
└── two words

When running with the original code we get the following output:

Processing 'a'
Processing 'b'
Processing 'c'
Processing 'd'
Processing 'subdirectory'
Processing 'two'
Processing 'words'

Oops! There are two loop iterations dispatched for the single file. Therefore all file names which contain a space will break this program.

But we can also become even malicious. I've added a file -n ... These are our files now:

.
β”œβ”€β”€ a
β”œβ”€β”€ b
β”œβ”€β”€ c
β”œβ”€β”€ d
β”œβ”€β”€ -n ..
β”œβ”€β”€ subdirectory
β”‚   β”œβ”€β”€ 1
β”‚   β”œβ”€β”€ 2
β”‚   β”œβ”€β”€ 3
β”‚   └── 4
└── two words

The output is now this, which is still innocent.

Processing 'a'
Processing 'b'
Processing 'c'
Processing 'd'
Processing '-n'
Processing '..'
Processing 'subdirectory'
Processing 'two'
Processing 'words'

But let's do a simpler echo statement like this:

for i in $(ls); do
    echo $i was processed
done

The output is not quite what one would expect:

a was processed
b was processed
c was processed
d was processed
was processed.. was processed
subdirectory was processed
two was processed
words was processed

Well, the file with name -n certainly has a legal name on an EXT4 file system. But the way that the code is written the line echo $i was processed will have the variable substituted to echo -n was processed which tells Bash to print β€œwas processed” without a trailing newline. And that is what we get there.

Now think of what might happen when we just have rm $i as our loop body. Then there will be two iterations, one with rm -n and another with rm ... Luckily rm does not remove directories by default. But if the loop body was rm -rf $i, then the user would be in serious trouble and have the parent directory deleted if a file containing .. with spaces is present.

So how do we fix that? Never parse the output of ls! Also read the Bash Pitfalls to see more dangerous code patterns. We use the following:

for i in ./*; do
    echo "$i" was processed
done

This solves a few issues at the same time:

  • Using a glob means that the file names that are generated from the pattern are not subject to whitespace splitting and therefore filenames with spaces are not a problem.

  • The leading ./ prevents file names starting with a hyphen to be interpreted as command line options to the programs. Most programs support the -- to signal the end of command line arguments, but some don't. So it is easier to just have the leading ./.

  • The quotes at "$i" prevent whitespace splitting of the content of the variable. This is a very strange quirk in Bash and one has to look out of it all the time.

  • The sorting of the files that ls returns depends on user preferences and the locale. This means that the order can be different depending on the user platform. With globs it still depends on the locale, but one can try to control that from the script.

Using these precautions the user can use whatever file names that they want. Our script stays robust and does not fail eventually.

LinkedIn allows to send feedback if there is some mistake with the question. But you better do it quick because the time still runs and the question will count as failed if you did not answer it in 90 seconds. I hope that they will pick up the suggestion and do not show bad code as the correct answer.