In Stata coding, Style is the Essential: A brief commentary on do-file style

“In all unimportant matters, style, not sincerity is the essential.
In all important matters, style, not sincerity is the essential.”
– Oscar Wilde, Phrases and Philosophies for the Young

Wilde, well known in his time for decadent style and flamboyant behavior, could have never imagined that his rather abruptly put witticism, extolling the virtue of style, can also aptly be applied to writing code in Stata.  Stata code, like computer code of any kind, has two purposes.  The first purpose, which Wilde may have seen as “sincerity” is to communicate a series of commands to the computer.  We are explaining to Stata, step by step, what we wish it to do.  Stata doesn’t care much about style, its only concern is syntax.  As long as the syntax is correct, the program will run.  Style be damned.  However, the second purpose of computer code is to provide a record of what instructions were given.  The code itself stands as a documentation of the programmer’s mind – how the problem addressed by the code, the approach taken, and the end result.  This is where style becomes essential.  As with any written document, it exists to be read, and the readability of Stata code by the code’s creator, a colleague, or a mere spectator, depends upon the style in which the code is written.

Lester McCann (1997), a senior lecturer at the University of Arizona, outlines the importance of programming style in Toward Developing Good Programming Style (written for C and for Pascal for those of us who are that old).  He explains:

“A program that is perfectly clear today is clear only because you just wrote it. Put it away for a few months, and it will most like take you a while to figure out what it does and how it does it. If it takes you a while to figure it out, how long would it take someone else to figure it out?”

In the wake of the LaCour Scandal, the importance of documenting our data, our processes, and our analyses so that others can evaluate what we have done can hardly be understated.  Our code should be well thought out, well organized, and well documented.  As McCann (1997) says, it is important to strive to “structure and document your program the way you wish other programmers would” (that is my emphasis on the document).    McCann focuses on themes specifically relevant to general programming, but some of these themes are also woven throughout “Suggestions on Stata programming style” (Cox, 2005), appearing in the Stata Journal.  Cox’s, though primarily focused on Stata programs (.ado files) rather than procedural “.do” files, goes to great great lengths to support the idea that above all else, programs must be clear.

In Stata, at its most basic level, a do-file is simply a text file that contains a script of commands for Stata to execute.  I suspect a true “programmer” would scoff at the relative simplicity of a Stata do-file, but those of us in the various disciplines that rely heavily on Stata for data analysis know that a Stata do-file can become quite the complex beast, reflecting hours (if not days) of work.  Indeed, there are do-files that make me more proud than any paper I’ve written.  Generally, do-files should be Robust and Legible (Long, 2009).

Robust do-files produces exactly the same result when run at a later time or on another computer.
Legible do-files are documented and formatted so that it is easy to understand what is being done.
(Long, 2009)

The stylistic approach that I take to writing do-files can be broken down into two key categories: (1) process and (2) format and style.

1. Process

Processes in Stata have been documented extensively in manuals and books published by the Stata Press (See Long, 2009 and Kohler & Kreuter, 2012 for two very useful examples).  I tend to keep suggestions from both, as well as several other practices in mind as I work though my do-files:

a. Always create a log

The importance of a Log file cannot be over emphasized.  Log files can help you trace errors, preserve output, and provide documentation.  Every do-file should open a log file (preferred) or append to an existing log.  I also prefer to log as text, even though Stata’s default is SCML, it is generally quicker and easier to open a log file (particularly if it is on a machine without Stata or a SCML viewer — like a mobile device).  A log file can easily be created using the log using command:

log using logname.txt, replace text

b. Make do-files self-contained

Stata do-files should be self-contained.  In other words, each do-file should be able to run as a stand-alone program without relying on the user to first load a dataset or take other action.  The do-file should load its own data sets, execute its own commands, and initiate logging.  It shouldn’t evaluate estimates or rely on macros that it has not created itself (see my comments on global vs. local macros below).  I think it worth pointing out that there are always some exceptions to these guidelines, particularly this one.  For example, it can be a good idea to use separate do-files to prepare and analyze data.  Clearly the preparation would need to be completed before the analysis, however, once a final data file has been prepared, the analysis can be completed multiple times independently of re-running a preparation script.

c. Identify the Stata version the log was written for

Stata changes over time.  New versions introduce new commands, retire old commands, and handle syntax differently. This should be done at the beginning of the do file using a simple command:

version 13

Not identifying the version can create troubleshooting headaches later on.  Take for example the merge command in Stata 10 or prior:

isid id
sort id
merge id using source.dta

Earlier versions of Stata required merge data be sorted on the merge variable (in this example, “id”) , and assumed a one-to-one merge, both the master and the using dataset needed to be uniquely identified on the merge variable.  Thus, before merging it was necessary to check for uniqueness (isid), sort on the variable, and then perform the merge.  In Stata 11 and higher, the merge command became more powerful, allowing for merging one-to-many, many-to-one, and even many-to-many based on the merge variable(s).  The new merge command also handled the sorting of data internally, eliminating the need to pre-sort the master and using data sets.  The same process as above could be accomplished now by using:

isid id
merge 1:1 id using source.dta

The newer merge command is perfectly capable of understanding the old syntax, but a do-file written with the new syntax would not successfully run on an earlier version of Stata.  If the “version” command at the beginning of the do-file recognizes an older version of Stata, execution will halt.

d. Use relative paths

Commands that read and write files (i.e., data files, log files, output files, etc.) should use folder (directory, if you must) paths that are relative locations to the location of the do-file or the project home location.  As I’ll discuss in the next section, your do-file should identify a home location in the header, after that, everything else should be relative.  If the location of the data changes, the do-file will still execute without having to find and replace every path in the file.  For example, a do-file might open a dataset:

use "c:\Users\JohnDoe\Documents\Stata Data\My Project\work\analysis\mycooldata.dta", clear

That path is all well and good until you move “My Project” and all of it’s contents onto a flash drive or external hard drive or you migrate the project from a PC to a Mac or Linux platform.  To avoid future issues, use a relative path:

use analysis\mycooldata.dta, clear

This approach does, however, require that you use a uniform directory structure for your projects.  I create a separate folder for every project I work on (e.g., “My Project”) and that project folder contains consistent subfolders:

documentation/
drafts/
readings/
work/
work/raw
work/intermediate
work/analysis
work/output

Some of the subfolders listed above are self-explanatory.  The “documentation” folder contains project documentation, codebooks, memos, etc., the “drafts” folder contains drafts of papers or reports.  I have to admit that the “readings” folder has become a bit obsolete since I now use Mendeley Desktop to organize PDFs and references, but in a group project, the folder can be useful for sharing articles.  The key folder for Stata work is the “work” folder.  In this folder and its subfolders, I keep the “raw” data, “intermediate” temporary data files (i.e., temporary files).  Output contains tables, graphs, or other output generated by my do-file.

I also store all non-sensitive projects in a cloud-based folder.  I prefer Dropbox, but Box and GoogleDrive work equally well.  These services create a local folder on your computer that is then synchronized with the cloud.  This serves two really important purposes:  first, it makes sure that my critical files are always backed up.  Second, since I have the cloud services installed on both my desktop and my laptop, files are automatically updated between the two and the paths remain the same.  A do-file that I write on my desktop will run on my laptop with no modifications at all.

d. The pen is mightier than the sword: Use the right tools

There is little doubt (for me anyway) that Stata is a pretty awesome analytical tool.  It is very difficult, however, for any program to be all things to all people.   The do-file editor built in to Stata definitely has its benefits, particularly with the built-in project management in Stata 13 and higher, but there are also some advantages to using a third-party editor as well. It might be worth noting, and probably goes without saying, that you shouldn’t use Word or Pages to edit your do-files — word processors are most definitely not the right tool for the job because they are designed to handle formatting and layout issues that aren’t relevant to our Stata code.  On my Mac, I use Textastic, a great editor that supports Stata syntax highlighting (e.g., commands are highlighted and the code is visually very easy to follow), automatic balancing of brackets and quotes, and automatic indentation.  Most importantly for me, it automatically saves, whereas the Stata do-file editor does not.  Textastic also has an iOS editor and iCloud/Dropbox support which makes it convenient to edit and view do-files on the go.  For PC Users, Cox (2005) provides a web link with text editor recommendations: http://fmwww.bc.edu/repec/bocode/t/textEditors.html.  He cites Heiberger and Holland’s (2004) requirements for any text editor (pp. 633-4):

1. Permit easy modification of computing instructions and facilitate their resubmission for processing
2. Be able to operate on output as well as input
3. Be able to cut, paste, and rearrange text; to search documents for strings of text; to work with rectangular regions of text
4. Be aware of the language, for example, to illustrate the syntax with appropriate indentation and with color and font highlighting, to detect syntactic errors in input code, and to check the spelling of statistical keywords
5. Handle multiple files simultaneously
6. Interact cleanly with the presentation documents (reports and correspondence) based on the statistical results
7. Check spelling of words in natural languages
8. Permit placement of graphics within text
9. Permit placement of mathematical expressions in the text.

Points 6, 8, and 9, are less relevant for Stata do-files, but the remaining requirements hold.

2. Format and Style

a. Be consistent

Consistency is key to making sure your do-files are readable and easy for others (and your future self) to understand.  Develop consistent patterns and habits for all aspects of the format and style of your coding, including structures, foreach and forvalues statements, as well as local and global variables.

b. Comment and document

Comments are critical to making your do-file easy to understand and refer back to.  Stata allows three basic methods for typing comments:

* Comment
// Comment
/* Comment */

You can’t have too many comments–and you will certainly not regret the time spent on comments and internal documentation if you have to return to the file later.  However, comments are only useful if you can understand them later — so they should be understandable and accurate.

For the purposes of consistency, I use different comment notation in different ways.  I tend to use * comments for headers and dividers, to “comment-out” specific lines, and also to indicate notes to self that require follow-up.  I use // comments to indicate comments on do-file operations, to explain what the file is doing and why.  For example:

*--------------------------------------------------
* My Project
* samplefile.do
* 7/30/2014, version 1
* Michael S. Hill, University of California, Davis
*--------------------------------------------------

*--------------------------------------------------
* Program Setup
*--------------------------------------------------
version 13              // Set Version number for backward compatibility
set more off            // Disable partitioned output
clear all               // Start with a clean slate
set linesize 80         // Line size limit to make output more readable
macro drop _all         // clear all macros
capture log close       // Close existing log files
log using teachermobility.txt, text replace      // Open log file
* --------------------------------------------------

// Open data file created by createmydata.do
use analysis\mycooldata.dta, clear

// Summarize data
summarize

// Close the log, end the file
log close
exit

In the example above, I also show you the kind of header and initial commands that I run in every do file.  Not only does this identify the do-file and purpose, but it also identifies me, the author, and the date and version number.  If I come back to this do-file in the future, I’ll know what it was supposed to do.  As a side note, I would also introduce any global macros in the header.  After that, I would only use local macros.

c. Use spacing and indentation well

Stata ignores spaces and tabs.  So go wild with spacing – consistently, of course – and line up your commands in neat columns, tab to offset different parts of code (especially foreach and forvalues commands), and make your code pretty.  Yes, that’s right:  pretty.  To borrow an example from J. Scott Long (2009), what is easier to read:

rename k12_unique_id sid
rename class_unique_id class_id
rename teacher_name teacher
rename semester_1_grade grade1
rename semester_2_grade grade2
rename final_course_grade grade3
rename pass_nopass pass

or this:

rename k12_unique_id         sid
rename class_unique_id       class_id
rename teacher_name          teacher
rename semester_1_grade      grade1
rename semester_2_grade      grade2
rename final_course_grade    grade3
rename pass_nopass           pass

Perhaps more importantly, which would be easier to troubleshoot if names started showing up wrong?

Using /// is a visually appealing and helpful way to organize a complex command across multiple lines.  Stata will execute the following two commands the same way:

use analysis\mycooldata.dta, clear

use ///
analysis\mycooldata.dta, ///
clear

There is likely no practical reason to split a use command across three lines, but as an example, Stata will treat those three lines of code as if they were one rather than breaking the command at the end of each line.  It is also possible to include comments after each /// just like comments can be included after //.  The difference is that a // leaves the line break intact.

It can also be helpful to use indentation to visually indicate that a code breaks across multiple lines.  This is easier to visually follow:

keep   sid class_id teacher grade1 ///
       grade2 grade3 pass

than this:

keep sid class_id teacher grade1 ///
grade2 grade3 pass

I also recommend not putting any code on a line after a brace { or }, and making sure that braces line up for easier visual tracking.

foreach var of varlist * {
     sum `var'
     rename `var' data_`var'
}

I also tend to group similar functions together without a space between lines, whereas a new set of functions gets a space between lines.

rename k12_unique_id     sid
rename class_unique_id   class_id

label variable sid          "Student ID"
label variable class_id     "Course ID"

d. Don’t substitute brevity for readability

Stata usually allows you to abbreviate commands, parameters, and variable names to the fewest possible characters needed to uniquely identify the name.  This is particularly handy when entering commands directly to the console.  However, in do-files, too much brevity can make it difficult to decipher the command later, particularly for variables.  Make sure that if you use abbreviations for commands that you can identify the command from the abbreviation.  Take, for example:

summarize grade1
sum grade1
su grade1

Each of the three commands will cause stata to summarize the values of variable “grade1,” but “su” may be unclear down the road.  The abbreviation “sum,” on the other hand, is clear in its meaning and considerably shorter than the full “summarize.”  As with all things, the key here is consistency.  If the do-file is well documented and the abbreviations are consistently applied then there should be no problems.

I also reject any argument that spacing, tabs, multiple lines, or fully spelled out commands slow down the execution of my do-file.  If an extra line of spacing makes my file easier to read, a few extra nanoseconds is irrelevant.

f. Stata is not C

To quote from Cox (2005), “Stata is not C” (or Pascal for that matter).  It is not necessary to delimit and end commands with a semi-colon.  For those not familiar with the command, using:

#delimit ;

will cause Stata to treat all characters before the semi-colon as part of the same line.  From the earlier example, Stata will treat:

#delimit ;
keep   sid class_id teacher grade1
       grade2 grade3 pass;

the same as:

keep   sid class_id teacher grade1 ///
       grade2 grade3 pass

Cox (2005) describes the use of /// as “tidier.”  I agree with this, and would extend the rationale for avoiding the semi-colon by pointing out that most of the time my commands remain on a single line.  It is less likely I will need to span multiple lines with a ///, and when I do it is generally because I’m trying to improve readability.  The use of /// at the end of lines also allows me to place comments after the /// to explain why I am organizing my command the way I am.  The /// is more elegant and in keeping with Stata’s design — switching to a semi-colon requires actually telling Stata to behave counter to its default mode.

With that being said, the semi-colon delimiter has its time and place.  However, it should be the exception, not the rule.

Some final comments

Research in a vacuum is useless.  We, as researchers, have an obligation to communicate our work — both what we have done/discovered and how we discovered it.  Documenting how and why we do what we do is an important part of this communication process, and if we are using Stata (or any other statistical package for that matter), our code is part of our documentation.  My goal for every do-file is that anybody else with a passing familiarity with Stata should be able to review my file and understand what I did as well as the how and why behind it.

None of what has been written here is intended to be “The Law of Stata” or any hard and fast rule of coding.  It certainly isn’t “wrong” to do things in another way.  These guidelines are intended to be helpful — and they will be for some, they may not be for others.  I encourage comments and suggestions, and I intend to update this post as Stata changes or as conventions change.

4 responses to “In Stata coding, Style is the Essential: A brief commentary on do-file style”

  1. […] A full style guide for R can be found here, and a Stata style guide can be found here. […]

  2. […] In Stata coding, Style is the Essential: A brief commentary on do-file style Embrace Your Fallibility: Thoughts on Code Integrity […]

  3. Thank you Michaet to share your reflections … I found it usefull.
    (One minor change in your program setup; I had to change the syntax in the log creation instruction: I changed txt to text. working on stata 15)
    Highly appreciated still.

  4. Good catch – thank you for the comment and fixed!

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s