List R Markdown



Links:Github Gist

Elements of an R Markdown file - YAML header. At the top of our newly intiated R Markdown file, enclosed in -tags, we see the first of the essential elements of an R Markdown file, the YAML header. YAML stands for “YAML Ain’t Markup Language” or “Yet Another Markup Language”, and is a human-readable language, which we use here to communicate with Pandoc.

GitHub user @stanstrup posted a question today on the blogdown GitHub repo about manually positioning a table of contents in blogdown:

When I use toc: true in a post the toc is inserted at the very top of the post.… If you could specify the position of the toc with some keyword you could work around it.

I don’t use the academic theme for Hugo (I use a modified version of hyde), so I’m not entirely sure if I can completely solve stanstrup’s problems, but I know I’ve run into something similar recently.

  1. For example, you can write a task list with Blackfriday but not with Pandoc. For R Markdown posts, you can use $ $ to write inline math expressions.
  2. Knitr can be used to convert R Markdown files to different formats, including web friendly formats. Learn how to convert R Markdown to PDF or HTML in RStudio.
  3. I have a series of templates for R Markdown. Here's one for academic manuscripts. I offer a guide on how to use it.
  4. R Markdown Error in inherits(x, 'list) object not found. Header line break in Rmarkdown, change textsize after, included as a whole in TOC.

And while Yihui is probably right that the effort isn’t worth it when fiddling with trivial aesthetics, I use R Markdown in enough places and have run into this a few times.Knowing that someone else out there felt the same pain was enough to push me to code up a quick solution.

The function I’ve worked up is called render_toc() and it allows you to drop in a table of contents anywhere inside an R Markdown document.This means you can use it to manually position a table of contents in:

  • A README file for your package repo
  • In a long blogdown post
  • In an overview slide in xaringan

and many more places.

Get It

Sheet

I’ve posted the function and an example document as a GitHub Gist.To use it in your document, choose one of the following:

  1. Download render_toc.R and source('render_toc.R') in your project or script

  2. Copy the function code into your RMarkdown document

  3. Source the function from GitHub using devtools:

Use It

I included an example file in the GitHub Gist.Essentially, you just need to source render_toc.R somewhere (such as a setup chunk) and then call it in the document where you want to render the table of contents.

The output will just be a markdown list, so if you want to give the table of contents it’s own header, you’ll have to include that in the document.

Here’s what a simple R Markdown document would look like.

which outputs asthis document (click to view image).

Behind the Scenes

The function simply reads through the lines of the RMarkdown document and strips out any code blocks.The supported code fencing style is three or more ` characters in a row.

Then I extract the headers, which must be in the hashtag-style to work.In other words headers like this

work well, while headers like these won’t be processed

The function creates the header anchor if not manually specified – see the pandoc header identifiers help page for more information – or uses the identifier if it is included.

The example above would link to #a-nice-header and the example below links to #my-shortcut

Any headers with a higher depth than the toc_depth parameter (default is 3) are discarded.Also any initial headers prior to the first base level header with higher levels (say ### when the base level is ##) are discarded as well.

Finally, if toc_header_name is set, the header with that name is discarded so that the TOC itself isn’t included in the TOC.

List R Markdown

The end result is a simple markdown list that can be rendered anywhere!

Which, underneath, is just markdown.

Let me know on twitter @grrrck if you found this helpful or run into any issues!

Run the code below in your console to download this exercise as a set of R scripts.

Reproducibility in scientific research

Reproducibility is “the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them.”1 Scholars who implement reproducibility in their projects can quickly and easily reproduce the original results and trace back to determine how they were derived. This easily enables verification and replication, and allows the researcher to precisely replicate his or her analysis. This is extremely important when writing a paper, submitting it to a journal, then coming back months later for a revise and resubmit because you won’t remember how all the code/analysis works together when completing your revisions.

Reproducibility is also key for communicating findings with other researchers and decision makers; it allows them to verify your results, assess your assumptions, and understand how your answers were formed rather than solely relying on your claims. In the data science framework employed in R for Data Science, reproducibility is infused throughout the entire workflow.

R Markdown is one approach to ensuring reproducibility by providing a single cohesive authoring framework. It allows you to combine code, output, and analysis into a single document, are easily reproducible, and can be output to many different file formats. R Markdown is just one tool for enabling reproducibility. Another tool is Git for version control, which is crucial for collaboration and tracking changes to code and analysis.

Jupyter Notebooks

In the data science realm, another popular unified authoring framework is the Jupyter Notebook. The Jupyter Notebook (originally called iPython Notebook) is a web application that incorporates text, code, and output into a single document. Originally created for the Python programming language, Jupyter Notebooks are now multi-language and support over 40 programming languages, including R. You have probably seen or used them before.

There is nothing wrong with Jupyter Notebooks, but I prefer R Markdown because it is integrated into RStudio, arguably the best integrated development environment (IDE) for R. Furthermore, as you will see an R Markdown file is a plain-text file. This means the content of the file can be read by any text-editor, and is easily tracked by Git. Jupyter Notebooks are stored as JSON documents, a different and more complex file format. JSON is a useful format as we will see when we get to our modules on obtaining data from the web, but they are also much more difficult to track for revisions using Git. For this reason, in this course we will exclusively use R Markdown for reproducible documents.

R Markdown basics

An R Markdown file is a plain text file that uses the extension .Rmd:

R Markdown documents contain 3 major components:

  1. A YAML header surrounded by ---s
  2. Chunks of R code surrounded by ``` (triple-backticks)
  3. Text mixed with simple text formatting using the Markdown syntax

Code chunks are interspersed with text throughout the document. To complete the document, you “Knit” or “render” the document. Most of you probably knit the document by clicking the “Knit” button in the script editor panel. You can also do this programmatically from the console by running the command rmarkdown::render('example.Rmd').

When you knit the document you send your .Rmd file to knitr, a package for R that executes all the code chunks and creates a second markdown document (.md). That markdown document is then passed onto pandoc, a document rendering software program independent from R. Pandoc allows users to convert back and forth between many different document formats such as HTML, $LaTeX$, Microsoft Word, etc. By splitting the workflow up, you can convert your R Markdown document into a wide range of output formats.

Text formatting with Markdown

We have previously practiced formatting text using the Markdown syntax. I will not go into it further, but do note that there is a quick reference guide to Markdown built-in to RStudio. To access it, go to Help >Markdown Quick Reference.

Exercise

  • Render gun-deaths.Rmd as an HTML document
  • Add text describing the frequency polygon

Code chunks

Code chunks are where you store R code that will be executed. You can name a code chunk using the syntax ```{r name-here}. Naming chunks is a good practice to get into for several reasons. First, it makes navigating an R Markdown document using the drop-down code navigator in the bottom-left of the script editor easier since your chunks will have intuitive names. Second, it generates meaningful file names for any graphs created within the chunk, rather than unhelpful names such as unnamed-chunk-1.png. Finally, once you start caching your results (more on that below), using consistent names for chunks avoids having to repeat computationally intensive calculations.

Customizing chunks

Code chunks can be customized to adjust the output of the chunk. Some important and useful options are:

  • eval = FALSE - prevents code from being evaluated. I use this in my notes for class when I want to show how to write a specific function but don’t need to actually use it.
  • include = FALSE - runs the code but doesn’t show the code or results in the final document. This is useful when you have setup code at the beginning of your document (loading packages, adjusting options, etc.) that may generate a lot of messages that are not really necessary to include in the final report.
  • echo = FALSE - prevents code from showing in the final output, but does show the results of the code. Use this if you are writing a paper or document for someone who cares more about the substantive results and less about the programming used to obtain them.
  • message = FALSE or warning = FALSE - prevents messages or warnings from appearing in the final document.
  • results = 'hide' - hides printed output.
  • error = TRUE - causes the document to continue knitting and rendering even if the code generates a fatal error. I use this a lot when I want to intentionally demonstrate an error in class. If you’re debugging your code, you might want to use this option. However for the final version of your document, you probably do not want to allow errors to pass through unnoticed.

For example, if I wanted a code chunk to not print the code itself or any warnings/messages generated by the chunk (i.e. only print tables and figures), I would write this as:

Caching

Remember the R Markdown workflow?

By default, every time you knit a document R starts completely fresh. None of the previous results are saved. If you have code chunks that run computationally intensive tasks, you might want to store these results to be more efficient and save time. If you use cache = TRUE, R will do exactly this. The output of the chunk will be saved to a specially named file on disk. If your .gitignore file is setup correctly, this cached file will not be tracked by Git. This is in fact preferable since the cached file could be hundreds of megabytes in size. Now, every time you knit the document the cached results will be used instead of running the code fresh.

Dependencies

This could be problematic when chunks rely on the output of previous chunks. Take this example from R for Data Science

processed_data relies on the rawdata file created in the raw_data chunk. If you change your code in raw_data, processed_data will continue to rely on the older cached results. This means even if rawdata is altered, the cached results will continue to erroneously be used. To prevent this, use the dependson option to declare any chunks the cached chunk relies upon:

Now if the code in the raw_data chunk is changed, processed_data will be run and the cache updated.

Global options

Rather than setting these options for each individual chunk, you can make them the default options for all chunks by using knitr::opts_chunk$set(). Just include this in a code chunk (typically in the first code chunk in the document). So for example,

hides the code by default in all code chunks. To override this new default, you can still declare echo = TRUE for individual chunks.

Inline code

Until now, you have only run code in a specially designated chunk. However you can also run R code in-line by using the `r ` syntax. For example, look at the text from the example document earlier:

We have data about `r nrow(gun_deaths)` individuals killed by guns. Only `r nrow(gun_deaths) - nrow(youth)` are older than 65. The distribution of the remainder is shown below:

When you knit the document, the R code is executed:

We have data about 100798 individuals killed by guns. Only 15687 are older than 65. The distribution of the remainder is shown below:

Exercise: practice chunk options

  • Set echo = FALSE as a global option
  • Enable caching as a global option and render the document. Look at the file structure for the cache. Now render the document again. Does it run faster?

YAML header

Yet Another Markup Language, or YAML (rhymes with camel) is a standardized format for storing hierarchical data in a human-readable syntax. The YAML header controls how rmarkdown renders your .Rmd file. A YAML header is a section of key: value pairs surrounded by --- marks.

The most important option is output, as this determines the final document format. However there are other common options such as providing a title and author for your document and specifying the date of publication.

Output formats

HTML document

For your homework assignments, we have used github_document to generate a Markdown document. However there are other document formats that are more commonly used.

output: html_document produces an HTML document. The nice feature of this document is that all images are embedded in the HTML file itself, so you can email just the .html file to someone and they will be able to open and read it.

Table of contents

Each output format has various options to customize the appearance of the final document. One option for HTML documents is to add a table of contents through the toc option. To add any option for an output format, just add it in a hierarchical format like this:

You can explicitly set the number of levels included in the table of contents with toc_depth (the default is 3).

Appearance and style

There are several options that control the visual appearance of HTML documents.

  • theme specifies the Bootstrap theme to use for the page (themes are drawn from the Bootswatch theme library). Valid themes include 'default', 'cerulean', 'journal', 'flatly', 'readable', 'spacelab', 'united', 'cosmo', 'lumen', 'paper', 'sandstone', 'simplex', and 'yeti'.
  • highlight specifies the syntax highlighting style for code chunks. Supported styles include 'default', 'tango', 'pygments', 'kate', 'monochrome', 'espresso', 'zenburn', 'haddock', and 'textmate'.

Code folding

Sometimes when knitting an R Markdown document you want to include your R source code (echo = TRUE) but you may want to include it but not make it visible by default. The code_folding: hide options allows you to include your R code but hide it. Users can then decide whether or not they want to see specific chunks or all chunks in the document. This strikes a good balance between readability and reproducibility.

Keeping Markdown

When knitr processes your .Rmd document, it creates a Markdown (.md) file that is subsequently deleted. If you want to keep a copy of the Markdown file use the keep_md option:

Exercise: test HTML options

  1. Add a table of contents
  2. Use the 'cerulean' theme
  3. Modify the figures so they are 8x6

PDF document

pdf_document converts the .Rmd file to a $LaTeX$ file which is used to generate a PDF.

You do need to have a full installation of TeX on your computer to generate PDF output. However the nice thing is that because it uses the $LaTeX$ rendering engine, you can use raw $LaTeX$ code in your .Rmd file (if you know how to use it).

Table of contents

Many options for HTML documents also work for PDFs. For instance, you create a table of contents the same way:

Syntax highlighting

You cannot customize the theme of a pdf_document (at least not in the same way as HTML files), but you can still customize the syntax highlighting.

$LaTeX$ options

You can also directly control options in the $LaTeX$ template itself via the YAML options. Note that these options are passed as top-level YAML metadata, not underneath the output section:

Keep intermediate TeX

R Markdown documents are converted first to a .tex file, and then use the $LaTeX$ engine to convert to PDF. To keep the .tex file, use the keep_tex option:

Presentations

You can use R Markdown not only to generate full documents, but also slide presentations. There are four major presentation formats:

  • ioslides - HTML presentation with ioslides
  • reveal.js - HTML presentation with reveal.js
  • Slidy - HTML presentation with W3C Slidy
  • Beamer - PDF presentation with $LaTeX$ Beamer

Each as their own strengths and weaknesses. ioslides and Slidy are probably the easiest to use initially, but are more difficult to customize. reveal.js is more complex, but allows for more customization (this is the format I use for my slides in this class). Beamer is the only presentation format that creates a PDF document and is probably a smoother transition for those already used to Beamer.

Multiple formats

You can even render your document into multiple output formats by supplying a list of formats:

If you don’t want to change any of the default options for a format, use the default option. You must assign some value to the second output format, hence the use of default.

Rendering multiple outputs programmatically

When rendering multiple output formats, you cannot just click the “Knit” button. Doing so will only render the first output format listed in the YAML. To render all output formats, you need to programmatically render the document using rmarkdown::render('my-document.Rmd', output_format = 'all'). Type ?render in the console to look up the help file for render() and see the different arguments the function can accept.

Exercise: render in multiple formats

  • Render gun-deaths.Rmd as both an HTML document and a PDF document

Unordered List R Markdown

If you do not have $LaTeX$ installed on your computer, render gun-deaths.Rmd as both an HTML document and a Word document. And at some point install $LaTeX$ on your computer so you can create PDF documents.

R scripts

So far we’ve done a lot of our work in R Markdown documents, knitting together code chunks, output, and Markdown text. However we don’t have to use R Markdown documents for all our work. In many instances, using a script might be preferable.

What is a script?

A script is a plain-text file with a .R file extension. It contains R code. You can add comments using the # symbol. For example, gun-deaths.R would look something like this:

You edit scripts in the editor panel in R Studio.

When to use a script?

Scripts are much easier to troubleshoot than R Markdown documents because your code is not split across chunks and you can run everything interactively. When you first begin a project, you may find it useful to use scripts initially to build and debug code, then convert it to an R Markdown document once you begin the substantive analysis and write-up. Or you may use a mix of scripts and R Markdown documents depending on the size and complexity of your project. For instance, you could use a reproducible pipeline which uses a sequence of R scripts to download, import, and transform your data, then use an R Markdown document to produce a final report.

Bulleted list in r markdown
Check out this example for how one could use a pipeline in this fashion.

In this class while the final product is generally submitted as an R Markdown document, it is fine to do your initial work in an R script. If you find it easier to write and debug code there, then use that approach. Or if you prefer the R Markdown lab notebook workflow, then use that. By this point you have enough competence in R to decide what works for you and what does not. Find what works best for you and do that.

Running scripts interactively

You can run sections of your script by highlighting the appropriate code and typing Cmd/Ctrl + Enter. You can also run code expression-by-expression by placing your cursor at the appropriate expression in the script and typing Cmd/Ctrl + Enter. To run the entire script at once, type Cmd/Ctrl + Shift + S or press “Run” at the top of the script editor panel.

Bulleted List R Markdown

Running scripts programmatically

To run a script saved on your computer, use the source() function in the console. As in source('gun-deaths.R'). You can also include this command in a second script. By doing this you can execute a sequence of related scripts all in order, rather than having to run each one manually in the console. See runfile.R from the pipeline-example repo to see this in action. Remember that R scripts (.R) are executed via the source() function, whereas R Markdown files (.Rmd) are executed via the rmarkdown::render() function.

Want to create a report from an R script? Just call rmarkdown::render('gun-deaths.R') to author an R Markdown document based on the R script. It will never be as fully featured as if you originally wrote it in an R Markdown document, but can sometimes be handy. Read this overview for more details on this procedure.

Running scripts via the shell

You can also run scripts directly from the shell using Rscript:

To render an R Markdown document from the shell, we use the syntax:

This creates a temporary R script which contains the single command rmarkdown::render('gun-deaths.Rmd') and executes it via Rscript.

Acknowledgments

  • Artwork by @allison_horst

Session Info

Last updated on Jan 21, 2021