Data Sets in R

From Displayr
Jump to navigation Jump to search

This page describes how Data Sets are represented when their contents are referred to in R Code.

Variables

If you type a variable's name into R Code, in either an R Output or an R Variable, Q will automatically use the data for that variable. For example, in the Cola.Q example project file, if you refer to Q3 in R Code it will be interpreted as a variable of length 327.

Missing values appear as NAs in R, and NaNs remain as NaNs.

Categorical variables

When a categorical variable is used in R (i.e., a Nominal or Nominal - Multi) it is automatically converted to a factor, or, if its Variable Type is Ordered Categorical to an ordered factor (these are R classes). If the categories have been merged, this merging will be reflected in the way the data appears in R. This is done as follows:

  • If all the categories of the variable are mutually exclusive and exhaustive, they all appear in R.
  • Where there are overlapping categories, the broadest of these will be excluded. For example, if the data contains three unique values, 0, 1, and 2, with labels of A, B, and C, respectively, and the categories shown on the table are A, B, C, NET, the NET category will be removed. Similar, if the categories are A, B, B + C, C, NET, then both NET and B + C are removed.
  • Any categories that are missing (i.e., hidden), are inserted, such that the categories are mutually exclusive and exhaustive.

Attributes of variables

When a variable from a data set is referred to in R Code, the variable is automatically uploaded to the R Server prior to any R Code being run. A variable will have the following attributes:

  • name. This is the name in the original data file that has been imported into Displayr (where such a name exists and is not problematic).
  • question. This is the name of the Variable Set, where the name is provided in the metadata or can be inferred.
  • label. This is the label of the variable, where such a label exists.

While these attributes can be accessed in R in the usual way (e.g., attr(my.variable, "label"), the best way to access them is often using flipFormat::Labels, which will attempt to construct a label of form Question Name: Variable Label where these are different, and Variable Label where these two are the same (e.g., flipFormat::Labels(Q3) will show Q3. Age). It falls back to name, and, if even this is not provided, it attempts to discern the original name of the argument.

Variable Set

You can refer to a Variable Set by its name in R Code. Where names contain spaces, they are surrounded by backticks (i.e., `). For example: `Q3. Age`.

Where a Variable Set contains multiple variables, they will be provided in a data.frame. Where a question contains multiple variables, these can be selected using $. For example, `Q4. Frequency numeric`$Coffee, will return a variable from the question called Q4. Frequency numeric. Here, Coffee refers to the name of one of the categories in the question, and may not correspond to a variable in the initial data file (e.g., because the user may have renamed the category, or created a new category by merging categories).

Multiple Data Sets

If you have multiple Data Sets in the project, and these contain variables or questions with the same names, the data file name is used to disambiguate (e.g., Cola.sav$`Q3. Age`). See Avoiding ambiguous references names for more information.

This is a test update to check the changes I made work