# Data 1 - data types

There are many data types that we might come across in review of clinical data. With clinical informatics existing at the center between the fields of health science, statistics, and computer science, we may be exposed to different nomenclature when discussing the same topics. Where possible we will review those differences as *stats *or *comp sci*.

## Boolean/Binary

The first data type you will be exposed to in the computer sciences is the Boolean data type. Boolean (*comp sci*) data is named after the mathematician George Boole who, in 1854, first defined the rules of working with binary (*stats*) digits known as Boolean algebra. Boolean algebra is foundational in modern information science because it is used to define how simple electrical circuits can add up to become functional computers. Those electrical circuits contain gates which open or close leading to off and on states in the circuit. Numerically, off = 0 and on = 1. So with Boolean/binary data types you are working with 0's and 1's. Those 0's and 1's can be translated into *False *and *True*, *No *and *Yes*, *Absent *and *Present*, *Negative *and *Positive*... examples of which abound throughout clinical data.

## Categorical

Data that is classified into groups or types is of a categorical nature. This data type can include nominal data in a Binary format such as Yes or No, True or False, 1 or 0. Nominal data may also be more complex, as with Blood Groups:

## Ordinal

Categorical data can also be ordinal, such that the groups in which the data is classified are hierarchical. A classic example of this is cancer staging, where the disease progresses from lower to higher stages:

You will often encounter categorical data in subject surveys where a person has to classify feelings (e.g. pain scales).

## Numerical

Numerical data, data which can be represented with numbers, can be classified as either discrete or continuous. Discrete numerical data is that which is represented by integer values 1,2,3,etc. Cell counts in hematology labs are good examples of discrete numerical data as a cell is a whole unit:

Group | WBC per uL |
---|---|

Men | 5000 - 10000 |

Women | 4500 - 11000 |

Children | 5000 - 10000 |

Continuous numerical data includes data where values can exist between integers, this is known as a real number in mathematics and represented as a floating point approximation value in computer science. You have to take care with continuous data when performing calculations of measures to ensure that rounding errors are propagated through the statistical analysis. Patient heights and weights are perfect examples of continuous data that can be measured with increasing levels of precision.

## String

String data makes up a large proportion of the data found in Electronic Medical Records (EMRs). Strings can be common text data as is found in providers notes. Essentially they can be any variable sized or constant sized list of alphanumeric data, such as Genomic data. Most databases will include data types such as CHAR or VARCHAR, which for our purposes we will interpret as string values.

## BLOB (binary large object)

BLOB data is the final data type we will review here. This is a special data type reserved for storing large binary or encoded data. Image, audio, or video data are all examples of typical binary large object data. Data stored in this format often have to be worked with differently than the other data types for storage and analysis.

There are many other data types that can be considered depending on whether working in statistical or database domains. For now, most of the data that we will work with will nicely fit one of these types.