bioCancer Package

bioCancer is a platform-independent interface for dynamic interaction with cancer genomics data. The web is implemented in the R language and based on the Shiny package. It runs on any modern Web browser and requires no programming skills, increasing the accessibility to the huge, complex and heterogeneous cancer genomic data. The data are provided from cBioPortal that contains data from 105 cancer genomics studies. The studies are updated monthly, based on the last TCGA production runs. User can access easily to studies, search in clinical data or by genetic profiles. All data are displayed in table which user can filter, combine, download, visualize and get statistics on it. For more global exploring, zoomable circular layout are available to merge and view around twenty matrices in the same plot. The circular layout makes easy and rapid to identify pertinent multi-assays changes in genes through multiple cancers or studies. The web page implements multiple methods, to classify genes by study or by disease, to cluster studies by biological process or other ontology annotation. From gene list user can predicts functional interaction network. Nodes and edges can be colored and formatted by omics cancer data. User is free to choose which dimension will be included in network and can set some thresholds to view only significant biological scenario. The web accepts multiple format of input data that can be included by user to compare/analysis with/without cancer studies. All investigation done by user can be saved in session and can be reloaded later or shared with colleagues. The main R plotting features are available and easy to use. User needs only to chose the type of plot and select variables to be viewed. All generated plot is downloadable with a high resolution. bioCancer has dynamic sidebar dashboard that changes and displays functionalities depending on user request. It reduces excessive clicking or false queries. It can be launched in local machine with any system with R installed or used from a remote server as in (bioCancer Server). All navigating panel are well assisted and documented by examples. bioCancer is free and open to all users and there is no login requirement.

Pipeline Overview

Plot_enrich

How to run bioCancer

library(bioCancer)
bioCancer()

Portal Panel

Display available Cancer Studies in Table

Studies Panel

This panel displays in table all available cancer studies hosted and maintained by Memorial Sloan Kettering Cancer Center (MSKCC). It provides access to data by The Cancer Genome Atlas as well as many carefully curated published data sets.

Every row lists one study by Identity, name and description.

Browse the data

By default only 10 rows of are shown at one time. You can change this setting through the Show ... entries dropdown. Press the Next and Previous buttons at the bottom-right of the screen to navigate through the data.

Sort

Click on a column header in the table to sort the data by the values of that variable. Clicking again will toggle between sorting in ascending and descending order. To sort on multiple columns at once press shift and then click on the 2nd, 3rd, etc. column to sort by.

Filters in Table

The search is possible for numerical or categorical variables. It is possible to match string or to use mathematical operator to filter data. For more detail see help page in Processing > View panel. #### Global Search the Filter box on the left (click the check-box first). #### Column filter Every column has its filetr at the column header.

Download table as csv file

User can download table as csv file. Use the download icon in the top-right of the page.

Show Clinical Data in Table

Clinical panel displays informations related to patients as AGE, GENDER and other variables depending on study and type of cancer. Some variables are shared between studies and others are specific. Each row corresponds to one patient.

Show Profiles Data in Table

Profiles panel displays informations related to gene list. User needs to specify a Study, a Case, and a Genetic Profile to get the right profile.

It is more practice to select that have all data (case_all) and change only the profile.

There are in general but not always, 6 types of genetic profiles: * Copy Number Alteration (CNA). * mRNA expression (mRNA) * Mutations (Mut) * Methylation (Met): There are two probes HM_27 and HM_450 * microRNA expression (miRNA) * Reverse Phase Protein Array (RPPA)

It is possible to find other kind of data related to one of listed types. For example the log or z_score of mRNA expression.

Load Gene List

User can upload gene list examples or upload own gene list.

When user selects examples and clic on Load examples button, the gene list examples is loaded in DropDown Gene List.

When User selects clipboard, it is possible to copy own gene list from text file (gene symbol by line) and clic on Paste Gene List button. The gene List will be named Genes in DropDown Gene List.

Load Profiles to Datasets

It is interesting to get any statistics analysis or transformation with genetic profiles. Any table from Profiles panel can be loaded to Processing panel by checking Load Profiles to Datasets and press the button. The data frame will be named ProfData. # Processing Panel

Manage data and state: Load data into bioCancer, Save data to disk, Remove a dataset from memory, or Save/Load the full state of the app

Datasets

When you start bioCancer a dataset (epiGenomics) with information on how it was formatted is shown in Processing panel.

It is good practice to add a description of the data and variables to each file you use. For the files that are part of bioCancer you will see a brief overview of the variables etc. below the table of the first 10 rows of the data. If you would like to add a description for your own data check the ‘Add/edit data description’ check-box. A window will open below the data table where you can add text in markdown format. The descriptions of the data included with bioCancer should serve as a good starting point.

If you would like to rename a dataset loaded in bioCancer check the Rename data box, enter a new name for the data, and click the Rename button

Load data

The best way to load and save data for use in bioCancer (and R) is to use the R-data format (rda). These are binary files that can be stored compactly and read into R quickly. Choose rda from the Load data of type dropdown and click Choose Files to locate the file(s) you want to load. If the rda file is available online choose rda (url) from the dropdown, paste the url into the text input, and press Load.

You can get data from a spreadsheet (e.g., Excel or Google sheets) into bioCancer in two ways. First, you can save data from the spreadsheet in csv format and then, in bioCancer, choose csv from the Load data of type dropdown. Most likely you will have a header row in the csv file with variable names. If the data are not comma separated you can choose semicolon or tab separated. To load a csv file click ‘Choose files’ and locate the file on your computer. If the csv data is available online choose csv (url) from the dropdown, paste the url into the text input shown, and press Load.

Note: For Windows users with data that contain multibyte characters please make sure your data are in ANSI format so bioCancer can load the characters correctly.

Alternatively, you can select and copy the data in the spreadsheet using CTRL-C (or CMD-C on mac), go to bioCancer, choose clipboard from the dropdown, and click the Paste data button. This is a short-cut that can be convenient for smaller datasets that are cleanly formatted. If you see a message in bioCancer that the data were not transferred cleanly try saving the data in csv format and loading it into bioCancer as described above.

To access all data files bundled with bioCancer choose examples from the Load data of type dropdown and click Load examples. These files are used to illustrate the various analysis tools accessible in bioCancer. For example, the catalog sales data is used as an example in the help file for regression (i.e., Regression > Linear (OLS)).

Save data

As mentioned above, the most convenient way to get data in and out of bioCancer is to use the R-data format (rda). Choose rda from the Save data dropdown and click the Save data button to save selected dataset to file.

It is good practice to add a description of the data and variables to each file you use. For the files that are part of bioCancer you will see a brief overview of the variables etc. below the table of the first 10 rows of the data. If you would like to add a description for your own data check the ‘Add/edit data description’ check-box. A window will open below that data table where you can add text in markdown format. The descriptions of the data included with bioCancer should serve as a good starting point. When you save the data as an rda file the description you created (or edited) will automatically be added to the file.

Getting data from bioCancer into a spreadsheet can be achieved in two ways. First, you can save data in csv format and load the file into the spreadsheet (i.e., choose csv from the Save data dropdown and click the Save data button). Alternatively, you can copy the data from bioCancer into the clipboard by choosing clipboard from the dropdown and clicking the Copy data button, open the spreadsheet, and paste the data from bioCancer using CTRL-V (or CMD-V on mac).

Save and load state

You can save and load the state of the bioCancer app just as you would a data file. The state file (extension rda) will contain (1) the data loaded in bioCancer, (2) settings for the analyses you were working on, (3) and any reports or code from the R-menu. Save the state-file to your hard-disk and when you are ready to continue simply load it by selecting the state radio button and clicking the Choose file button.

The best way to save your analyses is to save the state of the app to a file by clicking on the icon in the navbar and then on Save state. Similar functionality is available in Data > Manage tab.

This is convenient if you want to save your work to be completed at another time, perhaps on another computer, or to review any assignments you completed using bioCancer. You can also share the file with others that would like to replicate your analyses. As an example, download and then load the state_file RadiantState.rda. Go to Data > View, Data > Visualize to see some of the settings loaded from the statefile. There is also a report in R > Report created using the Radiant interface. The html file RadiantState.html contains the output.

A related feature in bioCancer is that state is maintained if you accidentally navigate to another page, close (and reopen) the browser, and/or hit refresh. Use Reset in the menu in the navigation bar to return to a clean/new state.

Loading and saving state also works with Rstudio. If you start bioCancer from Rstudio and use > Stop to stop the app, lists called r_data and r_state will be put into Rstudio’s global workspace. If you start bioCancer again using bioCancer() it will use these lists to restore state. This can be convenient if you want to make changes to a data file in Rstudio and load it back into bioCancer. Also, if you load a state file directly into Rstudio it will be used when you start bioCancer to recreate a previous state.

Remove data from memory

If data are loaded that you no longer need access to in the current session check the Remove data from memory box. Then select the data to remove and click the Remove data button. One datafile will always remain open.

Using commands to load and save data

The loadr command can be used to load data from a file directly into a bioCancer session and add it to the Datasets dropdown. The saver command can be used to exact data from bioCancer and save it to disk. Data can be loaded or saved as rda or rds format depending on the file extension chosen. These commands can be used both inside or without the bioCancer browser interface. See ?loadr and ?saver for details.

Show data in table form

Datasets

Choose one of the datasets from the Datasets dropdown. Files are loaded into bioCancer through the Manage tab.

Select columns

By default all columns in the data are shown. Click on any variable to focus on it alone. To select several variables use the SHIFT and ARROW keys on your keyboard. On a mac the CMD key can also be used to select multiple variables. The same effect is achieved on windows using the CTRL key. To select all variable use CTRL-A (or CMD-A on mac).

Browse the data

By default only 10 rows of are shown at one time. You can change this setting through the Show ... entries dropdown. Press the Next and Previous buttons at the bottom-right of the screen to navigate through the data.

Sort

Click on a column header in the table to sort the data by the values of that variable. Clicking again will toggle between sorting in ascending and descending order. To sort on multiple columns at once press shift and then click on the 2nd, 3rd, etc. column to sort by.

Filter

There are several ways to select a subset of the data to view. The Filter box on the left (click the check-box first) can be used with > and < signs and you can also combine subset commands. For example, x > 3 & y == 2 would show only those rows for which the variable x has values larger than 3 and for which y has values equal to 2. Note that in R, and most other programming languages, = is used to assign a value and == to evaluate if the value of a variable is equal to some other value. In contrast != is used to determine if a variable is unequal to some value. You can also use expressions that have an or condition. For example, to select rows where mutation frequency is smaller than 20 and larger than 10 use FreqMut > 10 & FreqMut < 20. & is the symbol for and. The table below gives an overview of common operators.

You can also use string matching to select rows. For example, type grepl("lu", Diseases) to select rows with lung Cancers. This search is case sensitive by default. For case insensitive search you would use grepl("TCGA", name, ignore.case = TRUE). Type your statement in the Filter box and press return to see the result on screen or an error below the box if the expression is invalid.

It is important to note that these filters are persistent. A filter entered in one of the Data-tabs will also be applied to other tabs and to any analysis conducted through the bioCancer menus. To deactivate a filter uncheck the Filter check-box. To remove a filter simply erase it.

Operator Description Example
< less than price < 5000
<= less than or equal to carat <= 2
> greater than price > 1000
>= greater than or equal to carat >= 2
== exactly equal to cut == 'Fair'
!= not equal to cut != 'Fair'
| x OR y price > 10000 | cut == 'Premium'
& x AND y carat < 2 & cut == 'Fair'
%in% x is one of y cut %in% c('Fair', 'Good')

Visualize data

Filter

Use the Filter box to select (or omit) specific sets of rows from the data. See the helpfile for Data > View for details.

Plot-type

Select the plot type you want. Choose histograms or density for one or more single variable plots. For example, with the epiGenomics data loaded select Histogram and all (X) variables (use CTRL-a or CMD-a). This will create histograms for all variables in your dataset. Scatter plots are used to visualize the relationship between two variables. Select one or more variables to plot on the Y-axis and one or more variables to plot on the X-axis. Line plots are similar to scatter plots but they connect-the-dots and are particularly useful for time-series data. Bar plots are used to show the relationship between a categorical variable (X-axis) and the average value of a numeric variable (Y-axis). Box-plots are also used when you have a numeric Y-variable and a categorical X-variable. They are more informative than bar charts but also require a bit more effort to evaluate.

Box plots

The upper and lower “hinges” of the box correspond to the first and third quartiles (the 25th and 75th percentiles) in the data. The middle hinge is the median value of the data. The upper whisker extends from the upper hinge (i.e., the top of the box) to the highest value in the data that is within 1.5 x IQR of the upper hinge. IQR is the inter-quartile range, or distance, between the first and third quartiles. The lower whisker extends from the lower hinge to the lowest value in the data within 1.5 x IQR of the lower hinge. Data beyond the end of the whiskers could be outliers and are plotted as points (as suggested by Tukey).

In sum: 1. The upper whisker extends from Q3 to min(max(data), Q3 + 1.5 x IQR) 2. The lower whisker extends from Q1 to max(min(data), Q1 - 1.5 x IQR)

You may have to read the two bullets above a few times before it sinks in. The plot below should help to explain the structure of the box plot.

Box-plot Source

Sub-plots and heat-maps

Facet row and Facet column can be used to split the data into different groups and create separate plots for each group.

If you select a scatter or line plot a Color drop-down will be shown. Selecting a Color variable will create a type of heat-map where the colors are linked to the values of the Color variable. Selecting a categorical variable from the Color dropdown for a line plot will split the data into groups and will show a line of a different color for each group.

Line, loess, and jitter

To add a linear or non-linear regression line to a scatter plot check the Line and/or Loess boxes. If your data take on a limited number of values checking Jitter can be useful to get a better feel for where most of the data points are located. Jitter-ing simply adds a small random value to each data point so they do not overlap completely in the plot(s).

Axis scale

The relationship between variables depicted in a scatter plot may be non-linear. There are numerous transformations we might apply to the data so this relationship becomes (approximately) linear (see Data > Transform) and easier to estimate. Perhaps the most common data transformation applied to business data is the (natural) log. To see if a log-linear or log-log transformation may be appropriate for your data check the Log X and/or Log Y boxes.

By default the scale of the y-axis is the same across sub-plots when using Facet row. To allow the y-axis to be specific to each sub-plot click the Scale-y check-box.

Flip axes

To switch the variable on the X- and Y-axis check the Flip box.

Plot height and width

To make plots bigger or smaller adjust the values in the height and width boxes on the bottom left.

Customizing plots in R > Report

To customize a plot first generate the visualize command by clicking the report (book) icon on the bottom left of your screen. The example below illustrates how to customize a command in the R > Report tab. Notice that custom is set to TRUE.

visualize(dataset = "diamonds", yvar = "price", xvar = "carat", type = "scatter", custom = TRUE) +
  ggtitle("A scatterplot") + xlab("price in $")

See the ggplot2 documentation page for available options http://docs.ggplot2.org.

Create pivot tables to explore your data

If you have used pivot-tables in Excel the functionality provided in the Pivot tab should be familiar to you. Similar to the Explore tab, you can generate summary statistics for variables in your data. You can also easily generate frequency tables. Perhaps the most powerful feature in Pivot is that you can describe the data by one or more other variables.

For example, with the epiGenomics data select Genes, Diseases and CNA from the Categorical variables drop-down. You can drag-and-drop the selected variables to change their order. The categories for the first variable will be the column headers. After selecting these three variables a frequency table of data with different Diseases and Genes. Choose Row, Column, or Total from the Normalize drop-down to normalize the frequencies by row, column, or overall total. If a normalize option is selected it can be convenient to check the Percentage box to express the numbers as percentages. Choose Color bar or Heat map from the Conditional formatting drop-down to emphasize the highest frequency counts.

It is also possible to summarize numerical variables. Select FreqMut from the Numerical variables drop-down. This will create the table shown below. Just as in the View tab you can sort the table by clicking on the column headers. You can also use sliders (e.g., click in the input box below I1) to limit the view to values in a specified range. To view only information for CNA with 0 or -1 levels click in the input box below the CNA header.

pivotr table

You can also create a bar chart based on the generated table (see image above). To download the table to csv format or the plot to a png format click the download icon on the right.

Filter

Use the Filter box to select (or omit) specific sets of rows from the data. See the help file for Data > View for details.

Summarize and explore your data

Generate summary statistics for one or more variables in your data. The most powerful feature in Explore is that you can easy describe the data by one or more other variables. Where the Pivot tab works best for frequency tables and to summarize a single numerical variable, the Explore tab allows you to summarize multiple variables at the same time using various statistics.

For example, if we select Genes from the xmRNA dataset we can see the number of observations (n), the mean, the median, etc. etc.

The created summary table can be stored in bioCancer by clicking the Store button. This can be useful if you want to create plots using the summarized data. To download the table to csv format click the download icon on the top-right.

You can select options from Column variable dropdown to switch between different column headers. Select either the functions (e.g., mean, median, etc), the variables (e.g., Genes), or the levels of the (first) Group by variable (e.g., Studies).

explore table

Filter

Use the Filter box to select (or omit) specific sets of rows from the data. See the helpfile for Data > View for details.

Transform command log

All transformations applied in the Data > Transform tab can be logged. If, for example, you apply a log transformation to numeric variables the following code is generated and put in the Transform command log window at the bottom of your screen when you click the Store button.

## transform variable
r_data[["epiGenomics"]] <- mutate_each(r_data[["epiGenomics"]], funs(log), ext = "_log", mRNA, Met450)

This is an important feature if you need to recreate your results at some point in the future or you want to re-run a report with new, but similar, data. Even more important is that there is a record of the steps taken to generate all results.

To add commands contained in the command log window to a report in R > Report click the icon.

Filter

Filter functionality must be turned off when transforming variables. If a filter is active the transform functions will show a warning message. Either remove the filter statement or un-check the Filter check-box. Alternatively, navigate to the Data > View tab and click the Store button to store the filtered data and select the newly create dataset. Then return to the Transform tab to make the desired variable changes.

Type

When you select Type from the Transformation type drop-down another drop-down menu is shown that will allow you to change the type (or class) of one or more variables. For example, you can change a variable of type integer to a variable of type factor. Click the Store button to change variable(s) in the data set. A description of the transformations included in bioCancer is provided below.

  1. As factor: convert a variable to type factor (i.e., a categorical variable)
  2. As number: convert a variable to type numeric
  3. As integer: convert a variable to type integer
  4. As character: convert a variable to type character (i.e., strings)
  5. As date (mdy): convert a variable to a date if the dates are ordered as month-day-year
  6. As date (dmy): convert a variable to a date if the dates are ordered as day-month-year
  7. As date (ymd): convert a variable to a date if the dates are ordered as year-month-day
  8. As date/time (mdy_hms): convert a variable to a date if the dates are ordered as month-day-year-hour-minute-second
  9. As date/time (mdy_hm): convert a variable to a date if the dates are ordered as month-day-year-hour-minute
  10. As date/time (dmy_hms): See mdy_hms
  11. As date/time (dmy_hm): See mdy_hm
  12. As date/time (ymd_hms): See mdy_hms
  13. As date/time (ymd_hm): See mdy_hm

Transform

When you select Transform from the Transformation type drop-down another drop-down menu is shown that will allow you to apply common transformations to one or more variables in the data. For example, to take the (natural) log of a variable select the variable(s) you want to transform and choose Log from the Apply function drop-down. A new variable is created with the extension specified in the ’Variable name extensiontext input (e.g,._log). Make sure to pressreturnafter changing the extension. Click theStore` button to add the variable(s) to the data set. A description of the transformation functions included in bioCancer is provided below.

  1. Log: create a natural log-transformed version of the selected variable (i.e., log(x) or ln(x))
  2. Square: multiply a variable by itself (i.e., x^2 or square(x))
  3. Square-root: take the square-root of a variable (i.e., x^.5)
  4. Absolute: Absolute value of a variable (i.e., abs(x))
  5. Center: create a new variable with a mean of zero (i.e., x - mean(x))
  6. Standardize: create a new variable with a mean of zero and standard deviation of one (i.e., (x - mean(x))/sd(x))
  7. Invert: 1/x
  8. Median split: create a new factor with two levels (Above and Below) that splits the variable values at the median
  9. Deciles: create a new factor with 10 levels (deciles) that splits the variable values at the 10th, 20th, …, 90th percentiles.

Create

Choose Create from the Transformation type drop-down. This is the most flexible command to create new or transformed variables. However, it also requires some basic knowledge of R-syntax. A new variable can be any function of other variables in the (active) dataset. Some examples are given below. In each example the name to the left of the = sign is the name of the new variable. To the right of the = sign you can include other variable names and basic R-functions. After you have typed the command press return to create the new variable and press Store to add it to the dataset.

  1. Create a new variable z that is the difference between variables x and y

    z = x - y

  2. Create a new variable z that is a transformation of variable x but with mean equal to zero (note that this transformation is also available in the Transform drop-down as Center):

    z = x - mean(x)

  3. Create a new logical variable z that takes on the value TRUE when x > y and FALSE otherwise

    z = x > y

  4. Create a new logical z that takes on the value TRUE when x is equal to y and FALSE otherwise

    z = x == y

  5. Create a variable z that is equal to x lagged by 3 periods

    z = log(x,3)

  6. Create a categorical variable with two levels

    z = ifelse(x < y, ‘smaller’, ‘bigger’)

  7. Create a categorical variable with three levels. An alternative approach would be to use the Recode function described below

    z = ifelse(x < 60, ‘< 60’, ifelse(x > 65, ‘> 65’, ‘60-65’))

  8. Convert an outlier to a missing value. For example, if we want to remove the maximum value from a variable called xmRNA that is equal to 400 we could use an ifelse statement and enter the command below in the Create box. Press return and Store to add the new xmRNA_rc variable. Note that if we had entered xmRNA on the left-hand side of the = sign the original variable would have been overwritten

xmRNA_rc = ifelse(xmRNA > 400, NA, sales)

  1. Similarly, if a respondent with ID 3 provided information in the wrong scale on a survey (e.g., income in $1s rather than in $1000s) we could use an ifelse statement and enter the command below in the Create box. As before, press return and Store to add the new sales_rc variable

    income_rc = ifelse(ID == 3, income/1000, income)

  2. If multiple respondents made the same scaling mistake (e.g., those with ID 1, 3, and 15) we again use Create and enter:

    income_rc = ifelse(ID %in% c(1, 3, 15), income/1000, income)

  3. If you have a date in a format not available through the Type menu you can use the parse_date_time function. For a date formated as “2-1-14” you would specify the command below (note that this format will also be parsed correctly by the mdy function in the Type menu)

    date = parse_date_time(x, “%m%d%y”)

  4. Determine the time difference between two dates/times in seconds

    time_diff = as_duration(time2 - time1)

  5. Extract the month from a date variable

    month = month(date)

  6. Other attributes that can be extracted from a date or date-time variable are minute, hour, day, week, quarter, year, wday (for weekday). For wday and month it can be convenient to add label = TRUE to the call. For example, to extract the weekday from a date variable and use a label rather than a number

    weekday = wday(date, label = TRUE)

  7. Calculating the distance between two locations using lat-long information

    trip_distance = as_distance(lat1, long1, lat2, long2)

Note: For examples 6, 7, and 14 above you may need to change the new variable to type factor before using it for further analysis (see Type above)

Recode

To use the recode feature select the variable you want to change and choose Recode from the Transformation type drop-down. Provide one or more recode commands, separated by a ;, and press return to see the newly created variable. Note that you can specify the names for the recoded variable in the Recoded variable name input box (press return to submit changes). Finally, click Store to add the new variable to the data. Some examples are given below.

  1. Values below 20 are set to ‘Low’ and all others to ‘High’

    lo:20 = ‘Low’; else = ‘High’

  2. Values above 20 are set to ‘High’ and all others to ‘Low’

    20:hi = ‘High’; else = ‘Low’

  3. Values 1 through 12 are set to ‘A’, 13:24 to ‘B’, and the remainder to ‘C’

    1:12 = ‘A’; 13:24 = ‘B’; else = ‘C’

  4. Collapse age categories for a cross-tab analysis. In the example below ‘<25’ and ‘25-34’ are recoded to ‘<35’, ‘35-44’ and ‘35-44’ are recoded to ‘35-54’, and ‘55-64’ and ‘>64’ are recoded to ‘>54’

    ‘<25’ = ‘<35’; ‘25-34’ = ‘<35’; ‘35-44’ = ‘35-54’; ‘45-54’ = ‘35-54’; ‘55-64’ = ‘>54’; ‘>64’ = ‘>54’

  5. To exclude a particular value (e.g., an outlier in the data) for subsequent analyses we can recode it to a missing value. For example, if we want to remove the maximum value from a variable called FreqMut that is equal to 102 we would (1) select the variable FreqMut in the Select variable(s) box and enter the command below in the Recode box. Press return and Store to add the recoded variable to the data

    102 = NA

  6. To recode specific numeric values (e.g., carat) to a new value (1) select the variable carat in the Select variable(s) box and enter the command below in the Recode box to set the value for carat to 2 in all rows where carat is currently larger than or equal to 2. Press return and Store to add the recoded variable to the data

    2:hi = 2

Note: Never use a = symbol in a label when using the recode function (e.g., 50:hi = ‘>= 50’) as this will cause an error.

Rename

Choose Rename from the Transformation type drop-down, select one or more variables, and enter new names for them in the rename box shown. Separate each name by a ,. Press return to see the variables with their new names on screen and press Store to alter the variable names in the original data.

Replace

Choose Replace from the Transformation type drop-down if you want to replace existing variables in the data with new ones created using, for example, Create, Transform, Clipboard, etc.. Select one or more variables to overwrite and the same number of replacement variables. Press Store to alter the data.

Clipboard

It is possible to manipulate your data in a spreadsheet (e.g., Excel or Google sheets) and copy-and-paste the data back into bioCancer. If you don’t have the original data in a spreadsheet already use the clipboard feature in Data > Manage so you can paste it into the spreadsheet or click the download icon on the top right of your screen in the Data > View tab. Apply your transformations in the spreadsheet program and then copy the new variable(s), with a header label, to the clipboard (i.e., CTRL-C on windows and CMD-C on mac). Select Clipboard from the Transformation type drop-down and paste the new data into the Paste from spreadsheet box. It is key that new variable(s) have the same number of observations as the data in bioCancer. To add the new variables to the data click Store.

Note: Using the clipboard feature for data transformation is discouraged because it is not reproducible.

Normalize

Choose Normalize from the Transformation type drop-down to standardize one or more variables. For example, in the epiGenomics data we may want to express mRNA of a Genes per-FreqMut. Select FreqMut as the normalizing variable and mRNA in the Select variable(s) box. You will see summary statistics for the new variable (e.g., mRNA_FreqMut) in the main panel. Store changes by clicking the Store button.

Reorder or remove columns

Choose Reorder/Remove columns from the Transformation type drop-down. Drag-and-drop variables to reorder them in the data. To remove a variable click the \(\times\) next to the label. Press Store to commit the changes.

Reorder or remove levels

If a (single) variable of type factor is selected in Select variable(s), choose Reorder/Remove levels from the Transformation type drop-down to reorder and/or remove levels. Drag-and-drop levels to reorder them or click the \(\times\) to remove them. Press Store to commit the changes. To temporarily exclude levels from the data use the Filter box (see the help file linked in the Data > View tab).

Remove missing values

Choose Remove missing from the Transformation type drop-down to eliminate rows with one or more missing values. If all variables are selected a row with a missing values in any column will be removed. If one or more variables are selected only those rows will be removed with missing values for the selected variables. Press Store to change the data. If missing values were present you will see the number of observations in the data summary change (i.e., the value of n changes).

Remove duplicates

It is common to have one or more variables in a dataset that should have only unique values (i.e., no duplicates). Customers id’s, for example, should be unique unless the dataset contains multiple orders for the same customer. In that case the combination of customer id and order id should be unique. To remove duplicate select one or more variables to determine uniqueness. Choose Remove duplicates from the Transformation type drop-down and check how the summary statistics change. Press Store to change the data. If there are duplicate rows you will see the number of observations in the data summary change (i.e., the value of n and n_distinct will change).

Show duplicates

If there are duplicates in the data use Show duplicates to get a better sense for the data points that have the same value in multiple rows. If you want to explore duplicates using the View tab make sure to Store them in a different dataset (i.e., make sure not to overwrite the data you are working on). If you choose to show duplicates based on all columns in the data only one of the duplicate rows will be shown. These rows are exactly the same so showing 2 or 3 isn’t helpful. If, however, we look for duplicates based on a subset of the available variables bioCancer will generate a dataset with all rows that are deemed similar.

Combine two datasets

There are six join (or merge) options available in bioCancer from the dplyr package developed by Hadley Wickham and Romain Francois on GitHub.

The examples below are adapted from Cheatsheet for dplyr join functions by Jenny Bryan and focus on three small datasets, superheroes, publishers, and avengers, to illustrate the different join types and other ways to combine datasets in R and bioCancer. The data is also available in csv format through the links below:

superheroes.csv

publishers.csv

avengers.csv

Superheroes
name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Hellboy good male Dark Horse Comics
Publishers
publisher yr_founded
DC 1934
Marvel 1939
Image 1992

In the screen-shot of the Data > Combine tab below we see the two datasets. The tables share the variable publisher which is automatically selected for the join. Different join options are available from the Combine type dropdown. You can also specify a name for the combined dataset in the Data name text input box.

join


Inner join (superheroes, publishers)

If x = superheroes and y = publishers: > An inner join returns all rows from x with matching values in y, and all columns from both x and y. If there are multiple matches between x and y, all match combinations are returned.

name alignment gender publisher yr_founded
Magneto bad male Marvel 1939
Storm good female Marvel 1939
Mystique bad female Marvel 1939
Batman good male DC 1934
Joker bad male DC 1934
Catwoman bad female DC 1934

In the table above we lose Hellboy because, although this hero does appear in superheroes, the publisher (Dark Horse Comics) does not appear in publishers. The join result has all variables from superheroes, plus yr_founded, from publishers. We can visualize an inner join with the venn-diagram below:

inner_join

The bioCancer commands are:

# bioCancer
combinedata("superheroes", "publishers", by = "publisher", type = "inner_join")

# R
inner_join(superheroes, publishers, by = "publisher")


Left join (superheroes, publishers)

A left join returns all rows from x, and all columns from x and y. If there are multiple matches between x and y, all match combinations are returned.

name alignment gender publisher yr_founded
Magneto bad male Marvel 1939
Storm good female Marvel 1939
Mystique bad female Marvel 1939
Batman good male DC 1934
Joker bad male DC 1934
Catwoman bad female DC 1934
Hellboy good male Dark Horse Comics NA

The join result contains superheroes with variable yr_founded from publishers. Hellboy, whose publisher does not appear in publishers, has an NA for yr_founded. We can visualize a left join with the venn-diagram below:

left_join

The bioCancer commands are:

# bioCancer
combinedata("superheroes", "publishers", by = "publisher", type = "left_join")

# R
left_join(superheroes, publishers, by = "publisher")


Right join (superheroes, publishers)

A right join returns all rows from y, and all columns from y and x. If there are multiple matches between y and x, all match combinations are returned.

name alignment gender publisher yr_founded
Batman good male DC 1934
Joker bad male DC 1934
Catwoman bad female DC 1934
Magneto bad male Marvel 1939
Storm good female Marvel 1939
Mystique bad female Marvel 1939
NA NA NA Image 1992

The join result contains all rows and columns from publishers and all variables from superheroes. We lose Hellboy, whose publisher does not appear in publishers. Image is retained in the table but has NA values for the variables name, alignment, and gender from superheroes. Notice that a join can change both the row and variable order so you should not rely on these in your analysis. We can visualize a right join with the venn-diagram below:

right_join

The bioCancer commands are:

# bioCancer
combinedata("superheroes", "publishers", by = "publisher", type = "right_join")

# R
right_join(superheroes, publishers, by = "publisher")


Full join (superheroes, publishers)

A full join combines two datasets, keeping rows and columns that appear in either.

name alignment gender publisher yr_founded
Magneto bad male Marvel 1939
Storm good female Marvel 1939
Mystique bad female Marvel 1939
Batman good male DC 1934
Joker bad male DC 1934
Catwoman bad female DC 1934
Hellboy good male Dark Horse Comics NA
NA NA NA Image 1992

In this table we keep Hellboy (even though Dark Horse Comics is not in publishers) and Image (even though the publisher is not listed in superheroes) and get variables from both datasets. Observations without a match are assigned the value NA for variables from the other dataset. We can visualize a full join with the venn-diagram below:

full_join

The bioCancer commands are:

# bioCancer
combinedata("superheroes", "publishers", by = "publisher", type = "full_join")

# R
full_join(superheroes, publishers, by = "publisher")

Semi join (superheroes, publishers)

A semi join keeps only columns from x. Whereas an inner join will return one row of x for each matching row of y, a semi join will never duplicate rows of x.

name alignment gender publisher
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel

We get a similar table as with inner_join but it contains only the variables in superheroes. The bioCancer commands are:

# bioCancer
combinedata("superheroes", "publishers", by = "publisher", type = "semi_join")

# R
semi_join(superheroes, publishers, by = "publisher")


Anti join (superheroes, publishers)

An anti join returns all rows from x without matching values in y, keeping only columns from x

name alignment gender publisher
Hellboy good male Dark Horse Comics

We now get only Hellboy, the only superhero not in publishers and we do not get the variable yr_founded either. We can visualize an anti join with the venn-diagram below:

anti_join


Dataset order

Note that the order of the datasets selected may matter for a join. If we setup the Data > Combine tab as below the results are as follows:

join order


Inner join (publishers, superheroes)

publisher yr_founded name alignment gender
DC 1934 Batman good male
DC 1934 Joker bad male
DC 1934 Catwoman bad female
Marvel 1939 Magneto bad male
Marvel 1939 Storm good female
Marvel 1939 Mystique bad female

Every publisher that has a match in superheroes appears multiple times, once for each match. Apart from variable and row order, this is the same result we had for the inner join shown above.


Left and Right join (publishers, superheroes)

Apart from row and variable order, a left join of publishers and superheroes is equivalent to a right join of superheroes and publishers. Similarly, a right join of publishers and superheroes is equivalent to a left join of superheroes and publishers.


Full join (publishers, superheroes)

As you might expect, apart from row and variable order, a full join of publishers and superheroes is equivalent to a full join of superheroes and publishers.


Semi join (publishers, superheroes)

publisher yr_founded
Marvel 1939
DC 1934

With semi join the effect of switching the dataset order is more clear. Even though there are multiple matches for each publisher only one is shown. Contrast this with an inner join where “If there are multiple matches between x and y, all match combinations are returned.” We see that publisher Image is lost in the table because it is not in superheroes.


Anti join (publishers, superheroes)

publisher yr_founded
Image 1992

Only publisher Image is retained because both Marvel and DC are in superheroes. We keep only variables in publishers.


Additional tools to combine datasets (avengers, superheroes)

When two datasets have the same columns (or rows) there are additional ways in which we can combine them into a new dataset. We have already used the superheroes dataset and will now try to combine it with the avengers data. These two datasets have the same number of rows and columns and the columns have the same names.

In the screen-shot of the Data > Combine tab below we see the two datasets. There is no need to select variables to combine the datasets here. Any variables in Select variables are ignored in the commands below. Again, you can specify a name for the combined dataset in the Data name text input box.

combine


Bind rows

name alignment gender publisher
Thor good male Marvel
Iron Man good male Marvel
Hulk good male Marvel
Hawkeye good male Marvel
Black Widow good female Marvel
Captain America good male Marvel
Magneto bad male Marvel
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Hellboy good male Dark Horse Comics

If the avengers dataset were meant to extend the list of superheroes we could just stack the two datasets, one below the other. The new datasets has 14 rows and 4 columns. Due to a coding error in the avengers dataset (i.e.., Magneto is not an Avenger) there is a duplicate row in the new combined dataset. Something we probably don’t want.

The bioCancer commands are:

# bioCancer
combinedata("avengers", "superheroes", type = "bind_rows")

# R
bind_rows(avengers, superheroes)


Bind columns

name alignment gender publisher name alignment gender publisher
Thor good male Marvel Magneto bad male Marvel
Iron Man good male Marvel Storm good female Marvel
Hulk good male Marvel Mystique bad female Marvel
Hawkeye good male Marvel Batman good male DC
Black Widow good female Marvel Joker bad male DC
Captain America good male Marvel Catwoman bad female DC
Magneto bad male Marvel Hellboy good male Dark Horse Comics

If the dataset had different columns for the same superheroes we could combine the two datasets, side by side. In bioCancer you will see an error message if you try to bind these columns because they have the same name. Something that we should always avoid. The method can be useful if we know the order of the row ids of two dataset are the same but the columns are all different.


Intersect

A good way to check if two datasets with the same columns have duplicate rows is to choose intersect from the Combine type dropdown. There is indeed one row that is identical in the avengers and superheroes data (i.e., Magneto).

The biCancer commands are the same as shown above, except you will need to replace bind_rows by intersect.


Union

Thor good male Marvel Magneto bad male Marvel
Iron Man good male Marvel Storm good female Marvel
Hulk good male Marvel Mystique bad female Marvel
Hawkeye good male Marvel Batman good male DC
Black Widow good female Marvel Joker bad male DC
Captain America good male Marvel Catwoman bad female DC
Magneto bad male Marvel Hellboy good male Dark Horse Comics

A union of avengers and superheroes will combine the datasets but will omit duplicate rows (i.e., it will keep only one copy of the row for Magneto). Likely what we want here.

The bioCancer commands are the same as shown above, except you will need to replace bind_rows by union.


Setdiff

name alignment gender publisher
Thor good male Marvel
Iron Man good male Marvel
Hulk good male Marvel
Hawkeye good male Marvel
Black Widow good female Marvel
Captain America good male Marvel
Magneto bad male Marvel

Finally, a setdiff will keep rows from avengers that are not in superheroes. If we reverse the inputs (i.e., choose superheroes from the Datasets dropdown and superheroes from the Combine with dropdown) we will end up with all rows from superheroes that are not in avengers. In both cases the entry for Magneto will be omitted.

The bioCancer commands are the same as shown above, except you will need to replace bind_rows by setdiff.



For additional discussion see http://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html

Enrichment Panel

Show multi-Omics Data in Circular Layout

The world Circomics comes from the association between Circos and Omics.

Circos is a package for visualizing data and information with circular layouts. User can visualize multiple matrices of Omics data at the same time and makes easy the exploring of relationships between dimensions using coloring sectors.

This function uses CoffeeWheel package developped by Dr. ARman Aksoy.

Studies in Wheel

User needs to: * Choice in which Studies is interested. * Visualize the availability of dimensions by checking Availability. + The output is a table with Yes/No availability. * Load Omics data for selected Studies by checking Load. The output is a list of loaded dimensions for selected Studies.

When Profiles Data are loaded, the button Load Profiles in Datasets appears. It uploads all Profiles Data to Processing panel for more exploring or analysis.

Legend checkbox displays the meaning of the color palette.

Load Profiles in Datasets

For every dimension, the tables are merged by study and saved as: xCNA, xMetHM27, xMetHM450, xmiRNA, xmRNA, xMut, xRPPA in Datasets (Processing panel).

Genes / Diseases / Pathways Classification and clustering

Classification

The classifier uses geNetClassifier methods [1] to classify genes by disease based only on gene expression (mRNA). The approach is implemented in an R package, named geNetClassifier, available as an open access tool in Bioconductor. All proccess are resumed into 5 steps: * Select Studies * get sample size by processing > Samples * Set the sample size and the posterior probability * Select one Case and one Genetic Profile for every study. Respect the order of studies. it is recommanded to use _v2_mrna for all genetic profiles. * Run classifier by processing > Classifier

The ranking is built by ordering the genes decreasingly by their pos- terior probability for each study (class). Each gene is assigned to a class in which has the best ranking. As a result of this process, even if a gene is found associated to several classes during the expression analysis, each gene can only be on the ranking of one class [1]. The resulting output is a table (Table 1) that associates genes to study and displays PostProb and gene expression sign exprsUpDw. The exprsMeanDiff value is the expression difference between the mean for each gene in the given class and the mean in the closest class.

Table1: Ranking Genes by Study Classifier table

Plot Clusters

Gene Diseases Association

GeneList/Diseases predicts Wich disease are involving your GeneList. It uses annotations from DisGeNET [2] and Methods from clusterProfiler package [3].

The GeneList/Diseases association uses gene list as input. The assess of this prediction is based on two parameters: * The number of genes that are involving in the disease (x-axis) * The P-value of this association (color). In the following example, there are two annotation related to Breast cancer which involve more than 130 genes and has small P-Value.

Figure 1: Genes / Diseases Association Plot_enrich

The Diseases Onthology uses genes/Study groups computed by Classifier (Table 1). The dotplot position indicates wish Diseases are annotated for genes/study [4]. The dot size indicates the ratio of genes involved in the disease for the same genes groups (lihc_tcga has 2/3 genes involved for the 4 disease). The color indicates the P-Value.

Figure 2: Diseases Onthology DO

The same process is possible with Gene Onthology (GO) and KEGG.

Figure 3: GO Pathway Enrishment GO

Figure 4: KEGG Pathway Enrishment KEGG

Function Interaction Network Enrichment

Edges Attributes

Function Interactions (FIs) Type

Arrowhead Reaction Arrowhead Reaction
-> activate, express, regulate -| inhibit
diamond -<> complexe curve catalyze, reaction
point -o phosphorylate binding, input, compound
-< dissociation …. predicted, indirect,ubiquitinated

Use Linkers

Picks up as few as possible of linkers that can connect input genes together. For example, if the algorithm finds one gene can link all input genes together, it will not try other genes (not from gene list) that may be used as a linker.

The linker gene hes box format.

Layouts

dot

The dot engine flows the directed graph in the direction of rank (i.e., downstream nodes of the same rank are aligned). By default, the direction is from top to bottom ##### twopi The twopi engine provides radial layouts. Nodes are placed on concentric circles depending their distance from a given root node.

neato

The neato engine provides spring model layouts. This is a suitable engine if the graph is not too large (less than 100 nodes) and you don’t know anything else about it. The neato engine attempts to minimize a global energy function, which is equivalent to statistical multi-dimensional scaling.

circo

The circo engine provide circular layouts. This is suitable for certain diagrams of multiple cyclic structures, such as certain telecommunications networks.

Nodes Attributes

From ReactomeFI

The size of node is related to the number of inetractions. If node has multiple interaction, it will has bigger size than node with few interaction. Otherwise, i will be easier to locate important gene in the network.

From Classifier

mRNA

Attribute node color using exprsMeanDiff values from Classifier panel.

Studies

Link study to associated genes from Classifier table.

From Profiles Data

User needs to * Select studies (From Which Studies) * Load profiles data (Load). * Select Profiles Data * Set threshold from Sliders

Legend

Reactome Legend

Interpretation

Reactome Network

References

[1] Aibar S, Fontanillo C, Droste C, Roson-Burgo B, Campos-Laborie F, Hernandez-Rivas J and De Las Rivas J (2015). “Analyse multiple disease subtypes and build associated gene networks using genome-wide expression profiles.” BMC Genomics, 16(Suppl 5:S3). http://dx.doi.org/10.1186/1471-2164-16-S5-S3.

[2] Piñero, J., Queralt-Rosinach, N., Bravo, A., Deu-Pons, J., Bauer-Mehren, A., Baron, M., Ferran Sanz, and Furlong, L. I. (2015). DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database: The Journal of Biological Databases and Curation, 2015, bav028. http://doi.org/10.1093/database/bav028

[3] Yu G, Wang L, Han Y and He Q (2012). “clusterProfiler: an R package for comparing biological themes among gene clusters.” OMICS: A Journal of Integrative Biology, 16(5), pp. 284-287. http://dx.doi.org/10.1089/omi.2011.0118.

[4] Yu G, Wang L, Yan G and He Q (2015). “DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis.” Bioinformatics, 31(4), pp. 608-609. http://dx.doi.org/10.1093/bioinformatics/btu684, http://bioinformatics.oxfordjournals.org/content/31/4/608.