bioCancer Package
bioCancer is a platform-independent interface for dynamic interaction with cancer genomics data. The web is implemented in the R language and based on the Shiny package. It runs on any modern Web browser and requires no programming skills, increasing the accessibility to the huge, complex and heterogeneous cancer genomic data. The data are provided from cBioPortal that contains data from 105 cancer genomics studies. The studies are updated monthly, based on the last TCGA production runs. User can access easily to studies, search in clinical data or by genetic profiles. All data are displayed in table which user can filter, combine, download, visualize and get statistics on it. For more global exploring, zoomable circular layout are available to merge and view around twenty matrices in the same plot. The circular layout makes easy and rapid to identify pertinent multi-assays changes in genes through multiple cancers or studies. The web page implements multiple methods, to classify genes by study or by disease, to cluster studies by biological process or other ontology annotation. From gene list user can predicts functional interaction network. Nodes and edges can be colored and formatted by omics cancer data. User is free to choose which dimension will be included in network and can set some thresholds to view only significant biological scenario. The web accepts multiple format of input data that can be included by user to compare/analysis with/without cancer studies. All investigation done by user can be saved in session and can be reloaded later or shared with colleagues. The main R plotting features are available and easy to use. User needs only to chose the type of plot and select variables to be viewed. All generated plot is downloadable with a high resolution. bioCancer has dynamic sidebar dashboard that changes and displays functionalities depending on user request. It reduces excessive clicking or false queries. It can be launched in local machine with any system with R installed or used from a remote server as in (bioCancer Server). All navigating panel are well assisted and documented by examples. bioCancer is free and open to all users and there is no login requirement.
Pipeline Overview
How to run bioCancer
library(bioCancer)
bioCancer()
Portal Panel
Display available Cancer Studies in Table
Studies Panel
This panel displays in table all available cancer studies hosted and maintained by Memorial Sloan Kettering Cancer Center (MSKCC). It provides access to data by The Cancer Genome Atlas as well as many carefully curated published data sets.
Every row lists one study by Identity
, name
and description
.
Browse the data
By default only 10 rows of are shown at one time. You can change this setting through the Show ... entries
dropdown. Press the Next
and Previous
buttons at the bottom-right of the screen to navigate through the data.
Sort
Click on a column header in the table to sort the data by the values of that variable. Clicking again will toggle between sorting in ascending and descending order. To sort on multiple columns at once press shift and then click on the 2nd, 3rd, etc. column to sort by.
Filters in Table
The search is possible for numerical or categorical variables. It is possible to match string
or to use mathematical operator
to filter data. For more detail see help page in Processing > View panel. #### Global Search the Filter
box on the left (click the check-box first). #### Column filter Every column has its filetr at the column header.
Download table as csv file
User can download table as csv file. Use the download icon in the top-right of the page.
Show Clinical Data in Table
Clinical panel displays informations related to patients as AGE
, GENDER
and other variables depending on study and type of cancer. Some variables are shared between studies and others are specific. Each row corresponds to one patient.
Show Profiles Data in Table
Profiles panel displays informations related to gene list. User needs to specify a Study
, a Case
, and a Genetic Profile
to get the right profile.
It is more practice to select that have all data (case_all
) and change only the profile.
There are in general but not always, 6 types of genetic profiles: * Copy Number Alteration (CNA). * mRNA expression (mRNA) * Mutations (Mut) * Methylation (Met): There are two probes HM_27
and HM_450
* microRNA expression (miRNA) * Reverse Phase Protein Array (RPPA)
It is possible to find other kind of data related to one of listed types. For example the log
or z_score of mRNA expression.
Load Gene List
User can upload gene list examples or upload own gene list.
When user selects examples
and clic on Load examples
button, the gene list examples is loaded in DropDown Gene List.
When User selects clipboard
, it is possible to copy own gene list from text file (gene symbol by line) and clic on Paste Gene List
button. The gene List will be named Genes
in DropDown Gene List.
Load Profiles to Datasets
It is interesting to get any statistics analysis or transformation with genetic profiles. Any table from Profiles
panel can be loaded to Processing
panel by checking Load Profiles to Datasets
and press the button. The data frame will be named ProfData
. # Processing Panel
Manage data and state: Load data into bioCancer, Save data to disk, Remove a dataset from memory, or Save/Load the full state of the app
Datasets
When you start bioCancer a dataset (epiGenomics
) with information on how it was formatted is shown in Processing
panel.
It is good practice to add a description of the data and variables to each file you use. For the files that are part of bioCancer you will see a brief overview of the variables etc. below the table of the first 10 rows of the data. If you would like to add a description for your own data check the ‘Add/edit data description’ check-box. A window will open below the data table where you can add text in markdown format. The descriptions of the data included with bioCancer should serve as a good starting point.
If you would like to rename a dataset loaded in bioCancer check the Rename data
box, enter a new name for the data, and click the Rename
button
Load data
The best way to load and save data for use in bioCancer (and R) is to use the R-data format (rda). These are binary files that can be stored compactly and read into R quickly. Choose rda
from the Load data of type
dropdown and click Choose Files
to locate the file(s) you want to load. If the rda
file is available online choose rda (url)
from the dropdown, paste the url into the text input, and press Load
.
You can get data from a spreadsheet (e.g., Excel or Google sheets) into bioCancer in two ways. First, you can save data from the spreadsheet in csv format and then, in bioCancer, choose csv
from the Load data of type
dropdown. Most likely you will have a header row in the csv file with variable names. If the data are not comma separated you can choose semicolon or tab separated. To load a csv file click ‘Choose files’ and locate the file on your computer. If the csv
data is available online choose csv (url)
from the dropdown, paste the url into the text input shown, and press Load
.
Note: For Windows users with data that contain multibyte characters please make sure your data are in ANSI format so bioCancer can load the characters correctly.
Alternatively, you can select and copy the data in the spreadsheet using CTRL-C (or CMD-C on mac), go to bioCancer, choose clipboard
from the dropdown, and click the Paste data
button. This is a short-cut that can be convenient for smaller datasets that are cleanly formatted. If you see a message in bioCancer that the data were not transferred cleanly try saving the data in csv format and loading it into bioCancer as described above.
To access all data files bundled with bioCancer choose examples
from the Load data of type
dropdown and click Load examples
. These files are used to illustrate the various analysis tools accessible in bioCancer. For example, the catalog sales data is used as an example in the help file for regression (i.e., Regression > Linear (OLS)).
Save data
As mentioned above, the most convenient way to get data in and out of bioCancer is to use the R-data format (rda). Choose rda
from the Save data
dropdown and click the Save data
button to save selected dataset to file.
It is good practice to add a description of the data and variables to each file you use. For the files that are part of bioCancer you will see a brief overview of the variables etc. below the table of the first 10 rows of the data. If you would like to add a description for your own data check the ‘Add/edit data description’ check-box. A window will open below that data table where you can add text in markdown format. The descriptions of the data included with bioCancer should serve as a good starting point. When you save the data as an rda file the description you created (or edited) will automatically be added to the file.
Getting data from bioCancer into a spreadsheet can be achieved in two ways. First, you can save data in csv format and load the file into the spreadsheet (i.e., choose csv
from the Save data
dropdown and click the Save data
button). Alternatively, you can copy the data from bioCancer into the clipboard by choosing clipboard
from the dropdown and clicking the Copy data
button, open the spreadsheet, and paste the data from bioCancer using CTRL-V (or CMD-V on mac).
Save and load state
You can save and load the state of the bioCancer app just as you would a data file. The state file (extension rda) will contain (1) the data loaded in bioCancer, (2) settings for the analyses you were working on, (3) and any reports or code from the R-menu. Save the state-file to your hard-disk and when you are ready to continue simply load it by selecting the state radio button and clicking the Choose file
button.
The best way to save your analyses is to save the state of the app to a file by clicking on the icon in the navbar and then on Save state
. Similar functionality is available in Data > Manage
tab.
This is convenient if you want to save your work to be completed at another time, perhaps on another computer, or to review any assignments you completed using bioCancer. You can also share the file with others that would like to replicate your analyses. As an example, download and then load the state_file RadiantState.rda
. Go to Data > View
, Data > Visualize
to see some of the settings loaded from the statefile. There is also a report in R > Report
created using the Radiant interface. The html file RadiantState.html
contains the output.
A related feature in bioCancer is that state is maintained if you accidentally navigate to another page, close (and reopen) the browser, and/or hit refresh. Use Reset
in the menu in the navigation bar to return to a clean/new state.
Loading and saving state also works with Rstudio. If you start bioCancer from Rstudio and use > Stop
to stop the app, lists called r_data
and r_state
will be put into Rstudio’s global workspace. If you start bioCancer again using bioCancer()
it will use these lists to restore state. This can be convenient if you want to make changes to a data file in Rstudio and load it back into bioCancer. Also, if you load a state file directly into Rstudio it will be used when you start bioCancer to recreate a previous state.
Remove data from memory
If data are loaded that you no longer need access to in the current session check the Remove data from memory
box. Then select the data to remove and click the Remove data
button. One datafile will always remain open.
Using commands to load and save data
The loadr
command can be used to load data from a file directly into a bioCancer session and add it to the Datasets
dropdown. The saver
command can be used to exact data from bioCancer and save it to disk. Data can be loaded or saved as rda
or rds
format depending on the file extension chosen. These commands can be used both inside or without the bioCancer browser interface. See ?loadr
and ?saver
for details.
Show data in table form
Datasets
Choose one of the datasets from the Datasets
dropdown. Files are loaded into bioCancer through the Manage tab.
Select columns
By default all columns in the data are shown. Click on any variable to focus on it alone. To select several variables use the SHIFT and ARROW keys on your keyboard. On a mac the CMD key can also be used to select multiple variables. The same effect is achieved on windows using the CTRL key. To select all variable use CTRL-A (or CMD-A on mac).
Browse the data
By default only 10 rows of are shown at one time. You can change this setting through the Show ... entries
dropdown. Press the Next
and Previous
buttons at the bottom-right of the screen to navigate through the data.
Sort
Click on a column header in the table to sort the data by the values of that variable. Clicking again will toggle between sorting in ascending and descending order. To sort on multiple columns at once press shift and then click on the 2nd, 3rd, etc. column to sort by.
Filter
There are several ways to select a subset of the data to view. The Filter
box on the left (click the check-box first) can be used with >
and <
signs and you can also combine subset commands. For example, x > 3 & y == 2
would show only those rows for which the variable x
has values larger than 3 and for which y
has values equal to 2. Note that in R, and most other programming languages, =
is used to assign a value and ==
to evaluate if the value of a variable is equal to some other value. In contrast !=
is used to determine if a variable is unequal to some value. You can also use expressions that have an or condition. For example, to select rows where mutation frequency
is smaller than 20 and larger than 10 use FreqMut > 10 & FreqMut < 20
. &
is the symbol for and. The table below gives an overview of common operators.
You can also use string matching to select rows. For example, type grepl("lu", Diseases)
to select rows with lung
Cancers. This search is case sensitive by default. For case insensitive search you would use grepl("TCGA", name, ignore.case = TRUE)
. Type your statement in the Filter
box and press return to see the result on screen or an error below the box if the expression is invalid.
It is important to note that these filters are persistent. A filter entered in one of the Data-tabs will also be applied to other tabs and to any analysis conducted through the bioCancer menus. To deactivate a filter uncheck the Filter
check-box. To remove a filter simply erase it.
Operator | Description | Example |
---|---|---|
<
|
less than |
price < 5000
|
<=
|
less than or equal to |
carat <= 2
|
>
|
greater than |
price > 1000
|
>=
|
greater than or equal to |
carat >= 2
|
==
|
exactly equal to |
cut == 'Fair'
|
!=
|
not equal to |
cut != 'Fair'
|
|
|
x OR y |
price > 10000 | cut == 'Premium'
|
&
|
x AND y |
carat < 2 & cut == 'Fair'
|
%in%
|
x is one of y |
cut %in% c('Fair', 'Good')
|
Column filters and Search
For variables that have a limited number of different values (i.e., a factor) you can select the levels to keep from the column filter below the variable name. For example, to filter on rows with CNA = -1
click in the box below the CNA
column header and select -1
from the dropdown menu shown. You can also type a string into these column filters followed by return. Note that matching is case-insensitive. In fact, typing 1
would produce the same result because the search will match any part of a string. Similarly, you can type a string to select observations for character variables (e.g., street names).
For numeric variables the column filter boxes have some special features that make them almost as powerful as the Filter
box. For numerical and integer variables you can use ...
to indicate a range. For example, to select mRNA
values between 200 and 500 type 200 ... 500
and press return. The range is inclusive of the values typed. Furthermore, if we want to filter on FreqMut
20 ...
will show only Studies with mutation frequancy larger than or equal to 20. Numeric variables also have a slider that you can use to define the range of values to keep.
If you want to get really fancy you can use the search box on the top right to search across all columns in the data using regular expressions. For example, to find all rows that have an entry in any column ending with the number 72 type 72$
(i.e., the $
sign is used to indicate the end of an entry). For all rows with entries that start with 60 use ^60
(i.e., the ^
is used to indicate the first character in an entry). Regular expressions are incredibly powerful for search but this is a big topic area. To learn more about regular expressions see this tutorial.
It is important to note that column sorting, column filters, and search are not persistent. To store these settings for use in other parts of bioCancer press the Store
button. You can store the data and settings under a different dataset name by changing the value in the text input to the left of the Store
button. This feature can also be used to select a subset of variables to keep. Just select the ones you want to keep and press the Store
button. For more control over the variables you want to keep or remove and to specify their order in the dataset use the Data > Transform
tab.
Visualize data
Filter
Use the Filter
box to select (or omit) specific sets of rows from the data. See the helpfile for Data > View for details.
Plot-type
Select the plot type you want. Choose histograms or density for one or more single variable plots. For example, with the epiGenomics
data loaded select Histogram
and all (X) variables (use CTRL-a or CMD-a). This will create histograms for all variables in your dataset. Scatter plots are used to visualize the relationship between two variables. Select one or more variables to plot on the Y-axis and one or more variables to plot on the X-axis. Line plots are similar to scatter plots but they connect-the-dots and are particularly useful for time-series data. Bar plots are used to show the relationship between a categorical variable (X-axis) and the average value of a numeric variable (Y-axis). Box-plots are also used when you have a numeric Y-variable and a categorical X-variable. They are more informative than bar charts but also require a bit more effort to evaluate.
Box plots
The upper and lower “hinges” of the box correspond to the first and third quartiles (the 25th and 75th percentiles) in the data. The middle hinge is the median value of the data. The upper whisker extends from the upper hinge (i.e., the top of the box) to the highest value in the data that is within 1.5 x IQR of the upper hinge. IQR is the inter-quartile range, or distance, between the first and third quartiles. The lower whisker extends from the lower hinge to the lowest value in the data within 1.5 x IQR of the lower hinge. Data beyond the end of the whiskers could be outliers and are plotted as points (as suggested by Tukey).
In sum: 1. The upper whisker extends from Q3 to min(max(data), Q3 + 1.5 x IQR) 2. The lower whisker extends from Q1 to max(min(data), Q1 - 1.5 x IQR)
You may have to read the two bullets above a few times before it sinks in. The plot below should help to explain the structure of the box plot.
Sub-plots and heat-maps
Facet row
and Facet column
can be used to split the data into different groups and create separate plots for each group.
If you select a scatter or line plot a Color
drop-down will be shown. Selecting a Color
variable will create a type of heat-map where the colors are linked to the values of the Color
variable. Selecting a categorical variable from the Color
dropdown for a line plot will split the data into groups and will show a line of a different color for each group.
Line, loess, and jitter
To add a linear or non-linear regression line to a scatter plot check the Line and/or Loess boxes. If your data take on a limited number of values checking Jitter can be useful to get a better feel for where most of the data points are located. Jitter-ing simply adds a small random value to each data point so they do not overlap completely in the plot(s).
Axis scale
The relationship between variables depicted in a scatter plot may be non-linear. There are numerous transformations we might apply to the data so this relationship becomes (approximately) linear (see Data > Transform) and easier to estimate. Perhaps the most common data transformation applied to business data is the (natural) log. To see if a log-linear or log-log transformation may be appropriate for your data check the Log X
and/or Log Y
boxes.
By default the scale of the y-axis is the same across sub-plots when using Facet row
. To allow the y-axis to be specific to each sub-plot click the Scale-y
check-box.
Flip axes
To switch the variable on the X- and Y-axis check the Flip
box.
Plot height and width
To make plots bigger or smaller adjust the values in the height and width boxes on the bottom left.
Customizing plots in R > Report
To customize a plot first generate the visualize command by clicking the report (book) icon on the bottom left of your screen. The example below illustrates how to customize a command in the R > Report
tab. Notice that custom
is set to TRUE
.
visualize(dataset = "diamonds", yvar = "price", xvar = "carat", type = "scatter", custom = TRUE) +
ggtitle("A scatterplot") + xlab("price in $")
See the ggplot2 documentation page for available options http://docs.ggplot2.org.
Create pivot tables to explore your data
If you have used pivot-tables in Excel the functionality provided in the Pivot tab should be familiar to you. Similar to the Explore tab, you can generate summary statistics for variables in your data. You can also easily generate frequency tables. Perhaps the most powerful feature in Pivot is that you can describe the data by one or more other variables.
For example, with the epiGenomics
data select Genes
, Diseases
and CNA
from the Categorical variables drop-down. You can drag-and-drop the selected variables to change their order. The categories for the first variable will be the column headers. After selecting these three variables a frequency table of data with different Diseases and Genes. Choose Row
, Column
, or Total
from the Normalize drop-down to normalize the frequencies by row, column, or overall total. If a normalize option is selected it can be convenient to check the Percentage
box to express the numbers as percentages. Choose Color bar
or Heat map
from the Conditional formatting drop-down to emphasize the highest frequency counts.
It is also possible to summarize numerical variables. Select FreqMut
from the Numerical variables drop-down. This will create the table shown below. Just as in the View tab you can sort the table by clicking on the column headers. You can also use sliders (e.g., click in the input box below I1
) to limit the view to values in a specified range. To view only information for CNA
with 0
or -1
levels click in the input box below the CNA
header.
You can also create a bar chart based on the generated table (see image above). To download the table to csv format or the plot to a png format click the download icon on the right.
Filter
Use the Filter
box to select (or omit) specific sets of rows from the data. See the help file for Data > View for details.
Summarize and explore your data
Generate summary statistics for one or more variables in your data. The most powerful feature in Explore is that you can easy describe the data by one or more other variables. Where the Pivot tab works best for frequency tables and to summarize a single numerical variable, the Explore tab allows you to summarize multiple variables at the same time using various statistics.
For example, if we select Genes
from the xmRNA
dataset we can see the number of observations (n), the mean, the median, etc. etc.
The created summary table can be stored in bioCancer by clicking the Store
button. This can be useful if you want to create plots using the summarized data. To download the table to csv format click the download icon on the top-right.
You can select options from Column variable
dropdown to switch between different column headers. Select either the functions
(e.g., mean, median, etc), the variables (e.g., Genes), or the levels of the (first) Group by
variable (e.g., Studies).
Filter
Use the Filter
box to select (or omit) specific sets of rows from the data. See the helpfile for Data > View for details.
Transform command log
All transformations applied in the Data > Transform tab can be logged. If, for example, you apply a log
transformation to numeric variables the following code is generated and put in the Transform command log window at the bottom of your screen when you click the Store
button.
## transform variable r_data[["epiGenomics"]] <- mutate_each(r_data[["epiGenomics"]], funs(log), ext = "_log", mRNA, Met450)
This is an important feature if you need to recreate your results at some point in the future or you want to re-run a report with new, but similar, data. Even more important is that there is a record of the steps taken to generate all results.
To add commands contained in the command log window to a report in R > Report click the icon.
Filter
Filter functionality must be turned off when transforming variables. If a filter is active the transform functions will show a warning message. Either remove the filter statement or un-check the Filter
check-box. Alternatively, navigate to the Data > View tab and click the Store
button to store the filtered data and select the newly create dataset. Then return to the Transform tab to make the desired variable changes.
Type
When you select Type
from the Transformation type
drop-down another drop-down menu is shown that will allow you to change the type (or class) of one or more variables. For example, you can change a variable of type integer to a variable of type factor. Click the Store
button to change variable(s) in the data set. A description of the transformations included in bioCancer is provided below.
- As factor: convert a variable to type factor (i.e., a categorical variable)
- As number: convert a variable to type numeric
- As integer: convert a variable to type integer
- As character: convert a variable to type character (i.e., strings)
- As date (mdy): convert a variable to a date if the dates are ordered as month-day-year
- As date (dmy): convert a variable to a date if the dates are ordered as day-month-year
- As date (ymd): convert a variable to a date if the dates are ordered as year-month-day
- As date/time (mdy_hms): convert a variable to a date if the dates are ordered as month-day-year-hour-minute-second
- As date/time (mdy_hm): convert a variable to a date if the dates are ordered as month-day-year-hour-minute
- As date/time (dmy_hms): See mdy_hms
- As date/time (dmy_hm): See mdy_hm
- As date/time (ymd_hms): See mdy_hms
- As date/time (ymd_hm): See mdy_hm
Transform
When you select Transform
from the Transformation type
drop-down another drop-down menu is shown that will allow you to apply common transformations to one or more variables in the data. For example, to take the (natural) log of a variable select the variable(s) you want to transform and choose Log
from the Apply function
drop-down. A new variable is created with the extension specified in the ’Variable name extensiontext input (e.g,.
_log). Make sure to press
returnafter changing the extension. Click the
Store` button to add the variable(s) to the data set. A description of the transformation functions included in bioCancer is provided below.
- Log: create a natural log-transformed version of the selected variable (i.e., log(x) or ln(x))
- Square: multiply a variable by itself (i.e., x^2 or square(x))
- Square-root: take the square-root of a variable (i.e., x^.5)
- Absolute: Absolute value of a variable (i.e., abs(x))
- Center: create a new variable with a mean of zero (i.e., x - mean(x))
- Standardize: create a new variable with a mean of zero and standard deviation of one (i.e., (x - mean(x))/sd(x))
- Invert: 1/x
- Median split: create a new factor with two levels (Above and Below) that splits the variable values at the median
- Deciles: create a new factor with 10 levels (deciles) that splits the variable values at the 10th, 20th, …, 90th percentiles.
Create
Choose Create
from the Transformation type
drop-down. This is the most flexible command to create new or transformed variables. However, it also requires some basic knowledge of R-syntax. A new variable can be any function of other variables in the (active) dataset. Some examples are given below. In each example the name to the left of the =
sign is the name of the new variable. To the right of the =
sign you can include other variable names and basic R-functions. After you have typed the command press return
to create the new variable and press Store
to add it to the dataset.
Create a new variable z that is the difference between variables x and y
z = x - y
Create a new variable z that is a transformation of variable x but with mean equal to zero (note that this transformation is also available in the
Transform
drop-down asCenter
):z = x - mean(x)
Create a new
logical
variable z that takes on the value TRUE when x > y and FALSE otherwisez = x > y
Create a new
logical
z that takes on the value TRUE when x is equal to y and FALSE otherwisez = x == y
Create a variable z that is equal to x lagged by 3 periods
z = log(x,3)
Create a categorical variable with two levels
z = ifelse(x < y, ‘smaller’, ‘bigger’)
Create a categorical variable with three levels. An alternative approach would be to use the
Recode
function described belowz = ifelse(x < 60, ‘< 60’, ifelse(x > 65, ‘> 65’, ‘60-65’))
Convert an outlier to a missing value. For example, if we want to remove the maximum value from a variable called
xmRNA
that is equal to 400 we could use anifelse
statement and enter the command below in theCreate
box. Pressreturn
andStore
to add the newxmRNA_rc
variable. Note that if we had enteredxmRNA
on the left-hand side of the=
sign the original variable would have been overwritten
xmRNA_rc = ifelse(xmRNA > 400, NA, sales)
Similarly, if a respondent with ID 3 provided information in the wrong scale on a survey (e.g., income in $1s rather than in $1000s) we could use an
ifelse
statement and enter the command below in theCreate
box. As before, pressreturn
andStore
to add the newsales_rc
variableincome_rc = ifelse(ID == 3, income/1000, income)
If multiple respondents made the same scaling mistake (e.g., those with ID 1, 3, and 15) we again use
Create
and enter:income_rc = ifelse(ID %in% c(1, 3, 15), income/1000, income)
If you have a date in a format not available through the
Type
menu you can use theparse_date_time
function. For a date formated as “2-1-14” you would specify the command below (note that this format will also be parsed correctly by themdy
function in theType
menu)date = parse_date_time(x, “%m%d%y”)
Determine the time difference between two dates/times in seconds
time_diff = as_duration(time2 - time1)
Extract the month from a date variable
month = month(date)
Other attributes that can be extracted from a date or date-time variable are
minute
,hour
,day
,week
,quarter
,year
,wday
(for weekday). Forwday
andmonth
it can be convenient to addlabel = TRUE
to the call. For example, to extract the weekday from a date variable and use a label rather than a numberweekday = wday(date, label = TRUE)
Calculating the distance between two locations using lat-long information
trip_distance = as_distance(lat1, long1, lat2, long2)
Note: For examples 6, 7, and 14 above you may need to change the new variable to type factor
before using it for further analysis (see Type
above)
Recode
To use the recode feature select the variable you want to change and choose Recode
from the Transformation type
drop-down. Provide one or more recode commands, separated by a ;
, and press return to see the newly created variable. Note that you can specify the names for the recoded variable in the Recoded variable name
input box (press return to submit changes). Finally, click Store
to add the new variable to the data. Some examples are given below.
Values below 20 are set to ‘Low’ and all others to ‘High’
lo:20 = ‘Low’; else = ‘High’
Values above 20 are set to ‘High’ and all others to ‘Low’
20:hi = ‘High’; else = ‘Low’
Values 1 through 12 are set to ‘A’, 13:24 to ‘B’, and the remainder to ‘C’
1:12 = ‘A’; 13:24 = ‘B’; else = ‘C’
Collapse age categories for a cross-tab analysis. In the example below ‘<25’ and ‘25-34’ are recoded to ‘<35’, ‘35-44’ and ‘35-44’ are recoded to ‘35-54’, and ‘55-64’ and ‘>64’ are recoded to ‘>54’
‘<25’ = ‘<35’; ‘25-34’ = ‘<35’; ‘35-44’ = ‘35-54’; ‘45-54’ = ‘35-54’; ‘55-64’ = ‘>54’; ‘>64’ = ‘>54’
To exclude a particular value (e.g., an outlier in the data) for subsequent analyses we can recode it to a missing value. For example, if we want to remove the maximum value from a variable called
FreqMut
that is equal to 102 we would (1) select the variableFreqMut
in theSelect variable(s)
box and enter the command below in theRecode
box. Pressreturn
andStore
to add the recoded variable to the data102 = NA
To recode specific numeric values (e.g., carat) to a new value (1) select the variable
carat
in theSelect variable(s)
box and enter the command below in theRecode
box to set the value for carat to 2 in all rows where carat is currently larger than or equal to 2. Pressreturn
andStore
to add the recoded variable to the data2:hi = 2
Note: Never use a =
symbol in a label when using the recode function (e.g., 50:hi = ‘>= 50’) as this will cause an error.
Rename
Choose Rename
from the Transformation type
drop-down, select one or more variables, and enter new names for them in the rename box shown. Separate each name by a ,
. Press return to see the variables with their new names on screen and press Store
to alter the variable names in the original data.
Replace
Choose Replace
from the Transformation type
drop-down if you want to replace existing variables in the data with new ones created using, for example, Create, Transform, Clipboard, etc.. Select one or more variables to overwrite and the same number of replacement variables. Press Store
to alter the data.
Clipboard
It is possible to manipulate your data in a spreadsheet (e.g., Excel or Google sheets) and copy-and-paste the data back into bioCancer. If you don’t have the original data in a spreadsheet already use the clipboard feature in Data > Manage so you can paste it into the spreadsheet or click the download icon on the top right of your screen in the Data > View tab. Apply your transformations in the spreadsheet program and then copy the new variable(s), with a header label, to the clipboard (i.e., CTRL-C on windows and CMD-C on mac). Select Clipboard
from the Transformation type
drop-down and paste the new data into the Paste from spreadsheet
box. It is key that new variable(s) have the same number of observations as the data in bioCancer. To add the new variables to the data click Store
.
Note: Using the clipboard feature for data transformation is discouraged because it is not reproducible.
Normalize
Choose Normalize
from the Transformation type
drop-down to standardize one or more variables. For example, in the epiGenomics data we may want to express mRNA of a Genes per-FreqMut. Select FreqMut
as the normalizing variable and mRNA
in the Select variable(s)
box. You will see summary statistics for the new variable (e.g., mRNA_FreqMut
) in the main panel. Store changes by clicking the Store
button.
Reorder or remove columns
Choose Reorder/Remove columns
from the Transformation type
drop-down. Drag-and-drop variables to reorder them in the data. To remove a variable click the \(\times\) next to the label. Press Store
to commit the changes.
Reorder or remove levels
If a (single) variable of type factor
is selected in Select variable(s)
, choose Reorder/Remove levels
from the Transformation type
drop-down to reorder and/or remove levels. Drag-and-drop levels to reorder them or click the \(\times\) to remove them. Press Store
to commit the changes. To temporarily exclude levels from the data use the Filter
box (see the help file linked in the Data > View
tab).
Remove missing values
Choose Remove missing
from the Transformation type
drop-down to eliminate rows with one or more missing values. If all variables are selected a row with a missing values in any column will be removed. If one or more variables are selected only those rows will be removed with missing values for the selected variables. Press Store
to change the data. If missing values were present you will see the number of observations in the data summary change (i.e., the value of n changes).
Remove duplicates
It is common to have one or more variables in a dataset that should have only unique values (i.e., no duplicates). Customers id’s, for example, should be unique unless the dataset contains multiple orders for the same customer. In that case the combination of customer id and order id should be unique. To remove duplicate select one or more variables to determine uniqueness. Choose Remove duplicates
from the Transformation type
drop-down and check how the summary statistics change. Press Store
to change the data. If there are duplicate rows you will see the number of observations in the data summary change (i.e., the value of n and n_distinct will change).
Show duplicates
If there are duplicates in the data use Show duplicates
to get a better sense for the data points that have the same value in multiple rows. If you want to explore duplicates using the View tab make sure to Store
them in a different dataset (i.e., make sure not to overwrite the data you are working on). If you choose to show duplicates based on all columns in the data only one of the duplicate rows will be shown. These rows are exactly the same so showing 2 or 3 isn’t helpful. If, however, we look for duplicates based on a subset of the available variables bioCancer will generate a dataset with all rows that are deemed similar.
Combine two datasets
There are six join (or merge) options available in bioCancer from the dplyr package developed by Hadley Wickham and Romain Francois on GitHub.
The examples below are adapted from Cheatsheet for dplyr join functions by Jenny Bryan and focus on three small datasets, superheroes
, publishers
, and avengers
, to illustrate the different join types and other ways to combine datasets in R and bioCancer. The data is also available in csv format through the links below:
name | alignment | gender | publisher |
---|---|---|---|
Magneto | bad | male | Marvel |
Storm | good | female | Marvel |
Mystique | bad | female | Marvel |
Batman | good | male | DC |
Joker | bad | male | DC |
Catwoman | bad | female | DC |
Hellboy | good | male | Dark Horse Comics |
publisher | yr_founded |
---|---|
DC | 1934 |
Marvel | 1939 |
Image | 1992 |
In the screen-shot of the Data > Combine tab below we see the two datasets. The tables share the variable publisher which is automatically selected for the join. Different join options are available from the Combine type
dropdown. You can also specify a name for the combined dataset in the Data name
text input box.
Inner join (superheroes, publishers)
If x = superheroes and y = publishers: > An inner join returns all rows from x with matching values in y, and all columns from both x and y. If there are multiple matches between x and y, all match combinations are returned.
name | alignment | gender | publisher | yr_founded |
---|---|---|---|---|
Magneto | bad | male | Marvel | 1939 |
Storm | good | female | Marvel | 1939 |
Mystique | bad | female | Marvel | 1939 |
Batman | good | male | DC | 1934 |
Joker | bad | male | DC | 1934 |
Catwoman | bad | female | DC | 1934 |
In the table above we lose Hellboy because, although this hero does appear in superheroes
, the publisher (Dark Horse Comics) does not appear in publishers
. The join result has all variables from superheroes
, plus yr_founded, from publishers
. We can visualize an inner join with the venn-diagram below:
The bioCancer commands are:
# bioCancer
combinedata("superheroes", "publishers", by = "publisher", type = "inner_join")
# R
inner_join(superheroes, publishers, by = "publisher")
Left join (superheroes, publishers)
A left join returns all rows from x, and all columns from x and y. If there are multiple matches between x and y, all match combinations are returned.
name | alignment | gender | publisher | yr_founded |
---|---|---|---|---|
Magneto | bad | male | Marvel | 1939 |
Storm | good | female | Marvel | 1939 |
Mystique | bad | female | Marvel | 1939 |
Batman | good | male | DC | 1934 |
Joker | bad | male | DC | 1934 |
Catwoman | bad | female | DC | 1934 |
Hellboy | good | male | Dark Horse Comics | NA |
The join result contains superheroes
with variable yr_founded
from publishers
. Hellboy, whose publisher does not appear in publishers
, has an NA
for yr_founded. We can visualize a left join with the venn-diagram below:
The bioCancer commands are:
# bioCancer
combinedata("superheroes", "publishers", by = "publisher", type = "left_join")
# R
left_join(superheroes, publishers, by = "publisher")
Right join (superheroes, publishers)
A right join returns all rows from y, and all columns from y and x. If there are multiple matches between y and x, all match combinations are returned.
name | alignment | gender | publisher | yr_founded |
---|---|---|---|---|
Batman | good | male | DC | 1934 |
Joker | bad | male | DC | 1934 |
Catwoman | bad | female | DC | 1934 |
Magneto | bad | male | Marvel | 1939 |
Storm | good | female | Marvel | 1939 |
Mystique | bad | female | Marvel | 1939 |
NA | NA | NA | Image | 1992 |
The join result contains all rows and columns from publishers
and all variables from superheroes
. We lose Hellboy, whose publisher does not appear in publishers
. Image is retained in the table but has NA
values for the variables name, alignment, and gender from superheroes
. Notice that a join can change both the row and variable order so you should not rely on these in your analysis. We can visualize a right join with the venn-diagram below:
The bioCancer commands are:
# bioCancer
combinedata("superheroes", "publishers", by = "publisher", type = "right_join")
# R
right_join(superheroes, publishers, by = "publisher")
Full join (superheroes, publishers)
A full join combines two datasets, keeping rows and columns that appear in either.
name | alignment | gender | publisher | yr_founded |
---|---|---|---|---|
Magneto | bad | male | Marvel | 1939 |
Storm | good | female | Marvel | 1939 |
Mystique | bad | female | Marvel | 1939 |
Batman | good | male | DC | 1934 |
Joker | bad | male | DC | 1934 |
Catwoman | bad | female | DC | 1934 |
Hellboy | good | male | Dark Horse Comics | NA |
NA | NA | NA | Image | 1992 |
In this table we keep Hellboy (even though Dark Horse Comics is not in publishers
) and Image (even though the publisher is not listed in superheroes
) and get variables from both datasets. Observations without a match are assigned the value NA for variables from the other dataset. We can visualize a full join with the venn-diagram below:
The bioCancer commands are:
Semi join (superheroes, publishers)
A semi join keeps only columns from x. Whereas an inner join will return one row of x for each matching row of y, a semi join will never duplicate rows of x.
name | alignment | gender | publisher |
---|---|---|---|
Batman | good | male | DC |
Joker | bad | male | DC |
Catwoman | bad | female | DC |
Magneto | bad | male | Marvel |
Storm | good | female | Marvel |
Mystique | bad | female | Marvel |
We get a similar table as with inner_join
but it contains only the variables in superheroes
. The bioCancer commands are:
# bioCancer
combinedata("superheroes", "publishers", by = "publisher", type = "semi_join")
# R
semi_join(superheroes, publishers, by = "publisher")
Anti join (superheroes, publishers)
An anti join returns all rows from x without matching values in y, keeping only columns from x
name | alignment | gender | publisher |
---|---|---|---|
Hellboy | good | male | Dark Horse Comics |
We now get only Hellboy, the only superhero not in publishers
and we do not get the variable yr_founded either. We can visualize an anti join with the venn-diagram below:
Dataset order
Note that the order of the datasets selected may matter for a join. If we setup the Data > Combine tab as below the results are as follows:
Inner join (publishers, superheroes)
publisher | yr_founded | name | alignment | gender |
---|---|---|---|---|
DC | 1934 | Batman | good | male |
DC | 1934 | Joker | bad | male |
DC | 1934 | Catwoman | bad | female |
Marvel | 1939 | Magneto | bad | male |
Marvel | 1939 | Storm | good | female |
Marvel | 1939 | Mystique | bad | female |
Every publisher that has a match in superheroes
appears multiple times, once for each match. Apart from variable and row order, this is the same result we had for the inner join shown above.
Left and Right join (publishers, superheroes)
Apart from row and variable order, a left join of publishers
and superheroes
is equivalent to a right join of superheroes
and publishers
. Similarly, a right join of publishers
and superheroes
is equivalent to a left join of superheroes
and publishers
.
Full join (publishers, superheroes)
As you might expect, apart from row and variable order, a full join of publishers
and superheroes
is equivalent to a full join of superheroes
and publishers
.
Semi join (publishers, superheroes)
publisher | yr_founded |
---|---|
Marvel | 1939 |
DC | 1934 |
With semi join the effect of switching the dataset order is more clear. Even though there are multiple matches for each publisher only one is shown. Contrast this with an inner join where “If there are multiple matches between x and y, all match combinations are returned.” We see that publisher Image is lost in the table because it is not in superheroes
.
Anti join (publishers, superheroes)
publisher | yr_founded |
---|---|
Image | 1992 |
Only publisher Image is retained because both Marvel and DC are in superheroes
. We keep only variables in publishers
.
Additional tools to combine datasets (avengers, superheroes)
When two datasets have the same columns (or rows) there are additional ways in which we can combine them into a new dataset. We have already used the superheroes
dataset and will now try to combine it with the avengers
data. These two datasets have the same number of rows and columns and the columns have the same names.
In the screen-shot of the Data > Combine tab below we see the two datasets. There is no need to select variables to combine the datasets here. Any variables in Select variables
are ignored in the commands below. Again, you can specify a name for the combined dataset in the Data name
text input box.
Bind rows
name | alignment | gender | publisher |
---|---|---|---|
Thor | good | male | Marvel |
Iron Man | good | male | Marvel |
Hulk | good | male | Marvel |
Hawkeye | good | male | Marvel |
Black Widow | good | female | Marvel |
Captain America | good | male | Marvel |
Magneto | bad | male | Marvel |
Magneto | bad | male | Marvel |
Storm | good | female | Marvel |
Mystique | bad | female | Marvel |
Batman | good | male | DC |
Joker | bad | male | DC |
Catwoman | bad | female | DC |
Hellboy | good | male | Dark Horse Comics |
If the avengers
dataset were meant to extend the list of superheroes we could just stack the two datasets, one below the other. The new datasets has 14 rows and 4 columns. Due to a coding error in the avengers
dataset (i.e.., Magneto is not an Avenger) there is a duplicate row in the new combined dataset. Something we probably don’t want.
The bioCancer commands are:
# bioCancer
combinedata("avengers", "superheroes", type = "bind_rows")
# R
bind_rows(avengers, superheroes)
Bind columns
name | alignment | gender | publisher | name | alignment | gender | publisher |
---|---|---|---|---|---|---|---|
Thor | good | male | Marvel | Magneto | bad | male | Marvel |
Iron Man | good | male | Marvel | Storm | good | female | Marvel |
Hulk | good | male | Marvel | Mystique | bad | female | Marvel |
Hawkeye | good | male | Marvel | Batman | good | male | DC |
Black Widow | good | female | Marvel | Joker | bad | male | DC |
Captain America | good | male | Marvel | Catwoman | bad | female | DC |
Magneto | bad | male | Marvel | Hellboy | good | male | Dark Horse Comics |
If the dataset had different columns for the same superheroes we could combine the two datasets, side by side. In bioCancer you will see an error message if you try to bind these columns because they have the same name. Something that we should always avoid. The method can be useful if we know the order of the row ids of two dataset are the same but the columns are all different.
Intersect
A good way to check if two datasets with the same columns have duplicate rows is to choose intersect
from the Combine type
dropdown. There is indeed one row that is identical in the avengers
and superheroes
data (i.e., Magneto).
The biCancer commands are the same as shown above, except you will need to replace bind_rows
by intersect
.
Union
Thor | good | male | Marvel | Magneto | bad | male | Marvel |
Iron Man | good | male | Marvel | Storm | good | female | Marvel |
Hulk | good | male | Marvel | Mystique | bad | female | Marvel |
Hawkeye | good | male | Marvel | Batman | good | male | DC |
Black Widow | good | female | Marvel | Joker | bad | male | DC |
Captain America | good | male | Marvel | Catwoman | bad | female | DC |
Magneto | bad | male | Marvel | Hellboy | good | male | Dark Horse Comics |
A union
of avengers
and superheroes
will combine the datasets but will omit duplicate rows (i.e., it will keep only one copy of the row for Magneto). Likely what we want here.
The bioCancer commands are the same as shown above, except you will need to replace bind_rows
by union
.
Setdiff
name | alignment | gender | publisher |
---|---|---|---|
Thor | good | male | Marvel |
Iron Man | good | male | Marvel |
Hulk | good | male | Marvel |
Hawkeye | good | male | Marvel |
Black Widow | good | female | Marvel |
Captain America | good | male | Marvel |
Magneto | bad | male | Marvel |
Finally, a setdiff
will keep rows from avengers
that are not in superheroes
. If we reverse the inputs (i.e., choose superheroes
from the Datasets
dropdown and superheroes
from the Combine with
dropdown) we will end up with all rows from superheroes
that are not in avengers
. In both cases the entry for Magneto will be omitted.
The bioCancer commands are the same as shown above, except you will need to replace bind_rows
by setdiff
.
For additional discussion see http://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html
Enrichment Panel
Show multi-Omics Data in Circular Layout
The world Circomics
comes from the association between Circos
and Omics
.
Circos is a package for visualizing data and information with circular layouts. User can visualize multiple matrices of Omics data at the same time and makes easy the exploring of relationships between dimensions using coloring sectors.
This function uses CoffeeWheel package developped by Dr. ARman Aksoy.
Studies in Wheel
User needs to: * Choice in which Studies is interested. * Visualize the availability of dimensions by checking Availability
. + The output is a table with Yes/No availability. * Load Omics data for selected Studies by checking Load
. The output is a list of loaded dimensions for selected Studies.
When Profiles Data are loaded, the button Load Profiles in Datasets
appears. It uploads all Profiles Data to Processing
panel for more exploring or analysis.
Legend
checkbox displays the meaning of the color palette.
Load Profiles in Datasets
For every dimension, the tables are merged by study and saved as: xCNA
, xMetHM27
, xMetHM450
, xmiRNA
, xmRNA
, xMut
, xRPPA
in Datasets (Processing panel).
Genes / Diseases / Pathways Classification and clustering
Classification
The classifier uses geNetClassifier
methods [1] to classify genes by disease based only on gene expression (mRNA). The approach is implemented in an R package, named geNetClassifier, available as an open access tool in Bioconductor. All proccess are resumed into 5 steps: * Select Studies * get sample size by processing
> Samples
* Set the sample size and the posterior probability * Select one Case
and one Genetic Profile
for every study. Respect the order of studies. it is recommanded to use _v2_mrna
for all genetic profiles. * Run classifier by processing
> Classifier
The ranking is built by ordering the genes decreasingly by their pos- terior probability for each study (class). Each gene is assigned to a class in which has the best ranking. As a result of this process, even if a gene is found associated to several classes during the expression analysis, each gene can only be on the ranking of one class [1]. The resulting output is a table (Table 1) that associates genes to study and displays PostProb
and gene expression sign exprsUpDw
. The exprsMeanDiff
value is the expression difference between the mean for each gene in the given class and the mean in the closest class.
Table1: Ranking Genes by Study
Plot Clusters
Gene Diseases Association
GeneList/Diseases
predicts Wich disease are involving your GeneList. It uses annotations from DisGeNET [2] and Methods from clusterProfiler package [3].
The GeneList/Diseases
association uses gene list as input. The assess of this prediction is based on two parameters: * The number of genes that are involving in the disease (x-axis) * The P-value of this association (color). In the following example, there are two annotation related to Breast cancer which involve more than 130 genes and has small P-Value.
Figure 1: Genes / Diseases Association
The Diseases Onthology
uses genes/Study groups computed by Classifier
(Table 1). The dotplot position indicates wish Diseases are annotated for genes/study [4]. The dot size indicates the ratio of genes involved in the disease for the same genes groups (lihc_tcga has 2/3 genes involved for the 4 disease). The color indicates the P-Value.
Figure 2: Diseases Onthology
The same process is possible with Gene Onthology (GO) and KEGG.
Figure 3: GO Pathway Enrishment
Figure 4: KEGG Pathway Enrishment
Function Interaction Network Enrichment
Edges Attributes
Function Interactions (FIs) Type
Arrowhead | Reaction | Arrowhead | Reaction |
---|---|---|---|
-> | activate, express, regulate | -| | inhibit |
diamond -<> | complexe | curve | catalyze, reaction |
point -o | phosphorylate | – | binding, input, compound |
-< | dissociation | …. | predicted, indirect,ubiquitinated |
Use Linkers
Picks up as few as possible of linkers that can connect input genes together. For example, if the algorithm finds one gene can link all input genes together, it will not try other genes (not from gene list) that may be used as a linker.
The linker gene hes box format.
Layouts
dot
The dot engine flows the directed graph in the direction of rank (i.e., downstream nodes of the same rank are aligned). By default, the direction is from top to bottom ##### twopi The twopi engine provides radial layouts. Nodes are placed on concentric circles depending their distance from a given root node.
neato
The neato engine provides spring model layouts. This is a suitable engine if the graph is not too large (less than 100 nodes) and you don’t know anything else about it. The neato engine attempts to minimize a global energy function, which is equivalent to statistical multi-dimensional scaling.
circo
The circo engine provide circular layouts. This is suitable for certain diagrams of multiple cyclic structures, such as certain telecommunications networks.
Nodes Attributes
From ReactomeFI
The size of node is related to the number of inetractions. If node has multiple interaction, it will has bigger size than node with few interaction. Otherwise, i will be easier to locate important gene in the network.
From Classifier
mRNA
Attribute node color using exprsMeanDiff
values from Classifier
panel.
Studies
Link study to associated genes from Classifier
table.
From Profiles Data
User needs to * Select studies (From Which Studies
) * Load profiles data (Load
). * Select Profiles Data * Set threshold from Sliders
Legend
Interpretation
References
[1] Aibar S, Fontanillo C, Droste C, Roson-Burgo B, Campos-Laborie F, Hernandez-Rivas J and De Las Rivas J (2015). “Analyse multiple disease subtypes and build associated gene networks using genome-wide expression profiles.” BMC Genomics, 16(Suppl 5:S3). http://dx.doi.org/10.1186/1471-2164-16-S5-S3.
[2] Piñero, J., Queralt-Rosinach, N., Bravo, A., Deu-Pons, J., Bauer-Mehren, A., Baron, M., Ferran Sanz, and Furlong, L. I. (2015). DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database: The Journal of Biological Databases and Curation, 2015, bav028. http://doi.org/10.1093/database/bav028
[3] Yu G, Wang L, Han Y and He Q (2012). “clusterProfiler: an R package for comparing biological themes among gene clusters.” OMICS: A Journal of Integrative Biology, 16(5), pp. 284-287. http://dx.doi.org/10.1089/omi.2011.0118.
[4] Yu G, Wang L, Yan G and He Q (2015). “DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis.” Bioinformatics, 31(4), pp. 608-609. http://dx.doi.org/10.1093/bioinformatics/btu684, http://bioinformatics.oxfordjournals.org/content/31/4/608.