%\VignetteEngine{R.rsp::tex} %\VignetteIndexEntry{MdsOPT package, published in M. Walesiak and A. Dudek, "Searching for an Optimal MDS Procedure for Metric and Interval-Valued Data using mdsOpt R package", in: Education Excellence and Innovation Management: A 2025 Vision to Sustain Economic Development during Global Challenges, 2020, pp. 307-324.} \documentclass[article, nojss]{jss} \usepackage{underscore} \usepackage[ascii]{inputenc} \usepackage[T1]{fontenc} \usepackage[english]{babel} \usepackage{amsmath} \usepackage{amssymb,amsfonts,textcomp} \usepackage{array} \usepackage{supertabular} \usepackage{hhline} \usepackage{booktabs} \usepackage{multirow} \usepackage{graphicx} \makeatletter \newcommand\arraybslash{\let\\\@arraycr} \makeatother \setlength\tabcolsep{1mm} \renewcommand\arraystretch{1.3} \newcommand\normalsubformula[1]{\text{\mathversion{normal}$#1$}} \newcounter{saveenum} \author{Marek Walesiak \\Wroc{\l}aw University of Economics \And Andrzej Dudek \\Wroc{\l}aw University of Economics} \title{mdsOpt -- Searching for Optimal MDS Procedure \newline for Metric and Interval-Valued Data} \Plainauthor{Marek Walesiak, Andrzej Dudek} %% comma-separated \Plaintitle{mdsOpt -- Searching for Optimal MDS Procedure for Metric and Interval-Valued Datat} %% without formatting \Shorttitle{mdsOpt -- Searching for Optimal MDS Procedure } %% a short title (if necessary) \Abstract{ In multidimensional scaling (MDS) carried out on the basis of a metric data matrix (interval, ratio) or interval-valued data table three approaches can be distinguished: classic-to-classic -- for metric data, symbolic-to-classic and symbolic-to-symbolic -- for interval-valued data. The article presents the \pkg{mdsOpt} package which helps to solve the problem of choosing the optimal multidimensional scaling procedure. It uses two criteria for selecting the optimal multidimensional scaling procedure: Kruskal's $\mathit{Stress}\text{{}-}1$ fit measure ($I\text{{}-}\mathit{Stress}$ in case of symbolic-to-symbolic approach) and Hirschman-Herfindahl $\mathit{HHI}$ index calculated based on Stress per point (Interval stress per box in case of symbolic-to-symbolic approach) values. In first part three possible approaches are characterised with theoretical background of used methods and the relationships between \pkg{mdsOpt} package and existing R packages. Second part explains procedure and criteria for selection of the optimal multidimensional scaling procedure for metric and interval-valued data. The last contains in details the usage of package and examples (R scripts) on real data sets related to tourist attractiveness of Polish voivodships (provinces) and Lower-Silesian counties.} \Address{ Marek Walesiak\\ Wroclaw University of Economics\\ Department of Econometrics and Computer Science\\ Komandorska 118/120\\ 53-345 Wroc\l{}aw, Poland\\ E-mail: \email{marek.walesiak@ue.wroc.pl}\\ Andrzej Dudek\\ Wroclaw University of Economics\\ Department of Econometrics and Computer Science\\ Komandorska 118/120\\ 53-345 Wroc\l{}aw, Poland\\ E-mail: \email{andrzej.dudek@ue.wroc.pl} } \Keywords{ multidimensional scaling, metric and interval-valued data, tourist attractiveness, $HHI$ index, R} \Plainkeywords{ multidimensional scaling, metric and interval-valued data, tourist attractiveness, HHI index, R} \begin{document} \section[Introduction]{Introduction} \subsection[The aim of multidimensional scaling]{The aim of multidimensional scaling} Classical multidimensional scaling (MDS) is a method that represents (dis)similarity data as distances in a low-dimensional space (typically 2 or 3 dimensional) in order to make these data accessible to visual inspection and exploration (\cite{Borg+Groenen:2005}, p. 3). Classical MDS requires that each entry of dissimilarity matrix be a single numerical value. Dissimilarity between object \textit{i} and object \textit{k} can be fuzzy (\cite{Groenen+Winsberg+Rodriguez+Diday:2006}, p. 361). The fuzzy dissimilarity is represented by an interval and $\mathit{n\times n}$ dissimilarity matrix is an interval of values $\left[\delta _{\mathit{ik}}^l,\delta_{\mathit{ik}}^u\right]$, where $\delta_{\mathit{ik}}^l(\delta _{\mathit{ik}}^u)$ denotes {the lower (upper) bound of the dissimilarity of objects }\textit{{i }}{and }\textit{{k }}in \textit{m}{}-dimensional space. Multidimensional scaling of interval dissimilarities represents the lower and upper bounds of the dissimilarities as distances between hypercubes (rectangles in two-dimensional space and cubes in three-dimensional space). The dimensions are not directly observable. They have the nature of latent variables. MDS allows the similarities and differences between the analyzed objects to be explained. Multidimensional scaling is a widely used technique in many areas, including psychology (\cite{Takane:2007}), sociology (\cite{Pinkley:2005}), linguistics (\cite{Embleton:2013}), marketing research (\cite{Cooper:1983}), tourism (\cite{Marcussen:2014}), musicology (\cite{Adams:1995}). \subsection[The approaches in multidimensional scaling via mdsOpt package]{The approaches in multidimensional scaling via \pkg{mdsOpt} package} In MDS carried out on the basis of a metric data matrix (interval, ratio) or interval-valued data table, via \pkg{mdsOpt} package, three approaches can be distinguished: \begin{enumerate} \item { Classic-to-classic -- for metric data:} \end{enumerate} \vspace*{-0.5cm} \textbf{} \begin{equation} \label{eq:1} \mathbf{X}=[x_{\mathit{ij}}]_{n\times m}\rightarrow \mathbf{Z}=[z_{\mathit{ij}}]_{n\times m}\rightarrow \left[\delta _{\mathit{ik}}(\mathbf{Z})\right]_{\mathit{nxn}}\rightarrow f:\left[\delta _{\mathit{ik}}\left(\mathbf{Z}\right)\rightarrow d_{\mathit{ik}}(\mathbf{V})\right]\rightarrow \mathbf{V}=[v_{\mathit{ij}}]_{n\times q},\end{equation} \noindent where: $x_{\mathit{ij}}$ -- the value of the \textit{j}{}-th variable for the \textit{i}{}-th object{,} $z_{\mathit{ij}}$ -- the normalized value of the \textit{j}{}-th variable for the \textit{i}{}-th object{,} $i,k=1,{\dots},n$ -- the number of the object, $j=1,{\dots},m$ -- the number of variable, $\left[\delta_{\mathit{ik}}(\mathbf{Z})\right]_{\mathit{nxn}}$ -- a distance matrix (dissimilarities) between objects in \textit{m}{}-dimensional space (distances between objects are calculated via e.g. city-block, Euclidean, Chebyshev, squared Euclidean), $[d_{\mathit{ik}}(\mathbf{V})]$ -- a distance matrix between objects in \textit{q}{}-dimensional space ($q0), \end{equation} \noindent where: $x_{\mathit{ij}}$ -- the value of \textit{j}{}-th variable for the \textit{i}{}-th object, $z_{\mathit{ij}}$ -- the normalized value of \textit{j}{}-th variable for the \textit{i}{}-th object, $A_j$ -- shift parameter to arbitrary zero for the \textit{j}{}-th variable, $B_j$ -- scale parameter for the \textit{j}{}-th variable. \begin{table}[!ht] \begin{tabular}{m{1cm}m{8.5cm}m{3.5cm}m{1.5cm}} \toprule \multirow{2}{*}{\parbox{1cm}{\centering Type}} & \multirow{2}{*}{\parbox{8.5cm}{\centering Method}} & \multicolumn{2}{m{5cm}}{\centering Parameter}\\ & & \centering $B_j$ & \centering\arraybslash $A_j$\\ \toprule \centering n1 & Standardization & \centering $s_j$ & \centering\arraybslash $\overline x_j$\\ \centering n2 & Positional standardization & \centering $\mathit{mad}_j$ & \centering\arraybslash $\mathit{med}_j$\\ \centering n3 & Unitization & \centering $r_j$ & \centering\arraybslash $\overline x_j$\\ \centering n3a & Positional unitization & \centering $r_j$ & \centering\arraybslash $\mathit{med}_j$\\ \centering n4 & Unitization with zero minimum & \centering $r_j$ & \centering\arraybslash $\underset i{\mathit{min}}\left\{x_{\mathit{ij}}\right\}$\\ \centering n5 & Normalization in range [--1; 1] & \centering $\underset i{\mathit{max}}\left|x_{\mathit{ij}}-\overline x_j\right|$ & \centering\arraybslash $\overline x_j$\\ \centering n5a & Positional normalization in range [--1; 1] & \centering $\underset i{\mathit{max}}\left|x_{\mathit{ij}}-\mathit{med}_j\right|$ & \centering\arraybslash $\mathit{med}_j$\\ \centering n6 & \multirow{8}{*}{\parbox{7.5cm}{Quotient \newline transformations}} & \centering $s_j$ & \centering\arraybslash 0\\ \centering n6a & & \centering $\mathit{mad}_j$ & \centering\arraybslash 0\\ \centering n7 & & \centering $r_j$ & \centering\arraybslash 0\\ \centering n8 & & \centering $\underset i{\mathit{max}}\left\{x_{\mathit{ij}}\right\}$ & \centering\arraybslash 0\\ \centering n9 & & \centering $\overline x_j$ & \centering\arraybslash 0\\ \centering n9a & & \centering $\mathit{med}_j$ & \centering\arraybslash 0\\ \centering n10 & & \centering $\sum _{i=1}^nx_{\mathit{ij}}$ & \centering\arraybslash 0\\ \centering n11 & & \centering $\sqrt{\sum _{i=1}^nx_{\mathit{ij}}^2}$ & \centering\arraybslash 0\\ \centering n12 & Normalization & \centering $\sqrt{\sum _{i=1}^n\left(x_{\mathit{ij}}-\overline x_j\right)^2}$ & \centering\arraybslash $\overline x_j$\\ \centering n12a & Positional normalization & \centering $\sqrt{\sum _{i=1}^n\left(x_{\mathit{ij}}-\mathit{med}_j\right)^2}$ & \centering\arraybslash $\mathit{med}_j$\\ \centering n13 & Normalization with zero being the central point & \centering $r_j/2$ & \centering\arraybslash $m_j$\\\bottomrule \end{tabular} $\overline x_j$ -- mean for the \textit{j}{}-th variable, $s_j$ -- standard deviation for the \textit{j}{}-th variable, $r_j$ -- range for the \textit{j}{}-th variable, $m_j=\left(\underset i{\mathit{max}}\left\{x_{\mathit{ij}}\right\}+\underset i{\mathit{min}}\left\{x_{\mathit{ij}}\right\}\right)/2$ -- mid-range for the \textit{j}{}-th variable, $\mathit{med}_j=\underset i{\mathit{med}}\left(x_{\mathit{ij}}\right)$ -- median for the \textit{j}{}-th variable, $\mathit{mad}_j=\underset i{\mathit{mad}}\left(x_{\mathit{ij}}\right)$ -- median absolute deviation for the \textit{j}{}-th variable. \caption{Normalization methods (based on \cite{Jajuga+Walesiak:2000}; \cite{Walesiak:2018})}\label{tab:2} \end{table} The normalization of variables is carried out when the variables describing the analyzed objects are metric or interval-valued. The purpose of normalization is to achieve the comparability of variables (\cite{Milligan+Cooper:1988}). For classical metric data an observation on the \textit{j}{}-th variable for the \textit{i}{}-th object in a data matrix $\mathbf{X}=[x_{\mathit{ij}}]_{n\times m}$ is expressed as one real number. Column 1 in Table \ref{tab:2} presents the type of normalization method adopted as the function data.Normalization of \pkg{clusterSim} package (\cite{Walesiak+Dudek:2019}). Similar procedure for data normalization is available as the function scale of \pkg{base} package. In this function the researcher defines the parameters $A_j$ and $B_j$. For interval-valued variables each cell $x_{\mathit{ij}}$ in a data table represents the interval $x_{\mathit{ij}}=[x_{\mathit{ij}}^l,x_{\mathit{ij}}^u]$ $(x_{\mathit{ij}}^l{\leq}x_{\mathit{ij}}^u$). Interval-valued data require a special normalization approach. The lower and upper bound of the interval of the \textit{j-}th variable for \textit{n} objects are combined into one vector containing \textit{2n} observations. This approach makes it possible to apply normalization methods used for classical metric data. {Other approaches to normalization of interval-valued data are presented in } (\cite{Mlodak:2014}). {After normalization process observations on each variable from 1 to}{\textit{ n}}{ create the lower bound of intervals while observations from } $n+1${ to } $2n${ create the upper bound. }The data were normalized using the interval\_normalization function from the \pkg{clusterSim} package. Table \ref{tab:3} presents selected distance measures for interval-valued data that have been used in the selection of the optimal multidimensional scaling procedure. {The methods for calculating these distances are available in dist.symbolic function of \pkg{clusterSim} package.} \begin{table}[!ht] \begin{tabular}{m{1.5cm}m{4.5cm}m{8cm}} \toprule Symbol & Name & \arraybslash Distance measure $\delta _{\mathit{ik}}\left(\mathbf{Z}\right)$\\\toprule U\_2\_q1 & {Ichino-Yaguchi\par} $q=1,\gamma =0.5$ & $\sum _{j=1}^m\varphi \left(z_{\mathit{ij}},z_{\mathit{kj}}\right)$\\ U\_2\_q2 & {Euclidean Ichino-Yaguchi\par} $q=2,\gamma =0.5$ & $\sqrt{\sum _{j=1}^m\varphi \left(z_{\mathit{ij}},z_{\mathit{kj}}\right)^2}$\\ H\_q1 & Hausdorff\par $q=1$ & $\sum _{j=1}^m\left[\mathit{max}\left(\left|z_{\mathit{ij}}^l-z_{\mathit{kj}}^l\right|,\left|z_{\mathit{ij}}^u-z_{\mathit{kj}}^u\right|\right)\right]$\\ H\_q2 & {Euclidean Hausdorff\par} $q=2$ & $\left\{\sum _{j=1}^m\left[\mathit{max}\left(\left|z_{\mathit{ij}}^l-z_{\mathit{kj}}^l\right|,\left|z_{\mathit{ij}}^u-z_{\mathit{kj}}^u\right|\right)\right]^2\right\}^{1/2}$\\\bottomrule \end{tabular} $\varphi \left(z_{\mathit{ij}},z_{\mathit{kj}}\right)=\left|z_{\mathit{ij}}\oplus z_{\mathit{kj}}\right|-\left|z_{\mathit{ij}}\otimes z_{\mathit{kj}}\right|+\gamma \left(2{\cdot}\left|z_{\mathit{ij}}\otimes z_{\mathit{kj}}\right|-\left|z_{\mathit{ij}}\right|-\left|z_{\mathit{kj}}\right|\right)$; $\left|\right|$ -- length of interval, $z_{\mathit{ij}}\oplus z_{\mathit{kj}}=z_{\mathit{ij}}{\cup}z_{\mathit{kj}}$, $z_{\mathit{ij}}\otimes z_{\mathit{kj}}=z_{\mathit{ij}}{\cap}z_{\mathit{kj}}.$ \caption{Distance measures for interval-valued data (based on \cite{Billard+Diday:2006}, pp. 244-246; \cite{Esposito+Malerba+Tamma:2000}, pp. 165-185; \cite{Ichino+Yaguchi:1994})}\label{tab:3} \end{table} \subsection{Stages in selecting the optimal procedure for MDS} The initial point of the application of smacofSym function is to determine e.g. the following values of arguments ({all parameters can be changed by the user}): {{}--\ \ initial configuration (``torgerson'' classical scaling starting solution),} {{}--\ \ convergence criterion (eps=1e-06),} {{}--\ \ maximum number of iterations (itmax=1000).} The initial point of the application of IMDS function of \pkg{smds} package is to determine e.g. the following values of arguments ({all parameters can be changed by the user}): {{}--\ \ initial configuration (the hyper-rectangles with centres assigned as the result of classical multidimensional scaling of primary space interval centres and vertices distant from the centres by one),} {{}--\ \ convergence criterion (}eps=1e-5){,} {{}--\ \ maximum number of iterations (maxit=1000).} Selecting the optimal procedure for multidimensional scaling takes place in several stages{:} \begin{enumerate} \item { Set the number of dimensions in MDS to two (ndim=2).} \end{enumerate} \setcounter{saveenum}{\value{enumi}} \begin{enumerate} \setcounter{enumi}{\value{saveenum}} \item { Taking into account in the analysis:} \end{enumerate} \setcounter{saveenum}{\value{enumi}} \begin{enumerate} \setcounter{enumi}{\value{saveenum}} \item[\textbullet] In classic-to-classic approach 10 normalization methods, 5 distance measures and 4 MDS models (mspline model -- polynomial function of second and third degree), there are 200 multidimensional scaling procedures. \end{enumerate} Due to the fact that the groups of A, B, C and D (see Table \ref{tab:4}) normalization methods give identical multidimensional scaling results, further analysis covers the first methods of the identified groups (n1, n2, n3, n9), as well as the other methods (n5, n5a, n8, n9a, n11, n12a). \setcounter{saveenum}{\value{enumi}} \begin{enumerate} \setcounter{enumi}{\value{saveenum}} \item[\textbullet] In symbolic-to-classic approach 18 normalization methods, 4 distance measures for interval-valued data and 4 MDS models, there are 288 multidimensional scaling procedures. \item[\textbullet] In symbolic-to-symbolic approach 18 normalization methods and 2 {optimization methods }there are 36 multidimensional scaling procedures. \end{enumerate} \begin{table}[!ht] \begin{tabular}{m{1.5cm}m{4cm}m{8.5cm}} \toprule \multirow{2}{*}{\parbox{1.5cm}{Groups}} & \multicolumn{2}{m{11cm}}{\centering Normalization methods}\\ & GDM1 distance & \arraybslash {Minkowski distances, squared Euclidean distance}*\\\toprule {A} & {n1, n6, n12} & {n1, n6, n12}\\ {B} & {n2, n6a} & {n2, n6a}\\ {C} & {n3, n3a, n4, n7, n13} & {n3, n3a, n4, n7, n13}\\ {D} & {n9, n10} & {n9, n10}\\\bottomrule \end{tabular} * after dividing distances in each distance matrix by the maximum value. \caption{The groups of normalization methods resulting in identical distance matrices (\cite{Walesiak+Dudek:2017})}\label{tab:4} \end{table} \setcounter{saveenum}{\value{enumi}} \begin{enumerate} \setcounter{enumi}{\value{saveenum}} \item { Multidimensional scaling is performed for each procedure separately. It then orders the procedures by increasing:} \end{enumerate} \setcounter{saveenum}{\value{enumi}} \begin{enumerate} \setcounter{enumi}{\value{saveenum}} \item[\textbullet] $\mathit{Stress}\text{{}-}1$ fit measure in classic-to-classic and in symbolic-to-classic approaches (see e.g. {\cite{Borg+Groenen+Mair:2018}}, p. 32{)}: \end{enumerate} \vspace*{-0.5cm} \begin{equation}\mathit{Stress}\text{{}-}1_p=\sqrt{\sum _{i library(mdsOpt) R> data(data_lower_silesian) \end{CodeInput} \end{CodeChunk} Then set the normalizations methods, distance measures and MDS models used in selection of optimal MDS procedure. \begin{CodeChunk} \begin{CodeInput} R> metnor<-c("n1","n2","n3","n5","n5a","n8","n9","n9a","n11","n12a") R> metscale<-c("ratio","interval","mspline") R> metdist<-c("euclidean","manhattan","seuclidean","maximum","GDM1") \end{CodeInput} \end{CodeChunk} The normalizations methods, distance measures and MDS models are used in next step (please notice that model mspline is used twice with spline.degree parameter equals 2 or 3) for selecting the optimal multidimensional scaling procedure. \begin{CodeChunk} \begin{CodeInput} R> res<-optSmacofSym_mMDS(data_lower_silesian,normalizations=metnor, + distances=metdist,mdsmodels=metscale,spline.degrees=c(2:3),outDec=".", + stressDigits=6,HHIDigits=2) \end{CodeInput} \end{CodeChunk} The results contain 200 rows (10 normalization methods x 5 distance measures x 4 MDS models) each describing one procedure with the six columns: Normalization method, MDS model, Spline degree, Distance measure, STRESS 1, HHI spp. The values are ordered by STRESS 1 value. Before displaying the result we need to change the max.print system option to value greater or equals 1200 (10 normalization methods x 5 distance measures x 4 MDS models x 6 columns). \begin{CodeChunk} \begin{CodeInput} R> options(max.print=1200) R> print(res) \end{CodeInput} \begin{CodeOutput} Normalization MDS model Spline Distance STRESS 1 HHI spp method degree measure [1,] "n9a" "mspline" "3" "euclidean" "0.026339" " 821.90" [2,] "n9a" "mspline" "2" "euclidean" "0.026451" " 856.47" [3,] "n9a" "mspline" "2" "seuclidean" "0.026967" " 791.68" ... [198,] "n8" "ratio" "" "maximum" "0.261772" " 414.10" [199,] "n3" "ratio" "" "maximum" "0.265246" " 414.13" [200,] "n5" "ratio" "" "maximum" "0.266663" " 404.71" \end{CodeOutput} \end{CodeChunk} Then we convert $\mathit{Stress}\text{{}-}1$ and $\mathit{HHI}$ values to numeric vectors. \begin{CodeChunk} \begin{CodeInput} R> stress<-as.numeric(res[,"STRESS 1"]) R> hhi<-as.numeric(res[,"HHI spp"]) \end{CodeInput} \end{CodeChunk} The maximal acceptable $\textit{{cs}}$ value is calculated as a mid-range of $\mathit{Stress}\text{{}-}1$ values. \begin{CodeChunk} \begin{CodeInput} R> cs<-(min(stress)+max(stress))/2 R> print(cs) [1] 0.146501 \end{CodeInput} \end{CodeChunk} Then the best MDS procedure from all combinations is chosen. \begin{CodeChunk} \begin{CodeInput} # Elements of optimal MDS procedure R> t<-findOptimalSmacofSym(res,cs) R> print(t) \end{CodeInput} \begin{CodeOutput} \$`Nr` [1] 117 \$Normalization_method [1] "n12a" \$MDS_model [1] "interval" \$Spline_degree [1] "" \$Distance_measure [1] "euclidean" \$STRESS_1 [1] 0.132176 \$HHI_spp [1] 420.74 \end{CodeOutput} \end{CodeChunk} In next step we can plot dependency between $\mathit{Stress}\text{{}-}1$ and $\mathit{HHI}$ index (see Figure \ref{figure:1}) with best solution marked by red circle and finally we choose the MDS solution that satisfies condition $\mathit{Stress}\text{{}-}1{\leq}\mathit{cs}$ and minimizes $\mathit{HHI}$. \begin{CodeChunk} \begin{CodeInput} # Plot dependency between Stress-1 and HHI index R> plot(stress[-t$Nr],hhi[-t$Nr],xlab="Stress-1",ylab="HHI", + type="n",font.lab=3) R> text(stress[-t$Nr],hhi[-t$Nr],labels=(1:nrow(res))[-t$Nr]) R> abline(v=cs,col="red") R> points(stress[t$Nr],hhi[t$Nr],cex=5,col="red") R> text(stress[t$Nr],hhi[t$Nr],labels=(1:nrow(res))[t$Nr],col="red") \end{CodeInput} \end{CodeChunk} \begin{figure}[htb] \centering \includegraphics[width=9.791cm,height=8.662cm]{Fig1.pdf} \caption{T{he values of } $\mathit{Stress}\text{{}-}1$ {fit measure }and $\mathit{HHI}$ index\newline for \textit{p} multidimensional scaling procedures (with best solution marked by red circle)} \label{figure:1} \end{figure} The results of optimal multidimensional scaling procedure (117), via below script, for 31 objects (29 Lower Silesian counties, Pattern and Anti-pattern object) according to the level of tourist attractiveness are presented on Figure \ref{figure:2}. \begin{CodeChunk} \begin{CodeInput} R> library(mdsOpt) R> data(data_lower_silesian) R> z<-data.Normalization(data_lower_silesian,type="n12a") R> d<-dist(z,method="euclidean") R> res<-smacofSym(delta=d,ndim=2,type="interval") R> par(mfrow=c(2,2),pty="s") # Shepard Diagram R> plot(res,plot.type="Shepard",cex.main=0.8,cex.lab=0.8,cex.axis=0.8,cex=0.2) # Stress plot R> spp<-sort(res$spp,decreasing=TRUE) R> names(spp)<-order(res$spp,decreasing=TRUE) R> plot(spp,main="Stress plot",ylab="Stress contribution in percents", + xlab="Objects",ylim=c(-2,30),cex=0.4,cex.main=0.8,cex.lab=0.8,cex.axis=0.8) R> text(spp,pos=3,names(spp),cex=0.4) # Configuration plot with bubble R> bubsize=res$spp/length(spp)*4 R> plot(res$conf,main="Configuration plot with bubble",xlab="Dimension 1", + ylab="Dimension 2",cex=bubsize,cex.main=0.8,cex.lab=0.8,cex.axis=0.8,asp=1) R> text(res$conf[,1],res$conf[,2],pos=3,1:nrow(res$conf),cex=0.7) R> arrows(res$conf[nrow(z),1],res$conf[nrow(z),2],res$conf[nrow(z)-1,1], + res$conf[nrow(z)-1,2],length=0.05,col="black") R> plot.new() R> legend("center",paste(1:nrow(res$conf),rownames(res$conf)), + bty="n",cex=0.7,ncol=2,title="Legend") \end{CodeInput} \end{CodeChunk} \begin{figure}[htb] \centering \includegraphics[width=12.7cm,height=12.7cm]{Fig2.pdf} \caption{The results of multidimensional scaling (procedure 117) of 31 objects\newline (29 Lower Silesian counties, Pattern and Anti-pattern) according to the level of tourist attractiveness} \label{figure:2} \end{figure} Figure \ref{figure:2} (Configuration plot with bubble) presents additional the quota of each object in total error is shown by the size of radius of the circle around each object. Shepard Diagram and Stress plot confirm the correctness of the chosen scaling model. On Figure \ref{figure:2} {(Configuration plot with bubble)}, the axis of the set, which means the shortest connection between Pattern and Anti-pattern object, is designated. It indicates the level of development of the tourist attractiveness of counties. Objects that are closer to Pattern object have higher level of tourist attractiveness. To opposite to the best MDS procedure (117) we show the results for the one of the worst procedures (13): n9a normalization method, mspline of third degree MDS model, maximum (Chebyshev) distance. {In relation to the previous script, changes in lines 3-5 and in Shepard diagram are required.} \begin{CodeChunk} \begin{CodeInput} R> z<-data.Normalization(data_lower_silesian,type="n9a") R> d<-dist(z,method="maximum") R> res<-smacofSym(delta=d,ndim=2,type="mspline",spline.degree=3) ... # Shepard Diagram R> plot(res,plot.type="Shepard",cex.main=0.8,cex.lab=0.8,cex.axis=0.8,cex=0.2) R> t1<-as.matrix(res$delta) R> t2<-as.matrix(res$confdist) R> text(t1[7,3],t2[7,3],pos=4,"7,3",cex=0.6) R> text(t1[31,3],t2[31,3],pos=1,"31,3",cex=0.6) \end{CodeInput} \end{CodeChunk} The results of multidimensional scaling for procedure 13 according to the level of tourist attractiveness are presented on Figure \ref{figure:3}. \begin{figure}[htb] \centering \includegraphics[width=12.7cm,height=12.7cm]{Fig3.pdf} \caption{The results of multidimensional scaling (procedure 13) of 31 objects\newline (29 Lower Silesian counties, Pattern and Anti-pattern) according to the level of tourist attractiveness } \label{figure:3} \end{figure} Overall Stress for procedure 13 ({0.0381}) is much better than for procedure 117 (0.1322). Figure \ref{figure:3} (Stress plot) exhibits that objects Jeleniogorski (3), Anti-pattern (31) and Zgorzelecki (7) contribute most to the overall Stress (56.62\%). It also shows (see Shepard Diagram -- in the lower left-hand corner) that two points (distance between Jeleniogorski county (3) and Anti-pattern object (31); Jeleniogorski county (3) and Zgorzelecki (7) county) are outliers. {These outliers contribute over-proportionally to the total Stress. }MDS configuration (Figure \ref{figure:3} -- Configuration plot with bubble) does not represent all proximities equally good. Jeleniogorski county (3) is one of the best of Lower Silesian counties according to the level of tourist attractiveness. In Configuration plot with bubble this county lies near Anti-pattern object (the worst object). The greater the value of the $\mathit{HHI}_p$ index, the worse is the effect of multidimensional scaling in terms of representation real relationships between objects. \subsection{Interval-valued data (symbolic-to-symbolic approach)} In second example we will find the optimal solution for symbolic-to-symbolic MDS approach. The dataset data\_symbolic\_interval\_polish\_voivodships comes from \pkg{clusterSim} package. For the evaluation of tourist attractiveness of Polish voivodships (provinces) in the year 2016 a two-stage data collection has been carried out. {Step 1. Data on tourist attractiveness were collected for 380 Polish counties for the following metric variables:} \noindent {x1 -- beds in hotels per 1,000 inhabitants of a county,} \noindent {x2 -- number of nights spent daily by resident tourists per 1,000 inhabitants of a county,} \noindent {x3 -- number of nights spent daily by foreign tourists per 1,000 inhabitants of a county,} \noindent {x4 -- dust pollution emission in tons per 10 km\textsuperscript{2} of a county area,} \noindent {x5 -- gas pollution emission in tons per 1 km\textsuperscript{2} of a county area,} \noindent {x6 -- number of criminal offences, crimes against life and health and property crimes per 1,000 inhabitants of a county,} \noindent {x7 -- forest cover of the county in \%,} \noindent {x8 -- participants of mass events per 1,000 inhabitants of a county,} \noindent {x9 -- number of tourist economy entities (sections: I, N79) registered in the system REGON per 1,000 inhabitants of a county.} {Three variables x4, x5 i x6 can be treated as destimulants. All other variables are stimulants. } {Step 2. Data table has been aggregated up to the voivodships with interval-valued data as an a result. The lower bound of the interval for each variable was obtained by calculating the first quartile based on data from the counties. In turn, the upper bound of the interval was obtained by calculating the third quartile. In result dataset contains data about 18 objects (16 voivodships, Pattern and Anti-pattern) described by 9 interval-valued variables.} First we load package \pkg{mdsOpt} and dataset (please notice that there is no need to load \pkg{clusterSim} package -- it is auto-loaded automatically by \pkg{mdsOpt}). \begin{CodeChunk} \begin{CodeInput} R> library(mdsOpt) R> data("data_symbolic_interval_polish_voivodships") R> data<-data_symbolic_interval_polish_voivodships \end{CodeInput} \end{CodeChunk} Then set the normalizations methods and {optimization methods} used in selection of optimal MDS procedure. \begin{CodeChunk} \begin{CodeInput} R> metnor<-c("n1","n2","n3","n3a","n4","n5","n5a","n6","n6a","n7","n8","n9", + "n9a","n10","n11","n12","n12a","n13") R> methods<-c("MM","BFGS") \end{CodeInput} \end{CodeChunk} In next step we run $I\text{{}-}\mathit{scal}$ algorithm for all combinations of normalization methods and {optimization methods} with default parameters. \begin{CodeChunk} \begin{CodeInput} R> res<-optIscalInterval(x=data,dataType="simple",normalizations=metnor, + optMethods=methods,outDec=".",stressDigits=6,HHIDigits=2) \end{CodeInput} \begin{CodeOutput} initial value 568.744280 iter 100 stress = 33.568175 .... final (iter 1000) stress = 9.392537 stopped after 1000 iterations initial value 217.126687 final value 3.943757 converged \end{CodeOutput} \begin{CodeInput} R> print(res) \end{CodeInput} \begin{CodeOutput} Normalization method Opt method I-STRESS HHI spb [1,] "n9a" "MM" "0.000087" " 746.31" [2,] "n9a" "BFGS" "0.000108" "1156.36" [3,] "n2" "BFGS" "0.000200" " 863.13" ... [34,] "n4" "MM" "0.007690" "1316.30" [35,] "n12" "MM" "0.008430" "1148.12" [36,] "n12a" "MM" "0.009668" "1058.55" \end{CodeOutput} \end{CodeChunk} The values are ordered by $I\text{{}-}\mathit{Stress}$ value. Then we convert $I\text{{}-}\mathit{Stress}$ and $\mathit{HHI}$ values to numeric vectors. \begin{CodeChunk} \begin{CodeInput} R> Istress<-as.numeric(res[,"I-STRESS"]) R> hhi<-as.numeric(res[,"HHI spb"]) \end{CodeInput} \end{CodeChunk} The maximal acceptable $cs\text{{}}$ value is calculated as an median of $I\text{{}-}\mathit{Stress}$ values. \begin{CodeChunk} \begin{CodeInput} R> cs<-median(Istress) R> print(cs) \end{CodeInput} \begin{CodeOutput} [1] 0.003215 \end{CodeOutput} \end{CodeChunk} Then the best MDS procedure from all combinations\textcolor{red}{ } is chosen. \begin{CodeChunk} \begin{CodeInput} # Elements of optimal MDS procedure R> t<-findOptimalIscalInterval(res,cs) R> print(t) \end{CodeInput} \begin{CodeOutput} $Nr [1] 5 $Normalization_method [1] "n2" $Opt_method [1] "MM" $I_STRESS [1] 0.000268 $HHI_spb [1] 743.61 \end{CodeOutput} \end{CodeChunk} In next step we can plot dependency between $I\text{{}-}\mathit{Stress}$ and $\mathit{HHI}$ index (see Figure \ref{figure:4}) with best solution marked by red circle and finally we choose the MDS solution that satisfies condition $I\text{{}-}\mathit{Stress}{\leq}\mathit{cs}$ and minimizes $\mathit{HHI}$. \begin{CodeChunk} \begin{CodeInput} # Plot dependency between I-Stress and HHI index R> plot(Istress[-t$Nr],hhi[-t$Nr], xlab="I-Stress",ylab="HHI", + type="n",font.lab=3) R> text(Istress[-t$Nr],hhi[-t$Nr],labels=(1:nrow(res))[-t$Nr]) R> abline(v=cs,col="red") R> points(Istress[t$Nr],hhi[t$Nr], cex=5,col="red") R> text(Istress[t$Nr],hhi[t$Nr],labels=(1:nrow(res))[t$Nr],col="red") \end{CodeInput} \end{CodeChunk} \begin{figure}[htb] \centering \includegraphics[width=9.791cm,height=8.662cm]{Fig4.pdf} \caption{ T{he values of } $I\text{{}-}\mathit{Stress}$ {fit measure }and $\mathit{HHI}$ index\newline for \textit{p} multidimensional scaling procedures (with best solution marked by red circle) } \label{figure:4} \end{figure} Now we will display the results of the best MDS procedure (5). First we need to load \pkg{smds} library. \begin{CodeChunk} \begin{CodeInput} R> library(smds) \end{CodeInput} \end{CodeChunk} The results of optimal multidimensional scaling procedure (5), via below script, for 18 objects (16 voivodships, Pattern and Anti-pattern object) according to the level of tourist attractiveness are presented on Figure \ref{figure:5}. \begin{CodeChunk} \begin{CodeInput} R> library(mdsOpt) R> data("data_symbolic_interval_polish_voivodships") R> data<-data_symbolic_interval_polish_voivodships R> normalized<-interval_normalization(x=data,dataType="simple",type="n2") R> x<-normalized$simple[,,1];y<-normalized$simple[,,2] R> my.idiss<-idistBox(X=(x+y)/2,R=(y-x)/2) R> cmat<-(my.idiss[2,,]+my.idiss[1,,])/2 R> iniX<-cmdscale(as.dist(cmat),k=2) R> n=dim(my.idiss)[2] R> iniR<-matrix(rep(1,n*2),nrow=n,ncol=2) R> res.box<-IMDS(IDM=my.idiss,p=2,model="box",opt.method="MM", + report=1001,ini=list(iniX,iniR)) R> x_l<-res.box$EIDM[1,,];x_u<-res.box$IDM[2,,] R> y_l<-res.box$IDM[1,,];y_u<-res.box$EIDM[2,,] R> spb<-ispb(res.box$EIDM,my.idiss) R> HHI<-sum(spb^2) R> par(mfrow=c(2,2),pty="s") # I-dist diagram R> plot(x_u,y_u, main="I-dist diagram", + ylab="The lower (red) and upper (green)\n configuration distances", + xlab="The lower (red) and upper\n (green) dissimilarities", + col="green",cex.main=0.8,cex.lab=0.8,cex.axis=0.8,cex=0.5) R> points(x_l,y_l,col="red",cex=0.5) # I-Stress plot R> w<-sort(spb,decreasing=TRUE) R> names(w)<-order(spb,decreasing=TRUE) R> plot(w,main="I-Stress plot",xlab="Object",ylab="ispb in percents", + ylim=c(-2,25),cex=0.4,cex.main=0.8,cex.lab=0.8,cex.axis=0.8) R> text(w,pos=3,names(w),cex=0.6) # Configuration plot R> x<-(res.box$X-res.box$R);y<-(res.box$X+res.box$R) R> plot(NULL,xlim=c(min(x[,1]),max(y[,1])),ylim=c(min(x[,2]),max(y[,2])), + pch=1,cex=0.4,main="Configuration plot",xlab="Dimension 1", + ylab="Dimension 2",cex.main=0.8,cex.lab=0.8,asp=1,cex.axis=0.8) R> rect(x[,1],x[,2],y[,1],y[,2]) R> text(res.box$X[,1],res.box$X[,2],labels=1:18,cex=0.8) R> plot.new() R> legend("center",legend=paste(1:dim(data)[[1]],attr(data,"row.names")), + bty="n",ncol=2,cex=0.65,title="Legend") \end{CodeInput} \end{CodeChunk} \begin{figure}[htb] \centering \includegraphics[width=12.7cm,height=12.7cm]{Fig5.pdf} \caption{ The results of multidimensional scaling (procedure 5) of 18 objects\newline (16 voivodships, Pattern and Anti-pattern) according to the level of tourist attractiveness } \label{figure:5} \end{figure} {Figure \ref{figure:5} (}$I\text{{}-}\mathit{dist}$ {diagram and } $I\text{{}-}\mathit{Stress}$ plot) confirms the correctness of the MDS results (Configuration plot). Objects that are closer to pattern of development have higher level of tourist attractiveness. To opposite to the best MDS procedure (5) we show, via below script, the results for the one of the worst procedures (12) according to $\mathit{HHI}$ index. {In relation to the previous script, changes in lines 5, 13-14 and } $I\text{{}-dist}${ diagram are required.} \begin{CodeChunk} \begin{CodeInput} R> normalized<-interval_normalization(x=data,dataType="simple",type="n5a") ... R> res.box<-IMDS(IDM=my.idiss,p=2,model="box",opt.method="BFGS", + report=1001,ini=list(iniX,iniR)) ... # I-dist diagram R> plot(x_u,y_u, main="I-dist diagram", + ylab="The lower (red) and upper (green)\n configuration distances", + xlab="The lower (red) and upper\n (green) dissimilarities",col="green", + cex.main=0.8,cex.lab=0.8,cex.axis=0.8,cex=0.5) R> points(x_l,y_l,col="red",cex=0.5) R> text(x_u[17,16],y_u[17,16],pos=2,"17,16",cex=0.6) R> text(x_u[17,4],y_u[17,4],pos=1,"17,4",cex=0.6) R> text(x_l[16,9],y_l[16,9],pos=3,"16,9",cex=0.6) \end{CodeInput} \end{CodeChunk} The results of multidimensional scaling for procedure 12 according to the level of tourist attractiveness are presented on Figure \ref{figure:6}. \begin{figure}[htb] \centering \includegraphics[width=12.7cm,height=12.7cm]{Fig6.pdf} \caption{ The results of multidimensional scaling (procedure 5) of 18 objects\newline (16 voivodships, Pattern and Anti-pattern) according to the level of tourist attractiveness } \label{figure:6} \end{figure} Figure \ref{figure:6} ($I\text{{}-}\mathit{Stress}$ plot) exhibits that objects Lubuskie (4), Pattern (17) and Zachodniopomorskie (16) contribute most to the overall $I\text{{}-}\mathit{Stress}$ (57,68\%). It also shows (see Figure \ref{figure:6} -- $I\text{{}-}\mathit{dist}$ diagram) that some points (upper distances between Zachodniopomorskie (16) voivodship and Pattern object (17); Pattern object (17) and Lubuskie voivodship (4); lower distance between Zachodniopomorskie (16) voivodship and Podkarpackie voivodship (9)) are outliers. {These outliers contribute over-proportionally to the total } $I\text{{}-}\mathit{Stress}${. }MDS configuration (Figure \ref{figure:6} -- Configuration plot) does not represent all proximities equally good. Zachodniopomorskie (16) is the best of Polish voivodships according to the level of tourist attractiveness. In Figure \ref{figure:6} (Configuration plot) this voivodship lies further from Pattern object than Lubuskie (4). The greater the value of the $\mathit{HHI}_p$ index, the worse is the effect of multidimensional scaling in terms of representation real relationships between objects. \section{Summary} The article proposes a methodology that allows the selection of the optimal MDS procedure for classical metric and interval-valued data. For classic-to-classic approach we choose best MDS procedure due to the used methods of normalization, distance measures and scaling models carried out on the basis of the metric data matrix. On the basis of this methodology research results are illustrated by first example to find the optimal procedure for multidimensional scaling of set of objects representing 29 counties in Lower Silesia according to the level of tourist attractiveness. For symbolic-to-classic approach we choose the best MDS procedure due to the used methods of normalization, distance measures for interval-valued data and scaling models carried out on the basis of the interval-valued data table. For symbolic-to-symbolic approach we choose the best MDS procedure due to the used methods of normalization and optimization methods carried out on the basis of the interval-valued data table. On the basis of this methodology research results are illustrated by second example to find the optimal procedure for multidimensional scaling of set of objects representing 16 Polish voivodships according to the level of tourist attractiveness. To solve the problem of choosing the optimal multidimensional scaling procedure two criteria were applied in \pkg{mdsOpt} package Kruskal's $\mathit{Stress}\text{{}-}1$ fit measure and the Hirschman-Herfindahl $\mathit{HHI}$ index (in classic-to-classic and symbolic-to-classic approaches) and $I\text{{}-}\mathit{Stress}$ fit measure and the Hirschman-Herfindahl $\mathit{HHI}$\textit{ }index (in symbolic-to-symbolic approach). In step 6 the maximal acceptable value of fit measures $\mathit{Stress}\text{{}-}1$ and $I\text{{}-}\mathit{Stress}$ has been arbitrary assumed. It is not determined how much error distribution for each object may deviate from the uniform distribution. Among the procedures of multidimensional scaling for which $\mathit{Stress}\text{{}-}1{\leq}cs$ ($I\text{{}-}\mathit{Stress}{\leq}cs$) the one for which occurs $\underset p{\mathit{min}}\left\{\mathit{HHI}_p\right\}$ is selected. This constraint does not essentially limit the presented proposal, as additional criteria for acceptability such as Shepard diagram (\cite{Leeuw+Mair:2015}) and Stress plot or $I\text{{}-}\mathit{dist}$ diagram and $I\text{{}-}\mathit{Stress}$ plot confirm the correctness of the MDS results. \addcontentsline{toc}{chapter}{References} \bibliography{\jobname} \end{document}