\documentclass[a4paper,10pt]{article} \usepackage{Sweave} \usepackage{graphicx} \usepackage[utf8]{inputenc} \addtolength{\topmargin}{-2cm} \setlength{\textheight}{24cm} \addtolength{\leftmargin}{-2cm} \setlength{\textwidth}{16.25cm} \hoffset-2cm % \VignetteIndexEntry{pendensity} % \VignetteDepends{lattice,fda,pendensity} % \VignetteKeyword{density} \SweaveOpts{prefix.string=graphics} \begin{document} % Title Page \title{ {\ttfamily R-package pendensity} - Density Estimation with a Penalized Mixture Approach} \author{Christian Schellhase\\ \small Centre for Statistics, Bielefeld University\\[-0.8ex] \small Department for Business Administration and Economics\\[-0.8ex] \\[-0.8ex] Version 0.2.5\\[-0.8ex] \textit{Density Estimation with a Penalized Mixture Approach}, Schellhase C. and Kauermann G. (2012), Computational Statistics 27 (4), p. 757-777.} \maketitle \begin{abstract} We give an overview about our {\ttfamily R-package pendensity}. Therefore, we briefly discuss the several options and advantages of our approach. First of all, the univariate density estimation without any covariate is discussed, followed by the extension to covariate dependencies. Furthermore, the variety of possibilities of the plot function are discussed. Thereby, we focus on examples of stock prices. \end{abstract} \section{Simple Example} For a short introduction, we show two simple examples. \subsection{Example without any covariate} \label{ex-simpl} We start directly and load the corresponding R-package \SweaveOpts{echo=TRUE} <<>>= library(pendensity) @ We consider a random sample of size $N=100$ from the standard normal distribution. To estimate the corresponding density, we can easily run the estimation with {\ttfamily pendensity()}. <<>>= set.seed(27) y <- rnorm(100) test <- pendensity(y~1) @ We can plot the estimated density using the corresponding function {\ttfamily plot()}, see Figure \ref{fig:one}. <>= plot(test) @ \begin{figure}[!b] \begin{center} \includegraphics[height=8cm,width=8cm]{graphics-fig1} \end{center} \caption{Plot of the estimated density.} \label{fig:one} \end{figure} \subsection{Example with covariate} \label{ex-cov} We show a simple example, with sample size $N=400$, separated in two groups of equal size $N=200$. In between these groups of normal distributed values, we shift the mean by $0.5$ for a constant variance. <<>>= set.seed(27) x <- rep(c(0,1),200) y <- rnorm(400,x*0.5,1) test2 <- pendensity(y~as.factor(x)) @ Again, we can plot the estimated density using the corresponding function {\ttfamily plot()}, see Figure \ref{fig:two}. <>= plot(test2) @ \begin{figure}[!t] \begin{center} \includegraphics[height=8cm,width=8cm]{graphics-fig2} \end{center} \caption{Plot of the estimated densities in Section \ref{ex-cov}.} \label{fig:two} \end{figure} Once we have estimated a density with covariates, we can test for equality of the corresponding densities. As we can see, the p-value indicates inequality. <<>>= test.equal(test2) @ \section{pendensity} We contribute an {\ttfamily R-package pendensity} for density estimation based on the idea of penalized splines. Therefore, the main program contains several parameters, which can be varied by user. First, we speak about the possible global options of {\ttfamily pendensity()}. \subsection{global options of pendensity()} Most of the options are equal, independent if covariates are considered or not. \subsubsection{Density estimation without covariate} Density estimation without covariate(s) is easily done by using {\ttfamily pendensity()}. We need a formula to describe the desired results. For the following we define $y$ as the interesting variable. If we want to estimate the density of $y$ without any covariate, we type {\ttfamily pendensity(y$\sim$1)}. We can change several parameters for the estimate. Here we mention the important parameter: \begin{itemize} \item[no.base:]{We can change the used number of basis functions, by default this parameter is 'NULL'. Without any other changes, this results in a base with 41 B-spline bases. We can change this parameter, but we should not choose it too small, e.g. no.base=10 results in $2*10+1=21$ B-spline bases for B-splines of degree two.} \item[max.iter:]{The maximal number of iterations should be selected too large or even too small. Depending on the selected penalty parameter $\lambda$, {\ttfamily pendensity()} needs more or less iterations. The default is 20.} \item[$\lambda_0$:]{The default for the penalty parameter is 50000. Sometimes, one may change this value, resulting in fewer iterations.} \item[q:]{The order of the used B-splines, default is $q=3$} \item[m:]{the used difference order of the penalty. The default is m=q, that is the order of the used B-splines.} \item[with.border:]{We run several simulations and sometimes add additional bases at the end of the support of the observed value. This may improve the fit, in particular at the boundary. Usually we set {\ttfamily with.border=1} or {\ttfamily with.border=2}}. \end{itemize} \subsubsection{Density estimation with covariate} If we want to estimate a density with covariate, only the formula in {\ttfamily pendensity()} changes. Keep in mind, only factorial covariates are used, indicating the corresponding groups. First we consider an example with one covariate x for the response y. So, we type {\ttfamily pendensity(y$\sim$x)} (x has to be a factor, otherwise the call has to be {\ttfamily pendensity(y~as.factor(x))}. Of course, we can also estimate densities with two or more covariates. Therefore we add the different factorial covariates with '+' in the call, e.g. {\ttfamily pendensity(y$\sim$x+z)}. \subsection{plot.pendensity()} Having done an estimation without covariate, we can easily plot the density with {\ttfamily plot()}. Using this function of plot, we can determine several parameters. At first, we can choose between the usual plot and lattice. Secondly, we can get three different type of plots. Now, we highlight the main options. \begin{itemize} \item[latt:] {TRUE/FALSE indicates if lattice should be used.} \item[plot.val:] {$1$ indicates the normal density plot, $2$ indicates the distribution function of the observation values and $3$ indicates the analytic distribution function.} \item[val:] {We can calculate the corresponding density at unobserved points. Therefore, we set a vector of y, at which the estimated density is calculated} \item[confi:] {TRUE/FALSE, if the confidence intervals should be plotted or not} \end{itemize} By default, the title of each plot contains the final number of knots 'K', the corresponding Akaike Information Criteria (AIC) and the optimal $\lambda$. We can change this with the known option of the generic function {\ttfamily plot()}, using {\ttfamily main=...} . Additionally, we can add text below the plot, using {\ttfamily sub=...} . Moreover, we can use {\ttfamily xlab} and {\ttfamily ylab}. \subsection{test.equal()} \label{test-equal} For each estimated pendensity object dependent on some covariate we can test for equality of the densities. Therefore, we can easily use the {\ttfamily test.equal()} function and get an output with the corresponding p-value(s). See the corresponding example in Section \ref{ex-cov}.\par If one estimates a density with covariate indicating more than two groups, we compare the densities pairwise. Therefore, we present the following example <<>>= x <- c(rep(0,50),rep(1,100),rep(2,100)) y <- rnorm(250,x,1) x <- as.factor(x) test3 <- pendensity(y~x) @ <>= plot(test3,latt=TRUE) @ <<>>= test.equal(test3) @ \begin{figure}[!b] \begin{center} \includegraphics[height=8cm,width=8cm]{graphics-fig11} \end{center} \caption{Plot of the estimated densities of example in Section \ref{test-equal}.} \label{fig:three} \end{figure} \subsection{Calculate density values} In many applications one is interested to calculate the density in unobserved values. To do this, we can use the {\ttfamily plot()} function using the argument {\ttfamily val}. We consider the example from Section \ref{ex-simpl} and Section \ref{ex-cov}. If we want to get the density value in a given set of points, we type just {\ttfamily plot(obj,val=...)}. This function is only available for densities {\ttfamily plot.val=1}, not for the distribution. We get the values for each group separately. Additionally, the corresponding values of the standard deviation are listed. For the example in Section \ref{ex-simpl} we get <<>>= points <- c(1,-0.5,0,0.5,1) plot(test,val=points) @ For the example in Section \ref{ex-cov} we get <<>>= points <- c(1,-0.5,0,0.5,1) plot(test2,val=points) @ \section{Stock Examples} For a more detailed overview of our package {\ttfamily pendensity}, we consider the attached data of the daily final values of German stocks Deutsche Bank AG und Lufthansa AG from 2000 to 2008. We give some density examples and corresponding comments. \subsection{Allianz 2006} We are interested to look at the density of the German stock {\textit Allianz} in 2006. Therefore, we build differences of first order and estimate the density. <<>>= data(Allianz) form<-'%d.%m.%y' time.Allianz <- strptime(Allianz[,1],form) data.Allianz <- Allianz[which(time.Allianz$year==106),2] d.Allianz <- diff(data.Allianz) density.Allianz <- pendensity(d.Allianz~1) @ This density is estimated with the default settings of {\ttfamily pendensity()}. Now, we like to highlight the different options to plot densities. We can also use {\ttfamily lattice} to create plots, see Figure \ref{fig:three}. <>= plot(density.Allianz) @ <>= plot(density.Allianz,latt=TRUE) @ \begin{figure}[!b] \begin{center} \includegraphics[height=8cm,width=8cm]{graphics-fig3} \includegraphics[height=8cm,width=8cm]{graphics-fig4} \end{center} \caption{Plot of the estimated density of Allianz 2006.} \label{fig:three} \end{figure} \subsection{Allianz 2006 and 2007} We are interested to look at the density of the german stock {\textit Allianz} in 2006 and 2007. Therefore, we build differences of first order of the stock values and estimate the density depending on covariate. Again, we can plot this estimate. We show the result in Figure \ref{fig:four}, also for the lattice output. <<>>= data.Allianz <- Allianz[which(time.Allianz$year==106|time.Allianz$year==107),2] d.Allianz <- diff(data.Allianz) density.Allianz2 <- pendensity(d.Allianz~as.factor(time.Allianz$year)) @ <>= plot(density.Allianz2) @ <>= plot(density.Allianz2,latt=TRUE) @ \begin{figure}[!b] \begin{center} \includegraphics[height=8cm,width=8cm]{graphics-fig5} \includegraphics[height=8cm,width=8cm]{graphics-fig6} \end{center} \caption{Plot of the estimated densities of Allianz in 2006 and 2007.} \label{fig:four} \end{figure} Checking for equality of the estimated densities is done with <<>>= test.equal(density.Allianz2) @ \section{Number of knots} As mentioned in many famous papers, the number of knots has not to be too small for penalized spline smoothing. A good rule is to select 20 up to 40 knots.If one chooses more knots, the results do not change, due to the used penalization concept. Here, we show a short example for a sample $y$ from standard normal distribution of size $N=400$. <<>>= set.seed(27) y <- rnorm(400) density1 <- pendensity(y~1,no.base=10) density2 <- pendensity(y~1,no.base=15) density3 <- pendensity(y~1,no.base=20) density4 <- pendensity(y~1,no.base=25) @ <>= plot(density1) @ <>= plot(density2) @ <>= plot(density3) @ <>= plot(density4) @ For comparison, we plot these densities in Figure \ref{fig:five}. We see easily, that there is no change in between these figures, while the penalty increases with the size of the base. The AIC values are more or less equal, the very small differences are numerical noises. By default, {\ttfamily pendensity()} estimates the densities with 41 B-spline bases. \begin{figure}[!b] \begin{center} \includegraphics[height=8cm,width=8cm]{graphics-fig7} \includegraphics[height=8cm,width=8cm]{graphics-fig8} \includegraphics[height=8cm,width=8cm]{graphics-fig9} \includegraphics[height=8cm,width=8cm]{graphics-fig10} \end{center} \caption{Plot of a sample $y$ for different number of knots.} \label{fig:five} \end{figure} \end{document}