Seminars of the academic year 2004-2005
- Wednesday 21-04-2004 - 16h 15
Prof. Gérard Antille
Université de Neuchâtel - Suisse
Analyse descriptive de la composant chronologique de matrices de données
Dans ce séminaire nous présenterons une nouvelle technique d'analyse de données appliquées à des tableaux croisés indicés par le temps. L'objectif de cette technique est la description des évolutions temporelles des unités statistiques. Cette méthode est basée sur les principes d'analyses des tableaux tridimensionnels, telles que analyse en composantes principales, analyse canonique et analyse discriminante généralisées ainsi que la méthode STATIS et est complétée par une classification automatique des trajectoires dans le temps des unités statistiques.
- Wednesday 27-10-2004 - 16h 00
Prof. Jürg Hüsler
Université de Bern- Suisse
Extreme values and tests for the extreme value conditions
We introduce the basics of extreme value with the (asymptotic) extreme value distributions and the generalized Pareto distributions. These distributions are applied to the extremes or largest order statistics of a sample with large sample size. But such applications are based on certain theoretical assumptions which should be evaluated. For the evaluation simple graphical devices are known. But goodness-of fit-tests should be investigated also. Such tests will be discussed during the talk.
- Tuesday 09-11-2004 - 11h 00
Prof. Farhad Mehran
Université de Neuchâtel - BIT (GE) - Suisse
Measuring the number of persons in forced labour in the world using capture-recapture sampling of reported cases
Forced labour is "all work or service which is exacted from any person under the menace of any penalty and for which the said person has not offered himself voluntarily." The iInternational Labour Office is attempting to derive credible estimates of the incidence of forced labour in the world. Such global estimates should serve the purpose of attracting public attention to the phenomenon and drawing support for its elimination. In the absence of solid and widely accepted national estimates, the estimation approach adopted by the ILO is to rely on traces of forced labour, by analysing and counting validated reports of forced labour cases. The methodology is based on capture-recapture sampling of reported cases of forced labour, and leads to minimum estimates providing lower bounds on the total number of victims of forced labour in the world.
- Tuesday 23-11-2004 - 11h 00
Prof. Hesse Christian
INSEE - France
SCHEMAS TYPE DE COORDINATION DE SONDAGES.
On expose les schémas type de coordination de sondages dans les Instituts Statistiques Gouvernementaux.
Considérons, pour simplifier, les sondages successifs comme une suite chronologique (un processus stochastique). Pour chaque unité de la population on a une indicatrice vectorielle de l'appartenance à chaque échantillon. Les probabilités de ces indicatrices, qu'on appelle probabilités multiples, sont contraintes par les probabilités d'inclusion marginales, des probabilités de quitter l'échantillon à chaque occasion pour les enquêtes périodiques, voire des durées d'inclusion. Elles sont intérieures à un polyèdre convexe (frontière comprise) et les schémas type de coordination correspondent à des sommets de ces polyèdres. Ce sont en particulier la coordination positive maximale, la rotation maximale, et la rotation partielle pour une enquête périodique. En plus on doit coordonner négativement plusieurs enquêtes périodiques.
Ces schémas type peuvent être réalisés par le sondage de Poisson. Mais celui-ci a le grave inconvénient d'être de taille aléatoire. On préfère des sondages à taille contrôlée comme le sondage aléatoire simple stratifié. Mais alors les sommets du polyèdre ne peuvent plus être atteints, en général, et ces schémas de coordination ne sont vérifiés qu'approximativement. Une bonne méthode de coordination est celle qui ne s'en éloigne pas trop. De plus il est alors possible d'estimer approximativement les probabilités d'inclusions multiples et donc, en particulier, les variances de différences entre deux dates.
- Friday 26-11-2004 - 10h 00
Prof. Gabriella Schoier
University of Trieste, Italy
On Spatial Data Mining and Clustering Methods: a Proposal
Nowadays a large amount of data are stored in databases. Moreover spatial data have many features such as topological and/or distance information so that they require an integration of Data Mining with spatial databases techniques. Spatial Data Mining allows integration of traditional clustering methods and spatial analysis methods by putting in evidence on efficiency, interaction with users, and discovery of new type of knowledge. In particular Spatial Data Mining can be used for browsing spatial databases, understanding spatial data, discovering spatial relationships, optimizing spatial queries.
Recently, clustering techniques have been recognized as primary Data Mining methods for knowledge discovery in spatial databases. The well-known clustering algorithms, however, have some drawbacks when applied to large spatial databases. On one side traditional algorithms seem to be inefficient when managing spatial data; on the other side problems arise when considering spatial and non-spatial data together for clustering. Algorithms for spatial data detect clusters in the geographical distribution of data but not always seem to be suited for considering also their attributes, as intensity, frequency or other characteristics of the observed phenomenon.
In this talk we present a new algorithm which is a modification of the Density-Based Spatial Clustering of applications with Noise (DBSCAN) algorithm proposed by Ester et al. (1996). This algorithm is based on a density approach for clustering. The units to be clustered are represented as points in the measurement space; it can discover both the clusters (i.e., the dense regions of the space) and the noise (i.e., the low density regions of the space) and can individuate clusters of different shape.
Our proposed modification takes into consideration both spatial and non spatial variables which are relevant for the phenomenon to be analysed. It has been applied to a geographical database.
- Tuesday 30-11-2004 - 11h 00
Prof. Stephan Morgenthaler
EPFL - Lausanne - Suisse
Modèles linéaires robustes
La notion intuitive de la résistence contre l'influence de valeurs aberrantes a été formalisé de différentes manières durant le dernier siècle. Parmi d'autres, on peut tronquer, calculer le point de rupture, calculer la sensibilité et/ou l'influence, étudier la robustesse sous perturbations
du modèle, trouver des estimateurs minimax par rapport à la variance et au biais. Cette recherche montre que l'on peut arriver à la robustesse d'estimateurs de deux facons différentes, d'un côté en exigant des propriétés de restistance et de l'autre côté en étudiant des estimateurs optimaux pour des modèles robustes. Dans cette présentation, nous allons prendre le deuxième chemin en considérant des modèles contaminés et leur utilisation en régression linéaire.
-
Friday 3-12-2005 - 10h30
Guillaume Chauvet, CREST-ENSAI, Rennes
Prof.Yves Tillé, Université de Neuchâtel
La macro CUBE d'échantillonnage : présentation et démonstration
La méthode du cube proposée par Jean-Claude Deville et Yves Tillé est un algorithme d'échantillonnage équilibré qui permet de sélectionner des échantillons ayant les mêmes moyennes que dans la population pour un ensemble de variables auxiliaires connues. Le fait que des variables auxiliaires soient restituées exactement par l'échantillon permet un gain de précision très important.
La méthode du cube a été implémentée à l'INSEE sous la forme d'une macro SAS-IML qui est un logiciel libre. La macro CUBE est facile à utiliser et permet de sélectionner des échantillons à probabilités égales ou inégales dans de grandes bases de données, tout en préservant l'équilibre sur plusieurs dizaines de variables auxiliaires. Cet algorithme a été utilisé entre autres pour deux opérations statistiques d'envergure en France : la sélection des groupes de rotation du recensement rénové de la population, et la sélection de l'échantillon-maître.
Sans entrer dans des détails trop techniques, nous commencerons par une description de l'algorithme. Nous procéderons ensuite à une démonstration du logiciel et à son application à plusieurs exemples concrets.
- Tuesday 7-12-2004 - 11h 15
Prof. Maria-Pia Victoria-Feser and Samuel Copt
University of Geneva - Switzerland
High Breakdown Inference for Mixed Linear Models
Mixed linear models are used to analyse data in many settings. These models have in most cases a multivariate normal formulation. The maximum likelihood estimator (MLE) or the residual MLE (REML) are usually chosen to estimate the parameters. However, the latter are based on the strong assumption of exact multivariate normality. Welsh and Richardson (1997) have shown that these estimators are not robust to small deviations from the multivariate normality. This means in practice that a small proportion of data (even only one) can drive the value of the estimates on their own. Since the model is multivariate, we propose in this paper a high breakdown robust estimator for very general mixed linear models, that include for example covariates. This robust estimator belongs to the class of S-estimators (Rousseeuw and Yohai, 1984) from which we can derive the asymptotic properties for inference. We also use it as a diagnostic tool to detect outlying subjects. We discuss the advantages of this estimator compared to other robust estimators proposed previously and illustrate its performance with simulation studies and the analysis of three datasets. We also consider robust inference for multivariate hypotheses as an alternative to the classical F-test by using a robust score type test statistic proposed by Heritier and Ronchetti (1994) and study its properties by means of simualtions and the real data analysis.
- Tuesday 25-01-2005 - 11h 00
Prof. Ali S . Hadi
Department of Mathematics, The American University in Cairo and Department of Statistical Sciences, Cornell University.
Modelling Extreme Events Data
In some statistical applications, the interest is centered on estimating some population characteristics (e.g., the average rainfall, the average temperature, the median income, etc.) based on random samples taken from a population under study. In other areas of applications, we are not interested in estimating the average but rather in estimating the maxima or the minima. For example, in designing a dam, engineers would not be interested in the average flood, but in the maximum flood. Farmers would be interested in both the maximum (which cause flooding) and minimum (which cause drought) floods.
The maxima or minima are called the extremes. The knowledge of the distributions of the extremes of the relevant phenomena is important. Additionally, estimating extreme quantities is very difficult because of the lack of available data. This is an expository talk in which the commonly used distributions for modelling extremes and the various methods for estimating their parameters and quantiles will be reviewed. Special attention will be given to recent literature.
- Tuesday 15-02-2005 - 11h 00
PD Dr. Giuseppe Melfi
Groupe de Statistique, Université de Neuchâtel
"On certain positive integer sequences "
- Tuesday 22-02-2005 - 11h 00
PD Dr. Riccardo Gatto
Universität Bern, Switzerland
Institut für mathematische Statistik und Versicherungslehre
riccardo.gatto@stat.unibe.ch
AN ACCURATE ASYMPTOTIC APPROXIMATION FOR EXPERIENCE RATED PREMIUMS
In the Bayesian approach, the experience rated premium is the value which minimizes an expected loss with respect to a posterior distribution. The posterior distribution is conditioned on the claim experience of the risk insured, represented by a n-tuple of observations. An exact analytical calculation for the experience rated premium is possible under restrictive circumstances only, regarding the prior distribution, the likelihood function, and the loss function. In this article we provide an analytical asymptotic approximation as n goes to infinity for the experience rated premium. This approximation can be obtained under more general circumstances, it is simple to compute, and it inherits the good accuracy of the Laplace approximation on which it is based. In contrast with numerical methods, this approximation allows for analytical interpretations. When exact calculations are possible, some analytical comparisons confirm the good accuracy of this approximation, which can even lead to the exact experience rated premium.
- Tuesday 22-03-2005 - 11h 00
Prof. Jose-Miguel Bernardo
Universidad de Valencia, España
Objective Bayesian Inference: A General Definition of a Reference Prior
Reference analysis produces objective Bayesian inference, that is Bayesian inferential statements which only depend on the assumed model and the available data. A reference prior function is a mathematical description of that situation where data would better dominate prior knowledge about the quantity of interest. Reference priors are not descriptions of personal beliefs; they are proposed as technical devices to produce reference posteriors for the quantities of interest, obtained by formal use of Bayes theorem with a reference prior function followed by appropriate probability operations. It is argued that reference posteriors encapsulate inferential statements over which there could be a general consensus and, therefore, may be used as standards for scientific communication. In this paper, statistical information theory is used to provide a general definition of a reference prior function from first principles. An explicit form for the reference prior is then obtained under very weak regularity conditions, and this is shown to contain the original reference algorithms as particular cases. Examples are given where a reference prior does not exist. Maximum entropy priors and Jeffreys priors are both obtained as particular cases under suitable conditions. This presentation concentrates on one parameter models, but the basic ideas are easily extended to multiparameter problems.
- Tuesday 12-04-2005 - 11h 00
Prof. Jana Jureckova
Charles University, Pragues
"Testing the Tail Index in Autoregressive Models "
Mardi 19-04-2005 - 14h 00
Prof. Gilbert Saporta
Professeur titulaire de la Chaire de Statistique Appliquée, Conservatoire National des Arts et Métiers
email: saporta@cnam.fr
Régression PLS typologique pour données fonctionnelles
On utilise la régression PLS dans le contexte suivant: détermination simultanée de classes d'observations et de formules de régression dans chaque classe, lorsque l'ensemble des prédicteurs forme un processus stochastique du second ordre. Cette approche est comparée avec d'autres méthodes sur des données boursières.
Référence: C.Preda, G.Saporta: Clusterwise PLS regression on a stochastic process - Computational Statistics and Data Analysis vol. 49(1): 99-108, 2005.
- Friday 10-06-2005 - 11h 00
Prof. Ingram Olkin
Stanford University, USA
META-ANALYSIS: HISTORY AND STATISTICAL ISSUES FOR COMBINING THE RESULTS OF INDEPENDENT STUDIES
Meta-analysis enables researchers to synthesize the results of independent studies so that the combined weight of evidence can be considered and applied. Increasingly meta-analysis is being used in medicine and other health sciences, in the behavioral and educational fields to augment traditional methods of narrative research by systematically aggregating and quantifying research literature.
Meta-analysis requires several steps prior to statistical analysis: formulation of the problem, literature search, coding and evaluation of the literature, after which one can address the statistical issues.
We here review some of the history of meta-analysis and discuss some of the problematic issues such as various forms of bias that may exist. The statistical techniques that have been used are nonparametric methods, combining proportions, the use of different metrics, and combining effect sizes from continuous data.
- Tuesday 21-06-2005 - 11h 00
Prof. Alfio Marazzi
Faculté de biologie et médecine, Université de Lausanne, Suissse
Robust response transformations based on optimal prediction
Response transformations have become a widely used tool to make data conform to a linear regression model. The most common example is the Box-Cox transformation. The transformed response is usually assumed to be linearly related to the covariates and the errors normally distributed with constant variance. The regression coefficients, as well as the parameter lambda defining the transformation, are generally estimated by maximum likelihood (ML). Unfortunately, near normality and homoscedasticity are hard to attain simultaneously with a single transformation. In addition, the ML-estimate is not consistent under non-normal or heteroscedastic errors and it is not robust.
Various semiparametric and nonparametric approaches to relax the parametric structure of the response distribution have been studied. However, these procedures do not provide effective protection against heavy contamination and heteroscedasticity. A first proposal of robust Box-Cox transformations for simple regression, which are robust and consistent even if the assumptions of normality and homoscedasticity do not hold, was given by Marazzi and Yohai in 2003.
Here, we present new estimates based on optimization of the prediction error. Our multiple regression model does not specify a parametric form of the error distribution. In order to develop a new nonparametric criterion, we introduce the basic concept of conditional M-expectation (CME), a robust version of the classical conditional expectation of the response for a given covariate vector. The CME minimizes a M-scale in place of the classical mean squared error. We then consider the CME of the transformed response as a function of lambda, the coefficients being estimated using a robust (e.g., MM-) estimator. The optimal prediction property of the CME provides a criterion to define the CME-estimate of lambda. Since the conditional mean of the response on the original scale is often the parameter of interest, we also provide a robust version of the well known smearing estimate which is consistent for the CME. Monte Carlo results show that the new estimators perform better than other available methods. Applications concerning modeling of hospital cost of stay with the help of covariates such as length of stay and admission type are presented. Based on the joint work with Victor Yohai (Department of Mathematics, University of Buenos Aires).