Sign In / Sign Out
- ASU Home
- My ASU
- Colleges and Schools
- Map and Locations
Acknowledgment of the challenges of extracting useful information from ever-growing torrents and oceans of raw data has become nearly ubiquitous over the past decade. Mathematical and statistical reasoning are central to addressing these challenges, and the mathematical sciences have established an impressive track record in providing methodology for “big data” problems as they have emerged in recent decades. The ASU Research Training Group (RTG) program is sponsored by the National Science Foundation to keep pace with these challenges. The program includes training in three areas:
The RTG program fosters integration across these areas to cultivate mathematical scientists who have skills in all three of them and can furthermore understand how to draw on concepts from multiple areas in addressing data-oriented problems. Examples of research questions to be addressed by the synergy of these disciplines include (but are not limited to):
All ASU undergraduate students, graduate students, and postdoctoral fellows are welcome to participate in the RTG seminar, which will include both research and professional development components.
Undergraduate students, graduate students, and postdoctoral fellows participating in the RTG program will have the opportunity to complete some research activity at an off-site location, typically during the summer at a national research laboratory or medical center. This will give participants a chance to collaborate with research from diverse backgrounds and other scientific disciplines on real data-data oriented problems.
Those interested in participating should contact Rodrigo Platte.
Funding is provided by the National Science Foundation and the School of Mathematical and Statistical Sciences.
(Now at Dartmouth College)
Inverse Problems, Tomography,
MAT/STP 591 Topic: Data-Oriented Mathematical and Statistical Sciences
Schedule: Mondays 1:30 - 2:30pm in WXLR 021 (lower level)
Description: This seminar series is part of the NSF-RTG Data-Oriented Mathematical and Statistical Sciences. Seminar speakers will include ASU faculty and post-docs, outside visitors, and students. The RTG seminar will focus on both research and professional development. Topics of interest include mathematical and statistical challenges related to data problems that have emerged in recent years.
The seminar is open to all ASU students and faculty. In addition, students may register for 1 credit hour (pass/fail) or 3 credit hours (standard grading). Students registering for 1 credit must attend all talks. Students registering for 3 credits must attend all talks and present two regular length seminar talks on pre-approved topics (or two parts of the same topic). Under special circumstances, the course instructor may propose a different set of requirements. RTG fellows are required to register for three credit hours.
Prerequisite: Degree- or nondegree-seeking graduate student. Registration for three credit hours requires instructor approval.
The seminars are at 1:30pm in Wexler 021.
The RTG seminar is open to everyone. ASU students may register for 1 or 3 credits. Further information is available here.
The RTG seminar is open to everyone. ASU students may register for 1 or 3 credits. Further information is available here.
The African Easterly Waves (AEWs) activity during the most recent decade (2008-2015) is reported and analyzed, and the same methodology is applied to predictions for a decade at the end of the century (2090-2099) . The data utilized are obtained from assimilated analyses of the National Center for Environmental Prediction (NCEP) and climate projections from the Community Earth System Model (CESM). The power spectral density computed by the multi-taper spectral analysis method and averaged over West Africa and over both decades shows the dominance of waves with periods in the 3-5 day window. The spectrum of AEWs in the future climate shows a shift towards low frequencies. The role of the intensity of the jet on the wave activity is supported by idealized simulations.
Superparamagnetic Relaxometry (SPMR) is a novel technique which uses antigen-bound nanoparticles to assist in early cancer detection. A challenge of translating this technique to mainstream clinical applications is the reconstruction of the bound particle signal. The primary focus of this semester’s work was to determine if a multi-resolution approach could be used to accurately reconstruct the signal, including the position and magnitude of a source. By reducing the search space we hoped for a method which would be less computationally intensive. From our results it appears that the multi-resolution approach is promising for accurately localizing the bound particles.
Numerous research have found an impact of gender, race, socioeconomic status (SES), school achievement, school engagement, and academic ability on academic achievement. Few have looked at more than one factor together. However, no research have combined all these factors together and their impact on academic achievement. In this paper, first, we looked at gathered data at multiple time points using ordered multinomial logistic regression(OMLR) to identify the main factors of academic achievement. Secondly, we built a discrete time Markov chain (DTMC) model using the finding from the OMLR.
It is well-known that polynomial interpolation using equispaced points in one dimension is unstable. On the other hand, using Chebyshev nodes in one dimension provides both stable and highly accurate points for polynomial interpolation. In higher dimensional complex regions, optimal interpolation points are not well understood. The goals of this project are to find nearly optimal sampling points in one- and two-dimensional domains for interpolation, least-squares fitting, and finite difference approximations. The optimality of sampling points is investigated using the Lebesgue constant.
The desire to build predictive models based on datasets with tens of millions of observations is not uncommon today. However, with large datasets, standard statistical methods for analysis and model building can become infeasible due to computational limitations. One approach is to take a subsample from the full dataset. Standard statistical methods can then be applied to build predictive models using only the subdata. Existing approaches to data reduction often rely on the assumption that the full data follow a specified model (Wang et al., 2017). However, such assumptions are not always applicable, particularly in the big data context. We explore two new methods of subdata selection that do not require model assumptions. These proposed approaches use k-means clustering and space-filling designs in an attempt to spread the subdata uniformly throughout the region of the full data. We perform a simulation study and an analysis of real data to investigate the efficacy of the predictive models that result from these methods.
Movement of proteins is a biophysical process involving transient binding of particles to a microtubule. Specifically, different types of motors aid in the transport of cargo, such as vesicles and organelles. The movement is modeled as a series of switches, based on a Poisson process, between two possible states: random diffusion or Brownian directed movement. Using observed data that is obscured by assumed Gaussian error, the true movement of the cargo and regime switches are predicted. The predictions are based on the stochastic Expectation-Maximization (EM) algorithm, implementing a particle filter and maximum likelihood estimation. The results are first tested through a simulation study and then applied to real data.
Motivated by a comparison between classifiers built using balanced and imbalanced datasets, this project aimed to address issues with imbalance in training data when using the soft margin Support Vector Machine. Oversampling and Synthetic Minority Oversampling were used to balance the training dataset to illustrate how these resampling techniques could be used to alleviate problems arising from imbalance. This allowed us to conclude that both of these re-sampling based approaches could increase the specificity of a classifier.
While genetic modification in soy beans has allowed farmers to increase their yield over the years, models for predicting which genetic strain could be the most successful in particular regions have fallen behind. This project uses three different methods to construct viable prediction models for newly created varieties of soy beans: clustering methods, Kalman filtering, and parenclitic networks.
Jones and Nachtsheim (2011) introduced a class of three-level screening designs called definitive screening designs (DSDs). The structure of these designs results in the statistical independence of main effects and two-factor interactions; the absence of complete confounding among two-factor interactions; and the ability to estimate all quadratic effects. In this paper we explore the construction of series of augmented designs, moving from the starting DSD to designs comparable in sample size to central composite designs. We perform a simulation study to calculate the predictive mean square error for each design to determine the number of augmented runs necessary to effectively fit the correct second-order model.
In this paper we consider inverse problems in the presence of Poisson noise. A probabilistic treatment of the noisy regularization problem allows for a more comprehensive quantification of uncertainty in the problem. The Bayesian framework for optimization is explored by adding data-oriented terms to the image reconstruction problem and comparing with the classic function space optimization techniques. The reconstruction effort is described and implemented for image data containing Poisson noise, a situation relevant to many particle-counting imaging problems.
Energy dispersive X-ray (EDX) spectroscopy is a technique used to determine the chemical composition. The sample is exposed to an excitation energy, triggering atomic reactions that result in X-ray emission. The number of emitted X-rays are recorded at each energy level, and the result is a spectrum indicating peaks for different elements at particular energy levels. From the series of spectrum data, an image representation of the density for each element in the sample may be recovered. While EDX spectroscopy offers the power resolve the densities of each element in the sample, the process of generating images for each element is nontrivial. In this paper we explore various image processing tools such as low-pass filters and principal component analysis that can be used to produce improved images from EDX spectroscopy data. Once we understand how these tools effect the resulting images, we hope to implement more advanced image reconstruction tools to improve the image formation.
This project investigates a gridding technique for function approximation on a spherical domain. This work is motivated by problems that arise in atmospheric research. The goal is to study the discretization based on the cubed sphere domain decomposition. This method decomposes the sphere into six identical regions where uniformly distributed nodes map onto nearly uniformly distributed nodes on the cube. We contrast this to the latitude and longitude discretization where the uniformity of the node distributions is completely lost by the change of coordinates and results in oversampling near the poles. The effect of using different sampling distributions for function approximation is explored.
The purpose of this project is to motivate and develop the general tomographic imaging problem. The Radon transform and its intimacy with the classic Fourier transform will be established. The inverse problem, will be defined along with an exploration of related iterative reconstruction schemes. The optimal use of sampling patterns is also explored.
With the increasing need to analyze data sets with potentially billions of entries and thousands of predictor variables, many methods have been proposed to computationally efficiently study these so-called “big data” sets; in particular, a recently proposed method called the Information-Based Optimal Subdata Selection (IBOSS) method. Preliminary studies have concluded the effectiveness of the method over previously introduced methods such as the Uniform Sampling Method and Leverage-based Sampling Methods in regards to the linear regression equation constructed from the given subdata by each given method, using a variety of simulated data sets and some real data sets. In the Fall of 2016, I conducted preliminary studies regarding the distribution of simulated data sets, and concluded the success of the process when the distribution used to generate the covariates is generally symmetric, though in all cases, the responses with each data set have been constructed using a linear model, and a linear model was fit for the subdata. Naturally, this raises some questions regarding how successful the IBOSS algorithm would be perform in basic model selection. In this project, I study how model selection performs when using IBOSS across two-factor interaction terms. Additionally, I explore the effects of skewed predictor data has on subdata selection methods.
Big data analysis has been on the rise and with it, a need for new research methods. One area of focus is subdata selection. In this project, there are several types of subdata selection methods that are discussed and compared, including basic leverage sampling (BLEV), shrinkage leverage sampling (SLEV), unweighted leverage sampling (LEVUNW), and uniform sampling (UNIF). After an in-depth comparison using mean squared error on simulated data as the criteria, it has been determined that the unweighted leverage sampling method resulted in the most accurate estimation of the true parameters among these four methods, making leverage-based subsampling a valuable solution to modeling big data. However, this was only determined under the assumption that an ordinary linear model with one response was being used and that the errors were independent and followed a normal distribution. To see if the results still held in other circumstances, three new models were proposed that both involved multivariate-multinormally-distributed data. The three models had ten parameters and two responses, although they could be generalized to even more responses or a different number of parameters. In the first, the errors were independent and identically distributed. In the second, the errors remained independent but had different levels of variance for each response. Finally, the third model had different levels of dependence among the errors, causing correlation among both the errors and the responses. Leverage sampling proved to perform well in multivariate data with and without the assumption of independence and identical distributions, with unweighted leverage sampling consistently performing the best. That is, the previous results can be extended into these new types of models. Although the methods were implemented using manageable-sized data, these methods can be applied in multivariate systems of a much larger size and on real data instead of simulated data.
The Fast Fourier Transform (FFT) allows for the efficient computation of the Discrete Fourier Transform (DFT) of a set of values into its frequency components. The FFT, along with its inverse, are widely used in many applications in science, engineering, mathematics, and medicine. The FFT reduces the computational workload of the DFT from O(n^2) down to O(n log n); however, in order to implement the FFT, a uniformly spaced set of data is required in both the time or frequency domain. In many applications, samples are nonuniform and multiple iterations of Fourier transforms are required. In order to overcome computational limitations, Nonuniform FFTs (NUFFTs) are often used. In the recent years, a number of algorithms have been developed to solve this type of problem. These NUFFTs are derived by combining interpolation and the use of the traditional FFT on an oversampled uniform space. This project addresses the basics of the Fourier transform as well as the DFT, the derivation of the FFT, motivation for NUFFTs, and the derivation of one NUFFT algorithm.
This project compares the l1 and l2 norms for signal reconstruction from noisy measurements. Suppose f is our unknown (nx1 vector). We would like to recover f from a given data vector b where f and b are related such that Af+e=b. Here, A is m×n and e is an unknown vector of errors. Then f can be approximated by solving the minimization problem min_f ||Af − b||. A popular method of solving this problem is least squares, which minimizes the l2 norm. However, the least squares method can perform poorly when the errors on the signal have large magnitude even if they are few. This provides the motivation for solving the minimization problem with the l1 norm. It has been shown that if certain conditions are met on both A and e, solving the minimization problem with the l1 norm is equivalent to solving it with l0. In this project, we explore some numerical examples to illustrate the effectiveness of recovering a signal using the l1 norm.
This project extended traditional 2D convolutional neural network into 3D. Fourier convolution is investigated to deal with increasing computational cost brought by the extra dimension. Numerical results show that our model is able to achieve nine percent testing error on ModelNet10 dataset, which is comparable to the best result reported in 2015.
Bootstrap provides a simple, but powerful way of assessing the quality of estimators, “assessors”. However, when working with big/massive data sets, most computers cannot keep up with the computationally demanding process required for bootstrap. Branches of bootstrap have been developed to deal with computational costs. This project explores Bag of Little Bootstraps, a proposed bootstrap technique for big and massive data.
APM 506 Computational Methods
APM 501 Differential Equations 1
APM 505 Applied Linear Algebra
APM 506 Computational methods
Undergraduate students are welcome to join the RTG. The most common ways to engage in research under the supervision of one of the RTG faculty are honors theses and research assistantships. The latter is most likely to take place during the summer. Limited NSF funding is available for qualifying US citizens and permanent residents. If you are interested, email one of the faculty members.
Possible projects are listed below.