Phylogenetic Reconstruction Analysis on Gene Order and Copy Number Variation

Friday, November 1, 2019 - 11:00 am
Meeting Room 2265, Innovation Center
DISSERTATION DEFENSE Department of Computer Science and Engineering University of South Carolina Author : Ruofan Xia Advisor : Dr. Jijun Tang Date : Nov 1st, 2019 Time : 11:00 am Place : Meeting Room 2265, Innovation Center Abstract Genome rearrangement is known as one of the main evolutionary mechanisms on the genomic level. Phylogenetic analysis based on rearrangement played a crucial role in biological research in the past decades, especially with the increasing availability of fully sequenced genomes. In general, the phylogenetic analysis aims to solve two problems: Small Parsimony Problem (SPP) and Big Parsimony Problem (BPP). Maximum parsimony is a popular approach for SPP and BPP which relies on iteratively solving an NP-hard problem, the median problem. As a result, current median solvers and phylogenetic inference methods based on the median problem all face serious problems on scalability and cannot be applied to datasets with large and distant genomes. In this thesis, we propose a new median solver for gene order data that combines double-cut-join (DCJ) sorting with the Simulated Annealing algorithm (SAMedian). Based on this median solver, we built a new phylogenetic inference method to solve both SPP and BPP problems. Our experimental results show that the new median solver achieves excellent performance on simulated datasets and the phylogenetic inference tool built based on the new median solver has a better performance than other existing methods. Cancer is known for its heterogeneity and is regarded as an evolutionary process driven by somatic mutations and clonal expansions. This evolutionary process can be modeled by a phylogenetic tree and phylogenetic analysis of multiple subclones of cancer cells can facilitate the study of the tumor variants progression. Copy-number aberration occurs frequently in many types of tumors in terms of segmental amplifications and deletions. In this thesis, we developed a distance-based method for reconstructing phylogenies from copy-number profiles of cancer cells. We demonstrate the importance of distance correction from the edit (minimum) distance to the estimated actual number of events. Experimental results show that our approaches provide accurate and scalable results in estimating the actual number of evolutionary events between copy number profiles and in reconstructing phylogenies. High-throughput sequencing of tumor samples has reported various degrees of genetic heterogeneity between primary tumors and their distant subpopulations. The clonal theory of cancer evolution shows that tumor cells are descended from a common origin cell. This origin cell includes an advantageous mutation that causes a clonal expansion with a large amount of population of cells descended from the origin cell. To further investigate cancer progression, phylogenetic analysis on the tumor cells is imperative. In this thesis, we developed a novel approach to infer the phylogeny to analyze both Next-Generation Sequencing and Long-Read Sequencing data. Experimental results show that our new proposed method can infer the entire phylogenetic progression very accurately on both Next-Generation Sequencing and Long-Read Sequencing data. In this thesis, we focused on phylogenetic analysis on both gene order sequence and copy number variations. Our thesis work can be categorized into three parts. First, we developed a new median solver to solve the median problem and phylogeny inference with DCJ model and apply our method to both simulated data and real yeast data. Second, we explored a new approach to infer the phylogeny of copy number profiles for a wide range of parameters (e.g., different number of leaf genomes, different number of positions in the genome, and different tree diameters). Third, we concentrated our work on the phylogeny inference on the high-throughput sequencing data and proposed a novel approach to further investigate and phylogenetic analyze the entire expansion process of cancer cells on both Next-Generation Sequencing and Long-Read Sequencing data.

Properties, Learning Algorithms and applications of Chain Graphs and Bayesian Hypergraphs

Wednesday, October 23, 2019 - 11:00 am
Meeting Room 2277, Innovation Center
DISSERTATION DEFENSE Department of Computer Science and Engineering University of South Carolina Author : Mohammad Ali Javidian Advisor : Dr. Marco Valtorta Date : Oct 23, 2019 Time : 11 am Place : Meeting Room 2277, Innovation Center Abstract Probabilistic graphical models (PGMs) use graphs, either undirected, directed, or mixed, to represent possible dependencies among the variables of a multivariate probability distribution. PGMs, such as Bayesian networks and Markov networks, are now widely accepted as a powerful and mature framework for reasoning and decision making under uncertainty in knowledge-based systems. With the increase of their popularity, the range of graphical models being investigated and used has also expanded. Several types of graphs with different conditional independence interpretations - also known as Markov properties - have been proposed and used in graphical models. The graphical structure of a Bayesian network has the form of a directed acyclic graph (DAG), which has the advantage of supporting an interpretation of the graph in terms of cause-effect relationships. However, a limitation is that only asymmetric relationships such as cause and effect relationships can be modeled between variables in a DAG. Chain graphs, which admit both directed and undirected edges, can be used to overcome this limitation. Today there exist three main different interpretations of chain graphs in the literature. These are the Lauritzen-Wermuth-Frydenberg, the Andersson-Madigan-Perlman, and the multivariate regression interpretations. In this thesis, we study these interpretations based on their separation criteria and the intuition behind their edges. Since structure learning is a critical component in constructing an intelligent system based on a chain graph model, we propose new feasible and efficient structure learning algorithms to learn chain graphs from data under the faithfulness assumption. The proliferation of different PGMs that allow factorizations of different kinds leads us to consider a more general graphical structure in this thesis, namely directed acyclic hypergraphs. Directed acyclic hypergraphs are the graphical structure of a new probabilistic graphical model that we call \textit{Bayesian hypergraphs}. Since there are many more hypergraphs than DAGs, undirected graphs, chain graphs, and, indeed, other graph-based networks, Bayesian hypergraphs can model much finer factorizations and thus are more computationally efficient. Bayesian hypergraphs also allow a modeler to represent causal patterns of interaction such as Noisy-OR graphically (without additional annotations). We introduce global, local and pairwise Markov properties of Bayesian hypergraphs and prove under which conditions they are equivalent. We define a projection operator, called shadow, that maps Bayesian hypergraphs to chain graphs, and show that the Markov properties of a Bayesian hypergraph are equivalent to those of its corresponding chain graph. We extend the causal interpretation of LWF chain graphs to Bayesian hypergraphs and provide corresponding formulas and a graphical criterion for intervention. The framework of graphical models, which provides algorithms for discovering and analyzing structure in complex distributions to describe them succinctly and extract unstructured information, allows them to be constructed and utilized effectively. Two of the most important applications of graphical models are causal inference and information extraction. To address these abilities of graphical models, we conduct a causal analysis, comparing the performance behavior of highly-configurable systems across environmental conditions (changing workload, hardware, and software versions), to explore when and how causal knowledge can be commonly exploited for performance analysis.

Stacked Modelling Framework

Monday, October 21, 2019 - 03:30 pm
Meeting Room 2265, Innovation Center
DISSERTATION DEFENSE Department of Computer Science and Engineering University of South Carolina Author : Kareem Abdelfatah Advisor : Dr. Gabriel Terejanu Date : Oct 21, 2019 Time : 3:30 pm Place : Meeting Room 2265, Innovation Center Abstract The thesis develops a predictive modeling framework based on stacked Gaussian processes and applies it to two main applications in environmental and chemical engineering. First, a network of independently trained Gaussian processes (StackedGP) is introduced to obtain analytical predictions of quantities of interest (model outputs) with quantified uncertainties. StackedGP framework supports component-based modeling in different fields such as environmental and chemical science, enhances predictions of quantities of interest through a cascade of intermediate predictions usually addressed by cokriging, and propagates uncertainties through emulated dynamical systems driven by uncertain forcing variables. By using analytical first and second-order moments of a Gaussian process with uncertain inputs using squared exponential and polynomial kernels, approximated expectations of model outputs that require an arbitrary composition of functions can be obtained. The performance of the proposed nonparametric stacked model in model composition and cascading predictions is measured in different applications and datasets. The framework has been evaluated in a wildfire and mineral resource problem using real data, and its application to time-series prediction is demonstrated in a 2D puff advection problem. In additions, the StackedGP is introduced to one of challenging environmental problems, prediction of mycotoxins. In this part of the work, we develop a stacked Gaussian process using both field and wet-lab measurements to predict fungal toxin (aflatoxin) concentrations in corn in South Carolina. While most of the aflatoxin contamination issues associated with the post-harvest period in the U.S. can be controlled with expensive testing, a systematic and economical approach is lacking to determine how the pre-harvest aflatoxin risk adversely affects crop producers as aflatoxin is virtually unobservable on a geographical and temporal scale. This information gap carries significant cost burdens for grain producers, and it is filled by the proposed stacked Gaussian process. The novelty of this part is two folds. First, the aflatoxin probabilistic maps are obtained using an analytical scheme to propagate the uncertainty through the stacked Gaussian process. The model predictions are validated both at the Gaussian process component level and at the system level for the entire stacked Gaussian process using historical field data. Second, a novel derivation is introduced to calculate the analytical covariance of aflatoxin production at two geographical locations. Similar with kriging/Gaussian process, this is used to predict aflatoxin at unobserved locations using measurements at nearby locations but with the prior mean and covariance provided by the stacked Gaussian process. As field measurements arrive, this measurement update scheme may be used in targeted field inspections and warning farmers of emerging aflatoxin contaminations. Lastly, we apply the stackedGP framework in a chemical engineering application. Computational catalyst discovery involves identification of a meaningful model and suitable descriptors that determine the catalyst properties. First, we study the impact of combining various descriptors (e.g. reaction energies, metal descriptors, and bond counts) for modeling transition state energies (TS) based on a database of adsorption and TS energies across transition metal surfaces {Palladium (PD_111), Platinum (PT_111), Nickel (NI_111), Ruthenium (RU_0001), and Rhodium (RH_111)} for the decarboxylation and decarbonylation of propionic acid, a chemistry characteristic for biomass conversion. Results of different machine learning models for more than 1330 of these descriptor combinations suggest that there is no statistically significant difference between linear and non-linear models when using the right combination of reactant energies, metal descriptors, and bond counts. However, linear models are inferior when not including bond count and metal descriptors. Furthermore, when there are missing data for reaction steps on all metals, conventional linear scaling is inferior to linear and nonlinear models with proper choice of descriptors that are surprisingly robust. Finally, the stackedGP framework is evaluated in modeling the adsorption and transition state energies as a function of metal descriptors with data from all metal surfaces. By getting these energies, the Turn-Over-Frequency (TOF) can be estimated using micro-kinetic models.

ACM Codeathon

Friday, October 18, 2019 - 07:00 pm
Swearingen 2A11
We're excited to invite you to the Fall 2019 Code-a-Thon! The competition will begin this Friday, October 18th, at 7pm. You may attend in-person in the Swearingen 2A11 computer lab, or participate online at the links provided later at the bottom of this email. Every semester, our chapter of the Association of Computing Machinery hosts a 24 hour Code-a-Thon (coding competition) open to all University of South Carolina students. Come solve problems, eat (FREE) pizza, find internships, and battle for prizes at ACM's most anticipated event of the semester. Do not be intimidated by the 24 hour run time! While the Code-a-Thon does indeed run 24 hours (7 PM - 7 PM), you do not have to be present for all of it. Submitting problems from home is allowed and welcomed, especially for alumni not in Columbia. There will be pizza, snacks, and drinks at the event. We have 4 different divisions so that you do not have to compete with people of wildly different skill levels. Please note that the divisions are split by CURRENT course enrollment - if you have completed CSCE 146 and are currently enrolled in CSCE 240, you would compete in the 240 division. However if you have completed CSCE 146 but are not currently enrolled in CSCE 240, you would be eligible to compete in the 146 division. To competitors not in a computing degree program at USC, self evaluate yourself as an (1) absolute beginner; (2) beginner - some coding experience; (3) intermediate; (4) advanced into the 145, 146, 240, and 350 divisions, respectively. Prizes are awarded for 1st, 2nd, and 3rd place in each division. Prizes have not yet been determined but in the past have been Arduinos, gift cards, flash drives, and USB keyboards. Check out the newspost on our ACM chapter's website for more information and links to the competition: https://acm.cse.sc.edu/events/2019-10-18-code-a-thon.html Once the competition starts, you can click on the links to join the division: 145, introductory programming questions 146, data structures questions 240, algorithms questions 350, advanced algorithms questions Look forward to seeing you there! Please reach out to me at hdamron@email.sc.edu with any questions. Hunter Damron ACM Student Chapter President

Algorithms for Autonomous Near Earth Sensing with Robot Teams

Friday, October 18, 2019 - 02:20 pm
Storey Innovation Center, Room 1400
Speaker: Pratap Tokekar Affiliation: University of Maryland Location: Innovation Center, Room 1400 Time: Friday 10/18/2019 (2:20 - 3:10pm) Abstract: A connected network of robots and smart sensors can solve grand challenges in domains such as agronomy, oceanography, and infrastructure monitoring. Robots can collect data from hard-to-reach places at unprecedented spatio-temporal scales. In this talk, I will present our recent work on devising efficient algorithms for data collection with heterogeneous robot teams. We will use tools from stochastic optimization, Bayesian statistics, combinatorial optimization, and information theory to design these algorithms. I will present our experiments results on bridge inspection with aerial robots, precision agriculture with aerial and ground robots, monitoring marine environments with aerial robots and robotic boats. I will conclude by discussing our recent efforts that aim to bridge the gap between algorithmic and field robotics. Bio --- Pratap Tokekar is an Assistant Professor in the Department of Computer Science at the University of Maryland. Between 2015 and 2019, he was an Assistant Professor at the Department of Electrical and Computer Engineering at Virginia Tech. Previously, he was a Postdoctoral Researcher at the GRASP lab of University of Pennsylvania. He obtained his Ph.D. in Computer Science from the University of Minnesota in 2014 and Bachelor of Technology degree in Electronics and Telecommunication from College of Engineering Pune, India in 2008. He is a recipient of the NSF CISE Research Initiation Initiative award and an Associate Editor for the IEEE Robotics & Automation Letters and Transactions of Automation Science & Engineering.

AI-Enabled Learning. Anyone. Anywhere. Anytime.

Friday, October 18, 2019 - 11:00 am
Sumwaltt 305
ABSTRACT: The grand challenge for the Georgia Tech emPrize team in the XPrize AI competition is to make human learning more accessible, effective and achievable. We are developing a coordinated set of AI techniques to assist human learning, including (1) Virtual cognitive tutors for learning domain concepts and skills; (2) Virtual teaching assistants that answer questions and student introductions on online discussion forums; (3) A question-asking virtual teaching assistant for formative assessment to help learners refine their business models for spawning startups; (4) A virtual research assistant for literature review that helps engineering students locate and understand biology articles relevant to design problems, and (5) A virtual research assistant that helps biology students conduct computational experiments in ecology. I will present the current status of these technologies, including their assessment and evaluation, focusing on the virtual teaching and research assistants. I will describe how our notion of AI has changed during the course of this program from a technology into a social-technical system, and, in particular, from a tool to a medium to a social actor. BIO: Ashok Goel is Professor of Computer Science and Human-Centered Computing in the School of Interactive Computing at Georgia Institute of Technology and Chief Scientist for Georgia Tech’s Center for 21st Century Universities. He conducts research into artificial intelligence and cognitive science with a focus on computational design and creativity, and more recently, AI-powered learning and education. He is the Editor of AAAI’s AI Magazine and a Co-Chair of the 41st Annual Meeting of the Cognitive Science Society. He is a co-editor of the volume Blended Learning in Practice: A Guide for Practitioners and Researchers published by MIT Press in April 2019. dilab.gatech.edu/ashok-k-goel

Technology and Ocean Exploration

Wednesday, October 9, 2019 - 11:50 am
Bert Storey Innovation Center, Room 1400
Wednesday October 9th, 11:50 am Bert Storey Innovation Center, Room 1400 550 Assembly St., Columbia, SC Abstract: The seminar will include a segment on how to properly setup a program/project for the ocean environment. “The Navy selected Boeing and Lockheed Martin to pursue the Extra Large Unmanned Underwater Vehicle (XLUUV) program, which will test the Navy’s ability to manage a rapid-acquisition project and the Navy-industry team’s ability to develop and integrate an unmanned system that operates completely independently of manned ships........ Lance Towers, Boeing’s director of Advanced Technology Programs for Autonomous Systems, told USNI News that the extra-large UUV concept will ultimately cost the Navy less money than trying to conduct similar missions with a smaller unmanned vehicle due to the XLUUV operating completely independently of a manned ship or submarine. Smaller “tactical” UUVs not only require a manned ship to be nearby to serve as a host, but their operations can also be limited by bad weather or other factors that affect that host ship.”....... LANCE TOWERS, P.E. Director, Autonomous Maritime Site Executive, Huntington Beach, Calif., and Herndon, Va. Autonomous Systems Defense, Space & Security

Cybersecurity Issues in the Context of Cryptographic Shuffling Algorithms and Concept Drift: Challenges and Solutions

Monday, October 7, 2019 - 03:30 pm
Meeting Room 2267, Innovation Center
DISSERTATION DEFENSE Department of Computer Science and Engineering University of South Carolina Author : Hatim Alsuwat Advisor : Dr. Csilla Farkas and Dr. Marco Valtorta Date : Oct 7, 2019 Time : 3:30 pm Place : Meeting Room 2267, Innovation Center Abstract In this dissertation, we investigate and address two kinds of data integrity threats. We first study the limitations of secure cryptographic shuffling algorithms regarding preservation of data dependencies. We then study the limitations of machine learning models regarding concept drift detection. We propose solutions to address these threats. Shuffling Algorithms have been used to protect the confidentiality of sensitive data. However, these algorithms may not preserve data dependencies, such as functional dependencies and data-driven associations. We present two solutions for addressing these shortcomings: (1) Functional dependencies preserving shuffle, and (2) Data-driven associations preserving shuffle. For preserving functional dependencies, we propose a method using Boyce-Codd Normal Form (BCNF) decomposition. Instead of shuffling the original relation, we recommend to shuffle each BCNF decomposition. The final shuffled relation is constructed by joining the shuffled decompositions. We show that our approach is lossless and preserves functional dependencies if the BCNF decomposition is dependency preserving. For preserving data-driven associations, we generate the transitive closure of the sets of attributes that are associated. Attributes of each set are bundled together during shuffling. Concept drift is a significant challenge that greatly influences the accuracy and reliability of machine learning models. There is, therefore, a need to detect concept drift in order to ensure the validity of learned models. We study the issue of concept drift in the context of discrete Bayesian networks. We propose a probabilistic graphical model framework to explicitly detect the presence of concept drift using latent variables. We employ latent variables to model real concept drift and uncertainty drift over time. For modeling real concept drift, we propose to monitor the mean of the distribution of the latent variable over time. For modeling uncertainty drift, we suggest to monitor the change in belief of the latent variable over time, i.e., we monitor the maximum value that the probability density function of the distribution takes over time. We also propose a probabilistic graphical model framework that is based on using latent variables to provide an explanation of the detected posterior probability drift across time. Our results show that neither cryptographic shuffling algorithms nor machine learning models are robust against data integrity threats. However, our proposed approaches are capable of detecting and mitigating such threats.

On the complexity of fault-tolerant consensus

Friday, September 27, 2019 - 02:20 pm
Storey Innovation Center (Room 1400)
Speaker: Dariusz Kowalski Affiliation: Augusta University Location: Innovation Center, Room 1400 Time: Friday 9/27/2019 (2:20 - 3:10pm) Abstract: The problem of reaching agreement in a distributed message-passing system prone to failures is one of the most important and vibrant problems in distributed computing. This talk concentrates on the impact of processes’ crashes to performance of consensus. Crashes are generated by constrained adversaries – a weakly-adaptive adversary, who has to fix, in advance, the set of f crash-prone processes, and a chain-adaptive adversary, who orders all the processes into k disjoint chains and has to follow this order when crashing them. Apart from these constraints, both of them may crash processes in an adaptive way at any time, thus emulating (almost) worst-case failure scenarios in the system against a given algorithm. While commonly used strongly-adaptive adversaries model attacks and non-adaptive ones – pre-defined faults, the constrained adversaries model more realistic scenarios when there are fault-prone dependent processes, e.g., in hierarchical or dependable software/hardware systems. In this view, our approach helps to understand better the crash-tolerant consensus in more realistic executions. We propose time-efficient consensus algorithms against such adversaries. We complement our algorithmic results with (almost) tight lower bounds, and extend the one for weakly-adaptive adversaries to hold also for (syntactically) weaker non-adaptive adversaries. Together with the consensus algorithm against weakly-adaptive adversaries (which automatically translates to the non-adaptive adversaries), these results extend the state-of-the-art of the popular class of non-adaptive adversaries, in particular, the result of Chor, Meritt and Shmoys [JACM 1989], and prove separation gap between the constrained adversaries (including the non-adaptive ones) and the strongly-adaptive adversaries, analyzed by Bar-Joseph and Ben-Or [PODC 1998] and others. Pooyan Jamshidi https://pooyanjamshidi.github.io/