Speaker: Christoph Koch From Cornell
Title: MayBMS: A Probabilistic Database Management System
Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, CSE-605.
When: 11:00 AM (Thursday, July 3rd, 2008)
Abstract:
Databases that contain uncertain data arise naturally in many data management scenarios, such as Web information extraction, data cleaning, data integration, sensor data management, and scientific databases. There are currently no scalable systems for managing and querying such databases. In this talk I present MayBMS, a database management system for efficiently managing and processing large collections of uncertain data that is currently under development at Cornell. MayBMS is based on a clean yet expressive query language that captures many important use cases of probabilistic databases, including what-if queries and the conditioning of databases using new evidence. MayBMS employs a carefully designed succinct representation system for probabilistic databases called U-relations, which nicely unifies various approaches to representing uncertain data, such as c-tables, relational decomposition, and probabilistic graphical models. U-relations allow for the natural reuse of mature relational storage, indexing and query processing techniques to build scalable probabilistic database systems. In addition to the exact processing of probabilistic database queries on U-relations, I discuss the efficient approximation of expressive, compositional queries on probabilistic databases.
Bio:
Christoph Koch is an associate professor of computer science at Cornell University. He is interested in both the theoretical and systems-oriented aspects of data management, and currently works on managing uncertain data, community data management systems, data-driven games, data integration, and Web information extraction and management.
Speaker: Sam Madden From MIT
Title: Column-Oriented Databases: Where's the Beef?
Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, CSE-605.
When: 11:30 AM (Friday, June 13th, 2008)
Abstract:
Vertical partitioning is a well-established technique for improving query processing performance in relational database systems. Surprisingly, the database community has recently unleashed a flurry of research projects (C-Store, MonetDB) and startup companies (Vertica, InfoBright, ParAccel) proposing "column-oriented databases", which appear to be nothing more than a conventional database with a fully vertically partitioned storage system. In this talk, I will describe how our work on the C-Store system goes beyond simple vertical partitioning. I will begin with an overview of column-oriented technology and its applications and then focus on the unusual aspects of the design of the storage system and query executor. I will also describe a series of experiments that show why vertical partitioning in a conventional database does not perform as well as a system designed from the ground up to support columns, showing that our academic prototype can achieve order-of-magnitude performance improvements over a commercial database on a recently proposed data warehousing benchmark.
Bio:
Samuel Madden is an Associate Professor in the EECS Department and CSAIL at MIT. He is a specialist in networked data management and database systems. As the author of the TinyDB system for sensor network data collection, the co-creator of the CarTel mobile sensor network system for automobiles, one of the architects of the C-Store database system, and a co-founder of Vertica Systems, a database startup commercializing column-stores. He has published articles in top computer science conferences, including SIGMOD, SenSys, and OSDI on data acquisition and processing, database optimization, query planning, and distributed databases. Madden received the NSF CAREER Award in 2004, the Sloan Fellowship in 2006, was named on of Technology Review's Top 35 Under 35 in 2006 for his work in data management in sensor networks, and won best paper awards in VLDB 2004 and 2007 and MobiCom 2006.
Speaker: Chris Olston From Yahoo! Research
Title: Processing Web-Scale Data with Pig
Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, CSE-605.
When: 10:30 AM - 11:30 AM (Tuesday, February 19th)
Abstract:
There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of logs collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. In this talk I will describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. Pig is an open-source, Apache-incubator project, and available for general use. The talk will also cover some of the other topics we are addressing in the Pig project, including: (1) data sampling and synthesis techniques to assist in query debugging, (2) how to schedule queries that can share work, (3) adaptive approaches to physical database design, and (4) adaptive data placement techniques.
Bio:
Christopher Olston is a senior research scientist at Yahoo! Research, after a stint as assistant professor at Carnegie Mellon University from 2003 to 2005. His research interests include data management and web search. Olston received his Ph.D. in 2003 from Stanford University, where he was supported by fellowship awards from the National Science Foundation and the Stanford Graduate Fellowship program. Prior to attending graduate school, he received the 1998 Computing Research Association Award for Outstanding Undergraduates for his work at UC Berkeley. Olston is an avid Cal fan but likes to rollerblade at Stanford.
Distinguished Lecturer Series
Speaker: Joe Hellerstein From UC Berkeley
Title: Declarative Networking: "What" is Next.
Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, EEB-105.
When: 3:30 PM - 4:30 PM (Thursday, February 7)
Abstract:
Declarative languages allow programmers to say *what* they want, without worrying over the details of *how* to achieve it. These kinds of languages revolutionized data management decades ago (SQL, spreadsheets), but have had limited success in other aspects of computing. The story seems to be changing in recent years, however. One new chapter is work that my colleagues and I have been pursuing on the design and implementation of declarative languages and runtime systems for network protocol specification. Distributed Systems and Networking appear to be surprisingly natural domains for declarative specifications, and -- given recent interest in revisiting Internet Architecture from scratch -- these domains are ripe for a new programming methodology. The results of our first phase of research have been exciting: we have implemented complex networking infrastructure in 100x less code than traditional implementations, and our programs often match very closely (sometimes line-for-line) with psuedocode published by protocol inventors. As the work on core declarative networking has matured, a number of groups have begun pursuing related applications for declarative languages, including our own emerging work on hybrid protocol synthesis, distributed Machine Learning, and language metacompilation, as well as initial work by others on replication systems, modular robotics, security, distributed debugging, and consensus protocols. This talk will introduce the concepts of Declarative Networking, the state of the research agenda today, and new directions being pursued.
Bio:
Joseph M. Hellerstein is a Professor of Computer Science at the University of California, Berkeley, whose research focuses on data management and networking. His work has been recognized via awards including an Alfred P. Sloan Research Fellowship, MIT Technology Review's inaugural TR100 list, and two ACM-SIGMOD "Test of Time" awards. Key ideas from his research have been incorporated into commercial and open-source database software released by IBM, Oracle, and PostgreSQL. He has also held industrial posts including Director of Intel Research Berkeley, and Chief Scientist of Cohera Corporation
Speaker: Tova Milo from Tel Aviv University.
When: 11:00 AM - 12:00 PM (Friday, July 13th)
Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, room CSE - 605.
Campus map pointing to our building.
Title: Querying and Monitoring Business Processes
Abstract:
We present in this talk BPQL, a novel query language for querying and monitoring business processes. The BPQL language is based on an intuitive model of business processes, an abstraction of the emerging BPEL (Business Process Execution Language) standard. It allows users to query business processes specifications, as well as their run time behavior, visually, in a manner very analogous to how such processes are typically specified, and can be employed in a distributed setting, where process components may be provided by distinct providers(peers).
We describe here the query language as well as its underlying formal model. We consider the properties of the various language components and explain how they influenced the language design. In particular we distinguish features that can be efficiently supported, and those that incur a prohibitively high cost, or cannot be computed at all. We also present our implementation which complies with real life standards for business process specifications, XML, and Web services, and is used in the BPQL system.
Speaker: Anhai Doan from University of Wisconsin-Madison.
When: 3:00 - 4:00 PM (Friday, June 1st)
Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, room CSE - 605.
Campus map pointing to our building.
Title: The Cimple Project on Community Information Management
Abstract:
In this talk I will give an overview of Cimple, a joint project between the University of Wisconsin-Madison and Yahoo! Research. Cimple develops a generic solution that crawls, extracts, and integrates data, to build structured "portals" for online communities. I will first describe the envisioned working of Cimple and our prototype, DBlife, which is a structured portal being developed for the database research community. Next, I describe the technical challenges underlying Cimple and our solution approaches. Finally, I discuss the connections between Cimple and research in data integration, information extraction, human computation, and Web data management. More information about Cimple can be found at http://www.cs.wisc.edu/~anhai/projects/cimple
Speaker: Deepak Patil from Microsoft.
When: 4:30 - 5:30PM (Monday, June 4th)
Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, room CSE - 605.
Campus map pointing to our building.
Title: Scale and Manageability Challenges for high OLTP and VLDB Web Services
Abstract:
Microsoft’s Window’s Live and MSN Web services are growing at rapid pace. As the infrastructure that delivers these services to hundreds of millions of users in over 200 countries - the scale, quality, manageability, security and performance challenges are daunting, yet exciting. Join the talk to listen to a technical perspective on how Microsoft is taking these challenges on and maintaining its winning position. You will hear details on the scale of these services and technology and process challenges wrt delivering ‘able’ services from an operational point. As the company strives to maintain 99.99% availability and high transaction performance for its key web services like Windows Live Messenger, Hotmail, Search, Spaces etc. its Global Foundation Services division is leaving no stone unturned with regards to manageability investments, operational intelligence through data mining and service security. As the division manages peta-bytes of storage – both structured and un-structured – its devising newer ways to deal with the challenges of data management, manipulation, abstraction, mining, transfer and security. Hear the details of all this and more..
Speaker: David Maier from Portland State University.
When: 10:30 - 11:30AM (Friday, April 27th)
Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, room CSE - 605.
Campus map pointing to our building.
Title: My Database Does Grids: Generating Data Products in the GridField Model
Abstract:
Scientists’ ability to generate and store simulation results is outpacing their ability to analyze them via ad hoc programs. We believe such programs have an algebraic structure that can be exploited to improve reasoning and performance. In this talk, we present the GridField model that exposes this structure and can be used to express, optimize, and reason about data transformations over gridded datasets. Grid structures are first-class citizens in the model, and operators can manipulate data on both structured and unstructured grids. As simulation results are primarily write-once, our implementation stores data in a column-oriented format that saves space and enables efficient algorithms. We advocate a light-weight design: Our services access the same native data representations as the scientists use themselves and can therefore coexist with legacy applications. Our evaluation of applicability and performance involves datasets from oceanography, seismology, and medicine.
In this talk I will discuss the requirements for representing gridded datasets, present the GridField model for organizing and manipulating such data, illustrate the definition and optimization of data products in the model, and briefly report on experimental evaluation
Speaker: Patrick Valduriez from INRIA-Rennes.
When: 2.00 - 3.00pm (Wednesday, April 18th)
Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, room CSE - 605.
Campus map pointing to our building.
Title: Data Currency in Replicated DHTs (Joint work with Reza Akbarinia and Esther Pacitti, to appear at SIGMOD 2007.)
Abstract:
Distributed Hash Tables (DHTs) provide a scalable solution for data sharing in P2P systems. To ensure high data availability, DHTs typically rely on data replication, yet without data currency guarantees. Supporting data currency in replicated DHTs is difficult as it requires the ability to return a current replica despite peers leaving the network or concurrent updates. In this paper, we give a complete solution to this problem. We propose an Update Management Service (UMS) to deal with data availability and efficient retrieval of current replicas based on timestamping. For generating timestamps, we propose a Key-based Timestamping Service (KTS) which performs distributed timestamp generation using local counters. Through probabilistic analysis, we compute the expected number of replicas which UMS must retrieve for finding a current replica. Except for the cases where the availability of current replicas is very low, the expected number of retrieved replicas is typically small, /e.g./ if at least 35% of current replicas are available then the expected number of retrieved replicas is less than 3. We validated our solution through implementation and experimentation over a 64-node cluster and evaluated its scalability through simulation up to 10,000 peers using SimJava. The results show the effectiveness of our solution. They also show that our algorithm used in UMS achieves major performance gains, in terms of response time and communication cost, compared with a baseline algorithm.
Speaker: Daniel Abadi from MIT.
When: Friday, January 5th at 2pm.
Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, room CSE - 605.
Campus map pointing to our building.
Title: Query Execution in Column-Oriented Database Systems
Abstract:
Recent research on column-oriented database systems (DBMSs) has shown that these systems can outperform existing row-oriented DBMSs by one to two orders of magnitude on read-mostly query workloads like those found in data warehouses, decision support, and customer relationship management systems. In this talk, I will discuss this exciting new class of database systems and will provide an overview of the C-Store system that we have developed over the past two years at MIT. I will then focus on the design of the column-oriented query execution engine I have developed. In particular, I will discuss the impact on query performance of tuple construction (stitching together attributes from multiple columns into a row-oriented "tuple") and operation on compressed data. Tuple construction allows column-oriented DBMSs to offer a standards-compliant relational database interface (e.g., ODBC, JDBC, etc); however, if done at the wrong point in a query plan, a significant performance penalty is paid.
Similarly, data compression can improve query performance by an order of magnitude by trading cheap CPU cycles for expensive I/O bandwidth.
Speaker: Mirek Riedewald from Cornell.
When: Friday, December 1st at 2pm.
Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, room CSE - 605.
Campus map pointing to our building.
Title: Indexing for Function Approximation
Abstract:
The availability of information technology is fundamentally changing the face of modern science. As the Report of the NSF Blue-Ribbon Advisory Panel on Cyberinfrastructure states, "Scientists in many disciplines have begun revolutionizing their fields by using computers, digital data, and networks to replace and extend their traditional efforts. The calculations that can be performed and the information that can be archived and used are exploding." Efficient management of massive amounts of data is crucial for Cyberinfrastructure, also known as eScience, to succeed.
The research presented in this talk was motivated by scientific simulations. Simulation is one of the most powerful tools for studying and understanding real-world physical phenomena, but realistic mathematical models are often very complex and run for a large number of steps. It is infeasible to evaluate these models exactly at each step, and thus scientists trade accuracy for reduced simulation cost. We model high-dimensional function approximation (HFA) as a storage and retrieval problem, and we show that HFA defines a new class of applications for high-dimensional index structures. HFA imposes a mixed query-update workload on the index which leads to novel tradeoffs between efficiency of search versus updates. We present hardness results and we investigate in detail one specific approach to HFA based on Taylor Series expansions, analyzing the index design tradeoffs through a thorough experimental study.
Speaker: David Andersen from CMU.
When: Friday, November 3rd at 2pm.
Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, room CSE - 403.
Campus map pointing to our building.
Title: Easing the Pain of Network Data Analysis: Why network researchers need database help.
Abstract:
The "science" of networking research remains disconcertingly ad-hoc, continually reinventing the same analysis tools and different storage formats. The result of this state of being is considerable wasted effort, the inability to easily replicate prior analyses, and gratuitous format differences that make it difficult to compare data-sets.
With a view to lowering the barriers to exploratory network measurement, repeatable experimentation, and data sharing, I will present several challenges in collecting and archiving large volumes of network measurement data. This talk deliberately raises more questions than it answers, in the hope that the database community can find (and solve!) many of these interesting problems. What are the best architectures for managing and analyzing these volumes of data? Can conventional databases and query languages be adapted to deal with the common queries encountered in network data mining?
I will then present our early work on crafting a large-scale Internet data storage and analysis facility -- the Datapository -- designed to create a framework for collaboratively addressing these challenges. While we have not yet tackled many of the database problems listed, our experience so far with the datapository as an analysis tool has been very encouraging. In many cases, we can reduce the major analysis components of contemporary Internet measurement research to one or a few SQL statements and a small amount of glue code to compose the analysis.
Speaker: Dennis Lee from Amazon.
When: Friday, April 14th at 11:30am-12:30pm.
Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, room CSE - 605.
Campus map pointing to our building.
Title: Operating and Scaling a Website to Millions of Users.
Abstract:
I relate my experience in running multiple websites all of which has to be up 24x7x365 while serving millions of hits an hour. Many things that are taken for granted in small sites start becoming huge issues when scaled up to this magnitude: deployment, configuration management, machine setup, upgrades, random byzantine failures, resource management, persistence and consistency. I start by sketching the multi-tiered service architecture of Amazon's website .Then I will go through some of the lessons I found in running the website --- both things that seem easy but are not at this scale, and things that seem hard but lend themselves to very different solutions due to the nature of the application that I was managing.
Our first guest speaker: Shankar Pal from Microsoft Research.
When: Friday, February 17th at 1:30pm-3pm.
Title: XML Processing in SQL Server 2005.
Abstract:
SQL Server 2005 had introduced support for XML as a rich data type within its relational infrastructure. XML instances are stored as byte sequences to support the XML data model faithfully. This raises new challenges for storage, query processing, indexing, and schema management.
This talk discusses the many innovations that have gone into the XML processor. A node labeling scheme called OrdPath captures document order and hierarchical relationship in a compressed binary representation while supporting insertion of new nodes at arbitrary positions in the XML tree. The query language supported is XQuery, a candidate recommendation from W3C, using the relational infrastructure with a handful of new operators. Several optimizations have been introduced for high performance. XML instances can be indexed in an edge-table like format. Additionally, packaged indexes are available for optimizing different classes of XQuery workloads. XML schema evolution is supported in a novel way without requiring upgrade of existing data and disrupting existing applications.