Pivotal Big Data Suite.
Changing the Economics of
Big Data. Forever.

Learn More

Download
Dzone Cloud
Report

Download Cloud Report

Loading a day’s worth of data for a million meters in under 50 seconds.

Spring underpins Orbitz’s entire application infrastructure.

Leveraging Pivotal’s expertise in real-time, high-capacity analytics and scalability.

Our goal is similar to Pivotal's – have the cloud seamless and transparent.

Pivotal Greenplum Database

Enable Analytic Innovation

Insight from Big Data is essential to business today. Predictive analytics of high volumes of data can make the difference between a profit or a loss, save lives, or predict the weather. Pivotal Greenplum® Database is a purpose-built, dedicated analytic data warehouse designed to extract value from your data.

Pivotal Greenplum Database (Pivotal GPDB) manages, stores and analyzes terabytes to petabytes of data in large-scale analytic data warehouses around the world. Using the purpose-built Pivotal GPDB platform, your organization can experience 10x, 100x or even 1000x better performance over traditional RDBMS products. Pivotal GPDB extracts the data you need to determine which modern applications will best serve customers in context, at the right location, and during the right activities using a shared-nothing, massively parallel processing (MPP) architecture and flexible column- and row-oriented storage. It also leverages fully parallel communication with external databases and Hadoop to continually harness data.

With Pivotal GPDB, your enterprise can rapidly develop deep analytics using preferred toolsets and languages that your team already knows, including SQL, Python, Ruby, and Java. And if ever your team is facing a particularly difficult data problem, the Pivotal Data Science team is ready to help. Together, the performance of Pivotal technology and the Pivotal team enable you to solve your analytic challenges faster than ever.

PIVOTAL Greenplum Database Features

Power Your Big Data Analytics

Advanced analytics must begin with simple principles. Over the last decade, the engineering team developing Pivotal Greenplum Database (Pivotal GPDB) has determined that an effective analytic data warehouse requires five key characteristics, which is why Pivotal GPDB features:

  • A Highly Scalable, Shared-Nothing Database - Pivotal GPDB’s architecture provides automatic parallelization of data loading and queries for high performance. Its self-healing, fault-tolerance capabilities deliver intelligent fault detection and fast online differential recovery, lowering your TCO and ensuring your cloud-scale systems have the highest levels of availability. The Pivotal GPDB architecture provides continuous real-time balancing of a cluster’s resources across all running queries optimally. With a diverse range of indexing, compression and partitioning options, Pivotal GPDB meets the needs of a wide variety of reporting and analytic workloads. It also supports best-in-class mixed-workload management, and includes health monitoring and alerting through integrated email and SNMP notifications.
  • A Platform for Advanced Analytics on Any (and All) Data - Pivotal GPDB supports a large and rich ecosystem so your users can develop analytics with tools that they know. Pivotal GPDB also supports an extensive—and growing—collection of analytics. It features a library of in-database analytic functions including data modeling and predictive routines and supports analytic extensions such as the MADlib analytic library. If you want to invent your own routines, Pivotal GPDB has the most flexible extensibility framework for custom analytics and database functions. Because much of the world’s data is semi- or unstructured text, Pivotal GPDB supports free-text search, as well as sophisticated text analysis through GPText. And with location becoming increasingly relevant, Pivotal GPDB offers a geospatial add-in to create, store and query geospatial data.
  • A Flexible, Enterprise-Ready Platform - Pivotal GBDB enables the high-performance, parallel import and export of compressed and uncompressed data from Hadoop clusters, file systems and other databases using gNet, a parallel communications transport protocol. It supports simpler, scalable backup via EMC Data Domain Boost. Because Pivotal GPDB pioneered Polymorphic Data Storage™, it includes tunable compression and support for both row- and column-oriented storage within a database, and can be extended to allow the placement of data on specific storage types, such as SSD media or NAS archival stores. Pivotal GPDB makes it easy for you to leverage multiple storage technologies to balance performance and cost, and the solution is available as either software only or as an appliance.
  • Deploy a Foundation for the Future - Pivotal GPDB is a foundational component of the Pivotal technological vision. Pivotal continues to invest significant development resources in the platform, knowing it must serve your needs today and into the future. For example, in-database analytics support data science teams with functions such as principal components analysis (PCA), enhanced support vector machines (SVM) and a variety of general linear models. Additionally, developments in unparalleled database optimizer technology will make GPDB even faster and more adaptable for the increasingly sophisticated analytics your organization requires. Because Pivotal GPDB is core to your data-driven enterprise, Pivotal engineers—working in lock-step with an unmatched team of in-house Data Science experts—continue to push the limits of innovation in analytics.
  • Quickly Enhance Data Science Capabilities - Building a data science capability in your enterprise is often a slow process. To give you a head start, engage with experienced Pivotal Data Science team members. Their industry experience and expertise in achieving business objectives can help you identify opportunities to extract business value from Big Data using advanced analytics while helping to grow the skills of your organization’s own data science team through coaching and onsite support.

Your organization can quickly ramp up data science capabilities with the Pivotal Data Science team and the Pivotal GPDB technologies that power Big Data analytics.

PIVOTAL Greenplum Database Technology

Most of today’s general-purpose relational databases (e.g., Oracle, Microsoft SQL Server) originated as online transaction processing (OLTP) systems. Their shared-disk or shared-everything architectures are optimized for high-transaction rates at the expense of analytical query performance and concurrency. In contrast, Pivotal Greenplum Database™ (Pivotal GPDB) is an extensible relational database platform that uses a shared-nothing, massively parallel processing (MPP) architecture built atop commodity hardware to vastly accelerate your analytical processing.

Pivotal Database’s shared-nothing MPP architecture provides every segment with an independent high-bandwidth connection to dedicated storage. The segment servers are able to process every query in a fully parallel manner, use all disk connections simultaneously, and efficiently flow data between segments as query plans dictates. The degree of parallelism and overall scalability that this allows far exceeds general-purpose database systems.

By transparently distributing data and work across multiple segment servers, Pivotal GPDB executes mathematically intensive analytical queries “close to the data” with performance that scales linearly with the number of segment servers.

Analytical capabilities are extensible, through MADlib, an open-source library of statistical and mathematical algorithms for statistical functions correlation, segmentation and machine learning, or through custom-code extensions provided by users. Workload management facilities balance concurrent queries, while automatic failover provides you with high availability by taking advantage redundant hardware, providing an enterprise-capable analytics platform.

As data sets increasingly of include unstructured data—text, logs, images or sound, Pivotal GPDB provides fast, parallel integration with Hadoop to enable co-processing—analysis of both structured and unstructured data within a unified analytics platform or UAP.

Using GPText, a no-cost Pivotal GPDB add-in, your organization can quickly and easily set up text feature extraction which, when combined with Pivotal GPDB’s deep analytic support, enables users to quickly and easily ask analytic questions they simply could not ask before at scale. For example, does activity on Twitter affect security prices? If so, how? With this knowledge in hand, your users can now differentiate your business with uniquely powerful analytics.

Moreover, as the world becomes more instrumented, location information becomes increasingly relevant. The Pivotal GPDB Geospatial add-in enables the creation, storage and query of geospatial data as well as spatial indexing. Pivotal GPDB’s geospatial functions include support for both geometric and geographic data types so your users can represent either simple polygons or more complex shapes in three dimensions. This functionality, combined with Pivotal GPDB’s built-in analytics, enables sophisticated capabilities including graph analysis.

Pivotal GPDB is available to your organization in software or pre-installed on a Pivotal DCA.

PIVOTAL Greenplum Database Technology

How Does the Parallel Query Optimizer Work?

Pivotal GPDB’s parallel query optimizer is responsible for converting SQL or MapReduce into a physical execution plan. It does this using a cost-based optimization algorithm in which it evaluates a vast number of potential plans and selects the one that it believes will lead to the most efficient query execution.

Unlike a traditional query optimizer, Pivotal GPDB’s optimizer takes a global view of execution across the cluster, and factors in the cost of moving data between nodes in any candidate plan. The benefit to you of this global query planning approach is that it can use global knowledge and statistical estimates to build an optimal plan once and ensure all nodes execute it in a fully coordinated fashion. This leads to far more predictable results than the alternative approach of SQL-pushing snippets that must be re-planned at each node.

The resulting query plans contain traditional physical operations—such as scans, joins, sorts, and aggregations—as well as parallel motion operations that describe when and how data should be transferred between nodes during query execution. You can find three kinds of motion operations in a Pivotal GPDB query plan:

  • Broadcast Motion (N:N) - Every segment sends the target data to all other segments
  • Redistribute Motion (N:N) - Every segment rehashes the target data (by join column) and redistributes each row to the appropriate segment
  • Gather Motion (N:1) - Every segment sends the target data to a single node (usually the master)

The following is an example SQL statement, and the resulting physical execution plan containing motion operations:

PIVOTAL Greenplum Database Technology

How does the Parallel Dataflow Engine Work?

At the heart of the Pivotal GPDB is the Parallel Dataflow Engine, which conducts the real work of processing and analyzing data. The Parallel Dataflow Engine is an optimized parallel processing infrastructure designed to help you process data as it flows from disk, from external files or applications, or from other segments over the gNet interconnect. The engine is inherently parallel—its spans all segments of a Pivotal GPDB cluster and can scale effectively to thousands of commodity processing cores.

The engine was designed based on supercomputing principles, with the idea that large volumes of data have weight (i.e., are not easily moved around) and so processing should be pushed as close as possible to the data. In the Pivotal GPDB architecture, this coupling is extremely efficient, with massive I/O bandwidth directly to and from the engine on each segment. As a result, a wide variety of complex processing can be pushed down as close as possible to the data for maximum processing efficiency and expressiveness.



The Pivotal Parallel Dataflow Engine is highly optimized at executing both SQL and MapReduce, and does so in a massively parallel manner. The Engine has the ability to directly execute all necessary SQL building blocks, including performance-critical operations such as hash-join, multistage hash-aggregation, SQL 2003 windowing, and arbitrary MapReduce programs.

PIVOTAL Greenplum Database Technology

What is gNet Software Interconnect?

In shared-nothing MPP database systems, you often need to move data whenever there is a join or an aggregation process for which the data requires repartitioning across the segments. As a result, the interconnect serves as one of the most critical components within Pivotal GPDB. Pivotal gNet™ software interconnect optimizes the flow of data to allow continuous pipelining of processing without blocking on all nodes of the system. The gNet interconnect is tuned and optimized to scale to tens of thousands of processors and leverages commodity Gigabit Ethernet and 10GigE switch technology.



At its core, the gNet software interconnect is a supercomputing-based soft-switch that is responsible for efficiently pumping streams of data between motion nodes during query-plan execution. It delivers messages, moves data, collects results and coordinates work among the segments in the system. It is infrastructure underpinning the execution of motion nodes that occur within parallel query plans on the Pivotal GPDB.

Within the execution of each node in the query plan, pipelining processes are responsible for multiple relational operations. For example, while a table scan is taking place, rows selected can be pipelined into a join process. Pipelining is the ability to begin a task before its predecessor task has completed, and this ability is key to increasing basic query parallelism. Pivotal GPDB utilizes pipelining whenever possible to ensure the highest-possible performance.

PIVOTAL Greenplum Database Technology

How does the Pivotal GPDB Handle Multilevel Fault Tolerance?

Pivotal GPDB is architected to prevent you from experiencing a single point of failure. The system utilizes log shipping and segment-level replication to achieve redundancy, and it provides automated failover.

Pivotal GPDB can also perform post-recovery segment rebalancing without disrupting in-progress client sessions or taking the database offline. Pivotal GPDB’s data-rebalancing technology also allows the cluster to be scaled incrementally with flexible administrative tools that provide minimal client impact.

In addition to multiple levels of redundancy, the Pivotal GPDB includes integrity checking. At the lowest level, Pivotal GPDB utilizes RAID-0+1 or RAID-5 storage to detect and mask disk failures. At the system level, Pivotal GPDB continuously replicates all segment and master data to other nodes within the system to ensure that the loss of a machine will not impact the overall database availability. Pivotal GPDB also utilizes redundant network interfaces on all systems and specifies redundant switches in all reference configurations. The result is a system that meets the reliability requirements of some of the most mission-critical operations in the world. The following figures depict Pivotal GPDB segment configurations:


PIVOTAL Greenplum Database Technology

What Is the Advantage of MPP Scatter/Gather Streaming Technology?

Pivotal GPDB’s architecture provides automatic parallelization of data loading and queries. Using Scatter/Gather Streaming™ (SG Streaming™) technology, your organization can eliminate bottlenecks associated with other approaches to data loading, allowing you to gain lightning-fast flow of data into the Pivotal GPDB for large-scale analytics and data warehousing. Pivotal customers are achieving production-loading speeds of over four terabytes per hour with negligible impact on concurrent database operations.

SG Streaming enables your organization to manage the flow of data into all nodes of the database, eliminate requirements for additional software or systems, and take advantage of the same Parallel Dataflow Engine nodes in Pivotal GPDB.

Pivotal GPDB utilizes a parallel-everywhere approach to loading in which data flows from one or more source systems to every node of the database without any sequential choke points. This differs from traditional bulk loading technologies—used by most mainstream database and MPP appliance vendors—that push data from a single source, often over a single or small number of parallel channels, and result in fundamental bottlenecks and ever-increasing load times. The Pivotal GPDB approach also avoids the need for a loader tier of servers, as required by some other MPP database vendors, which can add significant complexity and cost while effectively slowing bandwidth and communication parallelism into the database.

Pivotal SG Streaming technology ensures parallelism by scattering data from all source systems across 100s or 1,000s of parallel streams that simultaneously flow to all nodes of the Pivotal GPDB. Performance scales with the number of Pivotal GPDB nodes, and the technology supports both large-batch and continuous, near-real-time loading patterns with negligible impact on concurrent database operations. Data can be transformed and processed in-flight, utilizing all nodes of the database in parallel, for extremely high-performance ELT (extract-load-transform) and ETLT (extract-transform-load-transform) loading pipelines. Final gathering and storage of data to disk takes place on all nodes simultaneously, with data automatically partitioned across nodes and optionally compressed. This technology is exposed to the database administrator via a flexible and programmable external table interface and a traditional command-line loading interface.

PIVOTAL Greenplum Database Technology

What is the Benefit of Pivotal MapReduce?

A successful technique for high-scale data analysis, Pivotal MapReduce has been proven by Internet leaders including Google and Yahoo. Your organization gets the both of both worlds with Pivotal GPDB and MapReduce for programmers and Pivotal GPDB and SQL for database administrators (DBAs).

Pivotal MapReduce enables your programmers to run analytics against petabyte-scale data sets stored inside and outside of the Pivotal GPDB. Pivotal MapReduce brings the benefits of a growing standard programming model to the reliability and familiarity of the relational database. The capability expands the Pivotal GPDB to support MapReduce programs.

Pivotal customers were involved in an early-access program utilizing Pivotal MapReduce for advanced analytics. For example, LinkedIn used Pivotal GPDB for new, innovative social networking features such as “People You May Know” and evaluated Pivotal MapReduce as a way to develop compelling analytics products faster. As a primary benefit, its capabilities enable customers to combine SQL queries and MapReduce programs into unified tasks that are executed in parallel across hundreds or thousands of cores.

The following comments are from other customers in the program:

“The integration of MapReduce into Pivotal GPDB creates new ways to manage our text analysis efforts. What previously would require us to take data out of the database or write complex SQL queries can now be simplified into a few lines of code.”

— Roger Magoulas, Research Director, O’Reilly Media

"The most exciting aspect of MapReduce is the excitement it is generating. It's attracting talented programmers—many of whom don't want to buy or use SQL databases—and enabling them to wrangle enormous data sets without leaving their familiar programming paradigms. Any movement that brings that much compute power to a larger talent base has the potential to produce game-changing results."

— Joe Hellerstein, Professor, UC Berkeley

PIVOTAL Greenplum Database Technology

How Does Pivotal GPDB Handle Polymorphic Data Storage?

Traditionally relational data has been stored in rows—as a sequence of tuples in which all the columns of each tuple are stored together on disk. This was inherited from early OLTP systems that introduced the slotted-page layout that is still commonly used today. However, analytical databases tend to have different access patterns than OLTP systems. Instead of seeing many single-row reads and writes, analytical databases must process larger more complex queries that touch much larger volumes of data (i.e., read-mostly with big scanning reads and infrequent batch appends of data).

Technology leaders have taken a number of different approaches to more complex queries of Big Data. Some have optimized their disk layouts to eliminate the OLTP overhead and do smarter disk scans. Others have turned to column-stores. Each of these approaches has proven to be successful for some queries and less successful for others.

Rather than advocating a single approach, Pivotal GPDB provides flexibility so that you can choose the right strategy for a particular query. For each table or partition of a table, your database administrator (DBA) can select the storage, execution and compression settings that suit the way that table will be accessed. Then using Polymorphic Data Storage, the database transparently abstracts the details of any table or partition, allowing a wide variety of underlying models:

  • Read/Write Optimized - Traditional slotted-page, row-oriented table (based on PostgreSQL's native table type), optimized for fine-grained CRUD operations.
  • Row-Oriented / Read-Mostly Optimized - Optimized for read-mostly scans and bulk append loads. DDL allows optional compression ranging from fast/light to deep/archival.
  • Column-Oriented / Read-Mostly Optimized - Provides a true column-store just by specifying 'WITH (orientation=column)' on a table. Data is vertically partitioned and each column is stored in a series of large densely-packed blocks that can be efficiently compressed from fast/light to deep/archival (with notably higher compression ratios than row-oriented tables). Performance is excellent for workloads suited to column-store. The Pivotal GPDB implementation only scans those columns required by the query, does not have the overhead of per-tuple IDs, and conducts efficient early materialization using an optimized 'columnar append' operator.

When combined with Pivotal's multi-level table partitioning, Polymorphic Data Storage enables your organization to tune the storage types and compression settings of different partitions within the same table. For example, a single partitioned table could have older data stored as 'column-oriented with deep/archival compression,' more recent data could be stored as 'column-oriented with fast/light compression,' and the most recent data could be stored as 'read/write optimized' to support fast updates and deletes.

RELATED RESOURCES

News and events, blog posts, videos, case studies, whitepapers, and other related resources.

SEE ALL RESOURCES

The Pivotal Greenplum Database is a shared-nothing, massively parallel processin...


Datasheet |

This is a promotional video showing the partnership between Artemis by Havas Dig...


Webcast | Apr 23, 2013

This customer video shows how 3TIER uses weather science, which is harnessed fro...


Video | Apr 23, 2013

Roger Magoulas, Research Director of O'Reilly Media tells their big data st...


Video | Apr 23, 2013

Steve Hirsch, CDO of NYSE Euronext tells their Big Data story.


Video | Apr 23, 2013

Pivotal, an ambitious creation of the data storage giant EMC and its hefty affil...


Blog Post | Apr 24, 2013

General Electric just announced that they are pouring in $105 million in investm...


Blog Post | Apr 24, 2013

Contact Pivotal
Pivotal Support