PIVOTAL Chorus Technology
What Is the Analytics Cycle?
The effectiveness of your Big Data analytics lies not only in the speed of the data warehouse or your analytical tools, but also in the agility of your corporate business process, as well as in your ability to iterate analysis quickly on a full set of data and to collaborate with the entire data science team.
A legacy analytics process requires several distinct steps: analysts must weed through data warehouses to find information, obtain access to that data from a variety of data owners, schedule time with data owners to learn about data structures, request sandbox access from corporate IT, move data to a sandbox, analyze the data, and finally operationalize the model. This extremely time-intensive process often yields a stale and incomplete dataset, which in turn jeopardizes the accuracy of your model. In addition, the legacy process is fragmented by the analyst’s environment, minimizing collaborative efforts.
The first solution of its kind, Pivotal Chorus provides an analytic productivity platform that enables your team to search, explore, visualize, and import data both from within your organization and from external sources. It provides rich social network features that revolve around datasets, insights, methods, and workflows, allowing data analysts, data scientists, IT staff, database administrators, executives, and other stakeholders to collaborate on Big Data analyses.
Using a single interface, your data scientists can save time and increase their productivity as they navigate four Big Data analytics processes using Pivotal Chorus:
Explore the Data
Because Pivotal Chorus automatically indexes data, your data scientists and analysts can now more easily explore data stores across the enterprise. They can quickly identify datasets regardless of location, browse schemas and Hadoop files, and preview data for instant understanding. A comprehensive search function allows your analysts to search all data within Pivotal Chorus—including comments, SQL, Hadoop, and Pivotal Greenplum Database (GPDB). They can probe the data warehouse and gain all of the insight that other analysts have generated, producing a list of imports and datasets complete with the comments and questions associated with each.
Previously, analysts who wanted access to data needed to deploy and operate a new massively parallel processing (MPP) data warehouse—a time-consuming process. Now because agile Big Data practices need to move at a much faster pace, organizations require a different solution. Using Pivotal Chorus, IT staff and database administrators can establish pools of commodity servers and storage ahead of analyst demand, enabling data science teams to create new database instances and sandboxes in minutes, with just a few mouse clicks. Analysts and data scientists no longer need to file an IT help-desk ticket and wait days or even weeks for a resolution.
As they work to understand datasets, most data scientists export data to a local desktop, import it into R or a similar tool, and then create a visualization. Pivotal Chorus accelerates insight and understanding by providing data science teams with a rapid visual representation of information. Pivotal Chorus supports histograms, frequency, heat-map, time-series, and box-plot charts for on-demand visualizations. As a precursor to more sophisticated tools, this functionality enables the simple, ad hoc reporting that often serves as a starting point for innovation.
While providing data analysts with their own databases and sandboxes is important, it is essential that analysts be able to move data fluidly into a place where they can work on it and combine it with other data of interest. Through Pivotal Chorus self-service provisioning, data scientists can obtain a new sandbox as a standalone Pivotal GPDB instance on VMware vFabric virtual infrastructure. Alternatively, they can instantiate new sandbox schemas on the fly within existing Pivotal GPDB instances.
Once a sandbox goes live, Pivotal Chorus enables analysts to browse for interesting datasets among all of those that they have permission to access. Regardless of where the data “lives,” analysts can request that all or a slice of data be populated into their Pivotal Chorus sandbox. This capability applies not only to data in Pivotal GPDB instances within Pivotal Chorus, but also to file sources, web services, and sources such as Oracle. Throughout these processes, Pivotal Chorus schedules and manages the flow of data, tracks dependencies, and always shows a view of the most recent data. At the conclusion of an analysis project, IT can return computing resources to the resource pool for use in future projects.
Once data scientists have imported data into Pivotal Chorus workspaces, they can analyze that information and share the results as a new dataset that other analysts can discover and access. They can also invite colleagues to the project and solicit their input. Because today’s data science teams increasingly include business users as well as analysts, Pivotal Chorus provides a data wizard that enables business users to analyze datasets without requiring them to remember complex syntax or write SQL queries. As teams progress, anyone associated with the workspace—including executives and managers—can monitor progress, without having to wait to hear results at the end of a project. At the same time, database administrators can see a list of sandboxes that are using any instance of data, giving them important context into the business value of any dataset.
Publish and Iterate
After hours of hard work, too many analysts publish valuable insights via email to limited audiences. Their insights remain invisible to searches and are difficult to leverage in future analysis projects. Pivotal Chorus provides rich, social network features that facilitate participation and collaboration among data analysts, IT staff, database administrators, executives, and other stakeholders. As a result, the entire data science team can collaboratively discover, share, and discuss insights that have meaningful impact on the business. For example, the Create Insight feature makes it easy for an analyst to publish an insight across Pivotal Chorus so other team members can comment, brainstorm about overlooked possibilities, and generate new questions of their own.
PIVOTAL Chorus Technology
How Are Data Exploration, Search, and Data Dictionary Handled?
Pivotal Chorus provides an easy way to navigate and search all data in your environment—including Pivotal Greenplum Database (GPDB), Pivotal HD, indexes, codes, comments, users, and more—and gives your data analysts access to information in a collaborative environment. Analysts can explore and visualize the data, create new datasets, and add documents and comments—ensuring that data sources become enriched as part of the everyday work of your data science teams. Pivotal Chorus then automatically indexes all enriched metadata so that anyone authorized can easily search for anything within your enterprise data cloud.
With Pivotal Chorus, your analysts can browse and explore datasets across the enterprise. Once imported into the workspace, Pivotal Chorus will manage the flow of data, track dependencies, and update when the original source is updated with new data. Analysts can operate on the data and share the results of their analysis as a new dataset that can be discovered and accessed by other analysts.
The global search function allows analysts to search all data within Pivotal Chorus—including comments, SQL, Hadoop, and Pivotal GPDB. They can probe the data warehouse and gain all of the insight that other analysts have generated from it, producing a list of imports and datasets complete with the associated comments and questions.
Analysts no longer need to hunt through emails or shared drives for a comment or a piece of code. Everything is tracked and available through the Pivotal Chorus interface. The solution delivers federated search across data assets anywhere in the enterprise. Pivotal Chorus indexes all metadata, comments, SQL, and data assets to create a living data dictionary available in the form of a search prompt.
PIVOTAL Chorus Technology
How Does Insight Sharing Work?
Pivotal Chorus combines a single interface for all of your organization’s data together with virtual databases for exploration and innovation, as well as social collaboration for insight and analysis. A rich, social network with features that revolve around datasets, insights, and other key Pivotal Chorus components allows all analysts, IT staff, and business stakeholders to participate and collaborate in the same environment. As a result, your organization can begin to collaboratively discover, share, and discuss insights that have meaningful impact to your business.
Using the Create Insight feature, for example, makes it easy to publish an insight across Pivotal Chorus. An analyst can share any insight directly with team members, who can then make comments, brainstorm about overlooked possibilities, and generate new questions of their own. With Pivotal Chorus, your organizations can achieve greater business insight and economic value from your data than ever before.
PIVOTAL Chorus Technology
What Is a Pivotal Chorus Workspace?
In Pivotal Chorus, a workspace is where users collaborate on an analytics project. Analysts can request a workspace from a Pivotal Chorus administrator, and it can either be public (all users have read access) or private (only members have access).
Each workspace contains the following four tabs:
Summary Tab – This area includes a brief description of the workspace, including workspace-specific insights. It also includes the recent activity stream pertaining to this workspace.
Sandbox Tab – In one sandbox, Pivotal Chorus users import or work with data.
Data Sources – This area is for users to search, browse, visualize, and create datasets, which are then imported and used in the sandbox.
Work Files Tab – All content used in the workspace can be shared and version controlled in the Work Files tab, including documents, presentations, and special file types (e.g., SQL). An SQL work file, for example, allows a user to edit SQL against the data in the sandbox. Users can then conduct analytics directly in the database or use R in the database. In addition, this feature enables a user to know what the dataset looks like and create a dataset using the same SQL.
PIVOTAL Chorus Technology
How Is User/Access Management and Data Instance Supported?
In Pivotal Chorus, a data instance is either a data source or a data destination. Data from Pivotal Greenplum Database (GPDB) or Pivotal HD’s Hadoop Distributed File System (HDFS) are connected natively to Pivotal Chorus through the instance setup. Pivotal Chorus can also access data in existing non-Pivotal systems through the use of Pivotal GPDB external tables.
User and Access Management
Your data science team should not have to worry about where data is stored. Regardless of whether data is structured or unstructured, data should be easy to preview, move, and analyze. Pivotal Chorus supports direct integration to corporate LDAP or Microsoft Active Directory for user and password management. For data access control, Pivotal Chorus automatically takes into account database permissions. Users can see only the data that they have access to and can perform only the functions that administrators have enabled.
Registering Pivotal GPDB or Hadoop instances in Pivotal Chorus is straightforward. Users can work with data easily, whether it is stored in the database or in Hadoop, which means that Pivotal Chorus provides a collaborative platform for the entire analytics workflow. Using Pivotal Chorus, for example, data engineers and analysts can share the code for MapReduce jobs and view the results of those jobs as structured files within Hadoop, or they can import the results into the database as structured data for further analysis.
Instance details and usage
Pivotal Chorus is deeply integrated with the Pivotal GPDB. It understands the structure of tables, columns, and other database objects. It exposes useful information to analysts who need to understand the nature of the data as they explore the data cloud.
Pivotal GPDB instance registration is done in the following two ways:
- Shared account – Multiple Pivotal Chorus users can share one account for connectivity and data access. Manual role-based access control can also be configured through multiple shared-account instance setups, with appropriate data grants done in the data source. The Sales-Ops database instance, for example, can be registered with a database account having access to only sales-ops information. Similarly, the Marketing database instance can be registered with an account having access to only marketing information, but both instances can point to the same database.
Non-shared account – If an instance is connected to the data source through a non-shared account, the Pivotal Chorus user will be prompted for a data source account/password upon connection.
Moreover, if a Chorus Administrator specifies the parameters of hardware (e.g., CPU, memory), Pivotal Chorus can provision and create a new, standalone Pivotal GPDB instance on VMware virtualized infrastructure.
PIVOTAL Chorus Technology
What Is a Sandbox in Pivotal Chorus?
Each workspace contains one sandbox, which is equivalent to a schema in Pivotal Greenplum Database (GPDB). A sandbox can contain data from more than one source—Pivotal GPDB, Hadoop Distributed File System (HDFS), and external sources—and your analysts can use the sandbox as the place to import all data and perform joints to provide structure to the data.
Self-Service Provisioning of Sandbox
With Pivotal Chorus, your data science team can create new sandboxes with a few simple mouse clicks. They no longer need to file an IT ticket and wait hours or even weeks for the solution. With self-service provisioning, your users can get new sandboxes as a standalone Pivotal GPDB instance on VMware vFabric virtual infrastructure. Alternatively, your users can instantiate new sandbox schemas on the fly within existing Pivotal GPDB instances
Data Import from External Data Sources
Pivotal Chorus dramatically simplifies the process of importing data into a sandbox. Analysts can upload a delimited file (e.g., CSV, tab separated) directly from their desktops and Pivotal Chorus will automatically suggest the correct table structure for the data in the file. Analysts can further refine the structure and then share the resulting dataset with other Pivotal Chorus users. The solution also integrates with existing systems through a REST API. The data import can be scheduled at an interval for automatic import or it can be manually refreshed.
Pivotal Chorus provides visualization options for data preview to speed data insight and understanding. It supports histograms, frequency, heat-map, time-series, and box-plot charts for on-demand visualizations. Because Pivotal Chorus provides your data science team with a rapid visual representation of information, they no longer need to export data to a local desktop, import it into R or another analytics tool, and then create a visualization.