Google Cloud Storage
Supported pipeline types:
|
The destination creates an object for each batch of data written to Google Cloud Storage.
With the Google Cloud Storage destination, you configure the bucket and common prefix to define where to write objects. You can use a partition prefix to specify the partition to write to. You can configure a prefix for the object name, and a time basis and data time zone for the stage. When using any data format except whole file, you can also configure a suffix for the object name and compress data with gzip before writing to Google Cloud Storage.
The destination can generate events for an event stream. For more information about the event framework, see Dataflow Triggers Overview.
Credentials
Before writing to Google Cloud Storage, the Google Cloud Storage destination must pass credentials to Google Cloud Storage. Configure the destination to retrieve the credentials from the Google Application Default Credentials or from a Google Cloud service account credentials file.
Default Credentials Provider
When configured to use the Google Application
Default Credentials, the destination checks for the credentials file defined in the
GOOGLE_APPLICATION_CREDENTIALS
environment variable. If the
environment variable doesn't exist and Data Collector is
running on a virtual machine (VM) in Google Cloud Platform (GCP), the origin uses the
built-in service account associated with the virtual machine instance.
For more information about the default credentials, see Google Application Default Credentials in the Google Developer documentation.
Complete the following steps to define the credentials file in the environment variable:
- Use the Google Cloud Platform Console or the
gcloud
command-line tool to create a Google service account and have your application use it for API access.For example, to use the command line tool, run the following commands:gcloud iam service-accounts create my-account gcloud iam service-accounts keys create key.json --iam-account=my-account@my-project.iam.gserviceaccount.com
- Store the generated credentials file on the Data Collector machine.
- Add the
GOOGLE_APPLICATION_CREDENTIALS
environment variable to the appropriate file and point it to the credentials file.using the method required by your installation type.
Set the environment variable as follows:
export GOOGLE_APPLICATION_CREDENTIALS="/var/lib/sdc-resources/keyfile.json"
- Restart Data Collector to enable the changes.
- On the Credentials tab for the stage, select Default Credentials Provider for the credentials provider.
Service Account Credentials (JSON)
When configured to use the Google Cloud service account credentials file, the destination checks for the file defined in the origin properties.
- Generate a service account credentials file in JSON
format.
Use the Google Cloud Platform Console or the
gcloud
command-line tool to generate and download the credentials file. For more information, see generating a service account credential in the Google Cloud Platform documentation. - Store the generated credentials file on the Data Collector machine.
As a best practice, store the file in the Data Collector resources directory,
$SDC_RESOURCES
. - On the Credentials tab for the stage, select Service Account Credentials File for the credentials provider and enter the path to the credentials file.
Partition Prefix
You can use a partition prefix to organize objects by partitions. You can use the partition prefix to write to existing partitions or to create new partitions as needed. When a partition specified in the partition prefix does not exist, the destination creates the partition.
You can specify an exact partition name for the partition prefix, or you can use an expression that evaluates to a partition name.
For example, to write to partitions based on data in the Country field, you can use the
following expression as the partition prefix:
${record:value('/Country')}
.
With this expression, the destination writes records to partitions based on the country data in the record, and creates partitions for countries that do not already have a partition.
If you use datetime variables in the expression, be sure to configure the time basis for the stage.
Time Basis, Data Time Zone, and Time-Based Partition Prefixes
The time basis and the data time zone comprises the time used by the Google Cloud Storage destination to write records to a time-based partition prefix. When the configured partition prefix does not include time-based functions, you can ignore the time basis property.
A partition prefix has a time component when it includes datetime variables, such as
${YYYY()}
or ${DD()}
, or when it includes an
expression that evaluates to a datetime value, such as
${record:value("/Timestamp")}.
For details about datetime variables, see Datetime Variables.
- Processing Time
- When you use processing time as the time basis, the destination performs
writes based on the processing time and the configured partition prefix. The
processing time is the time associated with the Data Collector running the pipeline, by default. You can specify a different time zone
by configuring the Data Time Zone property. To use the processing time as
the time basis, use the following expression:
This is the default time basis.${time:now()}
- Record Time
- When you use the time associated with a record as the time basis, you specify a date field in the record. The destination writes data based on the datetimes associated with the records, adjusting for the value specified for the Data Time Zone property.
logs-${YYYY()}-${MM()}-${DD()}
If you use the time of processing as the time basis, the destination writes records to partitions based on when it processes each record. If you use the time associated with the data, such as a transaction timestamp, then the destination writes records to the partitions based on that timestamp. If a partition does not exist, the destination creates the needed partition.
Object Names
<prefix>-<UUID>
You configure the object name prefix. For example:
sdc-c9a2db16-b5d0-44cb-b3f5-d0781cced760
.
<prefix>-<UUID>.<optional suffix>
For example: sdc-c9a2db16-b5d0-44cb-b3f5-d0781cced760.txt
.
Whole File Names
<prefix>-<results of the file name expression>
Event Generation
The Google Cloud Storage destination can generate events that you can use in an event stream. When you enable event generation, Google Cloud Storage generates event records each time the destination completes writing to an object or completes streaming a whole file.
- With the Email executor to send a custom email
after receiving an event.
For an example, see Case Study: Sending Email.
- With a destination to store event information.
For an example, see Case Study: Event Storage.
For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Event Records
Record Header Attribute | Description |
---|---|
sdc.event.type | Event type. Uses one of the following types:
|
sdc.event.version | Integer that indicates the version of the event record type. |
sdc.event.creation_timestamp | Epoch timestamp when the stage created the event. |
- Object written
- The destination generates an object written event record when it completes writing to an object.
- Whole file processed
- The destination generates an event record when it completes streaming a
whole file. Whole file event records have the
sdc.event.type
record header attribute set towholeFileProcessed
and include the following fields:Field Description sourceFileInfo A map of attributes about the original whole file that was processed. The attribute names depend on the information provided by the origin system.
targetFileInfo A map of attributes about the whole file written to the destination system. The attributes include: - bucket - The bucket where the whole file is written.
- objectKey - The object key name that was written.
checksum Checksum generated for the written file. Included only when you configure the destination to include checksums in the event record.
checksumAlgorithm Algorithm used to generate the checksum. Included only when you configure the destination to include checksums in the event record.
Data Formats
- Avro
- The destination writes records based on the Avro schema. You can use one of the following methods to specify the location of the Avro schema definition:
- Delimited
- The destination writes records as delimited data. When you use this data format, the root field must be list or list-map.
- JSON
- The destination writes records as JSON data. You can use one of
the following formats:
- Array - Each file includes a single array. In the array, each element is a JSON representation of each record.
- Multiple objects - Each file includes multiple JSON objects. Each object is a JSON representation of a record.
- Protobuf
- Writes a batch of messages in each file.
- SDC Record
- The destination writes records in the SDC Record data format.
- Text
- The destination writes data from a single text field to the destination system. When you configure the stage, you select the field to use.
- Whole File
- Streams whole files to the destination system. The destination writes the data to the file and location defined in the stage. If a file of the same name already exists, you can configure the destination to overwrite the existing file or send the current file to error.