Google Cloud Storage
The Google Cloud Storage origin reads objects stored in Google Cloud Storage. The objects must be fully written and reside in a single bucket. The object names must share a prefix pattern.
With the Google Cloud Storage origin, you define the bucket, prefix pattern, and optional common prefix. These properties determine the objects that the origin processes.
You also define the project and credentials provider to use to connect to Google Cloud Storage. The origin can retrieve credentials from the Google Application Default Credentials or from a Google Cloud service account credentials file.
After processing an object or upon encountering errors, the origin can keep, archive, or delete the object. When archiving, the origin can copy or move the object.
When the pipeline stops, the Google Cloud Storage origin notes where it stops reading. When the pipeline starts again, the origin continues processing from where it stopped by default. You can reset the origin to process all requested objects.
The origin can generate events for an event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Credentials
Before reading objects in Google Cloud Storage, the Google Cloud Storage origin must pass credentials to Google Cloud Storage. Configure the origin to retrieve the credentials from the Google Application Default Credentials or from a Google Cloud service account credentials file.
Default Credentials Provider
When configured to use the Google Application
Default Credentials, the origin checks for the credentials file defined in the
GOOGLE_APPLICATION_CREDENTIALS
environment variable. If the
environment variable doesn't exist and Data Collector is
running on a virtual machine (VM) in Google Cloud Platform (GCP), the origin uses the
built-in service account associated with the virtual machine instance.
For more information about the default credentials, see Google Application Default Credentials in the Google Developer documentation.
Complete the following steps to define the credentials file in the environment variable:
- Use the Google Cloud Platform Console or the
gcloud
command-line tool to create a Google service account and have your application use it for API access.For example, to use the command line tool, run the following commands:gcloud iam service-accounts create my-account gcloud iam service-accounts keys create key.json --iam-account=my-account@my-project.iam.gserviceaccount.com
- Store the generated credentials file on the Data Collector machine.
- Add the
GOOGLE_APPLICATION_CREDENTIALS
environment variable to the appropriate file and point it to the credentials file.Modify environment variables using the method required by your installation type.
Set the environment variable as follows:
export GOOGLE_APPLICATION_CREDENTIALS="/var/lib/sdc-resources/keyfile.json"
- Restart Data Collector to enable the changes.
- On the Credentials tab for the stage, select Default Credentials Provider for the credentials provider.
Service Account Credentials (JSON)
When configured to use the Google Cloud service account credentials file, the origin checks for the file defined in the origin properties.
- Generate a service account credentials file in JSON
format.
Use the Google Cloud Platform Console or the
gcloud
command-line tool to generate and download the credentials file. For more information, see generating a service account credential in the Google Cloud Platform documentation. - Store the generated credentials file on the Data Collector machine.
As a best practice, store the file in the Data Collector resources directory,
$SDC_RESOURCES
. - On the Credentials tab for the stage, select Service Account Credentials File for the credentials provider and enter the path to the credentials file.
Common Prefix, Prefix Pattern, and Wildcards
The Google Cloud Storage origin appends the common prefix to the prefix pattern to define the objects that the origin processes. You can specify an exact prefix pattern or you can use Ant-style path patterns to read multiple objects recursively.
- Question mark (?) to match a single character
- Asterisk (*) to match zero or more characters
- Double asterisks (**) to match zero or more directories
US/East/MD/
and all nested
prefixes, you can use the following common prefix and prefix
pattern:Common Prefix: US/East/MD/
Prefix Pattern: **/*.log
US/**/weblogs/
, you can include the nested prefixes in the
prefix pattern or define the entire hierarchy in the prefix pattern, as
follows:Common Prefix: US/
Prefix Pattern: **/weblogs/*.log
Common Prefix:
Prefix Pattern: US/**/weblogs/*.log
Event Generation
The Google Cloud Storage origin can generate events when it completes processing all available data and the configured batch wait time has elapsed.
- With the Pipeline Finisher executor to
stop the pipeline and transition the pipeline to a Finished state when
the origin completes processing available data.
When you restart a pipeline stopped by the Pipeline Finisher executor, the origin continues processing from the last-saved offset unless you reset the origin.
For an example, see Case Study: Stop the Pipeline.
- With a destination to store event information.
For an example, see Case Study: Event Storage.
For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Event Records
Record Header Attribute | Description |
---|---|
sdc.event.type | Event type. Uses the following type:
|
sdc.event.version | An integer that indicates the version of the event record type. |
sdc.event.creation_timestamp | Epoch timestamp when the stage created the event. |
The Google Cloud Storage origin can generate the following event record:
- no-more-data
- The Google Cloud Storage origin generates a no-more-data event record when the origin completes processing all available records and the number of seconds configured for Batch Wait Time elapses without any new objects appearing to be processed.
Data Formats
- Avro
- Generates a record for every Avro record. Includes a "precision" and "scale" field attribute for each Decimal field. For more information about field attributes, see Field Attributes.
- Delimited
- Generates a record for each delimited line. You can use the
following delimited format types:
- Default CSV - File that includes comma-separated values. Ignores empty lines in the file.
- RFC4180 CSV - Comma-separated file that strictly follows RFC4180 guidelines.
- MS Excel CSV - Microsoft Excel comma-separated file.
- MySQL CSV - MySQL comma-separated file.
- PostgreSQL CSV - PostgreSQL comma-separated file.
- PostgreSQL Text - PostgreSQL text file.
- Tab-Separated Values - File that includes tab-separated values.
- Custom - File that uses user-defined delimiter, escape, and quote characters.
- Excel
- Generates a record for every row in the file. Can process
.xls
or.xlsx
files. - JSON
- Generates a record for each JSON object. You can process JSON files that include multiple JSON objects or a single JSON array.
- Log
- Generates a record for every log line.
- Protobuf
- Generates a record for every protobuf message.
- SDC Record
- Generates a record for every record. Use to process records generated by a Data Collector pipeline using the SDC Record data format.
- Text
- Generates a record for each line of text or for each section of text based on a custom delimiter.
- Whole File
- Streams whole files from the origin system to the destination system. You can specify a transfer rate or use all available resources to perform the transfer.