Provision Data Collectors
Step 1. Create a Custom Image
Use Docker to customize the public StreamSets Data Collector Docker image as needed, and then store the private image in your private repository.
- Customized configuration files
- Resource files
- External libraries, such as JDBC drivers
- Custom stages
- Additional stage libraries - The public Data Collector Docker image includes the basic, development, statistics,
                    and Windows stage libraries only.Important: Control Hub requires that the statistics stage library be installed on each registered Data Collector. Control Hub uses the library to run system pipelines on the Data Collector. Be sure to include the statistics stage library in your private image.
- Packages and files required to enable Kerberos authentication for Data Collector:- On Linux, the krb5-workstation and krb5-client Kerberos client packages.
- The Hadoop or HDFS configuration files required by the Kerberos-enabled
                            stage, for example:- core-site.xml
- hdfs-site.xml
- yarn-site.xml
- mapred-site.xml
 
 
Each deployment managed by a Provisioning Agent specifies the Data Collector Docker image to deploy. So you can create a unique Data Collector Docker image for each deployment, or you can use one Docker image for all deployments.
For example, let's say that one deployment of provisioned Data Collectors reads from web server logs, so the Data Collector Docker image used by that deployment requires only the basic and statistics stage libraries. Another deployment of provisioned Data Collectors reads from the Google Cloud platform, so the Data Collector Docker image used by that deployment requires the Google Cloud stage library in addition to the basic and statistics stage libraries. You can create and manage two separate Data Collector Docker images for the deployments. Or you can create and manage a single image that meets the needs of both deployments.
For more information about running Data Collector from Docker, see https://hub.docker.com/r/streamsets/datacollector/.
For more information about creating private Docker images and publishing them to a private repository, see the Docker documentation.
Step 2. Create a Provisioning Agent
- Using Helm
- Helm is a tool that streamlines installing and managing Kubernetes applications.
- Without using Helm
- If you do not want to use Helm, you can define a Provisioning Agent YAML specification file, and then use Kubernetes commands to create and deploy the Provisioning Agent.
When you use either method, you can configure the Provisioning Agent to provision Data Collector containers enabled for Kerberos authentication. However, StreamSets recommends using Helm to enable Kerberos authentication.
Creating an Agent Using Helm
To create a Provisioning Agent using Helm, install Helm and download the Control Agent Helm chart that StreamSets provides. After modifying values in the Helm chart, use the Helm install command to create and deploy the Provisioning Agent as a containerized application to a Kubernetes pod.
Create one Provisioning Agent for each Kubernetes cluster where you want to provision Data Collectors. For example, if you have a production cluster and a disaster recovery cluster, you would create a total of two Provisioning Agents - one for each cluster.
Creating an Agent without Using Helm
To create a Provisioning Agent without using Helm, configure a Provisioning Agent YAML specification file, and then use the Kubernetes create command to create and deploy the Provisioning Agent as a containerized application to a Kubernetes pod.
Create one Provisioning Agent for each Kubernetes cluster where you want to provision Data Collectors. For example, if you have a production cluster and a disaster recovery cluster, you would create a total of two Provisioning Agents - one for each cluster.
Step 3. Define a Deployment YAML Specification
Define a deployment in a YAML specification file. Each file can define a single deployment. The file can optionally define a Kubernetes Horizontal Pod Autoscaler, service, or Ingress associated with the deployment.
apps/v1 to define each deployment.The YAML specification file can contain the following components:
- Deployment
- Use for a deployment of one or more execution Data Collectors that can be manually scaled. To manually scale a deployment, you modify a deployment in the Control Hub UI to increase the number of Data Collector instances.
- Deployment associated with a Kubernetes Horizontal Pod Autoscaler
- Use for a deployment of one or more execution Data Collectors that must automatically scale during times of peak performance. Define the deployment and Horizontal Pod Autoscaler in the same YAML specification file. The Kubernetes Horizontal Pod Autoscaler automatically scales the deployment based on CPU utilization. For more information, see the Kubernetes Horizontal Pod Autoscaler documentation.
- Deployment associated with a Kubernetes service and Ingress
- Use for a deployment of a single authoring Data Collector. To allow users to log into a Data Collector container automatically provisioned on the Kubernetes cluster, you must expose the Data Collector container outside the cluster using a Kubernetes service.
Deployment Sample
Define only a deployment in the YAML specification file when creating a deployment for one or more execution Data Collectors that can be manually scaled.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: datacollector-deployment
  namespace: <agentNamespace>
spec:
  replicas: 1
  selector:
    matchLabels:
      app: <deploymentLabel>
  template:
    metadata:
      labels:
        app : <deploymentLabel>
        kerberosEnabled: true
        krbPrincipal: <KerberosUser>
    spec:
      containers:
      - name : datacollector
        image: <privateImage>
        ports:
        - containerPort: 18360
        volumeMounts:
         - name: krb5conf
           mountPath: /etc/krb5.conf
           subPath: krb5.conf
           readOnly: true
        env:
        - name: HOST
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: PORT0
          value: "18630"
      imagePullSecrets:
      - name: <imagePullSecrets>
      volumes:
      - name: krb5conf
        secret:
          secretName: krb5conf...
      kerberosEnabled: true
      krbPrincipal: <KerberosUser>
...
      volumeMounts:
         - name: krb5conf
           mountPath: /etc/krb5.conf
           subPath: krb5.conf
           readOnly: true
...
      volumes:
     - name: krb5conf
       secret:
         secretName: krb5conf| Variable | Description | 
|---|---|
| agentNamespace | Namespace used for the Provisioning Agent that manages this deployment. | 
| deploymentLabel | Label for this deployment. Must be unique for all deployments managed by the Provisioning Agent. | 
| KerberosUser | User for the Kerberos principal when enabling Kerberos
                            authentication. This attribute is optional. If you remove this
                                attribute, the Provisioning Agent uses  The Provisioning Agent creates a unique
                                Kerberos principal for each deployed Data Collector
                                container using the following format:
                                     For
                                example, if you define the  KerberosUserattribute
                                asmarketingand the Provisioning Agent deploys two
                                    Data Collector
                                containers, the agent creates the following Kerberos
                                principals: | 
| privateImage | Path to your private Data Collector Docker
                            image stored in your private repository. Or, if using the public StreamSets
                                Data Collector Docker
                                image, modify the attribute as
                                follows: Where  <version>is the Data Collector
                                version. For
                                example: | 
| imagePullSecrets | Pull secrets required for the private image stored in your private
                                repository. If using the public StreamSets Data Collector Docker image, remove these lines. | 
Deployment and Horizontal Pod Autoscaler Sample
Define a deployment and Horizontal Pod Autoscaler in the YAML specification file when creating a deployment for one or more execution Data Collectors that automatically scale during times of peak performance.
apiVersion: v1
kind: List
items:
- apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: datacollector-deployment
    namespace: <agentNamespace>
  spec:
    replicas: 1
    selector:
      matchLabels:
        app: <deploymentLabel>
    template:
      metadata:
        labels:
          app : <deploymentLabel>
          kerberosEnabled: true
          krbPrincipal: <KerberosUser>
      spec:
        containers:
        - name : datacollector
          image: <privateImage>
          ports:
          - containerPort: 18360
          volumeMounts:
         - name: krb5conf
           mountPath: /etc/krb5.conf
           subPath: krb5.conf
           readOnly: true
          env:
          - name: HOST
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          - name: PORT0
            value: "18630"
        imagePullSecrets:
        - name: <imagePullSecrets>
        volumes:
        - name: krb5conf
          secret:
            secretName: krb5conf
- apiVersion: autoscaling/v1
  kind: HorizontalPodAutoscaler
  metadata:
    name: datacollector-hpa
    namespace: <agentNamespace>
  spec:
    scaleTargetRef:
      apiVersion: apps/v1beta1
      kind: Deployment
      name: <deploymentLabel>
    minReplicas: 1 
    maxReplicas: 10
    targetCPUUtilizationPercentage: 50...
      kerberosEnabled: true
      krbPrincipal: <KerberosUser>
...
      volumeMounts:
         - name: krb5conf
           mountPath: /etc/krb5.conf
           subPath: krb5.conf
           readOnly: true
...
      volumes:
     - name: krb5conf
       secret:
         secretName: krb5conf| Variable | Description | 
|---|---|
| agentNamespace | Namespace used for the Provisioning Agent that manages this deployment. | 
| deploymentLabel | Label for this deployment. Must be unique for all deployments managed by the Provisioning Agent. | 
| KerberosUser | User for the Kerberos principal when enabling Kerberos
                            authentication. This attribute is optional. If you remove this
                                attribute, the Provisioning Agent uses  The Provisioning Agent creates a unique
                                Kerberos principal for each deployed Data Collector
                                container using the following format:
                                     For
                                example, if you define the  KerberosUserattribute
                                asmarketingand the Provisioning Agent deploys two
                                    Data Collector
                                containers, the agent creates the following Kerberos
                                principals: | 
| privateImage | Path to your private Data Collector Docker
                            image stored in your private repository. Or, if using the public StreamSets
                                Data Collector Docker
                                image, modify the attribute as
                                follows: Where  <version>is the Data Collector
                                version. For
                                example: | 
| imagePullSecrets | Pull secrets required for the private image stored in your private
                                repository. If using the public StreamSets Data Collector Docker image, remove these lines. | 
kind: Deployment
name: <deploymentLabel>In the Horizontal Pod Autoscaler definition, you also might want to modify the minimum and maximum replica values and the target CPU utilization percentage value. For more information on these values, see the Kubernetes Horizontal Pod Autoscaler documentation.
Deployment, Service, and Ingress Sample
Define a deployment, service, and Ingress in the YAML specification file when creating a deployment for a single authoring Data Collector that users must log into.
The following sample YAML specification file defines a deployment associated with a Kubernetes service and Ingress:
apiVersion: v1
kind: List
items:
- apiVersion: v1
  kind: Service
  metadata:
    name: datacollector-service
    namespace: <agentNamespace>
  spec:
    type: LoadBalancer
    ports:
    - name: iot
      port: 18636
      targetPort: 18636
      protocol: TCP
    selector:
      app: <deploymentLabel>
- apiVersion: extensions/v1beta1
  kind: Ingress
  metadata:
    name: authoring-sdc
    namespace: <agentNamespace>
  spec:
    rules:
    - host:
      http:
        paths:
        - path: / 
          backend:
            serviceName: datacollector-service
            servicePort: 18636
- apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: datacollector-deployment
    namespace: <agentNamespace>
  spec:
    replicas: 1
    selector:
      matchLabels:
        app: <deploymentLabel>
    template:
      metadata:
        labels:
          app : <deploymentLabel>
          kerberosEnabled: true
          krbPrincipal: <KerberosUser>
      spec:
        containers:
        - name : datacollector
          image: <privateImage>
          ports:
          - containerPort: 18360
          volumeMounts:
          - name: krb5conf
            mountPath: /etc/krb5.conf
            subPath: krb5.conf
            readOnly: true
          env:
          - name: HOST
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          - name: PORT0
            value: "18630"
          - name: SDC_CONF_SDC_BASE_HTTP_URL
            value: <serviceURL>
          - name: SDC_CONF_HTTP_ENABLE_FORWARDED_REQUESTS
            value: "true"
        imagePullSecrets:
        - name: <imagePullSecrets>
        volumes:
        - name: krb5conf
          secret:
            secretName: krb5conf...
      kerberosEnabled: true
      krbPrincipal: <KerberosUser>
...
      volumeMounts:
         - name: krb5conf
           mountPath: /etc/krb5.conf
           subPath: krb5.conf
           readOnly: true
...
      volumes:
     - name: krb5conf
       secret:
         secretName: krb5conf| Variable | Description | 
|---|---|
| agentNamespace | Namespace used for the Provisioning Agent that manages this deployment. | 
| deploymentLabel | Label for this deployment. Must be unique for all deployments managed by the Provisioning Agent. | 
| KerberosUser | User for the Kerberos principal when enabling Kerberos
                            authentication. This attribute is optional. If you remove this
                                attribute, the Provisioning Agent uses  The Provisioning Agent creates a unique
                                Kerberos principal for each deployed Data Collector
                                container using the following format:
                                     For
                                example, if you define the  KerberosUserattribute
                                asmarketingand the Provisioning Agent deploys two
                                    Data Collector
                                containers, the agent creates the following Kerberos
                                principals: | 
| privateImage | Path to your private Data Collector Docker
                            image stored in your private repository. Or, if using the public StreamSets
                                Data Collector Docker
                                image, modify the attribute as
                                follows: Where  <version>is the Data Collector
                                version. For
                                example: | 
| imagePullSecrets | Pull secrets required for the private image stored in your private
                                repository. If using the public StreamSets Data Collector Docker image, remove these lines. | 
| serviceURL | URL for the Kubernetes service used to access the authoring Data Collector. The URL must use the same protocol, HTTP or HTTPS, as the Control Hub system. Use the following format for the
                  URL: For
                  example:  | 
- The Ingress must be associated to a service defined in the same file.In the sample above, the Ingress is associated to the defined service with the following attributes:serviceName: datacollector-service servicePort: 18636
- The service must be associated to the deployment defined in the same file.In the sample above, the service is associated to the defined deployment with the following attribute:app: <deploymentLabel>
Attributes for AWS Fargate with EKS
When provisioning Data Collectors to AWS Fargate with Amazon Elastic Kubernetes Service (EKS), add the following additional attributes to the deployment YAML specification file:
- Required attribute
- Add the following required environment variable to avoid having to configure the
                    maximum open file limit on the virtual machines provisioned by AWS Fargate:
                    - name: SDC_FILE_LIMIT value: 0
- Optional attribute
- Add the following optional resourcesattribute to define the size of the virtual machines that AWS Fargate provisions. Set the values of thecpuandmemoryattributes as needed:resources: limits: cpu: 500m memory: 2G requests: cpu: 200m memory: 2G
apiVersion: apps/v1
kind: Deployment
metadata:
  name: datacollector-deployment
  namespace: <agentNamespace>
spec:
  replicas: 1
  selector:
    matchLabels:
      app: <deploymentLabel>
  template:
    metadata:
      labels:
        app : <deploymentLabel>
    spec:
      containers:
      - name : datacollector
        image: <privateImage>
        ports:
        - containerPort: 18360
        env:
        - name: HOST
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: PORT0
          value: "18630"
        - name: SDC_FILE_LIMIT
          value: 0
        resources:
           limits:
             cpu: 500m
             memory: 2G
           requests:
             cpu: 200m
             memory: 2GStep 4. Create a Deployment
After defining the deployment YAML specification file, use Control Hub to create a deployment.
You can create multiple deployments for a single Provisioning Agent. For example, for the Provisioning Agent running in the production cluster, you might create one deployment dedicated to running jobs that read web server logs and another deployment dedicated to running jobs that read data from Google Cloud.
Step 5. Start the Deployment
When you start a deployment, the Provisioning Agent deploys the Data Collector containers to the Kubernetes cluster and starts each Data Collector container.
If you configured the Provisioning Agent for Kerberos authentication, the Provisioning Agent works with Kerberos to dynamically create and inject Kerberos credentials (a service principal and keytab) into each deployed Data Collector container.
The agent deploys each container to a Kubernetes pod. So if the deployment specifies three Data Collector instances, the agent deploys three containers to three Kubernetes pods.
During the startup of each Data Collector container, the Data Collector registers itself with Control Hub.
 .
. .
. .
.