Whole Directory
The Whole Directory origin reads all files within the specified directory on HDFS or a local file system in a single batch. Every file must be fully written, include data of the same supported format, and use the same schema.
For example, you might use the Whole Directory origin in a batch pipeline where you want to reread a directory of files each time the pipeline runs. Or, you might use the origin in a slowly changing dimension pipeline that updates partitioned file dimension data.
To read files using a more traditional origin, one that track offsets and allows caching, use the File origin.
The Whole Directory origin reads from HDFS using connection information stored in a Hadoop configuration file.
When you configure the Whole Directory origin, you specify the directory to read. You select the data format of the data and configure related properties. When processing delimited or JSON data, you can define a custom schema for reading the data and configure related properties.
You can also specify HDFS configuration properties for a HDFS-compatible system. Any specified properties override those defined in the Hadoop configuration file.
Data Formats
The Whole Directory origin generates records based on the specified data format.
- Avro
- The origin generates a record for every Avro record in an Avro container file. Each file must contain the Avro schema. The origin uses the Avro schema to generate records.
- Delimited
- The origin generates a record for each line in a delimited file. You can specify a custom delimiter, quote, and escape character used in the data.
- JSON
- By default, the origin generates a record for each line in a JSON Lines file. Each line in the file should contain a valid JSON object. For details, see the JSON Lines website.
- ORC
- The origin generates a record for each row in an Optimized Row Columnar (ORC) file.
- Parquet
- The origin generates records for every Parquet record in the file. The file must contain the Parquet schema. The origin uses the Parquet schema to generate records.
- Text
- The origin generates a record for each line in a text file. The file must
                        use \nas the newline character.
- XML
- The origin generates a record for every row defined in an XML file. You specify the root tag used in files and the row tag used to define records.
Configuring a Whole Directory Origin
Configure a Whole Directory origin to read all files within a directory on HDFS or the local file system in a single batch.