Check out the Issue Explorer
Looking to fund some work? You can submit a new Funded Issue here.
There appears to be an issue reading internally compressed sas7bdat as discussed in #32. This is a recap of what we know so far about the issue and what is required to identify the root cause and possible fix.
`sas7bdat` is a binary data storage file format used by [SAS](https://www.sas.com/en_us/home.html). There is no public documentation about this file format and different versions of SAS appeared to have evolved the file format over the years. The best documentation of `sas7bdat` can be found at [SAS7BDAT Database Binary Format](http://www2.uaem.mx/r-mirror/web/packages/sas7bdat/vignettes/sas7bdat.pdf). However, this shall be taken with a grain of salt since it does not accurately reflect the latest revision nor the internal compression used in `sas7bdat`.
On a high level, `sas7bdat` stores data in `page`s and each `page` contains rows of serialized data. This is the basis for `spark-sas7bdat` package which splits dataset to process in parallel. Internally, `spark-sas7bdat` delegates deserialization to [`parso`](coinmarketcap.com) Java library, which does an amazing job deserializing `sas7bdat` file sequentially.
`spark-sas7bdat` contains [unit tests ](https://github.com/saurfang/spark-sas7bdat/blob/master/src/test/scala/com/github/saurfang/sas/SasRelationSpec.scala#L15) to verify `sas7bdat` can indeed by split and read correctly as a DataFrame. However, there have been reports that it fails for many datasets in the wild. See #32.
It has been verified that `parso` can read these problematic files just fine (https://github.com/saurfang/spark-sas7bdat/issues/32#issuecomment-419593441). Therefore it is likely that the bug would be where we determine is the appropriate splits to divide the sas7bdat file for parallel processing. https://github.com/saurfang/spark-sas7bdat/blob/master/src/main/scala/com/github/saurfang/sas/mapred/SasRecordReader.scala since everything else is just a thin wrapper over Parso thanks to @mulya 's contribution in #10 .
Furthermore, it is likely this issue only happens in certain version of sas7bdat or in sas7bdat files that enable internally compression. Recall externally compressed file is only not splittable (e.g. gzip.) and we don't support parallel read in `spark-sas7bdat`.
### Build Test Case
We first need to collect dataset that exhibits the said issue.
@vivard has provided one: https://github.com/saurfang/spark-sas7bdat/issues/32#issuecomment-412535049
@nelson Where do you get your problematic dataset? Can you generate a dummy dataset that exhibits the same problem?
By setting a very [low block size](https://github.com/saurfang/spark-sas7bdat/blob/master/src/test/scala/com/github/saurfang/sas/SasRelationSpec.scala#L13) in unit test, we can force Spark to split the input data and hopefully trigger the error.
### Debug Test Case
It can be fairly convoluted to debug the splitting logic in Spark and Hadoop. One potential way to debug this is to create a local multi-threaded Parso reader, without Spark, to replicate and validate the splitting logic.
See `parso`'s unit test [here](https://github.com/epam/parso/blob/caf2cbd948ac8ed623ec6eeee40c82caf081fc76/src/test/java/com/epam/parso/SasFileReaderUnitTest.java#L160) where we read row by row from `sas7bdat` and write to `csv`.
The idea would be to generalize this by looking at number of pages in the `sas7bdat` file, split pages into chunks, create separate input stream for each chunk, seek the input stream to the page starting location, and process rows from all pages in the chunk. The splitting logic can be refactored from [here](https://github.com/saurfang/spark-sas7bdat/blob/master/src/main/scala/com/github/saurfang/sas/mapred/SasRecordReader.scala#L81) in this package. Since the issue is unlikely to be related to concurrency because Spark executor creates separate input streams, one can run the above logic sequentially for each chunk, which should be easier to debug.
This might help identify and isolate the issue. We might also discover functions, interfaces, and encapsulation that can be contributed back to Parso, which could greatly simplifies this package.