There's a way to load spark dataframes using regular expressions. I'm loading a lot of data to process in spark from aws, and specifying path regexp helps to cut down loading times tremendously.
Spark inherits Hadoop ability to read paths as pattern matching. It works then reading from aws too. Link on documentation is
here.
Sample usage:
spark.read.parquet("s3://backet/data/year=20{20,19}/month=*/day=[01]*").
Closest alternative will be:
spark.read.parquet("s3://backet/data").where("[insane sql here]")
Timewise first approach is odrers of magnitude faster.
No comments:
Post a Comment