Monday, August 02, 2021

Apache Spark JSON pasing confusing errrors

input json:

{

      "shipping_address": {

        "street_address": "1600 Pen Avenue NW",

        "city": "Washington",

        "state": "DC",

         "type": "business",

         "additionalProperties": {

            "test": "one",

             "test1": "two"

          }

      }

}

spark code;

# File location and type

file_location = "/FileStore/tables/mock_example-1.json"

file_type = "json"


# CSV options

infer_schema = "false"

first_row_is_header = "false"

delimiter = ","


# The applied options are for CSV files. For other file types, these will be ignored.

df = spark.read.format(file_type) \

  .option("inferSchema", infer_schema) \

  .option("header", first_row_is_header) \

  .option("sep", delimiter) \

  .load(file_location)

display(df)



Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column

Solution: usuually spark expects one json message per line..

In general we use Notepad etc, to format JSON examples. ( just to validate the strucrue of the documents)... if U are saving formatted JSON, then spark will fail with the above error.

notepad JSON plugin offers compressJSON option too. so compress/save it. It works fine