I am a caffeinated, busy software junkie. Daily I help teams with solution engineering solutions: Apache Spark JSON pasing confusing errrors

Monday, August 02, 2021

Apache Spark JSON pasing confusing errrors

input json:

{

"shipping_address": {

"street_address": "1600 Pen Avenue NW",

"city": "Washington",

"state": "DC",

"type": "business",

"additionalProperties": {

"test": "one",

"test1": "two"

}

spark code;

# File location and type

file_location = "/FileStore/tables/mock_example-1.json"

file_type = "json"

# CSV options

infer_schema = "false"

first_row_is_header = "false"

delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.

df = spark.read.format(file_type) \

.option("inferSchema", infer_schema) \

.option("header", first_row_is_header) \

.option("sep", delimiter) \

.load(file_location)

display(df)

Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column

Solution: usuually spark expects one json message per line..

In general we use Notepad etc, to format JSON examples. ( just to validate the strucrue of the documents)... if U are saving formatted JSON, then spark will fail with the above error.

notepad JSON plugin offers compressJSON option too. so compress/save it. It works fine

Monday, August 02, 2021

Apache Spark JSON pasing confusing errrors

No comments: