Hey @barchiel33 ,
Hi! Let me share my thoughts, I’m not entirely sure if all of this applies directly to your use case, but hopefully some of it helps, and we can iterate further if needed.
From what I understand, you’re using a Python data function in Spotfire to fetch a Databricks table (1.28M rows, ~140 MB) using the Databricks SQL connector, convert it into a pandas DataFrame, and load that into your Spotfire report.
Simplify the data pull — one query instead of many
Right now, you’re executing a separate query for each distinct week using a loop. While this is a valid approach if you need to isolate each week’s data for processing or memory control, it has several performance drawbacks:
Each query involves a full round-trip to the Databricks SQL endpoint, which adds significant latency, especially across regions.
Each response is serialized and deserialized independently, which increases CPU and memory usage on your end.
If the filtering doesn’t substantially reduce the amount of data per query, you’re essentially paying the cost of fetching the entire table, just in smaller, slower chunks.
Unless you have a strong reason to isolate each week’s data up front, a more efficient alternative would be to run a single query:
query = f"SELECT * FROM {databricks_table_name}"
This allows Databricks to optimize the query execution and minimizes overhead by:
Reducing the number of network calls to one.
Handling the result as a single, more efficient data stream.
Letting the compute engine (Databricks SQL) manage the scan and projection internally, which is usually much faster than issuing multiple queries manually.
You can still apply any necessary filtering later in pandas or whichever tool you use, but this way you avoid redundant computation and data transfers at the query layer.
Alternative: export once, reuse many times
If your data doesn’t need to be real-time, a great optimization is:
Run a Databricks job to export the table to Parquet or CSV in cloud storage(S3 for example).
In Spotfire, use pandas.read_parquet() or read_csv() to fetch that file directly.
Cross-region impact is real
Since you’re in the Northeastern US and your workspace is in Western Europe, you’re dealing with unavoidable latency, especially visible when doing 20+ SQL queries in a loop...
Export + download method might be your best bet for speed and reliability.
Hope this helps 🙂
Isi