Databricks Community

barchiel33 · a week ago

Hello All!

I have a python script which utilizes the databricks SQL for python package in order to pull a databricks table into a pandas dataframe which is used to create a table in a Spotfire report. The table contains ~1.28 million rows, with 155 columns, totaling around 140 MB of data.

The script I use to pull the data looks like this:

import pandas as pd
import time
from databricks import sql

#access_token and databricks_table_name are script parameters supplied by the Spotfire data function interface.

server_hostname = "###########.##.azuredatabricks.net"
connection = sql.connect(server_hostname=server_hostname,
    http_path="################",
    access_token = token)
query = f"SELECT DISTINCT Week FROM {databricks_table_name}"
cursor = connection.cursor()

cursor.execute(query)
result = cursor.fetchall()

weeks = [];
for row in result:
    weeks.append(row[0])
cursor.close()

df_list = []
for week in weeks:
    with connection.cursor() as cursor:
        query = f"SELECT * FROM {databricks_table_name} WHERE Week={date}"
        cursor.execute(query)        
        result = cursor.fetchall_arrow()
        temp_df = result.to_pandas()
        df_list.append(temp_df)
        cursor.close()

output_df = pd.concat(df_list)
connection.close()

I am located in the Northeastern United States and the server I'm using is in western Europe, my current assumption is that this is likely what is causing the slow down, but I'm not sure how to prove this. The table is composed of several parquet files joined together. I was able to download these parquet files directly in less than 5 mins when I downloaded them all simultaneously.

I tried using an 2x-small SQL warehouse (our company does not allow us to create our own SQL warehouse computes, and 2x-small is the default they provide to projects), a Standard 15.4 LTS (includes Apache Spark 3.5.0, Scala 2.12) compute (worker and driver had 14 GB Memory, 4 cores, 2 min workers, 8 max workers, with autoscaling), and a larger Standard 15.4 LTS (worker and driver had 32 GB Memory, 8 Cores, 4 min workers, 8 max). I didn't really see a speed difference between those three, but I haven't tried the current chunked approach on the SQL or smaller cluster yet.

Isi · Wednesday

Hey @barchiel33 ,

Hi! Let me share my thoughts, I’m not entirely sure if all of this applies directly to your use case, but hopefully some of it helps, and we can iterate further if needed.

From what I understand, you’re using a Python data function in Spotfire to fetch a Databricks table (1.28M rows, ~140 MB) using the Databricks SQL connector, convert it into a pandas DataFrame, and load that into your Spotfire report.

Simplify the data pull — one query instead of many

Right now, you’re executing a separate query for each distinct week using a loop. While this is a valid approach if you need to isolate each week’s data for processing or memory control, it has several performance drawbacks:

Each query involves a full round-trip to the Databricks SQL endpoint, which adds significant latency, especially across regions.
Each response is serialized and deserialized independently, which increases CPU and memory usage on your end.
If the filtering doesn’t substantially reduce the amount of data per query, you’re essentially paying the cost of fetching the entire table, just in smaller, slower chunks.

Unless you have a strong reason to isolate each week’s data up front, a more efficient alternative would be to run a single query:

query = f"SELECT * FROM {databricks_table_name}"

This allows Databricks to optimize the query execution and minimizes overhead by:

Reducing the number of network calls to one.
Handling the result as a single, more efficient data stream.
Letting the compute engine (Databricks SQL) manage the scan and projection internally, which is usually much faster than issuing multiple queries manually.

You can still apply any necessary filtering later in pandas or whichever tool you use, but this way you avoid redundant computation and data transfers at the query layer.

Alternative: export once, reuse many times

If your data doesn’t need to be real-time, a great optimization is:

Run a Databricks job to export the table to Parquet or CSV in cloud storage(S3 for example).
In Spotfire, use pandas.read_parquet() or read_csv() to fetch that file directly.

Cross-region impact is real

Since you’re in the Northeastern US and your workspace is in Western Europe, you’re dealing with unavoidable latency, especially visible when doing 20+ SQL queries in a loop...

Export + download method might be your best bet for speed and reliability.

Hope this helps 🙂

Isi

barchiel33 · yesterday

Hello @Isi !

Thank you for the response!

I was originally running the query as a single full pull, however what I ran into was that it seems to be pulling the data as a single stream of data. It was taking >20 mins to pull the data with the singular query, while running parallel queries filtered to each of the weeks only took <5 mins TOTAL (each individual query takes between 3-5 mins, but because they're running as separate threads, those 3-5 mins were running concurrently instead of in serial). The increased CPU and Memory usage was negligible, since each stream only received minimal processing before being rejoined with the other streams.

I understand a single query may be more efficient, but it seems to come at the cost of time, which is more critical here. Having an engineer sit for 20+ mins to maybe get the data (since sometimes the process failed for reasons I don't quite understand) is more expensive than the additional wasted databricks resources from having an inefficient query. Maybe I'm just doing something wrong with the single query process? It seems strange to me that the full data download would take so long when it's not that large of a data set. The optimal answer to me would be to download the data in parallel, which is what I assumed the databricks system would do by default, but it doesn't seem to be the case based on the performance.

The alternative solution you proposed is something I had considered but I'm not sure the exact methodology. Is there some way to download the data as parquet to a temporary location? Or would it need to be downloaded to an actual location on the local computer?

Thank you again for the reply! It is well thought out and I appreciate the time you spent on it!

Kind Regards,

Braxton Stover