Databricks Community

barchiel33 · 2 weeks ago

Hello All!

I have a python script which utilizes the databricks SQL for python package in order to pull a databricks table into a pandas dataframe which is used to create a table in a Spotfire report. The table contains ~1.28 million rows, with 155 columns, totaling around 140 MB of data.

The script I use to pull the data looks like this:

import pandas as pd
import time
from databricks import sql

#access_token and databricks_table_name are script parameters supplied by the Spotfire data function interface.

server_hostname = "###########.##.azuredatabricks.net"
connection = sql.connect(server_hostname=server_hostname,
    http_path="################",
    access_token = token)
query = f"SELECT DISTINCT Week FROM {databricks_table_name}"
cursor = connection.cursor()

cursor.execute(query)
result = cursor.fetchall()

weeks = [];
for row in result:
    weeks.append(row[0])
cursor.close()

df_list = []
for week in weeks:
    with connection.cursor() as cursor:
        query = f"SELECT * FROM {databricks_table_name} WHERE Week={date}"
        cursor.execute(query)        
        result = cursor.fetchall_arrow()
        temp_df = result.to_pandas()
        df_list.append(temp_df)
        cursor.close()

output_df = pd.concat(df_list)
connection.close()

I am located in the Northeastern United States and the server I'm using is in western Europe, my current assumption is that this is likely what is causing the slow down, but I'm not sure how to prove this. The table is composed of several parquet files joined together. I was able to download these parquet files directly in less than 5 mins when I downloaded them all simultaneously.

I tried using an 2x-small SQL warehouse (our company does not allow us to create our own SQL warehouse computes, and 2x-small is the default they provide to projects), a Standard 15.4 LTS (includes Apache Spark 3.5.0, Scala 2.12) compute (worker and driver had 14 GB Memory, 4 cores, 2 min workers, 8 max workers, with autoscaling), and a larger Standard 15.4 LTS (worker and driver had 32 GB Memory, 8 Cores, 4 min workers, 8 max). I didn't really see a speed difference between those three, but I haven't tried the current chunked approach on the SQL or smaller cluster yet.

Isi · a week ago

Hey @barchiel33 ,

Hi! Let me share my thoughts, I’m not entirely sure if all of this applies directly to your use case, but hopefully some of it helps, and we can iterate further if needed.

From what I understand, you’re using a Python data function in Spotfire to fetch a Databricks table (1.28M rows, ~140 MB) using the Databricks SQL connector, convert it into a pandas DataFrame, and load that into your Spotfire report.

Simplify the data pull — one query instead of many

Right now, you’re executing a separate query for each distinct week using a loop. While this is a valid approach if you need to isolate each week’s data for processing or memory control, it has several performance drawbacks:

Each query involves a full round-trip to the Databricks SQL endpoint, which adds significant latency, especially across regions.
Each response is serialized and deserialized independently, which increases CPU and memory usage on your end.
If the filtering doesn’t substantially reduce the amount of data per query, you’re essentially paying the cost of fetching the entire table, just in smaller, slower chunks.

Unless you have a strong reason to isolate each week’s data up front, a more efficient alternative would be to run a single query:

query = f"SELECT * FROM {databricks_table_name}"

This allows Databricks to optimize the query execution and minimizes overhead by:

Reducing the number of network calls to one.
Handling the result as a single, more efficient data stream.
Letting the compute engine (Databricks SQL) manage the scan and projection internally, which is usually much faster than issuing multiple queries manually.

You can still apply any necessary filtering later in pandas or whichever tool you use, but this way you avoid redundant computation and data transfers at the query layer.

Alternative: export once, reuse many times

If your data doesn’t need to be real-time, a great optimization is:

Run a Databricks job to export the table to Parquet or CSV in cloud storage(S3 for example).
In Spotfire, use pandas.read_parquet() or read_csv() to fetch that file directly.

Cross-region impact is real

Since you’re in the Northeastern US and your workspace is in Western Europe, you’re dealing with unavoidable latency, especially visible when doing 20+ SQL queries in a loop...

Export + download method might be your best bet for speed and reliability.

Hope this helps 🙂

Isi

barchiel33 · Thursday

Hello @Isi !

Thank you for the response!

I was originally running the query as a single full pull, however what I ran into was that it seems to be pulling the data as a single stream of data. It was taking >20 mins to pull the data with the singular query, while running parallel queries filtered to each of the weeks only took <5 mins TOTAL (each individual query takes between 3-5 mins, but because they're running as separate threads, those 3-5 mins were running concurrently instead of in serial). The increased CPU and Memory usage was negligible, since each stream only received minimal processing before being rejoined with the other streams.

I understand a single query may be more efficient, but it seems to come at the cost of time, which is more critical here. Having an engineer sit for 20+ mins to maybe get the data (since sometimes the process failed for reasons I don't quite understand) is more expensive than the additional wasted databricks resources from having an inefficient query. Maybe I'm just doing something wrong with the single query process? It seems strange to me that the full data download would take so long when it's not that large of a data set. The optimal answer to me would be to download the data in parallel, which is what I assumed the databricks system would do by default, but it doesn't seem to be the case based on the performance.

The alternative solution you proposed is something I had considered but I'm not sure the exact methodology. Is there some way to download the data as parquet to a temporary location? Or would it need to be downloaded to an actual location on the local computer?

Thank you again for the reply! It is well thought out and I appreciate the time you spent on it!

Kind Regards,

Braxton Stover

Isi · yesterday

Hey @barchiel33 ,

After reviewing your context further, I believe the most effective approach would be to set up an automated pipeline within Databricks that periodically extracts data based on the frequency you need (daily, weekly, hourly, etc.), creating an optimized intermediate or final table that is ready for your analysis.

Instead of running ad-hoc queries directly from your local environment, the idea is to create an automated flow in Databricks that runs these extractions on a schedule and stores the results in a table. This would address the performance and latency issues you’re facing because the table would reside in the cloud environment where your Databricks instance runs, enabling virtually instant queries.

This table could either be managed directly by Databricks (a managed table format Delta) or stored in a specific cloud storage location like Azure Storage or AWS S3 (an external table). In either case, Databricks would simply maintain a logical pointer to that location, facilitating both management and governance.

Since you’re using Spotfire, I understand that your current workflow involves pulling data from Databricks into Spotfire for visualization or exploratory analysis, likely via Pandas. Although I’m not very familiar with Spotfire, these types of tools generally allow you to connect directly to tables or specific cloud storage locations (like S3 or Azure Blob Storage). Therefore, it’s possible that you could directly read from the table created by the pipeline.

However, if Spotfire doesn’t natively support Delta format for example, we could write the periodic data output from Databricks into Parquet or CSV files in a specific cloud storage location (Location). Spotfire could then connect to this location to consume the data directly.

This way, you’d have a fully automated, robust pipeline, optimized for fast querying, eliminating the latency you’ve encountered, and simplifying your day-to-day workflow. Spotfire would always have access to up-to-date data, available immediately.

I look forward to hearing your thoughts or any questions you may have regarding this approach.

Best regards, 🙂

Isi