AI have the ability to reason and generate functioning code in languages like Python, SQL, and R, they can provide impressive value with Data analysis. But can they replace data analysts?
By Justus Mulli, Data Scientist on October 31, 2023, in Data Science
What kind of Data Analysis can AI do?
We already know ChatGPT as the most versatile AI tool, with plugins that enable it to do just about anything. It can generate functioning code in Python, R, and many other languages, as well as complex SQL queries. As you can imagine, combining these functionalities would allow you to use AI for just about every part of your Data Analysis work.
The use cases include:
- Cleaning and other processing
When it comes to working with data, specialized tools like Julius AI (for csv files) or BlazeSQL (for SQL Databases) are designed specifically for this purpose. Unlike ChatGPT, these tools don’t require you to upload/connect and explain your data every time you open them up.
ChatGPT works for some quick analysis on a CSV file, but most companies store data in SQL databases inside private networks. Nonetheless, specialized tools can connect to these secured SQL databases, and answer your questions by querying your database and visualizing the results.
How could AI replace data analysts?
Data Analysis is all about getting insights from data, data analysts and data scientists are the ones with the technical skills to provide stakeholders with the insights they need. But things have changed, and now AI tools can successfully complete some of the tasks that could previously only be completed by data analysts and data scientists.
In theory, a business stakeholder with no technical skills could now connect their data to an AI tool, and make a request such as “Get the monthly revenue grouped by product, for the top 3 products of the year”. The AI can then grab the data, and even visualize it. The user would only need to spend a few seconds writing out the request. If they had asked a human colleague, they might not have gotten an answer for a few days, or longer.
Seeing an image like this can be both amazing and worrying for data analysts, but replacing data analysts and data scientists isn’t that simple. Simply running an SQL Query and graphing the result is only a part of their job, and even that can’t always be done reliably by AI. It may have worked in the screenshot above, but what if the result is wrong even though it looks ok?
Sounds like it’s time to talk about some limitations of AI for working with Data.
Limitation #1: AI Hallucinations
Most people who have worked with ChatGPT and similar tools have heard the term “hallucination” in this context. When you ask them about something they don’t know about, they will sometimes just make stuff up.
The reason for these hallucinations is simple: LLMs are like very advanced autocomplete algorithms. They return the most likely next message in a conversation, based on the data they were trained on. Thanks to high-quality datasets and advanced training techniques, this “autocomplete” works so well that these tools can fulfil complex requests with remarkably high-quality results. Unfortunately, when they encounter situations their training data did not prepare them for, the most likely next message might not actually make much sense.
What if it generates some code that runs, but the code returns the wrong data? The business stakeholders using the AI Data Analyst might have no idea that the result is wrong, but they can’t see the mistake since they don’t understand the code.
Limitation #2: Business information.
Usually, when a new data analyst starts working at a company, they’ll have to learn what some of the columns and values mean. This is because the data model was designed by the business. You can’t just analyze data without understanding where it comes from, because common knowledge isn’t enough to understand most databases.
AI tools like BlazeSQL do allow you to include this information for the AI to use, but a Data Analyst or Data Scientist will be required to keep these up to date.
Limitation #3: Sometimes, AI just gets stuck. AKA “Blind spots”
You may have seen examples of ChatGPT getting stuck on a very basic question. These questions are often very easy to answer, but require the AI to reason in a way that it’s not very good at.
We can call these cases “blind spots”, and they also exist for writing code. Ex. A common blindspot AI has for generating SQL queries is using subqueries. AI models will often generate queries that try to select a column from a subquery, even though that column does not exist in the subquery.
WITH recent_orders AS ( SELECT customer_id, MAX(order_date) AS latest_order_date FROM orders GROUP BY customer_id ) SELECT customer_id, product_id, -- (This column is not defined in the subquery) latest_order_date FROM recent_orders
Even when the mistake is pointed out, they will often make the same mistake when trying again.
Limitation #4: AI Models agree too much
AI models will tend to agree with you, even when you’re wrong. This can be a huge problem when the AI model is supposed to play the role of an expert since an expert should be able to correct you when you’re wrong.
Limitation #5: Input length
A human might spend months learning about a project and the database, gathering lots of important information. An LLM on the other hand typically has a “token limit”, which means it can only take a certain amount of input.
This Input length (AKA “token limit”) is often restrictive when it comes to complex tasks. How could you possibly distill those months of learning into a few pages, and fit it into the AI model?
The widely available version of GPT-4, is limited to 12 pages of input + output. Keep in mind that a data analyst will attend hours of meetings, and read documentation or reports. All the output (code, and explanation from GPT-4) needs to be subtracted from the 12 pages since the limit includes the output, not just the input.
This means a major data analysis project that requires lots of learning and exploration is simply not feasible.
Limitation #6: Soft skills
Last but definitely not least, ChatGPT and other AI chatbots are… just chatbots. Human interaction and soft skills are a big part of working on data projects. Whether it’s gaining trust, dealing with office politics, or interpreting non-verbal communication. These elements are crucial to successfully collaborating with stakeholders and completing a project.
As you can see, AI has a number of limitations that prevent it from being a fully capable data analyst. The above list just contains some of the main limitations, but there are plenty of other big hurdles when it comes to actually replacing a data expert. In other words, you don’t need to worry about AI replacing you!
That being said, AI is already having a significant impact on Data Analysts and Data Scientists. It may not be perfect, but it is already providing incredible value.
Working faster with AI
Writing code, whether it’s Python, SQL, or R, can be time consuming. These AI tools may not be 100% accurate, but they still work well a lot of the time. It’s often 10x faster to quickly review what they generated than it is to do everything from scratch.
In cases where AI struggles or often makes mistakes, it may be faster to just do it from scratch. In other cases, the massive increase in productivity is worth the occasional debugging effort. The important thing is to experiment with different tools, learn their strengths and weaknesses, and integrate them into your workflow accordingly.
What about the future?
Things are progressing extremely quickly, so some of the current limitations won’t necessarily be a factor for long. This is especially true now that AI tools are being used by so many people, as they learn from their users. These interactions are used to train the models, and there are millions of interactions every day.
ChatGPT has the fastest-growing user base of all time, and it learns from that user base.
With competitors like Claude, Bard, and others joining the race, we’re bound to see some massive improvements coming along soon.
Being prepared for these changes is simple, just keep an eye out for new tools, and experiment with them. That way you’ll know their strengths and weaknesses, and can make sure you’re leveraging the latest technology and adapting as it evolves.
On that note, a few tools to keep an eye on include:
BlazeSQL (for SQL databases)
ChatGPT Advanced Data Analysis (For CSV and other files)
Pandas AI (adding Generative AI to the pandas library)