GenAI Data Exposure: What GenAI Usage Is Really Costing Enterprises

July 31, 2025

Team Harmonic

In this article

Example H2

Example H3

Harmonic's latest analysis shows how GenAI tools are quietly exposing enterprise data in Q2 2025

‍

In the second quarter of 2025, Harmonic analyzed 1 million GenAI prompts and 20,000 uploaded files across more than 300 GenAI and AI-enabled SaaS applications. The data confirms what many security leaders suspect but struggle to quantify: sensitive data is leaking into GenAI tools at scale.

Nearly 22% of uploaded files and 4.37% of prompts contained sensitive content. Meanwhile, the average enterprise saw 23 previously unknown GenAI tools newly used by employees.

The risks of data exposure to GenAI tools is not hypothetical. It is routine, measurable, and growing.

‍

This analysis captures real-world corporate GenAI activity at scale

Harmonic’s research is based on anonymized, real usage data from employees at organizations across the United States and the United Kingdom. Activity was recorded via the Harmonic Security Browser Extension, which captures usage across SaaS environments and GenAI platforms, then sanitizes it for aggregate analysis.

Key parameters of the study:

1,000,000 GenAI prompts
20,000 uploaded files
300 GenAI or AI-embedded tools analyzed
Focus on web-based SaaS tools in active enterprise environments
Data collected between April and June 2025

Ensuring data privacy and security was paramount throughout this research. No sensitive customer data, personally identifiable information (PII), or proprietary file contents left any customer tenant. The analysis relied exclusively on anonymized data and aggregated counts generated by the Harmonic platform within customer environments.

Certain limitations apply to this data. The findings represent usage patterns within organizations that have deployed Harmonic's protection solutions, potentially indicating a higher level of security awareness compared to the general enterprise population. The analysis is confined to browser-based interactions captured by the extension; usage of GenAI tools via native mobile applications or direct API integrations outside the browser context is not included in this dataset.

ChatGPT remains the largest source of prompt-based data leakage

Of all sensitive prompts analyzed in Q2, 72.6% originated in ChatGPT. Microsoft Copilot (13.7%) and Google Gemini (5.0%) followed, with Claude (2.5%), Poe (2.1%), and Perplexity (1.8%) rounding out the “top six” list.

One dominant trend stands out: code leakage. It is the most common type of sensitive data sent to GenAI tools and was especially prevalent in:

ChatGPT
Claude
DeepSeek
Baidu Chat

This aligns with developer behavior. Developers frequently use GenAI tools to generate, test, and review code, including proprietary logic and access credentials. The result is a persistent exposure of high-value intellectual property into platforms not designed for confidentiality.

Other usage trends reveal different risk profiles. Prompts to ChatGPT often involved M&A planning, financial modeling, and investor communications. Claude showed a disproportionate number of prompts containing proprietary code and PII, suggesting deeper usage in system-level or regulated workflows.

‍

Use of enterprise versions increased, but 26.3% of exposure goes via ChatGPT Free.

While 47.8% of sensitive prompts and files were uploaded via ChatGPT Enterprise, 26.3% still went through the free version of ChatGPT. 15.13% of sensitive prompts/files were also submitted via free versions of Google Gemini accounts.

In terms of Perplexity, 47.42% of sensitive uploads came from users with standard (non-enterprise) accounts.

In other words, a large share of data is still flowing through channels with weak or nonexistent enterprise controls. While this is reducing, there is still a significant blindspot for organizations.

File uploads to GenAI platforms are consistently sensitive

Despite accounting for just 13.9% of overall data exposure events, uploaded files had a disproportionate concentration of sensitive and strategic content compared to prompt data.

For instance, files were the source of 79.7% of all stored credit card exposures, 75.3% of customer profile leaks, and 68.8% of employee PII incidents—all categories with high regulatory or reputational risk.

Even in financial projections, where both channels are active, files edged out prompts with 52.6% of total exposure volume.

Distribution of File Types Uploaded to GenAI Tools in Q2 2025

‍

Across core GenAI applications, the average enterprise uploaded 1.32GB of files in Q2. A full 21.86% of these files contained sensitive data.

The most common file types:

PDFs (over half of all uploads)
Word and Excel documents
CSVs

China-based GenAI platforms are growing fast and exposing high-value data

GenAI tools developed in China (such as Baidu Chat, DeepSeek, Kimi Moonshot, Manus, and Qwen) are seeing notable adoption in Western enterprises. Despite being largely unsanctioned, these platforms attract developers because they are fast, capable, and free.

Key statistics from our research released last week:

7.95% of employees in the average enterprise used a Chinese GenAI tool
1,059 users uploaded more than 17MB of content each
535 separate incidents of sensitive exposure were recorded

Breakdown of what was leaked:

32.8% involved source code, access credentials, or proprietary algorithms
18.2% included M&A documents and investment models
17.8% exposed PII such as customer or employee records
14.4% contained internal financial data

Each Chinese tool showed distinct risk characteristics. Baidu Chat, for instance, was disproportionately responsible for leaked legal and payment documents. DeepSeek had a high rate of credit card and employee PII exposure.

These tools often provide little to no visibility, auditability, or retention controls, yet employees adopt them with minimal friction.

Embedded AI in SaaS tools is creating an invisible layer of exposure

Not all GenAI risk comes from obvious chatbots. A growing share now stems from everyday SaaS tools that quietly embed LLMs and train on user content. On average, organizations discovered their employees using 23.22 new AI apps in Q2. With each of these needing to be properly vetted and reviewed, security teams are more stretched than ever.

‍

To better understand what type of data is going into these tools, we analyzed ten of the most frequently used embedded-AI applications in the enterprise. These tools are not flagged as GenAI tools by most enterprise controls. Yet they often receive sensitive content:

Canva was used to create documents containing legal strategy, M&A planning, and client data
Replit and Lovable.dev handled proprietary code and access keys
Grammarly and Quillbot were used for editing contracts, client emails, and internal legal language

Numbers of Data Exposure Distribution Across Embedded AI Tools

‍

Without robust governance, enterprise data now enters these systems by default (and often stays there).

‍

Enterprises must adapt their governance to reflect AI’s new shape

Shadow IT has long challenged security teams. But the rise of AI embedded in mainstream tools has exacerbated this challenge.

The stopgap measure had been to block any tool in the “AI Category”, but AI is now embedded in the very tools employees rely on every day. In many cases, employees have little knowledge they are exposing business data.

This shift demands a data-first governance model. Gartner has started referring to this as AI Usage Control (AI-UC), an emerging category focused on monitoring what data flows into AI systems, not just which tools are used.

To adapt, enterprises must:

Gain visibility into tool usage (including free tiers and embedded tools)
Monitor what types of data are entering GenAI systems
Enforce context-aware controls at the data layer
Establish opt-out policies and model training restrictions with vendors

The tools are already here. The data is already flowing. The only question is whether governance can catch up in time.

‍