Artificial intelligence didn’t just arrive; it opened a gateway. Like the ship Event Horizon, it was built to push the limits of innovation. Instead, it slipped into something else. Today, every search, voice command, and documents are pulled into its gravity drawn past a threshold where consent becomes irrelevant and return is impossible. Once your data crosses in, it stays. It trains the system, shapes its responses, and leaves a trace that can’t be unlearned.
This isn’t speculative. It’s already happening. Just ask Samsung, whose engineers fed sensitive code into a chatbot without realizing it would be swallowed whole. The line between public and private no longer bends; it fractures. If that leaves you unsettled, it should. Because discomfort is a warning sign. And whether you’re a casual user or a corporate gatekeeper, the question is no longer if your data is in the system but how much, and who’s using it now.
Step in. The fold’s already started.
The Rise of AI and the Quest for Data
Artificial intelligence has made extraordinary progress in recent years, with large language models and image generators leading the way. These systems require enormous volumes of data to learn, improve, and generate coherent, useful outputs.
Whether it’s composing emails or generating lifelike images, their effectiveness depends on how much and what kind of data they consume. The more data they ingest, the better they become at mimicking human language, creativity, and interaction.
But where does all AI data come from? Much of it originates from the internet itself forums, blogs, social media posts, product reviews, and even customer service interactions. These seemingly innocuous digital traces form the building blocks of modern AI systems.
Over time, the distinction between public and private data has become increasingly blurred, making it harder to track what is truly off-limits. Data that was once considered sensitive may now be swept up in massive scraping operations, often without the user’s awareness or consent.
The AI Data Trap: Once In, Never Out
AI Chatbots like ChatGPT are now embedded into browsers, mobile operating systems, smart assistants, and even surveillance technologies. These tools quietly monitor and learn from how we interact with the digital world. From typing patterns to location check-ins, AI systems are constantly collecting signals. What’s more concerning is the casual manner in which data is harvested. Companies often claim the data is “publicly available” or has been “anonymized and aggregated,”. But the reality is that those terms are rarely explained clearly. Consent is usually buried deep in lengthy terms of service, which most users never read.
As AI becomes more embedded in daily life, every click, tap, and scroll becomes part of a growing dataset that fuels machine intelligence.
Click, Paste, Breach
The corporate risks of AI data leakage became painfully clear in 2023, when Samsung engineers accidentally uploaded sensitive source code into ChatGPT while troubleshooting. The data, once entered, became part of OpenAI’s training material. What was intended as a simple tech fix became a data breach that no firewall could prevent. This highlights how easily confidential information can be compromised.
And the Samsung AI data leak isn’t an isolated incident. Across industries, employees frequently copy and paste internal documents, legal drafts, or medical notes into AI tools seeking assistance. AI tools, while helpful, are not secure by default. Their design is built around learning from user input. And this poses significant risks when proprietary or regulated data is involved. The data leak issue is particularly problematic for sectors like tech, law, and finance.
Embrace AI or Protect Data?
Companies are caught in a difficult balancing act when they use AI tools. On one hand, AI promises massive productivity improvements, automating routine tasks and uncovering insights faster than human teams ever could. On the other hand, AI use introduces new vectors for data leaks and compliance violations. For businesses that handle sensitive information, this is a critical concern. The reality is that many organizations have yet to define clear policies around AI use.
In the absence of strong governance, employees often experiment with AI tools without IT oversight, a phenomenon known as “shadow IT.” Compounding the issue is the lack of transparency from AI vendors. Businesses often have no visibility into how data is stored, retained, or reused once it’s input into an AI system. What’s truly at stake isn’t just corporate secrets, but deeply personal information. AI tools can absorb private conversations, intimate photos, browsing histories, and even biometric data from apps and devices. Once that information is used for training, it becomes a permanent part of the AI’s knowledge base. There is no way to “delete” or extract it later.
As AI becomes more powerful and pervasive, the boundaries of privacy are being redrawn without public discussion. The shift is happening quietly, leaving individuals with little say over how their data is used. And the unsettling truth is: if your data has already been used to train an AI, it can’t be untrained. It will always live somewhere in the model’s decision-making process.
Slow March of Regulation
Existing data protection laws like GDPR and CCPA were not crafted with generative AI in mind. While they emphasize consent and accountability, they fall short when it comes to AI’s ability to absorb vast datasets, learn from them, and repurpose them in unpredictable ways. The existing data protection laws are reactive. They are not proactive, which leaves significant gaps in the protection of confidential and proprietary data.
Efforts to address AI data theft issues remain fragmented across jurisdictions. In some regions, courts are beginning to address lawsuits over AI tools trained on copyrighted images, books, or private content scraped without permission. These early rulings may set important precedents. But for now, the pace of legal reform lags far behind the speed of advancement in AI technology.
How Can You Protect Yourself?
In a world where AI systems are constantly watching, learning, and collecting, safeguarding your personal and professional data has never been more urgent.
While you may not be able to stop every instance of data collection, you can take strategic steps to minimize your exposure and protect sensitive information. Think of it as a digital risk reduction plan comprising small, intentional steps that reduce long-term risk. For individuals and organizations alike, protection starts with awareness and evolves into action. It’s not about avoiding AI entirely, but using it wisely and setting clear boundaries.
Here are some practical ways to take control:
Avoid Sharing Sensitive Info in AI Platforms: Never input private documents, passwords, client records, or medical data into generative AI tools. Even helpful prompts can unintentionally leak confidential information.
Opt Out of Data Sharing When Possible: Many apps and platforms now allow users to opt out of data collection or ad personalization. Always check your settings and decline unnecessary permissions.
Implement Strong Organizational AI Policies: Businesses should create clear, enforceable guidelines around how employees can and cannot use AI tools. Regular training can prevent accidental data leaks.
Choose Responsible AI Vendors: When selecting third-party AI services, ask detailed questions about how data is handled. Look for vendors that commit to data encryption, limited retention, and clear user consent policies.
Use Encrypted and Private Alternatives: Seek out tools and platforms that prioritize user privacy with end-to-end encryption and minimal data retention. From browsers to productivity apps, privacy-focused alternatives can reduce your exposure to AI data theft while still enabling you to benefit from artificial intelligence responsibly.
Conclusion
Staying ahead of AI means treating your data like it matters; because it does. Use tools that protect you. Ask the hard questions. Stop assuming convenience is free; if the product is seamless, you’re probably the seam.
This isn’t just about privacy; it’s about leverage. Your data isn’t just floating out there; it’s being captured, indexed, modelled, and sold back to you in the form of predictive systems. Every upload, search, or throwaway prompt isn’t lost to the void; it’s absorbed. And what’s absorbed doesn’t disappear. It learns. It multiplies. It becomes architecture.
That’s the part most people miss: AI doesn’t just reflect the internet; it reflects you. Your patterns, your voice, your vulnerabilities. It’s not neutral. It’s shaped by what you feed it. And once it has that data, you don’t get a say in how it moves next. So stay sharp. Choose your inputs like they’ll last; because they will. Move deliberately. Use privacy-first tools not because they’re perfect, but because they give you a line of defense. And remember: awareness isn’t protection; but it’s where protection starts.
Marc-Roger Gagne MAPP
@ottlegalrebels