A machine learning researcher has released a massive dataset containing 1 million public posts from left-wing social media echo chamber Bluesky, raising questions about data privacy and consent. This data could be used to train the AI to be even more woke than notorious left-leaning AI chatbots like ChatGPT.
404 media report Daniel van Strijen, machine learning librarian at AI community platform Hugging Face, published a dataset consisting of 1 million Bluesky posts in a move that raises concerns about user privacy and consent. This dataset is intended for machine learning research and includes the textual content of each post, along with metadata such as posting time and the user's decentralized identifier (DID).
Announcing the dataset at Bluesky last week, Van Strien said, “This dataset includes 1 million public records collected from Bluesky Social's Firehose API for machine learning research and social media data experimentation. Each post contains information about the text content, metadata, media attachments, and the relationship between replies. ”
The data was collected from Bluesky's public Firehose API, which aggregates all public data updates on the platform in real time, but the inclusion of user DID raises privacy concerns. The dataset is not anonymous, and van Strien also created a search tool to find users based on their DID, which he published on Hugging Face.
A quick look at the dataset reveals a wide range of content, from political discussions and concert chatter to pornography. Notably, this dataset is a snapshot of Bluesky at a specific point in time, so it may contain posts that have been deleted by users.
According to the project page, this dataset could be used for a variety of purposes, including training language models, analyzing social media posting patterns, and studying conversation structure. However, the page also lists “out-of-scope” uses, such as building automated posting systems, creating fake or impersonated content, and extracting users' personal information.
This dataset quickly gained popularity on Hugging Face, becoming one of the most trending projects on the platform. This rapid adoption highlights the growing interest in using social media data for machine learning research.
Bluesky, which is built on the open AT protocol, has previously stated that it does not use user content to train its generative AI, nor does it intend to do so. The platform uses AI internally for content moderation and its Discover algorithm feed, but it does not train the AI systems it generates on user content.
In response to the release of the dataset, a Bluesky spokesperson issued the following statement: “Bluesky is an open, public social network, much like websites on the Internet. Just as robots.txt files don't always prevent outside companies from crawling these sites, the same is true here. We would like to find a way for Bluesky users to tell external organizations/developers whether they consent to this and whether external organizations will respect their consent. , we are actively discussing how to achieve this.”
Van Strien has since deleted the dataset, writing in a post to Bluesky: While I wanted to support the development of tools for the platform, I recognize that this approach violates the principles of transparency and consent in data collection. We apologize for this mistake. ”
Breitbart News previously reported that the left has flooded Bluesky with complaints, demands for censorship and even child pornography.
After several days of explosive growth on the platform, the Bluesky Safety team announced Friday that it had received 42,000 moderation reports in the past 24 hours. This compares to 360,000 in all of 2023. The most troubling thing is that the company admitted that it had received reports of “CSAM.” or child sexual abuse material, commonly known as child pornography.
On X/Twitter, users note that the new platform is quick to censor those who engage in wrong ideas. That includes users whose accounts were reportedly suspended the same day they signed up, reflecting previous Twitter moderation rules before Elon Musk bought the company.
read more Click here for 404 Media.
Lucas Nolan is a reporter for Breitbart News, covering free speech and online censorship issues.





