Daniel van Strien PRO
AI & ML interests
Machine Learning Librarian
Recent Activity
liked a dataset about 2 hours ago
impresso-project/impresso-mediaagencies-ner-dataset liked a dataset about 3 hours ago
stanford-vision-lab/gpicOrganizations
reacted to FlameF0X's post with 🔥 about 2 hours ago
reacted to sergiopaniego's post with 🚀 7 months ago
Post
3185
Meet OpenEnv 👋, an open ecosystem of environments for intelligent agents. Build, share, and test agents safely and consistently.
Ideal for training with TRL (we include examples🤓), deployment, and community collaboration via the HF Hub
Blog: https://huggingface.co/blog/openenv
Hub for Environments:
openenv
OpenEnv repo: https://github.com/meta-pytorch/OpenEnv
Try it out using TRL: https://huggingface.co/docs/trl/main/en/openenv
Ideal for training with TRL (we include examples🤓), deployment, and community collaboration via the HF Hub
Blog: https://huggingface.co/blog/openenv
Hub for Environments:
OpenEnv repo: https://github.com/meta-pytorch/OpenEnv
Try it out using TRL: https://huggingface.co/docs/trl/main/en/openenv
reacted to stefan-it's post with 😎🔥 7 months ago
Post
4815
Wohoo 🥳 I have finished my 2025 GPU workstation build and I am very excited to train new awesome open source models on it.
I built my last GPU workstation 5 years ago featuring an AMD Ryzen 5900X, 64GB of G.SKILL Trident Z RGB on an ASRock X570 Taichi cooled by an Alphacool Eisbär 420. GPU was a Zotac RTX 3090 AMP Extreme. Unfortunately, I was never satisfied with the case - some Fractal Define 7, as it is definitely too small, airflow is not optimal as I had to open the front door all the time and it also arrived with a partly damaged side panel.
For my new build, I've used the following components: an outstanding new AMD Ryzen 9950X3D with 64GB of Corsair Dominator Titanium (what a name). As a huge Noctua fan - warm greetings to my Austrian neighbors - I am using the brand new Noctua NH-D15 G2 on an ASRock X870E Taichi in an amazing Lian Li LANCOOL III chassis. One joke that only NVIDIA Blackwell users will understand: you definitely need a tempered glass panel to check if your GPU cables/connectors start melting 😂 And the best is yet to come: I returned my previously bought Zotac RTX 5090 Solid to the eBay seller (because of... missing ROPs, only NVIDIA Blackwell users will again understand) and bought a Zotac 5090 AMP Extreme INFINITY (yes, the long name indicates that this is the flagship model from Zotac) from a more trustworthy source (NBB in Germany).
I am so happy to start training and fine-tuning new open source models - stay tuned!!!
I built my last GPU workstation 5 years ago featuring an AMD Ryzen 5900X, 64GB of G.SKILL Trident Z RGB on an ASRock X570 Taichi cooled by an Alphacool Eisbär 420. GPU was a Zotac RTX 3090 AMP Extreme. Unfortunately, I was never satisfied with the case - some Fractal Define 7, as it is definitely too small, airflow is not optimal as I had to open the front door all the time and it also arrived with a partly damaged side panel.
For my new build, I've used the following components: an outstanding new AMD Ryzen 9950X3D with 64GB of Corsair Dominator Titanium (what a name). As a huge Noctua fan - warm greetings to my Austrian neighbors - I am using the brand new Noctua NH-D15 G2 on an ASRock X870E Taichi in an amazing Lian Li LANCOOL III chassis. One joke that only NVIDIA Blackwell users will understand: you definitely need a tempered glass panel to check if your GPU cables/connectors start melting 😂 And the best is yet to come: I returned my previously bought Zotac RTX 5090 Solid to the eBay seller (because of... missing ROPs, only NVIDIA Blackwell users will again understand) and bought a Zotac 5090 AMP Extreme INFINITY (yes, the long name indicates that this is the flagship model from Zotac) from a more trustworthy source (NBB in Germany).
I am so happy to start training and fine-tuning new open source models - stay tuned!!!
posted an update 9 months ago
Post
2724
I fine-tuned a smol VLM to generate specialized art history metadata!
https://huggingface.co/davanstrien/iconclass-vlm: Qwen2.5-VL-3B trained using SFT to generate ICONCLASS codes (think Dewey Decimal for art!)
Trained with TRL + HF Jobs - single UV script, no GPU needed!
Space to explore predictions on a test set: davanstrien/iconclass-predictions
Blog soon!
https://huggingface.co/davanstrien/iconclass-vlm: Qwen2.5-VL-3B trained using SFT to generate ICONCLASS codes (think Dewey Decimal for art!)
Trained with TRL + HF Jobs - single UV script, no GPU needed!
Space to explore predictions on a test set: davanstrien/iconclass-predictions
Blog soon!
The model could be super depressed and stressed out!
Hope so!
Yeah, quite bold that they put health + legal use cases so prominently
reacted to clem's post with 🔥 10 months ago
Post
6165
Thread to gossip during the
openai GPT-5 livestream: https://www.youtube.com/watch?v=0Uu_VJeVVfo. Feel free to post your impressions below!
Very off topic, but on the theme of music to welcome aliens, this short film is lovely: https://www.youtube.com/watch?v=Jr83bJsT6OA!
posted an update 12 months ago
Post
3748
Inspired by Hugging Face's official MCP server, I've developed a complementary tool that exposes my semantic search API to enhance discovery across the HF platform.
Key capabilities:
- AI-powered semantic search for models and datasets
- Parameter count analysis via safetensors metadata
- Trending content discovery
- Find similar models/datasets functionality
- 11 tools total for enhanced ecosystem navigation
The semantic search goes beyond simple keyword matching, understanding context and relationships between different models and datasets.
Example query: "Find around 10 reasoning Hugging Face datasets published in 2025 focusing on topics other than maths and science. Show a link and a short summary for each dataset." (results in video!)
https://github.com/davanstrien/hub-semantic-search-mcp
Key capabilities:
- AI-powered semantic search for models and datasets
- Parameter count analysis via safetensors metadata
- Trending content discovery
- Find similar models/datasets functionality
- 11 tools total for enhanced ecosystem navigation
The semantic search goes beyond simple keyword matching, understanding context and relationships between different models and datasets.
Example query: "Find around 10 reasoning Hugging Face datasets published in 2025 focusing on topics other than maths and science. Show a link and a short summary for each dataset." (results in video!)
https://github.com/davanstrien/hub-semantic-search-mcp
reacted to cbensimon's post with 🔥 about 1 year ago
Post
6192
🚀 ZeroGPU
Nothing too fancy for now—ZeroGPU Spaces still default to
- 💰 size-based quotas / pricing (
- 🦣 the upcoming
You can as of now control GPU size via a Space variable. Accepted values:
-
-
-
The auto mode checks total CUDA tensor size during startup:
- More than 30GB →
- Otherwise →
medium size is now available as a power-user featureNothing too fancy for now—ZeroGPU Spaces still default to
large (70GB VRAM)—but this paves the way for:- 💰 size-based quotas / pricing (
medium will offer significantly more usage than large)- 🦣 the upcoming
xlarge size (141GB VRAM)You can as of now control GPU size via a Space variable. Accepted values:
-
auto (future default)-
medium-
large (current default)The auto mode checks total CUDA tensor size during startup:
- More than 30GB →
large- Otherwise →
medium Post
2417
Came across a very nice submission from @marcodsn for the reasoning datasets competition (https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition).
The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:
- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model
It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.
I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.
Dataset can be found here: marcodsn/academic-chains (give it a like!)
The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:
- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model
It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.
I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.
Dataset can be found here: marcodsn/academic-chains (give it a like!)
posted an update about 1 year ago
Post
2417
Came across a very nice submission from @marcodsn for the reasoning datasets competition (https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition).
The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:
- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model
It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.
I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.
Dataset can be found here: marcodsn/academic-chains (give it a like!)
The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:
- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model
It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.
I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.
Dataset can be found here: marcodsn/academic-chains (give it a like!)
reacted to jasoncorkill's post with 🔥 about 1 year ago
Post
3105
🔥 Yesterday was a fire day!
We dropped two brand-new datasets capturing Human Preferences for text-to-video and text-to-image generations powered by our own crowdsourcing tool!
Whether you're working on model evaluation, alignment, or fine-tuning, this is for you.
1. Text-to-Video Dataset (Pika 2.2 model):
Rapidata/text-2-video-human-preferences-pika2.2
2. Text-to-Image Dataset (Reve-AI Halfmoon):
Rapidata/Reve-AI-Halfmoon_t2i_human_preference
Let’s train AI on AI-generated content with humans in the loop.
Let’s make generative models that actually get us.
We dropped two brand-new datasets capturing Human Preferences for text-to-video and text-to-image generations powered by our own crowdsourcing tool!
Whether you're working on model evaluation, alignment, or fine-tuning, this is for you.
1. Text-to-Video Dataset (Pika 2.2 model):
Rapidata/text-2-video-human-preferences-pika2.2
2. Text-to-Image Dataset (Reve-AI Halfmoon):
Rapidata/Reve-AI-Halfmoon_t2i_human_preference
Let’s train AI on AI-generated content with humans in the loop.
Let’s make generative models that actually get us.
reacted to ajibawa-2023's post with 🔥 about 1 year ago
Post
4654
Hi All, I recently released two Audio datasets which are generated using my earlier released dataset: ajibawa-2023/Children-Stories-Collection
First Audio Dataset:https://huggingface.co/datasets/ajibawa-2023/Audio-Children-Stories-Collection-Large has 5600++ stories in .mp3 format.
Second Audio Dataset:https://huggingface.co/datasets/ajibawa-2023/Audio-Children-Stories-Collection has 600 stories in .mp3 format.
First Audio Dataset:https://huggingface.co/datasets/ajibawa-2023/Audio-Children-Stories-Collection-Large has 5600++ stories in .mp3 format.
Second Audio Dataset:https://huggingface.co/datasets/ajibawa-2023/Audio-Children-Stories-Collection has 600 stories in .mp3 format.
reacted to jasoncorkill's post with 🚀🔥 about 1 year ago
Post
3334
🚀 We tried something new!
We just published a dataset using a new (for us) preference modality: direct ranking based on aesthetic preference. We ranked a couple of thousand images from most to least preferred, all sampled from the Open Image Preferences v1 dataset by the amazing @data-is-better-together team.
📊 Check it out here:
Rapidata/2k-ranked-images-open-image-preferences-v1
We're really curious to hear your thoughts!
Is this kind of ranking interesting or useful to you? Let us know! 💬
If it is, please consider leaving a ❤️ and if we hit 30 ❤️s, we’ll go ahead and rank the full 17k image dataset!
We just published a dataset using a new (for us) preference modality: direct ranking based on aesthetic preference. We ranked a couple of thousand images from most to least preferred, all sampled from the Open Image Preferences v1 dataset by the amazing @data-is-better-together team.
📊 Check it out here:
Rapidata/2k-ranked-images-open-image-preferences-v1
We're really curious to hear your thoughts!
Is this kind of ranking interesting or useful to you? Let us know! 💬
If it is, please consider leaving a ❤️ and if we hit 30 ❤️s, we’ll go ahead and rank the full 17k image dataset!
replied to jasoncorkill's post about 1 year ago
This is very cool! I was always curious about doing something like this! Could be quite cool to train a "aesthic preference model" on this kind of dataset. Could be quite cool to try and use as a reward model for image gen training...
cc @sayakpaul @multimodalart @linoyts @davidberenstein1957 who might also find this data interesting :)
reacted to jasoncorkill's post with ❤️ about 1 year ago
Post
3334
🚀 We tried something new!
We just published a dataset using a new (for us) preference modality: direct ranking based on aesthetic preference. We ranked a couple of thousand images from most to least preferred, all sampled from the Open Image Preferences v1 dataset by the amazing @data-is-better-together team.
📊 Check it out here:
Rapidata/2k-ranked-images-open-image-preferences-v1
We're really curious to hear your thoughts!
Is this kind of ranking interesting or useful to you? Let us know! 💬
If it is, please consider leaving a ❤️ and if we hit 30 ❤️s, we’ll go ahead and rank the full 17k image dataset!
We just published a dataset using a new (for us) preference modality: direct ranking based on aesthetic preference. We ranked a couple of thousand images from most to least preferred, all sampled from the Open Image Preferences v1 dataset by the amazing @data-is-better-together team.
📊 Check it out here:
Rapidata/2k-ranked-images-open-image-preferences-v1
We're really curious to hear your thoughts!
Is this kind of ranking interesting or useful to you? Let us know! 💬
If it is, please consider leaving a ❤️ and if we hit 30 ❤️s, we’ll go ahead and rank the full 17k image dataset!