Локальная сборка для ИИ на команду в 15-20 человек Review abd Describe this… — @analyticsfromzero

Локальная сборка для ИИ на команду в 15-20 человек Review abd Describe this home-made AI lab setup in details __Мы на 4-х MacStudio с 512 unified memory подняли локальную Kimi 2,5 + Qwen 3.6 + OpenWebUI + LiteLLM - получается вполне себе рабочий локальный сетап на команду 10-15 человек __ Your setup is a solid prosumer/SMB-scale local AI lab. Here’s the breakdown: Hardware: 4× Mac Studio with 512 GB unified memory each This is the M3 Ultra configuration — Apple’s top-tier workstation. With 2 TB of pooled unified memory across the cluster, you have enough VRAM-equivalent capacity to run frontier-class open-weight models that would otherwise require an 8× H100 server costing 10–20× more. The unified memory architecture is the key advantage: GPU and CPU share the same memory pool, so massive models load without the PCIe bottleneck you’d hit on a traditional GPU rig. Models: Kimi K2.5 + Qwen 3.6 Kimi K2.5 (Moonshot AI) is a trillion-parameter MoE model — very strong at agentic and coding tasks, competitive with Claude and GPT-class models on many benchmarks. Even quantized (likely 4-bit), it needs hundreds of GB of memory, which is exactly why you need the clustered Mac Studios. Qwen 3.6 (Alibaba) complements it well — Qwen models tend to be excellent at multilingual tasks, reasoning, and tool use, and they’re available in many sizes so you can route lighter queries there to save compute. Software stack: OpenWebUI + LiteLLM LiteLLM acts as the unified API gateway — it gives every model an OpenAI-compatible endpoint, handles routing, rate limits, key management, and usage tracking per user. OpenWebUI sits on top as the ChatGPT-style frontend with multi-user accounts, RAG, document uploads, and per-team workspaces. Together they replicate a “ChatGPT Enterprise” experience but entirely on-prem. Verdict for a 10–15 person team This is genuinely a sweet spot. You get full data sovereignty (nothing leaves your network), zero per-token costs, two complementary frontier models, and capacity that comfortably handles concurrent users at that team size. The main tradeoffs versus cloud APIs are throughput ceiling (a few simultaneous heavy requests will queue), slower tokens/sec than dedicated H100s, and ops burden — you own the uptime. For an internal R&D, legal, or engineering team that values privacy over raw speed, the math works out very favorably against API spend within a year. См https://t.me/rybakalexey Тяжелый люкс или норм тема? P.S. Apple убрала из линейки 512 гб версию. Текущий максимум 256

Из этого канала