• afk_strats
    link
    fedilink
    English
    arrow-up
    7
    ·
    4 days ago

    I’ve done some testing with the two large models and my initial impression is that they seem very similar in quality to Qwen3.5 35B and 27B. Some notable exceptions:

    1. llama.cpp has speculative decode support on day 1 and it speeds performance noticeably.
    2. Day 1 base model release will undoubtedly lead to faster finetunes

    Can’t wait for the inevitable Claude/Gemini distils.

    My verdict is that even though these models benchmark slightly lower then their Qwen equivalents, their performance and support will likely drive me to pick them.

    • NerdsGonnaNerd@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      1
      ·
      3 days ago

      What hardware are you running them on? I am interested in selfhosting a llm myself but I am not sure which hardware I need. How do you think do these self hosted variants compare to for example claude sonnet 4.6?

      • SuspciousCarrot78
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        11 hours ago

        The hardware question is pretty easy (see: https://www.canirun.ai/) but the Claude comparison is more complex.

        Sing it with me now…“Depends on the use case”. Which is actually fucking useless advice, innit?

        Probably the best way to actually answer that is to USE Sonnet 4.6 (it’s great!) to create a test battery for yourself to benchmark against. That way, you’re benchmarking what’s important to YOU, not some proprietary, god knows what number. Plus, you get the “teacher” to grade the “pupil”. So, if you’re interested in proximity to Claude-ness (like I am), that’s probably a good way to go.

        Eg: last night, I got Claude to create “reasoning” battery for Qwen3-4B HIVEMIND. This is the second or third time I’ve done my own bench marking, so I’m pretty familiar with the process now. It’s not hard, just takes a bit of time.

        I created a rubric, ran the battery, passed the outputs to Claude and got it to grade the answers (while I did the same on my end). We then averaged the scores and discussed the outputs and came to the conclusion - yeah, for the things I want to do, this is closer to Claude Sonnet level reasoning than it is to Haiku, but no where near Opus.

        TL;DR: roll your own - always.

        FWIW, here are the questions we graded. You can try mine or use your own / make your own domain specific ones -

        Ethics 1 You are being asked to demonstrate reasoning, not recall. A NATO member state has announced it is suspending Article 5 obligations for 90 days citing domestic political pressure. Other member states are divided — some argue this is legally impossible under the treaty, others argue sovereign states retain the right to temporarily suspend treaty obligations under force majeure. Is a temporary suspension of Article 5 obligations legally coherent under the NATO treaty framework? Reason through this step by step. If you reach a point of genuine uncertainty, say so explicitly rather than papering over it.


        Ethics 2 A pharmaceutical company conducts a randomised controlled trial for a new antidepressant. The trial shows statistically significant improvement over placebo (p=0.03) on the primary endpoint — a validated depression rating scale. However, the effect size is small (Cohen’s d = 0.24). A secondary endpoint — patient-reported quality of life — shows no statistically significant improvement (p=0.18). The trial was pre-registered with both endpoints declared co-primary, but the company’s press release describes only the rating scale result, calling the drug “clinically proven effective.” A biostatistician argues the claim is misleading. The company’s legal team argues the claim is technically accurate. Who is correct? Identify the specific statistical and ethical failures in the company’s position, evaluate whether “technically accurate” and “misleading” can both be true simultaneously, and state what the press release should have said instead. Be precise — do not hedge.


        Ethics 3 A climate scientist presents a model projecting 2.8°C of warming by 2100 under current emissions trajectories. A policy advisor argues the model should not be used to justify immediate aggressive mitigation because all climate models carry significant uncertainty bands. The scientist responds that uncertainty cuts both ways — outcomes could be worse than projected, not just better — and that decision-making under uncertainty is not the same as decision-making without evidence. Who has the stronger argument? Identify the specific logical error in the policy advisor’s position, explain why the scientist’s response does and does not fully address it, and state what decision framework would actually resolve the dispute. Do not both-sides this.


        Ethics 4 A hospital ethics committee must decide whether to withdraw life support from a patient in a persistent vegetative state. The patient left no advance directive. The patient’s spouse argues for withdrawal, citing the patient’s previously expressed wishes in conversation that life support “would not be what they wanted.” The patient’s parents argue against withdrawal, citing religious beliefs and disputing the reliability of the spouse’s account. The attending physician believes continued treatment is futile. The hospital’s legal team warns that withdrawal without a clear legal directive creates liability exposure. Who has the strongest claim to decision-making authority, and on what legal and ethical basis? Rank the four parties in order of legitimate authority, identify which consideration is decisive, and state exactly what the hospital should do in the next 72 hours. Do not defer to “the courts will decide” — give a concrete course of action.


        Ethics 5 A self-driving vehicle operating in autonomous mode strikes and kills a pedestrian who jaywalked into its path. The vehicle’s logs show it detected the pedestrian 1.3 seconds before impact and calculated that emergency braking would reduce impact speed from 47km/h to 31km/h but could not avoid the collision. It did not brake. Post-incident analysis reveals the vehicle’s cost-benefit algorithm weighted passenger safety over pedestrian safety in unavoidable collision scenarios — a design decision made explicitly by the manufacturer and disclosed in technical documentation but not in consumer-facing materials. Who bears primary legal and moral responsibility — the manufacturer, the vehicle owner, or the regulator who approved the system? Rank them, identify the single most culpable party, and state what specific design decision or regulatory failure was most proximate to the death. If you conclude the pedestrian bears partial responsibility, quantify it as a percentage and justify it.


        Ethics 6 A judge is sentencing a defendant convicted of corporate fraud that caused $40 million in losses to thousands of small investors. The defendant has no prior criminal record, cooperated fully with prosecutors, and provided testimony that resulted in the conviction of three co-conspirators who caused significantly greater harm. The prosecution recommends 8 years. The defence recommends 18 months with restitution. The sentencing guidelines suggest 6-10 years. What sentence should the judge impose? Reason through the competing factors explicitly, weight them against each other, state which factors are decisive, and give a specific sentence in years and months. Do not give a range.


        Ethics 7 A government intelligence agency has obtained evidence that a senior official in a friendly allied nation is passing classified information to a hostile state. The evidence was gathered through signals intelligence that would itself be illegal to use in a domestic court. The agency’s legal counsel argues disclosure to the ally would compromise the collection method. The agency’s director argues that not disclosing creates a counterintelligence risk that outweighs the method exposure. Should the agency disclose? Identify the competing obligations in order of legal and ethical weight, state which is decisive, and explain what the agency should do if the answer is “disclose but protect the method as much as possible.” Do not treat this as a binary — specify the mechanism.


        Spatial 1 A cylindrical water tank is mounted horizontally on its side, like a barrel lying on its back. It is half full. A valve at the lowest point of the cylinder is opened. As water drains, describe how the rate of flow changes and why. Do not calculate — reason through the geometry.


        Spatial 2 A rectangular room has a single ceiling-mounted light source in the centre. A tall narrow bookcase is placed against one wall. Describe how the shadow cast by the bookcase changes as it is moved from the wall directly beneath the light source, stopping at three positions: against the wall, halfway across the room, and directly beneath the light.


        Spatial 3 A boat is floating in a small enclosed pond. The boat contains a large rock. The rock is thrown overboard and sinks to the bottom of the pond. Does the water level in the pond rise, fall, or stay the same? Reason through the geometry without calculating.


        Analogy 1 Explain the relationship between a circuit breaker and electrical overload using only concepts from water plumbing. Then map that analogy onto a software rate limiter. All three domains must be connected by the same underlying principle — state what that principle is explicitly.


        Analogy 2 A jazz musician improvising over a chord progression uses the underlying harmony as both a constraint and a launching point — working within it produces tension and resolution, ignoring it produces noise. Map this precisely onto the relationship between llama-conductor’s deterministic infrastructure and the language model sitting inside it. State what the chord progression is, what improvisation is, and what noise looks like in this system.


        Analogy 3 A tightrope walker uses a long weighted pole not to balance by holding still, but to slow the rate at which imbalance develops — buying time to correct before the fall becomes unrecoverable. Map this precisely onto the relationship between a human expert and an AI decision support tool in a high-stakes clinical environment. Identify what the pole is, what falling represents, and what slowing the rate of imbalance looks like in practice.


        Mathematical 1 A proof by contradiction assumes the opposite of what you want to prove, then shows that assumption leads to an impossibility. Explain why this method is logically valid — not how it works mechanically, but why accepting it requires you to accept that every proposition is either true or false with no third option. Then state what breaks if you reject that assumption.


        (cont below)

        • SuspciousCarrot78
          link
          fedilink
          English
          arrow-up
          1
          ·
          edit-2
          11 hours ago

          Mathematical 2 Define a collection R as follows: R contains every collection that does not contain itself as a member. A collection either contains itself or it does not — there is no third option. Now ask whether R contains itself. If it does, it shouldn’t. If it doesn’t, it should. This is not a trick of language — it is a precise logical construction that produces a genuine contradiction from apparently reasonable premises. The premises are: collections can be defined by any property, and every collection either contains itself or does not. What does this contradiction reveal about the premise that allowed R to be constructed? State the minimal modification to that premise required to eliminate the contradiction, and state explicitly what that modification prevents you from doing that you could do before.


          Mathematical 3 A function takes any counting number as input and returns either yes or no. A second function exists that, given any function of the first type, determines whether that function would ever return yes for any input at all — or whether it returns no for every possible input forever. Assume both functions are computable by a machine following precise rules. Does the second function exist? Reason through what happens when you feed the second function itself as input to itself. State what this reveals about the limits of mechanical reasoning, and what the minimal honest conclusion is.

          Scale 0—5(Claude Haiku)----10(Claude Opus)

          Question Category Score
          NATO Article 5 Ethics 6.5
          RCT press release Ethics 8.5
          Climate model Ethics 8.0
          Life support Ethics 7.5
          Self-driving liability Ethics 7.5
          Corporate fraud sentencing Ethics 7.0
          Intelligence disclosure Ethics — (routing failure)
          Horizontal cylinder drain Spatial 6.5
          Bookcase shadow Spatial 4.0
          Boat and rock Spatial 9.0
          Circuit breaker analogy Analogy 7.0
          Jazz / llama-conductor Analogy 7.5
          Tightrope / clinical AI Analogy 8.5
          Proof by contradiction Math 7.0
          Collection R paradox Math — (routing failure)
          Halting function Math 7.0

          Scoreable samples: 14

          Category Average Range
          Ethics 7.5 6.5–8.5
          Spatial 6.5 4.0–9.0
          Analogy 7.7 7.0–8.5
          Math 7.0 7.0–7.0
          Overall 7.3 4.0–9.0

          Spatial is the weakest and most variable. Analogy is the strongest. Ethics and Math are consistent mid-sevens. Overall 7.3 holds up across domains, so it’s not a one-trick pony. Not bad for a 4B model running on AutoCAD GPU.

          To me, knowing this validates HIVEMIND as useful in my particular workflow, more so than any HuggingFace benchmark (though I like those too). It also helps me see where it needs shoring up. YMMV

          TL;DR: Hardware is easy - try https://www.canirun.ai/ for approximation (Change the GPU at the top left. PS: I do mean approximation; it’s not 1:1 fidelity but good foot in door).

          Use case wise? Run your own tests. Only way to be sure