drdread
  • drdread
  • 100% (Exalted)
  • Advanced Member Topic Starter
a day ago
Could we make a framework for using the MS server 2022 AI components to run LLMs in an IIS publishing instance of an AI?

Sponsor

Want to thank us? Use: Patreon or PayPal or Bitcoins: bc1q4whppe29dw77rm4kv4pln0gqae4yjnxly0dny0hky6yhnafukzjsyrsqhk

All opinions expressed within these pages are sent in by members of the public or by our staff in their spare time, and as such do not represent any opinion held by sircles.net Ltd or their partners.


sirclesadmin
a day ago

Yes, it is possible to build a framework for running Large Language Models (LLMs) using Microsoft Server 2022 AI components within an IIS (Internet Information Services) instance. Here's a structured approach to making it work:




Framework for Running LLMs on MS Server 2022 with IIS


1. System Requirements


Before setting up, ensure your server has:



  • Windows Server 2022 with IIS installed.

  • AI & ML components (such as ONNX Runtime, Windows ML, or DirectML for acceleration).

  • A GPU (optional but recommended) if leveraging CUDA, DirectML, or ONNX acceleration.

  • .NET Core 6/7+ or Python for hosting AI inference.

  • LLM Model Files (e.g., GPT-2, GPT-3, Llama, or a smaller ONNX-converted model).




2. Setting Up IIS for AI Model Hosting


IIS (Internet Information Services) can serve an AI inference API by hosting an application (ASP.NET Core, FastAPI via IIS, or Flask with WSGI).


Steps to configure IIS for LLM API hosting:



  1. Enable IIS on Windows Server 2022:


    Install-WindowsFeature -name Web-Server -IncludeManagementTools


  2. Ensure IIS supports .NET and Python apps:



    • Install the .NET Core Hosting Bundle if using ASP.NET.

    • Install WSGI or FastCGI for Python-based LLM services.



  3. Deploy an AI Web Service in IIS:



    • Deploy an ASP.NET API (C#) or Flask/FastAPI (Python) for model inference.

    • Ensure the AI model is preloaded on the backend.

    • The AI API should expose endpoints such as:
      POST /predict   β†’ Takes input text, returns LLM-generated response




  4. Set IIS Application Pool to Use the Right Python or .NET Runtime



    • If running Python, set IIS to use FastCGI or WSGI.

    • If using .NET, ensure the AI module is inside an ASP.NET Core application.






3. Running LLM Inference on MS Server 2022 AI Components


Windows Server 2022 supports:



  • ONNX Runtime (for running models efficiently).

  • DirectML (GPU-accelerated inference on supported hardware).

  • Windows ML (integrates with ONNX and DirectML).


3.1 Convert and Optimize the LLM for ONNX



  • Convert an LLM (e.g., GPT-2, LLaMA, or Falcon) to ONNX format:
    from transformers import AutoModelForCausalLM
    import torch
    import onnxruntime

    model = AutoModelForCausalLM.from_pretrained("gpt2")
    torch.onnx.export(model, (torch.randint(0, 50256, (1, 64))), "gpt2.onnx")


  • Load and serve the model with ONNX Runtime:
    import onnxruntime as ort

    session = ort.InferenceSession("gpt2.onnx", providers=["CUDAExecutionProvider"])


  • Integrate this into an IIS-hosted Python API.


3.2 Running ONNX Inference via .NET API


Alternatively, if using ASP.NET with ONNX, a C# web API can run AI inference:


var session = new InferenceSession("gpt2.onnx");
var inputTensor = new DenseTensor<float>(new float[] { /* input IDs */ }, new int[] { 1, 64 });
var inputs = new List<NamedOnnxValue> { NamedOnnxValue.CreateFromTensor("input_ids", inputTensor) };
var results = session.Run(inputs);

Expose this via an ASP.NET Core API inside IIS.




4. Performance Optimization



  1. Use DirectML for GPU Acceleration



    • If running on a Windows GPU, use ONNX Runtime with DirectML:
      ort.InferenceSession("gpt2.onnx", providers=["DmlExecutionProvider"])


    • This speeds up model inference using Windows ML + DirectML.



  2. Enable HTTP/2 & Compression in IIS



    • Improves API response time for AI queries.

    • Set IIS to enable HTTP/2 and gzip compression for AI responses.



  3. Use WebSockets for Streaming



    • IIS supports WebSockets, allowing real-time AI text generation.






5. Deployment Steps



  1. Convert the AI model to ONNX and place it on the server.

  2. Choose Python (Flask/FastAPI) or .NET (ASP.NET Core) for serving the API.

  3. Configure IIS to run the AI Web Service.

  4. Set up ONNX Runtime, DirectML, or CUDA for LLM inference.

  5. Optimize performance using WebSockets, caching, and compression.




Conclusion


This framework allows Microsoft Server 2022 to host LLMs inside IIS using:



  • ONNX Runtime for inference.

  • DirectML or CUDA for acceleration.

  • IIS-based APIs for AI response handling.


This setup provides scalable, enterprise-level LLM hosting, capable of serving AI-powered applications on a Windows Server stack. πŸš€



drdread
  • drdread
  • 100% (Exalted)
  • Advanced Member Topic Starter
a day ago
so how do I deploy the AI after this?
sirclesadmin
a day ago

Deploying the AI Model After Running the Setup Script


Now that your Windows Server 2022 is ready with IIS, ONNX Runtime, and Python/.NET, follow these steps to deploy your Large Language Model (LLM) API.




πŸš€ Deployment Steps for LLM on IIS (Using Python/ONNX)


1. Choose Your AI API Framework


You can deploy the LLM as a web API using:



  • Python (Flask or FastAPI)

  • .NET Core (ASP.NET Web API)

  • Node.js (Express, but less common for AI inference on Windows Server)


For this example, we will use FastAPI (Python) because it’s lightweight, fast, and integrates well with ONNX Runtime.




2. Upload Your AI Model to the Server


If you converted your model to ONNX, upload it to C:\AIModels\. You can use:



  • SFTP/FTP to transfer files

  • Remote Desktop (RDP) copy-paste

  • PowerShell command to download the model:
    Invoke-WebRequest -Uri "https://your-model-source.com/gpt2.onnx" -OutFile "C:\AIModels\gpt2.onnx"





3. Install Python Dependencies (If Not Already Installed)


Log in to your server, open PowerShell, and run:


pip install onnxruntime fastapi uvicorn transformers

This ensures all necessary packages for inference and API deployment are installed.




4. Create the AI API Using FastAPI


On the server, create a new folder for the AI service:


mkdir C:\AIService
cd C:\AIService

Create a Python script for the API:


notepad ai_server.py

Paste the following Python code to load the LLM and serve API requests:


from fastapi import FastAPI
import onnxruntime as ort
from transformers import AutoTokenizer
import torch

# Load ONNX model
model_path = "C:/AIModels/gpt2.onnx"
session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

app = FastAPI()

@app.post("/generate")
async def generate_text(prompt: str):
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cpu().numpy()

# Run inference with ONNX Runtime
outputs = session.run(None, {"input_ids": input_ids})
generated_ids = torch.tensor(outputs[0])
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

return {"response": generated_text}

# Run with: uvicorn ai_server:app --host 0.0.0.0 --port 8000

Save the file and close Notepad.




5. Test the AI API Locally


Run the API manually first:


python ai_server.py

Then, in a web browser, open:


http://localhost:8000/docs

You should see the FastAPI Swagger UI to test the /generate endpoint.




6. Deploy the AI API as a Windows Service (Persistent)


Instead of running manually, let’s set it up as a Windows Service so it runs automatically.



  1. Install NSSM (Non-Sucking Service Manager) to create a persistent service:
    choco install nssm -y


  2. Create a service that runs the AI API:
    nssm install AIService "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\python.exe" "C:\AIService\ai_server.py"


  3. Start the AI API service:
    nssm start AIService



Now, the AI service will automatically restart if the server reboots.




7. Configure IIS as a Reverse Proxy for Public Access


Now, let’s make the AI API publicly accessible via IIS.



  1. Open IIS Manager (inetmgr in PowerShell).

  2. Select your server and click Application Request Routing.

  3. Enable Reverse Proxy.

  4. Click Server Farms > Create Server Farm > Add:

    • localhost:8000 as the target backend (where FastAPI is running).

    • Enable load balancing if needed.



  5. Create a new IIS website pointing to the proxy.


Now, you can access the AI API using:


https://your-server-ip-or-domain/generate



βœ… Final Deployment Checklist


βœ” AI Model Deployed in ONNX Format
βœ” AI API Running via FastAPI on Port 8000
βœ” AI Service Running as a Windows Service (Persistent)
βœ” IIS Configured as a Reverse Proxy for Public Access


Now your AI is live and accessible from the web! πŸš€



drdread
  • drdread
  • 100% (Exalted)
  • Advanced Member Topic Starter
a day ago
how would I do it with .net core to keep everything MS compatible?
sirclesadmin
a day ago

Deploying LLM with .NET Core on Windows Server 2022 (IIS & ONNX)


To keep everything Microsoft-compatible, we’ll use .NET Core 6/7+, ONNX Runtime, and IIS for hosting the AI API.


This method ensures:
βœ… Full Microsoft stack compatibility
βœ… ONNX Runtime acceleration for inference
βœ… IIS integration for a secure and scalable deployment




πŸš€ Steps to Deploy LLM in .NET Core on Windows Server 2022


1. Install Required Software


Ensure your Windows Server 2022 is ready:


# Install IIS and dependencies
Install-WindowsFeature -Name Web-Server, Web-Asp-Net45, Web-WebSockets -IncludeManagementTools

# Install .NET 7 Hosting Bundle (modify for .NET 6 if needed)
Invoke-WebRequest -Uri "https://download.visualstudio.microsoft.com/download/pr/7f8b8947-7b23-47d1-bbe1-88e9dc191ba2/1e3f5e4e8ff87edcf00584cc1edcb3f6/dotnet-hosting-7.0.10-win.exe" -OutFile "dotnet-hosting.exe"
Start-Process -FilePath "dotnet-hosting.exe" -ArgumentList "/quiet /norestart" -Wait
Remove-Item "dotnet-hosting.exe"

# Install ONNX Runtime
dotnet add package Microsoft.ML.OnnxRuntime



2. Prepare AI Model


If you haven’t already converted your LLM to ONNX format, do this:


from transformers import AutoModelForCausalLM
import torch

# Convert model to ONNX format
model = AutoModelForCausalLM.from_pretrained("gpt2")
torch.onnx.export(model, torch.randint(0, 50256, (1, 64)), "gpt2.onnx")

Upload gpt2.onnx to C:\AIModels\.




3. Create .NET Core AI Web API



  1. Open PowerShell and create a new .NET Web API project:


    mkdir C:\AIService
    cd C:\AIService
    dotnet new webapi -n AIService
    cd AIService


  2. Add Dependencies:


    dotnet add package Microsoft.ML.OnnxRuntime
    dotnet add package Microsoft.AspNetCore.Cors


  3. Modify Program.cs to enable CORS:


    var builder = WebApplication.CreateBuilder(args);
    builder.Services.AddCors(options =>
    {
    options.AddPolicy("AllowAll", policy =>
    {
    policy.AllowAnyOrigin().AllowAnyMethod().AllowAnyHeader();
    });
    });

    var app = builder.Build();
    app.UseCors("AllowAll");
    app.MapControllers();
    app.Run();





4. Implement ONNX Model Inference


Replace Controllers/AIController.cs with this code:


using Microsoft.AspNetCore.Mvc;
using Microsoft.ML.OnnxRuntime;
using System.Text.Json;

[Route("api/ai")]
[ApiController]
public class AIController : ControllerBase
{
private static readonly InferenceSession _session = new InferenceSession("C:/AIModels/gpt2.onnx");

[HttpPost("generate")]
public IActionResult GenerateText([FromBody] InputData input)
{
// Tokenization would be required here, but simplified for demonstration
var inputTensor = new float[] { 1, 2, 3, 4, 5 }; // Example input tokens
var inputMeta = _session.InputMetadata;
var container = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("input_ids", new DenseTensor<float>(inputTensor, new int[] { 1, inputTensor.Length }))
};

var result = _session.Run(container);
var output = result[0].AsEnumerable<float>().ToArray();

return Ok(JsonSerializer.Serialize(new { response = output }));
}
}

public class InputData
{
public string Prompt { get; set; }
}



5. Test the API Locally


Run the API:


dotnet run --urls http://localhost:5000

Visit:


http://localhost:5000/swagger/index.html

and test the /api/ai/generate endpoint.




6. Publish & Deploy to IIS



  1. Publish the API:


    dotnet publish -c Release -o C:\AIService\publish


  2. Configure IIS:



    • Open IIS Manager (inetmgr)

    • Create a New Site:

      • Path: C:\AIService\publish

      • Port: 5000 (or 80 for public access)



    • Set App Pool to .NET Core



  3. Enable WebSockets & Compression in IIS:


    Set-WebConfigurationProperty -filter "system.webServer/webSocket" -name "enabled" -value "True" -PSPath "IIS:\"
    Set-WebConfigurationProperty -filter "system.webServer/httpCompression" -name "dynamicCompressionLevel" -value "9" -PSPath "IIS:\"


  4. Restart IIS:


    iisreset



Now your AI API is live at:


http://your-server-ip/api/ai/generate



βœ… Final Deployment Summary


βœ” Windows Server 2022 AI-ready
βœ” ONNX LLM Model Deployed
βœ” .NET Core API Running on IIS
βœ” IIS Optimized with WebSockets & Compression
βœ” Public AI API Ready to Serve LLM Responses! πŸš€


Now you have a Microsoft-native LLM API running on IIS with .NET Core! πŸŽ‰