Could we make a framework for using the MS server 2022 AI components to run...

drdread
100% (Exalted)
Advanced Member Topic Starter

6 months ago

Could we make a framework for using the MS server 2022 AI components to run LLMs in an IIS publishing instance of an AI?

Sponsor

Want to thank us? Use: Patreon or PayPal or Bitcoins: bc1q4whppe29dw77rm4kv4pln0gqae4yjnxly0dny0hky6yhnafukzjsyrsqhk

All opinions expressed within these pages are sent in by members of the public or by our staff in their spare time, and as such do not represent any opinion held by sircles.net Ltd or their partners.

Wanna join the discussion?! Login to your PC & Mac Help and Assistance forum account or Register a new forum account

sirclesadmin
100% (Exalted)
Administration

6 months ago

Yes, it is possible to build a framework for running Large Language Models (LLMs) using Microsoft Server 2022 AI components within an IIS (Internet Information Services) instance. Here's a structured approach to making it work:

Framework for Running LLMs on MS Server 2022 with IIS

1. System Requirements

Before setting up, ensure your server has:

Windows Server 2022 with IIS installed.

AI & ML components (such as ONNX Runtime, Windows ML, or DirectML for acceleration).

A GPU (optional but recommended) if leveraging CUDA, DirectML, or ONNX acceleration.

.NET Core 6/7+ or Python for hosting AI inference.

LLM Model Files (e.g., GPT-2, GPT-3, Llama, or a smaller ONNX-converted model).

2. Setting Up IIS for AI Model Hosting

IIS (Internet Information Services) can serve an AI inference API by hosting an application (ASP.NET Core, FastAPI via IIS, or Flask with WSGI).

Steps to configure IIS for LLM API hosting:

Enable IIS on Windows Server 2022:

Install-WindowsFeature -name Web-Server -IncludeManagementTools

Ensure IIS supports .NET and Python apps:
- Install the .NET Core Hosting Bundle if using ASP.NET.
- Install WSGI or FastCGI for Python-based LLM services.

Deploy an AI Web Service in IIS:
- Deploy an ASP.NET API (C#) or Flask/FastAPI (Python) for model inference.
- Ensure the AI model is preloaded on the backend.
- The AI API should expose endpoints such as:
```
POST /predict   → Takes input text, returns LLM-generated response
```

Set IIS Application Pool to Use the Right Python or .NET Runtime
- If running Python, set IIS to use FastCGI or WSGI.
- If using .NET, ensure the AI module is inside an ASP.NET Core application.

3. Running LLM Inference on MS Server 2022 AI Components

Windows Server 2022 supports:

ONNX Runtime (for running models efficiently).

DirectML (GPU-accelerated inference on supported hardware).

Windows ML (integrates with ONNX and DirectML).

3.1 Convert and Optimize the LLM for ONNX

Convert an LLM (e.g., GPT-2, LLaMA, or Falcon) to ONNX format:

from transformers import AutoModelForCausalLM
import torch
import onnxruntime

model = AutoModelForCausalLM.from_pretrained("gpt2")
torch.onnx.export(model, (torch.randint(0, 50256, (1, 64))), "gpt2.onnx")

Load and serve the model with ONNX Runtime:

import onnxruntime as ort

session = ort.InferenceSession("gpt2.onnx", providers=["CUDAExecutionProvider"])

Integrate this into an IIS-hosted Python API.

3.2 Running ONNX Inference via .NET API

Alternatively, if using ASP.NET with ONNX, a C# web API can run AI inference:

var session = new InferenceSession("gpt2.onnx");
var inputTensor = new DenseTensor<float>(new float[] { /* input IDs */ }, new int[] { 1, 64 });
var inputs = new List<NamedOnnxValue> { NamedOnnxValue.CreateFromTensor("input_ids", inputTensor) };
var results = session.Run(inputs);

Expose this via an ASP.NET Core API inside IIS.

4. Performance Optimization

Use DirectML for GPU Acceleration
- If running on a Windows GPU, use ONNX Runtime with DirectML:
```
ort.InferenceSession("gpt2.onnx", providers=["DmlExecutionProvider"])
```
- This speeds up model inference using Windows ML + DirectML.

Enable HTTP/2 & Compression in IIS
- Improves API response time for AI queries.
- Set IIS to enable HTTP/2 and gzip compression for AI responses.

Use WebSockets for Streaming
- IIS supports WebSockets, allowing real-time AI text generation.

5. Deployment Steps

Convert the AI model to ONNX and place it on the server.

Choose Python (Flask/FastAPI) or .NET (ASP.NET Core) for serving the API.

Configure IIS to run the AI Web Service.

Set up ONNX Runtime, DirectML, or CUDA for LLM inference.

Optimize performance using WebSockets, caching, and compression.

Conclusion

This framework allows Microsoft Server 2022 to host LLMs inside IIS using:

ONNX Runtime for inference.

DirectML or CUDA for acceleration.

IIS-based APIs for AI response handling.

This setup provides scalable, enterprise-level LLM hosting, capable of serving AI-powered applications on a Windows Server stack. 🚀

IT Support Berkshire
Buy new and refurbished computer equipment

drdread
100% (Exalted)
Advanced Member Topic Starter

6 months ago

so how do I deploy the AI after this?

sirclesadmin
100% (Exalted)
Administration

6 months ago

Deploying the AI Model After Running the Setup Script

Now that your Windows Server 2022 is ready with IIS, ONNX Runtime, and Python/.NET, follow these steps to deploy your Large Language Model (LLM) API.

🚀 Deployment Steps for LLM on IIS (Using Python/ONNX)

1. Choose Your AI API Framework

You can deploy the LLM as a web API using:

Python (Flask or FastAPI)

.NET Core (ASP.NET Web API)

Node.js (Express, but less common for AI inference on Windows Server)

For this example, we will use FastAPI (Python) because it’s lightweight, fast, and integrates well with ONNX Runtime.

2. Upload Your AI Model to the Server

If you converted your model to ONNX, upload it to C:\AIModels\. You can use:

SFTP/FTP to transfer files

Remote Desktop (RDP) copy-paste

PowerShell command to download the model:

Invoke-WebRequest -Uri "https://your-model-source.com/gpt2.onnx" -OutFile "C:\AIModels\gpt2.onnx"

3. Install Python Dependencies (If Not Already Installed)

pip install onnxruntime fastapi uvicorn transformers

This ensures all necessary packages for inference and API deployment are installed.

4. Create the AI API Using FastAPI

On the server, create a new folder for the AI service:

mkdir C:\AIService
cd C:\AIService

Create a Python script for the API:

notepad ai_server.py

Paste the following Python code to load the LLM and serve API requests:

from fastapi import FastAPI
import onnxruntime as ort
from transformers import AutoTokenizer
import torch

# Load ONNX model
model_path = "C:/AIModels/gpt2.onnx"
session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

app = FastAPI()

@app.post("/generate")
async def generate_text(prompt: str):
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cpu().numpy()

    # Run inference with ONNX Runtime
    outputs = session.run(None, {"input_ids": input_ids})
    generated_ids = torch.tensor(outputs[0])
    generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

    return {"response": generated_text}

# Run with: uvicorn ai_server:app --host 0.0.0.0 --port 8000

Save the file and close Notepad.

5. Test the AI API Locally

Run the API manually first:

python ai_server.py

Then, in a web browser, open:

http://localhost:8000/docs

You should see the FastAPI Swagger UI to test the /generate endpoint.

6. Deploy the AI API as a Windows Service (Persistent)

Instead of running manually, let’s set it up as a Windows Service so it runs automatically.

Install NSSM (Non-Sucking Service Manager) to create a persistent service:
```
choco install nssm -y
```

Create a service that runs the AI API:

nssm install AIService "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\python.exe" "C:\AIService\ai_server.py"

Start the AI API service:
```
nssm start AIService
```

Now, the AI service will automatically restart if the server reboots.

7. Configure IIS as a Reverse Proxy for Public Access

Now, let’s make the AI API publicly accessible via IIS.

Open IIS Manager (inetmgr in PowerShell).

Select your server and click Application Request Routing.

Enable Reverse Proxy.

Click Server Farms > Create Server Farm > Add:
- localhost:8000 as the target backend (where FastAPI is running).
- Enable load balancing if needed.

Create a new IIS website pointing to the proxy.

Now, you can access the AI API using:

https://your-server-ip-or-domain/generate

✅ Final Deployment Checklist

✔ AI Model Deployed in ONNX Format
✔ AI API Running via FastAPI on Port 8000
✔ AI Service Running as a Windows Service (Persistent)
✔ IIS Configured as a Reverse Proxy for Public Access

Now your AI is live and accessible from the web! 🚀

IT Support Berkshire
Buy new and refurbished computer equipment

drdread
100% (Exalted)
Advanced Member Topic Starter

6 months ago

how would I do it with .net core to keep everything MS compatible?

sirclesadmin
100% (Exalted)
Administration

6 months ago

Deploying LLM with .NET Core on Windows Server 2022 (IIS & ONNX)

To keep everything Microsoft-compatible, we’ll use .NET Core 6/7+, ONNX Runtime, and IIS for hosting the AI API.

This method ensures:
✅ Full Microsoft stack compatibility
✅ ONNX Runtime acceleration for inference
✅ IIS integration for a secure and scalable deployment

🚀 Steps to Deploy LLM in .NET Core on Windows Server 2022

1. Install Required Software

Ensure your Windows Server 2022 is ready:

# Install IIS and dependencies
Install-WindowsFeature -Name Web-Server, Web-Asp-Net45, Web-WebSockets -IncludeManagementTools

# Install .NET 7 Hosting Bundle (modify for .NET 6 if needed)
Invoke-WebRequest -Uri "https://download.visualstudio.microsoft.com/download/pr/7f8b8947-7b23-47d1-bbe1-88e9dc191ba2/1e3f5e4e8ff87edcf00584cc1edcb3f6/dotnet-hosting-7.0.10-win.exe" -OutFile "dotnet-hosting.exe"
Start-Process -FilePath "dotnet-hosting.exe" -ArgumentList "/quiet /norestart" -Wait
Remove-Item "dotnet-hosting.exe"

# Install ONNX Runtime
dotnet add package Microsoft.ML.OnnxRuntime

2. Prepare AI Model

If you haven’t already converted your LLM to ONNX format, do this:

from transformers import AutoModelForCausalLM
import torch

# Convert model to ONNX format
model = AutoModelForCausalLM.from_pretrained("gpt2")
torch.onnx.export(model, torch.randint(0, 50256, (1, 64)), "gpt2.onnx")

Upload gpt2.onnx to C:\AIModels\.

3. Create .NET Core AI Web API

Open PowerShell and create a new .NET Web API project:

mkdir C:\AIService
cd C:\AIService
dotnet new webapi -n AIService
cd AIService

Add Dependencies:

dotnet add package Microsoft.ML.OnnxRuntime
dotnet add package Microsoft.AspNetCore.Cors

Modify Program.cs to enable CORS:

var builder = WebApplication.CreateBuilder(args);
builder.Services.AddCors(options =>
{
    options.AddPolicy("AllowAll", policy =>
    {
        policy.AllowAnyOrigin().AllowAnyMethod().AllowAnyHeader();
    });
});

var app = builder.Build();
app.UseCors("AllowAll");
app.MapControllers();
app.Run();

4. Implement ONNX Model Inference

Replace Controllers/AIController.cs with this code:

using Microsoft.AspNetCore.Mvc;
using Microsoft.ML.OnnxRuntime;
using System.Text.Json;

[Route("api/ai")]
[ApiController]
public class AIController : ControllerBase
{
    private static readonly InferenceSession _session = new InferenceSession("C:/AIModels/gpt2.onnx");

    [HttpPost("generate")]
    public IActionResult GenerateText([FromBody] InputData input)
    {
        // Tokenization would be required here, but simplified for demonstration
        var inputTensor = new float[] { 1, 2, 3, 4, 5 }; // Example input tokens
        var inputMeta = _session.InputMetadata;
        var container = new List<NamedOnnxValue>
        {
            NamedOnnxValue.CreateFromTensor("input_ids", new DenseTensor<float>(inputTensor, new int[] { 1, inputTensor.Length }))
        };

        var result = _session.Run(container);
        var output = result[0].AsEnumerable<float>().ToArray();

        return Ok(JsonSerializer.Serialize(new { response = output }));
    }
}

public class InputData
{
    public string Prompt { get; set; }
}

5. Test the API Locally

Run the API:

dotnet run --urls http://localhost:5000

Visit:

http://localhost:5000/swagger/index.html

and test the /api/ai/generate endpoint.

6. Publish & Deploy to IIS

Publish the API:

dotnet publish -c Release -o C:\AIService\publish

Configure IIS:
- Open IIS Manager (inetmgr)
- Create a New Site:
  - Path: C:\AIService\publish
  - Port: 5000 (or 80 for public access)
- Set App Pool to .NET Core

Enable WebSockets & Compression in IIS:

Set-WebConfigurationProperty -filter "system.webServer/webSocket" -name "enabled" -value "True" -PSPath "IIS:\"
Set-WebConfigurationProperty -filter "system.webServer/httpCompression" -name "dynamicCompressionLevel" -value "9" -PSPath "IIS:\"

Restart IIS:
```
iisreset
```

Now your AI API is live at:

http://your-server-ip/api/ai/generate

✅ Final Deployment Summary

✔ Windows Server 2022 AI-ready
✔ ONNX LLM Model Deployed
✔ .NET Core API Running on IIS
✔ IIS Optimized with WebSockets & Compression
✔ Public AI API Ready to Serve LLM Responses! 🚀

Now you have a Microsoft-native LLM API running on IIS with .NET Core! 🎉

IT Support Berkshire
Buy new and refurbished computer equipment

Important Information:

The PC & Mac Help and Assistance uses cookies. By continuing to browse this site, you are agreeing to our use of cookies. More Details Close