Deploying the AI Model After Running the Setup Script
Now that your Windows Server 2022 is ready with IIS, ONNX Runtime, and Python/.NET, follow these steps to deploy your Large Language Model (LLM) API.
π Deployment Steps for LLM on IIS (Using Python/ONNX)
1. Choose Your AI API Framework
You can deploy the LLM as a web API using:
- Python (Flask or FastAPI)
- .NET Core (ASP.NET Web API)
- Node.js (Express, but less common for AI inference on Windows Server)
For this example, we will use FastAPI (Python) because itβs lightweight, fast, and integrates well with ONNX Runtime.
2. Upload Your AI Model to the Server
If you converted your model to ONNX, upload it to C:\AIModels\
. You can use:
3. Install Python Dependencies (If Not Already Installed)
Log in to your server, open PowerShell, and run:
pip install onnxruntime fastapi uvicorn transformers
This ensures all necessary packages for inference and API deployment are installed.
4. Create the AI API Using FastAPI
On the server, create a new folder for the AI service:
mkdir C:\AIService
cd C:\AIService
Create a Python script for the API:
notepad ai_server.py
Paste the following Python code to load the LLM and serve API requests:
from fastapi import FastAPI
import onnxruntime as ort
from transformers import AutoTokenizer
import torch
# Load ONNX model
model_path = "C:/AIModels/gpt2.onnx"
session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
app = FastAPI()
@app.post("/generate")
async def generate_text(prompt: str):
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cpu().numpy()
# Run inference with ONNX Runtime
outputs = session.run(None, {"input_ids": input_ids})
generated_ids = torch.tensor(outputs[0])
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
return {"response": generated_text}
# Run with: uvicorn ai_server:app --host 0.0.0.0 --port 8000
Save the file and close Notepad.
5. Test the AI API Locally
Run the API manually first:
python ai_server.py
Then, in a web browser, open:
http://localhost:8000/docs
You should see the FastAPI Swagger UI to test the /generate
endpoint.
6. Deploy the AI API as a Windows Service (Persistent)
Instead of running manually, letβs set it up as a Windows Service so it runs automatically.
- Install NSSM (Non-Sucking Service Manager) to create a persistent service:
choco install nssm -y
- Create a service that runs the AI API:
nssm install AIService "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\python.exe" "C:\AIService\ai_server.py"
- Start the AI API service:
nssm start AIService
Now, the AI service will automatically restart if the server reboots.
7. Configure IIS as a Reverse Proxy for Public Access
Now, letβs make the AI API publicly accessible via IIS.
- Open IIS Manager (
inetmgr
in PowerShell).
- Select your server and click Application Request Routing.
- Enable Reverse Proxy.
- Click Server Farms > Create Server Farm > Add:
localhost:8000
as the target backend (where FastAPI is running).
- Enable load balancing if needed.
- Create a new IIS website pointing to the proxy.
Now, you can access the AI API using:
https://your-server-ip-or-domain/generate
β
Final Deployment Checklist
β AI Model Deployed in ONNX Format
β AI API Running via FastAPI on Port 8000
β AI Service Running as a Windows Service (Persistent)
β IIS Configured as a Reverse Proxy for Public Access
Now your AI is live and accessible from the web! π