cby 发布的文章 - 小陈运维

登录 / 注册

chenby

累计撰写 211 篇文章
累计收到 124 条评论

搜索到 211 篇与 cby 的结果

2022-11-23
在k8s安装CICD-devtron 在k8s安装CICD-devtron先前条件《kubernetes(k8s) 存储动态挂载》参考我之前的文档进行部署https://www.oiox.cn/index.php/archives/32/安装helm工具root@cby:~# curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 root@cby:~# chmod 700 get_helm.sh root@cby:~# ./get_helm.sh Downloading https://get.helm.sh/helm-v3.10.2-linux-amd64.tar.gz Verifying checksum... Done. Preparing to install helm into /usr/local/bin helm installed into /usr/local/bin/helm root@cby:~# 使用 helm 安装root@cby:~# helm repo add devtron https://helm.devtron.ai "devtron" has been added to your repositories root@cby:~# root@cby:~# root@cby:~# root@cby:~# helm install devtron devtron/devtron-operator --create-namespace --namespace devtroncd --set installer.modules={cicd} NAME: devtron LAST DEPLOYED: Fri Nov 18 05:22:13 2022 NAMESPACE: devtroncd STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: 1. Run the following command to get the password for the default admin user: kubectl -n devtroncd get secret devtron-secret -o jsonpath='{.data.ADMIN_PASSWORD}' | base64 -d 2. Run the following command to get the dashboard URL for the service type: LoadBalancer kubectl get svc -n devtroncd devtron-service -o jsonpath='{.status.loadBalancer.ingress}' 3. To track the progress of Devtron microservices installation, run the following command: kubectl -n devtroncd get installers installer-devtron -o jsonpath='{.status.sync.status}' root@cby:~# 查看验证root@cby:~# kubectl get pod -n devtroncd NAME READY STATUS RESTARTS AGE app-sync-cronjob-27815700-lz565 0/1 Completed 0 2d5h app-sync-cronjob-27817140-6wsj6 0/1 Completed 0 29h app-sync-cronjob-27818580-kzjdb 0/1 Completed 0 5h33m argo-rollouts-68dc6f5b75-949x9 1/1 Running 2 (152m ago) 4d10h argocd-application-controller-0 1/1 Running 2 (152m ago) 4d9h argocd-dex-server-54c8d7cbdf-nfjj2 1/1 Running 2 (153m ago) 4d10h argocd-redis-7967b6b9f7-6c69j 1/1 Running 2 (152m ago) 4d9h argocd-repo-server-6f9d65d87f-9p9p8 1/1 Running 2 (152m ago) 4d9h argocd-server-7cf98cdffb-4qxgm 1/1 Running 2 (152m ago) 4d9h clair-8cd58cdd9-nhglm 1/1 Running 46 (152m ago) 4d9h dashboard-777c9bb5f9-zz4b5 1/1 Running 2 (152m ago) 4d10h devtron-d74cf8958-2x7sb 1/1 Running 4 (151m ago) 4d8h devtron-grafana-6657cbc8f9-9j7fp 2/2 Running 2 (153m ago) 4d8h devtron-grafana-test 0/1 Completed 6 4d8h devtron-housekeeping-qp59k 0/1 Completed 0 4d10h devtron-nats-0 3/3 Running 6 (152m ago) 4d10h devtron-nats-test-request-reply 0/1 Completed 0 4d10h git-sensor-0 1/1 Running 6 (152m ago) 4d10h grafana-org-job-jgzjp 0/1 Completed 0 4d8h image-scanner-8679b48b66-t7bd2 1/1 Running 8 (151m ago) 4d9h inception-846694f944-5hjtq 1/1 Running 2 (152m ago) 4d10h kubelink-67985f58d5-xmds2 1/1 Running 2 (152m ago) 4d10h kubewatch-655f8669dd-xrx5q 1/1 Running 8 (152m ago) 4d10h lens-6c86975478-vwpq2 1/1 Running 9 (151m ago) 4d10h notifier-5b4b48b677-dkcls 1/1 Running 1 (152m ago) 4d8h postgresql-migrate-casbin-2lz42 0/1 Completed 0 4d10h postgresql-migrate-casbin-bnzdb-954p6 0/1 Completed 0 4d8h postgresql-migrate-devtron-t2w25 0/1 Completed 0 4d10h postgresql-migrate-devtron-vlym3-jnvmf 0/1 Completed 0 4d8h postgresql-migrate-gitsensor-sxpcr 0/1 Completed 0 4d10h postgresql-migrate-lens-tmvt5 0/1 Completed 0 4d10h postgresql-postgresql-0 2/2 Running 4 (152m ago) 4d10h root@cby:~# root@cby:~# kubectl get svc -n devtroncd NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE argo-rollouts-metrics ClusterIP 10.98.113.34 <none> 8090/TCP 4d10h argocd-application-controller ClusterIP 10.107.155.128 <none> 8082/TCP 4d9h argocd-dex-server ClusterIP 10.97.14.200 <none> 5556/TCP,5557/TCP,5558/TCP 4d10h argocd-redis ClusterIP 10.102.166.243 <none> 6379/TCP 4d9h argocd-repo-server ClusterIP 10.111.245.9 <none> 8081/TCP 4d9h argocd-server ClusterIP 10.106.6.25 <none> 80/TCP,443/TCP 4d9h clair ClusterIP 10.109.97.107 <none> 6060/TCP,6061/TCP 4d9h dashboard-service ClusterIP 10.110.239.18 <none> 80/TCP 4d10h devtron-grafana ClusterIP 10.111.200.165 <none> 80/TCP 4d8h devtron-nats ClusterIP None <none> 4222/TCP,6222/TCP,8222/TCP,7777/TCP,7422/TCP,7522/TCP 4d10h devtron-service LoadBalancer 10.100.28.2 <pending> 80:32489/TCP 4d10h git-sensor-service ClusterIP 10.99.53.176 <none> 80/TCP 4d10h image-scanner-service ClusterIP 10.103.97.46 <none> 80/TCP 4d9h kubelink-service ClusterIP 10.97.172.63 <none> 50051/TCP 4d10h lens-service ClusterIP 10.100.239.205 <none> 80/TCP 4d10h notifier-service ClusterIP 10.102.67.212 <none> 80/TCP 4d8h postgresql-postgresql ClusterIP 10.104.194.12 <none> 5432/TCP 4d10h postgresql-postgresql-headless ClusterIP None <none> 5432/TCP 4d10h postgresql-postgresql-metrics ClusterIP 10.103.17.122 <none> 9187/TCP 4d10h root@cby:~# 访问测试# 使用用户名：admin和下面提到的密码运行命令。 root@cby:~# kubectl -n devtroncd get secret devtron-secret -o jsonpath='{.data.ADMIN_PASSWORD}' | base64 -d Qn7GuI26j4HcuVW2 # 访问地址 http://192.168.8.61:32489/ # 用户名：admin # 密码：Qn7GuI26j4HcuVW2123关于https://www.oiox.cn/https://www.oiox.cn/index.php/start-page.htmlCSDN、GitHub、51CTO、知乎、开源中国、思否、掘金、简书、华为云、阿里云、腾讯云、哔哩哔哩、今日头条、新浪微博、个人博客全网可搜《小陈运维》文章主要发布于微信公众号
- 2022年11月23日
- 840 阅读
- 2 评论
- 1 点赞
2022-11-23
在k8s上安装Harbor 在k8s上安装Harbor先前条件《kubernetes(k8s) 存储动态挂载》《在k8s（kubernetes）上安装 ingress V1.1.3》参考我之前的文档进行部署https://www.oiox.cn/index.php/archives/32/https://www.oiox.cn/index.php/archives/142/我用到的批量将dockerhub导入阿里云#!/bin/bash for((i=0;i<n;i++)); do echo "${i}" done export docker_images="goharbor/harbor-db:v2.6.2 goharbor/harbor-jobservice:v2.6.2 goharbor/harbor-portal:v2.6.2 goharbor/harbor-registryctl:v2.6.2 goharbor/notary-server-photon:v2.6.2 goharbor/notary-signer-photon:v2.6.2 goharbor/redis-photon:v2.6.2 goharbor/registry-photon:v2.6.2 goharbor/trivy-adapter-photon:v2.6.2" export aliyun_image="registry.cn-hangzhou.aliyuncs.com/chenby/" for images in $docker_images;do export end_image=`echo "$images" | awk -F "/" '{print $NF}'` docker pull "$images" docker tag "$images" "$aliyun_image""$end_image" docker push "$aliyun_image""$end_image" docker rmi "$images" docker rmi "$aliyun_image""$end_image" done安装helm工具# 安装helm工具 curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 chmod 700 get_helm.sh ./get_helm.sh添加Harbor 官方Helm Chart仓库# 添加Harbor 官方Helm Chart仓库 root@cby:~# helm repo add harbor https://helm.goharbor.io WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/.kube/config WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /root/.kube/config "harbor" has been added to your repositories查看源列表# 查看源列表 root@cby:~# helm repo list WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/.kube/config WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /root/.kube/config NAME URL devtron https://helm.devtron.ai harbor https://helm.goharbor.io root@cby:~# 列出最新版本的包# 列出最新版本的包 root@cby:~# helm search repo harbor -l | grep harbor/harbor | head -4 WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/.kube/config WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /root/.kube/config harbor/harbor 1.10.2 2.6.2 An open source trusted cloud native registry th... harbor/harbor 1.10.1 2.6.1 An open source trusted cloud native registry th... harbor/harbor 1.10.0 2.6.0 An open source trusted cloud native registry th... harbor/harbor 1.9.4 2.5.4 An open source trusted cloud native registry th... root@cby:~# 下载Chart包到本地# 下载Chart包到本地 root@cby:~# helm pull harbor/harbor --version 1.10.2 WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/.kube/config WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /root/.kube/config root@cby:~# root@cby:~# ls harbor-1.10.2.tgz harbor-1.10.2.tgz root@cby:~# root@cby:~# tar zxvf harbor-1.10.2.tgz root@cby:~# cd harbor/ root@cby:~/harbor# ll total 276 drwxr-xr-x 5 root root 4096 Nov 22 10:35 ./ drwx------ 12 root root 4096 Nov 22 10:35 ../ drwxr-xr-x 2 root root 4096 Nov 22 10:35 cert/ -rw-r--r-- 1 root root 567 Nov 10 09:08 Chart.yaml drwxr-xr-x 2 root root 4096 Nov 22 10:35 conf/ -rw-r--r-- 1 root root 57 Nov 10 09:08 .helmignore -rw-r--r-- 1 root root 11357 Nov 10 09:08 LICENSE -rw-r--r-- 1 root root 202142 Nov 10 09:08 README.md drwxr-xr-x 16 root root 4096 Nov 22 10:35 templates/ -rw-r--r-- 1 root root 33779 Nov 10 09:08 values.yaml root@cby:~/harbor# 修改values.yaml配置# 修改values.yaml配置 root@cby:~/harbor# sed -i "s#harbor.domain#oiox.cn#g" values.yaml # 设置为我的阿里云仓库 root@cby:~/harbor# sed -i "s#repository: goharbor#repository: registry.cn-hangzhou.aliyuncs.com/chenby#g" values.yaml # 修改字段 externalURL # 注意 30785 是我的ingress端口，各位的端口应该和我的不一样 root@cby:~/harbor# vim values.yaml externalURL: https://core.oiox.cn:30785 # debug看看配置与自己的环境是否匹配，是否需要修改 root@cby:~/harbor# helm install harbor ./ --dry-run | grep oiox.cn WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/.kube/config WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /root/.kube/config EXT_ENDPOINT: "https://core.oiox.cn:30785" - core.oiox.cn host: core.oiox.cn - notary.oiox.cn host: notary.oiox.cn Then you should be able to visit the Harbor portal at https://core.oiox.cn:30785 root@cby:~/harbor# 安装# 创建命名空间 root@cby:~/harbor# kubectl create namespace harbor namespace/harbor created root@cby:~/harbor# # 进行安装 root@cby:~/harbor# helm install harbor . -n harbor WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/.kube/config WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /root/.kube/config NAME: harbor LAST DEPLOYED: Tue Nov 22 10:56:50 2022 NAMESPACE: harbor STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: Please wait for several minutes for Harbor deployment to complete. Then you should be able to visit the Harbor portal at https://core.oiox.cn For more details, please visit https://github.com/goharbor/harbor root@cby:~/harbor# 编辑ingress配置root@cby:~# kubectl edit ingress -n harbor harbor-ingress root@cby:~# kubectl edit ingress -n harbor harbor-ingress-notary # 添加字段 ingressClassName: nginx spec: ingressClassName: nginx rules: - host: core.oiox.cn http: # 查看 root@cby:~# kubectl get ingress -n harbor harbor-ingress -o yaml apiVersion: networking.k8s.io/v1 kind: Ingress metadata: annotations: ingress.kubernetes.io/proxy-body-size: "0" ingress.kubernetes.io/ssl-redirect: "true" meta.helm.sh/release-name: harbor meta.helm.sh/release-namespace: harbor nginx.ingress.kubernetes.io/proxy-body-size: "0" nginx.ingress.kubernetes.io/ssl-redirect: "true" creationTimestamp: "2022-11-22T15:21:35Z" generation: 3 labels: app: harbor app.kubernetes.io/managed-by: Helm chart: harbor heritage: Helm release: harbor name: harbor-ingress namespace: harbor resourceVersion: "2070090" uid: def0b549-3a00-49a4-8ece-b5ce18205427 spec: ingressClassName: nginx rules: - host: core.oiox.cn http: paths: - backend: service: name: harbor-core port: number: 80 path: /api/ pathType: Prefix - backend: service: name: harbor-core port: number: 80 path: /service/ pathType: Prefix - backend: service: name: harbor-core port: number: 80 path: /v2/ pathType: Prefix - backend: service: name: harbor-core port: number: 80 path: /chartrepo/ pathType: Prefix - backend: service: name: harbor-core port: number: 80 path: /c/ pathType: Prefix - backend: service: name: harbor-portal port: number: 80 path: / pathType: Prefix tls: - hosts: - core.oiox.cn secretName: harbor-ingress status: loadBalancer: ingress: - ip: 192.168.8.65 root@cby:~# root@cby:~# kubectl get ingress -n harbor NAME CLASS HOSTS ADDRESS PORTS AGE harbor-ingress nginx core.oiox.cn 192.168.8.65 80, 443 9m8s harbor-ingress-notary nginx notary.oiox.cn 192.168.8.65 80, 443 9m8s root@cby:~# 访问测试# 查看管理员密码 root@cby:~# kubectl get secret -n harbor harbor-core -o jsonpath='{.data.HARBOR_ADMIN_PASSWORD}'|base64 --decode Harbor12345 # 写入本地hosts配置 root@cby:~# echo "192.168.8.65 core.oiox.cn" >> /etc/hosts root@cby:~# sudo mkdir -p /etc/docker root@cby:~# sudo tee /etc/docker/daemon.json <<-'EOF' { "registry-mirrors": [ "https://hub-mirror.c.163.com", "https://mirror.baidubce.com" ], "insecure-registries": [ "hb.oiox.cn", "core.oiox.cn:30785" ], "exec-opts": ["native.cgroupdriver=systemd"] } EOF root@cby:~# sudo systemctl daemon-reload root@cby:~# sudo systemctl restart docker root@cby:~# docker login -uadmin -pHarbor12345 core.oiox.cn:30785 WARNING! Using --password via the CLI is insecure. Use --password-stdin. WARNING! Your password will be stored unencrypted in /root/.docker/config.json. Configure a credential helper to remove this warning. See https://docs.docker.com/engine/reference/commandline/login/#credentials-store Login Succeeded关于https://www.oiox.cn/https://www.oiox.cn/index.php/start-page.htmlCSDN、GitHub、51CTO、知乎、开源中国、思否、掘金、简书、华为云、阿里云、腾讯云、哔哩哔哩、今日头条、新浪微博、个人博客全网可搜《小陈运维》文章主要发布于微信公众号
- 2022年11月23日
- 1,082 阅读
- 0 评论
- 0 点赞
2022-11-16
安装Harbor 安装Harbor安装docker# 安装 apt 依赖包 apt-get install \ apt-transport-https \ ca-certificates \ curl \ gnupg-agent \ software-properties-common # 添加 Docker 的官方 GPG 密钥 curl -fsSL https://mirrors.ustc.edu.cn/docker-ce/linux/ubuntu/gpg | sudo apt-key add - # 使用以下指令设置稳定版仓库 add-apt-repository \ "deb [arch=amd64] https://mirrors.ustc.edu.cn/docker-ce/linux/ubuntu/ \ $(lsb_release -cs) \ stable" # 安装最新版本的 Docker Engine-Community 和 containerd apt-get update apt-get install docker-ce docker-ce-cli containerd.io安装docker compose# 配置Docker Compose root@cby:~# wget https://ghproxy.com/https://github.com/docker/compose/releases/download/v2.12.2/docker-compose-linux-x86_64 root@cby:~# mv docker-compose-linux-x86_64 /usr/local/bin/docker-compose root@cby:~# chmod +x /usr/local/bin/docker-compose root@cby:~# docker-compose --version Docker Compose version v2.12.2 root@cby:~# 下载harbor安装包# 下载Docker Harbor安装包 wget https://ghproxy.com/https://github.com/goharbor/harbor/releases/download/v2.6.2/harbor-offline-installer-v2.6.2.tgz # 解压安装包 root@cby:~# tar xvf harbor-offline-installer-v2.6.2.tgz -C /usr/local/ harbor/harbor.v2.6.2.tar.gz harbor/prepare harbor/LICENSE harbor/install.sh harbor/common.sh harbor/harbor.yml.tmpl root@cby:~# cd /usr/local/harbor/创建证书# 创建ca证书目录 root@cby:/usr/local/harbor# mkdir ca root@cby:/usr/local/harbor# cd ca/ root@cby:/usr/local/harbor/ca# # 生成CA证书私钥 root@cby:/usr/local/harbor/ca# openssl genrsa -out ca.key 4096 # 生成CA证书 root@cby:/usr/local/harbor/ca# openssl req -x509 -new -nodes -sha512 -days 3650 \ -subj "/C=CN/ST=Beijing/L=Beijing/O=example/OU=Personal/CN=hb.oiox.cn" \ -key ca.key \ -out ca.crt # 生成服务器证书生成私钥 root@cby:/usr/local/harbor/ca# openssl genrsa -out hb.oiox.cn.key 4096 # 生成证书签名请求（CSR） root@cby:/usr/local/harbor/ca# openssl req -sha512 -new \ -subj "/C=CN/ST=Beijing/L=Beijing/O=example/OU=Personal/CN=hb.oiox.cn" \ -key hb.oiox.cn.key \ -out hb.oiox.cn.csr # 生成一个x509 v3扩展文件 root@cby:/usr/local/harbor/ca# cat > v3.ext <<-EOF authorityKeyIdentifier=keyid,issuer basicConstraints=CA:FALSE keyUsage = digitalSignature, nonRepudiation, keyEncipherment, dataEncipherment extendedKeyUsage = serverAuth subjectAltName = @alt_names [alt_names] DNS.1=oiox.cn DNS.2=hb.oiox.cn DNS.3=www.oiox.cn EOF # 使用该v3.ext文件为您的Harbor主机生成证书 root@cby:/usr/local/harbor/ca# openssl x509 -req -sha512 -days 3650 \ -extfile v3.ext \ -CA ca.crt -CAkey ca.key -CAcreateserial \ -in hb.oiox.cn.csr \ -out hb.oiox.cn.crt 配置docker证书# 转换crt为cert，供Docker使用，Docker守护程序将.crt文件解释为CA证书，并将.cert文件解释为客户端证书 root@cby:/usr/local/harbor/ca# openssl x509 -inform PEM -in hb.oiox.cn.crt -out hb.oiox.cn.cert # 将服务器证书，密钥和CA文件复制到Harbor主机上的Docker证书文件夹中。您必须首先创建适当的文件夹 root@cby:/usr/local/harbor/ca# mkdir -p /etc/docker/certs.d/hb.oiox.cn/ root@cby:/usr/local/harbor/ca# cp hb.oiox.cn.cert /etc/docker/certs.d/hb.oiox.cn/ root@cby:/usr/local/harbor/ca# cp hb.oiox.cn.key /etc/docker/certs.d/hb.oiox.cn/ root@cby:/usr/local/harbor/ca# cp ca.crt /etc/docker/certs.d/hb.oiox.cn/ # 如果将默认nginx端口443 映射到其他端口，请创建文件夹 # /etc/docker/certs.d/yourdomain.com:port # 重新启动Docker Engine root@cby:/usr/local/harbor/ca# systemctl restart docker查看文件# 查看目录下证书文件 root@cby:/usr/local/harbor/ca# ll total 36 drwxr-xr-x 2 root root 4096 Nov 16 06:23 ./ drwxr-xr-x 5 root root 4096 Nov 16 06:16 ../ -rw-r--r-- 1 root root 2041 Nov 16 06:20 ca.crt -rw------- 1 root root 3272 Nov 16 06:16 ca.key -rw-r--r-- 1 root root 2143 Nov 16 06:23 hb.oiox.cn.cert -rw-r--r-- 1 root root 2143 Nov 16 06:22 hb.oiox.cn.crt -rw-r--r-- 1 root root 1704 Nov 16 06:22 hb.oiox.cn.csr -rw------- 1 root root 3268 Nov 16 06:22 hb.oiox.cn.key -rw-r--r-- 1 root root 261 Nov 16 06:22 v3.ext root@cby:/usr/local/harbor/ca# 配置harbor服务# 配置harbor文件 root@cby:/usr/local/harbor# cp harbor.yml.tmpl harbor.yml root@cby:/usr/local/harbor# vim harbor.yml root@cby:/usr/local/harbor# cat harbor.yml | grep -v '^#' | grep -v '^$' | grep -v ' #' hostname: hb.oiox.cn http: port: 80 https: port: 443 certificate: /usr/local/harbor/ca/hb.oiox.cn.crt private_key: /usr/local/harbor/ca/hb.oiox.cn.key harbor_admin_password: Harbor12345 database: password: root123 max_idle_conns: 100 max_open_conns: 900 data_volume: /data trivy: ignore_unfixed: false skip_update: false offline_scan: false security_check: vuln insecure: false jobservice: max_job_workers: 10 notification: webhook_job_max_retry: 10 chart: absolute_url: disabled log: level: info local: rotate_count: 50 rotate_size: 200M location: /var/log/harbor _version: 2.6.0 proxy: http_proxy: https_proxy: no_proxy: components: - core - jobservice - trivy upload_purging: enabled: true age: 168h interval: 24h dryrun: false cache: enabled: false expire_hours: 24 root@cby:/usr/local/harbor# 安装harbor# 进行安装 root@cby:/usr/local/harbor# ./install.sh tput: No value for $TERM and no -T specified tput: No value for $TERM and no -T specified tput: No value for $TERM and no -T specified tput: No value for $TERM and no -T specified tput: No value for $TERM and no -T specified tput: No value for $TERM and no -T specified tput: No value for $TERM and no -T specified tput: No value for $TERM and no -T specified [Step 0]: checking if docker is installed ... Note: docker version: 20.10.21 [Step 1]: checking docker-compose is installed ... Note: docker-compose version: 2.12.2 [Step 2]: loading Harbor images ... Loaded image: goharbor/harbor-jobservice:v2.6.2 Loaded image: goharbor/trivy-adapter-photon:v2.6.2 Loaded image: goharbor/chartmuseum-photon:v2.6.2 Loaded image: goharbor/redis-photon:v2.6.2 Loaded image: goharbor/nginx-photon:v2.6.2 Loaded image: goharbor/notary-signer-photon:v2.6.2 Loaded image: goharbor/harbor-core:v2.6.2 Loaded image: goharbor/harbor-db:v2.6.2 Loaded image: goharbor/harbor-registryctl:v2.6.2 Loaded image: goharbor/harbor-exporter:v2.6.2 Loaded image: goharbor/prepare:v2.6.2 Loaded image: goharbor/registry-photon:v2.6.2 Loaded image: goharbor/notary-server-photon:v2.6.2 Loaded image: goharbor/harbor-portal:v2.6.2 Loaded image: goharbor/harbor-log:v2.6.2 [Step 3]: preparing environment ... [Step 4]: preparing harbor configs ... prepare base dir is set to /usr/local/harbor Clearing the configuration file: /config/core/app.conf Clearing the configuration file: /config/core/env Clearing the configuration file: /config/jobservice/env Clearing the configuration file: /config/jobservice/config.yml Clearing the configuration file: /config/nginx/nginx.conf Clearing the configuration file: /config/registryctl/env Clearing the configuration file: /config/registryctl/config.yml Clearing the configuration file: /config/portal/nginx.conf Clearing the configuration file: /config/db/env Clearing the configuration file: /config/registry/passwd Clearing the configuration file: /config/registry/config.yml Clearing the configuration file: /config/log/logrotate.conf Clearing the configuration file: /config/log/rsyslog_docker.conf Generated configuration file: /config/portal/nginx.conf Generated configuration file: /config/log/logrotate.conf Generated configuration file: /config/log/rsyslog_docker.conf Generated configuration file: /config/nginx/nginx.conf Generated configuration file: /config/core/env Generated configuration file: /config/core/app.conf Generated configuration file: /config/registry/config.yml Generated configuration file: /config/registryctl/env Generated configuration file: /config/registryctl/config.yml Generated configuration file: /config/db/env Generated configuration file: /config/jobservice/env Generated configuration file: /config/jobservice/config.yml loaded secret from file: /data/secret/keys/secretkey Generated configuration file: /compose_location/docker-compose.yml Clean up the input dir Note: stopping existing Harbor instance ... [Step 5]: starting Harbor ... [+] Running 10/10 ⠿ Network harbor_harbor Created 0.0s ⠿ Container harbor-log Started 0.6s ⠿ Container harbor-portal Started 0.8s ⠿ Container registryctl Started 1.1s ⠿ Container redis Started 0.9s ⠿ Container registry Started 1.1s ⠿ Container harbor-db Started 1.2s ⠿ Container harbor-core Started 1.3s ⠿ Container nginx Started 1.9s ⠿ Container harbor-jobservice Started 2.0s ✔ ----Harbor has been installed and started successfully.---- root@cby:/usr/local/harbor# root@cby:/usr/local/harbor# root@cby:/usr/local/harbor#配置解析和docker# FQDN解析 cat > /etc/hosts <<EOF 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.8.61 k8s-master01 192.168.8.62 k8s-master02 192.168.8.63 k8s-master03 192.168.8.64 k8s-node01 192.168.8.65 k8s-node02 192.168.8.66 lb-vip 192.168.8.3 hb.oiox.cn EOF # 例如docker的配置 [root@k8s-master-1 ~]# cat > /etc/docker/daemon.json <<EOF { "registry-mirrors": [ "https://hub-mirror.c.163.com", "https://mirror.baidubce.com" ], "exec-opts": ["native.cgroupdriver=systemd"], "insecure-registries": ["hb.oiox.cn"] } EOF # 重新启动docker [root@k8s-master-1 ~]# systemctl restart docker && systemctl status docker -l测试使用# 登陆 [root@k8s-master-1 ~]# docker login hb.oiox.cn Username: admin Password: WARNING! Your password will be stored unencrypted in /root/.docker/config.json. Configure a credential helper to remove this warning. See https://docs.docker.com/engine/reference/commandline/login/#credentials-store Login Succeeded [root@k8s-master-1 ~]# # 测试使用 [root@k8s-master-1 ~]# docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/dashboard:v2.7.0 [root@k8s-master-1 ~]# docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/dashboard:v2.7.0 [root@k8s-master-1 ~]# docker push hb.oiox.cn/library/dashboard:v2.7.0 [root@k8s-master-1 ~]# docker pull hb.oiox.cn/library/dashboard:v2.7.0关于https://www.oiox.cn/https://www.oiox.cn/index.php/start-page.htmlCSDN、GitHub、51CTO、知乎、开源中国、思否、掘金、简书、华为云、阿里云、腾讯云、哔哩哔哩、今日头条、新浪微博、个人博客全网可搜《小陈运维》文章主要发布于微信公众号
- 2022年11月16日
- 751 阅读
- 1 评论
- 0 点赞
2022-11-12
Grafana Prometheus Altermanager Grafana Prometheus Altermanager 监控系统基本概念Prometheus 是一套开源的系统监控、报警、时间序列数据库的组合，最初有 SoundCloud 开发的，后来随着越来越多公司使用，于是便独立成开源项目。Alertmanager 主要用于接收 Prometheus 发送的告警信息，它支持丰富的告警通知渠道，例如邮件、微信、钉钉、Slack 等常用沟通工具，而且很容易做到告警信息进行去重，降噪，分组等，是一款很好用的告警通知系统。Prometheus架构如下：安装Grafana服务root@cby:~# sudo apt-get install -y adduser libfontconfig1 root@cby:~# wget https://dl.grafana.com/enterprise/release/grafana-enterprise_9.2.4_amd64.deb root@cby:~# sudo dpkg -i grafana-enterprise_9.2.4_amd64.deb root@cby:~# systemctl enable --now grafana-server.service Synchronizing state of grafana-server.service with SysV service script with /lib/systemd/systemd-sysv-install. Executing: /lib/systemd/systemd-sysv-install enable grafana-server Created symlink /etc/systemd/system/multi-user.target.wants/grafana-server.service → /lib/systemd/system/grafana-server.service. root@cby:~# 安装Prometheus服务root@cby:~# wget https://github.com/prometheus/prometheus/releases/download/v2.40.1/prometheus-2.40.1.linux-amd64.tar.gz root@cby:~# tar xvf prometheus-2.40.1.linux-amd64.tar.gz prometheus-2.40.1.linux-amd64/ prometheus-2.40.1.linux-amd64/NOTICE prometheus-2.40.1.linux-amd64/prometheus prometheus-2.40.1.linux-amd64/LICENSE prometheus-2.40.1.linux-amd64/console_libraries/ prometheus-2.40.1.linux-amd64/console_libraries/menu.lib prometheus-2.40.1.linux-amd64/console_libraries/prom.lib prometheus-2.40.1.linux-amd64/promtool prometheus-2.40.1.linux-amd64/prometheus.yml prometheus-2.40.1.linux-amd64/consoles/ prometheus-2.40.1.linux-amd64/consoles/prometheus-overview.html prometheus-2.40.1.linux-amd64/consoles/prometheus.html prometheus-2.40.1.linux-amd64/consoles/node-cpu.html prometheus-2.40.1.linux-amd64/consoles/node-overview.html prometheus-2.40.1.linux-amd64/consoles/node-disk.html prometheus-2.40.1.linux-amd64/consoles/index.html.example prometheus-2.40.1.linux-amd64/consoles/node.html root@cby:~# mv prometheus-2.40.1.linux-amd64 prometheus root@cby:~# 进行全局配置root@cby:~# vim prometheus/prometheus.yml root@cby:~# cat prometheus/prometheus.yml # Prometheus全局配置项 global: scrape_interval: 15s # 设定抓取数据的周期，默认为1min evaluation_interval: 15s # 设定更新rules文件的周期，默认为1min scrape_timeout: 15s # 设定抓取数据的超时时间，默认为10s external_labels: # 额外的属性，会添加到拉取得数据并存到数据库中 monitor: 'codelab_monitor' # Alertmanager配置 alerting: alertmanagers: - static_configs: - targets: ["127.0.0.1:9093"] # 设定alertmanager和prometheus交互的接口，即alertmanager监听的ip地址和端口 # rule配置，首次读取默认加载，之后根据evaluation_interval设定的周期加载 rule_files: - "dist/*.yml" # scape配置 scrape_configs: - job_name: 'prometheus' # job_name默认写入timeseries的labels中，可以用于查询使用 scrape_interval: 15s # 抓取周期，默认采用global配置 static_configs: # 静态配置 - targets: ['127.0.0.1:9090'] # prometheus所要抓取数据的地址，即instance实例项 - job_name: 'web' scrape_interval: 15s static_configs: - targets: ['10.0.0.10:9200'] - job_name: 'node-exporter' scrape_interval: 15s file_sd_configs: - files: - "static_conf/*.yaml" refresh_interval: 1s root@cby:~# 进行写入动态配置文件内容写需要监控的主机即可root@cby:~# mkdir prometheus/static_conf/ root@cby:~# vim /prometheus/static_conf/file.yaml root@cby:~# cat /prometheus/static_conf/file.yaml - targets: ['10.0.0.1:9200'] - targets: ['10.0.0.2:9200'] - targets: ['10.0.0.3:9200'] - targets: ['10.0.0.4:9200'] - targets: ['10.0.0.5:9200'] - targets: ['10.0.0.6:9200'] - targets: ['10.0.0.7:9200'] - targets: ['10.0.0.8:9200'] - targets: ['10.0.0.9:9200'] - targets: ['10.0.0.10:9200'] - targets: ['10.0.0.11:9200'] - targets: ['10.0.0.12:9200'] - targets: ['10.0.0.13:9200'] - targets: ['10.0.0.14:9200'] - targets: ['10.0.0.15:9200'] - targets: ['10.0.0.16:9200'] - targets: ['10.0.0.17:9200'] - targets: ['10.0.0.18:9200'] - targets: ['10.0.0.19:9200'] - targets: ['10.0.0.20:9200'] - targets: ['10.0.0.21:9200'] - targets: ['10.0.0.22:9200'] - targets: ['10.0.0.23:9200'] - targets: ['10.0.0.24:9200'] - targets: ['10.0.0.25:9200'] - targets: ['10.0.0.26:9200'] - targets: ['10.0.0.27:9200'] - targets: ['10.0.0.28:9200'] - targets: ['10.0.0.29:9200'] - targets: ['10.0.0.30:9200'] - targets: ['10.0.0.31:9200'] - targets: ['10.0.0.32:9200'] - targets: ['10.0.0.33:9200'] - targets: ['10.0.0.34:9200'] - targets: ['10.0.0.35:9200'] - targets: ['10.0.0.36:9200'] - targets: ['10.0.0.37:9200'] - targets: ['10.0.0.38:9200'] - targets: ['10.0.0.39:9200'] - targets: ['10.0.0.40:9200'] - targets: ['10.0.0.41:9200'] - targets: ['10.0.0.42:9200'] - targets: ['10.0.0.43:9200'] - targets: ['10.0.0.44:9200'] - targets: ['10.0.0.45:9200'] - targets: ['10.0.0.46:9200'] - targets: ['10.0.0.47:9200'] - targets: ['10.0.0.48:9200'] - targets: ['10.0.0.49:9200'] - targets: ['10.0.0.50:9200'] - targets: ['10.0.0.51:9200'] - targets: ['10.0.0.52:9200'] - targets: ['10.0.0.53:9200'] - targets: ['10.0.0.54:9200'] - targets: ['10.0.0.55:9200'] - targets: ['10.0.0.56:9200'] - targets: ['10.0.0.57:9200'] - targets: ['10.0.0.58:9200'] - targets: ['10.0.0.59:9200'] - targets: ['10.0.0.60:9200'] - targets: ['10.0.0.61:9200'] - targets: ['10.0.0.62:9200'] - targets: ['10.0.0.63:9200'] - targets: ['10.0.0.64:9200'] - targets: ['10.0.0.65:9200'] - targets: ['10.0.0.66:9200'] - targets: ['10.0.0.67:9200'] - targets: ['10.0.0.68:9200'] - targets: ['10.0.0.69:9200'] - targets: ['10.0.0.70:9200'] - targets: ['10.0.0.71:9200'] - targets: ['10.0.0.72:9200'] - targets: ['10.0.0.73:9200'] - targets: ['10.0.0.74:9200'] - targets: ['10.0.0.75:9200'] - targets: ['10.0.0.76:9200'] - targets: ['10.0.0.77:9200'] - targets: ['10.0.0.78:9200'] - targets: ['10.0.0.79:9200'] - targets: ['10.0.0.80:9200'] - targets: ['10.0.0.81:9200'] - targets: ['10.0.0.82:9200'] - targets: ['10.0.0.83:9200'] - targets: ['10.0.0.84:9200'] - targets: ['10.0.0.85:9200'] - targets: ['10.0.0.86:9200'] - targets: ['10.0.0.87:9200'] - targets: ['10.0.0.88:9200'] - targets: ['10.0.0.89:9200'] - targets: ['10.0.0.90:9200'] - targets: ['10.0.0.91:9200'] - targets: ['10.0.0.92:9200'] - targets: ['10.0.0.93:9200'] - targets: ['10.0.0.94:9200'] - targets: ['10.0.0.95:9200'] - targets: ['10.0.0.96:9200'] - targets: ['10.0.0.97:9200'] - targets: ['10.0.0.98:9200'] - targets: ['10.0.0.99:9200'] - targets: ['10.0.0.100:9200'] - targets: ['10.0.0.101:9200'] - targets: ['10.0.0.102:9200'] - targets: ['10.0.0.103:9200'] - targets: ['10.0.0.104:9200'] - targets: ['10.0.0.105:9200'] - targets: ['10.0.0.106:9200'] - targets: ['10.0.0.107:9200'] - targets: ['10.0.0.108:9200'] - targets: ['10.0.0.109:9200'] - targets: ['10.0.0.110:9200'] - targets: ['10.0.0.111:9200'] - targets: ['10.0.0.112:9200'] - targets: ['10.0.0.113:9200'] - targets: ['10.0.0.114:9200'] - targets: ['10.0.0.115:9200'] - targets: ['10.0.0.116:9200'] - targets: ['10.0.0.117:9200'] - targets: ['10.0.0.118:9200'] - targets: ['10.0.0.119:9200'] - targets: ['10.0.0.120:9200'] - targets: ['10.0.0.121:9200'] - targets: ['10.0.0.122:9200'] - targets: ['10.0.0.123:9200'] - targets: ['10.0.0.124:9200'] - targets: ['10.0.0.125:9200'] - targets: ['10.0.0.126:9200'] - targets: ['10.0.0.127:9200'] - targets: ['10.0.0.128:9200'] - targets: ['10.0.0.129:9200'] - targets: ['10.0.0.130:9200'] - targets: ['10.0.0.131:9200'] - targets: ['10.0.0.132:9200'] - targets: ['10.0.0.133:9200'] - targets: ['10.0.0.134:9200'] - targets: ['10.0.0.135:9200'] - targets: ['10.0.0.136:9200'] - targets: ['10.0.0.137:9200'] - targets: ['10.0.0.138:9200'] - targets: ['10.0.0.139:9200'] - targets: ['10.0.0.140:9200'] - targets: ['10.0.0.141:9200'] - targets: ['10.0.0.142:9200'] - targets: ['10.0.0.143:9200'] - targets: ['10.0.0.144:9200'] - targets: ['10.0.0.145:9200'] - targets: ['10.0.0.146:9200'] - targets: ['10.0.0.147:9200'] - targets: ['10.0.0.148:9200'] - targets: ['10.0.0.149:9200'] - targets: ['10.0.0.150:9200'] - targets: ['10.0.0.151:9200'] - targets: ['10.0.0.152:9200'] - targets: ['10.0.0.153:9200'] - targets: ['10.0.0.154:9200'] - targets: ['10.0.0.155:9200'] - targets: ['10.0.0.156:9200'] - targets: ['10.0.0.157:9200'] - targets: ['10.0.0.158:9200'] - targets: ['10.0.0.159:9200'] - targets: ['10.0.0.160:9200'] - targets: ['10.0.0.161:9200'] - targets: ['10.0.0.162:9200'] - targets: ['10.0.0.163:9200'] - targets: ['10.0.0.164:9200'] - targets: ['10.0.0.165:9200'] - targets: ['10.0.0.166:9200'] - targets: ['10.0.0.167:9200'] - targets: ['10.0.0.168:9200'] - targets: ['10.0.0.169:9200'] - targets: ['10.0.0.170:9200'] - targets: ['10.0.0.171:9200'] - targets: ['10.0.0.172:9200'] - targets: ['10.0.0.173:9200'] - targets: ['10.0.0.174:9200'] - targets: ['10.0.0.175:9200'] - targets: ['10.0.0.176:9200'] - targets: ['10.0.0.177:9200'] - targets: ['10.0.0.178:9200'] - targets: ['10.0.0.179:9200'] - targets: ['10.0.0.180:9200'] - targets: ['10.0.0.181:9200'] - targets: ['10.0.0.182:9200'] - targets: ['10.0.0.183:9200'] - targets: ['10.0.0.184:9200'] - targets: ['10.0.0.185:9200'] - targets: ['10.0.0.186:9200'] - targets: ['10.0.0.187:9200'] - targets: ['10.0.0.188:9200'] - targets: ['10.0.0.189:9200'] - targets: ['10.0.0.190:9200'] - targets: ['10.0.0.191:9200'] - targets: ['10.0.0.192:9200'] - targets: ['10.0.0.193:9200'] - targets: ['10.0.0.194:9200'] - targets: ['10.0.0.195:9200'] - targets: ['10.0.0.196:9200'] - targets: ['10.0.0.197:9200'] - targets: ['10.0.0.198:9200'] - targets: ['10.0.0.199:9200'] - targets: ['10.0.0.200:9200'] - targets: ['10.0.0.201:9200'] - targets: ['10.0.0.202:9200'] - targets: ['10.0.0.203:9200'] - targets: ['10.0.0.204:9200'] - targets: ['10.0.0.205:9200'] - targets: ['10.0.0.206:9200'] - targets: ['10.0.0.207:9200'] - targets: ['10.0.0.208:9200'] - targets: ['10.0.0.209:9200'] - targets: ['10.0.0.210:9200'] - targets: ['10.0.0.211:9200'] - targets: ['10.0.0.212:9200'] - targets: ['10.0.0.213:9200'] - targets: ['10.0.0.214:9200'] - targets: ['10.0.0.215:9200'] - targets: ['10.0.0.216:9200'] - targets: ['10.0.0.217:9200'] - targets: ['10.0.0.218:9200'] - targets: ['10.0.0.219:9200'] - targets: ['10.0.0.220:9200'] - targets: ['10.0.0.221:9200'] - targets: ['10.0.0.222:9200'] - targets: ['10.0.0.223:9200'] - targets: ['10.0.0.224:9200'] - targets: ['10.0.0.225:9200'] - targets: ['10.0.0.226:9200'] - targets: ['10.0.0.227:9200'] - targets: ['10.0.0.228:9200'] - targets: ['10.0.0.229:9200'] - targets: ['10.0.0.230:9200'] - targets: ['10.0.0.231:9200'] - targets: ['10.0.0.232:9200'] - targets: ['10.0.0.233:9200'] - targets: ['10.0.0.234:9200'] - targets: ['10.0.0.235:9200'] - targets: ['10.0.0.236:9200'] - targets: ['10.0.0.237:9200'] - targets: ['10.0.0.238:9200'] - targets: ['10.0.0.239:9200'] - targets: ['10.0.0.240:9200'] - targets: ['10.0.0.241:9200'] - targets: ['10.0.0.242:9200'] - targets: ['10.0.0.243:9200'] - targets: ['10.0.0.244:9200'] - targets: ['10.0.0.245:9200'] - targets: ['10.0.0.246:9200'] - targets: ['10.0.0.247:9200'] - targets: ['10.0.0.248:9200'] - targets: ['10.0.0.249:9200'] - targets: ['10.0.0.250:9200'] - targets: ['10.0.0.251:9200'] - targets: ['10.0.0.252:9200'] - targets: ['10.0.0.253:9200'] - targets: ['10.0.0.254:9200'] - targets: ['10.0.0.255:9200'] root@cby:~# 配置开机自启服务root@cby:~# vim /lib/systemd/system/prometheus.service root@cby:~# cat /lib/systemd/system/prometheus.service [Unit] Description=Prometheus After=network-online.target [Service] Type=simple ExecStart=/prometheus/prometheus --config.file=/prometheus/prometheus.yml Restart=on-failur ExecStop=/bin/kill -9 $MAINPID [Install] WantedBy=multi-user.target root@cby:~# root@cby:~# systemctl daemon-reload root@cby:~# root@cby:~# systemctl enable --now prometheus.service Created symlink /etc/systemd/system/multi-user.target.wants/prometheus.service → /lib/systemd/system/prometheus.service. root@cby:~# root@cby:~# systemctl status prometheus.service 安装Node_exporter监控组件root@cby:~# wget https://github.com/prometheus/node_exporter/releases/download/v1.4.0/node_exporter-1.4.0.linux-amd64.tar.gz root@cby:~# tar xvf node_exporter-1.4.0.linux-amd64.tar.gz node_exporter-1.4.0.linux-amd64/ node_exporter-1.4.0.linux-amd64/LICENSE node_exporter-1.4.0.linux-amd64/NOTICE node_exporter-1.4.0.linux-amd64/node_exporter root@cby:~# root@cby:~# mv node_exporter-1.4.0.linux-amd64 node_exporter root@cby:~# mv prometheus / root@cby:~# mv node_exporter / 设置为开机自启root@cby:~# vim /lib/systemd/system/node_exporter.service root@cby:~# cat /lib/systemd/system/node_exporter.service [Unit] Description=node_exporter After=network-online.target [Service] Type=simple ExecStart=/node_exporter/node_exporter --web.listen-address=":9200" Restart=on-failur ExecStop=/bin/kill -9 $MAINPID [Install] WantedBy=multi-user.target root@cby:~# systemctl daemon-reload root@cby:~# root@cby:~# systemctl enable --now node_exporter.service Created symlink /etc/systemd/system/multi-user.target.wants/prometheus.service → /lib/systemd/system/prometheus.service. root@cby:~# root@cby:~# systemctl status node_exporter.service 下载安装alertmanager服务root@cby:~# wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz root@cby:~# tar xvf alertmanager-0.24.0.linux-amd64.tar.gz alertmanager-0.24.0.linux-amd64/ alertmanager-0.24.0.linux-amd64/alertmanager.yml alertmanager-0.24.0.linux-amd64/LICENSE alertmanager-0.24.0.linux-amd64/NOTICE alertmanager-0.24.0.linux-amd64/alertmanager alertmanager-0.24.0.linux-amd64/amtool root@cby:~# root@cby:~# mv alertmanager-0.24.0.linux-amd64 alertmanager root@cby:~# mv alertmanager / root@cby:~# 全局配置root@cby:~# vim /alertmanager/alertmanager.yml root@cby:~# cat /alertmanager/alertmanager.yml global: resolve_timeout: 5m smtp_from: 'cby@chenby.cn' smtp_smarthost: 'smtp.qiye.aliyun.com:465' smtp_auth_username: 'cby@chenby.cn' smtp_auth_password: 'xxxxxxxx' smtp_require_tls: false smtp_hello: 'chenby.cn' route: group_by: ['alertname'] group_wait: 5s group_interval: 5s repeat_interval: 5m receiver: 'email' receivers: - name: 'email' email_configs: - to: 'cby@chenby.cn' send_resolved: true inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance'] root@cby:~# root@cby:~#配置告警规则规则模板建议在此网站找适合自己的 https://awesome-prometheus-alerts.grep.to/举例groups: - name: test-rules rules: - alert: InstanceDown # 告警名称 expr: up == 0 # 告警的判定条件，参考Prometheus高级查询来设定 for: 2m # 满足告警条件持续时间多久后，才会发送告警 labels: #标签项 team: node annotations: # 解析项，详细解释告警信息 summary: "{{$labels.instance}}: has been down" description: "{{$labels.instance}}: job {{$labels.job}} has been down " value: {{$value}}我的告警配置root@cby:~# mkdir /prometheus/dist/ root@cby:~# vim /prometheus/dist/123.yml root@cby:~# cat /prometheus/dist/123.yml groups: - name: generals.rules rules: - alert: PrometheusJobMissing expr: absent(up{job="prometheus"}) for: 0m labels: severity: warning annotations: summary: Prometheus job missing (instance {{ $labels.instance }}) description: "A Prometheus job has disappeared\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusTargetMissing expr: up == 0 for: 0m labels: severity: critical annotations: summary: Prometheus target missing (instance {{ $labels.instance }}) description: "A Prometheus target has disappeared. An exporter might be crashed.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusAllTargetsMissing expr: sum by (job) (up) == 0 for: 0m labels: severity: critical annotations: summary: Prometheus all targets missing (instance {{ $labels.instance }}) description: "A Prometheus job does not have living target anymore.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusTargetMissingWithWarmupTime expr: sum by (instance, job) ((up == 0) * on (instance) group_right(job) (node_time_seconds - node_boot_time_seconds > 600)) for: 0m labels: severity: critical annotations: summary: Prometheus target missing with warmup time (instance {{ $labels.instance }}) description: "Allow a job time to start up (10 minutes) before alerting that it's down.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusConfigurationReloadFailure expr: prometheus_config_last_reload_successful != 1 for: 0m labels: severity: warning annotations: summary: Prometheus configuration reload failure (instance {{ $labels.instance }}) description: "Prometheus configuration reload error\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusTooManyRestarts expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2 for: 0m labels: severity: warning annotations: summary: Prometheus too many restarts (instance {{ $labels.instance }}) description: "Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusAlertmanagerJobMissing expr: absent(up{job="alertmanager"}) for: 0m labels: severity: warning annotations: summary: Prometheus AlertManager job missing (instance {{ $labels.instance }}) description: "A Prometheus AlertManager job has disappeared\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusAlertmanagerConfigurationReloadFailure expr: alertmanager_config_last_reload_successful != 1 for: 0m labels: severity: warning annotations: summary: Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }}) description: "AlertManager configuration reload error\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusAlertmanagerConfigNotSynced expr: count(count_values("config_hash", alertmanager_config_hash)) > 1 for: 0m labels: severity: warning annotations: summary: Prometheus AlertManager config not synced (instance {{ $labels.instance }}) description: "Configurations of AlertManager cluster instances are out of sync\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusAlertmanagerE2eDeadManSwitch expr: vector(1) for: 0m labels: severity: critical annotations: summary: Prometheus AlertManager E2E dead man switch (instance {{ $labels.instance }}) description: "Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusNotConnectedToAlertmanager expr: prometheus_notifications_alertmanagers_discovered < 1 for: 0m labels: severity: critical annotations: summary: Prometheus not connected to alertmanager (instance {{ $labels.instance }}) description: "Prometheus cannot connect the alertmanager\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusRuleEvaluationFailures expr: increase(prometheus_rule_evaluation_failures_total[3m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus rule evaluation failures (instance {{ $labels.instance }}) description: "Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusTemplateTextExpansionFailures expr: increase(prometheus_template_text_expansion_failures_total[3m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus template text expansion failures (instance {{ $labels.instance }}) description: "Prometheus encountered {{ $value }} template text expansion failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusRuleEvaluationSlow expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds for: 5m labels: severity: warning annotations: summary: Prometheus rule evaluation slow (instance {{ $labels.instance }}) description: "Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusNotificationsBacklog expr: min_over_time(prometheus_notifications_queue_length[10m]) > 0 for: 0m labels: severity: warning annotations: summary: Prometheus notifications backlog (instance {{ $labels.instance }}) description: "The Prometheus notification queue has not been empty for 10 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusAlertmanagerNotificationFailing expr: rate(alertmanager_notifications_failed_total[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus AlertManager notification failing (instance {{ $labels.instance }}) description: "Alertmanager is failing sending notifications\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusTargetEmpty expr: prometheus_sd_discovered_targets == 0 for: 0m labels: severity: critical annotations: summary: Prometheus target empty (instance {{ $labels.instance }}) description: "Prometheus has no target in service discovery\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusTargetScrapingSlow expr: prometheus_target_interval_length_seconds{quantile="0.9"} / on (interval, instance, job) prometheus_target_interval_length_seconds{quantile="0.5"} > 1.05 for: 5m labels: severity: warning annotations: summary: Prometheus target scraping slow (instance {{ $labels.instance }}) description: "Prometheus is scraping exporters slowly since it exceeded the requested interval time. Your Prometheus server is under-provisioned.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusLargeScrape expr: increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10 for: 5m labels: severity: warning annotations: summary: Prometheus large scrape (instance {{ $labels.instance }}) description: "Prometheus has many scrapes that exceed the sample limit\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusTargetScrapeDuplicate expr: increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0 for: 0m labels: severity: warning annotations: summary: Prometheus target scrape duplicate (instance {{ $labels.instance }}) description: "Prometheus has many samples rejected due to duplicate timestamps but different values\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusTsdbCheckpointCreationFailures expr: increase(prometheus_tsdb_checkpoint_creations_failed_total[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }}) description: "Prometheus encountered {{ $value }} checkpoint creation failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusTsdbCheckpointDeletionFailures expr: increase(prometheus_tsdb_checkpoint_deletions_failed_total[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }}) description: "Prometheus encountered {{ $value }} checkpoint deletion failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusTsdbCompactionsFailed expr: increase(prometheus_tsdb_compactions_failed_total[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus TSDB compactions failed (instance {{ $labels.instance }}) description: "Prometheus encountered {{ $value }} TSDB compactions failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusTsdbHeadTruncationsFailed expr: increase(prometheus_tsdb_head_truncations_failed_total[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus TSDB head truncations failed (instance {{ $labels.instance }}) description: "Prometheus encountered {{ $value }} TSDB head truncation failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusTsdbReloadFailures expr: increase(prometheus_tsdb_reloads_failures_total[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus TSDB reload failures (instance {{ $labels.instance }}) description: "Prometheus encountered {{ $value }} TSDB reload failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusTsdbWalCorruptions expr: increase(prometheus_tsdb_wal_corruptions_total[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus TSDB WAL corruptions (instance {{ $labels.instance }}) description: "Prometheus encountered {{ $value }} TSDB WAL corruptions\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusTsdbWalTruncationsFailed expr: increase(prometheus_tsdb_wal_truncations_failed_total[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }}) description: "Prometheus encountered {{ $value }} TSDB WAL truncation failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: PrometheusTimeserieCardinality expr: label_replace(count by(__name__) ({__name__=~".+"}), "name", "$1", "__name__", "(.+)") > 10000 for: 0m labels: severity: warning annotations: summary: Prometheus timeserie cardinality (instance {{ $labels.instance }}) description: "The \"{{ $labels.name }}\" timeserie cardinality is getting very high: {{ $value }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostOutOfMemory expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10 for: 2m labels: severity: warning annotations: summary: Host out of memory (instance {{ $labels.instance }}) description: "Node memory is filling up (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostMemoryUnderMemoryPressure expr: rate(node_vmstat_pgmajfault[1m]) > 1000 for: 2m labels: severity: warning annotations: summary: Host memory under memory pressure (instance {{ $labels.instance }}) description: "The node is under heavy memory pressure. High rate of major page faults\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostUnusualNetworkThroughputIn expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100 for: 5m labels: severity: warning annotations: summary: Host unusual network throughput in (instance {{ $labels.instance }}) description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostUnusualNetworkThroughputOut expr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100 for: 5m labels: severity: warning annotations: summary: Host unusual network throughput out (instance {{ $labels.instance }}) description: "Host network interfaces are probably sending too much data (> 100 MB/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostUnusualDiskReadRate expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50 for: 5m labels: severity: warning annotations: summary: Host unusual disk read rate (instance {{ $labels.instance }}) description: "Disk is probably reading too much data (> 50 MB/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostUnusualDiskWriteRate expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50 for: 2m labels: severity: warning annotations: summary: Host unusual disk write rate (instance {{ $labels.instance }}) description: "Disk is probably writing too much data (> 50 MB/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" # Please add ignored mountpoints in node_exporter parameters like # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)". # Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users. - alert: HostOutOfDiskSpace expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0 for: 2m labels: severity: warning annotations: summary: Host out of disk space (instance {{ $labels.instance }}) description: "Disk is almost full (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" # Please add ignored mountpoints in node_exporter parameters like # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)". # Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users. - alert: HostDiskWillFillIn24Hours expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0 for: 2m labels: severity: warning annotations: summary: Host disk will fill in 24 hours (instance {{ $labels.instance }}) description: "Filesystem is predicted to run out of space within the next 24 hours at current write rate\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostOutOfInodes expr: node_filesystem_files_free / node_filesystem_files * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0 for: 2m labels: severity: warning annotations: summary: Host out of inodes (instance {{ $labels.instance }}) description: "Disk is almost running out of available inodes (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostInodesWillFillIn24Hours expr: node_filesystem_files_free / node_filesystem_files * 100 < 10 and predict_linear(node_filesystem_files_free[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0 for: 2m labels: severity: warning annotations: summary: Host inodes will fill in 24 hours (instance {{ $labels.instance }}) description: "Filesystem is predicted to run out of inodes within the next 24 hours at current write rate\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostUnusualDiskReadLatency expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0 for: 2m labels: severity: warning annotations: summary: Host unusual disk read latency (instance {{ $labels.instance }}) description: "Disk latency is growing (read operations > 100ms)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostUnusualDiskWriteLatency expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0 for: 2m labels: severity: warning annotations: summary: Host unusual disk write latency (instance {{ $labels.instance }}) description: "Disk latency is growing (write operations > 100ms)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostHighCpuLoad expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80 for: 0m labels: severity: warning annotations: summary: Host high CPU load (instance {{ $labels.instance }}) description: "CPU load is > 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostCpuStealNoisyNeighbor expr: avg by(instance) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10 for: 0m labels: severity: warning annotations: summary: Host CPU steal noisy neighbor (instance {{ $labels.instance }}) description: "CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostCpuHighIowait expr: avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100 > 5 for: 0m labels: severity: warning annotations: summary: Host CPU high iowait (instance {{ $labels.instance }}) description: "CPU iowait > 5%. A high iowait means that you are disk or network bound.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" # 1000 context switches is an arbitrary number. # Alert threshold depends on nature of application. # Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58 - alert: HostContextSwitching expr: (rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 1000 for: 0m labels: severity: warning annotations: summary: Host context switching (instance {{ $labels.instance }}) description: "Context switching is growing on node (> 1000 / s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostSwapIsFillingUp expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80 for: 2m labels: severity: warning annotations: summary: Host swap is filling up (instance {{ $labels.instance }}) description: "Swap is filling up (>80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostSystemdServiceCrashed expr: node_systemd_unit_state{state="failed"} == 1 for: 0m labels: severity: warning annotations: summary: Host systemd service crashed (instance {{ $labels.instance }}) description: "systemd service crashed\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostPhysicalComponentTooHot expr: node_hwmon_temp_celsius > 75 for: 5m labels: severity: warning annotations: summary: Host physical component too hot (instance {{ $labels.instance }}) description: "Physical hardware component too hot\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostNodeOvertemperatureAlarm expr: node_hwmon_temp_crit_alarm_celsius == 1 for: 0m labels: severity: critical annotations: summary: Host node overtemperature alarm (instance {{ $labels.instance }}) description: "Physical node temperature alarm triggered\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostRaidArrayGotInactive expr: node_md_state{state="inactive"} > 0 for: 0m labels: severity: critical annotations: summary: Host RAID array got inactive (instance {{ $labels.instance }}) description: "RAID array {{ $labels.device }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostRaidDiskFailure expr: node_md_disks{state="failed"} > 0 for: 2m labels: severity: warning annotations: summary: Host RAID disk failure (instance {{ $labels.instance }}) description: "At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostKernelVersionDeviations expr: count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1 for: 6h labels: severity: warning annotations: summary: Host kernel version deviations (instance {{ $labels.instance }}) description: "Different kernel versions are running\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostOomKillDetected expr: increase(node_vmstat_oom_kill[1m]) > 0 for: 0m labels: severity: warning annotations: summary: Host OOM kill detected (instance {{ $labels.instance }}) description: "OOM kill detected\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostEdacCorrectableErrorsDetected expr: increase(node_edac_correctable_errors_total[1m]) > 0 for: 0m labels: severity: info annotations: summary: Host EDAC Correctable Errors detected (instance {{ $labels.instance }}) description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} correctable memory errors reported by EDAC in the last 5 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostEdacUncorrectableErrorsDetected expr: node_edac_uncorrectable_errors_total > 0 for: 0m labels: severity: warning annotations: summary: Host EDAC Uncorrectable Errors detected (instance {{ $labels.instance }}) description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostNetworkReceiveErrors expr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01 for: 2m labels: severity: warning annotations: summary: Host Network Receive Errors (instance {{ $labels.instance }}) description: "Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} receive errors in the last two minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostNetworkTransmitErrors expr: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01 for: 2m labels: severity: warning annotations: summary: Host Network Transmit Errors (instance {{ $labels.instance }}) description: "Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} transmit errors in the last two minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostNetworkInterfaceSaturated expr: (rate(node_network_receive_bytes_total{device!~"^tap.*|^vnet.*|^veth.*|^tun.*"}[1m]) + rate(node_network_transmit_bytes_total{device!~"^tap.*|^vnet.*|^veth.*|^tun.*"}[1m])) / node_network_speed_bytes{device!~"^tap.*|^vnet.*|^veth.*|^tun.*"} > 0.8 < 10000 for: 1m labels: severity: warning annotations: summary: Host Network Interface Saturated (instance {{ $labels.instance }}) description: "The network interface \"{{ $labels.device }}\" on \"{{ $labels.instance }}\" is getting overloaded.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostNetworkBondDegraded expr: (node_bonding_active - node_bonding_slaves) != 0 for: 2m labels: severity: warning annotations: summary: Host Network Bond Degraded (instance {{ $labels.instance }}) description: "Bond \"{{ $labels.device }}\" degraded on \"{{ $labels.instance }}\".\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostConntrackLimit expr: node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8 for: 5m labels: severity: warning annotations: summary: Host conntrack limit (instance {{ $labels.instance }}) description: "The number of conntrack is approaching limit\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostClockSkew expr: (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0) for: 2m labels: severity: warning annotations: summary: Host clock skew (instance {{ $labels.instance }}) description: "Clock skew detected. Clock is out of sync. Ensure NTP is configured correctly on this host.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostClockNotSynchronising expr: min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16 for: 2m labels: severity: warning annotations: summary: Host clock not synchronising (instance {{ $labels.instance }}) description: "Clock not synchronising. Ensure NTP is configured on this host.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: HostRequiresReboot expr: node_reboot_required > 0 for: 4h labels: severity: info annotations: summary: Host requires reboot (instance {{ $labels.instance }}) description: "{{ $labels.instance }} requires a reboot.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesNodeReady expr: kube_node_status_condition{condition="Ready",status="true"} == 0 for: 10m labels: severity: critical annotations: summary: Kubernetes Node ready (instance {{ $labels.instance }}) description: "Node {{ $labels.node }} has been unready for a long time\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesMemoryPressure expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1 for: 2m labels: severity: critical annotations: summary: Kubernetes memory pressure (instance {{ $labels.instance }}) description: "{{ $labels.node }} has MemoryPressure condition\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesDiskPressure expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1 for: 2m labels: severity: critical annotations: summary: Kubernetes disk pressure (instance {{ $labels.instance }}) description: "{{ $labels.node }} has DiskPressure condition\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesNetworkUnavailable expr: kube_node_status_condition{condition="NetworkUnavailable",status="true"} == 1 for: 2m labels: severity: critical annotations: summary: Kubernetes network unavailable (instance {{ $labels.instance }}) description: "{{ $labels.node }} has NetworkUnavailable condition\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesOutOfCapacity expr: sum by (node) ((kube_pod_status_phase{phase="Running"} == 1) + on(uid) group_left(node) (0 * kube_pod_info{pod_template_hash=""})) / sum by (node) (kube_node_status_allocatable{resource="pods"}) * 100 > 90 for: 2m labels: severity: warning annotations: summary: Kubernetes out of capacity (instance {{ $labels.instance }}) description: "{{ $labels.node }} is out of capacity\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesContainerOomKiller expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1 for: 0m labels: severity: warning annotations: summary: Kubernetes container oom killer (instance {{ $labels.instance }}) description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesJobFailed expr: kube_job_status_failed > 0 for: 0m labels: severity: warning annotations: summary: Kubernetes Job failed (instance {{ $labels.instance }}) description: "Job {{$labels.namespace}}/{{$labels.exported_job}} failed to complete\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesCronjobSuspended expr: kube_cronjob_spec_suspend != 0 for: 0m labels: severity: warning annotations: summary: Kubernetes CronJob suspended (instance {{ $labels.instance }}) description: "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesPersistentvolumeclaimPending expr: kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1 for: 2m labels: severity: warning annotations: summary: Kubernetes PersistentVolumeClaim pending (instance {{ $labels.instance }}) description: "PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesVolumeOutOfDiskSpace expr: kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 < 10 for: 2m labels: severity: warning annotations: summary: Kubernetes Volume out of disk space (instance {{ $labels.instance }}) description: "Volume is almost full (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesVolumeFullInFourDays expr: predict_linear(kubelet_volume_stats_available_bytes[6h], 4 * 24 * 3600) < 0 for: 0m labels: severity: critical annotations: summary: Kubernetes Volume full in four days (instance {{ $labels.instance }}) description: "{{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesPersistentvolumeError expr: kube_persistentvolume_status_phase{phase=~"Failed|Pending", job="kube-state-metrics"} > 0 for: 0m labels: severity: critical annotations: summary: Kubernetes PersistentVolume error (instance {{ $labels.instance }}) description: "Persistent volume is in bad state\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesStatefulsetDown expr: (kube_statefulset_status_replicas_ready / kube_statefulset_status_replicas_current) != 1 for: 1m labels: severity: critical annotations: summary: Kubernetes StatefulSet down (instance {{ $labels.instance }}) description: "A StatefulSet went down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesHpaScalingAbility expr: kube_horizontalpodautoscaler_status_condition{status="false", condition="AbleToScale"} == 1 for: 2m labels: severity: warning annotations: summary: Kubernetes HPA scaling ability (instance {{ $labels.instance }}) description: "Pod is unable to scale\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesHpaMetricAvailability expr: kube_horizontalpodautoscaler_status_condition{status="false", condition="ScalingActive"} == 1 for: 0m labels: severity: warning annotations: summary: Kubernetes HPA metric availability (instance {{ $labels.instance }}) description: "HPA is not able to collect metrics\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesHpaScaleCapability expr: kube_horizontalpodautoscaler_status_desired_replicas >= kube_horizontalpodautoscaler_spec_max_replicas for: 2m labels: severity: info annotations: summary: Kubernetes HPA scale capability (instance {{ $labels.instance }}) description: "The maximum number of desired Pods has been hit\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesPodNotHealthy expr: min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[15m:1m]) > 0 for: 0m labels: severity: critical annotations: summary: Kubernetes Pod not healthy (instance {{ $labels.instance }}) description: "Pod has been in a non-ready state for longer than 15 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesPodCrashLooping expr: increase(kube_pod_container_status_restarts_total[1m]) > 3 for: 2m labels: severity: warning annotations: summary: Kubernetes pod crash looping (instance {{ $labels.instance }}) description: "Pod {{ $labels.pod }} is crash looping\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesReplicassetMismatch expr: kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas for: 10m labels: severity: warning annotations: summary: Kubernetes ReplicasSet mismatch (instance {{ $labels.instance }}) description: "Deployment Replicas mismatch\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesDeploymentReplicasMismatch expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available for: 10m labels: severity: warning annotations: summary: Kubernetes Deployment replicas mismatch (instance {{ $labels.instance }}) description: "Deployment Replicas mismatch\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesStatefulsetReplicasMismatch expr: kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas for: 10m labels: severity: warning annotations: summary: Kubernetes StatefulSet replicas mismatch (instance {{ $labels.instance }}) description: "A StatefulSet does not match the expected number of replicas.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesDeploymentGenerationMismatch expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation for: 10m labels: severity: critical annotations: summary: Kubernetes Deployment generation mismatch (instance {{ $labels.instance }}) description: "A Deployment has failed but has not been rolled back.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesStatefulsetGenerationMismatch expr: kube_statefulset_status_observed_generation != kube_statefulset_metadata_generation for: 10m labels: severity: critical annotations: summary: Kubernetes StatefulSet generation mismatch (instance {{ $labels.instance }}) description: "A StatefulSet has failed but has not been rolled back.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesStatefulsetUpdateNotRolledOut expr: max without (revision) (kube_statefulset_status_current_revision unless kube_statefulset_status_update_revision) * (kube_statefulset_replicas != kube_statefulset_status_replicas_updated) for: 10m labels: severity: warning annotations: summary: Kubernetes StatefulSet update not rolled out (instance {{ $labels.instance }}) description: "StatefulSet update has not been rolled out.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesDaemonsetRolloutStuck expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100 or kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0 for: 10m labels: severity: warning annotations: summary: Kubernetes DaemonSet rollout stuck (instance {{ $labels.instance }}) description: "Some Pods of DaemonSet are not scheduled or not ready\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesDaemonsetMisscheduled expr: kube_daemonset_status_number_misscheduled > 0 for: 1m labels: severity: critical annotations: summary: Kubernetes DaemonSet misscheduled (instance {{ $labels.instance }}) description: "Some DaemonSet Pods are running where they are not supposed to run\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesCronjobTooLong expr: time() - kube_cronjob_next_schedule_time > 3600 for: 0m labels: severity: warning annotations: summary: Kubernetes CronJob too long (instance {{ $labels.instance }}) description: "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesJobSlowCompletion expr: kube_job_spec_completions - kube_job_status_succeeded > 0 for: 12h labels: severity: critical annotations: summary: Kubernetes job slow completion (instance {{ $labels.instance }}) description: "Kubernetes Job {{ $labels.namespace }}/{{ $labels.job_name }} did not complete in time.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesApiServerErrors expr: sum(rate(apiserver_request_total{job="apiserver",code=~"^(?:5..)$"}[1m])) / sum(rate(apiserver_request_total{job="apiserver"}[1m])) * 100 > 3 for: 2m labels: severity: critical annotations: summary: Kubernetes API server errors (instance {{ $labels.instance }}) description: "Kubernetes API server is experiencing high error rate\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesApiClientErrors expr: (sum(rate(rest_client_requests_total{code=~"(4|5).."}[1m])) by (instance, job) / sum(rate(rest_client_requests_total[1m])) by (instance, job)) * 100 > 1 for: 2m labels: severity: critical annotations: summary: Kubernetes API client errors (instance {{ $labels.instance }}) description: "Kubernetes API client is experiencing high error rate\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesClientCertificateExpiresNextWeek expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 7*24*60*60 for: 0m labels: severity: warning annotations: summary: Kubernetes client certificate expires next week (instance {{ $labels.instance }}) description: "A client certificate used to authenticate to the apiserver is expiring next week.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesClientCertificateExpiresSoon expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 24*60*60 for: 0m labels: severity: critical annotations: summary: Kubernetes client certificate expires soon (instance {{ $labels.instance }}) description: "A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: KubernetesApiServerLatency expr: histogram_quantile(0.99, sum(rate(apiserver_request_latencies_bucket{subresource!="log",verb!~"^(?:CONNECT|WATCHLIST|WATCH|PROXY)$"} [10m])) WITHOUT (instance, resource)) / 1e+06 > 1 for: 2m labels: severity: warning annotations: summary: Kubernetes API server latency (instance {{ $labels.instance }}) description: "Kubernetes API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" root@cby:~# 设置开机自启root@cby:~# vim /lib/systemd/system/alertmanager.service root@cby:~# cat /lib/systemd/system/alertmanager.service [Unit] Description=Alertmanager for Prometheus After=network-online.target [Service] Type=simple ExecStart=/alertmanager/alertmanager --config.file=/alertmanager/alertmanager.yml Restart=on-failur ExecStop=/bin/kill -9 $MAINPID [Install] WantedBy=multi-user.target root@cby:~# systemctl daemon-reload root@cby:~# root@cby:~# systemctl enable --now alertmanager.service Created symlink /etc/systemd/system/multi-user.target.wants/prometheus.service → /lib/systemd/system/prometheus.service. root@cby:~# root@cby:~# systemctl status alertmanager.service 访问地址# Node http://10.0.0.2:9090/ # Grafana http://10.0.0.2:3000/ # Prometheus http://10.0.0.2:9200/ # Altermanager http://10.0.0.2:9093/Grafana 链接 Prometheus异常记录邮件内容关于https://www.oiox.cn/https://www.oiox.cn/index.php/start-page.htmlCSDN、GitHub、51CTO、知乎、开源中国、思否、掘金、简书、华为云、阿里云、腾讯云、哔哩哔哩、今日头条、新浪微博、个人博客全网可搜《小陈运维》文章主要发布于微信公众号
- 2022年11月12日
- 849 阅读
- 0 评论
- 1 点赞
2022-11-09
在Ubuntu中安装Samba文件服务在Ubuntu中安装Samba文件服务安装 samba 服务root@v:~# apt install samba samba-common root@v:~# 创建共享目录root@v:~# mkdir /cby/smb/ -pv root@v:~# chmod 777 -R /cby/smb/ root@v:~# 修改配置文件 # 编写配置文件实现匿名访问 [share] path = /cby/smb public = yes read only = no guest ok = Yes create mask = 0644 force create mode = 0644 directory mask = 0755 force directory mode = 0755 available = yes # 完整配置如下 root@v:~# vim /etc/samba/smb.conf root@v:~# cat /etc/samba/smb.conf # # Sample configuration file for the Samba suite for Debian GNU/Linux. # # # This is the main Samba configuration file. You should read the # smb.conf(5) manual page in order to understand the options listed # here. Samba has a huge number of configurable options most of which # are not shown in this example # # Some options that are often worth tuning have been included as # commented-out examples in this file. # - When such options are commented with ";", the proposed setting # differs from the default Samba behaviour # - When commented with "#", the proposed setting is the default # behaviour of Samba but the option is considered important # enough to be mentioned here # # NOTE: Whenever you modify this file you should run the command # "testparm" to check that you have not made any basic syntactic # errors. #======================= Global Settings ======================= [global] ## Browsing/Identification ### # Change this to the workgroup/NT-domain name your Samba server will part of workgroup = WORKGROUP #### Networking #### # The specific set of interfaces / networks to bind to # This can be either the interface name or an IP address/netmask; # interface names are normally preferred ; interfaces = 127.0.0.0/8 eth0 # Only bind to the named interfaces and/or networks; you must use the # 'interfaces' option above to use this. # It is recommended that you enable this feature if your Samba machine is # not protected by a firewall or is a firewall itself. However, this # option cannot handle dynamic or non-broadcast interfaces correctly. ; bind interfaces only = yes #### Debugging/Accounting #### # This tells Samba to use a separate log file for each machine # that connects log file = /var/log/samba/log.%m # Cap the size of the individual log files (in KiB). max log size = 1000 # We want Samba to only log to /var/log/samba/log.{smbd,nmbd}. # Append syslog@1 if you want important messages to be sent to syslog too. logging = file # Do something sensible when Samba crashes: mail the admin a backtrace panic action = /usr/share/samba/panic-action %d ####### Authentication ####### # Server role. Defines in which mode Samba will operate. Possible # values are "standalone server", "member server", "classic primary # domain controller", "classic backup domain controller", "active # directory domain controller". # # Most people will want "standalone server" or "member server". # Running as "active directory domain controller" will require first # running "samba-tool domain provision" to wipe databases and create a # new domain. server role = standalone server obey pam restrictions = yes # This boolean parameter controls whether Samba attempts to sync the Unix # password with the SMB password when the encrypted SMB password in the # passdb is changed. unix password sync = yes # For Unix password sync to work on a Debian GNU/Linux system, the following # parameters must be set (thanks to Ian Kahan <<kahan@informatik.tu-muenchen.de> for # sending the correct chat script for the passwd program in Debian Sarge). passwd program = /usr/bin/passwd %u passwd chat = *Enter\snew\s*\spassword:* %n\n *Retype\snew\s*\spassword:* %n\n *password\supdated\ssuccessfully* . # This boolean controls whether PAM will be used for password changes # when requested by an SMB client instead of the program listed in # 'passwd program'. The default is 'no'. pam password change = yes # This option controls how unsuccessful authentication attempts are mapped # to anonymous connections map to guest = bad user ########## Domains ########### # # The following settings only takes effect if 'server role = classic # primary domain controller', 'server role = classic backup domain controller' # or 'domain logons' is set # # It specifies the location of the user's # profile directory from the client point of view) The following # required a [profiles] share to be setup on the samba server (see # below) ; logon path = \\%N\profiles\%U # Another common choice is storing the profile in the user's home directory # (this is Samba's default) # logon path = \\%N\%U\profile # The following setting only takes effect if 'domain logons' is set # It specifies the location of a user's home directory (from the client # point of view) ; logon drive = H: # logon home = \\%N\%U # The following setting only takes effect if 'domain logons' is set # It specifies the script to run during logon. The script must be stored # in the [netlogon] share # NOTE: Must be store in 'DOS' file format convention ; logon script = logon.cmd # This allows Unix users to be created on the domain controller via the SAMR # RPC pipe. The example command creates a user account with a disabled Unix # password; please adapt to your needs ; add user script = /usr/sbin/adduser --quiet --disabled-password --gecos "" %u # This allows machine accounts to be created on the domain controller via the # SAMR RPC pipe. # The following assumes a "machines" group exists on the system ; add machine script = /usr/sbin/useradd -g machines -c "%u machine account" -d /var/lib/samba -s /bin/false %u # This allows Unix groups to be created on the domain controller via the SAMR # RPC pipe. ; add group script = /usr/sbin/addgroup --force-badname %g ############ Misc ############ # Using the following line enables you to customise your configuration # on a per machine basis. The %m gets replaced with the netbios name # of the machine that is connecting ; include = /home/samba/etc/smb.conf.%m # Some defaults for winbind (make sure you're not using the ranges # for something else.) ; idmap config * : backend = tdb ; idmap config * : range = 3000-7999 ; idmap config YOURDOMAINHERE : backend = tdb ; idmap config YOURDOMAINHERE : range = 100000-999999 ; template shell = /bin/bash # Setup usershare options to enable non-root users to share folders # with the net usershare command. # Maximum number of usershare. 0 means that usershare is disabled. # usershare max shares = 100 # Allow users who've been granted usershare privileges to create # public shares, not just authenticated ones usershare allow guests = yes #======================= Share Definitions ======================= [homes] comment = Home Directories browseable = no # By default, the home directories are exported read-only. Change the # next parameter to 'no' if you want to be able to write to them. read only = yes # File creation mask is set to 0700 for security reasons. If you want to # create files with group=rw permissions, set next parameter to 0775. create mask = 0700 # Directory creation mask is set to 0700 for security reasons. If you want to # create dirs. with group=rw permissions, set next parameter to 0775. directory mask = 0700 # By default, \\server\username shares can be connected to by anyone # with access to the samba server. # The following parameter makes sure that only "username" can connect # to \\server\username # This might need tweaking when using external authentication schemes valid users = %S # Un-comment the following and create the netlogon directory for Domain Logons # (you need to configure Samba to act as a domain controller too.) ;[netlogon] ; comment = Network Logon Service ; path = /home/samba/netlogon ; guest ok = yes ; read only = yes # Un-comment the following and create the profiles directory to store # users profiles (see the "logon path" option above) # (you need to configure Samba to act as a domain controller too.) # The path below should be writable by all users so that their # profile directory may be created the first time they log on ;[profiles] ; comment = Users profiles ; path = /home/samba/profiles ; guest ok = no ; browseable = no ; create mask = 0600 ; directory mask = 0700 [printers] comment = All Printers browseable = no path = /var/spool/samba printable = yes guest ok = no read only = yes create mask = 0700 # Windows clients look for this share name as a source of downloadable # printer drivers [print$] comment = Printer Drivers path = /var/lib/samba/printers browseable = yes read only = yes guest ok = no # Uncomment to allow remote administration of Windows print drivers. # You may need to replace 'lpadmin' with the name of the group your # admin users are members of. # Please note that you also need to set appropriate Unix permissions # to the drivers directory for these users to have write rights in it ; write list = root, @lpadmin [share] path = /cby/smb public = yes read only = no guest ok = Yes create mask = 0644 force create mode = 0644 directory mask = 0755 force directory mode = 0755 available = yes root@v:~# 重启服务root@v:~# systemctl restart smbd root@v:~# 关于https://www.oiox.cn/https://www.oiox.cn/index.php/start-page.htmlCSDN、GitHub、51CTO、知乎、开源中国、思否、掘金、简书、华为云、阿里云、腾讯云、哔哩哔哩、今日头条、新浪微博、个人博客全网可搜《小陈运维》文章主要发布于微信公众号
- 2022年11月09日
- 870 阅读
- 0 评论
- 0 点赞