Java进程OOM排查

109次阅读

共计 6106 个字符，预计需要花费 16 分钟才能阅读完成。

故障表现

后端开发反馈，1台服务器上的java进程有产生异常退出。JMV异常退出时自动产生了hs_err_pid*.log日志。

操作系统：Ubuntu20.04
CPU：E7- 4820 8核心
内存：32GB

‍

日志很明显可以看到发生了 Out of Memory Error：

# cat hs_err_pid1679886.log|head -n 50
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 41943040 bytes for committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   The process is running with CompressedOops enabled, and the Java Heap may be blocking the growth of the native heap
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
#   JVM is running with Zero Based Compressed Oops mode in which the Java heap is
#     placed in the first 32GB address space. The Java Heap base address is the
#     maximum limit for the native heap growth. Please use -XX:HeapBaseMinAddress
#     to set the Java Heap base and to place the Java Heap above 32GB virtual address.
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:3229), pid=1679886, tid=1680174
#
# JRE version: OpenJDK Runtime Environment (11.0.27+6) (build 11.0.27+6-post-Ubuntu-0ubuntu122.04)
# Java VM: OpenJDK 64-Bit Server VM (11.0.27+6-post-Ubuntu-0ubuntu122.04, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -F%F -- %E" (or dumping to /home/hdigital/buildbot/work_root/worker/backend-test/build/HDAidMaster-server/HDAidMaster-server-admin/target/core.1679886)
#

---------------  S U M M A R Y ------------

Command Line: HDAidMaster-server-admin.jar

Host: Intel(R) Xeon(R) CPU E7- 4820  @ 2.00GHz, 8 cores, 31G, Ubuntu 22.04.2 LTS
Time: Thu Jul 24 17:08:50 2025 CST elapsed time: 1302287.230222 seconds (15d 1h 44m 47s)

---------------  T H R E A D  ---------------

Current thread (0x00007444c189e800):  JavaThread "http-nio-8091-exec-97" daemon [_thread_in_vm, id=1680174, stack(0x0000744387a00000,0x0000744387b00000)]

Stack: [0x0000744387a00000,0x0000744387b00000],  sp=0x0000744387afc5b0,  free space=1009k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xebaf2a]  VMError::report_and_die(int, char const*, char const*, __va_list_tag*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x19a
V  [libjvm.so+0xebbce1]  VMError::report_and_die(Thread*, char const*, int, unsigned long, VMErrorType, char const*, __va_list_tag*)+0x31
V  [libjvm.so+0x67f98a]  report_vm_out_of_memory(char const*, int, unsigned long, VMErrorType, char const*, ...)+0xca
V  [libjvm.so+0xc031b3]  os::pd_commit_memory_or_exit(char*, unsigned long, unsigned long, bool, char const*)+0xf3
V  [libjvm.so+0xbfc121]  os::commit_memory_or_exit(char*, unsigned long, unsigned long, bool, char const*)+0x21
V  [libjvm.so+0x799f25]  G1PageBasedVirtualSpace::commit_preferred_pages(unsigned long, unsigned long)+0x65
V  [libjvm.so+0x79a308]  G1PageBasedVirtualSpace::commit(unsigned long, unsigned long)+0x1a8
V  [libjvm.so+0x7a56d7]  G1RegionsLargerThanCommitSizeMapper::commit_regions(unsigned int, unsigned long, WorkGang*)+0x47
V  [libjvm.so+0x82aa18]  HeapRegionManager::commit_regions(unsigned int, unsigned long, WorkGang*)+0x58
V  [libjvm.so+0x82b668]  HeapRegionManager::make_regions_available(unsigned int, unsigned int, WorkGang*)+0x38

‍

但从监控侧发现故障时间点，机器内存使用率很低。

‍

启动命令为：

nohup java -jar HDAidMaster-server-admin.jar &

此命令没有显示配置内存参数，意味着JVM会根据系统内存自动设置。

‍

故障定位

显式配置内存参数启动进程时，启动立马报错，可以看到commit_memory时无法申请到足够空间。

#  java -Xms4g -Xmx8g -XX:-UseCompressedOops -XX:+UseG1GC -jar HDAidMaster-server-admin.jar
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007bd000000000, 4294967296, 0) failed; error='Not enough space' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 4294967296 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/hdigital/buildbot/work_root/worker/backend-test/build/HDAidMaster-server/HDAidMaster-server-admin/target/hs_err_pid2950582.log

‍

查看当前内存状态：

# cat /proc/meminfo | grep -E "(CommitLimit|Committed_AS|VmallocTotal|VmallocUsed)"
CommitLimit: 18515456 kB
Committed_AS: 14614528 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 127664 kB

‍

故障已经定位到了，最大剩余承诺内存已经不够满足java进程的申请需要。

计算可得：
物理内存：32GB
Swap空间：2GB
允许的最大承诺内存：32GB × 50% + 2GB = 18GB
当前已承诺：14.6GB
剩余可承诺：18GB - 14.6GB = 3.4GB
JVM尝试分配：4GB (超出了剩余限制)

‍

查看当前内核参数：

# sysctl -a |grep overcommit
vm.nr_overcommit_hugepages = 0
vm.overcommit_kbytes = 0
vm.overcommit_memory = 2
vm.overcommit_ratio = 50

‍

关于内核参数 overcommit_memory的解释：

位于 /proc/sys/vm/overcommit_memory。它控制着 Linux 内核在处理内存分配请求时，是否允许“内存过量使用”（memory overcommit）以及如何进行过量使用。

0 (默认值): 启发式过量使用 (Heuristic Overcommit)

行为： 这是 Linux 的默认行为。内核会尝试根据一些启发式规则来判断是否允许内存分配。它会检查当前已提交的内存（committed_AS，即所有进程已申请但可能未使用的虚拟内存总量）是否接近物理内存和交换空间的总和。
特点：
- 乐观但有检查： 允许一定程度的内存过量使用，但如果一个单独的、非常大的内存分配请求明显不可能满足，内核可能会拒绝它。
- OOM 风险： 仍然存在 OOM Killer 被触发的风险，尤其是在许多进程同时申请并使用大量内存时。
- 适用场景： 大多数通用服务器和桌面环境都使用此设置，因为它在内存利用率和稳定性之间取得了较好的平衡。

‍

1: 总是过量使用 (Always Overcommit)

行为： 内核几乎总是允许内存分配请求，无论当前有多少可用内存。它假设应用程序不会真正使用所有它们请求的内存。
特点：
- 最大化过量使用： 除非系统真的连虚拟地址空间都无法提供，否则 malloc() 等内存分配函数几乎不会失败。
- 高 OOM 风险： 内存分配请求总是成功，但当进程真正尝试访问这些内存页时，如果物理内存或交换空间不足，OOM Killer 会立即被触发。这意味着应用程序在运行时可能突然被杀死，而不是在申请内存时收到错误。
- 适用场景： 某些特定的工作负载，例如：
- 需要大量稀疏内存（申请大块内存但只使用其中一小部分）的应用程序。
- 某些数据库系统，它们有自己的内存管理机制，并能更好地处理内存不足的情况。
- 测试环境，用于模拟内存不足的情况。

‍

2: 从不过量使用 (Never Overcommit / Strict Overcommit)

行为： 内核会严格限制已提交的内存总量。它会根据一个预设的比例（由 overcommit_ratio 参数控制）来计算一个虚拟内存限制。如果新的内存分配请求会导致已提交内存总量超过这个限制，那么该请求会被立即拒绝（malloc() 返回错误）。
特点：
- 最严格： 内存分配请求会被严格检查，如果超出限制，malloc() 会失败。
- 无 OOM Killer 风险（针对内存不足）： 由于内存分配在早期就被拒绝，系统不会因为内存不足而触发 OOM Killer。应用程序会收到分配失败的错误，可以自行处理。
- 可能浪费内存： 如果应用程序申请了大量内存但实际只使用了很少一部分，那么这部分未使用的“承诺”内存也会计入限制，可能导致其他应用程序无法获得内存，即使物理内存是充足的。
- overcommit_ratio 参数： 当 overcommit_memory 设置为 2 时，overcommit_ratio 参数变得非常重要。它定义了可以提交的虚拟内存占物理内存的百分比。例如，如果 overcommit_ratio 是 50，那么系统最多可以提交物理内存的 50%（加上交换空间）。
  * overcommit_ratio 的默认值是 50。
  * 计算公式通常是：可提交内存上限 = (总物理内存 * overcommit_ratio / 100) + 总交换空间。
  * 注意： 实际的计算可能略有不同，但核心思想是限制已提交虚拟内存总量为一个基于物理内存和交换空间的固定比例。
- 适用场景：
- 对内存分配失败有明确处理逻辑的应用程序。
- 需要严格控制内存使用，避免 OOM Killer 的关键系统（如嵌入式系统、高性能计算集群）。
- 某些安全敏感的环境，需要精确控制资源。

‍

故障解决

解决方式有两点：修改内核参数、显式配置JVM参数

修改内核参数，有两种方式：

将vm.overcommit_memory 调大，本案例中例如可以调整到80%。vm.overcommit_memory仅在 vm.overcommit_memory=2 时生效。
将vm.overcommit_memory=2 修改为 0 或者 1，默认值为0。

使内核参数固化，需要修改 /etc/sysct.conf 后，执行sysctl -p。

显式配置JVM参数：
java -Xms4g -Xmx8g -XX:+UseG1GC -jar ***.jar

正文完

Linux

发表至： Linux

2025-07-30

转载说明：除特殊说明外本站文章皆由CC-4.0协议发布，转载请注明出处：https://www.opshub.cn

服务器故障迁移并虚拟化