1.
Clear Linux와 Chelsio 하나에 이어지는 글입니다.
지난 글에서 빌드까지 확인하였습니다. 빌드가 되었다고 모든 것이 끝난 것은 아닙니다. Chelsio는 TOE와 관련한 기능을 커널모듈로 지원하기 때문에 해당 모듈이 커널에서 동작하도록 하여야 합니다.
시스템을 부탕한 후 다음과 같은 명령어를 실행하였습니다.
oot@Clearlinux/home/smallake # modprobe t4_tom
그리고 커널모듈이 정상적으로 올라왔는지를 확인하였습니다.
root@Clearlinux/home/smallake # lsmod | grep t4_tom
t4_tom 188416 1
toecore 36864 1 t4_tom
cxgb4 811008 1 t4_tom
이상만 보면 정상적이라고 생각할 수 있지만 제가 일부러 하나를 빼놓았습니다. modprobe한 이후 결과값입니다. 보통 정상적으로 동작하면 아무런 반응이 없습니다. 그래서 lsmod로 확인합니다. 그런데 이번 경우에는 다음과 같이 응답이 나왔습니다.
root@Clearlinux/home/smallake # modprobe t4_tom
Killed
이유가 무엇인지 확인하기 위하여 dmesg 명령어를 실행하였습니다.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
[ 53.074828] calling offload_init+0x0/0x1000 [toecore] @ 878 [ 53.074840] toecore: IPv6 Offload not supported with this module. [ 53.074844] initcall offload_init+0x0/0x1000 [toecore] returned 0 after 7 usecs [ 53.136700] calling t4_tom_init+0x0/0xd03 [t4_tom] @ 878 [ 53.136716] BUG: unable to handle page fault for address: ffffffff82297f80 [ 53.143701] #PF: supervisor instruction fetch in kernel mode [ 53.149447] #PF: error_code(0x0010) - not-present page [ 53.154664] PGD 176013067 P4D 176013067 PUD 176014063 PMD 0 [ 53.160436] Oops: 0010 [#1] SMP PTI [ 53.164007] CPU: 5 PID: 878 Comm: modprobe Tainted: G OE 5.15.2-1096.native #1 [ 53.172587] Hardware name: Gigabyte Technology Co., Ltd. Z370 UD3H/Z370 UD3H-CF, BIOS F4 07/05/2018 [ 53.181704] RIP: 0010:0xffffffff82297f80 [ 53.185728] Code: Unable to access opcode bytes at RIP 0xffffffff82297f56. [ 53.192737] RSP: 0018:ffffafa3c232fd50 EFLAGS: 00010246 [ 53.198083] RAX: ffffffff82297f80 RBX: 0000000000000000 RCX: 0000000000000000 [ 53.205407] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffc050a031 [ 53.212644] RBP: ffffafa3c232fd58 R08: 0000000000000000 R09: 0000000000000000 [ 53.219906] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffc040a2fd [ 53.227222] R13: ffff9c48be793130 R14: ffffffff8b828708 R15: 0000000000000000 [ 53.234458] FS: 00007fa3836fe740(0000) GS:ffff9c4baf540000(0000) knlGS:0000000000000000 [ 53.242684] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 53.248534] CR2: ffffffff82297f56 CR3: 00000002ef58a006 CR4: 00000000003706e0 [ 53.255812] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 53.263093] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 53.270416] Call Trace: [ 53.272869] ? prepare_tom_for_offload+0x20/0x1c0 [t4_tom] [ 53.278450] t4_tom_init+0xc/0xd03 [t4_tom] [ 53.282698] ? t4_init_listen_cpl_handlers+0x2c/0x2c [t4_tom] [ 53.288608] do_one_initcall+0x43/0x200 [ 53.292541] ? __cond_resched+0x15/0x80 [ 53.296433] ? kmem_cache_alloc_trace+0x3b/0x400 [ 53.301087] do_init_module+0x5d/0x280 [ 53.304900] load_module+0x958/0xa00 [ 53.308533] __do_sys_finit_module+0xb2/0x140 [ 53.312970] __x64_sys_finit_module+0x13/0x40 [ 53.317406] do_syscall_64+0x3b/0xc0 [ 53.321065] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 53.326231] RIP: 0033:0x7fa3838327c0 [ 53.329879] Code: c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 2e 2e 2e 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 28 b6 0f 00 f7 d8 64 89 01 48 [ 53.348955] RSP: 002b:00007ffcccbf83d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [ 53.356711] RAX: ffffffffffffffda RBX: 0000555fe6f02d00 RCX: 00007fa3838327c0 [ 53.363939] RDX: 0000000000000000 RSI: 0000555fe54df93f RDI: 0000000000000004 [ 53.371235] RBP: 00007ffcccbf8420 R08: 0000000000000000 R09: 0000000000000000 [ 53.378516] R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000000000 [ 53.385804] R13: 0000000000040000 R14: 0000555fe54df93f R15: 0000555fe6f02e30 [ 53.392989] Modules linked in: t4_tom(OE+) toecore(OE) ee1004 ppdev mei_hdcp intel_wmi_thunderbolt wmi_bmof mxm_wmi intel_tcc_cooling ghash_clmulni_intel snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd e1000e soundcore i2c_i801 i2c_smbus joydev cxgb4(OE) wmi intel_pmc_core parport_pc parport thermal mei_me mei edac_core [ 53.432335] CR2: ffffffff82297f80 [ 53.435717] ---[ end trace 59f991b80e76b2b7 ]--- [ 53.440413] RIP: 0010:0xffffffff82297f80 [ 53.444427] Code: Unable to access opcode bytes at RIP 0xffffffff82297f56. [ 53.451462] RSP: 0018:ffffafa3c232fd50 EFLAGS: 00010246 [ 53.456777] RAX: ffffffff82297f80 RBX: 0000000000000000 RCX: 0000000000000000 [ 53.464065] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffc050a031 [ 53.471345] RBP: ffffafa3c232fd58 R08: 0000000000000000 R09: 0000000000000000 [ 53.478624] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffc040a2fd [ 53.485878] R13: ffff9c48be793130 R14: ffffffff8b828708 R15: 0000000000000000 [ 53.493080] FS: 00007fa3836fe740(0000) GS:ffff9c4baf540000(0000) knlGS:0000000000000000 [ 53.501244] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 53.507086] CR2: ffffffff82297f56 CR3: 00000002ef58a006 CR4: 00000000003706e0 [ 53.514357] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 53.521646] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 |
2.
위 로그를 보면 다음과 같은 부분이 보입니다.
[ 53.136716] BUG: unable to handle page fault for address: ffffffff82297f80
[ 53.143701] #PF: supervisor instruction fetch in kernel mode
[ 53.149447] #PF: error_code(0x0010) – not-present page
[ 53.154664] PGD 176013067 P4D 176013067 PUD 176014063 PMD 0
관련한 메시지를 검색하면 이런 응답이 있습니다. 물로 t4_tom module은 아닙니다.
Getting a “BUG” exception taints the kernel. In this case it would have been tainted already though. G – there is a non GPL module loaded. D – Oops or Bug occurred. The process executing when the exception occurred was rsync (Comm name of process)
The supervisor read access in kernel mode message on such a page fault error is common. (Not saying I understand it… just that I’ve seen it lots)
It could be a bug in the kernel, or a module that’s loaded, or chipset/memory issues. I’d be less confident that it’s a module, as it looks like a page cache read that failed.
P.S. I’d probably try a different kernel first, and see if the problem occurs.
BUG: unable to handle page fault중에서
커널 혹은 모듈의 버그라고 해서 clear Linux의 다른 버전, RHEL 8버전이 사용하는 커널로 해보았지만 결과는 다르지 않았습니다. 결과적으로 실패.
이제 남은 시험은 Solarflare와 Mellanox입니다.