时间复杂度与硬件现实：为何链表遍历可能比数组慢数倍？

时间复杂度并非万能，算法设计必须与硬件架构的现实相结合，而不是仅仅基于理想化的计算模型

引言

以下均为有感而发, 若有不正确的地方还请在评论区内纠正

想要撰写这篇文章是看到数学转码的相关讨论有感而发

ArrayList - LinkedList 时间复杂度简单介绍

以下是一个简化的表格，展示了数组（Array）和链表（Linked List）在各种操作情况下的时间复杂度：

操作	数组（Array）	链表（Linked List）
访问元素	O(1)	O(n)
插入（头部）	O(n)	O(1)
插入（尾部）	O(n)（满时）	O(1)（有尾指针）
插入（中间）	O(n)	O(n)
删除（头部）	O(n)	O(1)
删除（尾部）	O(n)	O(n)
删除（中间）	O(n)	O(n)
遍历	O(n)	O(n)

不难看出, 在插入工况下的LinkedList时间复杂度明显占优(或持平), 哪实际情况又会如何

实际情况

以下是实际操作得到的结果, 可以看出LinkedList慢的不只一点

1
2
3
4


r@r:/data/test/a-l-test$ gcc main.c -o main
r@r:/data/test/a-l-test$ ./main
ArrayList middle insert time: 0.066473 seconds
LinkedList middle insert time: 1.509272 seconds

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118


#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h> // For memmove

#define N 10000
#define INITIAL_SIZE 100000

// Simple dynamic array structure
typedef struct {
    int* data;
    size_t size;
    size_t capacity;
} ArrayList;

void arraylist_insert(ArrayList* arr, size_t index, int value) {
    if (arr->size >= arr->capacity) {
        // Basic resize (actual implementations are smarter)
        size_t new_capacity = arr->capacity == 0 ? 10 : arr->capacity * 2;
        int* new_data = (int*)realloc(arr->data, new_capacity * sizeof(int));
        if (!new_data) { perror("Failed to realloc"); exit(1); }
        arr->data = new_data;
        arr->capacity = new_capacity;
    }
    if (index > arr->size) index = arr->size; // Clamp index

    // Move elements: The O(n) part, but potentially cache-friendlier
    memmove(&arr->data[index + 1], &arr->data[index], (arr->size - index) * sizeof(int));

    arr->data[index] = value;
    arr->size++;
}

// LinkedList Node
typedef struct Node {
    int data;
    struct Node* next;
} Node;

// Basic linked list insert (find + insert)
void linkedlist_insert(Node** head_ref, size_t index, int value) {
    Node* newNode = (Node*)malloc(sizeof(Node));
    if (!newNode) { perror("Failed malloc"); exit(1); }
    newNode->data = value;

    if (index == 0 || *head_ref == NULL) {
        newNode->next = *head_ref;
        *head_ref = newNode;
        return;
    }

    // Traverse: The cache-unfriendly O(n) part
    Node* current = *head_ref;
    for (size_t i = 0; i < index - 1 && current->next != NULL; ++i) {
        current = current->next; // Potential cache miss here
    }

    // Insert (O(1) pointer ops + allocation cost)
    newNode->next = current->next;
    current->next = newNode;
}


int main() {
     // --- ArrayList ---
    ArrayList arr = {NULL, 0, 0};
    // Pre-allocate to focus on insertion cost, not just realloc
    arr.capacity = INITIAL_SIZE + N;
    arr.data = (int*)malloc(arr.capacity * sizeof(int));
     if (!arr.data) return 1;
    for(size_t i = 0; i < INITIAL_SIZE; ++i) arr.data[i] = (int)i;
    arr.size = INITIAL_SIZE;

    size_t insertIndexArr = INITIAL_SIZE / 2;
    clock_t start_arr = clock();
    for (int i = 0; i < N; ++i) {
        arraylist_insert(&arr, insertIndexArr, -i);
        // insertIndexArr++; // Simulate always inserting in the expanding middle
    }
    clock_t end_arr = clock();
    printf("ArrayList middle insert time: %f seconds\n",
           ((double)(end_arr - start_arr)) / CLOCKS_PER_SEC);
    free(arr.data);


    // --- LinkedList ---
    Node* head = NULL;
    Node* tail = NULL;
     for(int i = 0; i < INITIAL_SIZE; ++i) { // Initialize
        Node* newNode = (Node*)malloc(sizeof(Node));
        if (!newNode) return 1;
        newNode->data = i; newNode->next = NULL;
        if (!head) head = newNode;
        else tail->next = newNode;
        tail = newNode;
    }

    size_t insertIndexList = INITIAL_SIZE / 2;
    clock_t start_list = clock();
    for (int i = 0; i < N; ++i) {
        linkedlist_insert(&head, insertIndexList, -i);
         // insertIndexList++;
    }
    clock_t end_list = clock();
    printf("LinkedList middle insert time: %f seconds\n",
           ((double)(end_list - start_list)) / CLOCKS_PER_SEC);

    // ... free list ...
    Node* current_free = head;
    while(current_free != NULL){
        Node* next_free = current_free->next;
        free(current_free);
        current_free = next_free;
    }

    // Compare the times. Often, ArrayList is faster despite O(n) move.
    return 0;
}

解释

CPU：一个手脚麻利的大厨。
内存 (RAM)：一个巨大但遥远的仓库。
CPU 缓存：你手边的小工作台，容量小但拿东西飞快。
缓存行 (Cache Line)：仓库管理员一次性从货架上帮你搬到工作台的一小批（比如 16 个）连续的原材料。
缓存命中：你要的东西正好在工作台上，秒取。
缓存失效：你要的东西不在工作台，得等仓库管理员大老远送过来，很慢。
数组/Slice：仓库里一整排连续、整齐的货架。
链表：仓库里散落各处的箱子，每个箱子里除了货物，还有一张纸条写着下一个箱子的位置。

Array

数组/Slice (在货架中间插队):
- 操作步骤: 你想在货架第 k 个位置放个新食材。
  1. （如果货架满了，先经历一次上面说的“搬家扩容”）。
  2. 你得指挥仓库管理员，把从第 k 个位置开始到末尾的所有食材，都统一往后挪一个位置，腾出空当 (O(n-k) 的 memmove)。
  3. 把新食材放到第 k 个位置。
- 硬件现实: 挪东西是主要成本 (O(n))。但是，因为是在整齐的货架上操作（连续内存），管理员搬东西时可以比较有条理地批量进行 (memmove 优化得很好)，你的工作台（缓存）也能一定程度上帮上忙（比如预读要移动的数据块）。虽然累，但过程相对“顺畅”。

Linked

链表 (在第 k 个箱子后加个新箱子):
- 操作步骤:
  1. 灾难性的第一步：找到第 k-1 个箱子！ 你必须从第一个箱子开始，拿出里面的地址纸条，找到第二个箱子；再拿出第二个箱子的地址纸条，找到第三个…… 一步一步地在仓库里跑腿，直到找到第 k-1 个箱子。这是 O(k) 的查找。
  2. 找到后，再执行类似 Append 的操作：找新空箱子 (分配)，放食材，修改第 k-1 个箱子和新箱子的地址纸条，把它们“链接”起来 (理论 O(1) 的指针修改)。
- 硬件现实:
  1. 找箱子的过程是性能杀手！ 每一次 current = current.next 的跳转，都极大概率意味着你要访问的下一个箱子不在你的工作台（缓存）上，你得眼巴巴等着仓库管理员从老远的地方把箱子拿过来 (缓存失效 Cache Miss)。这个 O(k) 遍历的实际时间成本可能极其高昂，远超你的想象。
  2. O(1) 插入只是“最后一哆嗦”: 就算最后链接新箱子的步骤再快 (理论 O(1))，也无法弥补前面漫长而低效的查找时间。而且别忘了还有分配新箱子的开销。
- 总结 (链表 Insert): 理论上插入动作本身快 (O(1))，但必须建立在 O(k) 查找的基础上。这个查找因为缓存极不友好，导致总时间成本往往比数组 O(n) 的整体移动还要高得多。

简单总结

这就导致了明明时间复杂度更低, 但还是更慢的情况; 寻址这一操作消耗了大部分的时间

链表结构对于CPU而言是相当棘手的, 链表导致数据散落在内存各处, 难以被高效缓存; 而Array是整块内存, 对缓存更友好

再说明

接下来简单以读取举例一下

数组列表（Slice/Vector）：缓存的最爱

现在，让我们看看数组列表在硬件层面是如何运作的：

内存布局：数组列表的核心优势在于其内存连续性。元素 a[0], a[1], a[2], … 在物理内存中是紧挨着存放的。
遍历过程：当你访问 a[0] 时，CPU 发现缓存中没有，于是从主内存加载包含 a[0] 的整个 Cache Line。这个 Cache Line 很可能同时包含了 a[1], a[2], …, a[15]（假设一个元素 4 字节，64 / 4 = 16）。
缓存命中：当你接下来访问 a[1] 时，CPU 惊喜地发现它已经在高速缓存里了（Cache Hit）！访问 a[2]、a[3]… 直到这个 Cache Line 用完，都是极快的缓存访问。然后访问 a[16] 时会触发下一次主内存加载（加载下一个 Cache Line），但之后 a[17] 到 a[31] 又会命中缓存。
结果：遍历数组列表的过程，大部分时间都是在高速缓存上进行的。虽然理论上是 O(n) 次操作，但单次操作的实际成本非常低。

链表：缓存的噩梦

链表的遭遇则截然不同：

内存布局：链表的节点（包含数据和指向下一个节点的指针）在内存中是离散存储的。node1 可能在地址 1000，而它指向的 node2 可能在地址 58000，node3 又可能在地址 2500。它们的位置不可预测，彼此相距遥远。
遍历过程：当你访问第一个节点 node1 时，它被加载到缓存。然后，你需要读取 node1 中的指针，得到 node2 的地址。当你尝试访问 node2 时，极大概率它不在当前加载的任何 Cache Line 中。
缓存失效：CPU 不得不再次访问缓慢的主内存去获取 node2 所在的 Cache Line（Cache Miss）。访问 node3 时，同样的故事很可能再次上演。
结果：遍历链表的过程，充满了昂贵的缓存失效。CPU 大部分时间都在等待数据从主内存传输，而不是在执行计算。尽管理论上也是 O(n) 次操作，但单次操作的实际成本（由于缓存失效）远高于数组列表。这就是为何链表遍历在实践中常常慢得多的根本原因。

结语

本文并不是鼓吹什么时间复杂度无用论, 但在编程优化中不能理想化, 程序终究是运行在硬件之上

撰写本文的目的也是让那些脱离了底层硬件知识的空想派醒悟过来, 这套理论在其他场景也会适用

时间复杂度并非万能，算法设计必须与硬件架构的现实相结合，而不是仅仅基于理想化的计算模型

实际优化必须从多个角度入手, 团队协助也是如何, 即使是functional编程, 也最好进行整体bench和测试

时间复杂度在某个方面反应了这段代码的效率, 但只看时间复杂度是不可取的

引言#

ArrayList - LinkedList 时间复杂度简单介绍#

实际情况#

解释#

Array#

Linked#

简单总结#

再说明#

数组列表（Slice/Vector）：缓存的最爱#

链表：缓存的噩梦#

结语#

引言