# Hypergraph Convolution and Hyper Attention

# Introduction

- Author: Oxford
- contribution: proposed end to end tranable conv and attention for hypergraph.
- paper

# What is Hypergraph

A simple graph's edge connect two vertices, while hypergraph's edge connects more than two vertices.As shown in fig1:

Let $\mathcal { G } = ( V , E )$ be vera hypergraph with $N$ vertices and $M$ hyperedges. Each hyperedge $\epsilon \in E$ is assigned with a positive weight $W _ { \epsilon \epsilon }$, with all the weights stored in a diagonal matrix $\mathbf { W } \in \mathbb { R } ^ { M \times M }$.This

Different from simple graph which has an adjacency matrix, hypergraph use incidence matrix $\mathbf { H } \in \mathbb { R } ^ { N \times M }$ to represent graph's connections. If $v_i$ is connected by $\epsilon$, the $H_{i\epsilon}=1$,otherwise 0.Then the vertex degree is $$D _ { i i } = \sum _ { \epsilon = 1 } ^ { M } W _ { \epsilon \epsilon } H _ { i \epsilon }$$ and the hyperedge degree is $$B _ { \epsilon \epsilon } = \sum _ { i = 1 } ^ { N } H _ { i \epsilon }$$ and the vertices' feature is $X$

# Hyper Conv

Assumtions:

- more propagaions should be done between those vertices connected by a common hyperedge
- the hyperedges with larger weights deserve more confidence in propagations.

So one step conv can be: $$x _ { i } ^ { ( l + 1 ) } = \sigma \left( \sum _ { j = 1 } ^ { N } \sum _ { \epsilon = 1 } ^ { M } H _ { i \epsilon } H _ { j \epsilon } W _ { \epsilon \epsilon } x _ { j } ^ { ( l ) } \mathbf { P } \right)$$

Written in matrix form: $$\mathbf { X } ^ { ( l + 1 ) } = \sigma \left( \mathbf { H } \mathbf { W } \mathbf { H } ^ { \mathrm { T } } \mathbf { X } ^ { ( l ) } \mathbf { P } \right)$$

However $\mathbf { H } \mathbf { W } \mathbf { H } ^ { \mathrm { T } } $ does not hold a constrainted spectal radius, which means the scale of $X^l$ (here means distribution in my thoughts) will be changed.This will lead to numerical instabilities and increase the risk of exploding/vanishing gradients when stack multi hpyer conv layer together. So the author impose normalization here: $$\mathbf { X } ^ { ( l + 1 ) } = \sigma \left( \mathbf { D } ^ { - 1 / 2 } \mathbf { H } \mathbf { W } \mathbf { B } ^ { - 1 } \mathbf { H } ^ { \mathrm { T } } \mathbf { D } ^ { - 1 / 2 } \mathbf { X } ^ { ( l ) } \mathbf { P } \right)$$

# Hyper Attention

The incidence matrix $\mathbf{H}$ can be thought some kind of Attention matrix, but it was not learnable and trainable after defined before.

So here the author proposed Hyper Attention.However hypergraph attention is only feasible when the vertex set and the hyperedge set are from (or can be projected to) the same homogeneous domain, since only in this case, the similarities between vertices and hyperedges are directly comparable.Then the author choose a simple way: each vertex collects its K-nearest neighbors to form a hyper-edge.Then the attention score can be: $$H _ { i j } = \frac { \exp \left( \sigma \left( \operatorname { sim } \left( x _ { i } \mathbf { P } , x _ { j } \mathbf { P } \right) \right) \right) } { \sum _ { k \in \mathcal { N } _ { i } } \exp \left( \sigma \left( \operatorname { sim } \left( x _ { i } \mathbf { P } , x _ { k } \mathbf { P } \right) \right) \right) }$$ and \sin ( \cdot ) is a similarity function that computes the pairwise similarities between two verticies.

# More have not been done

- Hyper Attention on heterogeneous domain.
- Why not treat the hyperedge as fully connected subgraph? better or worse than GCN+Polling or HGN?