Skip to main content

Matrix Multiplication in Python with NumPy, PyTorch & TensorFlow

Hi Guys in this blog we will learn about Matrix Multiplication in Python with NumPy, PyTorch & TensorFlow. Learn matrix multiplication from first principles and implement it in Python using NumPy, PyTorch and TensorFlow. This in-depth, SEO-friendly guide includes a clear introduction, table of contents, and detailed line-by-line explanations of every code snippet so you can copy, run, and understand each implementation.

Table of Contents

  1. Introduction — Why matrix multiplication matters (SEO-friendly)

  2. Matrix multiplication: quick definition and rules

  3. Worked numeric example — computed step-by-step

  4. Implementations overview — NumPy, PyTorch, TensorFlow

  5. NumPy implementation — code + line-by-line explanation

  6. PyTorch implementation (CPU) — code + line-by-line explanation

  7. PyTorch implementation (GPU) — code + line-by-line explanation

  8. TensorFlow implementation — code + line-by-line explanation

  9. Common errors and how to fix them

  10. Best practices for performance and stability

  11. Conclusion — recap and next steps

1. Introduction — Why matrix multiplication matters (SEO-friendly)

Matrix multiplication is the backbone of linear algebra and an essential operation for fields like machine learning, data science, computer graphics, physics simulations, and scientific computing. If you’re building neural networks, implementing transformations in 3D graphics, solving systems of linear equations, or working with Markov chains, you will use matrix multiplication repeatedly.

This guide is built for developers, data scientists, and students who want:

  • A clear conceptual understanding of matrix multiplication,

  • Working code for NumPy, PyTorch, and TensorFlow,

  • Line-by-line explanations of all code so you can learn by reading and running.

2. Matrix multiplication: quick definition and rules

Definition: If AA is an m×nm \times n matrix and BB is an n×qn \times q matrix, then the product C=ABC = AB is an m×qm \times q matrix whose entries are

cij=k=1naikbkj.c_{ij} = \sum_{k=1}^{n} a_{ik}\, b_{kj}.

That is, the element at row ii and column jj of CC equals the dot product of row ii of AA with column jj of BB.

Shape rule (must-check):

  • Inner dimensions must match: A(m×n)A\:(m \times n) and B(n×q)B\:(n \times q).

  • Resulting shape: C(m×q)C\:(m \times q).

Important properties (short):

  • Associative: (AB)C=A(BC)(AB)C = A(BC).

  • Distributive over addition.

  • Not commutative generally: ABBAAB \ne BA.

  • Transpose reverses order: (AB)T=BTAT(AB)^T = B^T A^T.

3. Worked numeric example — computed step-by-step

We’ll use a small, concrete example that we’ll also use to verify the Python implementations.

Let

A=[123456](2×3),B=[789101112](3×2).A = \begin{bmatrix}1 & 2 & 3 \\[4pt] 4 & 5 & 6\end{bmatrix} \quad (2\times3), \qquad B = \begin{bmatrix}7 & 8 \\[4pt] 9 & 10 \\[4pt] 11 & 12\end{bmatrix} \quad (3\times2).

Because AA is 2×32\times3 and BB is 3×23\times2, the product C=ABC = AB is defined and will have shape 2×22\times2.

Compute entries explicitly using the row·column dot product:

  • c11=17+29+311c_{11} = 1\cdot 7 + 2\cdot 9 + 3\cdot 11.
    Compute step-by-step:
    17=71\cdot7 = 7, 29=182\cdot9=18, 311=333\cdot11=33. Sum: 7+18+33=587 + 18 + 33 = 58.

  • c12=18+210+312c_{12} = 1\cdot 8 + 2\cdot 10 + 3\cdot 12.
    18=81\cdot8 = 8, 210=202\cdot10 = 20, 312=363\cdot12 = 36. Sum: 8+20+36=648 + 20 + 36 = 64.

  • c21=47+59+611c_{21} = 4\cdot 7 + 5\cdot 9 + 6\cdot 11.
    47=284\cdot7 = 28, 59=455\cdot9 = 45, 611=666\cdot11 = 66. Sum: 28+45+66=13928 + 45 + 66 = 139.

  • c22=48+510+612c_{22} = 4\cdot 8 + 5\cdot 10 + 6\cdot 12.
    48=324\cdot8 = 32, 510=505\cdot10 = 50, 612=726\cdot12 = 72. Sum: 32+50+72=15432 + 50 + 72 = 154.

So the result:

AB=[5864139154].AB = \begin{bmatrix}58 & 64 \\[4pt] 139 & 154 \end{bmatrix}.

We’ll verify this exact result with each library.

4. Implementations overview — NumPy, PyTorch, TensorFlow

All three libraries provide optimized matrix multiplication routines:

  • NumPy: A @ B, np.dot(A, B), np.matmul(A, B). Great for CPU numerical work; typically backed by BLAS/LAPACK.

  • PyTorch: A @ B, torch.matmul(A, B); supports GPU tensors (.to(device) or device= on creation) and autograd.

  • TensorFlow (2.x): A @ B, tf.matmul(A, B); eager mode by default makes code look similar to NumPy.

Below we present the canonical code for each library, then explain every line in detail so you understand what each part does.

5. NumPy implementation — code + line-by-line explanation

NumPy code (complete)

import numpy as np

# Define two matrices
A = np.array([[1, 2, 3],
              [4, 5, 6]])   # Shape (2,3)

B = np.array([[7, 8],
              [9, 10],
              [11, 12]])    # Shape (3,2)

# Matrix multiplication
C = A @ B   # or np.dot(A, B) or np.matmul(A, B)

print("NumPy Result:\n", C)

Line-by-line explanation

  1. import numpy as np

    • Imports the NumPy library and gives it the alias np. NumPy provides the ndarray type and many numerical routines. Using an alias (np) is standard and concise.

  2. # Define two matrices

    • A comment describing the next statements. Comments are for human readers and ignored by Python.

  3. A = np.array([[1, 2, 3], [4, 5, 6]])

    • Calls np.array with a nested Python list to create a 2D NumPy array.

    • The created array A has shape (2, 3) and, because inputs are integers, NumPy infers dtype int64 (platform dependent). You can check A.shape and A.dtype to confirm.

  4. B = np.array([[7, 8], [9, 10], [11, 12]])

    • Creates the second array B with shape (3, 2).

  5. C = A @ B

    • Uses Python’s matrix multiplication operator @ (PEP 465). For 2-D arrays this does standard matrix multiplication. This line computes the 2×2 matrix C using an optimized routine (often BLAS under the hood). Alternatives: C = np.dot(A, B) or C = np.matmul(A, B).

  6. print("NumPy Result:\n", C)

    • Prints the result to standard output. You should see:

      NumPy Result:
      [[ 58  64]
       [139 154]]
      

Extra NumPy notes

  • If you prefer floating point values (common in ML), create arrays with dtype=np.float32 or np.float64:
    A = np.array(..., dtype=np.float32).

  • For batched matrix multiplication use np.matmul on arrays with rank > 2; NumPy will follow broadcasting rules.

6. PyTorch implementation (CPU) — code + line-by-line explanation

PyTorch code (CPU)

import torch

A = torch.tensor([[1, 2, 3],
                  [4, 5, 6]], dtype=torch.float32)   # shape (2,3)

B = torch.tensor([[7, 8],
                  [9, 10],
                  [11, 12]], dtype=torch.float32)    # shape (3,2)

C = torch.matmul(A, B)   # or A @ B

print("PyTorch Result:\n", C)

Line-by-line explanation

  1. import torch

    • Imports the PyTorch library. Typical aliasing like import torch as th is possible, but plain torch is standard.

  2. A = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.float32)

    • Constructs a torch.Tensor from a Python nested list. The dtype=torch.float32 argument ensures the tensor uses 32-bit floating point numbers. Using floats is typical for gradient-based optimization (neural nets). PyTorch will otherwise choose a default numeric dtype if dtype is not given.

  3. B = torch.tensor([[7, 8], [9, 10], [11, 12]], dtype=torch.float32)

    • Constructs the second tensor B with shape (3, 2).

  4. C = torch.matmul(A, B)

    • Computes matrix multiplication. torch.matmul supports various input ranks: for 2-D tensors it does matrix multiply; for 3-D it does batched matmul; for general ranks it follows broadcasting rules. A @ B is equivalent to torch.matmul(A, B).

  5. print("PyTorch Result:\n", C)

    • Prints the tensor. Example output:

      PyTorch Result:
      tensor([[ 58.,  64.],
              [139., 154.]])
      
    • Note the tensor(...) wrapper and decimal points since dtype is float32.

PyTorch specifics (CPU)

  • PyTorch tensors can be converted to NumPy arrays via C.numpy() if C is on CPU. If C is on GPU, you must .cpu() first.

  • PyTorch supports autograd (automatic differentiation) — if A and B require gradients, operations will populate the computation graph.

7. PyTorch implementation (GPU) — code + line-by-line explanation

If you have a CUDA-capable GPU and the correct PyTorch build, matmul can be computed on the GPU for large speedups.

PyTorch code (GPU-aware)

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

A = torch.tensor([[1,2,3],[4,5,6]], dtype=torch.float32, device=device)
B = torch.tensor([[7,8],[9,10],[11,12]], dtype=torch.float32, device=device)

C = A @ B   # GPU-accelerated if device is CUDA

# Move to CPU for printing (if needed):
print("PyTorch Result (from device):\n", C.cpu().numpy())

Line-by-line explanation

  1. import torch

    • Import PyTorch.

  2. device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    • Checks if CUDA is available. If yes, device will be the GPU; otherwise CPU. This pattern allows the same code to run on both GPU and CPU seamlessly.

  3. A = torch.tensor(..., device=device) and B = ...

    • Construct tensors directly on the chosen device. Creating tensors on the target device avoids an extra transfer (.to(device)) call.

  4. C = A @ B

    • Performs matrix multiplication. If device is CUDA, PyTorch calls cuBLAS/cuDNN kernels for GPU-accelerated matmul.

  5. C.cpu().numpy()

    • If you want to inspect results in Python (NumPy), move the tensor back to CPU (.cpu()) and convert to a NumPy array (.numpy()). Direct .numpy() on a CUDA tensor raises an error.

Practical GPU tips

  • For small matrices, GPU may be slower due to transfer and kernel launch overhead. Use sufficiently large or batched multiplications to amortize overhead.

  • For maximum throughput, minimize CPU↔GPU transfers inside loops.

8. TensorFlow implementation — code + line-by-line explanation

TensorFlow code (TF 2.x — eager mode)

import tensorflow as tf

A = tf.constant([[1, 2, 3],
                 [4, 5, 6]], dtype=tf.float32)   # shape (2,3)

B = tf.constant([[7, 8],
                 [9, 10],
                 [11, 12]], dtype=tf.float32)    # shape (3,2)

C = tf.matmul(A, B)   # or A @ B

print("TensorFlow Result:\n", C.numpy())   # .numpy() to get a NumPy array

Line-by-line explanation

  1. import tensorflow as tf

    • Imports TensorFlow; tf is the standard alias.

  2. A = tf.constant([...], dtype=tf.float32)

    • Creates a tf.Tensor using tf.constant. Specifying dtype=tf.float32 makes it a 32-bit float tensor.

  3. B = tf.constant([...], dtype=tf.float32)

    • Creates the second tensor.

  4. C = tf.matmul(A, B)

    • Performs matrix multiplication. In TensorFlow 2 (eager mode) this runs immediately and returns a tf.Tensor.

  5. C.numpy()

    • Converts the eager tf.Tensor to a NumPy array for display or CPU-side processing.

TensorFlow GPU behavior

  • TensorFlow automatically places ops on GPU if available and if the tensor/device placement allows it. For most typical setups, tf.matmul will use GPU-accelerated kernels without extra code.

9. Common errors and how to fix them

Below are typical problems you may encounter and concrete fixes.

1. Shape mismatch (inner dimensions don't match)

  • Symptom: ValueError or shape mismatch error.

  • Fix: Check shapes by printing A.shape and B.shape. If needed, transpose (B.T) or reshape so inner dimensions match. Example: (2,3) @ (3,2) works; (2,3) @ (2,3) does not.

2. Dtype mismatch (PyTorch)

  • Symptom: expected m1 and m2 to have the same dtype or silent upcasting.

  • Fix: Make sure tensors have same dtype: A = A.float() or B = B.double().

3. Trying .numpy() on a CUDA tensor (PyTorch)

  • Symptom: error like “can't call numpy() on Tensor that requires grad or is on CUDA device.”

  • Fix: Move tensor to CPU first: C.cpu().numpy().

4. Using * instead of @ or matmul

  • Symptom: unexpected elementwise multiplication results.

  • Fix: Use @ (recommended) or np.matmul, torch.matmul, or tf.matmul for matrix multiplication.

5. Performance issues on GPU

  • Symptom: GPU code slower than CPU for small arrays.

  • Fix: Batch operations or increase matrix size; reduce CPU↔GPU transfers; profile to find bottlenecks.

10. Best practices for performance and stability

  • Use library matmul (NumPy/PyTorch/TensorFlow). Don’t write triple nested loops in Python — these are much slower.

  • Pick correct dtype: float32 is standard for deep learning; use float64 for high-precision needs.

  • Batch your work: GPUs are designed for throughput; combine many small multiplies into a batched matmul (rank-3 tensors) for efficiency.

  • Minimize transfers: Move data to GPU once, do many ops, then move results back once.

  • Profile: Use torch.profiler or TensorFlow profiling tools to identify hotspots.

  • Check conditioning: Matrix multiplication is stable, but operations like inversion or solving linear systems following matmul can be ill-conditioned — watch for numerical issues.

11. Conclusion — recap and next steps

Matrix multiplication is a simple concept with deep consequences. In this guide you learned:

  • The definition and shape rules for matrix multiplication.

  • A step-by-step numeric example (with arithmetic shown).

  • Practical, copy-and-run implementations in NumPy, PyTorch (CPU & GPU), and TensorFlow — with thorough line-by-line explanations for every line of code.

  • Common errors, fixes, and performance best practices.

Next steps you can take:

  • Try the examples locally and verify the numeric output matches the worked example ([[58, 64], [139, 154]]).

  • Experiment with batched matmul for larger workloads (e.g., batch x m x n times batch x n x q).

  • Benchmark CPU vs GPU for matrices of varying sizes to see when GPU becomes beneficial.

  • Explore applications: build a small fully-connected neural network to see matmul in action or use matmul to transform coordinates in a graphics demo.

Comments