Hi Guys in this blog we will learn about Matrix Multiplication in Python with NumPy, PyTorch & TensorFlow. Learn matrix multiplication from first principles and implement it in Python using NumPy, PyTorch and TensorFlow. This in-depth, SEO-friendly guide includes a clear introduction, table of contents, and detailed line-by-line explanations of every code snippet so you can copy, run, and understand each implementation.
Table of Contents
-
Introduction — Why matrix multiplication matters (SEO-friendly)
-
Matrix multiplication: quick definition and rules
-
Worked numeric example — computed step-by-step
-
Implementations overview — NumPy, PyTorch, TensorFlow
-
NumPy implementation — code + line-by-line explanation
-
PyTorch implementation (CPU) — code + line-by-line explanation
-
PyTorch implementation (GPU) — code + line-by-line explanation
-
TensorFlow implementation — code + line-by-line explanation
-
Common errors and how to fix them
-
Best practices for performance and stability
-
Conclusion — recap and next steps
1. Introduction — Why matrix multiplication matters (SEO-friendly)
Matrix multiplication is the backbone of linear algebra and an essential operation for fields like machine learning, data science, computer graphics, physics simulations, and scientific computing. If you’re building neural networks, implementing transformations in 3D graphics, solving systems of linear equations, or working with Markov chains, you will use matrix multiplication repeatedly.

This guide is built for developers, data scientists, and students who want:
-
A clear conceptual understanding of matrix multiplication,
-
Working code for NumPy, PyTorch, and TensorFlow,
-
Line-by-line explanations of all code so you can learn by reading and running.
2. Matrix multiplication: quick definition and rules
Definition: If is an matrix and is an matrix, then the product is an matrix whose entries are
That is, the element at row and column of equals the dot product of row of with column of .
Shape rule (must-check):
-
Inner dimensions must match: and .
-
Resulting shape: .
Important properties (short):
-
Associative: .
-
Distributive over addition.
-
Not commutative generally: .
-
Transpose reverses order: .
3. Worked numeric example — computed step-by-step
We’ll use a small, concrete example that we’ll also use to verify the Python implementations.
Let
Because is and is , the product is defined and will have shape .
Compute entries explicitly using the row·column dot product:
-
.
Compute step-by-step:
, , . Sum: . -
.
, , . Sum: . -
.
, , . Sum: . -
.
, , . Sum: .
So the result:
We’ll verify this exact result with each library.
4. Implementations overview — NumPy, PyTorch, TensorFlow
All three libraries provide optimized matrix multiplication routines:
-
NumPy:
A @ B
,np.dot(A, B)
,np.matmul(A, B)
. Great for CPU numerical work; typically backed by BLAS/LAPACK. -
PyTorch:
A @ B
,torch.matmul(A, B)
; supports GPU tensors (.to(device)
ordevice=
on creation) and autograd. -
TensorFlow (2.x):
A @ B
,tf.matmul(A, B)
; eager mode by default makes code look similar to NumPy.
Below we present the canonical code for each library, then explain every line in detail so you understand what each part does.
5. NumPy implementation — code + line-by-line explanation
NumPy code (complete)
import numpy as np
# Define two matrices
A = np.array([[1, 2, 3],
[4, 5, 6]]) # Shape (2,3)
B = np.array([[7, 8],
[9, 10],
[11, 12]]) # Shape (3,2)
# Matrix multiplication
C = A @ B # or np.dot(A, B) or np.matmul(A, B)
print("NumPy Result:\n", C)
Line-by-line explanation
-
import numpy as np
-
Imports the NumPy library and gives it the alias
np
. NumPy provides thendarray
type and many numerical routines. Using an alias (np
) is standard and concise.
-
-
# Define two matrices
-
A comment describing the next statements. Comments are for human readers and ignored by Python.
-
-
A = np.array([[1, 2, 3], [4, 5, 6]])
-
Calls
np.array
with a nested Python list to create a 2D NumPy array. -
The created array
A
has shape(2, 3)
and, because inputs are integers, NumPy infers dtypeint64
(platform dependent). You can checkA.shape
andA.dtype
to confirm.
-
-
B = np.array([[7, 8], [9, 10], [11, 12]])
-
Creates the second array
B
with shape(3, 2)
.
-
-
C = A @ B
-
Uses Python’s matrix multiplication operator
@
(PEP 465). For 2-D arrays this does standard matrix multiplication. This line computes the 2×2 matrixC
using an optimized routine (often BLAS under the hood). Alternatives:C = np.dot(A, B)
orC = np.matmul(A, B)
.
-
-
print("NumPy Result:\n", C)
-
Prints the result to standard output. You should see:
NumPy Result: [[ 58 64] [139 154]]
-
Extra NumPy notes
-
If you prefer floating point values (common in ML), create arrays with
dtype=np.float32
ornp.float64
:
A = np.array(..., dtype=np.float32)
. -
For batched matrix multiplication use
np.matmul
on arrays with rank > 2; NumPy will follow broadcasting rules.
6. PyTorch implementation (CPU) — code + line-by-line explanation
PyTorch code (CPU)
import torch
A = torch.tensor([[1, 2, 3],
[4, 5, 6]], dtype=torch.float32) # shape (2,3)
B = torch.tensor([[7, 8],
[9, 10],
[11, 12]], dtype=torch.float32) # shape (3,2)
C = torch.matmul(A, B) # or A @ B
print("PyTorch Result:\n", C)
Line-by-line explanation
-
import torch
-
Imports the PyTorch library. Typical aliasing like
import torch as th
is possible, but plaintorch
is standard.
-
-
A = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.float32)
-
Constructs a
torch.Tensor
from a Python nested list. Thedtype=torch.float32
argument ensures the tensor uses 32-bit floating point numbers. Using floats is typical for gradient-based optimization (neural nets). PyTorch will otherwise choose a default numeric dtype ifdtype
is not given.
-
-
B = torch.tensor([[7, 8], [9, 10], [11, 12]], dtype=torch.float32)
-
Constructs the second tensor
B
with shape(3, 2)
.
-
-
C = torch.matmul(A, B)
-
Computes matrix multiplication.
torch.matmul
supports various input ranks: for 2-D tensors it does matrix multiply; for 3-D it does batched matmul; for general ranks it follows broadcasting rules.A @ B
is equivalent totorch.matmul(A, B)
.
-
-
print("PyTorch Result:\n", C)
-
Prints the tensor. Example output:
PyTorch Result: tensor([[ 58., 64.], [139., 154.]])
-
Note the
tensor(...)
wrapper and decimal points since dtype is float32.
-
PyTorch specifics (CPU)
-
PyTorch tensors can be converted to NumPy arrays via
C.numpy()
ifC
is on CPU. IfC
is on GPU, you must.cpu()
first. -
PyTorch supports autograd (automatic differentiation) — if
A
andB
require gradients, operations will populate the computation graph.
7. PyTorch implementation (GPU) — code + line-by-line explanation
If you have a CUDA-capable GPU and the correct PyTorch build, matmul can be computed on the GPU for large speedups.
PyTorch code (GPU-aware)
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
A = torch.tensor([[1,2,3],[4,5,6]], dtype=torch.float32, device=device)
B = torch.tensor([[7,8],[9,10],[11,12]], dtype=torch.float32, device=device)
C = A @ B # GPU-accelerated if device is CUDA
# Move to CPU for printing (if needed):
print("PyTorch Result (from device):\n", C.cpu().numpy())
Line-by-line explanation
-
import torch
-
Import PyTorch.
-
-
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-
Checks if CUDA is available. If yes,
device
will be the GPU; otherwise CPU. This pattern allows the same code to run on both GPU and CPU seamlessly.
-
-
A = torch.tensor(..., device=device)
andB = ...
-
Construct tensors directly on the chosen device. Creating tensors on the target device avoids an extra transfer (
.to(device)
) call.
-
-
C = A @ B
-
Performs matrix multiplication. If
device
is CUDA, PyTorch calls cuBLAS/cuDNN kernels for GPU-accelerated matmul.
-
-
C.cpu().numpy()
-
If you want to inspect results in Python (NumPy), move the tensor back to CPU (
.cpu()
) and convert to a NumPy array (.numpy()
). Direct.numpy()
on a CUDA tensor raises an error.
-
Practical GPU tips
-
For small matrices, GPU may be slower due to transfer and kernel launch overhead. Use sufficiently large or batched multiplications to amortize overhead.
-
For maximum throughput, minimize CPU↔GPU transfers inside loops.
8. TensorFlow implementation — code + line-by-line explanation
TensorFlow code (TF 2.x — eager mode)
import tensorflow as tf
A = tf.constant([[1, 2, 3],
[4, 5, 6]], dtype=tf.float32) # shape (2,3)
B = tf.constant([[7, 8],
[9, 10],
[11, 12]], dtype=tf.float32) # shape (3,2)
C = tf.matmul(A, B) # or A @ B
print("TensorFlow Result:\n", C.numpy()) # .numpy() to get a NumPy array
Line-by-line explanation
-
import tensorflow as tf
-
Imports TensorFlow;
tf
is the standard alias.
-
-
A = tf.constant([...], dtype=tf.float32)
-
Creates a
tf.Tensor
usingtf.constant
. Specifyingdtype=tf.float32
makes it a 32-bit float tensor.
-
-
B = tf.constant([...], dtype=tf.float32)
-
Creates the second tensor.
-
-
C = tf.matmul(A, B)
-
Performs matrix multiplication. In TensorFlow 2 (eager mode) this runs immediately and returns a
tf.Tensor
.
-
-
C.numpy()
-
Converts the eager
tf.Tensor
to a NumPy array for display or CPU-side processing.
-
TensorFlow GPU behavior
-
TensorFlow automatically places ops on GPU if available and if the tensor/device placement allows it. For most typical setups,
tf.matmul
will use GPU-accelerated kernels without extra code.
9. Common errors and how to fix them
Below are typical problems you may encounter and concrete fixes.
1. Shape mismatch (inner dimensions don't match)
-
Symptom:
ValueError
or shape mismatch error. -
Fix: Check shapes by printing
A.shape
andB.shape
. If needed, transpose (B.T
) or reshape so inner dimensions match. Example:(2,3) @ (3,2)
works;(2,3) @ (2,3)
does not.
2. Dtype mismatch (PyTorch)
-
Symptom:
expected m1 and m2 to have the same dtype
or silent upcasting. -
Fix: Make sure tensors have same dtype:
A = A.float()
orB = B.double()
.
3. Trying .numpy()
on a CUDA tensor (PyTorch)
-
Symptom: error like “can't call numpy() on Tensor that requires grad or is on CUDA device.”
-
Fix: Move tensor to CPU first:
C.cpu().numpy()
.
4. Using *
instead of @
or matmul
-
Symptom: unexpected elementwise multiplication results.
-
Fix: Use
@
(recommended) ornp.matmul
,torch.matmul
, ortf.matmul
for matrix multiplication.
5. Performance issues on GPU
-
Symptom: GPU code slower than CPU for small arrays.
-
Fix: Batch operations or increase matrix size; reduce CPU↔GPU transfers; profile to find bottlenecks.
10. Best practices for performance and stability
-
Use library matmul (NumPy/PyTorch/TensorFlow). Don’t write triple nested loops in Python — these are much slower.
-
Pick correct dtype:
float32
is standard for deep learning; usefloat64
for high-precision needs. -
Batch your work: GPUs are designed for throughput; combine many small multiplies into a batched matmul (rank-3 tensors) for efficiency.
-
Minimize transfers: Move data to GPU once, do many ops, then move results back once.
-
Profile: Use
torch.profiler
or TensorFlow profiling tools to identify hotspots. -
Check conditioning: Matrix multiplication is stable, but operations like inversion or solving linear systems following matmul can be ill-conditioned — watch for numerical issues.
11. Conclusion — recap and next steps
Matrix multiplication is a simple concept with deep consequences. In this guide you learned:
-
The definition and shape rules for matrix multiplication.
-
A step-by-step numeric example (with arithmetic shown).
-
Practical, copy-and-run implementations in NumPy, PyTorch (CPU & GPU), and TensorFlow — with thorough line-by-line explanations for every line of code.
-
Common errors, fixes, and performance best practices.
Next steps you can take:
-
Try the examples locally and verify the numeric output matches the worked example (
[[58, 64], [139, 154]]
). -
Experiment with batched matmul for larger workloads (e.g.,
batch x m x n
timesbatch x n x q
). -
Benchmark CPU vs GPU for matrices of varying sizes to see when GPU becomes beneficial.
-
Explore applications: build a small fully-connected neural network to see matmul in action or use matmul to transform coordinates in a graphics demo.
Comments
Post a Comment