🎓 Bachelor Thesis · GUC 2026

Implementing PageRank
Using Python NetworkX

A comprehensive implementation, analysis, and simulation of the PageRank algorithm across six computational methods, with an interactive browser-based visualisation engine.

Author: Hana Ayman Abdelhamid

Supervisor: Dr. Islam El-maddah

Faculty of Media Engineering & Technology

PR Methods

Experiments

791

Node Graph

1.53×

Pruning Speedup

+287%

Max SEO Gain

6.7K

Lines of Code

Project Overview

Five Requirements, One Complete System

Each requirement builds on the last — from implementing the raw algorithm to building a full search engine with real-time simulation.

Requirement 1

Progressive Pruning PageRank

A novel enhancement to power iteration that freezes low-rank pages after convergence, achieving 1.53× speedup with zero accuracy loss on 2,000-node graphs.

NumPyBenchmarkL1 Error = 0

Requirement 2

Realistic Large-Scale Data

A BFS web scraper with 4 bug fixes, validated against a 791-node synthetic documentation graph. All 7 PageRank methods benchmarked with L1 error and timing.

BeautifulSoup791 nodes7 Methods

Requirement 3

Manual PageRank Implementation

Custom PageRank algorithm using manual power iteration and NetworkX. Implements the core mathematical formula with explicit dangling node handling, convergence tracking, and validation against exact matrix algebra solution.

PythonNetworkXPower Iteration

Bonus — Week 6

Interactive Simulation Engine

A full Streamlit web app with live BFS crawling, Vis.js graph visualisation, heat-scale colour coding, TF-IDF search, and Google benchmark comparison.

StreamlitVis.jsLive Demo

Source Code

Complete Code Workspace

Dynamic source-code browser. Code files are parsed and highlighted live directly from your workspace directory path context.

Initializing...

# Processing workspace logic variables... Please wait.

Experimental Results

Key Numbers at a Glance

All results are measured on real graph executions, not estimated.

All 7 Methods · 791-Node Graph

Method	L1 Error	Time
1 NetworkX	0 (ref)	6.5 ms
2 Manual	1.47e-4	79.2 ms
3 Matrix	1.47e-4	30.1 ms
4 SciPy	1.47e-4	0.6 ms
5 Monte Carlo	2.76e-2	645 ms
6 HITS	2.48e-1	214 ms
7 Pruning	1.57e-4	93.9 ms

Pruning Benchmark · Web-Like Graphs

N	Std Time	Prune	Speedup
50	0.1 ms	0.2 ms	0.51×
500	6.8 ms	16.7 ms	0.41×
2,000	176 ms	115 ms	1.53×

L1 Error = 0.00e+00 in all cases

Monte Carlo · Error vs Surfer Count

Surfers	L1 Error	Time
100	0.0284	8 ms
1,000	0.0092	76 ms
10,000	0.0031	760 ms
50,000	0.0018	3.8 s

Search Engine Tests · 8/8 Passing

Test	Status
Constructor & init	PASS
Return type & structure	PASS
Pure PageRank mode	PASS
Keyword relevance	PASS
Recency weight	PASS
Topic-sensitive PR	PASS
Combined weights	PASS
Edge cases	PASS

Interactive Tool

PageRank Simulation Engine

A browser-based Streamlit application built in Week 6. Crawl any website, run all 5 methods, search with TF-IDF, and compare with Google.

🕸️

Smart BFS Crawler

Crawl any URL with depth, path prefix, keyword, and asset filters.

📊

Live Vis.js Graph

Nodes sized and coloured by PageRank on a blue→amber→red heat scale.

🔍

TF-IDF Search

Combined score = 0.6×PR + 0.4×TF-IDF.

🆚

Google Benchmark

Multi-tier fuzzy URL matching. Reports Spearman ρ and Kendall τ.

Requirement 5

SEO Recommendations — Proven by Data

Each recommendation is validated by modifying the graph and measuring exact PageRank changes using NetworkX.

+287%

1. Earn a High-Authority Inbound Link

One link from python_tutorial (PR=0.10) increased web_scraping's rank from 0.010 to 0.039.

+35.5%

2. Create an Internal Hub Page

Adding a db_hub page increased database_sql PageRank from 0.049 to 0.066.

+763%

3. Accumulate Multiple Inbound Links

5 inbound links raised PageRank from 0.010 to 0.086 — a 763% total increase.

+23.8%

4. Fix Dangling Pages

Adding one link back to the homepage increased homepage PR from 0.190 to 0.236.

Equalise

5. Use Hub-and-Spoke, Not Chains

Hub-and-spoke equalises distribution — A improves +29.8%, E drops 65%.

Run It Yourself

Installation & Run Commands

All code requires Python 3.8+. Install dependencies once, then run any requirement independently.

pip install networkx numpy scipy matplotlib requests beautifulsoup4 streamlit plotly pandas

# ============================================================
# experiments.py
# ============================================================

# Run a specific experiment:
python experiments.py --exp 1       # Convergence Speed
python experiments.py --exp 2       # Scalability
python experiments.py --exp 3       # Accuracy Comparison
python experiments.py --exp 4       # Damping Factor Sensitivity
python experiments.py --exp 5       # Monte Carlo Accuracy

# Run all experiments:
python experiments.py --exp all

# ============================================================
# pr_simulator.py
# ============================================================

streamlit run app.py

# ============================================================
# progressive_pruning_pr.py
# ============================================================

python pruning_test.py

# ============================================================
# realistic_data.py
# ============================================================

# Demo with synthetic graph (no internet required):
python req2_realistic_data.py --demo

# Scrape a real website (internet required):
python req2_realistic_data.py --scrape --url https://docs.python.org/3/ --depth 2

# Demo on a previously saved edge list:
python req2_realistic_data.py --demo --input edge_list.csv

# Dependencies:
pip install networkx numpy scipy matplotlib requests beautifulsoup4

# ============================================================
# search_engine.py
# ============================================================

python search_engine_test.py

# Dependencies:
pip install networkx numpy

# ============================================================
# metro_daily_with_recommendations.py
# ============================================================

python metro_daily_with_recommendations.py

Implementing PageRankUsing Python NetworkX