🎓 Bachelor Thesis · GUC 2026

Implementing PageRank
Using Python NetworkX

A comprehensive implementation, analysis, and simulation of the PageRank algorithm across six computational methods, with an interactive browser-based visualisation engine.

Author: Hana Ayman Abdelhamid
Supervisor: Dr. Islam El-maddah
Faculty of Media Engineering & Technology
6
PR Methods
5
Experiments
791
Node Graph
1.53×
Pruning Speedup
+287%
Max SEO Gain
6.7K
Lines of Code

Five Requirements, One Complete System

Each requirement builds on the last — from implementing the raw algorithm to building a full search engine with real-time simulation.

Requirement 1

Progressive Pruning PageRank

A novel enhancement to power iteration that freezes low-rank pages after convergence, achieving 1.53× speedup with zero accuracy loss on 2,000-node graphs.

NumPyBenchmarkL1 Error = 0
Requirement 2

Realistic Large-Scale Data

A BFS web scraper with 4 bug fixes, validated against a 791-node synthetic documentation graph. All 7 PageRank methods benchmarked with L1 error and timing.

BeautifulSoup791 nodes7 Methods
Requirement 3

Manual PageRank Implementation

Custom PageRank algorithm using manual power iteration and NetworkX. Implements the core mathematical formula with explicit dangling node handling, convergence tracking, and validation against exact matrix algebra solution.

PythonNetworkXPower Iteration
Bonus — Week 6

Interactive Simulation Engine

A full Streamlit web app with live BFS crawling, Vis.js graph visualisation, heat-scale colour coding, TF-IDF search, and Google benchmark comparison.

StreamlitVis.jsLive Demo

Complete Code Workspace

Dynamic source-code browser. Code files are parsed and highlighted live directly from your workspace directory path context.

Initializing...
# Processing workspace logic variables... Please wait.

Key Numbers at a Glance

All results are measured on real graph executions, not estimated.

All 7 Methods · 791-Node Graph

MethodL1 ErrorTime
1 NetworkX0 (ref)6.5 ms
2 Manual1.47e-479.2 ms
3 Matrix1.47e-430.1 ms
4 SciPy1.47e-40.6 ms
5 Monte Carlo2.76e-2645 ms
6 HITS2.48e-1214 ms
7 Pruning1.57e-493.9 ms

Pruning Benchmark · Web-Like Graphs

NStd TimePruneSpeedup
500.1 ms0.2 ms0.51×
5006.8 ms16.7 ms0.41×
2,000176 ms115 ms1.53×

L1 Error = 0.00e+00 in all cases

Monte Carlo · Error vs Surfer Count

SurfersL1 ErrorTime
1000.02848 ms
1,0000.009276 ms
10,0000.0031760 ms
50,0000.00183.8 s

Search Engine Tests · 8/8 Passing

TestStatus
Constructor & initPASS
Return type & structurePASS
Pure PageRank modePASS
Keyword relevancePASS
Recency weightPASS
Topic-sensitive PRPASS
Combined weightsPASS
Edge casesPASS

PageRank Simulation Engine

A browser-based Streamlit application built in Week 6. Crawl any website, run all 5 methods, search with TF-IDF, and compare with Google.

🕸️

Smart BFS Crawler

Crawl any URL with depth, path prefix, keyword, and asset filters.

📊

Live Vis.js Graph

Nodes sized and coloured by PageRank on a blue→amber→red heat scale.

🔍

TF-IDF Search

Combined score = 0.6×PR + 0.4×TF-IDF.

🆚

Google Benchmark

Multi-tier fuzzy URL matching. Reports Spearman ρ and Kendall τ.

SEO Recommendations — Proven by Data

Each recommendation is validated by modifying the graph and measuring exact PageRank changes using NetworkX.

+287%

1. Earn a High-Authority Inbound Link

One link from python_tutorial (PR=0.10) increased web_scraping's rank from 0.010 to 0.039.

+35.5%

2. Create an Internal Hub Page

Adding a db_hub page increased database_sql PageRank from 0.049 to 0.066.

+763%

3. Accumulate Multiple Inbound Links

5 inbound links raised PageRank from 0.010 to 0.086 — a 763% total increase.

+23.8%

4. Fix Dangling Pages

Adding one link back to the homepage increased homepage PR from 0.190 to 0.236.

Equalise

5. Use Hub-and-Spoke, Not Chains

Hub-and-spoke equalises distribution — A improves +29.8%, E drops 65%.

Installation & Run Commands

All code requires Python 3.8+. Install dependencies once, then run any requirement independently.

pip install networkx numpy scipy matplotlib requests beautifulsoup4 streamlit plotly pandas

# ============================================================
# experiments.py
# ============================================================

# Run a specific experiment:
python experiments.py --exp 1       # Convergence Speed
python experiments.py --exp 2       # Scalability
python experiments.py --exp 3       # Accuracy Comparison
python experiments.py --exp 4       # Damping Factor Sensitivity
python experiments.py --exp 5       # Monte Carlo Accuracy

# Run all experiments:
python experiments.py --exp all

# ============================================================
# pr_simulator.py
# ============================================================

streamlit run app.py

# ============================================================
# progressive_pruning_pr.py
# ============================================================

python pruning_test.py

# ============================================================
# realistic_data.py
# ============================================================

# Demo with synthetic graph (no internet required):
python req2_realistic_data.py --demo

# Scrape a real website (internet required):
python req2_realistic_data.py --scrape --url https://docs.python.org/3/ --depth 2

# Demo on a previously saved edge list:
python req2_realistic_data.py --demo --input edge_list.csv

# Dependencies:
pip install networkx numpy scipy matplotlib requests beautifulsoup4

# ============================================================
# search_engine.py
# ============================================================

python search_engine_test.py

# Dependencies:
pip install networkx numpy

# ============================================================
# metro_daily_with_recommendations.py
# ============================================================

python metro_daily_with_recommendations.py