Skip to content

Conversation

@cdgaete
Copy link
Collaborator

@cdgaete cdgaete commented Apr 23, 2025

Closes #12
Supersedes #225

Changes proposed in this Pull Request

  • Add new capability to extract power plant data from OpenStreetMap (OSM)
  • Implement caching system to reduce data download requirements
  • Add support for various power plant types from OSM data
  • Include capacity estimation for plants with missing information
  • Add configuration options in config.yaml for customizing OSM extraction
  • Include example script comparing OSM data with GEM database

Current Implementation Focus
The OSM module has been designed with several key architectural components:

  1. Comprehensive Data Processing Pipeline: A structured workflow that processes OSM elements based on hierarchy (relations, ways, nodes), with specialized handling for each element type

  2. Advanced Geometric Analysis: Utilizes spatial relationships to identify overlapping power plants and prevent duplication

  3. Multi-level Data Extraction: Implements a cascading approach to extract capacity information:

    • Direct extraction from tags
    • Pattern-based extraction using configurable regex
    • Estimation based on geographic properties when direct values aren't available
  4. Efficient Caching Architecture: Implemented a multi-tiered caching system that significantly reduces API requests and improves performance, enabling quick analysis of large geographic areas

  5. Rejection Tracking System: The rejection tracker serves two critical purposes:

    • Configuration refinement tool: Helps identify configuration gaps that can be addressed to improve data capture
    • Quality improvement mechanism: Provides structured feedback that can be shared with the OSM community

Benefits

  • Access new data source with global coverage for power plants
  • Complement existing databases with community-maintained data
  • Improve coverage in regions with limited official data
  • Get standardized output compatible with all powerplantmatching functions
  • Help OSM mappers identify and fix data issues through rejection tracking
  • Contribute to a continuous feedback loop that improves both OSM and powerplantmatching data quality
  • Fine-tune data collection through config options based on rejection insights

Checklist

  • ✅ Code changes are sufficiently documented
  • ✅ Tests for new features were added
  • ✅ Release notes have been updated
  • ✅ I consent to the release of this PR's code under the MIT license

@euronion euronion changed the title Feature osm perf improvements OSM improvements (features, code, performance) May 6, 2025
@cdgaete
Copy link
Collaborator Author

cdgaete commented May 12, 2025

OSM Integration Example

I've added an example script demonstrating the new OSM module functionality. The script (analysis/osm_example_ppm.py) compares power plant data from OpenStreetMap with Global Energy Monitor data for three countries: Chile, South Africa, and Indonesia.

Example Script:

# Main execution
if __name__ == "__main__":
    # Set up output directory
    output_dir = "outputs"
    os.makedirs(output_dir, exist_ok=True)
    
    # List of countries to process
    countries = [
        "Chile",
        "South Africa",
        "Indonesia",
    ]
    
    # Get the base configuration
    config = get_config()
    config["main_query"] = ""
    config["target_countries"] = countries
    config["OSM"]["force_refresh"] = False
    config["OSM"]["plants_only"] = False
    config["OSM"]["units_clustering"]["enabled"] = False
    config["OSM"]["missing_name_allowed"] = False
    config["OSM"]["missing_technology_allowed"] = False
    config["OSM"]["missing_start_date_allowed"] = True
    
    # Get combined data for all countries
    osm_data = OSM(raw=False, update=False, config=config)
    gem_data = GEM(raw=False, update=False, config=config)
    
    fig, axis = fueltype_totals_bar([gem_data, osm_data], keys=["GEM", "OSM"])
    plt.savefig(os.path.join(output_dir, "osm_gem_ppm.png"))
    
    fig, axis = fueltype_and_country_totals_bar([gem_data, osm_data], keys=["GEM", "OSM"])
    plt.savefig(os.path.join(output_dir, "osm_gem_ppm_country.png"))

Results:

The example generates two plots comparing OSM and GEM data:

  1. Country-specific breakdown by fuel type:
    osm_gem_ppm_country

  2. Aggregated capacity by fuel type:
    osm_gem_ppm

These results show that OSM provides data quality comparable to specialized sources like GEM for major generation types (Hard Coal, Hydro, Natural Gas), with particularly strong alignment in South Africa. The OSM module successfully captures most major power plant types while also providing additional data for some categories (like Oil and Waste) that may be missing in other sources.

The differences highlight opportunities for enhancements, particularly for Wind and Geothermal power plants, which show lower coverage in OSM. This will be covered in the coming tasks where the focus is data validation, and code edition towards data quality enhancement.

@cdgaete
Copy link
Collaborator Author

cdgaete commented May 16, 2025

Next Steps for OSM Implementation

Data Validation Across Multiple Countries

The next phase will focus on thorough data validation:

  • Testing several countries worldwide while updating the configuration file's source and tech mapping to maximize OSM element inclusion
  • Identifying potential enhancements based on findings and challenges encountered during testing
  • Comparing results with other global power plant datasets and reporting key differences
    - GEM datasets already implemented in PPM
  • Using these comparisons to enhance the tool when inconsistencies or code issues are found
  • Testing and reporting on PPM's matching feature performance, determining optimal configuration settings for matching datasets with the OSM implementation
    - reliability_score

Documentation and Tutorials

To support the implementation, development of supporting materials will include:

  • Step-by-step guides for using the OSM module with practical examples
  • Documentation of key configuration options and their effects on data collection
  • Basic tutorials showing common use cases and recommended settings

diazr-david and others added 11 commits July 10, 2025 11:06
Added explanation to pipeline image
Added content to the Contributing to OSM Data Quality section
- Document comprehensive OpenStreetMap data source with advanced processing
- Highlight dual purpose for energy analysts and OSM community
- Include details on caching, enhancement features, and quality control
@cdgaete cdgaete enabled auto-merge (squash) July 10, 2025 21:02
@fneum fneum self-requested a review July 11, 2025 07:06
@lkstrp lkstrp linked an issue Jul 11, 2025 that may be closed by this pull request
@lkstrp lkstrp changed the title OSM improvements (features, code, performance) OpenStreetMap Integration Jul 11, 2025
@FabianHofmann
Copy link
Contributor

@cdgaete I have tested your example all it works impressively well. would you mind fixing the remaining failing tests (it is probably just fixing a deprecation, we can also drop support for python 3.9 if needed). then we can finally merge

@cdgaete cdgaete force-pushed the feature_osm_perf_improvements branch from 4f9a6f5 to 614713e Compare July 29, 2025 00:29
Breaking changes:
- Minimum Python version is now 3.10 (was 3.9)
- Remove mypy and all type checking infrastructure

Changes:
- Update requires-python to >=3.10 in pyproject.toml
- Remove Python 3.9 from classifiers
- Delete type-checking.yml GitHub workflow
- Remove mypy and type stub dependencies (mypy, types-requests, types-PyYAML, pandas-stubs, types-tqdm, types-six)
- Remove [tool.mypy] configuration section from pyproject.toml
- Update release notes to document breaking changes
- Update isinstance calls to use Python 3.10+ union syntax (X | Y)
@cdgaete cdgaete force-pushed the feature_osm_perf_improvements branch from 66327d7 to 749e3fc Compare July 29, 2025 00:56
- Replace Optional[X] with X | None throughout OSM module
- Update isinstance calls to use X | Y syntax
- Apply ruff formatting for consistency with Python 3.10+ standards
@lkstrp
Copy link
Member

lkstrp commented Jul 31, 2025

Can either @fneum or myself have another look before merging? Ref to first Review #225

@fneum
Copy link
Member

fneum commented Jul 31, 2025

Yes, I would like to have a look, too. Before merging, we need to understand what impact it has on the powerplants.csv in the default config (that is used in PyPSA-Eur). This should be another validation step.

@fneum fneum requested a review from Copilot July 31, 2025 07:28
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds comprehensive OpenStreetMap (OSM) integration capabilities to powerplantmatching, enabling extraction of power plant data from the global OpenStreetMap database. The implementation provides a robust data processing pipeline with caching, quality tracking, and analysis tools.

  • Implements a complete OSM data extraction pipeline with hierarchical processing (relations → ways → nodes)
  • Adds sophisticated multi-level caching system to minimize API calls and improve performance
  • Introduces capacity estimation, plant reconstruction, and rejection tracking for data quality improvement

Reviewed Changes

Copilot reviewed 43 out of 45 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pyrightconfig.json Adds Pyright type checker configuration for basic type checking
pyproject.toml Updates Python version requirement to 3.10+, adds new dependencies (scikit-learn, shapely), removes mypy configuration
powerplantmatching/plot.py Removes large commented code blocks, adds new plotly_map function for interactive mapping
powerplantmatching/package_data/config.yaml Adds comprehensive OSM configuration section with API settings, source mappings, and processing parameters
powerplantmatching/osm/workflow.py Implements main workflow orchestration for OSM data processing pipeline
powerplantmatching/osm/utils.py Provides utility functions for capacity parsing, country validation, and cache path management
powerplantmatching/osm/retrieval/regional.py Implements regional download functionality for custom geographic areas
powerplantmatching/osm/retrieval/populate.py Provides cache population utilities for batch processing multiple countries
powerplantmatching/osm/retrieval/client.py Core Overpass API client with retry logic, caching, and progress tracking
powerplantmatching/osm/retrieval/cache.py Multi-level caching system for OSM elements and processed units
powerplantmatching/osm/retrieval/init.py Package initialization for retrieval module
powerplantmatching/osm/quality/rejection.py Comprehensive rejection tracking system for data quality analysis
powerplantmatching/osm/quality/coverage.py Cache coverage analysis and maintenance tools
powerplantmatching/osm/quality/init.py Package initialization for quality module
powerplantmatching/osm/parsing/plants.py Parser for OSM power plant relations with reconstruction capabilities
Comments suppressed due to low confidence (2)

pyproject.toml:13

  • Removing Python 3.9 support is a breaking change. Consider documenting this in release notes or maintaining backward compatibility if existing users depend on Python 3.9.
license = { file = "LICENSE" }

powerplantmatching/osm/parsing/plants.py:434

  • Using assert for runtime validation in production code is problematic. The assertion could be disabled with python -O. Consider raising a proper exception like ValueError or RuntimeError instead.
                mismatch_ratio = (

return False, None, "regex_error"
else:
if regex_patterns is None:
regex_patterns = [
Copy link

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex patterns for capacity parsing are hardcoded. Consider moving these to configuration to make them easily customizable without code changes.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not move these away, but add a comment what each regex intends to match

while retries < self.max_retries:
try:
response = requests.post(
self.api_url, data={"data": query}, timeout=self.timeout + 30
Copy link

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout value is calculated as self.timeout + 30 but this could result in very long timeouts if self.timeout is large. Consider setting a maximum timeout limit to prevent indefinite waits.

Suggested change
self.api_url, data={"data": query}, timeout=self.timeout + 30
self.api_url, data={"data": query}, timeout=min(self.timeout + 30, MAX_TIMEOUT)

Copilot uses AI. Check for mistakes.

def _save_cache(self, cache_path: str, data: dict) -> None:
"""Save dictionary to JSON cache file."""
cache_data = data if data else {}
Copy link

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The condition data if data else {} can be simplified to data or {} for better readability.

Suggested change
cache_data = data if data else {}
cache_data = data or {}

Copilot uses AI. Check for mistakes.
rejections_to_process.extend(rejections)

for rejection in rejections_to_process:
if rejection.coordinates is None or "cluster" in rejection.id.lower():
Copy link

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using string matching with "cluster" in rejection.id.lower() is fragile. Consider adding a proper flag or enum value to identify cluster rejections more reliably.

Copilot uses AI. Check for mistakes.
@FabianHofmann
Copy link
Contributor

FabianHofmann commented Jul 31, 2025

it does not have affect on the default powerplants.csv as it is not added to the default matching sources. so at this stage it is purely about adding an optional source. however @cdgaete would you mind making a short comparison of the powerplants.csv with and without OSM (preferrably for europe) ?

Copy link
Contributor

@maurerle maurerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is huge.
It increases the code base by a multiple of its current size and is hard to review.

Files which are not essential to the features of this PR are changed here as well.

I made some code remarks for better readability.
Recommendations to split functions or reduce the scope of classes (by not including the config everywhere).

# - repo: https://github.com/RobertCraigie/pyright-python
# rev: v1.1.383
# hooks:
# - id: pyright
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this unrelated change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated cchange

# GNU General Public License for more details.

# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of full removal, maybe add SPDX-Headers?

logger.info(f"Total countries to process: {len(all_countries)}")

if dry_run:
print("\nDry run - would download the following countries:")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use logger instead of printing?

return False, None, "regex_error"
else:
if regex_patterns is None:
regex_patterns = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not move these away, but add a comment what each regex intends to match


total_elements = total_plants + total_generators

if check_live_counts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe extract this branch into a separate function which updates the passed cached_countries?

)
cached_countries[country_code]["cache_status"] = "error"

if return_data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of spending another 100 lines of code in this function, I would rather create a function which works on the output dict and just handles the printing.
That way, we can spare the return_data parameter as well and receive cleaner code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Minimum similarity ratio for common substrings
"""

def __init__(self, config: dict[str, Any]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class only needs name_similarity_threshold - why give it the whole config?


def __init__(
self,
config: dict[str, Any],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class only needs the min_generators_for_reconstruction - why give it the whole config?

Carlos Gaete added 2 commits September 26, 2025 01:15
…fig management

- Refactor classes to accept specific parameters instead of full config objects
- Split large functions in coverage.py into focused, single-purpose functions
- Replace print statements with proper logging throughout codebase
- Add SPDX license headers to all new files
- Replace runtime assertions with proper exception handling
- Enhance caching with diskcache for better memory management
- Add comprehensive European power plant analysis (39K plants, 850+ GW OSM contribution)
- Improve error handling and CSV corruption recovery
- Add omitted countries support for API limitations
@cdgaete
Copy link
Collaborator Author

cdgaete commented Sep 26, 2025

Response to Review Feedback

Thank you @maurerle, @FabianHofmann, @fneum, and @lkstrp for the thorough reviews and constructive feedback. I've made substantial improvements to address the core concerns raised about code organization, configuration management, and maintainability. Disclaimer: the huge jump in line of codes is associated to a html report in the documentation.

@maurerle's Review Comments

Configuration Dependencies Reduction:

  • Issue: Classes receiving entire config objects when only needing specific parameters
  • Resolution: Refactored key classes to accept only required parameters:
    • NameAggregator: Now takes similarity_threshold: float = 0.7 instead of full config
    • PlantReconstructor: Now takes min_generators: int instead of full config
  • Files Changed: powerplantmatching/osm/enhancement/reconstruction.py, powerplantmatching/osm/parsing/plants.py, powerplantmatching/osm/parsing/generators.py

Function Decomposition:

  • Issue: Large functions with multiple responsibilities, particularly in coverage analysis
  • Resolution: Split show_country_coverage into focused functions:
    • get_country_coverage_data() - data collection only
    • print_coverage_report() - formatting and display
    • Helper functions: _print_status_summary(), _print_country_table(), _print_missing_countries(), _print_continent_breakdown()
  • Files Changed: powerplantmatching/osm/quality/coverage.py

Logging Instead of Print Statements:

  • Issue: Usage of print() statements for output
  • Resolution: Replaced all print() calls with appropriate logger.info(), logger.warning(), etc.
  • Files Changed: powerplantmatching/osm/retrieval/populate.py and others

License Headers:

  • Issue: Removal of license information
  • Resolution: Added SPDX license headers throughout codebase:
    # SPDX-FileCopyrightText: Contributors to powerplantmatching <https://github.com/pypsa/powerplantmatching>
    # SPDX-License-Identifier: MIT
  • Files Changed: All new OSM module files

Unrelated Changes:

  • Issue: Modifications to pre-commit configuration and other files
  • Resolution: Removed pyright configuration changes from .pre-commit-config.yaml as requested

@copilot Comments

Runtime Assertions:

  • Issue: Using assert statements for runtime validation in production code
  • Resolution: Replaced with proper exception handling:
    # Before
    assert existing_capacity_source, "..."
    
    # After  
    if existing_capacity_source is None:
        raise ValueError("Existing capacity source should not be None if existing_capacity is not None")
  • Files Changed: powerplantmatching/osm/parsing/plants.py

Breaking Changes Documentation:

  • Issue: Python 3.9 support removal not properly documented
  • Resolution: Documented breaking changes in commit messages and updated project classifiers appropriately

Code Simplification:

  • Issue: Verbose conditional expressions
  • Resolution: Simplified expressions like data if data else {} to data or {}
  • Files Changed: powerplantmatching/osm/retrieval/cache.py

Regex Pattern Documentation:

  • Issue: Hardcoded regex patterns without explanation
  • Resolution: Added comprehensive comments explaining what each pattern matches:
    # Matches: "100 MW", "15.5 kW", "50 MWp" (number with optional space and unit)
    r"^(\d+(?:\.\d+)?)\s*([a-zA-Z]+(?:p|el|e)?)$",
    # Matches: "100MW", "15.5kW", "50MWp" (number directly followed by unit)  
    r"^(\d+(?:\.\d+)?)([a-zA-Z]+(?:p|el|e)?)$",
  • Files Changed: powerplantmatching/osm/utils.py

Additional Architectural Improvements

Enhanced Caching System:

  • Introduced diskcache for large global caches (nodes, ways, relations) with size limits
  • Maintained JSON caching for small country-specific data
  • Added proper connection management with context managers and cleanup
  • Implemented cache corruption detection and recovery

Improved Error Handling:

  • Enhanced CSV reading with proper quoting and error recovery
  • Added corrupted cache file detection and automatic cleanup
  • Better exception handling throughout the processing pipeline

Configuration Management:

  • Added support for omitted_countries to handle API limitations (e.g., Kosovo)
  • Enhanced country validation with better error messages and suggestions
  • Improved parameter validation and type hints

Documentation and European Power Plant Analysis (addressing @FabianHofmann and @fneum's request for understanding OSM impact):

  • Added comprehensive European power plant analysis with interactive visualizations covering 39,155 plants across Europe
  • Key findings: OSM contributes 850+ GW (55% of total European capacity), with 115 GW exclusively from OSM-only plants
  • Geographic distribution: France leads with 23.8 GW OSM-only capacity, followed by Spain (13.5 GW) and Germany (13.5 GW)
  • Technology coverage: Strong OSM representation in pumped storage (80%), CCGT (75%), and hydro (70%+), with growth opportunities in offshore wind and solar PV
  • Dual role identified: OSM both discovers new plants (17.5% of plants, 7.4% capacity) and validates existing infrastructure (21% of plants, 47.6% capacity)
  • This analysis demonstrates that OSM as an optional source provides substantial value without affecting the default powerplants.csv
  • Created downloadable analysis scripts (osm_ppm_eu_analysis.py) for reproducibility
  • Enhanced module documentation with detailed usage examples and interactive maps

Comments Evaluated but Not Implemented

Timeout Management (Copilot AI suggestion):

  • Issue: Timeout calculation self.timeout + 30 could result in very long timeouts
  • Evaluation: Current implementation works reliably in practice. The timeout is calculated per request and defaults to reasonable values (300s + 30s buffer). Adding maximum limits would add complexity without significant benefit given typical API response times.

String Matching for Rejection Types (Copilot AI suggestion):

  • Issue: Using "cluster" in rejection.id.lower() is fragile
  • Evaluation: This string matching serves a specific purpose in rejection tracking where cluster IDs are clearly identifiable by naming convention. The impact is limited to internal debugging/tracking with no effect on data processing. The current approach is simple and effective for the intended use case.

Viewing the EU Analysis Report

The comprehensive European analysis is available in multiple formats:

OSM_Data_Coverage_Analysis_documentation.pdf

Interactive Map: Clone the branch and open doc/_static/europe_power_plants_osm_comparison.html in your browser to explore 39K+ plants with OSM involvement comparison.

Static Analysis: The downloadable script doc/_static/osm_ppm_eu_analysis.py generates all visualizations and can be run independently.

Documentation: See doc/osm-data-analysis.rst for the full analysis writeup with key findings and methodology.

Please let me know if any specific areas need further clarification or if you'd like me to address any of the remaining optimization suggestions.

@euronion

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdgaete could you use scatter clustering for easier viewing of your map?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adding Open Street Map (OSM) as a source

7 participants