-
Notifications
You must be signed in to change notification settings - Fork 69
OpenStreetMap Integration #236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
OpenStreetMap Integration #236
Conversation
…gaete/powerplantmatching into feature_osm_perf_improvements
…gaete/powerplantmatching into feature_osm_perf_improvements
…gaete/powerplantmatching into feature_osm_perf_improvements
OSM Integration ExampleI've added an example script demonstrating the new OSM module functionality. The script ( Example Script:# Main execution
if __name__ == "__main__":
# Set up output directory
output_dir = "outputs"
os.makedirs(output_dir, exist_ok=True)
# List of countries to process
countries = [
"Chile",
"South Africa",
"Indonesia",
]
# Get the base configuration
config = get_config()
config["main_query"] = ""
config["target_countries"] = countries
config["OSM"]["force_refresh"] = False
config["OSM"]["plants_only"] = False
config["OSM"]["units_clustering"]["enabled"] = False
config["OSM"]["missing_name_allowed"] = False
config["OSM"]["missing_technology_allowed"] = False
config["OSM"]["missing_start_date_allowed"] = True
# Get combined data for all countries
osm_data = OSM(raw=False, update=False, config=config)
gem_data = GEM(raw=False, update=False, config=config)
fig, axis = fueltype_totals_bar([gem_data, osm_data], keys=["GEM", "OSM"])
plt.savefig(os.path.join(output_dir, "osm_gem_ppm.png"))
fig, axis = fueltype_and_country_totals_bar([gem_data, osm_data], keys=["GEM", "OSM"])
plt.savefig(os.path.join(output_dir, "osm_gem_ppm_country.png"))Results:The example generates two plots comparing OSM and GEM data: These results show that OSM provides data quality comparable to specialized sources like GEM for major generation types (Hard Coal, Hydro, Natural Gas), with particularly strong alignment in South Africa. The OSM module successfully captures most major power plant types while also providing additional data for some categories (like Oil and Waste) that may be missing in other sources. The differences highlight opportunities for enhancements, particularly for Wind and Geothermal power plants, which show lower coverage in OSM. This will be covered in the coming tasks where the focus is data validation, and code edition towards data quality enhancement. |
Next Steps for OSM ImplementationData Validation Across Multiple CountriesThe next phase will focus on thorough data validation:
Documentation and TutorialsTo support the implementation, development of supporting materials will include:
|
Added explanation to pipeline image
Added content to the Contributing to OSM Data Quality section
- Document comprehensive OpenStreetMap data source with advanced processing - Highlight dual purpose for energy analysts and OSM community - Include details on caching, enhancement features, and quality control
…en-energy-transition/powerplantmatching into feature_osm_perf_improvements
|
@cdgaete I have tested your example all it works impressively well. would you mind fixing the remaining failing tests (it is probably just fixing a deprecation, we can also drop support for python 3.9 if needed). then we can finally merge |
4f9a6f5 to
614713e
Compare
Breaking changes: - Minimum Python version is now 3.10 (was 3.9) - Remove mypy and all type checking infrastructure Changes: - Update requires-python to >=3.10 in pyproject.toml - Remove Python 3.9 from classifiers - Delete type-checking.yml GitHub workflow - Remove mypy and type stub dependencies (mypy, types-requests, types-PyYAML, pandas-stubs, types-tqdm, types-six) - Remove [tool.mypy] configuration section from pyproject.toml - Update release notes to document breaking changes - Update isinstance calls to use Python 3.10+ union syntax (X | Y)
66327d7 to
749e3fc
Compare
- Replace Optional[X] with X | None throughout OSM module - Update isinstance calls to use X | Y syntax - Apply ruff formatting for consistency with Python 3.10+ standards
|
Yes, I would like to have a look, too. Before merging, we need to understand what impact it has on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request adds comprehensive OpenStreetMap (OSM) integration capabilities to powerplantmatching, enabling extraction of power plant data from the global OpenStreetMap database. The implementation provides a robust data processing pipeline with caching, quality tracking, and analysis tools.
- Implements a complete OSM data extraction pipeline with hierarchical processing (relations → ways → nodes)
- Adds sophisticated multi-level caching system to minimize API calls and improve performance
- Introduces capacity estimation, plant reconstruction, and rejection tracking for data quality improvement
Reviewed Changes
Copilot reviewed 43 out of 45 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pyrightconfig.json | Adds Pyright type checker configuration for basic type checking |
| pyproject.toml | Updates Python version requirement to 3.10+, adds new dependencies (scikit-learn, shapely), removes mypy configuration |
| powerplantmatching/plot.py | Removes large commented code blocks, adds new plotly_map function for interactive mapping |
| powerplantmatching/package_data/config.yaml | Adds comprehensive OSM configuration section with API settings, source mappings, and processing parameters |
| powerplantmatching/osm/workflow.py | Implements main workflow orchestration for OSM data processing pipeline |
| powerplantmatching/osm/utils.py | Provides utility functions for capacity parsing, country validation, and cache path management |
| powerplantmatching/osm/retrieval/regional.py | Implements regional download functionality for custom geographic areas |
| powerplantmatching/osm/retrieval/populate.py | Provides cache population utilities for batch processing multiple countries |
| powerplantmatching/osm/retrieval/client.py | Core Overpass API client with retry logic, caching, and progress tracking |
| powerplantmatching/osm/retrieval/cache.py | Multi-level caching system for OSM elements and processed units |
| powerplantmatching/osm/retrieval/init.py | Package initialization for retrieval module |
| powerplantmatching/osm/quality/rejection.py | Comprehensive rejection tracking system for data quality analysis |
| powerplantmatching/osm/quality/coverage.py | Cache coverage analysis and maintenance tools |
| powerplantmatching/osm/quality/init.py | Package initialization for quality module |
| powerplantmatching/osm/parsing/plants.py | Parser for OSM power plant relations with reconstruction capabilities |
Comments suppressed due to low confidence (2)
pyproject.toml:13
- Removing Python 3.9 support is a breaking change. Consider documenting this in release notes or maintaining backward compatibility if existing users depend on Python 3.9.
license = { file = "LICENSE" }
powerplantmatching/osm/parsing/plants.py:434
- Using
assertfor runtime validation in production code is problematic. The assertion could be disabled withpython -O. Consider raising a proper exception likeValueErrororRuntimeErrorinstead.
mismatch_ratio = (
| return False, None, "regex_error" | ||
| else: | ||
| if regex_patterns is None: | ||
| regex_patterns = [ |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex patterns for capacity parsing are hardcoded. Consider moving these to configuration to make them easily customizable without code changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not move these away, but add a comment what each regex intends to match
| while retries < self.max_retries: | ||
| try: | ||
| response = requests.post( | ||
| self.api_url, data={"data": query}, timeout=self.timeout + 30 |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout value is calculated as self.timeout + 30 but this could result in very long timeouts if self.timeout is large. Consider setting a maximum timeout limit to prevent indefinite waits.
| self.api_url, data={"data": query}, timeout=self.timeout + 30 | |
| self.api_url, data={"data": query}, timeout=min(self.timeout + 30, MAX_TIMEOUT) |
|
|
||
| def _save_cache(self, cache_path: str, data: dict) -> None: | ||
| """Save dictionary to JSON cache file.""" | ||
| cache_data = data if data else {} |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The condition data if data else {} can be simplified to data or {} for better readability.
| cache_data = data if data else {} | |
| cache_data = data or {} |
| rejections_to_process.extend(rejections) | ||
|
|
||
| for rejection in rejections_to_process: | ||
| if rejection.coordinates is None or "cluster" in rejection.id.lower(): |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using string matching with "cluster" in rejection.id.lower() is fragile. Consider adding a proper flag or enum value to identify cluster rejections more reliably.
|
it does not have affect on the default powerplants.csv as it is not added to the default matching sources. so at this stage it is purely about adding an optional source. however @cdgaete would you mind making a short comparison of the powerplants.csv with and without OSM (preferrably for europe) ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is huge.
It increases the code base by a multiple of its current size and is hard to review.
Files which are not essential to the features of this PR are changed here as well.
I made some code remarks for better readability.
Recommendations to split functions or reduce the scope of classes (by not including the config everywhere).
.pre-commit-config.yaml
Outdated
| # - repo: https://github.com/RobertCraigie/pyright-python | ||
| # rev: v1.1.383 | ||
| # hooks: | ||
| # - id: pyright |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this unrelated change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated cchange
| # GNU General Public License for more details. | ||
|
|
||
| # You should have received a copy of the GNU General Public License | ||
| # along with this program. If not, see <http://www.gnu.org/licenses/>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of full removal, maybe add SPDX-Headers?
| logger.info(f"Total countries to process: {len(all_countries)}") | ||
|
|
||
| if dry_run: | ||
| print("\nDry run - would download the following countries:") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use logger instead of printing?
| return False, None, "regex_error" | ||
| else: | ||
| if regex_patterns is None: | ||
| regex_patterns = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not move these away, but add a comment what each regex intends to match
|
|
||
| total_elements = total_plants + total_generators | ||
|
|
||
| if check_live_counts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe extract this branch into a separate function which updates the passed cached_countries?
| ) | ||
| cached_countries[country_code]["cache_status"] = "error" | ||
|
|
||
| if return_data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of spending another 100 lines of code in this function, I would rather create a function which works on the output dict and just handles the printing.
That way, we can spare the return_data parameter as well and receive cleaner code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
| Minimum similarity ratio for common substrings | ||
| """ | ||
|
|
||
| def __init__(self, config: dict[str, Any]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class only needs name_similarity_threshold - why give it the whole config?
|
|
||
| def __init__( | ||
| self, | ||
| config: dict[str, Any], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class only needs the min_generators_for_reconstruction - why give it the whole config?
…fig management - Refactor classes to accept specific parameters instead of full config objects - Split large functions in coverage.py into focused, single-purpose functions - Replace print statements with proper logging throughout codebase - Add SPDX license headers to all new files - Replace runtime assertions with proper exception handling - Enhance caching with diskcache for better memory management - Add comprehensive European power plant analysis (39K plants, 850+ GW OSM contribution) - Improve error handling and CSV corruption recovery - Add omitted countries support for API limitations
Response to Review FeedbackThank you @maurerle, @FabianHofmann, @fneum, and @lkstrp for the thorough reviews and constructive feedback. I've made substantial improvements to address the core concerns raised about code organization, configuration management, and maintainability. Disclaimer: the huge jump in line of codes is associated to a html report in the documentation. @maurerle's Review CommentsConfiguration Dependencies Reduction:
Function Decomposition:
Logging Instead of Print Statements:
License Headers:
Unrelated Changes:
@copilot CommentsRuntime Assertions:
Breaking Changes Documentation:
Code Simplification:
Regex Pattern Documentation:
Additional Architectural ImprovementsEnhanced Caching System:
Improved Error Handling:
Configuration Management:
Documentation and European Power Plant Analysis (addressing @FabianHofmann and @fneum's request for understanding OSM impact):
Comments Evaluated but Not ImplementedTimeout Management (Copilot AI suggestion):
String Matching for Rejection Types (Copilot AI suggestion):
Viewing the EU Analysis ReportThe comprehensive European analysis is available in multiple formats: OSM_Data_Coverage_Analysis_documentation.pdf Interactive Map: Clone the branch and open Static Analysis: The downloadable script Documentation: See Please let me know if any specific areas need further clarification or if you'd like me to address any of the remaining optimization suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdgaete could you use scatter clustering for easier viewing of your map?


Closes #12
Supersedes #225
Changes proposed in this Pull Request
Current Implementation Focus
The OSM module has been designed with several key architectural components:
Comprehensive Data Processing Pipeline: A structured workflow that processes OSM elements based on hierarchy (relations, ways, nodes), with specialized handling for each element type
Advanced Geometric Analysis: Utilizes spatial relationships to identify overlapping power plants and prevent duplication
Multi-level Data Extraction: Implements a cascading approach to extract capacity information:
Efficient Caching Architecture: Implemented a multi-tiered caching system that significantly reduces API requests and improves performance, enabling quick analysis of large geographic areas
Rejection Tracking System: The rejection tracker serves two critical purposes:
Benefits
Checklist