-
Notifications
You must be signed in to change notification settings - Fork 0
License
dfmorrison/naperformances
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
The script naperformances.py is used to extract North American changing ringing performances from Bellboard. It can be run standalone, emitting formatted versions of the relevant performances to standard out, or it can update a MySQL database. It requires Python 3. Probably any version 3.x will do, but it has primarily been developed using 3.4 and 3.5. In addition, if it will be updating a MySQL database the module pymysql is required. To run it standalone simply run python3 naperformances.py Command line arguments can be supplied to specify a date range from which to extract performances and how to sort the results. The --from argument takes a date in the form yyyy-mm-dd and extracts only performances rung on or after that date. The --to argument takes a date and extracts only performances rung on or before that date. The --changed_before and --changed-after arguments are similar, but (a) they work on the date a report was entered or most recently modified, and (b) --changed-before, unlike -to, is not inclusive. The --days argument is an alternative to --from, and sets the --from date to that many days before today. The default values of these arguments are 120 days for --days, and today for --to The --sort argument should be one of chronological, bbid, or clapper; if not supplied it defaults to clapper. To update a database no command line arguments are used, but a config file should installed in /etc/naperformances_credentials.py giving the name of the database, and the user and password for logging into it. A template of this file is provided. When run in this mode, typically as a cron job or something similar, the date range is the default number of days back to today. If desired, the number of days back used can also be modified in /etc/naperformances_credentials.py. The file db.sql contains SQL commands for creating a database of the correct form for using with this script. ---------------------------------------------------------------------- This describes a couple of the major choices made, and why they were made. The overall goal is to maintain a database for the NAGCR web site of peal and quarter peal reports relevant to North American ringing, extracted from Bellboard. Relevant performances are those rung anywhere in the world for the North American Guild or the Abraham Lincoln Society, or rung anywhere in North America, regardless of association. And these should include handbell performances. This parallel sub-database should reflect updates and deletions to the data in Bellboard. The two major issues to be confronted are (a) how to identify performances that are relevant to North American ringing, and (b) how to maintain a parallel sub-database in sync with Bellboard. Part (a): Because of the complexity of deciding whether or not performance is relevant to North American ringing a fixed query to Bellboard expected to match just such a performance is impossible. Instead we query all performances since a certain date (more on this, below), and filter them. Note that while we are able to only query Bellboard for a pagesized chunk of performances at a time, we abstract this into an iterator, and deal with things on our side as if it were a single, long list of performances. For deciding which performances are relevant we match regular expressions against four different fields of the returned performances: the association, and the place, county and dedication of the place-name. The matching against association is simple: if the association contains the word "American" or the word "Abraham" the perfomance is declared relevant. For those not matching this category, things are more complicated. The two principle regular expressions for matching against places are both essentially lists of relevant words; one collection of words is matched against the place field of a place name, and the other against the county field. All the county words are also included in the place list, reflecting the fact that folks are inconsistent in how they enter things, sometimes entering state or province names as "counties", and sometimes including them in the place, and putting a country name in the county field. And, of course, there are plenty of further inconsistencies. The words in the county list include all the spelled out names, and two letter abbreviations (and a few other abbreviations, such as "PEI"), of all the states of the United States, of all the provinces and territories of Canada, and a variety of country name words such as "America", "Canada", and "USA". The places list augments these words with the names of towns in North America containing towers. Unfortunately, these are not unambiguous. For example, "WA" is used for the state of Washington in the United States, but also for Western Australia. Thus, for a variety of words in this list there are some ancillary include/exclude regular expressions, that are consulted only if the corresponding word from the main list has matched. In the case of WA, we note any performances with a county of WA as relevant to North America only if they do not include the words Bunbury, Mandurah, Perth, Rockingham or York, the names of the towns in Western Australia containing towers. Note that we will still signal a false positive for a handbell quarter rung in a different town of Western Australia, though. It is important to note that these include/exclude expressions are only applied if the corresponding match from the main expression has been found. For example, if we match "Rochester", then we exclude any performances that also include "Kent" in the place/county name, but it would be incorrect to exclude the word "Kent" from place names in general, since there is a tower in Kent, Connecticut. Further examples of include/exclude restrictions include: Boston is considered to match only if the dedication does not include "Botolph". And Northampton is considered to match only if the dedication does include either "Smith" or "Mendenhall". It is worth noting that as new towers are added in North America, the words used for matching in the code will need to be augmented. Part (b): I had originally hoped to be able to simply track changes to records in Bellboard, probably using the changed_since and deleted parameters to queries against it. However, the more I tried to figure out how to make that work reliably, the harder is seemed. Among the most problematic difficulties is that an id in Bellboard does not identify a performance, in the sense of a bunch of ringers ringing a number of rows at a place and time; rather an id identifies a particular revision of a report of such a performance. As a report is edited it is supplanted by one with a new id. The chain of ids is only visible in one direction: when requesting a report by id, if it has been supplanted, the new one will be returned instead of the old one. However, there is no way to link from the XML describing the new one back to the old id. I'm pretty sure any attempt to keep things in sync this way will require making one or more queries over the network for each individual performance that might have been updated, an awkward procedure. (If anyone else wants to try this, it is worth noting that you cannot use export.php to do this, either. For some reason export.php doesn't know about the id parameter. Instead, you must use view.php, with content negotiation to ensure XML is returned instead of HTML.) Anyway, once my head had exploded sufficiently trying to make this work, I settled on an alternative scheme. I replace all performances in our local copy that were rung (not entered, or modified, rung) in the last N days (N is currently 120) with all those in Bellboard rung during this period. In most cases I'm simply replacing what's in the local database with an exact duplicate, but if there are additions, deletions or modifications the result will be different. The expectation is that this script will be run on a periodic schedule. I suggest once a day. During the wee hours, North American time, is probably best. This is not foolproof. It does not pick up any modifications made to performances that were rung more than N days ago. Fortunately, I believe most modifications are made within a few weeks of something being rung, so we will miss little here. Similarly, changes made involving the date that cross the N day ago boundary may in rare instances result in a performance being lost or duplicated (if duplicated, it will have a different id, though, representing different revisions in Bellboard). And it does not pick up any that are entered long after they were rung (though such performances hardly count as news, and so are less relevant to the NAGCR web site :-). But I think on balance it is good enough. In general, there are plenty of ways for both halves of this scheme to fail, but I think such failures will be rare. In fact, I believe they will be considerably rarer than failures of our old, Campanophile based scheme, which depended upon folks always using the magic form to enter things, which they did not; and non-North American folks not stumbling over the magic form and accidentally using it for non-North American performances, which did happen occasionally. A further, potential problem is that the version of MySQL currently in use on the NAG web server does not support transactions. Whenever it is upgraded to one that does, the function replacePerformances in the script should be revised to use them to ensure atomicity of the deletion of recent performances and their re-insertion into the database. As it currently stands there is a race condition where an interruption at just the wrong time could leave the database in a state where recent performances have been deleted but not yet replaced. A virtue of the new scheme, however, is that it is largely self-healing. If something goes catastrophically wrong one day, simply re-running the script will generally bring things back into sync, modulo a day or two around N days ago. In particular, if the race condition referred to in the previous paragraph were to be tripped over, the missing performances (sans those run exactly N days ago) would automatically be replaced the next day. And even that gap could be refilled easily simply by running the script once by hand, with N increased slightly.
About
No description, website, or topics provided.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published