Thursday, October 13, 2011

Lifecycle of a Node in Alfresco


In Association with Amazon.in



The lifecycle of your average Alfresco 'node' isn't that complicated is it? A node (perhaps a document) is created; it gets changed, hangs around for a while, and is finally deleted at some point. End of story, right? From a user perspective that's correct, but what's actually going on under the abstraction that is the 'Object Content Repository'?
What follows is a step-by-step journey of one humble node from birth to its final demise, at the level of the file-system and database entries.

The Content Repository: A quick reminder
Remember that the content repository is composed of three aspects. Without any one of these, Alfresco will fail to operate correctly (at least prior to version 4.0, where the lack of an index isn't the end of the world):
  1. The database which describes all the details about a content node/document.
  2. The file-system (or 'content-store') which is where all file content is to be kept. A reference to each file is kept by the node information kept in the database. Note that every change to a file is written to a new file instead of overwriting an existing file.
  3. The Lucene search-index, which contains both information from the DB and the content on the file-system. If lost, this index can be rebuilt based on the presence of the other two aspects.
 

Step 1 - Create a Node
  • The user creates a new node in the repo called: "Toms Document.doc". The date is the 1st January 2011.
  • Under the hood, the 'nodeService.createNode' operation is called.
  • Where does the node live?
    • FS: It's on the file-system in the file system folder: alf_data/contentstore/
      • The actual file content has been renamed and located somewhere like: alf_data/contentstore/2011/1/1/13/14/6e228904-d5d2-4a99-b7b1-8fe7c03c71f3.bin
      • Notice that the file content has been renamed using a new unrelated UUID. This is *not* the same UUID as the node-ref in the DB.
    • DB: It's in the database as: workspace://SpacesStore/4529b059-a4d1-4938-b0ce-ff7fce3a5c9a  
      •   QUERY to get UUID of a node:
SELECT * FROM alf_node as an, alf_node_properties as anp where an.id=anp.node_id and string_value like '%Toms Document%';
      • The node contains a property in the DB that allows it to find its content in the file-system content store. This can be found in alf_content_data and alf_content_url tables.     
      • QUERY to get Content’s URL in the File System:
SELECT anp.node_id, acu.content_url, an.uuid, an.audit_creator, an.audit_created
FROM alf_content_url as acu, alf_content_data as acd, alf_node_properties anp, alf_node as an
where acd.content_url_id = acu.id
and acd.id=anp.long_value
and anp.node_id= an.id
ORDER BY audit_created DESC;
    • IDX: It's in the search index found in folder alf_data/lucene-indexes/workspace/SpacesStore/
Step 2 - User Deletion of the Node
  • The user deletes this node ("Toms Document.doc") using the web-UI, or usual alfresco services (all of which lead to nodeService.deleteNode). The date is the 1st February 2011
  • Where does the node live now?
    • FS: On the file-system it lives in exactly the same place. i.e.: /alf_data/contentstore
      • Note: Incidentally, it's worth noting that once a file is written to the content store, it is never ever modified.
      • Note: I think various optimizations exist that may cause two nodes to point to the same file content (for example, a copy operation).
    • DB: In the database the document-node record is marked as living in a different store: archive://SpacesStore/4529b059-a4d1-4938-b0ce-ff7fce3a5c9a
      • Note: The record in the alf_node table has a store_id property that is changed from 6 to 5 (store_id depends on Alfresco versions, So fire the query to know the exact store_id and store name).
      • QUERIES:
// To get the store_id of particular UUID.
SELECT * FROM alf_node where uuid='4529b059-a4d1-4938-b0ce-ff7fce3a5c9a ';
//To get all store names
SELECT * FROM alf_store;
// To get store_id and store name of particular UUID.
SELECT an.id, an.store_id, als.protocol, als.identifier, an.uuid
FROM alf_node as an, alf_store as als
      where
      als.id = an.store_id
      and an.uuid='4529b059-a4d1-4938-b0ce-ff7fce3a5c9a';
    • IDX: It's still in the search index, but it's moved and is now found in folder alf_data/lucene-indexes/archive/SpacesStore/
      • So, it's removed from one index and added to another.
  • From a user's perspective, at this point they can go to their user-profile (the 'Manage Deleted Items' part) and un-delete items, which moves them back to the store: workspace://SpacesStore.
    • Undeleting changes the store_id of the node in the DB, and moves it back to the correct Lucene index.
  • From a programmers perspective, they can bypass step 2 completely and go straight on to Step 3, if they apply the 'cm:temporary' aspect before they call 'nodeService.deleteNode'.
Step 3 - Empty the Trash
  • The date is the 1st March 2011 and the user decides to empty "Toms Document.doc" from the trash-can.
    • They can do this using the "Manage Deleted Items" link from the "User Profile" in the JSF Alfresco Explorer client, and selecting 'Purge'. Let's say they do it 30 days after deleting the node.
    • Alternatively, we can use the TrashCanCleaner scheduled task to automatically remove items older than 30 days from the 'trash-can' archive. Our automatic cleaner is based on the one in the forge but it's effectively the same action as the manual task the user can perform in the Alfresco Explorer UI (JSF client).
    • Under the hood, the 'archiveService.purge' operation is called.

  • Where does the node live now?
    • FS: On the file-system it lives in exactly the same place. i.e.: /alf_data/contentstore
    • DB: In the database you would expect it to be deleted, except it isn't: it's only *marked* as deleted.
      • The relevant row in the alf_node table has a field named 'node_deleted' which is set to '1' or 't ' to indicate that is a deleted node.
      • QUERY:
// To check node_deleted value.
SELECT * FROM alf_node where uuid=4529b059-a4d1-4938-b0ce-ff7fce3a5c9a'
      • Note: This update from 'node_deleted=0 or f'  to 'node_deleted=1 or t' does not affect the timestamp value of 'alf_node.audit_modified' field.
      • Alfresco now considers any related content file found in the file-system content-store to be an 'orphan'.
      • At this point where the 'node_deleted' field becomes '1' or ‘t’, the orphan is declared by updating the 'orphan_time' field in the table 'alf_content_url' from NULL to the timestamp of NOW. (You'll see that this is important for quickly identifying orphaned files later on).
      • QUERY:
// To find all orphaned contents.
SELECT * FROM alf_content_url where orphan_time is not null;
      • Pretty much all DB queries made by alfresco will only consider rows where node_deleted=0 or f.
      • Auxiliary node information is in fact deleted right now (that is, *most* related rows in tables other than alf_node, such as alf_node_properties, alf_node_assoc, alf_child_assoc, etc).
    • IDX: The search index will be empty of this node. It is removed from all search indices.
      • It cannot be found in either the 'alf_data/lucene-indexes/workspace/SpacesStore/' nor can it be found in in the 'alf_data/lucene-indexes/archive/SpacesStore/' Lucene index.
NOTE: The above statements for orphaned node are true for up to Alfresco v3.3.5. Please check for Alfresco 3.4.x onwards. I tried it on Alfresco v3.4.4 and deleted a node from Managed deleted items. It does not keep the orphaned node in contentstore and not even in database. It deletes all entries of it from everywhere. So, please check it and confirm.

Step 4 - Die Content File, Die!
  • A scheduled orphan-cleaner job activates (that is, contentStoreCleanerTrigger fires the contentStoreCleaner bean, embodied by 'org.alfresco.repo.content.cleanup.ContentStoreCleaner.java').
    • This schedule executes every day at 4am, by default.
    • This orphaned file cleaner doesn't act on orphans immediately; it waits for a period of X protected days.
    • The query to find orphans (query template key 'select_ContentUrlsOrphaned') looks within alf_content_url for 'orphan_time' field values greater than 14 days old).
    • See the XML configuration for 'contentStoreCleaner' and the 'protectDays' attribute for the number of days it waits before acting on content files.
    • It does not actually delete the content files; instead it simply moves them out of the '/alf_data/contentstore' folder location and into the folder 'alf_data/contentstore.deleted'.
    • Assuming that "Toms Document.doc" was orphaned 14 days ago, the date is now 14th March 2011.
    • After moving an orphaned content file out of the active content-store, the relevant line in the alf_content_url table deleted from the DB.
    • Note: there is *not* a crazy folder-tree 'file crawl' process looking for content files, performing lookups to determine if they're orphans eligible for removal.
  • Orphaned content files that have been deleted from the content store, sit around the 'contentstore.deleted' folder forever... until a System Administrator backs it up, moves it, or deletes it.
  • More Info on:
o   contentStoreCleaner bean is defined in ALFRESCO_HOME\tomcat\webapps\alfresco\WEB-INF\classes\alfresco\content-services-context.xml file.
o   contentStoreCleanerTrigger bean is defined in ALFRESCO_HOME\tomcat\webapps\alfresco\WEB-INF\classes\alfresco\scheduled-jobs-context.xml file.
o   system.content.orphanCleanup.cronExpression defined in ALFRESCO_HOME\tomcat\webapps\alfresco\WEB-INF\classes\alfresco\repository.properties file.
  • Where does the node live now?
    • FS: Gone. On the file system, it's finally gone from the content store, although it still does exist in a folder outside the content store, that's safe to delete.
    • DB: It is *still* in the DB
      • It's still unchanged in the alf_node table, with its 'node_deleted' field property still set to '1'.
      • However, any reference in the alf_content_url table has been removed.
    • IDX: The search index is unchanged: it still doesn't exist in any search index.
Step 5 – Be gone from the Database!
  • A separate scheduled job runs to tidy up the database
    • This clean-up job executes every day at 21:00 (bean 'nodeServiceCleanupTrigger' leading to bean 'nodeServiceCleanupJobDetail'), and performs the work found inside 'DeletedNodeCleanupWorker'.
    • After 30 days from when the 'node_deleted' field was set to '1', this process considers it safe to truly delete the node with a call to the DAO service purge.
    • Note: it doesn't use the audit_modifed date, since this wasn't changed when the row was marked for deletion. Instead, it uses the commit_time_ms transaction time from the alf_transaction table.
    • Note: this job also removes old transactions from the alf_transaction table. Transactions are considered old using the same property as node removal work: '30 days'; Defined using the property 'index.tracking.minRecordPurgeAgeDays').
  • Where does the node live now?
    • FS: Gone (it finally went in step 4).
    • DB: Gone (just now in step 5).
    • IDX: Gone (it went in step 3).
Finally the Node is Truly Gone
So, after 14 days of removing a node from the archive store, it's taken out of the content store on the file-system, and after a further 15 days (approx.) it is finally removed from the database too.

So, why aren't nodes deleted from the DB immediately?
The 'content_url' information is needed for a while until the orphan content cleaner is happy to finally remove the content file. This alf_content_url table is the only way to find content files to remove from the file-system.

Once the trash is emptied, there's no going back, so why does it not simply remove the content file there and then, instead of relying on an orphan-cleaner removing it 14 days later? This is because in backup-recovery situations, you can restore an old 'DB-and-Index' content-set but you don't need to also revert the file-system contents to an earlier version. This useful ability (the file content-store could be huge and time-consuming to restore) is possible since the file content-store will still have the files that have been deleted since the DB snapshot was taken; assuming that snapshot is not older than 14 days. Note: You may have a number of recently created extraneous, inaccessible files, but everything will be in a consistent state.

The row in the alf_node table is kept for a number of reasons, one of which is that Lucene indexes are incrementally rebuilt (in backup-recovery scenarios, usually from a nightly backup) by replaying transactions recorded in the alf_transaction table. That transaction points to a row in the alf_node table, and if it's marked as deleted then the index can be correctly updated with the fact that the node has been deleted since the index snapshot was taken (the associated Lucene document deleted).

If you try to perform an incremental index recovery based on a backed-up Lucene index that's older than the oldest row in the alf_transaction table, it will fail and perform a full re-index based on the full DB tables instead.

So you see, there are good reasons for a node to take this long journey to its final death.

Summary:


Alfreso Node Lifecycle
Get more info on:

In Association with Amazon.in