1st post

June 10, 2011

I’ve created this blog to share my experiences with the Autonomy Interwoven 8.5 (IDOL) indexer product.  I’ve been working for about 1 year with this product and have many things to share, most which would have helpful to  me !  I seem to be gifted for causing problem or stumbling upon them; which in turn means I have to find solutions; most of which I was not able to simply Google.

This blog will be focused on the Autonomy Interwoven Indexer 8.5 which uses IDOL.  Most of the IDOL components are the same as found in the “generic” IDOL; IDOL, DIH, DAH, and the Content Engines.  We have the following services; Connector and the Ingestion which seem to be unique to Interwoven IDOL.

I’ll try to blog about thing you will not find elsewhere.

You’ll find some very good “generic”  IDOL information on the ITToolbox.com web site here :

http://knowledgemanagement.ittoolbox.com/groups/technical-functional/autonomy-l/

Other sources of information are the Autonomy Customer support forums, I think you need to be a customer or an Autonomy Partner to access these.

I also think there is some IDOL discussion on the LinkedIn groups but I have not had any success joining those.

Naim.

A benchmark plateform

May 9, 2013

The difficulty with benchmarking Worksite IDOL begins with finding suitable hardware,  Our firm has virtualized 95% of all our workload in Vsphere enviromnet with an isci SAN,  but I don’t want to have IDOL in shared environment.

That leaves me with standard firm issued Lenovo M58P Core2 desktop computer.   Not exactly cutting edge !

Now comes EBAY, I was able to find a used Dell Precision Workstation T7500 on the Dell Financial Services Ebay store for a few hundred dollars.  After a Bios update, a gamble on unsupported DDR3 Ram, I now have Xeon X5560 CPU with 96 GB of ram.

Why in world would I want so much RAM, Very simple.   We don’t dispose of exotic storage such as SAS RAID CARD with SAS SSD or PCIe Flash storage cards such a the Fusion I/O card.

But I want to see what IDOL, or more precisely Content Engine performance is during searching and indexing on the fastest type of storage > RAMDISK. My assumption is that any type of PCIe flash based storage will always be slow than a RAM Disk in terms of latency and bandwidth.  (turns out this is probably an incorrect assumption).

After testing a few RAMDisk products I settled on the product by SoftPerfect as it had the best overall performance.  This is direct-I/O type of ramdisk as opposed to a SCSI disk type such the Startwind product.  The direct I/O products offer better performance but they don’t work with many utilities that expect a device to present itself as a physical disk.

Using Anvil Storage Utilities RC6, the SoftPerfect Ram disk measured as follows.

READ 4K QD 16 422,120 IOPS (1688 MB/s)
WRITE 4K QD16 450,142 IOPS (1758.37 MB/s)

The FusionIO ioDrive2 2.4 TB card has the following published specs

READ 4K 480, 000 IOPS
WRITE 4K  490,00 IOPS

After allot of testing I realized that the RAMdisk performance is highly dependent on the processors.  Benchmarks that are run on newer Sandy bridge Xeons can achieve almost 1 million IOPS.  Having a load on the CPU also decreases the RAMDisk performance.

I am optimistic that the RAMdisk should archive performance that is similar to a FusionIO card. (more on this later)

Considering that generic server grade DDR3 RDIMMs can be found for around $ 8 per GB, VS $ 15 per GB for Flash SSD storage such as the FusionIO card one could use a RAMDISK for the Active Content Engine, get Read/Write performance and the data contained in an Active Content is exported out every 24 hours. As an other benefit Write cycles on the Flash storage would not be wasted.

Naim

 

 

Continued …

April 8, 2013

Circumstances have changed an I am back to where I left off in 2012, thinking about and working on implementing the Worksite IDOL indexers.

What’s happened in the HP- Autonomy WorkSite Indexer landscape ?  Not much.

We are still waiting for HP-Autonomy to release the WorkSite indexer 8.5 Sp3 which should bring the public release of new features such as the Workspace, MetaData Content Engines along with newer binary files.

The  new Content Engines will lower the query response times for Work space and Metadata queries simply because they are specialized, configured only to index and search on a smaller dataset.

Our firm is on track to deploy the IDOL indexer this year with a proper budget for a solution using dedicated hardware on the server and storage side.

I’ve been working on a plan to evaluate the performance of Worksite IDOL on different hardware setups and that’s what I’ll be writing about in the following weeks.

Naim.

End of my blog on Idol

March 9, 2012

In 2011  my past employer decided to pull the plug on our IDOL adventure.  After my SSD experiment it became clear that we would have need to spend some serious coinage on hardware to run IDOL from a physical plate form with dedicated local SSD based storage as opposed to running IDOL in a virtualized  environment with SAN storage.

Our budget request got denied, so it was decided with reason that our IDOL system would be dismantled to free up resource in the virtual environment and on the SAN.  The DMS system was going to continue running off the old 8.2 Verity indexer.

I got so absorbed in to the IDOL indexer that rather disappointed in outcome of all our hard work.

Today I have moved on to a new firm in a new role.   I am now working with the OpenText eDOC DM system.. which guess what also has a full text indexer…. Perhaps I will one day be required to blog about that indexer

Naim

IDOL high performance at a low cost: Are SATA SSD’s the answers ?

October 31, 2011

We’ve been running  our IDOL Content Engines on a Dell Equallogic  SAN, without following the Autonomy recommendation for disk which are as follow :  Use 15K dedicated Disks for each Content Engine, so that the Autonomy Perf tool reports 150 MB/s Read and 100 MB/s Writes on all Content Engine Volumes when running concurrently.

We have 7 Content Engine loaded up with about 6.5 million document each, the Volumes they reside on report  individual  performance numbers about  40 MB/s Read and 30 MB/s Write.  Clearly with this disk performance we are still not running IDOL in Productions.  Searches such as show my emails from a Matter will take over 60 seconds, which with default setting causes IDOL and the WorkSite servers and the FileSite/Desksite client all to timeout.

Trying to find cost alternatives we tried an experiment using consumer SATA SSD’s.   We copied our Active Content Engine and a regular Content engine to a Desktop computer, Core2 processor with 8 GB of ram and two Crucial M4 120 GB SSD drives.   One for Active Content and the other for Content Engine 1.  The SATA controller on this mother board is off an older generation so the full potential of our M4’s was not exploited.  SATAII limits bandwith to 300 MB/s per SATA device.

We setup a DIH Mirror so both the original and the SSD Active Content Engine are running in parallel, (getting the same indexing jobs)  I also add the SSD Active Content and Content Engine to the DAH configuration so search queries  would go to the Original Content Engines and the clone running on a SSDs.

Lets talk about performance.   The Autonomy Perf tool reported the following metrics for each  SSD’s. The Perftool was running simultaneously.

File write speed :  162.41MB/s
File read speed :  223.19MB/s

How does this translate in :

Search performance.
Take for example a Full Text search of the term “0” ( this is the most taxing search as Zero is the 1st re-occurring term on All CE)

AC on SSD with 300k 1 seconds
CE1 SSD with 6.5 million 4 seconds
CE1 SAN with 6.5 million 18 seconds. ( this was the fastest results for a SAN based CE, other engines reported time of up to 60 seconds. )

A search query on the SAN takes 350% longer than on the SSD.

 Indexing:
Indexing 100k document on the SAN CE takes 220 seconds
Indexing the same 100k document on the SSD CE takes 85 seconds

Takes 160 % longer to Index on a SAN CE vs the  SSD CE.

Maintenance:
The DRECompact command takes about an average of 14 hours on Each SAN Content Engine.
While the same compact takes about 1 hours to complete on the SSD CE.

These numbers are very good.  We then have to ask ourselves how long the SSD’s will last in a IDOL environment.    The life expectancy of an SSD is derived from the total amount of data that can be written to a disk before the flash cells start dying. This is called the Write Endurance, and this number is sometimes published by the SSD manufacturers.  Some SSD disk will report the total amount of writes on disk in the S.M.A.R.T attribute.

So the consumer Crucial  M4 120 GB disks are rated for a life of 72 TB, meaning that flash cells will start to dye after this point,  This does not mean the disk will die but performance will start to degrade and eventually the drive will die

With IDOL running for a week we measured the amount of Windows HOST WRITES to both the Active Content and Content Engine1.  (Disk reads do not affect SSD longevity)

Over a 5 days span we averaged :

500 GB of write for Active Content, which means our M4 SSD would start to die after 144 days.
80 GB of write for the Content Engine,which means our M4 SSD would start to die after 878 days.

The amount of Writes per day will depend on the number of documents that are indexed per day and the frequency of the Sync to disk, (which can be configured)  Because we running in Active Content mode, the other Content engines are only adding new documents once per day, when the Sync tool runs.

We foresee that a production build would require “enterprise grade SSD’s”  To keep costs down, we think SATA enterprise SSD’s will do fine.  In a proposed build we would have 6 Content engines running a one physical server  ( this is boiler plate recommendation you get from the Autonomy sizing team).  We would have 12 SSD configured in 6 Raid1 volumes.  This would done with a  SAS Pci-e 2.0 Raid card,  SAS RAID cards such as those from HP and LSI support SATA disks,  You just need the adapter fan-out cables.

We currently are trying to figure out what the best cost/longevity  SSD would be matched to Active Content which is WRITE  intensive so it probably needs SLC or eMLC flash which both have longer life expectancy than regular MLC found in consumer SSD drives.  The new Intel  710 SSD (eMLC) looks like a good product for our application; good write endurance at halve the cost of SLC SSD’s.

RAW performance of the SSD disk is a secondary factor because the performance bottleneck will the RAID controller.   All of SAS RAID cards I have seen are of the PCI-E 2.0 x8 ( 8 lanes) type, supporting 4 GB/s of bandwidth to the host.

With 6 Content engines we would have 12 SSD’s if they are of the SATA6 generation using Sandforce 22xx controllers, they will support READ and Write in the 500 MB/s range x 12 which means a total of  6GB/s.  The 4 GB/s SAS RAID card limit would the I/O to each SSD to about 333 MB/s. Still very good numbers if you compare that to the performance of spinning disk.

In order to benefit from the full bandwidth of the SSD disks we would to place two SAS Raid cards on a server each with 6 SSDs connected to each SAS Raid controller.

Hopefully I’d be able to write about the out-come of this in a few months.

If any of you other there are currently using SSD’s with IDOL please let me know how your experience has been.

Naim

How to track the status of indexing jobs.

October 27, 2011

Troubleshooting IDOL problems can be complex.

Indexing problem are typical go as follows, in a multiple Content Engine environment,  The DIH distributes indexing jobs to the child Content Engines, and it monitors the progress, removing the jobs from queue only when all the child members have completed the process.  If a child job never finishes, the index jobs will pile up in the DIH till it’s queue is filled.  At this point upstream services like Active DIH, Ingestion and the Connector will all start to queue jobs and will stop when queue limits are exceeded

So how do you quickly find the Content Engine(s) which is having an issue?   In the past I would connected to every Content then review the Application and index logs to find the source of the problem. Or using the Index browser you can connect to each Content Engine and issue the following cmd IndexerGetStatus&IndexStatus=0 which will return the indexing job(s) that is currently running.

Typically I’ve found that Index jobs which run forever are encountering out of disk space situations.   A critical error will simply cause a Content Engine to stop.  The DIH status will show the engine as down.  Again, jobs for the engine that is down are queued till the Engine comes back up or till the queue on the DIH is filled.

So in order to quickly find which Content Engine is having a problem, using the Index Browser connect to IDOL Port 9000, and issue the following cmd : IndexerGetStatus&IndexStatus=0&ChildDetails=true&child=8  ( Set the child= to the number of Content engines you have, in my case 8).

With this command, you’ll see which Indexing job the IDOL DIH service is processing and the status on each Content Engine with it’s respective JOB ID.

In the example, below job 299 was sent to 7 Content Engines, (Job 299 is a DRESYNC  job, ).

The job is queued on  Content Engines 0,6. Running on Content Engines 1,4, and finished on 2,3,5

item =
id = 299
origin_ip = 127.0.0.1
received_time = 2011/10/27 10:59:04
start_time = 2011/10/27 10:59:04
end_time = Not Finished
duration_secs = 2547
percentage_processed = 42
status = 0
description = PROCESSING
index_command = /DRESYNC?
childenginesstatus =
childstatus =
engineid = 0
indexid = 251
percentage_processed = 0
status = -7
description = Queued
childstatus =
engineid = 1
indexid = 368
percentage_processed = 0
status = 0
description = PROCESSING
childstatus =
engineid = 2
indexid = 368
percentage_processed = 100
status = -1
description = Finished
childstatus =
engineid = 3
indexid = 368
percentage_processed = 100
status = -1
description = Finished
childstatus =
engineid = 4
indexid = 368
percentage_processed = 0
status = 0
description = PROCESSING
childstatus =
engineid = 5
indexid = 368
percentage_processed = 100
status = -1
description = Finished
childstatus =
engineid = 6
indexid = 368
percentage_processed = 0
status = -7
description = Queued
item =

In order to make sure all of your Content Engines are up you can connected the Index Browser to port 9070 (DIH Service, does indexing)  then make sure that the Status=UP for each engine.You can also send indexer cmd directly the DIH service.  The IndexerGetStatus&IndexStatus=0&ChildDetails=true&child=8 will also work here and give you the same results.

product = DIH
indexport = 9071
aciport = 9070
serviceport = 9072
version = 7.5.9.0
build = 793214
full = false
full_ratio = 0.00
engines =
engine =
number = 0
group = 0
host = VTSS-DMSIDXC01
port = 11000
status = UP
updateonly = false
weight = 1
disabled = false
engine =
number = 1
group = 1
host = VTSS-DMSIDXC02
port = 11000
status = UP
updateonly = false
weight = 1
disabled = false
engine =
number = 2
group = 2
host = VTSS-DMSIDXC02
port = 12000
status = UP
updateonly = false
weight = 1
disabled = false
engine =
number = 3
group = 3
host = VTSS-DMSIDXC03
port = 11000
status = UP
updateonly = false
weight = 1
disabled = false
engine =
number = 4
group = 4
host = VTSS-DMSIDXC03
port = 12000
status = UP
updateonly = false
weight = 1
disabled = false
engine =
number = 5
group = 5
host = VTSS-DMSIDXC04
port = 11000
status = UP
updateonly = false
weight = 1
disabled = false
engine =
number = 6
group = 6
host = VTSS-DMSIDXC04
port = 12000
status = UP
updateonly = false
weight = 1
disabled = false
indexqueue =
indexqueuereceived = 304
indexqueueprocessed = 304
indexqueuecompleted = 114
indexqueuequeued = 0
indexqueuequeuedforchildren = 190

Naim

DRECOMPACT : how to reduce execution time from 99 hours to 12 hours without any hardware changes.

August 12, 2011

About 1 year ago I encountered a perplexing problem with the DRECOMPACT command, it was taking over 100 hours on a content engine that had about 6 million documents.

I posted about this problem on the ITToolbox Autonomy user form and I got responses that suggested that our disks I/O might not be fast enough.  Which is true, our SAN disk perform as per the recommendations of hitting a read speed of 150 MB/sec and a write speed of 100 MB/sec.

I also opened a support ticket with Autonomy and was told basically that our slow disk’s were at cause.

But there was one very important fact that I overlooked.  We have a total of 7 content engines which are are distributed on 4 servers.  This means that one of the Content Engines is running with an Active Content.  The AC gets purged every day so it’s a small database.  The DRECOMPACT runs faster on this server.

I did all sorts of thing to try and understand why the DRECOMPACT would take 99 hours.   I traced an entire DRECOMPACT cycle using the Sysinternal promon tool, which is very good in that it gives summarized disk activity data.  It showed that during the DRECOMPACT a couple of Terabytes of data were READ from disk.

I also found that by enabling the debug logging in the application log I obtained some extra logging that gave me a better picture of what the DRECOMPACT does in it’s different stages. That might be the topic of another post.

As I was using windows 2008 R2 I also used the Resource Manager / Disk tab to view the disk activity.  It showed that during the bulk of the DRECOMPACT disk IO was low in the 3-4 MB/s.

So at some point during the on and off period during which I was thinking about this problem, I suddenly got a flash; after I read something about Procmon I have never known.  While I was using the Procmon file summary information I was trying to also get other tools  to also report the Terabytes of disk READ activity.  These other tools Sysinternal Procexp has some process summary information which does include READ and Writes from disk.  These numbers were different.  The reads where in the GB range in Procexp not the TB. What’s going on.   Well Procmon reports all of an applications DISK I/O but it does not tell you when the data gets served from disk or the file system cache or the Memory Manager standby lists.

What was happening was simple enough, DRECOMPACT causes a lot of repetitive random disk READs unused memory in Windows gets re-purposed for caching.  The more unused ram you have the greater the Memory Manager Standby lists will be.  I have a graph below that shows the outcome of an experiment I did : I found an unused server, it so happened to have 16 GB of ram so I used bcdeit boot options to limit the from from 2 GB to 16 GB with increments in between.  With each different memory setting I performed the same DRECOMPACT on a content engine which had 6 million documents. You can see that at one end of spectrum the DRECOMPACT took  about 70 hours to run when the server only had 2 GB of RAM, at the other end with 16 GB of ram the compact was finished in 4 hours.  You can see that as you add more ram the performance increase will flatten out.

Back on our production content engines I has to devise a strategy that would optimize the amount of free RAM on a sever where two content engines are running at the same time.  Our production servers currently have 7.25 GB of RAM. The OS will use about 800 megs, so it leave about 6.45 GB for both Content Engines.  I have seen that a Content Engine will use up as much ram as you have on a server after days and weeks of operation so not much ram is left for the OS to use as cache

So my solution was simple prior to doing a DRECOMPACT I issue the DRESYNC cmd followed by the DREFLUSHANDPAUSE.  This causes the engines to commit all indexing jobs to disk and it stop the engine from processing new jobs. If they are received they are sent in queue. In this state I stop both Content Engine on a given server and restart the engine on which I want to run the compact.  A content engine has a foot print of about 1 GB of ram which leave about 5.45 GB of ram for the OS to use for caching.  As the second Content Engine on the server if off,  The DAH service will mark it down so no queries will sent to it, and the DIH service will also mark it down, queuing indexing job for it till the Engine comes back on line.  Make sure the DIH queue size is large enough to handle the jobs that will be queues during the compact.

With this solution in place the DRECOMPACT that took 99 hours in our production server now take about 12 hours which is what the above graph shows.  I did the test on a physical server connected by FC to an HP EVA SAN.  Our production server are all VM’s connected to an Equallogic iSCSI SAN.  So in the end disk speed was not that important for the DRECOMPACT it was the amount of RAM that use had available to it to cache the disk writes.

I used an very nice tool called ATM.exe during this troubleshooting which is the only tool that I know off that will visual show you the amounts of memory that the OS is using for the  file system caching and the stand by lists.  You can find the tool here http://www.tmurgent.com/Tool_ATM.aspx.

This tool also allow you to clear the standby caches, something you can also do in Sysinternals rammap.exe

Shows standby memory usage (caching) during the DRECOMPACT

I do not fully understand the OS file system caching, I think it works as follows,  for windows 2008 R2 the server uses a set amount of memory which I have read is about 500 megs,  when thing fall out of the cache they get placed in the Standby lists, if the server has free unused ram.  When an application makes a READ call for a file, even if it’s no longer in the file system cache the OS can retrieve the files from the Standby list without having to physical read it from disk.

But the take away is clear, you do have control over how long it takes for the DRECOMPACT to occur.  Even if you have only have 1 Content Engine per sever you should still stop your CE prior to the DRECOMPACT so it release all the memory it using, and this can be a lots of GB over time.

You might be wondering what happens when a Content engine is shutdown,  well it will marked down by the DIH service which will internal queue indexing command that are destined to the down server.  The DAH service which takes care of the search queries will also mark the engines as down.   So in order to keep content engines turn off you need some sort of mirroring or duplicate IDOL build so you can direct search traffic to a second IDOL system while you are doing maintenance on your primary.   But keep in mind you don’t have to turn off your connector and ingestion if your DRECOMPACT complete before your Active Content flush cycle.  During the DRECOMPACT,  with an Active Content setup it is only the Delete indexing cmd that get sent to the Content Engines, you can queue lots of these on the DIH the take up no space.

Naim

Connector / DiffTool : how to control the scan starttime and endtime

August 10, 2011

I’ve been trying to understand why the connector /difftool  running in scanMode=Diff_only reports so many duplicate document.  I think that one of the major sources of duplicates are encounter because of the Active Content database setup.

Here is my theory;   on of the things the diff tool does is look for duplicate document, these are document that in IDOL share the same Reference ID but have a different indexID. With an Active Content setup all new document and edits to existing document get sent to the Active Content and re-distributed to the normal Content Engines every 24 hours.  So If I edit , today a document I had created a few days ago,  an instance of the document will be found on the regular content engine, plus a new copy gets saved in the active content. The Diff_tool will report that the same ReferenceID is found on the content Engine and the Active Content.

I was thinking about how one could run the diff_tool while having it skip documents  that are saved in Active Content.  By default the connector / diff tool will always use a EPOC start time of 1  (Wed Dec 31 17:00:00 MST 1969 ) and the End time will always the time at which the tool was started.

The IDOL admin guide states that the Connector / diff tool .cfg file in the [TaskName] section supports the StartTime and EndTime parameters.

From the WorkSite Administrator Guide 8.5 version16 page 269

Starttime  Defines a specific document date at which the crawl should begin. Any documents created before this date will not be indexed. This parameter should be used only for testing purposes.

EndTime Defines a specific document date at which the crawl should end. Any documents created after this date will not be indexed. This parameter should be used only for testing purposes.

So my is logic that we should be able to use the EndTime Parameter to specify a date time of (now) – 24 hours. Thus we would skip all of the documents that were indexed in the Active Content database. This should greatly reduce the false duplicates that are reported in the  Diff_now comparison log.

I tried the above parameters and they did not work.  A support case was opened with Autonomy and I was told to try entering the time values in unix epoc time up to the millisecond.

In a follow up I  was told by Autonomy support that this functionality had been removed but the IDOL documentation had not been updated to reflect this.  Now this seemed very odd to me and I did not suspect this was correct.

Time to investigate things on my own.

When the connector / difftool starts up it lists all of the parameters that are read from the config file.  So my game plan was simple,  trace an existing parameter in the code to find the definitions for all parameters and I suspected I would find the proper syntax for the starttime, endtime.

Most of the Connector / difftool is in Java so we call on free Java decompiler http://java.decompiler.free.fr/?q=jdgui so we can take a look inside the code.

After searching inside, I found two promising  parameters in the Worksiteconnector.jar file as seen below.

I did some testing and the startTimeOverRide and the EndTimeoverRide do work as expected.  They are the proper syntax for the starttime endtime found in the documentation.

To test I set  the difftool cfg file

startTimeOverRide=1
endTimeOverRide=2

Partial sections from the worksiteCrawler.log

2011-08-10 11:54:10,721 [Thread-1            ] [INFO ] [WSParamsBuilder     ] – NumberToReturn: -1
2011-08-10 11:54:10,721 [Thread-1            ] [INFO ] [WSParamsBuilder     ] – StartTime Override: 1
2011-08-10 11:54:10,721 [Thread-1            ] [INFO ] [WSParamsBuilder     ] – EndTime Override: 2
2011-08-10 11:54:10,721 [Thread-1            ] [INFO ] [WSParamsBuilder     ] – IndexDbName: TEST
……
2011-08-10 11:54:11,387 [Thread-1            ] [INFO ] [WSLibraryImpl       ] – Overrode startTime: 1
2011-08-10 11:54:11,387 [Thread-1            ] [INFO ] [WSLibraryImpl       ] – Overrode endTime: 2
2011-08-10 11:54:11,469 [Thread-1            ] [INFO ] [WsDbCrawler         ] – Scan From: Wed Dec 31 17:00:00 MST 1969   (1)
2011-08-10 11:54:11,469 [Thread-1            ] [INFO ] [WsDbCrawler         ] – Scan To  : Wed Dec 31 17:00:00 MST 1969   (2)
2011-08-10 11:54:11,475 [DbCrawler-0         ] [INFO ] [CrawlerFramework    ] – Got ZERO records in the failed chunk.

Now that I have the parameters I can continue my test with the DiffTool to confirm if I can get it skip over documents that are in the Active Content.

Stay tuned for more.

Naim

The nebulous Diff tool delete_only mode

July 12, 2011

I  had a little adventure using the WorkSite Diff Tool. This is custom connector that’s used to perform different types of scans to insure that the WorkSite SQL databases are in sync with the data in the IDOL index.

While recovering from the DRECOMPACT bug in a previous post, I miss understood what the Diff tool in delete_only mode would do for me. I was expecting it to to delete duplicate documents, which it does not do, it’s the comparison_now mode which does that job.

Here is what the WorkSite Indexing Admin guide ( version 16 April 2011) say’s about the delete_only mode :

delete_only: Identify documents that have been deleted in the WorkSite database but which still exist in the index.

So that would imply that the Diff tool will connect to the IDOL index and query by doc range and compare the same doc range in SQL to find delete documents in SQL that still exist in the IDOL index then submit the delete operations to the ingestion engine.

Nope.  I ran the delete_only mode on our production IDOL system which contains about 43 million docs.  The tool caused  the content engines to flood with #DREDELETEREF? command it took a few hours for the queues to empty.  I was also amazed at how fast the tool would run.  Something was strange, even stranger when I ran the tool again this time targeting only a small database and the same delete operation were yet again submitted.  The other strange thing, I did not see any queries show up on the DAH query log or the content engine query log. What’s up ? an undocumented way of searching the content engines for delete documents ?

I had a suspicion that the Delete_only mode was not doing any queries on IDOL ! It was only getting a list of documents to delete from SQL and then submitting delete operations for all these document in IDOL without checking to see if they existed or not.  To test this I stopped the content engines and ran the diff tool again. It produced the same results.

Here is SQL query that the diff tool, uses to identify all document that should be deleted from an IDOL index connection :

SELECT  DOCNUM, VERSION, ACTIVITY  FROM  MHGROUP.DOCHISTORY DH WITH (NOLOCK) WHERE  ACTIVITY_DATETIME >= ‘1970-01-01 00:00:00’ AND ACTIVITY_DATETIME <= ‘2011-07-07 16:54:19’ AND DOCNUM >= 4 AND DOCNUM < 1004 AND (ACTIVITY = ‘DELETE’) AND NOT EXISTS (SELECT 1 FROM MHGROUP.DOCMASTER DM WHERE DH.DOCNUM=DM.DOCNUM AND DH.VERSION=DM.VERSION)

I opened a support case with Autonomy. After a few days I got a response that this was the indented behavior.

OK so I beg your pardon, My response was please update the documentation so it accurately explains what it about to happen.  Here is the type of entry I’d expect.

delete_only: Identify all documents that have been ever been deleted in a WorkSite database and replay the delete operations in the IDOL index.  Every time you run the tool in delete_only mode it will “re-flag” the same document for deletion.

So in our environment I caused the Content Engines to be flooded with delete operation which tied up the indexers for hours needlessly.  I guess that the Delete_only mode serves some kind of business requirement but the reasons escape me.

Naim

Autonomy Worksite Indexer 8.5 Sp2 update 1 (IDOL) beware DRECOMPACT breakes document security

July 5, 2011

The following applies to Autonomy WorkSite Indexer 8.5 Sp2 Update 1 (IDOL).

This is the second time the following bug appears in the Content Engines. We 1st experience it in September 2010 with WorkSite Indexer 8.5 Sp1.  Its very surprising to see that the bug got re-introduced in the Content Engine code.

Be aware, if you use the DRECOMPACT indexing cmd on a Content Engine ( version  7.5.7.0 Build 755734) all of the document with Private security in your DMS system will be hidden from all; NRTADMIN users, Diff tool, Index browser searches etc….

The bug is known as issue :  NT-27969

This bug can create quickly create chaos;  You discover that thousands of documents are missing from your IDOL indexer, you run the Diff_tool  to run in Comparison_now mode,  your document are re-imported in IDOL,  your commited document count goes up, you run off disk space on your Content Engines,  you run the DRECOMPACT to purge the duplicate document and re-claim the disk space ……..

Fortunately you can apply the following manual fix, Stop your Content engines modify the WorkSite Content X.cfg file

Under the Server section

RegenerateSecIndex=true  (don’t forget to remove this after your engines have started)

When you start the Content Engine it will delete and regenerate the security index, this can take a few hours to run.  Your Content Engine applicaton log will show this activity as follows :

4/07/2011 15:43:46 [1] Normal: Thread 1 initialized security modules.
04/07/2011 15:43:46 [1] Full: Creating new nbitfield index 1
04/07/2011 15:43:46 [1] Full: Creating new nbitfield index 2
04/07/2011 15:43:46 [1] Full: Creating new nbitfield index 3
04/07/2011 15:43:46 [1] Full: Creating new nbitfield index 4
04/07/2011 15:43:46 [1] Normal: Indexing secindex information for 6497137 documents.
04/07/2011 15:56:57 [1] Normal: Indexing information from documents 10% complete.
04/07/2011 16:09:08 [1] Normal: Indexing information from documents 20% complete.
04/07/2011 16:22:10 [1] Normal: Indexing information from documents 30% complete.
04/07/2011 16:34:30 [1] Normal: Indexing information from documents 40% complete.
04/07/2011 16:47:00 [1] Normal: Indexing information from documents 50% complete.
04/07/2011 16:57:42 [1] Normal: Indexing information from documents 60% complete.
04/07/2011 17:08:56 [1] Normal: Indexing information from documents 70% complete.
04/07/2011 17:26:34 [1] Normal: Indexing information from documents 80% complete.
04/07/2011 17:42:09 [1] Normal: Indexing information from documents 90% complete.
04/07/2011 17:56:51 [1] Normal: Opening nbitfield index 1
04/07/2011 17:56:51 [1] Normal: Opening nbitfield index 2
04/07/2011 17:56:51 [1] Normal: Updated secindex information for 6196584 documents. 0 Failures.

Naim

 

UPDATE : August 11,  I was given a new version of the Content Engine which is suppose to fix this issue.  I was going to test it on a small all-in-one box.  Prior to testing the new version I ran the DRECOMPACT on small Content Engine which only contained 11 document.  The bug did not occur.  I contacted Autonomy and they confirmed that this bug does not occur in all environments.   I ran the compact again on a engine that has 2.3 million documents and the bug does not occur.  We have over 6 million document in our production Content Engines and the bug does occur here. 

So if we have just a few million document in IDOL you probably will not experience this bug.