Login ProductsSalesSupportDownloadsAbout |
Home » Technical Support » DBISAM Technical Support » Support Forums » DBISAM General » View Thread |
Messages 1 to 9 of 9 total |
Full-text indexing taking a long time |
Tue, May 8 2007 9:37 PM | Permanent Link |
kk aw | I know this issue has been raised a few times.
I have a table of just over 135 MB with one memo field indexed. It is taking hours for the table to be restructured to include this index. Is there anything I can do to speed up this process? This is on a 2GB, Quad Core machine. I read about Roy suggestion to de-dupe the string before indexing. That would mean changing my 4 applications plus dbsys itself. Also, the OnTextIndexFilter event is not fired unless the index is removed and reinstated again. Seems to be a long work-around. Would this be an issue with ElevateDB? I have bought ElevateDB but has not got round to using it yet. Regards, KK Aw |
Wed, May 9 2007 2:51 AM | Permanent Link |
Roy Lambert NLH Associates Team Elevate | kk
From a couple of simple tests I've carried out ElevateDB is a LOT faster than DBISAM for FTI. BTW my suggestions aren't to dedup the string, but rather to set up appropriate STOP WORDS. I also dump any word less than x or greater than y characters, and clear out any HTML tags or numbers. Not only does this speed up the indexing it also reduces the size of the index by a large amount. Roy Lambert |
Wed, May 9 2007 6:13 AM | Permanent Link |
kk aw | Roy,
Thanks for the information about ElevateDB. Where do you dump words less than x and greater than y and where do you strip the HTML tags? Regards, KK Aw |
Wed, May 9 2007 7:31 AM | Permanent Link |
Roy Lambert NLH Associates Team Elevate | kk
Here you go - the complete wadge Roy Lambert procedure Tmnd.ftiMassage(Sender: TObject; const TableName, FieldName: string; var TextToIndex: string); var Cntr: integer; ThisUn: string; Skip: boolean; WorkStr: string; sl: TStringList; slCnt: integer; MinWordLength: integer; MaxWordLength: integer; ftiWebStuff: boolean; ftiHelp: TDBISAMTable; ftiSession: TDBISAMSession; URLandEmail: string; const Delims = [' ', ',', '.', ';', ':', '!', '"', '?', '(', ')', '/', '\', '>', '<', '[', ']', '}', '{']; Alphas = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '_', '-']; function RemoveHTML(FormattedText: string): string; var CharsToProcess: integer; CurrentChar: integer; TestChar: Char; function HTMLTag: string; var Tag: string; CRCheck: string; EqPos: integer; procedure FindClosingTag(WhatTag: string); var TagLength: integer; function NoTagMatch: boolean; var Cntr: integer; begin Result := False; for Cntr := 1 to TagLength do begin if WhatTag[Cntr] <> UpperCase(FormattedText[CurrentChar + Cntr]) then begin Result := True; Break; end; end; end; begin TagLength := Length(WhatTag); while (CurrentChar < CharsToProcess - TagLength) and (NoTagMatch) do inc(CurrentChar); CurrentChar := CurrentChar + TagLength + 1; end; begin Tag := ''; while (FormattedText[CurrentChar] <> '>') and (CurrentChar <= CharsToProcess) do begin Tag := Tag + FormattedText[CurrentChar]; inc(CurrentChar); end; Tag := Tag + FormattedText[CurrentChar]; //This should get the > Tag := UpperCase(Tag); if Tag = '<HEAD>' then begin FindClosingTag('</HEAD>'); end else if Tag = '<XML>' then begin FindClosingTag('</XML>'); end else if Tag = '<TITLE>' then begin // We need to dump everything until the closing tag FindClosingTag('</TITLE>'); Result := ''; end else if Copy(Tag, 1, 6) = '<STYLE' then begin FindClosingTag('</STYLE>'); // We need to dump everything until the closing tag - especially css stuff Result := ''; end else if Tag = '<BR>' then begin Result := #13#10; end else if Copy(Tag, 1, 2) = '<P' then begin Result := #13#10; end else if Tag = '</DIV>' then begin if CurrentChar < CharsToProcess then begin if (FormattedText[CurrentChar - 6] <> '>') and (FormattedText[CurrentChar - 10] <> '<') then Result := #13#10 else begin CRCheck := FormattedText[CurrentChar - 10] + FormattedText[CurrentChar - 9] + FormattedText[CurrentChar - 8] + FormattedText[CurrentChar - 7] + FormattedText[CurrentChar - 6]; if UpperCase(CRCheck) <> '<DIV>' then Result := #13#10; end end else Result := ''; end else if (Copy(Tag, 1, 3) = '</H') and (Tag[4] in ['0'..'9']) then begin Result := #13#10; end else Result := ''; end; function SpecialChar: string; var HTMLChar: string; begin HTMLChar := ''; while (FormattedText[CurrentChar] <> ';') and (CurrentChar <= CharsToProcess) do begin HTMLChar := HTMLChar + FormattedText[CurrentChar]; inc(CurrentChar); end; HTMLChar := LowerCase(HTMLChar + FormattedText[CurrentChar]); //This should get the ; Result := ''; end; begin if 0 <> Pos('<html', LowerCase(FormattedText)) then begin Result := ''; CharsToProcess := Length(FormattedText); CurrentChar := 1; while CurrentChar <= CharsToProcess do begin TestChar := FormattedText[CurrentChar]; case TestChar of #0..#9, #11, #12, #14..#31: {do nothing}; '<': Result := Result + HTMLTag; '&': Result := Result + SpecialChar; else Result := Result + TestChar; end; inc(CurrentChar); end; end else Result := FormattedText; end; function LineIsUUEncoded: boolean; var uuCntr: integer; const TableUU = '`!"#$%&''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_'; begin Result := False; if (Length(WorkStr) > MinWordLength) and (((((Length(WorkStr) - 1) * 3) / 4) = Pos(WorkStr[1], TableUU) - 1) and (WorkStr[1] < 'a')) then begin // Only if it hits here is there a possibility that its UUEncoded, but we need to check to make sure Result := True; for uuCntr := 1 to Length(WorkStr) do begin if 0 = Pos(WorkStr[uuCntr], TableUU) then begin Result := False; Break; end; end; end; end; function GetWebStuff: string; var wsl: TStringList; wCntr: integer; lCntr: integer; Line: string; DelimPos: integer; const wDelims = [' ', ',', ';', ':', '!', '"', '?', '(', ')', '/', '/', '>', '<', '[', ']', '}', '{']; function SeekEnd: string; var Chop: integer; begin Chop := lCntr + 1; while Chop <= Length(Line) do begin if (Line[Chop] in wDelims) then Break else inc(Chop); end; Result := Copy(Line, DelimPos, Chop - DelimPos) + ' '; Delete(Line, DelimPos, Chop - DelimPos); lCntr := DelimPos; end; begin { Rules 1. If the line is less than 10 characters just forget it 2. Only interested in URLs starting www. 3. I'm only interested in the base URL ie if any / then chop before it 4. I'm only interested in one line at a time } wsl := TStringList.Create; wsl.Text := TextToIndex; TextToIndex := ''; Result := ''; for wCntr := 0 to wsl.Count - 1 do begin DelimPos := 1; if Length(wsl[wCntr]) > 10 then begin Line := wsl[wCntr]; lCntr := 1; while lCntr <= Length(Line) do begin if Line[lCntr] in wDelims then DelimPos := lCntr + 1; if Line[lCntr] = '@' then Result := Result + SeekEnd; if LowerCase(Copy(Line, lCntr, 4)) = 'www.' then Result := Result + SeekEnd; inc(lCntr); end; if Line <> '' then TextToIndex := TextToIndex + #13#10 + Line; end else TextToIndex := TextToIndex + #13#10 + wsl[wCntr]; end; wsl.Free; end; procedure AddWordToBeIndexed; begin if ThisUn = '' then Exit; if ThisUn[Length(ThisUn)] = '-' then Delete(ThisUn, Length(ThisUn), 1); if (Length(ThisUn) > MinWordLength) and (Length(ThisUn) <= MaxWordLength) then TextToIndex := TextToIndex + ThisUn + ' '; ThisUn := ''; end; begin if (TextToIndex = '') or (FieldName = '_Flags') then Exit; if FieldName = '_Headers' then begin TextToIndex := GetWebStuff; Exit; end; ftiSession := MakeDBISAMSession; ftiHelp := MakeDBISAMTable('ftiHelper', 'Memory', ftiSession); ftiHelp.Open; MinWordLength := ftiHelp.FieldByName('_MinWordLength').AsInteger; MaxWordLength := ftiHelp.FieldByName('_MaxWordLength').AsInteger; ftiWebStuff := ftiHelp.FieldByName('_ftiWebStuff').AsBoolean; ftiHelp.Close; ftiHelp.Free; ftiSession.Free; sl := TStringList.Create; URLandEmail := GetWebStuff; if FieldName = '_Message' then sl.Text := RemoveHTML(TextToIndex) else sl.Text := TextToIndex; TextToIndex := ''; for slCnt := 0 to sl.Count - 1 do begin WorkStr := sl[slCnt]; Skip := False; if LineIsUUEncoded then Break; // assumption is the rest of the message is UU stuff if Length(WorkStr) > MinWordLength then begin for Cntr := 1 to length(WorkStr) do begin if (not Skip) and (WorkStr[Cntr] in Alphas) then begin ThisUn := ThisUn + WorkStr[Cntr]; Skip := false; if (Cntr = length(WorkStr)) or (WorkStr[Cntr + 1] in Delims) then AddWordToBeIndexed; end else begin if (Cntr = length(WorkStr)) or (WorkStr[Cntr + 1] in Delims) then AddWordToBeIndexed else begin Skip := true; ThisUn := ''; end; end; if Skip then Skip := (Cntr < Length(WorkStr)) and (WorkStr[Cntr + 1] in Delims); end; end; end; if ftiWebStuff and (URLandEmail <> '') then TextToIndex := TextToIndex + URLandEmail; if TextToIndex <> '' then Delete(TextToIndex, Length(TextToIndex), 1); // get rid of the trailing space sl.Free; end; procedure Tmnd.DataModuleCreate(Sender: TObject); begin with Engine do begin Active := False; MaxTableDataBufferSize := MaxTableDataBufferSize * 4; MaxTableDataBufferCount := MaxTableDataBufferCount * 4; MaxTableIndexBufferSize := MaxTableIndexBufferSize * 2; MaxTableIndexBufferCount := MaxTableIndexBufferCount * 2; //MaxTableBlobBufferSize //MaxTableBlobBufferCount LargeFileSupport := True; OnTextIndexFilter := ftiMassage; BeforeDeleteTrigger := BeforeDeleteTrigger; Active := True; end; end; |
Wed, May 9 2007 10:25 AM | Permanent Link |
kk aw | Roy,
Thanks. I will give it a try. KK Aw |
Wed, May 9 2007 10:35 AM | Permanent Link |
kk aw | Roy,
I am missing makeDbisamSession and makeDBISAMTable. Regards, KK Aw |
Wed, May 9 2007 11:35 AM | Permanent Link |
Roy Lambert NLH Associates Team Elevate | kk
Some people need spoon feeding Roy Lambert function MakeDBISAMSession: TDBISAMSession; begin Result := TDBISAMSession.Create(nil); Result.PrivateDir := GetWindowsTempPath; Result.AutoSessionName := True; Result.LockProtocol := lpPessimistic; end; function MakeDBISAMDatabase(const iSession: TDBISAMSession; const Path: string): TDBISAMDatabase; begin Result := TDBISAMDatabase.Create(nil); Result.DatabaseName := 'db' + iSession.SessionName; Result.Directory := Path; Result.SessionName := iSession.SessionName; end; function MakeDBISAMTable(const iName: string; const iDBPath: string; const iSession: TDBISAMSession): TDBISAMTable; begin Result := TDBISAMTable.Create(nil); Result.TableName := iName; Result.DatabaseName := iDBPath; Result.SessionName := iSession.SessionName; end; function MakeDBISAMQuery(const iDBPath: string; const iSession: TDBISAMSession): TDBISAMQuery; begin Result := TDBISAMQuery.Create(nil); Result.DatabaseName := iDBPath; Result.SessionName := iSession.SessionName; end; |
Thu, May 10 2007 4:49 AM | Permanent Link |
kk aw | Roy,
Thanks. I wasn't sure what these functions were supposed to do. KK Aw |
Thu, May 10 2007 5:21 AM | Permanent Link |
Roy Lambert NLH Associates Team Elevate | kk
>Thanks. > >I wasn't sure what these functions were supposed to do. I made the names as descriptive as I could <vbg> Roy Lambert |
This web page was last updated on Tuesday, September 17, 2024 at 04:19 AM | Privacy PolicySite Map © 2024 Elevate Software, Inc. All Rights Reserved Questions or comments ? E-mail us at info@elevatesoft.com |