#1 2009-09-27 11:48

narayan
Senior Member
Registered: 2009-02-08
Posts: 470

Separate words separates successive Capitalized letters

The Cleanup rule's insert space before capitalized letter option chops up allcaps words (acronyms) such as HDTV, HTML, etc.

When it encounters successive capitalized letters, it should insert a space BEFORE and AFTER the entire bunch; but not in-between. That will keep the acronym intact, and it will also separate it from the rest of the text.

Last edited by narayan (2009-09-27 11:49)

Offline

#2 2009-09-28 21:21

den4b
Administrator
From: den4b.com
Registered: 2006-04-06
Posts: 3,379

Re: Separate words separates successive Capitalized letters

Sounds good!

My immediate thought was that "A" on its own might create some trouble, for example: "ThisIsADJExample" should be "This Is A DJ Example", but by logic above "A DJ" would be "ADJ".

I will work on this one now...

Offline

#3 2009-09-28 23:11

den4b
Administrator
From: den4b.com
Registered: 2006-04-06
Posts: 3,379

Re: Separate words separates successive Capitalized letters

Ok, I have finished. Available in the latest beta version.

It was a little tricky to figure out all of the possible scenarios. Maybe somebody wants to double check it?

Here is the new algorithm used for the this option:

  Result := S;
  for I := Length(S) downto 2 do
  begin
    // if upper case
    if IsUp(I) then
    begin
      // if upper case precedes
      if IsUp(I-1) then
      begin
        // if last char
        if I = Length(S) then
          Continue
        // if not last
        else
          // if upper case or space follows
          if IsUp(I+1) or IsSpace(I+1) then
            Continue;
      end;
      // if space precedes
      if IsSpace(I-1) then
        Continue;
      // else insert space
      Insert(' ', Result, I);
    end;
  end;

Where IsUp(I) returns TRUE if character at position I in the original string S is upper case letter; IsSpace(I) works similarly except it checks if character is a space.

P.S. This does not solve the problem when "A" used together with the acronym.

Offline

#4 2009-09-29 08:12

Stefan
Moderator
From: Germany, EU
Registered: 2007-10-23
Posts: 1,161

Re: Separate words separates successive Capitalized letters

From my tests I have an question:

FROM:
ThisIsADJExample
ThisIsA_Long-Sentence,WithoutAny Spaces

TO:
This Is ADJ Example
This Is A_ Long- Sentence, Without Any Spaces

Do we want to exclude punctuation marks?
To get
This Is A_Long-Sentence, Without Any Spaces
(I think coma and dot didn't have to be excluded, but underscore and dash?)

Denis, will you please provide an fully working PS code too? ,.... for use for playing around ourself


Read the  *WIKI* for HELP + MANUAL + Tips&Tricks.
If ReNamer had helped you, please *DONATE* to Denis or buy a PRO license. (Read *Lite vs Pro*)

Offline

#5 2009-09-29 14:01

narayan
Senior Member
Registered: 2009-02-08
Posts: 470

Re: Separate words separates successive Capitalized letters

works perfectly (I had overlooked the possibility that there could be a word following the acronym; so the last capitalized letter would actually belong to the next word).

That leaves only the article "a" preceding the acronym (but not the articles "The" and "An", which will be treated correctly).

On the other hand, overcorrecting won't let us have any acronym beginning with A. http://www.acronymslist.com/alphabet/A.html

Offline

#6 2009-09-29 20:00

den4b
Administrator
From: den4b.com
Registered: 2006-04-06
Posts: 3,379

Re: Separate words separates successive Capitalized letters

Ok, basically the code (previously posted by me) remains unchanged, except IsSpace(I) function now also returns TRUE if character is a "_" (underscore) or "-" (dash). This will fix Stefan's examples.

FROM:
ThisIsADJExample
ThisIsA_Long-Sentence,WithoutAny Spaces

TO:
This Is ADJ Example
This Is A_Long-Sentence, Without Any Spaces

P.S. Stefan, my code was the real source code (not pseudo code), except declarations for IsSpace and IsUp functions are missing.

Offline

#7 2009-09-30 09:52

Stefan
Moderator
From: Germany, EU
Registered: 2007-10-23
Posts: 1,161

Re: Separate words separates successive Capitalized letters

Yes this functions are missing. That was it i can't test the modifications.
So i asked if you like to post this whole code. So we can react for different needs and modify the code
If it closed source... don't do it, i would understand this tongue


Thanks for improving your tools always again and again.


Read the  *WIKI* for HELP + MANUAL + Tips&Tricks.
If ReNamer had helped you, please *DONATE* to Denis or buy a PRO license. (Read *Lite vs Pro*)

Offline

#8 2009-09-30 11:42

den4b
Administrator
From: den4b.com
Registered: 2006-04-06
Posts: 3,379

Re: Separate words separates successive Capitalized letters

Here is the full implementation of InsertSpaceBeforeCapitals function. I haven't tried pasting it into the PascalScript rule, I have a feeling it is not going to work straight away. PascalScript might not support sub-routines (that is functions within functions), and possibly WideChar sets. Anyway, IsSpace and IsUp are only helper functions which use cached values for character types in the input string, so they can easily be reimplemented for PascalScript.

Good luck! smile

function InsertSpaceBeforeCapitals(const S: WideString): WideString;
const
  SPACE_ALIKE = [WideChar('_'), WideChar('-')];
var
  I: Integer;
  IsUpCache: Array of Boolean;
  IsSpaceCache: Array of Boolean;

  procedure Init;
  var A: Integer;
  begin
    SetLength(IsUpCache, Length(S));
    SetLength(IsSpaceCache, Length(S));
    for A := 1 to Length(S) do
    begin
      IsUpCache[A-1] := IsWideCharAlpha(S[A]) and IsWideCharUpper(S[A]);
      IsSpaceCache[A-1] := IsWideCharSpace(S[A]) or (S[A] in SPACE_ALIKE);
    end;
  end;
  function IsSpace(Index: Integer): Boolean;
  begin
    Result := (Index >= 1) and (Index <= Length(S));
    if Result then Result := IsSpaceCache[Index-1];
  end;
  function IsUp(Index: Integer): Boolean;
  begin
    Result := (Index >= 1) and (Index <= Length(S));
    if Result then Result := IsUpCache[Index-1];
  end;

begin
  Init;
  Result := S;
  for I := Length(S) downto 2 do
  begin
    // if upper case
    if IsUp(I) then
    begin
      // if upper case precedes
      if IsUp(I-1) then
      begin
        // if last char
        if I = Length(S) then
          Continue
        // if not last
        else
          // if upper case or space follows
          if IsUp(I+1) or IsSpace(I+1) then
            Continue;
      end;
      // if space precedes
      if IsSpace(I-1) then
        Continue;
      // else insert space
      Insert(' ', Result, I);
    end;
  end;
end;

Offline

#9 2009-09-30 12:17

Stefan
Moderator
From: Germany, EU
Registered: 2007-10-23
Posts: 1,161

Re: Separate words separates successive Capitalized letters

Thanks you Denis!

---

Playing around with beta from Thuesday evening
i found a few more possibilities to take into account (maybe, i leave this up to you)
Still not all possibilities found i think, but want to report so you/we  can check if this should be the version to release.

FROM:
TestStringA WWWWithUpper.Case,Letters_And_Underscores-And-Dashes
TestStringA (BC) -- [CD] -- !EF! -- ,GH, 
De'Argostino ---  'THE`Master
ThisIsMy$Var
!NotImportendFile
.AnDotAtBegin
TheBeatles-TheSongRMX (Master)
TheBeatles-TheSongRMX(Master)


TO:
Test String A WWW With Upper. Case, Letters_And_Underscores-And-Dashes
Test String A ( B C) -- [ C D] -- ! E F! -- , G H,
De' Argostino --- ' TH E` Master
This Is My$ Var
! Not Importend File
. An Dot At Begin
The Beatles-The Song RMX ( Master)
The Beatles-The Song RM X( Master)


Expected would be, i think, maybe i am wrong:
Test String A (BC) -- [CD] -- !EF! -- , GH, 
De'Argostino ---  'THE` Master
This Is My $Var
!Not Importend File
.An Dot At Begin
The Beatles-The Song RMX (Master)
The Beatles-The Song RMX(Master)



-------- some more tests , i know you can't check them all,... i want just show it

FROM:
MusicFinder\Pop\2Raumwohnung - Wir Werden Sehen_01.mp3
Unknown\AventuraFeat.Akon,Wisin&Yandel-AllUp2You.mp3
Hip hop\T.I. - Dead and gone.mp3
G.G. Anderson - Discofox Hit Mix.mp3
D.J. Bobo - Pray.mp3
O.M.D - Pandora´s Box.mp3
Radio NJoy - Broadcasted by Atlantis.BG.ogg
Bobby 'Boris' Pickett - Monster Mash.mp3


TO:
Music Finder\ Pop\2 Raumwohnung - Wir Werden Sehen_01.mp3
Unknown\ Aventura Feat. Akon, Wisin& Yandel-All Up2 You.mp3
Hip hop\ T. I. - Dead and gone.mp3  ---------- back slash left
G. G. Anderson - Discofox Hit Mix.mp3 -------- dots
D. J. Bobo - Pray.mp3
O. M. D - Pandora´s Box.mp3
Radio N Joy - Broadcasted by Atlantis. BG.ogg  --- two upper case chars
Bobby ' Boris' Pickett - Monster Mash.mp3

Last edited by Stefan (2009-09-30 13:23)


Read the  *WIKI* for HELP + MANUAL + Tips&Tricks.
If ReNamer had helped you, please *DONATE* to Denis or buy a PRO license. (Read *Lite vs Pro*)

Offline

#10 2009-10-04 13:46

den4b
Administrator
From: den4b.com
Registered: 2006-04-06
Posts: 3,379

Re: Separate words separates successive Capitalized letters

I think this will be my last attempt to make an adjustment to this option. There are just too many exceptions and exceptions to exceptions and so on, i.e. no well defined rules to these adjustments.

Stefan, I think you got carried away and forgot that the option is called "Insert spaces in front of capitals", as opposed to "Insert spaces where they seem to fit" wink

Latest modifications will work in the following way:

INPUT:
ThisIsADJExample
ThisIsA_Long-Sentence,WithoutAny Spaces
TestStringA WWWWithUpper.Case,Letters_And_Underscores-And-Dashes
TestStringA (BC) -- [CD] -- !EF! -- ,GH, 
De'Argostino ---  'THE`Master
ThisIsMy$Var
!NotImportendFile
.AnDotAtBegin
TheBeatles-TheSongRMX (Master)
TheBeatles-TheSongRMX(Master)
MusicFinder\Pop\2Raumwohnung - Wir Werden Sehen_01.mp3
Unknown\AventuraFeat.Akon,Wisin&Yandel-AllUp2You.mp3
Hip hop\T.I. - Dead and gone.mp3
G.G. Anderson - Discofox Hit Mix.mp3
O.M.D - Pandora´s Box.mp3
Radio NJoy - Broadcasted by Atlantis.BG.ogg
Bobby 'Boris' Pickett - Monster Mash.mp3

OUTPUT:
This Is ADJ Example
This Is A_Long-Sentence,Without Any Spaces
Test String A WWW With Upper.Case,Letters_And_Underscores-And-Dashes
Test String A (BC) -- [CD] -- !EF! -- ,GH,
De'Argostino --- 'THE`Master
This Is My$Var
!Not Importend File
.An Dot At Begin
The Beatles-The Song RMX (Master)
The Beatles-The Song RMX(Master)
Music Finder\Pop\2 Raumwohnung - Wir Werden Sehen_01.mp3
Unknown\Aventura Feat.Akon,Wisin&Yandel-All Up2 You.mp3
Hip hop\T.I. - Dead and gone.mp3
G.G. Anderson - Discofox Hit Mix.mp3
O.M.D - Pandora´s Box.mp3
Radio N Joy - Broadcasted by Atlantis.BG.ogg
Bobby 'Boris' Pickett - Monster Mash.mp3

Conclusions:
In this way all capitals are treated unless it involves punctuation. If user wants some punctuations to be handled differently he can easily replace them with spaces or insert spaces around them with a simple replace rule. On the other hand, if we would've treated punctuations as well and user didn't want some of them to be corrected (like O.M.D style abbreviations) - it would be harder to remove spaces from only those cases. I think the base goal is achieved here.


function InsertSpaceBeforeCapitals(const S: WideString): WideString;
var
  I: Integer;
  IsUpCache, IsLowCache, IsLetterCache: Array of Boolean;

  procedure Init;
  var A: Integer;
  begin
    SetLength(IsUpCache, Length(S));
    SetLength(IsLowCache, Length(S));
    SetLength(IsLetterCache, Length(S));
    for A := 1 to Length(S) do
    begin
      IsUpCache[A-1] := IsWideCharAlpha(S[A]) and IsWideCharUpper(S[A]);
      IsLowCache[A-1] := IsWideCharAlpha(S[A]) and IsWideCharLower(S[A]);
      IsLetterCache[A-1] := IsWideCharAlphaNumeric(S[A]);
    end;
  end;
  function IsLetter(Index: Integer): Boolean;
  begin
    Result := IsLetterCache[Index-1];
  end;
  function IsUp(Index: Integer): Boolean;
  begin
    Result := IsUpCache[Index-1];
  end;
  function IsLow(Index: Integer): Boolean;
  begin
    Result := IsLowCache[Index-1];
  end;

begin
  Init;
  Result := S;
  for I := Length(S) downto 2 do
  begin
    // if upper case
    if IsUp(I) then
    begin
      // if upper case precedes
      if IsUp(I-1) then
      begin
        // if last char
        if I = Length(S) then Continue
        // if not last and not lower case follows
        else if not IsLow(I+1) then Continue;
      end;
      // if not letter precedes
      if not IsLetter(I-1) then Continue;
      // else insert a space
      Insert(' ', Result, I);
    end;
  end;
end;

Offline

Board footer

Powered by FluxBB