You are not logged in.
The Cleanup rule's insert space before capitalized letter option chops up allcaps words (acronyms) such as HDTV, HTML, etc.
When it encounters successive capitalized letters, it should insert a space BEFORE and AFTER the entire bunch; but not in-between. That will keep the acronym intact, and it will also separate it from the rest of the text.
Last edited by narayan (2009-09-27 11:49)
Offline
Sounds good!
My immediate thought was that "A" on its own might create some trouble, for example: "ThisIsADJExample" should be "This Is A DJ Example", but by logic above "A DJ" would be "ADJ".
I will work on this one now...
Offline
Ok, I have finished. Available in the latest beta version.
It was a little tricky to figure out all of the possible scenarios. Maybe somebody wants to double check it?
Here is the new algorithm used for the this option:
Result := S;
for I := Length(S) downto 2 do
begin
// if upper case
if IsUp(I) then
begin
// if upper case precedes
if IsUp(I-1) then
begin
// if last char
if I = Length(S) then
Continue
// if not last
else
// if upper case or space follows
if IsUp(I+1) or IsSpace(I+1) then
Continue;
end;
// if space precedes
if IsSpace(I-1) then
Continue;
// else insert space
Insert(' ', Result, I);
end;
end;
Where IsUp(I) returns TRUE if character at position I in the original string S is upper case letter; IsSpace(I) works similarly except it checks if character is a space.
P.S. This does not solve the problem when "A" used together with the acronym.
Offline
From my tests I have an question:
FROM:
ThisIsADJExample
ThisIsA_Long-Sentence,WithoutAny Spaces
TO:
This Is ADJ Example
This Is A_ Long- Sentence, Without Any Spaces
Do we want to exclude punctuation marks?
To get
This Is A_Long-Sentence, Without Any Spaces
(I think coma and dot didn't have to be excluded, but underscore and dash?)
Denis, will you please provide an fully working PS code too? ,.... for use for playing around ourself
Read the *WIKI* for HELP + MANUAL + Tips&Tricks.
If ReNamer had helped you, please *DONATE* to Denis or buy a PRO license. (Read *Lite vs Pro*)
Offline
works perfectly (I had overlooked the possibility that there could be a word following the acronym; so the last capitalized letter would actually belong to the next word).
That leaves only the article "a" preceding the acronym (but not the articles "The" and "An", which will be treated correctly).
On the other hand, overcorrecting won't let us have any acronym beginning with A. http://www.acronymslist.com/alphabet/A.html
Offline
Ok, basically the code (previously posted by me) remains unchanged, except IsSpace(I) function now also returns TRUE if character is a "_" (underscore) or "-" (dash). This will fix Stefan's examples.
FROM:
ThisIsADJExample
ThisIsA_Long-Sentence,WithoutAny Spaces
TO:
This Is ADJ Example
This Is A_Long-Sentence, Without Any Spaces
P.S. Stefan, my code was the real source code (not pseudo code), except declarations for IsSpace and IsUp functions are missing.
Offline
Yes this functions are missing. That was it i can't test the modifications.
So i asked if you like to post this whole code. So we can react for different needs and modify the code
If it closed source... don't do it, i would understand this
Thanks for improving your tools always again and again.
Read the *WIKI* for HELP + MANUAL + Tips&Tricks.
If ReNamer had helped you, please *DONATE* to Denis or buy a PRO license. (Read *Lite vs Pro*)
Offline
Here is the full implementation of InsertSpaceBeforeCapitals function. I haven't tried pasting it into the PascalScript rule, I have a feeling it is not going to work straight away. PascalScript might not support sub-routines (that is functions within functions), and possibly WideChar sets. Anyway, IsSpace and IsUp are only helper functions which use cached values for character types in the input string, so they can easily be reimplemented for PascalScript.
Good luck!
function InsertSpaceBeforeCapitals(const S: WideString): WideString;
const
SPACE_ALIKE = [WideChar('_'), WideChar('-')];
var
I: Integer;
IsUpCache: Array of Boolean;
IsSpaceCache: Array of Boolean;
procedure Init;
var A: Integer;
begin
SetLength(IsUpCache, Length(S));
SetLength(IsSpaceCache, Length(S));
for A := 1 to Length(S) do
begin
IsUpCache[A-1] := IsWideCharAlpha(S[A]) and IsWideCharUpper(S[A]);
IsSpaceCache[A-1] := IsWideCharSpace(S[A]) or (S[A] in SPACE_ALIKE);
end;
end;
function IsSpace(Index: Integer): Boolean;
begin
Result := (Index >= 1) and (Index <= Length(S));
if Result then Result := IsSpaceCache[Index-1];
end;
function IsUp(Index: Integer): Boolean;
begin
Result := (Index >= 1) and (Index <= Length(S));
if Result then Result := IsUpCache[Index-1];
end;
begin
Init;
Result := S;
for I := Length(S) downto 2 do
begin
// if upper case
if IsUp(I) then
begin
// if upper case precedes
if IsUp(I-1) then
begin
// if last char
if I = Length(S) then
Continue
// if not last
else
// if upper case or space follows
if IsUp(I+1) or IsSpace(I+1) then
Continue;
end;
// if space precedes
if IsSpace(I-1) then
Continue;
// else insert space
Insert(' ', Result, I);
end;
end;
end;
Offline
Thanks you Denis!
---
Playing around with beta from Thuesday evening
i found a few more possibilities to take into account (maybe, i leave this up to you)
Still not all possibilities found i think, but want to report so you/we can check if this should be the version to release.
FROM:
TestStringA WWWWithUpper.Case,Letters_And_Underscores-And-Dashes
TestStringA (BC) -- [CD] -- !EF! -- ,GH,
De'Argostino --- 'THE`Master
ThisIsMy$Var
!NotImportendFile
.AnDotAtBegin
TheBeatles-TheSongRMX (Master)
TheBeatles-TheSongRMX(Master)
TO:
Test String A WWW With Upper. Case, Letters_And_Underscores-And-Dashes
Test String A ( B C) -- [ C D] -- ! E F! -- , G H,
De' Argostino --- ' TH E` Master
This Is My$ Var
! Not Importend File
. An Dot At Begin
The Beatles-The Song RMX ( Master)
The Beatles-The Song RM X( Master)
Expected would be, i think, maybe i am wrong:
Test String A (BC) -- [CD] -- !EF! -- , GH,
De'Argostino --- 'THE` Master
This Is My $Var
!Not Importend File
.An Dot At Begin
The Beatles-The Song RMX (Master)
The Beatles-The Song RMX(Master)
-------- some more tests , i know you can't check them all,... i want just show it
FROM:
MusicFinder\Pop\2Raumwohnung - Wir Werden Sehen_01.mp3
Unknown\AventuraFeat.Akon,Wisin&Yandel-AllUp2You.mp3
Hip hop\T.I. - Dead and gone.mp3
G.G. Anderson - Discofox Hit Mix.mp3
D.J. Bobo - Pray.mp3
O.M.D - Pandora´s Box.mp3
Radio NJoy - Broadcasted by Atlantis.BG.ogg
Bobby 'Boris' Pickett - Monster Mash.mp3
TO:
Music Finder\ Pop\2 Raumwohnung - Wir Werden Sehen_01.mp3
Unknown\ Aventura Feat. Akon, Wisin& Yandel-All Up2 You.mp3
Hip hop\ T. I. - Dead and gone.mp3 ---------- back slash left
G. G. Anderson - Discofox Hit Mix.mp3 -------- dots
D. J. Bobo - Pray.mp3
O. M. D - Pandora´s Box.mp3
Radio N Joy - Broadcasted by Atlantis. BG.ogg --- two upper case chars
Bobby ' Boris' Pickett - Monster Mash.mp3
Last edited by Stefan (2009-09-30 13:23)
Read the *WIKI* for HELP + MANUAL + Tips&Tricks.
If ReNamer had helped you, please *DONATE* to Denis or buy a PRO license. (Read *Lite vs Pro*)
Offline
I think this will be my last attempt to make an adjustment to this option. There are just too many exceptions and exceptions to exceptions and so on, i.e. no well defined rules to these adjustments.
Stefan, I think you got carried away and forgot that the option is called "Insert spaces in front of capitals", as opposed to "Insert spaces where they seem to fit"
Latest modifications will work in the following way:
INPUT:
ThisIsADJExample
ThisIsA_Long-Sentence,WithoutAny Spaces
TestStringA WWWWithUpper.Case,Letters_And_Underscores-And-Dashes
TestStringA (BC) -- [CD] -- !EF! -- ,GH,
De'Argostino --- 'THE`Master
ThisIsMy$Var
!NotImportendFile
.AnDotAtBegin
TheBeatles-TheSongRMX (Master)
TheBeatles-TheSongRMX(Master)
MusicFinder\Pop\2Raumwohnung - Wir Werden Sehen_01.mp3
Unknown\AventuraFeat.Akon,Wisin&Yandel-AllUp2You.mp3
Hip hop\T.I. - Dead and gone.mp3
G.G. Anderson - Discofox Hit Mix.mp3
O.M.D - Pandora´s Box.mp3
Radio NJoy - Broadcasted by Atlantis.BG.ogg
Bobby 'Boris' Pickett - Monster Mash.mp3
OUTPUT:
This Is ADJ Example
This Is A_Long-Sentence,Without Any Spaces
Test String A WWW With Upper.Case,Letters_And_Underscores-And-Dashes
Test String A (BC) -- [CD] -- !EF! -- ,GH,
De'Argostino --- 'THE`Master
This Is My$Var
!Not Importend File
.An Dot At Begin
The Beatles-The Song RMX (Master)
The Beatles-The Song RMX(Master)
Music Finder\Pop\2 Raumwohnung - Wir Werden Sehen_01.mp3
Unknown\Aventura Feat.Akon,Wisin&Yandel-All Up2 You.mp3
Hip hop\T.I. - Dead and gone.mp3
G.G. Anderson - Discofox Hit Mix.mp3
O.M.D - Pandora´s Box.mp3
Radio N Joy - Broadcasted by Atlantis.BG.ogg
Bobby 'Boris' Pickett - Monster Mash.mp3
Conclusions:
In this way all capitals are treated unless it involves punctuation. If user wants some punctuations to be handled differently he can easily replace them with spaces or insert spaces around them with a simple replace rule. On the other hand, if we would've treated punctuations as well and user didn't want some of them to be corrected (like O.M.D style abbreviations) - it would be harder to remove spaces from only those cases. I think the base goal is achieved here.
function InsertSpaceBeforeCapitals(const S: WideString): WideString;
var
I: Integer;
IsUpCache, IsLowCache, IsLetterCache: Array of Boolean;
procedure Init;
var A: Integer;
begin
SetLength(IsUpCache, Length(S));
SetLength(IsLowCache, Length(S));
SetLength(IsLetterCache, Length(S));
for A := 1 to Length(S) do
begin
IsUpCache[A-1] := IsWideCharAlpha(S[A]) and IsWideCharUpper(S[A]);
IsLowCache[A-1] := IsWideCharAlpha(S[A]) and IsWideCharLower(S[A]);
IsLetterCache[A-1] := IsWideCharAlphaNumeric(S[A]);
end;
end;
function IsLetter(Index: Integer): Boolean;
begin
Result := IsLetterCache[Index-1];
end;
function IsUp(Index: Integer): Boolean;
begin
Result := IsUpCache[Index-1];
end;
function IsLow(Index: Integer): Boolean;
begin
Result := IsLowCache[Index-1];
end;
begin
Init;
Result := S;
for I := Length(S) downto 2 do
begin
// if upper case
if IsUp(I) then
begin
// if upper case precedes
if IsUp(I-1) then
begin
// if last char
if I = Length(S) then Continue
// if not last and not lower case follows
else if not IsLow(I+1) then Continue;
end;
// if not letter precedes
if not IsLetter(I-1) then Continue;
// else insert a space
Insert(' ', Result, I);
end;
end;
end;
Offline