#1 2017-11-10 05:42

B00ze
Member
Registered: 2016-03-11
Posts: 10

Rename 500,000 files - very slow clearing

Good day.

Well yesterday and today I finally had a need for ReNamer (I don't use it often). I downloaded the 2015 and 2016 SPAM Archives from untroubled.org/spam/ and discovered that the files needed to be renamed *.eml for me to import them into Thunderbird.

So yesterday I fired-up ReNamer 6.6 and added the 2015 folder (with all subfolders) and had ~500K files in ReNamer. Created a simple Extension rule to add .EML to the filenames. Ran the rule -> All files were renamed in about 30-45 seconds, but afterwards I had two unresponsive ReNamer items/windows in my Windows toolbar/taskbar. After a couple minutes I just end-tasked.

Today I upgraded to ReNamer 6.7, and tried the 2016 archive. I first tried a single month (about 30K files) -> This worked fine. So I added all the remaining months (folders) and again ran the rename on ~500K files. Once again, ReNamer renamed everything quickly and went into never never land afterwards. This time I came here to post, and while I was writing, ReNamer finally finished and became responsive again.

You might want to look at the "clean-up" phase in ReNamer. It takes 30-45 seconds to rename ~500K files but then takes 5 minutes to clear-up those renamed files from the list-view...

Anyway, glad I had ReNamer; saved me from pulling out the FOR /R syntax to write a small DOS rename command and possibly screw-up and mangle all the filenames.

Best Regards,

Offline

#2 2017-11-10 12:33

den4b
Administrator
From: den4b.com
Registered: 2006-04-06
Posts: 2,547

Re: Rename 500,000 files - very slow clearing

Hello,

Let's have a look at the benchmark timings from a development station:

Num files    | 100K | 200K | 500K | 
Create files |   20 |   56 |  163 | seconds
Load files   |   14 |   29 |   72 | seconds
Memory use   |   60 |  100 |  224 | MB
Preview      |    1 |    1 |    1 | seconds
Rename       |   59 |  132 |  357 | seconds
Clear files  |    3 |   12 |   72 | seconds

* A separate tool was used to time the creation of files.
* Memory use is the working memory set as displayed in the Process Explorer.
* Preview was performed with a simple Insert rule and without validation.

The time taken to clear all files increases exponentially, which is undesirable, but unavoidable unfortunately. Memory management takes it toll when reshuffling millions of variable allocations, but it is necessary to maintain a clean state (stable and memory leak free). Note that the timing values are still very reasonable.

I would advise you to inspect and tweak your settings in ReNamer. Some of them may have a significant impact on the performance when processing large amount of files.

In regards to the 2 frozen windows appearing in the taskbar. That is normal. ReNamer has a hidden application management window in addition to the main window. Windows may decide to show both in the taskbar when application is too busy and appears frozen.

Killing ReNamer's process instead of waiting for it to clear all files is an acceptable workaround. The operating system will cleanup the entire memory footprint instantly, instead of cleaning millions of memory allocations individually.

Offline

#3 2017-11-13 06:09

B00ze
Member
Registered: 2016-03-11
Posts: 10

Re: Rename 500,000 files - very slow clearing

Good day.

First, I am not sure what your development station looks like, but your numbers are not exactly like mine - It definitively takes LESS time to "Create and Load" the files than it takes to "Clear" them. Even renaming, because of my write-back cache I guess, takes less time than clearing. It really took a whole 5 minutes to clear 500K files - on a Skylake @ 4.5GHz with fast memory; I have no idea how it can take you only 72 seconds.

Second, I had occasion to test this again, this time with 2017's junk mail archive. This amounted to ~1.3M files. Again, it took about 2 minutes to load all the files into ReNamer, and 2 minutes to rename everything. However, it took an HOUR to clear the list. That's right, a whole HOUR! Clearly this is an area that can be improved upon. It shouldn't be that renaming files is O(N) but clearing the list-view is O(N^2) or worse. It's almost as-if you were re-sorting the list every time an element was removed (it's not, of course, since removing an element from a sorted linked-list keeps the list in order, but /something/ is eating-up all those cycles.)

Some suggestions: Use memory pools. And since creating the list is super-fast, why not, after a rename, walk the list and create a new list with the items that are supposed to stay behind, and then free-up the entire pool where the first list resides? I don't know how you're doing it now, but it fails miserably with a large number of files.

Best Regards.

Offline

#4 2017-11-15 00:04

den4b
Administrator
From: den4b.com
Registered: 2006-04-06
Posts: 2,547

Re: Rename 500,000 files - very slow clearing

Our benchmarks did not reveal such a severe degradation in performance. However, your latest observations are indeed very far from normal.

So, we have conducted an in-depth investigation and discovered an inefficiency in the clearing operation, which causes large memory relocations due to inadvertent poor sequencing. In simple terms, an efficient clearing of one list causes very inefficient clearing of referenced items in another list.

The cost of this inefficiency increases exponentially and can be amplified by differences in memory and CPU performance as well as memory availability.

A fix is on the way and should become available shortly.

Thank you for your persistence! wink

Offline

#5 2017-11-15 03:33

B00ze
Member
Registered: 2016-03-11
Posts: 10

Re: Rename 500,000 files - very slow clearing

Good day.

Glad you found the problem. I never wanted to say "ReNamer is crap" or anything like that, and I am glad you took my criticism openly - ReNamer as far as I am concerned is the best renaming utility out there, bar none; the "program your rename operation" model is exceptional, and you go beyond string manipulations with support for Tags etc (and if I recall, you went nuts and included a scripting language). But something was really wrong with clearing-up of the renamed items, so I said "fails miserably with large number of files." A bit strong, my apologies.

Best Regards,

Offline

#6 2017-11-15 11:35

den4b
Administrator
From: den4b.com
Registered: 2006-04-06
Posts: 2,547

Re: Rename 500,000 files - very slow clearing

No worries, no offense taken! You were absolutely right to continue the debate.

The fix has been integrated into ReNamer 6.7.0.9 Beta. Please try the latest development version:
http://www.den4b.com/download/renamer/beta

Clearing all files is significantly faster now.

Note: Partial clearing of files will still be slow, because it is not possible to apply the same optimization in this case.

Offline

#7 2017-11-18 03:57

B00ze
Member
Registered: 2016-03-11
Posts: 10

Re: Rename 500,000 files - very slow clearing

Good day.

Bad news; still takes an hour. So I thought maybe I was using some wrong option. Found "Remove Renamed Files" and "Clear List" options. I was using "Remove Renamed Files" so I figured enabling "Clear List" would make it go really fast? No change. Then I thought maybe I should not be enabling both options, so I unchecked "Remove Renamed Files" and kept "Clear List" but still no change, took an hour again.

Could be some case of cache problem where the way you clear the lists causes a cache miss constantly? And the Xeon in your workstation doesn't have that problem because it has bigger cache lines or something?

Just for your information, I don't see any problems with my memory subsystem:
Dual DDR4-3200 14-14-14-34 CR2
Read Speed 44768 MB/s
Write Speed 47505 MB/s
Latency 44.6 ns

Best Regards,

Offline

#8 2017-11-18 19:42

den4b
Administrator
From: den4b.com
Registered: 2006-04-06
Posts: 2,547

Re: Rename 500,000 files - very slow clearing

There is an explanation for that.

Currently, only the explicit clearing of all files via the context menu of the files table (Clear -> Clear All) is able to use the optimized clearing method.

Options which automatically clear files on rename actually perform a selective clearing. Option "clear files table on rename" skips files with an error status, while "clear renamed files on rename" only clears files with a renamed status.

A potential workaround could be to check if the selective clearing would affect all files, and then use the optimized clearing method in such case, but this would add an extra overhead when it is not the case.

Another potential workaround is to add an option which clears all files on rename regardless of their status.

Offline

#9 2017-11-28 06:03

B00ze
Member
Registered: 2016-03-11
Posts: 10

Re: Rename 500,000 files - very slow clearing

Hey, sorry for the delay in replying.

Yeah, you could rename the "Clear files table" option, or add a little "?" circle next to it, saying files with rename errors stay there, and then add a 3rd option "Clear all files on rename (quick)" or something similar. But I still think there has to be a better way. I mean, I can load Excel with a million lines and I am sure I can do lots of stuff with the data much quicker than ReNamer clears the file list (I haven't tried tho). You could maybe keep a count of the number of errors, and if less than 1/2 the number of list entries, build a new list with the files with errors and flush the old list. ReNamer is really good at loading the list, it takes less than 2 minutes to load all 1.3M entries; I am assuming that copying a few entries to a new list would be just as fast. It would go from an hour to a few minutes, a sizable improvement.

Best Regards,

Offline

#10 Today 13:00

den4b
Administrator
From: den4b.com
Registered: 2006-04-06
Posts: 2,547

Re: Rename 500,000 files - very slow clearing

There is a potential solution too all these problems.

We have investigated several alternative data structures and benchmarked Search, Insert and Delete operations in various orders. Basic structures such as array and linked list simply grind to a halt for large datasets, at least for some of these operations.

Then we came across AVL tree data structure, which is a self-balancing binary search tree. It has shown incomparable superiority in all benchmarks for large datasets at a cost of slightly worst performance for small datasets.

The plan is to use AVL tree, hopefully in near future.

Offline

Board footer

Powered by FluxBB