[sip-comm-dev] RFC: Save chat history in embedded database


#1

Hi all,
I saw this listed as GSoC idea but couldn't find a discussion around it.

Why do you need to change it? Why use XML in the first place?

From a user stand point I'd like a communication app to store the history on

plain text files so one can easily run grep on top of that. Replacing with
binary database format will eliminate this possibility.

Other thing I personally use the history logs for is maintaining the log from an
IRC discussion and then posting it to an archives page. With plain text files it
is as simple as attaching the file.

I'd like to understand what makes history so slow that you've decided to speed
it up with a database (preferably without reading all the code).

Thanks,
Alexander.

···

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#2

Hi Alexander,

Why use XML in the first place?

WIth the various metadata associated with a message, I personally
agree with the developers of the current implementation that XML is a
sensible choice when compared to custom-grown plain text formats.

Why do you need to change it?

Stability (including concurrency) and speed issues (which practically
render the history unusable in the non-trial cases) are the causes to
seek a modification of the current implementation.

From a user stand point I'd like a communication app to store the history on
plain text files so one can easily run grep on top of that. Replacing with
binary database format will eliminate this possibility.

Absolutely!!!

Which means that it's one of the disadvantages of a database approach
we'll have to weight against the advantages a database brings to us.

Other thing I personally use the history logs for is maintaining the log from an
IRC discussion and then posting it to an archives page. With plain text files it
is as simple as attaching the file.

Well, the use case here must only be straight when the format of the
chat log is pretty much one of the formats the archive supports. In
the rest of the cases, one has to use custom translation from the
application format to the archive format - in the case of XML one may
use XSLT and in the case of a database with a vibrant community and
relatively extensive utility support one may go with one of the
accompanying dump utilities.

And with app-to-plain format supported, the grep cause may not be
entirely lost, though more difficult for everyday use.

I'd like to understand what makes history so slow that you've decided to speed
it up with a database (preferably without reading all the code).

I personally agree with you here.

What I understand from my discussion with Emil and Damencho (guys,
please correct me wherever I'm wrong) is that by making the
modifications to the current implementation which address the
stability and speed issues will effectively lead us to implementing a
database of our own. Which I cannot completely disagree with.

Finally, I'd like to thank you for bringing this discussion and to
invite the community members to share their views.

Regards,
Lubo

···

On Wed, Mar 25, 2009 at 2:38 PM, Alexander Todorov <alexx.todorov@gmail.com> wrote:

Hi all,
I saw this listed as GSoC idea but couldn't find a discussion around it.

Why do you need to change it? Why use XML in the first place?

From a user stand point I'd like a communication app to store the history on
plain text files so one can easily run grep on top of that. Replacing with
binary database format will eliminate this possibility.

Other thing I personally use the history logs for is maintaining the log from an
IRC discussion and then posting it to an archives page. With plain text files it
is as simple as attaching the file.

I'd like to understand what makes history so slow that you've decided to speed
it up with a database (preferably without reading all the code).

Thanks,
Alexander.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#3

Lubomir Marinov написа:

Hi Alexander,

Hi Lubomir,
thanks for starting the discussion.

Why use XML in the first place?

WIth the various metadata associated with a message, I personally
agree with the developers of the current implementation that XML is a
sensible choice when compared to custom-grown plain text formats.

Why do you need to change it?

Stability (including concurrency) and speed issues (which practically
render the history unusable in the non-trial cases) are the causes to
seek a modification of the current implementation.

Hmm what concurrency? As I see history logs are under
$account_name/$contact_name. Unless you have 2 instances of SC running on the
same machine and talking to the same contact at the same time (which is nearly
impossible) I don't see the problem. We're only appending to the history so even
in the event of a collision just append the 1st one, then the 2nd one, etc.

Do you mean concurrency in another context? Please specify.

...

Other thing I personally use the history logs for is maintaining the log from an
IRC discussion and then posting it to an archives page. With plain text files it
is as simple as attaching the file.

Well, the use case here must only be straight when the format of the
chat log is pretty much one of the formats the archive supports.

I mean plain to the extent that it is easily human readable as in:
user1 DD.MM.YYYY HH:MM:SS > message
user2 DD.MM.YYYY HH:MM:SS > message

Similar format is used by most of the major IM clients.

Looking at the XML format of a history record I can understand most of the
fields and why they exist. The question is do users perform such complicated
searches (and that's pretty much what history is used for, right) that can't be
simplified?

What I propose is to revise all the XML tags and the information contained in
them. What does it mean, how is it used, what use cases does it serve? I have
the feeling that all of this can be simplified to a brain dead text file and the
only thing left will be search speed optimization which may resolve
automagically by that time. Let's not jump into the embedded database option
straight away but investigate and compare other possibilities before changing
the code.

Regards,
Alexander.

···

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#4

Hi Alexander,

As I alluded to in my previous e-mail on the subject, I'm with you rather than with the database-oriented rewrite. Indeed, I was the one to replace the XML configuration store with a plainer (Java .properties) format which does perform much better in terms of execution speed and memory consumption (the change isn't active right now and is scheduled for v2.0).

However, I must recognize the opinion of those with more development experience in SIP Communicator and more history on the phases of its implementation - the history has reached its implementation through numerous implementations and bug fixes and addressing its stability and speed issues in its current form is perceived by the most active developers to stand low chances of success (including because there are no reliable steps to reproduce the problems). And since I've started working on other issues including the addition of conferencing support and thus I'm unlikely to be able to devote reasonable time to fixing the current implementation, I'm inclined to not deny the advantages and the prospects of the proposed database idea.

For what it's worth, the database approach cannot be viewed to be more than an alternative right now - it's currently just a Google Summer of Code 2009 project, its implementation is still to be started and considered to be integrated after the end of the program... So it seems to me that there's still plenty of time for volunteers to stand up and fix the current implementation while others are looking at the alternative which they feel to be more promising and then let the better one win.

Best regards,
Lubomir

···

On Mar 25, 2009, at 4:32 PM, Alexander Todorov wrote:

Lubomir Marinov написа:

Hi Alexander,

Hi Lubomir,
thanks for starting the discussion.

Why use XML in the first place?

WIth the various metadata associated with a message, I personally
agree with the developers of the current implementation that XML is a
sensible choice when compared to custom-grown plain text formats.

Why do you need to change it?

Stability (including concurrency) and speed issues (which practically
render the history unusable in the non-trial cases) are the causes to
seek a modification of the current implementation.

Hmm what concurrency? As I see history logs are under
$account_name/$contact_name. Unless you have 2 instances of SC running on the
same machine and talking to the same contact at the same time (which is nearly
impossible) I don't see the problem. We're only appending to the history so even
in the event of a collision just append the 1st one, then the 2nd one, etc.

Do you mean concurrency in another context? Please specify.

...

Other thing I personally use the history logs for is maintaining the log from an
IRC discussion and then posting it to an archives page. With plain text files it
is as simple as attaching the file.

Well, the use case here must only be straight when the format of the
chat log is pretty much one of the formats the archive supports.

I mean plain to the extent that it is easily human readable as in:
user1 DD.MM.YYYY HH:MM:SS > message
user2 DD.MM.YYYY HH:MM:SS > message

Similar format is used by most of the major IM clients.

Looking at the XML format of a history record I can understand most of the
fields and why they exist. The question is do users perform such complicated
searches (and that's pretty much what history is used for, right) that can't be
simplified?

What I propose is to revise all the XML tags and the information contained in
them. What does it mean, how is it used, what use cases does it serve? I have
the feeling that all of this can be simplified to a brain dead text file and the
only thing left will be search speed optimization which may resolve
automagically by that time. Let's not jump into the embedded database option
straight away but investigate and compare other possibilities before changing
the code.

Regards,
Alexander.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#5

Hey Alex,

Alexander Todorov wrote:

Lubomir Marinov написа:

Hi Alexander,

Hi Lubomir,
thanks for starting the discussion.

Why use XML in the first place?

WIth the various metadata associated with a message, I personally
agree with the developers of the current implementation that XML is a
sensible choice when compared to custom-grown plain text formats.

Why do you need to change it?

Stability (including concurrency) and speed issues (which practically
render the history unusable in the non-trial cases) are the causes to
seek a modification of the current implementation.

Hmm what concurrency? As I see history logs are under
$account_name/$contact_name. Unless you have 2 instances of SC running on the
same machine and talking to the same contact at the same time (which is nearly
impossible) I don't see the problem. We're only appending to the history so even
in the event of a collision just append the 1st one, then the 2nd one, etc.

Do you mean concurrency in another context? Please specify.

I believe Lubomir meant concurrent operations over the history files
such as:

* a thread executing a search and another one logging a new message
* a thread logging an outgoing message and another one logging an
incoming one
* (only valid for certain protocols) multiple threads trying to log
consecutive incoming messages received from a server/peer in a single
burst. (like for example when signing in and retrieving offline messages)

there are probably other examples too but these are the first that come
to mind.

There's something else. The history is not the only module that would
benefit from an embedded DB. There are others where the advantages are
even more obvious. It simply made more sense to build the GSoC project
around it to facilitate comprehension. I'll explain more later.

...

Other thing I personally use the history logs for is maintaining the log from an
IRC discussion and then posting it to an archives page. With plain text files it
is as simple as attaching the file.

Well, the use case here must only be straight when the format of the
chat log is pretty much one of the formats the archive supports.

I mean plain to the extent that it is easily human readable as in:
user1 DD.MM.YYYY HH:MM:SS > message
user2 DD.MM.YYYY HH:MM:SS > message

Similar format is used by most of the major IM clients.

Looking at the XML format of a history record I can understand most of the
fields and why they exist. The question is do users perform such complicated
searches (and that's pretty much what history is used for, right) that can't be
simplified?

What I propose is to revise all the XML tags and the information contained in
them. What does it mean, how is it used, what use cases does it serve? I have
the feeling that all of this can be simplified to a brain dead text file and the
only thing left will be search speed optimization which may resolve
automagically by that time.

A simplistic format like this would be a tough sell in our case for a
variety of reasons such as:

* it would imply that we'd need to use a separate logging facility for
chats (we curretnly use the same service for logging calls and IM)
* the output is going to end up mangled anyway for protocols using HTML
formatting (and we do want to keep that in the history)
* in order for the simplification to make sense we would have to get rid
of some the data we currently store such as direction and UIDs for
example and we do need them in the implementation.

As for the search performance optimizations - yes we could indeed work
on implementing indexes, transactions and failure management ourselves
but the history service is not the only place where we'd need them. The
MetaContactList for example has even more glaring issues that could be
addressed by an embedded DB.

Let's not jump into the embedded database option
straight away but investigate and compare other possibilities before changing
the code.

We are not exactly jumping into this :). Persistent storage has been one
of the most serious issues in SIP Communicator for a while now. We are
still having unacceptable reliability problems with our MetaContactList
for example and they are all related to issues with graceful failure
management and concurrency. We've spent a considerable amount of effort
there and we are still not where we'd like to be.

The idea of "subcontracting" all of the above to an embedded DB has
therefore been considered on numerous occasions. We've been hoping that
in addition to simply resolving the issues, it would spare us the
maintenance resources that we need to concentrate on AV telephony and
protocols in general.

We therefore decided that we could start with a GSoC project and see how
it goes. Instead of focusing the project on the meta contact list
however (which is the one with the most urgent issues) we decided that
storing the chat logs in a DB is something that would be a concept
easier to grasp by a student.

We are not yet 100% decided that this is absolutely the way we'll go but
we want to give it a try and see how it goes. If at the end of the
summer it turns out that a DB would bring more issues than it would
resolve then we'd certainly look into other possible solutions.

Hope this sounds reasonable.

Cheers
Emil

···

Regards,
Alexander.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net