[sip-comm-dev] Uniquely identifying a news item - Reloaded


#1

Hello all,
And sorry for not giving any sign of life for the past 10 days (it
seems that programmers too can get writer's block :D)
As the discussion started here (
https://sip-communicator.dev.java.net/servlets/ReadMsg?listName=dev&msgNo=1814
) hasn't come to a solution regarding this very specific issue
(uniquely identifying items in a RSS feed), I felt it needed a little
more consideration. One reason for that is the fact that strict
identification of news items is needed for a more urgent issue:
presenting only new items.
In the current implementation of the RSS plugin, only feeds that
present a date property for each item are handled correctly. For other
feeds, they're either not handled at all (see
http://www.freenews.fr/feeds/rss.php although I suspect some invalid
RSS too), either all news items are presented as new even though they
have been read in the past (it's the case with
http://www.pheedo.com/f/drdobbs_all_articles but although with almost
all ATOM feeds I tried).
For the moment I see a couple of solutions for this matter,each with
advantages and disadvantages, as follows:
1. Keeping track of the last item displayed. When retrieving the feed
we look for the item we last showed. If it's not in the feed, then all
items are (presumably) new. If it's in the feed, we only show the
items after last_show_item.
  + easy to implement
  + little overhead in the contact class
  - we rely on the fact that items are stored sequentially and
chronologically ordered in the feed which usually happends but it's
not mandatory
  - if the last_show_item disappears form the feed (i.e. for a blog
feed the author marks the post as private, or draft), then the feed is
considered to contain only new items, which is obviously false
  - might be a little slow as we first have to identify the
last_show_item in the feed and only then new items can be shown
2.Keeping track of all the items displayed (or at least a fair amount
of items). This solution will more likely a HashTable or something
that will contain a per-feed list of shown items. URI/URL-s of news
items or some other hash (CRC comes to mind, but it's just an idea)
could be used as key.
  + it should require only one pass through the feed structure, as it
can be easily decided if an item have been shown or not
  + by using a proper hash we could treat all feeds equally (right now
feeds with no dates fail miserably :frowning: )
  - a little more memory consuming
  - I'm not quite sure about this hashing stuff :smiley: (i'm not going to
implement SHA1 for feed items, but you get the point)
3. Any other idea?

So what do you think?

Thanks,
Mihai

···

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#2

Hello Mihai,

I feel that these 2 points are moving the focus slightly off the real
problem.

The question is not really whether we keep track of one or more items
but how do we keep track of items at all. What attributes do we use to
compare one item to another.

We had already exchanged some thoughts on the subject actually and in here:

https://sip-communicator.dev.java.net/servlets/ReadMsg?list=dev&msgNo=1767

Vincent proposes using:
- the published date.
- the title.
- the link/URI (which is always there ... but is not necessarily unique)

I'd also add:
- other tags that standards may define or that you may discover while
studying various flows or mechanisms that other RSS clients are using

So to summarize, we need to

1. make sure that our RSS protocol provider implementation supports
using different keys for the different flows, and

2. make sure that we are able to uniquely identify flows for as many
cases as possible.

Concerning your question as to whether we should be keeping track of one
or more of the last RSS items: I don't really see the point of keeping
track of more than one item. If we assume that we properly implement
unique identification, then keeping track of the last item is enough. In
case our unique identification implementation is not reliable then it
would still be unreliable even if we keep track of more than one item.

What do you think?

Emil

Mihai Balan wrote:

···

Hello all,
And sorry for not giving any sign of life for the past 10 days (it
seems that programmers too can get writer's block :D)
As the discussion started here (
https://sip-communicator.dev.java.net/servlets/ReadMsg?listName=dev&msgNo=1814
) hasn't come to a solution regarding this very specific issue
(uniquely identifying items in a RSS feed), I felt it needed a little
more consideration. One reason for that is the fact that strict
identification of news items is needed for a more urgent issue:
presenting only new items.
In the current implementation of the RSS plugin, only feeds that
present a date property for each item are handled correctly. For other
feeds, they're either not handled at all (see
http://www.freenews.fr/feeds/rss.php although I suspect some invalid
RSS too), either all news items are presented as new even though they
have been read in the past (it's the case with
http://www.pheedo.com/f/drdobbs_all_articles but although with almost
all ATOM feeds I tried).
For the moment I see a couple of solutions for this matter,each with
advantages and disadvantages, as follows:
1. Keeping track of the last item displayed. When retrieving the feed
we look for the item we last showed. If it's not in the feed, then all
items are (presumably) new. If it's in the feed, we only show the
items after last_show_item.
  + easy to implement
  + little overhead in the contact class
  - we rely on the fact that items are stored sequentially and
chronologically ordered in the feed which usually happends but it's
not mandatory
  - if the last_show_item disappears form the feed (i.e. for a blog
feed the author marks the post as private, or draft), then the feed is
considered to contain only new items, which is obviously false
  - might be a little slow as we first have to identify the
last_show_item in the feed and only then new items can be shown
2.Keeping track of all the items displayed (or at least a fair amount
of items). This solution will more likely a HashTable or something
that will contain a per-feed list of shown items. URI/URL-s of news
items or some other hash (CRC comes to mind, but it's just an idea)
could be used as key.
  + it should require only one pass through the feed structure, as it
can be easily decided if an item have been shown or not
  + by using a proper hash we could treat all feeds equally (right now
feeds with no dates fail miserably :frowning: )
  - a little more memory consuming
  - I'm not quite sure about this hashing stuff :smiley: (i'm not going to
implement SHA1 for feed items, but you get the point)
3. Any other idea?

So what do you think?

Thanks,
Mihai

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#3

Hello Emil, all,
See inline comments:

I feel that these 2 points are moving the focus slightly off the real
problem.

The question is not really whether we keep track of one or more items
but how do we keep track of items at all. What attributes do we use to
compare one item to another.

We had already exchanged some thoughts on the subject actually and in here:

https://sip-communicator.dev.java.net/servlets/ReadMsg?list=dev&msgNo=1767

Vincent proposes using:
- the published date.
- the title.
- the link/URI (which is always there ... but is not necessarily unique)

as a matter of fact I think that on a per-feed basis the link is ought
to be unique. What sense would it make to have two items (different
items) pointing to the same address (OK, that CAN be done, and the
standard doesn't forbid that fact, but I don't know of any automated
feed generator that would provide the same link for different feed
items)

I'd also add:
- other tags that standards may define or that you may discover while
studying various flows or mechanisms that other RSS clients are using

see further on

So to summarize, we need to

1. make sure that our RSS protocol provider implementation supports
using different keys for the different flows, and

I have some doubts that using different keys for different flows would
be the best approach. From what I could understand on how ROME is
built the main idea is to treat all feeds as unitary as possible,
without regard to what kind of feed they are (RSS, ATOM, etc). Using a
per-feed-type indentification mechanism I feel would defeat the
purpose of using ROME, as it would require hacks to bypass ROME
structures.
Differenciating between dated and dateless feeds would be doable though.

2. make sure that we are able to uniquely identify flows for as many
cases as possible.

Concerning your question as to whether we should be keeping track of one
or more of the last RSS items: I don't really see the point of keeping
track of more than one item. If we assume that we properly implement
unique identification, then keeping track of the last item is enough. In
case our unique identification implementation is not reliable then it
would still be unreliable even if we keep track of more than one item.

As a matter of fact the issue isn't that straight-forward and I'm
gonna explain why.
At the moment, we treat only feeds that have a date associated with
every item. This is very convenient for a few reasons:
  * Date-s can not only be compared for equality but also for order
(is a Date before or after another date). This way, we can assure that
what we consider newer (and unread) feeds really are newer
  * even if the items aren't necessarily ordered by date, by keeping
track of the last item's date, we can make sure we don't display the
same item twice

On the other hand, for feeds that don't have a date associated with
each item (or for which the date is in an incorrect format) it's quite
difficult to come up with a way to uniquely identify a feed in a way
that's comparable for both equality and order AND that can remain
consistent even if feed items come up or disappear from a feed (some
sort of "absolute" order, independent of the feeds' universe :wink: ).
For identifying this kind of feed items (date-less items) I'd go for
the URL/URI solution for the afore-mentioned reasons.
The main idea for keeping some more than one last item is the fact
that this very item we refer to, could disappear from the feed, thus
leading to the (wrong) conclusion that all items are new items.

What do you think?

Trying to further refine the handling behaviour (per feed type / by
using feed specific tags) seems a little superfluous to me :smiley:

Emil

Any other ideas?

Mihai

···

Mihai Balan wrote:
> Hello all,
> And sorry for not giving any sign of life for the past 10 days (it
> seems that programmers too can get writer's block :D)
> As the discussion started here (
> https://sip-communicator.dev.java.net/servlets/ReadMsg?listName=dev&msgNo=1814
> ) hasn't come to a solution regarding this very specific issue
> (uniquely identifying items in a RSS feed), I felt it needed a little
> more consideration. One reason for that is the fact that strict
> identification of news items is needed for a more urgent issue:
> presenting only new items.
> In the current implementation of the RSS plugin, only feeds that
> present a date property for each item are handled correctly. For other
> feeds, they're either not handled at all (see
> http://www.freenews.fr/feeds/rss.php although I suspect some invalid
> RSS too), either all news items are presented as new even though they
> have been read in the past (it's the case with
> http://www.pheedo.com/f/drdobbs_all_articles but although with almost
> all ATOM feeds I tried).
> For the moment I see a couple of solutions for this matter,each with
> advantages and disadvantages, as follows:
> 1. Keeping track of the last item displayed. When retrieving the feed
> we look for the item we last showed. If it's not in the feed, then all
> items are (presumably) new. If it's in the feed, we only show the
> items after last_show_item.
> + easy to implement
> + little overhead in the contact class
> - we rely on the fact that items are stored sequentially and
> chronologically ordered in the feed which usually happends but it's
> not mandatory
> - if the last_show_item disappears form the feed (i.e. for a blog
> feed the author marks the post as private, or draft), then the feed is
> considered to contain only new items, which is obviously false
> - might be a little slow as we first have to identify the
> last_show_item in the feed and only then new items can be shown
> 2.Keeping track of all the items displayed (or at least a fair amount
> of items). This solution will more likely a HashTable or something
> that will contain a per-feed list of shown items. URI/URL-s of news
> items or some other hash (CRC comes to mind, but it's just an idea)
> could be used as key.
> + it should require only one pass through the feed structure, as it
> can be easily decided if an item have been shown or not
> + by using a proper hash we could treat all feeds equally (right now
> feeds with no dates fail miserably :frowning: )
> - a little more memory consuming
> - I'm not quite sure about this hashing stuff :smiley: (i'm not going to
> implement SHA1 for feed items, but you get the point)
> 3. Any other idea?
>
> So what do you think?
>
> Thanks,
> Mihai

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#4

Hi Mihai,

Mihai Balan wrote:

as a matter of fact I think that on a per-feed basis the link is ought
to be unique. What sense would it make to have two items (different
items) pointing to the same address

How about an RSS feed that notifies you of updates on one particular
page? I don't have a specific example but it doesn't seem that
impossible for someone to want to have such a think.

Then again, these are only doubts and if you come across a standard that
stipulates unique links then I am perfectly fine with it.

I have some doubts that using different keys for different flows would
be the best approach. From what I could understand on how ROME is
built the main idea is to treat all feeds as unitary as possible,
without regard to what kind of feed they are (RSS, ATOM, etc). Using a
per-feed-type indentification mechanism I feel would defeat the
purpose of using ROME, as it would require hacks to bypass ROME
structures.

I completely agree and if you find a way of implementing identification
the same way for all flows by using the default ROME mechanisms - then
go ahead, this would be great!

However, from our previous discussions I had the feeling that the guys
who were looking into the matter were not finding such a mechanism
inside ROME and that we needed to create our own keys.

As a matter of fact the issue isn't that straight-forward and I'm
gonna explain why.
At the moment, we treat only feeds that have a date associated with
every item. This is very convenient for a few reasons:
  * Date-s can not only be compared for equality but also for order
(is a Date before or after another date). This way, we can assure that
what we consider newer (and unread) feeds really are newer
  * even if the items aren't necessarily ordered by date, by keeping
track of the last item's date, we can make sure we don't display the
same item twice

On the other hand, for feeds that don't have a date associated with
each item (or for which the date is in an incorrect format) it's quite
difficult to come up with a way to uniquely identify a feed in a way
that's comparable for both equality and order AND that can remain
consistent even if feed items come up or disappear from a feed (some
sort of "absolute" order, independent of the feeds' universe :wink: ).
For identifying this kind of feed items (date-less items) I'd go for
the URL/URI solution for the afore-mentioned reasons.
The main idea for keeping some more than one last item is the fact
that this very item we refer to, could disappear from the feed, thus
leading to the (wrong) conclusion that all items are new items.

I have no experience with RSS but I really feel it's safe to assume that
item order won't change through consecutive flow retrievals and that if
it does, it would be a bug in the flow producer.

I'd therefore say that we don't have to worry that much about order and
that we can safely assume all feeds we're treating would have it (i.e.
new items appear at only one end).

In other words, in cases where we don't have a date, we simply go
backwards through the flow until we find an item with a key that we have
already seen and assume that we have also seen all items that are
preceding it.

Cheers
Emil

···

What do you think?

Trying to further refine the handling behaviour (per feed type / by
using feed specific tags) seems a little superfluous to me :smiley:

Emil

Any other ideas?

Mihai

Mihai Balan wrote:

Hello all,
And sorry for not giving any sign of life for the past 10 days (it
seems that programmers too can get writer's block :D)
As the discussion started here (
https://sip-communicator.dev.java.net/servlets/ReadMsg?listName=dev&msgNo=1814
) hasn't come to a solution regarding this very specific issue
(uniquely identifying items in a RSS feed), I felt it needed a little
more consideration. One reason for that is the fact that strict
identification of news items is needed for a more urgent issue:
presenting only new items.
In the current implementation of the RSS plugin, only feeds that
present a date property for each item are handled correctly. For other
feeds, they're either not handled at all (see
http://www.freenews.fr/feeds/rss.php although I suspect some invalid
RSS too), either all news items are presented as new even though they
have been read in the past (it's the case with
http://www.pheedo.com/f/drdobbs_all_articles but although with almost
all ATOM feeds I tried).
For the moment I see a couple of solutions for this matter,each with
advantages and disadvantages, as follows:
1. Keeping track of the last item displayed. When retrieving the feed
we look for the item we last showed. If it's not in the feed, then all
items are (presumably) new. If it's in the feed, we only show the
items after last_show_item.
  + easy to implement
  + little overhead in the contact class
  - we rely on the fact that items are stored sequentially and
chronologically ordered in the feed which usually happends but it's
not mandatory
  - if the last_show_item disappears form the feed (i.e. for a blog
feed the author marks the post as private, or draft), then the feed is
considered to contain only new items, which is obviously false
  - might be a little slow as we first have to identify the
last_show_item in the feed and only then new items can be shown
2.Keeping track of all the items displayed (or at least a fair amount
of items). This solution will more likely a HashTable or something
that will contain a per-feed list of shown items. URI/URL-s of news
items or some other hash (CRC comes to mind, but it's just an idea)
could be used as key.
  + it should require only one pass through the feed structure, as it
can be easily decided if an item have been shown or not
  + by using a proper hash we could treat all feeds equally (right now
feeds with no dates fail miserably :frowning: )
  - a little more memory consuming
  - I'm not quite sure about this hashing stuff :smiley: (i'm not going to
implement SHA1 for feed items, but you get the point)
3. Any other idea?

So what do you think?

Thanks,
Mihai

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#5

Hi all,
Here's the optimal solution me and Emil finally settled for :wink:
Given the fact that ROME doesn't give access to format specific tags
(such as the <guid> in RSS 2.0 feeds) and that trying to access such
tags is strongly discouraged (quote from the Javadoc for
getForeignMarkup(): «Returns:Opaque object to discourage use») the
best way to manage once-per-new-item-display, is trying to determine
on feed retrieval what identification information can be used and
adapt all further processing to the identification information used.
Basically, if it can find suitable date/time information, it will use
that information. Otherwise, we use the items URI to identify it, and
retain this URI in order to determine new items. I'll try to make
things as reusable and independent as possible in order to accomodate
a possible better solution that neither I or Emil could find at the
moment :slight_smile:

All the best,
Mihai

···

On 8/8/07, Emil Ivov <emcho@emcho.com> wrote:

Hi Mihai,

Mihai Balan wrote:
> as a matter of fact I think that on a per-feed basis the link is ought
> to be unique. What sense would it make to have two items (different
> items) pointing to the same address

How about an RSS feed that notifies you of updates on one particular
page? I don't have a specific example but it doesn't seem that
impossible for someone to want to have such a think.

Then again, these are only doubts and if you come across a standard that
stipulates unique links then I am perfectly fine with it.

> I have some doubts that using different keys for different flows would
> be the best approach. From what I could understand on how ROME is
> built the main idea is to treat all feeds as unitary as possible,
> without regard to what kind of feed they are (RSS, ATOM, etc). Using a
> per-feed-type indentification mechanism I feel would defeat the
> purpose of using ROME, as it would require hacks to bypass ROME
> structures.

I completely agree and if you find a way of implementing identification
the same way for all flows by using the default ROME mechanisms - then
go ahead, this would be great!

However, from our previous discussions I had the feeling that the guys
who were looking into the matter were not finding such a mechanism
inside ROME and that we needed to create our own keys.

> As a matter of fact the issue isn't that straight-forward and I'm
> gonna explain why.
> At the moment, we treat only feeds that have a date associated with
> every item. This is very convenient for a few reasons:
> * Date-s can not only be compared for equality but also for order
> (is a Date before or after another date). This way, we can assure that
> what we consider newer (and unread) feeds really are newer
> * even if the items aren't necessarily ordered by date, by keeping
> track of the last item's date, we can make sure we don't display the
> same item twice
>
> On the other hand, for feeds that don't have a date associated with
> each item (or for which the date is in an incorrect format) it's quite
> difficult to come up with a way to uniquely identify a feed in a way
> that's comparable for both equality and order AND that can remain
> consistent even if feed items come up or disappear from a feed (some
> sort of "absolute" order, independent of the feeds' universe :wink: ).
> For identifying this kind of feed items (date-less items) I'd go for
> the URL/URI solution for the afore-mentioned reasons.
> The main idea for keeping some more than one last item is the fact
> that this very item we refer to, could disappear from the feed, thus
> leading to the (wrong) conclusion that all items are new items.

I have no experience with RSS but I really feel it's safe to assume that
item order won't change through consecutive flow retrievals and that if
it does, it would be a bug in the flow producer.

I'd therefore say that we don't have to worry that much about order and
that we can safely assume all feeds we're treating would have it (i.e.
new items appear at only one end).

In other words, in cases where we don't have a date, we simply go
backwards through the flow until we find an item with a key that we have
already seen and assume that we have also seen all items that are
preceding it.

Cheers
Emil

>
>> What do you think?
>
> Trying to further refine the handling behaviour (per feed type / by
> using feed specific tags) seems a little superfluous to me :smiley:
>
>> Emil
>>
>
> Any other ideas?
>
> Mihai
>
>> Mihai Balan wrote:
>>> Hello all,
>>> And sorry for not giving any sign of life for the past 10 days (it
>>> seems that programmers too can get writer's block :D)
>>> As the discussion started here (
>>> https://sip-communicator.dev.java.net/servlets/ReadMsg?listName=dev&msgNo=1814
>>> ) hasn't come to a solution regarding this very specific issue
>>> (uniquely identifying items in a RSS feed), I felt it needed a little
>>> more consideration. One reason for that is the fact that strict
>>> identification of news items is needed for a more urgent issue:
>>> presenting only new items.
>>> In the current implementation of the RSS plugin, only feeds that
>>> present a date property for each item are handled correctly. For other
>>> feeds, they're either not handled at all (see
>>> http://www.freenews.fr/feeds/rss.php although I suspect some invalid
>>> RSS too), either all news items are presented as new even though they
>>> have been read in the past (it's the case with
>>> http://www.pheedo.com/f/drdobbs_all_articles but although with almost
>>> all ATOM feeds I tried).
>>> For the moment I see a couple of solutions for this matter,each with
>>> advantages and disadvantages, as follows:
>>> 1. Keeping track of the last item displayed. When retrieving the feed
>>> we look for the item we last showed. If it's not in the feed, then all
>>> items are (presumably) new. If it's in the feed, we only show the
>>> items after last_show_item.
>>> + easy to implement
>>> + little overhead in the contact class
>>> - we rely on the fact that items are stored sequentially and
>>> chronologically ordered in the feed which usually happends but it's
>>> not mandatory
>>> - if the last_show_item disappears form the feed (i.e. for a blog
>>> feed the author marks the post as private, or draft), then the feed is
>>> considered to contain only new items, which is obviously false
>>> - might be a little slow as we first have to identify the
>>> last_show_item in the feed and only then new items can be shown
>>> 2.Keeping track of all the items displayed (or at least a fair amount
>>> of items). This solution will more likely a HashTable or something
>>> that will contain a per-feed list of shown items. URI/URL-s of news
>>> items or some other hash (CRC comes to mind, but it's just an idea)
>>> could be used as key.
>>> + it should require only one pass through the feed structure, as it
>>> can be easily decided if an item have been shown or not
>>> + by using a proper hash we could treat all feeds equally (right now
>>> feeds with no dates fail miserably :frowning: )
>>> - a little more memory consuming
>>> - I'm not quite sure about this hashing stuff :smiley: (i'm not going to
>>> implement SHA1 for feed items, but you get the point)
>>> 3. Any other idea?
>>>
>>> So what do you think?
>>>
>>> Thanks,
>>> Mihai
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
> For additional commands, e-mail: dev-help@sip-communicator.dev.java.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#6

Hi Mihai, Emil and all,

And sorry for this late reply (I was gone on summer holidays).

Mihai Balan wrote:

Hi all,
Here's the optimal solution me and Emil finally settled for :wink:
Given the fact that ROME doesn't give access to format specific tags
(such as the <guid> in RSS 2.0 feeds) and that trying to access such
tags is strongly discouraged (quote from the Javadoc for
getForeignMarkup(): «Returns:Opaque object to discourage use») the
best way to manage once-per-new-item-display, is trying to determine
on feed retrieval what identification information can be used and
adapt all further processing to the identification information used.

I completely agree with this point.

Basically, if it can find suitable date/time information, it will use
that information. Otherwise, we use the items URI to identify it, and
retain this URI in order to determine new items. I'll try to make
things as reusable and independent as possible in order to accomodate
a possible better solution that neither I or Emil could find at the
moment :slight_smile:

The actual SC implementation uses date to determine if the feed is new or not, and this solution doesn't work well. Indeed, the last feed or sometimes the last "n"-th feeds are re-sended each time SC starts.
I don't have already have a look to the code to see if its only a date problem or simply a coding mistake, but if it appears to be a problem with the dates, it would be nice to use a dual key (even if the date seems correct) for identifying the feed: i.e. the couple date/URI.

Hope this helps,
Vincent

···

All the best,
Mihai

On 8/8/07, Emil Ivov <emcho@emcho.com> wrote:

Hi Mihai,

Mihai Balan wrote:

as a matter of fact I think that on a per-feed basis the link is ought
to be unique. What sense would it make to have two items (different
items) pointing to the same address

How about an RSS feed that notifies you of updates on one particular
page? I don't have a specific example but it doesn't seem that
impossible for someone to want to have such a think.

Then again, these are only doubts and if you come across a standard that
stipulates unique links then I am perfectly fine with it.

I have some doubts that using different keys for different flows would
be the best approach. From what I could understand on how ROME is
built the main idea is to treat all feeds as unitary as possible,
without regard to what kind of feed they are (RSS, ATOM, etc). Using a
per-feed-type indentification mechanism I feel would defeat the
purpose of using ROME, as it would require hacks to bypass ROME
structures.

I completely agree and if you find a way of implementing identification
the same way for all flows by using the default ROME mechanisms - then
go ahead, this would be great!

However, from our previous discussions I had the feeling that the guys
who were looking into the matter were not finding such a mechanism
inside ROME and that we needed to create our own keys.

As a matter of fact the issue isn't that straight-forward and I'm
gonna explain why.
At the moment, we treat only feeds that have a date associated with
every item. This is very convenient for a few reasons:
  * Date-s can not only be compared for equality but also for order
(is a Date before or after another date). This way, we can assure that
what we consider newer (and unread) feeds really are newer
  * even if the items aren't necessarily ordered by date, by keeping
track of the last item's date, we can make sure we don't display the
same item twice

On the other hand, for feeds that don't have a date associated with
each item (or for which the date is in an incorrect format) it's quite
difficult to come up with a way to uniquely identify a feed in a way
that's comparable for both equality and order AND that can remain
consistent even if feed items come up or disappear from a feed (some
sort of "absolute" order, independent of the feeds' universe :wink: ).
For identifying this kind of feed items (date-less items) I'd go for
the URL/URI solution for the afore-mentioned reasons.
The main idea for keeping some more than one last item is the fact
that this very item we refer to, could disappear from the feed, thus
leading to the (wrong) conclusion that all items are new items.

I have no experience with RSS but I really feel it's safe to assume that
item order won't change through consecutive flow retrievals and that if
it does, it would be a bug in the flow producer.

I'd therefore say that we don't have to worry that much about order and
that we can safely assume all feeds we're treating would have it (i.e.
new items appear at only one end).

In other words, in cases where we don't have a date, we simply go
backwards through the flow until we find an item with a key that we have
already seen and assume that we have also seen all items that are
preceding it.

Cheers
Emil

What do you think?

Trying to further refine the handling behaviour (per feed type / by
using feed specific tags) seems a little superfluous to me :smiley:

Emil

Any other ideas?

Mihai

Mihai Balan wrote:

Hello all,
And sorry for not giving any sign of life for the past 10 days (it
seems that programmers too can get writer's block :D)
As the discussion started here (
https://sip-communicator.dev.java.net/servlets/ReadMsg?listName=dev&msgNo=1814
) hasn't come to a solution regarding this very specific issue
(uniquely identifying items in a RSS feed), I felt it needed a little
more consideration. One reason for that is the fact that strict
identification of news items is needed for a more urgent issue:
presenting only new items.
In the current implementation of the RSS plugin, only feeds that
present a date property for each item are handled correctly. For other
feeds, they're either not handled at all (see
http://www.freenews.fr/feeds/rss.php although I suspect some invalid
RSS too), either all news items are presented as new even though they
have been read in the past (it's the case with
http://www.pheedo.com/f/drdobbs_all_articles but although with almost
all ATOM feeds I tried).
For the moment I see a couple of solutions for this matter,each with
advantages and disadvantages, as follows:
1. Keeping track of the last item displayed. When retrieving the feed
we look for the item we last showed. If it's not in the feed, then all
items are (presumably) new. If it's in the feed, we only show the
items after last_show_item.
  + easy to implement
  + little overhead in the contact class
  - we rely on the fact that items are stored sequentially and
chronologically ordered in the feed which usually happends but it's
not mandatory
  - if the last_show_item disappears form the feed (i.e. for a blog
feed the author marks the post as private, or draft), then the feed is
considered to contain only new items, which is obviously false
  - might be a little slow as we first have to identify the
last_show_item in the feed and only then new items can be shown
2.Keeping track of all the items displayed (or at least a fair amount
of items). This solution will more likely a HashTable or something
that will contain a per-feed list of shown items. URI/URL-s of news
items or some other hash (CRC comes to mind, but it's just an idea)
could be used as key.
  + it should require only one pass through the feed structure, as it
can be easily decided if an item have been shown or not
  + by using a proper hash we could treat all feeds equally (right now
feeds with no dates fail miserably :frowning: )
  - a little more memory consuming
  - I'm not quite sure about this hashing stuff :smiley: (i'm not going to
implement SHA1 for feed items, but you get the point)
3. Any other idea?

So what do you think?

Thanks,
Mihai

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#7

Hi Vincent,

Vincent Lucas wrote:

The actual SC implementation uses date to determine if the feed is new
or not, and this solution doesn't work well. Indeed, the last feed or
sometimes the last "n"-th feeds are re-sended each time SC starts.
I don't have already have a look to the code to see if its only a date
problem or simply a coding mistake, but if it appears to be a problem
with the dates, it would be nice to use a dual key (even if the date
seems correct) for identifying the feed: i.e. the couple date/URI.

I was also experiencing this problem. However, I am afraid I don't
really understand how a dual key (date/URI) would fix it. If the date
changes the whole key changes and the item would still look like a new
one. So, I'd still prefer that we have a preference list for keys and we
use the topmost available one.

Cheers
Emil

···

Hope this helps,
Vincent

All the best,
Mihai

On 8/8/07, Emil Ivov <emcho@emcho.com> wrote:

Hi Mihai,

Mihai Balan wrote:

as a matter of fact I think that on a per-feed basis the link is ought
to be unique. What sense would it make to have two items (different
items) pointing to the same address

How about an RSS feed that notifies you of updates on one particular
page? I don't have a specific example but it doesn't seem that
impossible for someone to want to have such a think.

Then again, these are only doubts and if you come across a standard that
stipulates unique links then I am perfectly fine with it.

I have some doubts that using different keys for different flows would
be the best approach. From what I could understand on how ROME is
built the main idea is to treat all feeds as unitary as possible,
without regard to what kind of feed they are (RSS, ATOM, etc). Using a
per-feed-type indentification mechanism I feel would defeat the
purpose of using ROME, as it would require hacks to bypass ROME
structures.

I completely agree and if you find a way of implementing identification
the same way for all flows by using the default ROME mechanisms - then
go ahead, this would be great!

However, from our previous discussions I had the feeling that the guys
who were looking into the matter were not finding such a mechanism
inside ROME and that we needed to create our own keys.

As a matter of fact the issue isn't that straight-forward and I'm
gonna explain why.
At the moment, we treat only feeds that have a date associated with
every item. This is very convenient for a few reasons:
  * Date-s can not only be compared for equality but also for order
(is a Date before or after another date). This way, we can assure that
what we consider newer (and unread) feeds really are newer
  * even if the items aren't necessarily ordered by date, by keeping
track of the last item's date, we can make sure we don't display the
same item twice

On the other hand, for feeds that don't have a date associated with
each item (or for which the date is in an incorrect format) it's quite
difficult to come up with a way to uniquely identify a feed in a way
that's comparable for both equality and order AND that can remain
consistent even if feed items come up or disappear from a feed (some
sort of "absolute" order, independent of the feeds' universe :wink: ).
For identifying this kind of feed items (date-less items) I'd go for
the URL/URI solution for the afore-mentioned reasons.
The main idea for keeping some more than one last item is the fact
that this very item we refer to, could disappear from the feed, thus
leading to the (wrong) conclusion that all items are new items.

I have no experience with RSS but I really feel it's safe to assume that
item order won't change through consecutive flow retrievals and that if
it does, it would be a bug in the flow producer.

I'd therefore say that we don't have to worry that much about order and
that we can safely assume all feeds we're treating would have it (i.e.
new items appear at only one end).

In other words, in cases where we don't have a date, we simply go
backwards through the flow until we find an item with a key that we have
already seen and assume that we have also seen all items that are
preceding it.

Cheers
Emil

What do you think?

Trying to further refine the handling behaviour (per feed type / by
using feed specific tags) seems a little superfluous to me :smiley:

Emil

Any other ideas?

Mihai

Mihai Balan wrote:

Hello all,
And sorry for not giving any sign of life for the past 10 days (it
seems that programmers too can get writer's block :D)
As the discussion started here (
https://sip-communicator.dev.java.net/servlets/ReadMsg?listName=dev&msgNo=1814
) hasn't come to a solution regarding this very specific issue
(uniquely identifying items in a RSS feed), I felt it needed a little
more consideration. One reason for that is the fact that strict
identification of news items is needed for a more urgent issue:
presenting only new items.
In the current implementation of the RSS plugin, only feeds that
present a date property for each item are handled correctly. For other
feeds, they're either not handled at all (see
http://www.freenews.fr/feeds/rss.php although I suspect some invalid
RSS too), either all news items are presented as new even though they
have been read in the past (it's the case with
http://www.pheedo.com/f/drdobbs_all_articles but although with almost
all ATOM feeds I tried).
For the moment I see a couple of solutions for this matter,each with
advantages and disadvantages, as follows:
1. Keeping track of the last item displayed. When retrieving the feed
we look for the item we last showed. If it's not in the feed, then all
items are (presumably) new. If it's in the feed, we only show the
items after last_show_item.
  + easy to implement
  + little overhead in the contact class
  - we rely on the fact that items are stored sequentially and
chronologically ordered in the feed which usually happends but it's
not mandatory
  - if the last_show_item disappears form the feed (i.e. for a blog
feed the author marks the post as private, or draft), then the feed is
considered to contain only new items, which is obviously false
  - might be a little slow as we first have to identify the
last_show_item in the feed and only then new items can be shown
2.Keeping track of all the items displayed (or at least a fair amount
of items). This solution will more likely a HashTable or something
that will contain a per-feed list of shown items. URI/URL-s of news
items or some other hash (CRC comes to mind, but it's just an idea)
could be used as key.
  + it should require only one pass through the feed structure, as it
can be easily decided if an item have been shown or not
  + by using a proper hash we could treat all feeds equally (right now
feeds with no dates fail miserably :frowning: )
  - a little more memory consuming
  - I'm not quite sure about this hashing stuff :smiley: (i'm not going to
implement SHA1 for feed items, but you get the point)
3. Any other idea?

So what do you think?

Thanks,
Mihai

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net