[sip-comm-dev] Codecs implementation


#1

Hi devs,

Currently I'm working on the speex implementation. Its almost done but I have some problems with computing the duration of the media based on the in data.

So as I'm nearly finishing work on the speex codec I need some help.

Emil send me a ilbc implementation in java written by Jean Lorchat and I started looking at it and how can be plugged in to JMF. But I have some difficulties maybe Jean can help me :))
The encoder and decoder requires byte arrays as input and output. So as I'm not so familiar with this codec I don't know for example how to compute the length of the output array for the IlbcDecoder based on the input one. And also the same problem as the one with speex how to compute the duration in milliseconds of the media based on the given length of the data.

damencho

···

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#2

Hi damencho,

I hear someone is talking about me :wink:

Currently I'm working on the speex implementation. Its almost done but I
have some problems with computing the duration of the media based on the
in data.

Shouldn't be hard to fix. Let's discuss ilbc, shall we.

The encoder and decoder requires byte arrays as input and output. So as
I'm not so familiar with this codec I don't know for example how to
compute the length of the output array for the IlbcDecoder based on the
input one.

As per RFC specifications of the ilbc codec, the input data MUST be 8000
Hz sampled 16 bits data. To express this in more friendly words, the
data comes as 16 bits elements (a short int) and you need 8000 such
elements (samples) to represent one second.

Now, we still have to talk about how to feed it to the codec. Once again
we refer to the RFC and it says that ilbc can operate using two modes.
It always has to handle data by blocks, but can do so with 20 or 30
milliseconds blocks. Since we have 8000 samples per seconds this means
that an input block is exactly 160 samples (in 20ms mode) or 240 samples
(in 30 ms mode).

I'll have to look at the code again but I think that there is a version
of the coding/decoding function that works with shorts[]. Otherwise, you
have to split all your 16-bits values in the byte array. This means that
if you use byte arrays, they are 320 bytes large (20ms mode) or 480
bytes large (30 ms mode).

Of course, the size of the compressed data is also defined in the RFC.
For the 20ms mode, the compressed data stream is 304 bits large (38
bytes) and in the 30ms mode, it is 400 bits large (50 bytes). Bitrates
are respectively around 15 and 13 kbps.

Symmetrically, if you decode 50 bytes of data, you will get 480 bytes of
sound (or 240 samples or 30 ms) and if you decode 38 bytes of data
you'll get 320 bytes of data (or 160 samples or 20 ms).

As a sidenote, the 20ms/30ms mode must be configurable because although
ilbc is interoperable, a 20ms stream is no good when decoded with 30ms
mode et vice versa.

And also the same problem as the one with speex how to
compute the duration in milliseconds of the media based on the given
length of the data.

As you can see from ilbc example, it all depends on the specifications.
If you have some document about speex at handy, I'll check right away.
Otherwise I'll dig the web to find the answers =)

Cheers,
Jean

···

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#3

Hi Jean,
thanks for the quick answer I think it will be very helpful and it answers my questions :))
actually the problem with speex is not the duration I mentioned.
I'm testing with asterisk and it seems when encoding speech asterisk sends rtp packets with more than one frames.
The RFC mentions that the decoder must detect and handle if such data is passed to him. But it seems jspeex doesn't .
I cannot get the number of frames in a packet. In the RFC is mentioned that in the rtp packet there is no sense to put such data as decoder must detect this.
Anyway thanks again a will struggle a little bit more with speex and will start with the ilbc. I will write for my progress.

damencho

Jean Lorchat wrote:

···

Hi damencho,

I hear someone is talking about me :wink:

Currently I'm working on the speex implementation. Its almost done but I
have some problems with computing the duration of the media based on the
in data.
    
Shouldn't be hard to fix. Let's discuss ilbc, shall we.

The encoder and decoder requires byte arrays as input and output. So as
I'm not so familiar with this codec I don't know for example how to
compute the length of the output array for the IlbcDecoder based on the
input one.
    
As per RFC specifications of the ilbc codec, the input data MUST be 8000
Hz sampled 16 bits data. To express this in more friendly words, the
data comes as 16 bits elements (a short int) and you need 8000 such
elements (samples) to represent one second.

Now, we still have to talk about how to feed it to the codec. Once again
we refer to the RFC and it says that ilbc can operate using two modes.
It always has to handle data by blocks, but can do so with 20 or 30
milliseconds blocks. Since we have 8000 samples per seconds this means
that an input block is exactly 160 samples (in 20ms mode) or 240 samples
(in 30 ms mode).

I'll have to look at the code again but I think that there is a version
of the coding/decoding function that works with shorts[]. Otherwise, you
have to split all your 16-bits values in the byte array. This means that
if you use byte arrays, they are 320 bytes large (20ms mode) or 480
bytes large (30 ms mode).

Of course, the size of the compressed data is also defined in the RFC.
For the 20ms mode, the compressed data stream is 304 bits large (38
bytes) and in the 30ms mode, it is 400 bits large (50 bytes). Bitrates
are respectively around 15 and 13 kbps.

Symmetrically, if you decode 50 bytes of data, you will get 480 bytes of
sound (or 240 samples or 30 ms) and if you decode 38 bytes of data
you'll get 320 bytes of data (or 160 samples or 20 ms).

As a sidenote, the 20ms/30ms mode must be configurable because although
ilbc is interoperable, a 20ms stream is no good when decoded with 30ms
mode et vice versa.

And also the same problem as the one with speex how to
compute the duration in milliseconds of the media based on the given
length of the data.
    
As you can see from ilbc example, it all depends on the specifications.
If you have some document about speex at handy, I'll check right away.
Otherwise I'll dig the web to find the answers =)

Cheers,
Jean

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#4

Hi again,

I'm testing with asterisk and it seems when encoding speech asterisk sends rtp packets with more than one frames.

Makes sense if you want to use bandwidth more efficiently. Although you raise the latency at the same time.

The RFC mentions that the decoder must detect and handle if such data is passed to him. But it seems jspeex doesn't .

:frowning:

I cannot get the number of frames in a packet. In the RFC is mentioned that in the rtp packet there is no sense to put such data as decoder must detect this.

Actually, this makes sense too. I can imagine that the size of a frame must be fixed and that as such, based on payload size, it is possible to guess the number of frames. Then again, I might have missed something. Let me think about that with the RFCs. I'll be back =)

Jean

···

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#5

Hi,

Here is the progress about speex codec implementation.
It is recommended to put one frame of speex data in rtp packet.
1 frame = 160 samples = 20 ms.
I found some code which counts the samples in given encoded data.
There are two situations when receiving media :
1. Receiving data is 160 samples and is decoded ok.
2. receiving data is with varying number of samples - 160,320,480. Which are processed from the decoder but the sound is garbage. As I read in various documents decoder must handle this. But it seems not.
I have write to the jspeex forum and I'm waiting for response.
This is for now. I get with the ilbc right now hope it will be ok :))

damencho

···

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#6

Hi Jean,

The ilbc decoder works fine :slight_smile: no I'm struggling with the encoder. The jmf pass byte buffers with length at about 2000.
I'm tring to process them on portions. As I have understood from the previous mail the encoder gets 480 bytes (30ms mode) and
returns the encoded data to 50 bytes. Am I right ?

damencho

Jean Lorchat wrote:

···

Hi again,

I'm testing with asterisk and it seems when encoding speech asterisk sends rtp packets with more than one frames.

Makes sense if you want to use bandwidth more efficiently. Although you raise the latency at the same time.

The RFC mentions that the decoder must detect and handle if such data is passed to him. But it seems jspeex doesn't .

:frowning:

I cannot get the number of frames in a packet. In the RFC is mentioned that in the rtp packet there is no sense to put such data as decoder must detect this.

Actually, this makes sense too. I can imagine that the size of a frame must be fixed and that as such, based on payload size, it is possible to guess the number of frames. Then again, I might have missed something. Let me think about that with the RFCs. I'll be back =)

Jean

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#7

Hi damencho,

The ilbc decoder works fine :slight_smile: no I'm struggling with the encoder. The

Glad to see it not *only works for me* :smiley:

jmf pass byte buffers with length at about 2000.

Wow... can't we lower this a bit ? Because at 8 kHz, this makes 1000
samples (i.e. 125 ms). Which means almost as much latency. Anyway, let's
get it working first.

I'm tring to process them on portions. As I have understood from the
previous mail the encoder gets 480 bytes (30ms mode) and

yes

returns the encoded data to 50 bytes. Am I right ?

yes !

A small problem you might have, is when sizeof(jmf_byte_buffer) % 480 != 0.

To all other people (if you read this far... emil ? u there ?), I'd also
like to have a small discussion about latency and I do not know when is
the right time to start that thread. Because low latency is a very
important feature, and it's easier to start implementing properly even
if this means some overhead right now. Of course I'll have more
information to provide to this issue when the native alsa source is
finished... All right, I'm late. Feel free to hit me (that's why I'm
living so far away)...

Cheers,
Jean

···

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#8

Jean Lorchat wrote:

A small problem you might have, is when sizeof(jmf_byte_buffer) % 480 != 0.

I've tried to handle this but every time I had try the sound is garbage. So I tried other example - not processing the bytes from buffer but bytes which
are at portions of 480 bytes and were processed by the decoder (like echo application) but this way the sound also is not ok. Can be this something with the encoder ?

damencho

···

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#9

Hi Jean,

Jean Lorchat wrote:

To all other people (if you read this far... emil ? u there ?), I'd also
like to have a small discussion about latency and I do not know when is
the right time to start that thread. Because low latency is a very
important feature, and it's easier to start implementing properly even
if this means some overhead right now. Of course I'll have more
information to provide to this issue when the native alsa source is
finished... All right, I'm late. Feel free to hit me (that's why I'm
living so far away)...

I couldn't agree more. Latency is crucial, and even more so for us and all the java-is-too-slow-for-voip comments that we're bound to be getting.

Latency could be coming from one of the following: capture, encoding/decoding, net streaming and playback. I have never seen an official study of the impact of any of these in JMF and therefore SIP Communicator. From my experience however capture (and possibly playback) seem to be the ones that are causing most trouble, and this especially on Linux.

JMF's windows performance pack includes a DirectSound data source so things aren't that bad (though they could be better). Linux's performance pack has no native data source and uses javasound which I believe is the cause for much of the latency there.

Encoding and decoding are more or less ok even when not implemented natively (though once again I don't have anything official on how much they take).

To summarize, I believe that a good study of the various parts (enc/dec, capture, playback and streaming) of our audio system and their impact on latency would be a very nice thing and could give us many pointers as to how we could best optimize it. If I had to take a stab, however, I'd go for capture first.

WDYT?

Cheers
Emil

···

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#10

Hi again,

I've tried to handle this but every time I had try the sound is
garbage. So I tried other example - not processing the bytes from
buffer but bytes which
are at portions of 480 bytes and were processed by the decoder (like
echo application) but this way the sound also is not ok. Can be this
something with the encoder ?

First of all, did you try the code as standalone application ? I mean
based on local files. Since the code is based on RFC code, it can work
standalone. This is how I tested it back then and it sounded fine.
HOWEVER I might have submitted a wrong version to you.

Steps you can try :

1/ convert some audio file to raw, 16bits (LE), 8000 Hz
2/ feed it to the standalone encoder/decoder application
   it is going to produce 2 more files : one compressed stream
   and one decoded stream based on compressed data.
3/ Listen to the decoded file. If it sounds like the source file, then
there you are. Otherwise, blame me :wink:

Then if it is working this way, we will make it work the other way. Can
you please describe what is not working in more details :
. what is the input stream (jmf from capture device, audio file, jmf
from network) ?
. what is the format (compressed ilbc, or 8000 Hz audio)
. what is it you get (obviously awful noise)

If you want, we can discuss this longer on icq/irc

jean

···

damencho

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#11

Hi Emil,

I couldn't agree more. Latency is crucial, and even more so for us and all the java-is-too-slow-for-voip comments that we're bound to be getting.

Same for C++ then. Why don't we have a full-ASM softphone :-D...

JMF's windows performance pack includes a DirectSound data source so things aren't that bad (though they could be better). Linux's performance pack has no native data source and uses javasound which I

                       ^^^^ oh my god... wasn't aware of that

believe is the cause for much of the latency there.

To summarize, I believe that a good study of the various parts (enc/dec, capture, playback and streaming) of our audio system and their impact on latency would be a very nice thing and could give us many pointers as to how we could best optimize it. If I had to take a stab, however, I'd go for capture first.

I happen to know someone working on a native data source for Linux. It is not exactly a stab, but it might just work.

WDYT?

The latency problem is only a question of tradeoff. We are trading latency for several things like processing time (because better handle chunks than one sample at a time), so-called efficiency (because if we have buffers and data is late, then we drain the buffer) and so on.

While working on the datasource, I am considering the question of "how much data should I read at once". Of course, reading byte by byte is not possible because you can't transfer 2 bytes from the sound card 8000 times per second. So I thought that it could be nice to reach a granularity that allows to operate properly with all the codecs. From the ilbc case, I figured that 10ms buffer might be interesting. As you know I still have some buffer overrun issues because.... well, because.

But the principle is sound and as soon as the datasource is working, I can have a look at something else. Of course, encoding/decoding might be an issue (the C-version of ilbc runs 2~3 times faster than the java version) but I suspect that if Javasound has poor recording performance, I don't see why it should have good playback performance ;-).

jean

···

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#12

Jean Lorchat wrote:

First of all, did you try the code as standalone application ? I mean
based on local files. Since the code is based on RFC code, it can work
standalone. This is how I tested it back then and it sounded fine.
HOWEVER I might have submitted a wrong version to you.

Yes I've try it stand alone and the result is the same.

Steps you can try :

1/ convert some audio file to raw, 16bits (LE), 8000 Hz

I've recorded some data of incoming data after decoding. I've played the data in standalone app and its ok.

2/ feed it to the standalone encoder/decoder application
   it is going to produce 2 more files : one compressed stream
   and one decoded stream based on compressed data.
3/ Listen to the decoded file. If it sounds like the source file, then
there you are. Otherwise, blame me :wink:
  

I've encoded the data. Then decoded it and played it and the result is as the one within jmf and sip-communicator all this in standalone app.

Then if it is working this way, we will make it work the other way. Can
you please describe what is not working in more details :
. what is the input stream (jmf from capture device, audio file, jmf
from network) ?
  

jmf capture device

. what is the format (compressed ilbc, or 8000 Hz audio)
  

ilbc sent to asterisk.

. what is it you get (obviously awful noise)
  

awful noise.

Maybe the version I have is wrong? can you send mi the latest one :))
thanks once again

damencho

···

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#13

Hey Jean,

Jean Lorchat wrote:

I happen to know someone working on a native data source for Linux.

Oh, do I know the guy ?

The latency problem is only a question of tradeoff. We are trading latency for several things like processing time (because better handle chunks than one sample at a time), so-called efficiency (because if we have buffers and data is late, then we drain the buffer) and so on.

While working on the datasource, I am considering the question of "how much data should I read at once". Of course, reading byte by byte is not possible because you can't transfer 2 bytes from the sound card 8000 times per second. So I thought that it could be nice to reach a granularity that allows to operate properly with all the codecs. From the ilbc case, I figured that 10ms buffer might be interesting. As you know I still have some buffer overrun issues because.... well, because.

But the principle is sound and as soon as the datasource is working, I can have a look at something else.

I am looking forward to that moment :slight_smile:

Of course, encoding/decoding might be an issue (the C-version of ilbc runs 2~3 times faster than the java version)

What exactly does 2-3 times mean? Was it something like - 5% of CPU usage for the native and 15% for the java version or was it rather 30% and 90%?

but I suspect that if Javasound has poor recording performance, I don't see why it should have good playback performance ;-).

You are right. Some of my scarce testing involved capturing audio locally then streaming it to a hardware phone and also playing it locally. I distinctly remember I was hearing my voice on the phone before it played on the local machine ...

Emil

···

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#14

Hi,

Oh, do I know the guy ?

I don't think so :-p

What exactly does 2-3 times mean? Was it something like - 5% of CPU usage for the native and 15% for the java version or was it rather 30% and 90%?

At that time, I didn't bother to measure CPU usage. So that 2~3 times faster means that my sample audio file was encoded in a half minute using the C-version and 1~1.5 minutes using the Java version. Still well below realtime anyway.

You are right. Some of my scarce testing involved capturing audio locally then streaming it to a hardware phone and also playing it locally. I distinctly remember I was hearing my voice on the phone before it played on the local machine ...

I hate that kind of situation. This really made people (including me) think that java sucks. Although it looks like it is not the case :wink:

Jean

···

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net