[sip-comm-dev] Re: svn commit: r7627 - trunk/src/net/java/sip/communicator/util/Html2Text.java


#1

Hi,
I may be wrong because it seems that this class was already in use,
but this class is not threadsafe, if the
extractText(String html) method is called twice from different threads
at the same time it can result in troubles since the StringBuilder
which contains the data is shared. Are we sure that it cannot happens
?

Matthieu

···

2010/8/21 <lubomir_m@dev.java.net>:

Author: lubomir_m
Date: 2010-08-20 23:04:04+0000
New Revision: 7627

Modified:
trunk/src/net/java/sip/communicator/util/Html2Text.java

Log:
Fixes a memory leak in Html2Text which used to keep the plain text extracted from a specific HTML text after it was no longer necessary.

Modified: trunk/src/net/java/sip/communicator/util/Html2Text.java
Url: https://sip-communicator.dev.java.net/source/browse/sip-communicator/trunk/src/net/java/sip/communicator/util/Html2Text.java?view=diff&rev=7627&p1=trunk/src/net/java/sip/communicator/util/Html2Text.java&p2=trunk/src/net/java/sip/communicator/util/Html2Text.java&r1=7626&r2=7627

--- trunk/src/net/java/sip/communicator/util/Html2Text.java (original)
+++ trunk/src/net/java/sip/communicator/util/Html2Text.java 2010-08-20 23:04:04+0000
@@ -4,7 +4,6 @@
* Distributable under LGPL license.
* See terms of license at gnu.org.
*/
-
package net.java.sip.communicator.util;

import java.io.*;
@@ -13,25 +12,34 @@
import javax.swing.text.html.parser.*;

/**
- * A utility class that allows to extract the text content of an html page
+ * A utility class that allows to extract the text content of an HTML page
* stripped from all formatting tags.
*
* @author Emil Ivov <emcho at sip-communicator.org>
* @author Yana Stamcheva
+ * @author Lubomir Marinov
*/
public class Html2Text
{
+ /**
+ * The <tt>Logger</tt> used by the <tt>Html2Text</tt> class for logging
+ * output.
+ */
private static final Logger logger
= Logger.getLogger(Html2Text.class);

- private static HTMLParserCallBack parser;
+ /**
+ * The HTML parser used by {@link #extractText(String)} to parse HTML so
+ * that plain text can be extracted from it.
+ */
+ private static HTMLParserCallback parser;

/\*\*

- * A utility method that allows to extract the text content of an html page
+ * A utility method that allows to extract the text content of an HTML page
* stripped from all formatting tags. Method is synchronized to avoid
- * concurrent access to the underlying html editor kit.
+ * concurrent access to the underlying <tt>HTMLEditorKit</tt>.
*
- * @param html the html string that we will extract the text from.
+ * @param html the HTML string that we will extract the text from.
* @return the text content of the <tt>html</tt> parameter.
*/
public static synchronized String extractText(String html)
@@ -40,59 +48,86 @@
return null;

    if \(parser == null\)

- parser = new HTMLParserCallBack();
+ parser = new HTMLParserCallback();

    try
    \{
        StringReader in = new StringReader\(html\);

- parser.parse(in);
- in.close();

- return parser.getText();
+ try
+ {
+ return parser.parse(in);
+ }
+ finally
+ {
+ in.close();
+ }
}
- catch (Exception exc)
+ catch (Exception ex)
{
if (logger.isInfoEnabled())
- logger.info("Failed to extract plain text from html="+html, exc);
+ logger.info("Failed to extract plain text from html="+html, ex);
return html;
}
}

/\*\*

- * The ParserCallback that will parse the html.
+ * The ParserCallback that will parse the HTML.
*/
- private static class HTMLParserCallBack extends HTMLEditorKit.ParserCallback
+ private static class HTMLParserCallback
+ extends HTMLEditorKit.ParserCallback
{
- StringBuffer s;
+ /**
+ * The <tt>StringBuilder</tt> which accumulates the parsed text while it
+ * is being parsed.
+ */
+ private StringBuilder sb;

    /\*\*
     \* Parses the text contained in the given reader\.
     \*
     \* @param in the reader to parse\.

+ * @return the parsed text
* @throws IOException thrown if we fail to parse the reader.
*/
- public void parse(Reader in) throws IOException
+ public String parse(Reader in)
+ throws IOException
{
- s = new StringBuffer();
- ParserDelegator delegator = new ParserDelegator();
- // the third parameter is TRUE to ignore charset directive
- delegator.parse(in, this, Boolean.TRUE);
+ sb = new StringBuilder();
+
+ String s;
+
+ try
+ {
+ new ParserDelegator().parse(in, this, /* ignoreCharSet */ true);
+ s = sb.toString();
+ }
+ finally
+ {
+ /*
+ * Since the Html2Text class keeps this instance in a static
+ * reference, the field sb should be reset to null as soon as
+ * completing its goad in order to avoid keeping the parsed
+ * text in memory after it is no longer needed i.e. to prevent
+ * a memory leak. This method has been converted to return the
+ * parsed string instead of having a separate getter method for
+ * the parsed string for the same purpose.
+ */
+ sb = null;
+ }
+ return s;
}

    /\*\*
     \* Appends the given text to the string buffer\.

+ *
+ * @param text
+ * @param pos
*/
+ @Override
public void handleText(char[] text, int pos)
{
- s.append(text);
- }
-
- /**
- * Returns the parsed text.
- */
- public String getText()
- {
- return s.toString();
+ sb.append(text);
}
}
}

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: commits-help@sip-communicator.dev.java.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#2

Hey Matthieu,

Nice catch! Want to take care of it or should it log an issue?

Cheers,
Emil

На 25.08.10 15:22, Matthieu Casanova написа:

···

Hi,
I may be wrong because it seems that this class was already in use,
but this class is not threadsafe, if the
extractText(String html) method is called twice from different threads
at the same time it can result in troubles since the StringBuilder
which contains the data is shared. Are we sure that it cannot happens
?

Matthieu

2010/8/21 <lubomir_m@dev.java.net>:

Author: lubomir_m
Date: 2010-08-20 23:04:04+0000
New Revision: 7627

Modified:
  trunk/src/net/java/sip/communicator/util/Html2Text.java

Log:
Fixes a memory leak in Html2Text which used to keep the plain text extracted from a specific HTML text after it was no longer necessary.

Modified: trunk/src/net/java/sip/communicator/util/Html2Text.java
Url: https://sip-communicator.dev.java.net/source/browse/sip-communicator/trunk/src/net/java/sip/communicator/util/Html2Text.java?view=diff&rev=7627&p1=trunk/src/net/java/sip/communicator/util/Html2Text.java&p2=trunk/src/net/java/sip/communicator/util/Html2Text.java&r1=7626&r2=7627

--- trunk/src/net/java/sip/communicator/util/Html2Text.java (original)
+++ trunk/src/net/java/sip/communicator/util/Html2Text.java 2010-08-20 23:04:04+0000
@@ -4,7 +4,6 @@
* Distributable under LGPL license.
* See terms of license at gnu.org.
*/
-
package net.java.sip.communicator.util;

import java.io.*;
@@ -13,25 +12,34 @@
import javax.swing.text.html.parser.*;

/**
- * A utility class that allows to extract the text content of an html page
+ * A utility class that allows to extract the text content of an HTML page
* stripped from all formatting tags.
*
* @author Emil Ivov <emcho at sip-communicator.org>
* @author Yana Stamcheva
+ * @author Lubomir Marinov
*/
public class Html2Text
{
+ /**
+ * The <tt>Logger</tt> used by the <tt>Html2Text</tt> class for logging
+ * output.
+ */
    private static final Logger logger
        = Logger.getLogger(Html2Text.class);

- private static HTMLParserCallBack parser;
+ /**
+ * The HTML parser used by {@link #extractText(String)} to parse HTML so
+ * that plain text can be extracted from it.
+ */
+ private static HTMLParserCallback parser;

    /**
- * A utility method that allows to extract the text content of an html page
+ * A utility method that allows to extract the text content of an HTML page
     * stripped from all formatting tags. Method is synchronized to avoid
- * concurrent access to the underlying html editor kit.
+ * concurrent access to the underlying <tt>HTMLEditorKit</tt>.
     *
- * @param html the html string that we will extract the text from.
+ * @param html the HTML string that we will extract the text from.
     * @return the text content of the <tt>html</tt> parameter.
     */
    public static synchronized String extractText(String html)
@@ -40,59 +48,86 @@
            return null;

        if (parser == null)
- parser = new HTMLParserCallBack();
+ parser = new HTMLParserCallback();

        try
        {
            StringReader in = new StringReader(html);
- parser.parse(in);
- in.close();

- return parser.getText();
+ try
+ {
+ return parser.parse(in);
+ }
+ finally
+ {
+ in.close();
+ }
        }
- catch (Exception exc)
+ catch (Exception ex)
        {
            if (logger.isInfoEnabled())
- logger.info("Failed to extract plain text from html="+html, exc);
+ logger.info("Failed to extract plain text from html="+html, ex);
            return html;
        }
    }

    /**
- * The ParserCallback that will parse the html.
+ * The ParserCallback that will parse the HTML.
     */
- private static class HTMLParserCallBack extends HTMLEditorKit.ParserCallback
+ private static class HTMLParserCallback
+ extends HTMLEditorKit.ParserCallback
    {
- StringBuffer s;
+ /**
+ * The <tt>StringBuilder</tt> which accumulates the parsed text while it
+ * is being parsed.
+ */
+ private StringBuilder sb;

        /**
         * Parses the text contained in the given reader.
         *
         * @param in the reader to parse.
+ * @return the parsed text
         * @throws IOException thrown if we fail to parse the reader.
         */
- public void parse(Reader in) throws IOException
+ public String parse(Reader in)
+ throws IOException
        {
- s = new StringBuffer();
- ParserDelegator delegator = new ParserDelegator();
- // the third parameter is TRUE to ignore charset directive
- delegator.parse(in, this, Boolean.TRUE);
+ sb = new StringBuilder();
+
+ String s;
+
+ try
+ {
+ new ParserDelegator().parse(in, this, /* ignoreCharSet */ true);
+ s = sb.toString();
+ }
+ finally
+ {
+ /*
+ * Since the Html2Text class keeps this instance in a static
+ * reference, the field sb should be reset to null as soon as
+ * completing its goad in order to avoid keeping the parsed
+ * text in memory after it is no longer needed i.e. to prevent
+ * a memory leak. This method has been converted to return the
+ * parsed string instead of having a separate getter method for
+ * the parsed string for the same purpose.
+ */
+ sb = null;
+ }
+ return s;
        }

        /**
         * Appends the given text to the string buffer.
+ *
+ * @param text
+ * @param pos
         */
+ @Override
        public void handleText(char[] text, int pos)
        {
- s.append(text);
- }
-
- /**
- * Returns the parsed text.
- */
- public String getText()
- {
- return s.toString();
+ sb.append(text);
        }
    }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: commits-help@sip-communicator.dev.java.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net

--
Emil Ivov, Ph.D. 67000 Strasbourg,
Project Lead France
SIP Communicator
emcho@sip-communicator.org PHONE: +33.1.77.62.43.30
http://sip-communicator.org FAX: +33.1.77.62.47.31

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#3

Html2Text.extractText(String) is synchronized so that no two threads
can execute it at one and the same time. Could you please clarify?

···

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#4

ok, I'll fix that

Matthieu

···

On Thu, Aug 26, 2010 at 2:53 PM, Emil Ivov <emcho@sip-communicator.org> wrote:

Hey Matthieu,

Nice catch! Want to take care of it or should it log an issue?

Cheers,
Emil

На 25.08.10 15:22, Matthieu Casanova написа:

Hi,
I may be wrong because it seems that this class was already in use,
but this class is not threadsafe, if the
extractText(String html) method is called twice from different threads
at the same time it can result in troubles since the StringBuilder
which contains the data is shared. Are we sure that it cannot happens
?

Matthieu

2010/8/21 <lubomir_m@dev.java.net>:

Author: lubomir_m
Date: 2010-08-20 23:04:04+0000
New Revision: 7627

Modified:
trunk/src/net/java/sip/communicator/util/Html2Text.java

Log:
Fixes a memory leak in Html2Text which used to keep the plain text extracted from a specific HTML text after it was no longer necessary.

Modified: trunk/src/net/java/sip/communicator/util/Html2Text.java
Url: https://sip-communicator.dev.java.net/source/browse/sip-communicator/trunk/src/net/java/sip/communicator/util/Html2Text.java?view=diff&rev=7627&p1=trunk/src/net/java/sip/communicator/util/Html2Text.java&p2=trunk/src/net/java/sip/communicator/util/Html2Text.java&r1=7626&r2=7627

--- trunk/src/net/java/sip/communicator/util/Html2Text.java (original)
+++ trunk/src/net/java/sip/communicator/util/Html2Text.java 2010-08-20 23:04:04+0000
@@ -4,7 +4,6 @@
* Distributable under LGPL license.
* See terms of license at gnu.org.
*/
-
package net.java.sip.communicator.util;

import java.io.*;
@@ -13,25 +12,34 @@
import javax.swing.text.html.parser.*;

/**
- * A utility class that allows to extract the text content of an html page
+ * A utility class that allows to extract the text content of an HTML page
* stripped from all formatting tags.
*
* @author Emil Ivov <emcho at sip-communicator.org>
* @author Yana Stamcheva
+ * @author Lubomir Marinov
*/
public class Html2Text
{
+ /**
+ * The <tt>Logger</tt> used by the <tt>Html2Text</tt> class for logging
+ * output.
+ */
private static final Logger logger
= Logger.getLogger(Html2Text.class);

- private static HTMLParserCallBack parser;
+ /**
+ * The HTML parser used by {@link #extractText(String)} to parse HTML so
+ * that plain text can be extracted from it.
+ */
+ private static HTMLParserCallback parser;

/\*\*

- * A utility method that allows to extract the text content of an html page
+ * A utility method that allows to extract the text content of an HTML page
* stripped from all formatting tags. Method is synchronized to avoid
- * concurrent access to the underlying html editor kit.
+ * concurrent access to the underlying <tt>HTMLEditorKit</tt>.
*
- * @param html the html string that we will extract the text from.
+ * @param html the HTML string that we will extract the text from.
* @return the text content of the <tt>html</tt> parameter.
*/
public static synchronized String extractText(String html)
@@ -40,59 +48,86 @@
return null;

    if \(parser == null\)

- parser = new HTMLParserCallBack();
+ parser = new HTMLParserCallback();

    try
    \{
        StringReader in = new StringReader\(html\);

- parser.parse(in);
- in.close();

- return parser.getText();
+ try
+ {
+ return parser.parse(in);
+ }
+ finally
+ {
+ in.close();
+ }
}
- catch (Exception exc)
+ catch (Exception ex)
{
if (logger.isInfoEnabled())
- logger.info("Failed to extract plain text from html="+html, exc);
+ logger.info("Failed to extract plain text from html="+html, ex);
return html;
}
}

/\*\*

- * The ParserCallback that will parse the html.
+ * The ParserCallback that will parse the HTML.
*/
- private static class HTMLParserCallBack extends HTMLEditorKit.ParserCallback
+ private static class HTMLParserCallback
+ extends HTMLEditorKit.ParserCallback
{
- StringBuffer s;
+ /**
+ * The <tt>StringBuilder</tt> which accumulates the parsed text while it
+ * is being parsed.
+ */
+ private StringBuilder sb;

    /\*\*
     \* Parses the text contained in the given reader\.
     \*
     \* @param in the reader to parse\.

+ * @return the parsed text
* @throws IOException thrown if we fail to parse the reader.
*/
- public void parse(Reader in) throws IOException
+ public String parse(Reader in)
+ throws IOException
{
- s = new StringBuffer();
- ParserDelegator delegator = new ParserDelegator();
- // the third parameter is TRUE to ignore charset directive
- delegator.parse(in, this, Boolean.TRUE);
+ sb = new StringBuilder();
+
+ String s;
+
+ try
+ {
+ new ParserDelegator().parse(in, this, /* ignoreCharSet */ true);
+ s = sb.toString();
+ }
+ finally
+ {
+ /*
+ * Since the Html2Text class keeps this instance in a static
+ * reference, the field sb should be reset to null as soon as
+ * completing its goad in order to avoid keeping the parsed
+ * text in memory after it is no longer needed i.e. to prevent
+ * a memory leak. This method has been converted to return the
+ * parsed string instead of having a separate getter method for
+ * the parsed string for the same purpose.
+ */
+ sb = null;
+ }
+ return s;
}

    /\*\*
     \* Appends the given text to the string buffer\.

+ *
+ * @param text
+ * @param pos
*/
+ @Override
public void handleText(char[] text, int pos)
{
- s.append(text);
- }
-
- /**
- * Returns the parsed text.
- */
- public String getText()
- {
- return s.toString();
+ sb.append(text);
}
}
}

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: commits-help@sip-communicator.dev.java.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net

--
Emil Ivov, Ph.D. 67000 Strasbourg,
Project Lead France
SIP Communicator
emcho@sip-communicator.org PHONE: +33.1.77.62.43.30
http://sip-communicator.org FAX: +33.1.77.62.47.31

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net


#5

Oh you're right I missed the synchronized keyword, so I finally don't
change anything it seems good now.

Matthieu

···

On Thu, Aug 26, 2010 at 4:34 PM, Lubomir Marinov <lubomir.marinov@gmail.com> wrote:

Html2Text.extractText(String) is synchronized so that no two threads
can execute it at one and the same time. Could you please clarify?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@sip-communicator.dev.java.net
For additional commands, e-mail: dev-help@sip-communicator.dev.java.net