How to extract text from an html file using Python?

Estimated read time 1 min read

You can extract text from an HTML file using Python by using a library like BeautifulSoup or lxml. Here’s an example using BeautifulSoup:

from bs4 import BeautifulSoup

# open the HTML file
with open('example.html', 'r') as file:
    # read the contents of the file
    contents = file.read()

# create a BeautifulSoup object from the HTML content
soup = BeautifulSoup(contents, 'html.parser')

# find all the text in the HTML file
text = soup.get_text()

# print the extracted text
print(text)

In this example, we first open an HTML file named example.html using the open() function and read its contents into a variable called contents. We then create a BeautifulSoup object from the contents variable using the 'html.parser' parser.

Next, we use the get_text() method of the BeautifulSoup object to extract all the text from the HTML file. Finally, we print the extracted text using the print() function.

You can customize the behavior of the get_text() method by passing arguments to it. For example, you can specify the separator between the extracted text by passing a separator argument. You can also exclude certain HTML tags from the extraction by passing a list of tags to the exclude argument.

You May Also Like

More From Author

+ There are no comments

Add yours

Leave a Reply