You can extract text from an HTML file using Python by using a library like BeautifulSoup
or lxml
. Here’s an example using BeautifulSoup
:
from bs4 import BeautifulSoup
# open the HTML file
with open('example.html', 'r') as file:
# read the contents of the file
contents = file.read()
# create a BeautifulSoup object from the HTML content
soup = BeautifulSoup(contents, 'html.parser')
# find all the text in the HTML file
text = soup.get_text()
# print the extracted text
print(text)
In this example, we first open an HTML file named example.html
using the open()
function and read its contents into a variable called contents
. We then create a BeautifulSoup
object from the contents
variable using the 'html.parser'
parser.
Next, we use the get_text()
method of the BeautifulSoup
object to extract all the text from the HTML file. Finally, we print the extracted text using the print()
function.
You can customize the behavior of the get_text()
method by passing arguments to it. For example, you can specify the separator between the extracted text by passing a separator
argument. You can also exclude certain HTML tags from the extraction by passing a list of tags to the exclude
argument.
+ There are no comments
Add yours