Read a Report Format File From Python

python read word document

This post volition talk most how to read Word Documents with Python. We're going to embrace three different packages – docx2txt, docx, and my personal favorite: docx2python.

The docx2txt package

Permit's talk well-nigh docx2text first. This is a Python package that allows you to scrape text and images from Discussion Documents. The example below reads in a Word Document containing the Zen of Python. Equally you can run across, once we've imported docx2txt, all we need is one line of code to read in the text from the Discussion Document. We can read in the document using a method in the package called process, which takes the name of the file equally input. Regular text, listed items, hyperlink text, and table text will all exist returned in a single cord.

                  import docx2txt  # read in word file result = docx2txt.process("zen_of_python.docx")              

python scrape word document

What if the file has images? In that case nosotros just need a pocket-size tweak to our code. When we run the process method, nosotros can pass an extra parameter that specifies the name of an output directory. Running docx2txt.procedure will extract any images in the Word Certificate and salvage them into this specified folder. The text from the file will still too be extracted and stored in the result variable.

                  import docx2txt  event = docx2txt.process("zen_of_python_with_image.docx", "C:/path/to/store/files")              

Sample Prototype

python scrape image from word document

docx2txt will besides scrape any text from tables. Once again, this volition be returned into a unmarried string with any other text constitute in the certificate, which means this text can more difficult to parse. Later in this post nosotros'll talk nigh docx2python, which allows you lot to scrape tables in a more structured format.

The docx bundle

The source code behind docx2txt is derived from lawmaking in the docx package, which can too exist used to scrape Word Documents. docx is a powerful library for manipulating and creating Discussion Documents, but can also (with some restrictions) read in text from Word files.

In the example beneath, we open a connection to our sample give-and-take file using the docx.Document method. Here nosotros just input the proper noun of the file nosotros want to connect to. Then, we can scrape the text from each paragraph in the file using a list comprehension in conjunction with medico.paragraphs. This volition include scraping separate lines defined in the Word Certificate for listed items. Unlike docx2txt, docx, cannot scrape images from Word Documents. Also, docx volition non scrape out hyperlinks and text in tables defined in the Word Document.

                  import docx  # open connexion to Word Document dr. = docx.Document("zen_of_python.docx")  # read in each paragraph in file outcome = [p.text for p in medico.paragraphs]              

python docx

The docx2python package

docx2python is another bundle we can utilise to scrape Word Documents. It has some additional features beyond docx2txt and docx. For example, it is able to return the text scraped from a certificate in a more structured format. Let's examination out our Word Document with docx2python. We're going to add a simple table in the document so that we can extract that as well (see below).

python word document table

docx2python contains a method with the same name. If nosotros call this method with the certificate's name every bit input, we go back an object with several attributes.

                  from docx2python import docx2python  # excerpt docx content doc_result = docx2python('zen_of_python.docx')              

Each attribute provides either text or information from the file. For example, consider that our file has three main components – the text containing the Zen of Python, a table, and an image. If we call doc_result.body, each of these components will be returned as divide items in a list.

                  # get carve up components of the certificate doc_result.body  # get the text from Zen of Python doc_result[0]  # get the epitome doc_result[1]   # get the table text doc_result[2]              

Scraping a word document table with docx2python

The tabular array text consequence is returned as a nested list, equally you tin run across below. Each row (including the header) gets returned as a dissever sub-listing. The 0th element of the list refers to the header – or 0th row of the tabular array. The next chemical element refers to the adjacent row in the tabular array and and so on. In turn, each value in a row is returned as an individual sub-list within that row's corresponding list.

docx2python scrape table

Nosotros can catechumen this result into a tabular format using pandas. The data frame is still a little messy – each prison cell in the data frame is a list containing a unmarried value. This value too has quite a few "\t"'south (which represent tab spaces).

                  pd.DataFrame(doc_result.body[i][i:])              

python scrape table from word file

Here, nosotros use the applymap method to apply the lambda function below to every cell in the data frame. This part gets the individual value within the listing in each prison cell and removes all instances of "\t".

                  import pandas equally pd   pd.DataFrame(doc_result.body[1][one:]).\                             applymap(lambda val: val[0].strip("\t"))              

docx2python pandas data frame

Adjacent, let'southward change the column headers to what we run into in the Word file (which was likewise returned to us in doc_result.body).

                  df.columns = [val[0].strip("\t") for val in doc_result.body[1][0]]              

docx2python scrape table from word document

Extracting images

We can extract the Give-and-take file'due south images using the images aspect of our doc_result object. doc_result.images consists of a lexicon where the keys are the names of the paradigm files (not automatically written to deejay) and the corresponding values are the images files in binary format.

                  type(doc_result.images) # dict  doc_result.images.keys() # dict_keys(['image1.png'])              

We tin write the binary-formatted image out to a concrete file like this:

                  for key,val in doc_result.images.items():     f = open(key, "wb")     f.write(val)     f.shut()              

Above nosotros're just looping through the keys (image file names) and values (binary images) in the dictionary and writing each out to file. In this instance, we only have one paradigm in the document, so we only become one written out.

Other attributes

The docx2python result has several other attributes we tin apply to extract text or data from the file. For instance, if we want to but get all of the file's text in a unmarried string (similar to docx2txt) we tin run doc_result.text.

                  # get all text in a single cord doc_result.text              

In add-on to text, nosotros tin likewise go metadata most the file using the properties attribute. This returns information such as the creator of the document, the created / concluding modified dates, and number of revisions.

                  doc_result.properties              

If the document you're scraping has headers and footers, you can besides scrape those out similar this (annotation the singular version of "header" and "footer"):

                  # go the headers doc_result.header  # get the footers doc_result.footer              

Footnotes can also be extracted like this:

                  doc_result.footnotes              

Getting HTML returned with docx2python

We can also specify that we want to get an HTML object returned with the docx2python method that supports a few types of tags including font (size and color), italics, bold, and underline text. Nosotros simply need to specify the parameter "html = True". In the instance below we see The Zen of Python in bold and underlined impress. Respective to this, nosotros can run into the HTML version of this in the second snapshot below. The HTML feature does not currently support table-related tags, so I would recommend using the method nosotros went through higher up if you lot're looking to scrape tables from Discussion documents.

                  doc_html_result = docx2python('zen_of_python.docx', html = True)              

python word document html

python get html from word document

Hope you enjoyed this post! Delight bank check out other Python posts of mine below or past clicking hither.

williamsforivento95.blogspot.com

Source: http://theautomatic.net/2019/10/14/how-to-read-word-documents-with-python/

0 Response to "Read a Report Format File From Python"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel