Tabula PythonAn IntroductionGenerally, it is not necessary that the data we use available in CSV or JSON format. The data can be stored in the form of a table in a PDF file. As a most straightforward case, we can copy and paste the table into a spreadsheet or a text editor. But it may also be that we can more than one table in the same PDF that has similar structures. For such cases, we have to copy and paste each of these tables separately, which makes the work tedious. However, to cut this dreary work, Python provides an open-source library, also known as tabula-py, that allows users to extract more than one table distinctly. In the following tutorial, we will learn about tabula and their functions. What is Tabula?Tabular is a basic wrapper of tabula-java that allows users to the extraction of the table and converts the PDF file directly into Data frames or JSON using Python Programming language. The user can also extract tables from PDF and convert them into TSV, CSV, or JSON format files. Tabula is a tool based on Graphical User Interface (GUI) Application; however, tabula-java is a tool based on Command-Line User Interface (CUI). tabula-java provides the bindings of Ruby, R, and NodeJS but not for Python. Thus, the developers introduced the concept of tabula-py that provides Python binding. Now, let us understand who uses Tabula and how we can install it. Who utilizes Tabula?Tabula is a powerful tool that is being used by News organizations of all sizes in order to power investigative reporting. These News organizations are The Times of London, ProPublica, Foreign Policy, The New York Times, La Nacion (Argentina), and St. Paul (MN) Pioneer Press. There are Grassroots Organizations such as SchoolCuts.org that also depend upon Tabula in order to convert clunky documents into human-friendly public resources. Apart from the above, there are researchers from other backgrounds who utilize Tabula for turning their PDF reports into Excel Spreadsheets, CSVs, and JSON format files and use it for the purpose of Analysing and Database Applications. Implementation of Tabula in PythonOnce we have discussed a bit Tabula, let us understand its implementation in Python. Installation of the library Since tabula-py is an open-source library of Python, we will use the pip installer in order to install the library. Importing of the library Once the installation is completed, we can verify it by simply importing the library as shown below: In case the program returns an importing error, it is recommended to reinstall the package. The tabula-py library provides various functions such as reading a PDF file, reading a table on a specific page of a PDF file, reading multiple tables on the same page of a PDF file, or Converting PDF files directly a CSV file. Let us begin with reading a PDF file Reading a PDF fileThe tabula-py library allows its users to read a PDF file using the function known as the read_pdf() function. Syntax: Parameters: filename: The filename parameter is the name of the pdf file; we would like to read the data from. Let us convert the following pdf data table into pandas Data Frame. File Name: marksheet_table.py Page: 1
Here is an example given below, that demonstrates how to extract the data from the pdf. Example: Output: Name English Physics Chemistry Biology Total 0 A 86 54 65 83 288 1 B 56 45 80 55 236 2 C 34 66 73 90 263 3 D 77 75 46 34 232 4 E 74 82 55 77 288 5 F 69 76 82 46 273 6 G 53 33 29 45 160 7 H 70 41 67 23 201 8 I 80 43 88 28 239 9 J 90 37 45 71 243 10 K 98 55 88 81 322 11 L 90 54 67 37 248 12 M 87 76 88 54 305 13 N 86 69 82 66 303 14 O 67 74 54 65 260 15 P 75 96 53 67 291 16 Q 45 87 80 45 257 17 R 44 66 49 78 237 18 S 78 39 78 80 275 19 T 56 54 77 86 273 20 U 43 90 64 77 274 21 V 95 88 66 55 304 22 W 64 67 86 80 297 23 X 82 56 45 65 248 24 Y 79 65 70 54 268 25 Z 83 54 40 75 252 Explanation: In the above example, we have imported the required library and defined a variable that stores the address of the pdf data file. We have then used the read_pdf() function to read the data from the pdf and printed it for the users. As a result, the data table has been read successfully. Note: We have used the pages parameter in the read_pdf() function to read the data from the specified page(s). Let us consider another example to print the tables from a specific page, say page number 2. Example: Output: Name Final Scores 0 A 288 1 B 236 2 C 263 3 D 232 4 E 288 5 F 273 6 G 160 7 H 201 8 I 239 9 J 243 3 D 232 4 E 288 5 F 273 6 G 160 7 H 201 8 I 239 9 J 243 10 K 322 11 L 248 12 M 305 13 N 303 14 O 260 15 P 291 16 Q 257 17 R 237 18 S 275 19 T 273 20 U 274 21 V 304 22 W 297 23 X 248 24 Y 268 25 Z 252 Explanation: In the above example, we have followed the same procedure as we did earlier. However, we have assigned the pages parameter to 2 and printed the first table of the specified page. As a result, the table of index zero on page 2 has been printed successfully. Now, let us understand what happens when there is more than one table on the same page of a PDF data file. Handling multiple tables on the same page of a PDF fileWe can handle more than one table on the same using an additional parameter known as multiple_tables. The multiple_tables parameter takes a Boolean value for which the read_pdf() function reads multiple tables as independent tables if true or reads multiple tables as a single table if false. Let us consider the following example demonstrating how to read multiple tables as independent tables. Example: Output: Name Final Scores 0 A 288 1 B 236 2 C 263 3 D 232 4 E 288 5 F 273 6 G 160 7 H 201 8 I 239 9 J 243 10 K 322 11 L 248 12 M 305 13 N 303 14 O 260 15 P 291 16 Q 257 17 R 237 18 S 275 19 T 273 20 U 274 21 V 304 22 W 297 23 X 248 24 Y 268 25 Z 252 Name Position 0 K I 1 M II 2 V III 3 N IV 4 W V Explanation: In the following example, we have again imported the required library and defined the variable that stores the address of the PDF file. We have then used the read_pdf() function and included the multiple_tables parameter setting it to True. We have then printed the multiple tables present on page 2 of the PDF file separately. Now, let us consider an example to understand how to read multiple tables as a single table. Example: Output: Name Final Scores 0 A 288 1 B 236 2 C 263 3 D 232 4 E 288 5 F 273 6 G 160 7 H 201 8 I 239 9 J 243 10 K 322 11 L 248 12 M 305 13 N 303 14 O 260 15 P 291 9 J 243 10 K 322 11 L 248 12 M 305 13 N 303 14 O 260 15 P 291 16 Q 257 17 R 237 18 S 275 19 T 273 20 U 274 21 V 304 22 W 297 23 X 248 24 Y 268 25 Z 252 26 Name Position 27 K I 28 M II 29 V III 30 N IV 31 W V Explanation: In the following example, we have now set the multiple_tables parameter to False. As a result, the tables present on page 2 are treated as a single table. Converting PDF file directly to a CSV fileWe can convert a PDF file that contains tabular data directly into a CSV file with the help of the convert_into() method in the tabula library. Syntax: Let us consider the following example illustrating the conversion of the PDF file into CSV file. Example: Output: 'pages' argument isn't specified.Will extract only from page 1 by default. The PDF file has been converted successfully. Explanation: In the above example, we have again imported the required library and defined the variable containing the address of the PDF file. We have then used the convert_into() method to convert the PDF file into the CSV file and printed a success message. Moreover, we can also observe that the program returned a statement saying that the 'pages' argument is not specified. Thus, the table present on page 1 will be extracted by default. |