[PYTHON- 라이브러리]★PDF to DataFrame★

PYTHON-기초통계/PYTHON 라이브러리

[PYTHON- 라이브러리]★PDF to DataFrame★

goAhEAd_29 2024. 4. 11. 15:08

728x90

1. PDF 파일내 표를 DataFrame으로 변환하고자 한다.

from tabula import read_pdf
import pandas as pd

def read_pdf_table_to_dataframe(pdf_path, page_number):
    # tabula-py can only read tables from a PDF, ensure your PDF contains tables.
    df_list = read_pdf(pdf_path, pages=page_number, multiple_tables=True)



    # read_pdf returns a list of DataFrames, so concatenate them if there are multiple tables.
    df = pd.concat(df_list, ignore_index=True) if df_list else pd.DataFrame()

    return df

# Specify the path to your PDF, and the page number you want to extract the table from.
pdf_path = '젠톡_김지은_2.pdf'  # Change to your PDF file path.
page_number = '8'  # Change to your specific page number.

# Call the function and get the DataFrame.
df = read_pdf_table_to_dataframe(pdf_path, page_number)

# Now you can work with the DataFrame.
print(df.head())

① tabula-py를 pip install 한다.(2.9.0) , tabula를 install 하면 안된다.

②

해당 에러가 뜰경우

pip install JPype1

728x90