Transform scanned PDF documents into Excel using OCR
I have a task that is to convert cable schedules into an Excel spreadsheet. I have tried a few different OCR (Optical Character Recognition) things such as websites, code in R (tesseract), javascript, data from picture in excel and looked into C. So far none of these have worked. I believe the main 3 errors are that I have are that; If there are multiple rows with the same information in a column there is just an arrow pointing down for however long it is. A lot of things I have found seem to thing that it randomly splits into 2 columns. The layout of the documents might be an issue as it is in the format of an engineering drawing that has been exported from AutoCad with the reference grid all around it. It is a non-editable pdf and was made by basically drawing it in AutoCad and thus doesn't technically have text hence why I am trying OCR The layout of the excel that is exported doesn't matter as long as the data is right as I can just manually copy and paste the columns into the correct format. I have approximately 350 of these to do and only December/January to do it. With searching for other methods along with trying manually I can complete about 4 every day assuming no distractions. As that shows those numbers do not line up with the timeline hence why I am asking if anyone here knows any options. What have I tried; Random online pdf to excel converters They either output a corrupted table or just add the table as in image in excel. Adobe's built in pdf to excel. I don't have access to this and I haven't perused access to this as using Adobe's online version it has the same issues as 1. Excel's data from picture function. doesn't read the data correctly. it converts everything wrong even basic words like "TRAY" it will fail to get a single letter correct. Using Tesseract on RStudio This method was found on Stack Overflow https://stackoverflow.com/questions/31979857/doing-ocr-with-r however it struggles with the layout of the cable schedule and only able to extract a few parts. This may still be viable as I am not very experienced with either R or tesseract. OneNote's built in OCR basically makes hieroglyphs from the document docsumo.com is something that somewhat works. this is what I have been using to extract the larger columns however it struggles with some of the smaller columns so about half of it still needs to be entered manually. https://www.docsumo.com/free-tools/extract-tables-from-pdf-images Below is an example of what I am trying to extract. It is mainly the data inside the red box that I need to extract, all the cable schedules are in this format.

I have a task that is to convert cable schedules into an Excel spreadsheet. I have tried a few different OCR (Optical Character Recognition) things such as websites, code in R (tesseract), javascript, data from picture in excel and looked into C.
So far none of these have worked. I believe the main 3 errors are that I have are that;
If there are multiple rows with the same information in a column there is just an arrow pointing down for however long it is. A lot of things I have found seem to thing that it randomly splits into 2 columns.
The layout of the documents might be an issue as it is in the format of an engineering drawing that has been exported from AutoCad with the reference grid all around it.
It is a non-editable pdf and was made by basically drawing it in AutoCad and thus doesn't technically have text hence why I am trying OCR
The layout of the excel that is exported doesn't matter as long as the data is right as I can just manually copy and paste the columns into the correct format.
I have approximately 350 of these to do and only December/January to do it. With searching for other methods along with trying manually I can complete about 4 every day assuming no distractions. As that shows those numbers do not line up with the timeline hence why I am asking if anyone here knows any options.
What have I tried;
- Random online pdf to excel converters
- They either output a corrupted table or just add the table as in image in excel.
- Adobe's built in pdf to excel.
- I don't have access to this and I haven't perused access to this as using Adobe's online version it has the same issues as 1.
- Excel's data from picture function.
- doesn't read the data correctly. it converts everything wrong even basic words like "TRAY" it will fail to get a single letter correct.
- Using Tesseract on RStudio
- This method was found on Stack Overflow https://stackoverflow.com/questions/31979857/doing-ocr-with-r however it struggles with the layout of the cable schedule and only able to extract a few parts.
- This may still be viable as I am not very experienced with either R or tesseract.
- OneNote's built in OCR
- basically makes hieroglyphs from the document
- docsumo.com is something that somewhat works. this is what I have been using to extract the larger columns however it struggles with some of the smaller columns so about half of it still needs to be entered manually.
Below is an example of what I am trying to extract.
It is mainly the data inside the red box that I need to extract, all the cable schedules are in this format.