My PDF file content is not meaningful after extracting its content [duplicate] - python

This question already has answers here:
How to extract text from a PDF file?
(33 answers)
Closed 4 years ago.
I have been having a serious problem with my PDF file. I want to extract all the text from my PDF. After extraction, I have all of it in byte code.
You can see below an extracted part of the extracted text:
b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<</Type/Catalog/Pages 2 0 R/Lang(en-US) /Metadata 89 0 R/ViewerPreferences 90 0 R>>\r\nendobj\r\n2 0 obj\r\n<</Type/Pages/Count 11/Kids[ 3 0 R 28 0 R 36 0 R 38 0 R 42 0 R 49 0 R 58 0 R 60 0 R 62 0 R 64 0 R 66 0 R] >>\r\nendobj\r\n3 0 obj\r\n<</Type/Page/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 9 0 R/F3 12 0 R/F4 17 0 R/F5 19 0 R>>/ExtGState<</GS7 7 0 R/GS8 8 0 R>>/XObject<</Image27 27 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/Annots[ 11 0 R 24 0 R 25 0 R 26 0 R] /MediaBox[ 0 0 612 792] /Contents 4 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S>>\r\nendobj\r\n4 0 obj\r\n<</Filter/FlateDecode/Length 5962>>\r\nstream\r\nx\x9c\xc5][o\xe3\xc6\x92~\x1f`\xfeC?J\x81\x87!\xbby\x1d\x1c,0\x17\'9\x07\xc9\\l\x03\xd9 \xc9\x03-\xd1\x16weI!9\xe3\xf1\xbf\xdf\xfa\xaa\x9b\x17\x89\xa4\xec\x91Z\xde\x01\xac\x91\xa8&\xab\xba\xaa\xba\xee\xdd\xfa\xe7\xe5\x0b\xd7q\xf1/\xf1\xa4pEH\xafQ"E\x91\xbd|\xf1\xfb\x0fb\xf5\xf2\xc5\xdb\xab\x97/~\xfc\xc9\x13\x9e\xe7\xb8\xbe\xb8\xbay\xf9\xc2\xa3q\xae\xf0\x84\x1f\x06\x8e\xa4\xe1A\xe2$\xa1\xb8\xba\xa3q?_F\xe2\xb6\xa4g\x8a[\xfe\x14\x9bO?\xbf|\xf1\xe7\xe4\xd7\xe9+5I\xcbJ\xe0\xff/S5\xd9\xd0\xdf\x9c\xfe\xd2j\xea\xb9\x93l\xfeZL\xff\x16W\xffy\xf9\xe2\x9c`~~\xf9\xe2\x9f#\x90\x0bd\xec\x04q\x179\xc6\xc9\xa0\xa2\x80\xc2\x8f\xd3P\xbfq\xa7\x11}x\xe5O$\xbd\xc1\x07\x0fWc\x8b\xc8D\xa1\xe3\xc91d\xbe{\xd6z\x90r\x9d\xd8\x17a(\x9d\xc8\x17^\xec9I$\x12\xfa#\x17\xdb\xa1O\x1d\xa7q\x97\x82`u\x11W\xa1\x88|\x1f\xb8?\x8e\xf4\xe7\xfa\x8d\xf4\x94#\x93\x1a\xa2\nb\xc7U\x83\x98=m`\x83Z\xc0\xc4\xeb`\'\xbd\xd8\xf1\x03\xc2\xd0ud\xdc\xc3\xf0\xb7\xacJ\xb5t\xa5\xd3Wr2
The code for this is a follows:
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
response = requests.get(url, stream=True)
data = response.content
print(data)
How can I extract the text from this?

You would need to use a package to parse the PDF file and extract the text from it. For example PyPDF2 could be used as follows:
import io
import requests
import PyPDF2
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
response = requests.get(url, stream=True)
pdf = PyPDF2.PdfFileReader(io.BytesIO(response.content))
with open('output.txt', 'w') as f_output:
for page in range(pdf.getNumPages()):
f_output.write(pdf.getPage(page).extractText())
This would create an output.txt file starting:
Last updated:
3/30/2018
Metadata:
Tivoli Bay
South
Hydrologic
Station
Location:
Tivoli Bay
, NY
(
42.027038,
-
73.925957
)
Data collection period:
July
1996*

Related

Character specific conditional check in a string

I have to read and analyse some log files using python which usually containing strings in the following desired format :
date Don Dez 10 21:41:41.747 2020
base hex timestamps absolute
no internal events logged
// version 13.0.0
//28974.328957 previous log file: 21-41-41_Voltage.asc
// Measurement UUID: 9e0029d6-43a0-49e3-8708-3ec70363124c
28976.463987 LoggingString := "Log,5:45 AM, Friday, December 11, 2020,05:45:20.6,65.48,11.99,0.009843,12,0.01078,11.99,0.01114,11.99,0.01096,12,0.009984,4.595,0,1.035,0,0.1745,0,2,OM_2_1,0"
28978.600018 LoggingString := "Log,5:45 AM, Friday, December 11, 2020,05:45:22.7,65.47,11.99,0.009896,12,0.01079,11.99,0.01117,11.99,0.01097,12,0.009965,4.628,0,1.044,0,0.1698,0,2,OM_2_1,0"
However, sometime it occurs that files are created that have undesired formats like below :
date Die Jul 13 08:40:22.878 2021
base hex timestamps absolute
no internal events logged
// version 13.0.0
//1035.595166 previous log file: 08-40-22_Voltage.asc
// Measurement UUID: 2baf3f3f-300a-4f0a-bcbf-0ba5679d8be2
"1203.997816 LoggingString := ""Log" 9:01 am Tuesday July 13 2021 09:01:58.3 24.53 13.38 0.8948 13.37 0.8801 13.37 0.89 13.37 0.9099 13.47 0.8851 4.551 0.00115 0.8165 0 0.2207 0 5 OM_3_2 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 "0"""
"1206.086064 LoggingString := ""Log" 9:02 am Tuesday July 13 2021 09:02:00.4 24.53 13.37 0.8945 13.37 0.8801 13.37 0.8902 13.37 0.9086 13.46 0.8849 5.142 0.001185 1.033 0 0.1897 0 5 OM_3_2 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 "0"""
OR
date Mit Jun 16 10:11:43.493 2021
base hex timestamps absolute
no internal events logged
// version 13.0.0
// Measurement UUID: fe4a6a97-d907-4662-89f9-bd246aa54a33
10025.661597 LoggingString := """""""Log""" 12:59 PM Wednesday June 16 2021 12:59:01.1 66.14 0.00423 0 0.001206 0 0.001339 0 0.001229 0 0.001122 0 0.05017 0 0.01325 0 0.0643 0 0 OM_2_1_transition 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 """0"""""""
10030.592652 LoggingString := """""""Log""" 12:59 PM Wednesday June 16 2021 12:59:06.1 66.14 11.88 0.1447 11.88 0.1444 11.88 0.1442 11.87 0.005552 11.9 0.00404 2.55 0 0.4712 0 0.09924 0 0 OM_2_1_transition 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 """0"""""""
Since i am only concerned with data below "// Measurement UUID " line, i am using this code to extract data from the string that is of desired format :
files = os.listdir(directory)
files = natsorted(files)
for file in files:
base, ext = os.path.splitext(file)
if file not in processed_files and ext == '.asc':
print("File added:", file)
file_path = os.path.join(directory, file)
count = 0
with open(file_path, 'r') as file_in:
processed_files.append(file)
Output_list = [] # Each string from file is read into this list
Final = [] # Required specific data from each string is isolated & stored here
for line in map(str.strip, file_in):
if "LoggingString" in line:
first_quote = line.index(
'"') # returns the column number where " first appears in the whole string
last_quote = line.index('"', first_quote + 1)
# returns the column value where " appears last in the whole string ( end of line )
# print(first_quote)
Output_list.append(
line[:first_quote].split(maxsplit=1)
+ line[first_quote + 1: last_quote].split(","),
)
Final.append(Output_list[count][7:27])
The undesired format contains one or more whitespaces between each string character as seen above. I guess it is because the log file generator sometime generates a non comma separate file or a comma separated file with error probably, i am not sure.
I tried to put the condition after:
if "LoggingString" in line :
if ',' in line:
first_quote = line.index('"')
last_quote = line.index('"', first_quote + 1)
Output_list.append(line[:first_quote].split(maxsplit=1)
+ line[first_quote + 1: last_quote].split(","),)
Final.append(Output_list[count][7:27])
else:
raise Exception("Error in File")
However, this didn't serve the purpose because if in any other undesired format if there is even one ',' in the string, the program would consider it valid and process it which results in false results.
How do i ensure that after the only files that contain strings in desired format are processed and if others are processed then an error message would be print out ? What type of conditional check could be implemented here ?
You can use pandas.read_csv with a regex separator :
import glob
import pandas as pd
l = []
for f in glob.glob("/tmp/Log*.txt"):
df = (pd.read_csv(f, sep=',|(?<=[\w"])\s+(?=[\w"])',
header=None, skiprows=6, engine="python").iloc[:, 2:28])
df.insert(0, "filename", f.split("\\", )[-1])
l.append(df)
out = pd.concat(l)
Output :

Scrape web with info from several years and create a csv file for each year

I have scraped information with the results of the 2016 Chess Olympiad, using the following code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
#Imports the HTML into python
url = 'https://www.olimpbase.org/2016/2016te14.html'
requests.get(url)
page = requests.get(url)
print(page)
soup = BeautifulSoup(page.text, 'lxml')
#Subsets the HTML to only get the HTML of our table needed
table = soup.find('table', attrs = {'border': '1'})
print(table)
#Gets all the column headers of our table, but just for the first eleven columns in the webpage
table.find_all('td', class_= 'bog')[1:12]
headers = []
for i in table.find_all('td', class_= 'bog')[1:12]:
title = i.text.strip()
headers.append(title)
#Creates a dataframe using the column headers from our table
df = pd.DataFrame(columns = headers)
table.find_all('tr')[3:] #We grab data since the fourth row; the previous ones belong to the headers.
for j in table.find_all('tr')[3:]:
row_data = j.find_all('td')
row = [tr.text for tr in row_data][0:11]
length = len(df)
df.loc[length] = row
I want to do the same thing for the results of 2014 and 2012 (the Olympics are played every two years normally), authomatically. I have advanced the code half the way, but I really don't know how to continue. This is what I've done so far.
import requests
from bs4 import BeautifulSoup
import pandas as pd
#Imports the HTML into python
url = 'https://www.olimpbase.org/2016/2016te14.html'
requests.get(url)
page = requests.get(url)
print(page)
soup = BeautifulSoup(page.text, 'lxml')
#Subsets the HTML to only get the HTML of our table needed
table = soup.find('table', attrs = {'border': '1'})
print(table)
#Gets all the column headers of our table
table.find_all('td', class_= 'bog')[1:12]
headers = []
for i in table.find_all('td', class_= 'bog')[1:12]:
title = i.text.strip()
headers.append(title)
#Creates a dataframe using the column headers from our table
df = pd.DataFrame(columns = headers)
table.find_all('tr')[3:] #We grab data since the fourth row; the previous ones belong to the headers.
start_year=2012
i=2
end_year=2016
def download_chess(start_year):
url = f'https://www.olimpbase.org/{start_year}/{start_year}te14.html'
response = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
for j in table.find_all('tr')[3:]:
row_data = j.find_all('td')
row = [tr.text for tr in row_data][0:11]
length = len(df)
df.loc[length] = row
while start_year<end_year:
download_chess(start_year)
start_year+=i
download_chess(start_year)
I don't have much experience so I don't quite understand the logic of writing filenames. I hope you can help me.
The following will retrieve information for a range of years - in this case, 2000 -- 2018, and save each table to csv as well:
import requests
import pandas as pd
years = range(2000, 2019, 2)
for y in years:
try:
df = pd.read_html(f'https://www.olimpbase.org/{y}/{y}te14.html')[1]
new_header = df.iloc[2]
df = df[3:]
df.columns = new_header
print(df)
df.to_csv(f'chess_olympics_{y}.csv')
except Exception as e:
print(y, 'error', e)
This will print out the results table for each year:
no.
team
Elo
flag
code
pos.
pts
Buch
MP
gms
nan
+
=
-
nan
+
=
-
nan
%
Eloav
Elop
ind.medals
3
1
Russia
2685
nan
RUS
1
38
457.5
20
56
nan
8
4
2
nan
23
30
3
nan
67.9
2561
2694
1 - 0 - 2
4
2
Germany
2604
nan
GER
2
37
455.5
22
56
nan
10
2
2
nan
21
32
3
nan
66.1
2568
2685
0 - 0 - 2
5
3
Ukraine
2638
nan
UKR
3
35½
457.5
21
56
nan
8
5
1
nan
18
35
3
nan
63.4
2558
2653
1 - 0 - 0
6
4
Hungary
2661
nan
HUN
4
35½
455.5
21
56
nan
8
5
1
nan
22
27
7
nan
63.4
2570
2665
0 - 0 - 0
7
5
Israel
2652
nan
ISR
5
34½
463.5
20
56
nan
7
6
1
nan
17
35
4
nan
61.6
2562
2649
0 - 0 - 0
[...]
Relevant documentation for pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

Why can't I scrape table data in order?

I'm trying to scrape table data off of this website:
https://www.nfl.com/standings/league/2019/REG
I have working code (below), however, it seems like the table data is not in the order that I see on the website.
On the website I see (top-down):
Baltimore Ravens, Green Bay Packers, ..., Cincinatti Bengals
But in my code results, I see (top-down): Bengals, Lions, ..., Ravens
Why is soup returning the tags out of order? Does anyone know why this is happening? Thanks!
import requests
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import lxml
url = 'https://www.nfl.com/standings/league/2019/REG'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
print(soup) #not sure why soup isn't returning tags in the order I see on website
table = soup.table
headers = []
for th in table.select('th'):
headers.append(th.text)
print(headers)
df = pd.DataFrame(columns=headers)
for sup in table.select('sup'):
sup.decompose() #Removes sup tag from the table tree so x, xz* in nfl_team_name will not show up
for tr in table.select('tr')[1:]:
td_list = tr.select('td')
td_str_list = [td_list[0].select('.d3-o-club-shortname')[0].text]
td_str_list = td_str_list + [td.text for td in td_list[1:]]
df.loc[len(df)] = td_str_list
print(df.to_string())
After initial load the table is dynamically sorted by column PCT - To get your goal do the same with your DataFrame using sort_values():
pd.read_html('https://www.nfl.com/standings/league/2019/REG')[0].sort_values(by='PCT',ascending=False)
Or based on your example:
df.sort_values(by='PCT',ascending=False)
Output:
NFL Team
W
L
T
PCT
PF
PA
Net Pts
Home
Road
Div
Pct
Conf
Pct
Non-Conf
Strk
Last 5
Ravens
14
2
0
0.875
531
282
249
7 - 1 - 0
7 - 1 - 0
5 - 1 - 0
0.833
10 - 2 - 0
0.833
4 - 0 - 0
12W
5 - 0 - 0
49ers
13
3
0
0.813
479
310
169
6 - 2 - 0
7 - 1 - 0
5 - 1 - 0
0.833
10 - 2 - 0
0.833
3 - 1 - 0
2W
3 - 2 - 0
Saints
13
3
0
0.813
458
341
117
6 - 2 - 0
7 - 1 - 0
5 - 1 - 0
0.833
9 - 3 - 0
0.75
4 - 0 - 0
3W
4 - 1 - 0
Packers
13
3
0
0.813
376
313
63
7 - 1 - 0
6 - 2 - 0
6 - 0 - 0
1
10 - 2 - 0
0.833
3 - 1 - 0
5W
5 - 0 - 0
...

how to open csv in python?

I have a dataset in following format.
row_num;locale;day_of_week;hour_of_day;agent_id;entry_page;path_id_set;traffic_type;session_durantion;hits
"988681;L6;Monday;17;1;2111;""31672;0"";6;7037;\N"
"988680;L2;Thursday;22;10;2113;""31965;0"";2;49;14"
"988679;L4;Saturday;21;2;2100;""0;78464"";1;1892;14"
"988678;L3;Saturday;19;8;2113;51462;6;0;1;\N"
I want it to be in following format :
row_num locale day_of_week hour_of_day agent_id entry_page path_id_set traffic_type session_durantion hits
988681 L6 Monday 17 1 2111 31672 0 6 7037 N
988680 L2 Thursday 22 10 2113 31965 0 2 49 14
988679 L4 Saturday 21 2 2100 0 78464 1 1892 14
988678 L3 Saturday 19 8 2113 51462 6 0 1 N
I tried with the following code :
import pandas as pd
df = pd.read_csv("C:\Users\Rahhy\Desktop\trivago.csv", delimiter = ";")
But I am getting a error as :
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
Using replace():
with open("data_test.csv", "r") as fileObj:
contents = fileObj.read().replace(';',' ').replace('\\', '').replace('"', '')
print(contents)
OUTPUT:
row_num locale day_of_week hour_of_day agent_id entry_page path_id_set traffic_type session_durantion hits
988681 L6 Monday 17 1 2111 31672 0 6 7037 N 988680 L2 Thursday 22 10 2113 31965 0 2 49 14 988679 L4 Saturday 21 2 2100 0 78464 1 1892 14 988678 L3 Saturday 19 8 2113 51462 6 0 1 N
EDIT:
You can open a file, read its content, replace the unwanted chars. write the new contents to the file and then read it through pd.read_csv:
with open("data_test.csv", "r") as fileObj:
contents = fileObj.read().replace(';',' ').replace('\\', '').replace('"', '')
# print(contents)
with open("data_test.csv", "w+") as fileObj2:
fileObj2.write(contents)
import pandas as pd
df = pd.read_csv(r"data_test.csv", index_col=False)
print(df)
OUTPUT:
row_num locale day_of_week hour_of_day agent_id entry_page path_id_set traffic_type session_durantion hits
988681 L6 Monday 17 1 2111 31672 0 6 7037 N 988680 L2 Thursday 22 10 2113 31965 0 2 49 14 988679 L4 Saturday 21 2 2100 0 78464 1 1892 14 988678 L3 Saturday 19 8 2113 51462 6 0 1 N
import pandas as pd
from io import StringIO
# Load the file to a string (prefix r (raw) to not use \ for escaping)
filename = r'c:\temp\x.csv'
with open(filename, 'r') as file:
raw_file_content = file.read()
# Remove the quotes which break the CSV file
file_content_without_quotes = raw_file_content.replace('"','')
# Simulate a file with the corrected CSV content
simulated_file = StringIO(file_content_without_quotes)
# Get the CSV as a table with pandas
# Since the first field in each data row shall not be used for indexing we need to set index_col=False
csv_data = pd.read_csv(simulated_file, delimiter = ';', index_col=False)
print(csv_data['hits']) # print some column
csv_data
Since there are 11 data fields and 10 headers only the first 10 fields are used. You'll have to figure out what you want to do with the last one (Values: \N, 14)
Output:
0 7037
1 49
2 1892
3 1
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Compare some columns from some tables using python

I need to compare two values MC and JT from 2 tables:
EID MolIdx TEStart TEEnd TE TZone TBulkBE TBulkAE MC JT zavg vabs vzavg xyd.x xyd.y xydist nnbw vabsprev midhb
0 370 36700 36800 110 20 36150 37090 0 0 -8.25705 0.219113 -0.000800014 20.8926 41.4347 5.75852 0 4.13067 0
1 423 17950 18150 210 180 17400 18430 1 0 -4.26426 0.586578 -0.053 77.22 85.2104 22.0534 0 3.551 0
2 468 41790 42020 240 50 41360 42380 0 0 7.82681 0.181248 -0.00269566 90.0646 92.7698 5.0841 0 4.19304 0
and
EID MolIdx TEStart TEEnd TE TZone TBulkBE TBulkAE MC JT zavg vabs vzavg xyd.x xyd.y xydist nnbw vabsprev midhb
0 370 36700 36800 110 20 36150 37090 0 0 -0.846655 0.0218695 2.59898e-05 2.0724 4.1259 0.583259 10 0.412513 0
1 423 17950 18150 210 180 17400 18780 1 0 -0.453311 0.058732 -0.00526783 7.7403 8.52544 2.19627 0 0.354126 0
2 468 41790 42020 240 70 41360 42380 0 0 0.743716 0.0181613 -0.000256186 9.08777 9.21395 0.502506 0 0.419265 0
I need to do it using module csv. I know how to do it using pandas and xlrd, but using csv don't know.
Desire output:
Number_of_strings MC JT
And print strings, where values are different
import csv
old = csv.reader(open('old.csv', 'rb'), delimiter=',')
row1 = old.next()
new = csv.reader(open('new.csv', 'rb'), delimiter=',')
row2 = new.next()
if (row1[8] == row2[8]) and (row1[9] == row2[9]):
continue
else:
print row1[0] + ':' + row1[8] + '!=' + row2[8]
You can try something like the following:
old = list(csv.reader(open('old.csv', 'rb'), delimiter=','))
new = list(csv.reader(open('new.csv', 'rb'), delimiter=','))
old = zip(*old)
new = zip(*new)
print ['%s-%s-%s'%(str(a), str(b), str(c)) for a, b, c in zip(old[0], new[8], old[8]) if b != c]
First, we get a list of lists. zip(*x) will transpose a list of lists. The rest should be easy to decipher ...
You can actually put whatever you want within the string ...

Categories

Resources