I am unable to extract MCC details from PDF. I am able to extract other data with my code.
import tabula.io as tb
from tabula.io import read_pdf
pdf_path = "IR21_SVNMT_Telekom Slovenije d.d._20210506142456.pdf"
for df in df_list:
if 'MSRN Number Range(s)' in df.columns:
df = df.drop(df.index[0])
df.columns = df.columns.str.replace('\r', '')
df.columns = df.columns.str.replace(' ', '')
df.columns = df.columns.str.replace('Unnamed:0', 'CountryCode(CC)')
df.columns = df.columns.str.replace('Unnamed:1', 'NationalDestinationCode(NDC)')
df.columns = df.columns.str.replace('Unnamed:2', 'SNRangeStart')
df.columns = df.columns.str.replace('Unnamed:3', 'SNRangeStop')
break
msrn_table = (df[['CountryCode(CC)','NationalDestinationCode(NDC)','SNRangeStart','SNRangeStop']])
print (msrn_table)
The same logic I am trying to retrieve "Mobile Country Code (MCC)" details. But Pandas data frame is showing different data instead of what is there in PDF.
for df in df_list:
if 'Mobile Country Code (MCC)' in df.columns:
break
print (df)
Pandas output is given in this:
The actual content in pdf file is:
This code works
import pdfplumber
import re
pattern =re.compile(r'Mobile Network Code \(MNC\)[\r\n]+([^\r\n]+)')
#pattern =re.compile(r'Mobile\sNetwork\sCode\s\(MNC\)')
pdf = pdfplumber.open(pdf_path)
n = len(pdf.pages)
final = ""
for page in range(n):
data = pdf.pages[page].extract_text()
final = final + "\n" + data
mcc_mnc=" "
matches=pattern.findall(final)
mcc_mnc=mcc_mnc.join(matches)
mcc = mcc_mnc.split(" ")
actual_mcc =mcc[0]
actual_mnc=mcc[1]
print (actual_mcc)
print (actual_mnc)
Related
In my code, the csv-writer is writing some un-realistic values to the CSV file.
My goal is to read all csv files in one directory and put filter on any specific column and write the filtered dataframe to a consolidated csv file.
I am able to get the outputs as required in the VS console, but I am not able to write them into a csv file.
Kindly help to understand what I am doing incorrect.
This is my sample input:
And this is the output I am getting:
Code:
import pandas as pd
import os
import glob
import csv
from pandas.errors import EmptyDataError
# use glob to get all the csv files
# in the folder
path = os.getcwd()
#print(path)
csv_files = glob.glob(os.path.join(path, "*.csv"))
print(csv_files)
col_name = input("Enter the column name to filter: ")
print(col_name)
State_Input = input("Enter the {} ".format(col_name) )
print(State_Input)
df_empty = pd.DataFrame()
for i in csv_files:
try:
df = pd.read_csv(i)
#print(df.head(5))
State_Filter = df["State"] == State_Input
print(df[State_Filter])
df_child = (df[State_Filter])
with open('D:\\PythonProjects\\File-Split-Script\\temp\\output\\csv_fil111.csv', 'w') as csvfile:
data_writer = csv.writer(csvfile, dialect = 'excel')
for row in df_child:
data_writer.writerows(row)
except EmptyDataError as e:
print('There was an error in your input, please try again :{0}'.format(e))
Use pd.to_csv to write your file at once. Prefer store your filtered dataframes into a list then concatenate all of them to a new dataframe:
import pandas as pd
import pathlib
data_dir = pathlib.Path.cwd()
# Your input here
state = input('Enter the state: ') # Gujarat, Bihar, ...
print(state)
data = []
for csvfile in data_dir.glob('*.csv'):
df = pd.read_csv(csvfile)
df = df.loc[df['State'] == state]]
data.append(df)
df = pd.concat(data, axis=1, ignore_index=True)
df.to_csv('output.csv', axis=0)
I am new to Python Pandas, need your guidance. I have following below code which extract specific data from a pdf files and export into a excel file. The code is working fine, however all data are exported into text format. Is there any way I can use text and number extract in same code.
import os
import pandas as pd
import numpy as np
import glob
import pdfplumber
def get_keyword(start, end, text):
for i in range(len(start)):
try:
field = ((text.split(start[i]))[1].split(end[i])[0])
return field
except:
continue
def main():
my_dataframe = pd.DataFrame()
for files in glob.glob("C:/PDFs\*.pdf"):
with pdfplumber.open(files) as pdf:
page = pdf.pages[0]
text = page.extract_text()
text = " ".join(text.split())
# obtain keyword #1-Find Supplier-This is text & it is fine
start = ['SUPPLIER ']
end = [' Purchase']
keyword1 = get_keyword(start, end, text)
# obtain keyword #2-Find Invoice, This is number-which need to number not text.
start = ['Invoice Weight(Kg) ']
end = ['.00 Net Weight.(Kg)']
keyword2 = get_keyword(start, end, text)
my_list = [keyword1, keyword2]
my_list = pd.Series(my_list)
my_dataframe = my_dataframe.append(my_list, ignore_index=True)
print("Document's keywords have been extracted successfully!")
my_dataframe = my_dataframe.rename(columns={0:'Supplier',
1:'Invoice Number',
2:'Mill Lot Number'})
save_path: str = 'C:/PDFs'
os.chdir(save_path)
# extract my dataframe to an .xlsx file!
my_dataframe.to_excel('sample.xlsx', sheet_name = 'Sheet1')
print("")
print(my_dataframe)
if __name__ == '__main__':
main()
I tried using #str.extract(r"([A-Za-z\s]+)([\d-]+)"))- but it did not work. I also tried below link, but could not decipher. Kindly help!!
Python pandas extracting numbers and text within text to two new column
I have a datafile which is the result of combining several sources that contain name information. Each name have a unique ID (Column ID).
Sorting the ID by column, I would like to remove the second/third source finding in the column Source.
My output today:
(all the red rows are "duplicates" since we already got them from the first source (blue rows))
What I would like to achieve:
How can I achieve this result?
Is there a way to iterate row by row, where I remove duplicate of ID already when I iterate in the function "for file in files:" part of the code?
Or is it easier to do it in the "df_merged" before I output the dataframe to an an excel file?.
Code:
import pandas as pd
import os
from datetime import datetime
from shutil import copyfile
from functools import reduce
import numpy as np
#Path
base_path = "G:/Till/"
# Def
def get_files(folder, filetype):
list_files = []
directory = os.fsencode(folder)
for file in os.listdir(directory):
filename = os.fsdecode(file)
if filename.endswith("." + filetype.strip().lower()):
list_files.append(filename)
return list_files
# export files
df_result_e = pd.DataFrame()
files = get_files(base_path + "datasource/" + "export","xlsx")
df_append_e = pd.DataFrame()
for file in files:
df_temp = pd.read_excel(base_path + "datasource/" + "export/" + file, "Results", dtype=str, index=False)
df_temp["Source"] = file
df_append_e = pd.concat([df_append_e, df_temp])
df_result_e = pd.concat([df_result_e, df_append_e])
print(df_result_e)
# match files
df_result_m = pd.DataFrame()
files = get_files(base_path + "datasource/" + "match","xlsx")
df_append_m = pd.DataFrame()
for file in files:
df_temp = pd.read_excel(base_path + "datasource/" + "match/" + file, "Page 1", dtype=str, index=False)
df_append_m = pd.concat([df_append_m, df_temp])
df_result_m = pd.concat([df_result_m, df_append_m])
df_result_m = df_result_m[['ID_Our','Name_Our','Ext ID']]
df_result_m.rename(columns={'ID_Our' : 'ID', 'Name_Our' : 'Name' , 'Ext ID' : 'Match ID'}, inplace=True)
df_result_m.dropna(subset=["Match ID"], inplace=True) # Drop all NA
data_frames = [df_result_e, df_result_m]
# Join files
df_merged = reduce(lambda left,right: pd.merge(left, right, on=["Match ID"], how='outer'), data_frames)
#Output of files
df_merged.to_excel(base_path + "Total datasource Export/" + datetime.now().strftime("%Y-%m-%d_%H%M") + ".xlsx", index=False)
For remove them you can try transform with factorize
newdf=df[df.groupby('ID')['Source'].transform(lambda x : x.factorize()[0])==0]
I used python 3 and pandas to parse the daily close from WSJ into EXCEL. However, the daily close shown on the web page screen cannot be extracted. Here is the link: "https://quotes.wsj.com/index/COMP/historical-prices"
How to download the close data on screen into excel?
and how to download "DOWNLOAD A SPREADSHEET" button file into excel with another name like comp.xlxs ?
Here are the codes:
import requests
import pandas as pd
url = 'https://quotes.wsj.com/index/COMP/historical-prices'
jsonData = requests.get(url).json()
final_df = pd.DataFrame()
for row in jsonData['data']:
#row = jsonData['data'][1]
data_row = []
for idx, colspan in enumerate(row['colspan']):
colspan_int = int(colspan[0])
data_row.append(row['td'][idx] * colspan_int)
flat_list = [item for sublist in data_row for item in sublist]
temp_row = pd.DataFrame([flat_list])
final_df = final_df.append(temp_row, sort=True).reset_index(drop=True)
wait2 = input("PRESS ENTER TO CONTINUE.")
Follow UP question quotes:
#
url = 'https://quotes.wsj.com/index/HK/XHKG/HSI/historical-prices/download?num_rows=15&range_days=15&endDate=12/06/2019'
response = requests.get(url)
open('HSI.csv', 'wb').write(response.content)
read_file = pd.read_csv (r'C:\A-CEO\REPORTS\STOCKS\PROFILE\Python\HSI.csv')
read_file.to_excel (r'C:\A-CEO\REPORTS\STOCKS\PROFILE\Python\HSI.xlsx', index = None, header=True)
#
url = 'https://quotes.wsj.com/index/SPX/historical-prices/download?num_rows=15&range_days=15&endDate=12/06/2019'
response = requests.get(url)
open('SPX.csv', 'wb').write(response.content)
read_file = pd.read_csv (r'C:\A-CEO\REPORTS\STOCKS\PROFILE\Python\SPX.csv')
read_file.to_excel (r'C:\A-CEO\REPORTS\STOCKS\PROFILE\Python\SPX.xlsx', index = None, header=True)
#
url = 'https://quotes.wsj.com/index/COMP/historical-prices/download?num_rows=15&range_days=15&endDate=12/06/2019'
response = requests.get(url)
open('COMP.csv', 'wb').write(response.content)
read_file = pd.read_csv (r'C:\A-CEO\REPORTS\STOCKS\PROFILE\Python\COMP.csv')
read_file.to_excel (r'C:\A-CEO\REPORTS\STOCKS\PROFILE\Python\COMP.xlsx', index = None, header=True)
the URL is wrong; once downloaded you can do "Get Info" if on a Mac, and you'll see "Where From:". You will see it's of the form below.
import requests
import pandas as pd
import io
#original URL had a bunch of other parameters I omitted, only these seem to matter but YMMV
url = 'https://quotes.wsj.com/index/COMP/historical-prices/download?num_rows=360&range_days=360&endDate=11/06/2019'
response = requests.get(url)
#do this if you want the CSV written to your machine
open('test_file.csv', 'wb').write(response.content)
# this decodes the content of the downloaded response and presents it to pandas
df_test = pd.read_csv(io.StringIO(response.content.decode('utf-8')))
To answer your additional question -- you can simply loop across a list of tickers or symbols, something like:
base_url = 'https://quotes.wsj.com/index/{ticker_name}/historical-prices/download?num_rows=360&range_days=360&endDate=11/06/2019'
ticker_list = ['COMP','SPX','HK/XHKG/HSI']
for ticker in ticker_list:
response = requests.get(base_url.format(ticker_name = ticker))
#do this if you want the CSV written to your machine
open('prices_'+ticker.replace('/','-')+'.csv', 'wb').write(response.content)
Note for HK/XHKG/HSI, we need to replace the slashes with hyphens or it's not a valid filename. You can also use this pattern to make dataframes.
Below is my code for annual data return. Since quarterly is not a different link and rather a button I cannot figure out how to pull it. I have spent days and finally resorting to help.
End game is excel output with balance sheets, cash flows, etc but I need on quarterly basis.
Any help welcome. Thank you
import lxml
from lxml import html
import requests
import numpy as np
import pandas as pd
import xlrd
def scrape_table(url):
page = requests.get(url)
tree = html.fromstring(page.content)
## page.content rather than page.text because html.fromstring implicitly expects bytes as input.)
table = tree.xpath('//table')
##XPath is a way of locating information in structured documents such as HTML or XML
assert len(table) == 1
df = pd.read_html(lxml.etree.tostring(table[0], method='html'))[0]
df = df.set_index(0)
# df = df.dropna()
df = df.transpose()
df = df.replace('-', '0')
# The first column should be a date
df[df.columns[0]] = pd.to_datetime(df[df.columns[0]])
cols = list(df.columns)
cols[0] = 'Date'
df = df.set_axis(cols, axis='columns', inplace=False)
numeric_columns = list(df.columns)[1::]
df[numeric_columns] = df[numeric_columns].astype(np.float64)
return df
loc= (r"F:\KateLaptop2019\Work\DataAnalysis\listpubliccompanies.xlsx")
wb=xlrd.open_workbook(loc)
sheet=wb.sheet_by_index(0)
sheet.cell_value(0,0)
companies=[]
for i in range(1,sheet.nrows):
companies.append((sheet.cell_value(i,1).strip()))
def annual_financials():
for item in companies:
try:
balance_sheet_url = 'https://finance.yahoo.com/quote/' + item + '/balance-sheet?p=' + item
download_destination = (r'F:\KateLaptop2019\Work\DataAnalysis\OilCompanyResearch\CompanyFinancials\BalanceSheet\\' + item + ".xlsx")
df_balance_sheet = scrape_table(balance_sheet_url)
df_balance_sheet.to_excel(download_destination)
except:
print(item,"key error")
pass
annual_financials()