Parsing mixed flat file data to write into xls using Python - python

I've complex flat file with huge data of mixed type. Trying to parse it using Python (best known to me), Succeeded to segregate data categorically using manual parsing.
Now stuck at a point where I have extracted data and need to make it tabular so that I could write it into xls, using pandas or any other lib.
I have pasted data at pastebin , url is https://pastebin.com/qn9J5nUL
data comes in non-tabualr and tabular format, out of which I need to discard non-tabular data and only need to write tabular data into xls.
To be precise I want to delete below data -
ABC Command-----UIP BLOCK:;
SE : ABC_UIOP_89TP
Report : +ve ABC_UIOP_89TP 2016-09-23 15:16:14
O&M #998459350
%%/*Web=1571835373:;%%
ID = 0 Result Ok.
and only utilize below format data into xls (example, not exact. Please refer to pastebin url to see complete data format) -
Local Info ID ID Name ID Frequency ID Data My ID
0 XXX_1 0 12 13

Since your datafile has certain pattern i think you can do it this way.
import pandas
s = []
e = []
with open('data_to_be_parsed.txt') as f:
datafile = f.readlines()
for idx,line in enumerate(datafile):
if 'Local' in line:
s.append(idx)
if '(Number of results' in line:
e.append(idx)
maindf = pd.DataFrame()
for i in range(len(s)):
head = list(datafile[s[i]].split(" "))
head = [x for x in head if x.strip()]
tmpdf = pd.DataFrame(columns=head)
for l_ in range(s[i]+1,e[i]):
da = datafile[l_]
if len(da)>1:
data = list(da.split(" "))
data = [x for x in data if x.strip()]
tmpdf = tmpdf.append(dict(zip(head,data)),ignore_index=True)
maindf = pd.concat([maindf,tempdf])
maindf.to_excel("output.xlsx")

Related

Optimize string split to get a pandas dataframe

I am extracting data from a large pdf file using regex using python in databricks. This data is in form of a long string and I am using string split function to convert this into a pandas dataframe as I want the final data as csv file. But while doing line.split command it takes about 5 hours for the command to run and I am looking for ways to optimize this. I am new to python and I am not sure which part of the code should I look at for reducing this time of running the command.
for pdf in os.listdir(data_directory):
# creating an object
file = open(data_directory + pdf, 'rb')
# creating file reader object
fileReader = PyPDF2.PdfFileReader(file)
num_pages = fileReader.numPages
#print("total pages = " + str(num_pages))
extracted_string = "start of file"
current_page = 0
while current_page < num_pages:
#print("adding page " + str(current_page) + " to the file")
extracted_string += (fileReader.getPage(current_page).extract_text())
current_page = current_page + 1
regex_date = "\d{2}\/\d{2}\/\d{4}[^\n]*"
table_lines = re.findall(regex_date, extracted_string)
Above code is to get the data from PDF
#create dataframe out of extracted string and load into a single dataframe
for line in table_lines:
df = pd.DataFrame([x.split(' ') for x in line.split('\n')])
df.rename(columns={0: 'date_of_import', 1: 'entry_num', 2: 'warehouse_code_num', 3: 'declarant_ref_num', 4: 'declarant_EORI_num', 5: 'VAT_due'}, inplace=True)
table = pd.concat([table,df],sort= False)
This part of the code is what is taking up huge time. I have tried different ways to get a dataframe out of this data but the above has worked best for me. I am looking for faster way to run this code.
https://drive.google.com/file/d/1ew3Fw1IjeToBA-KMbTTD_hIINiQm0Bkg/view?usp=share_link pdf file for reference
There are 2 immediate optimization steps in your code.
Pre-compile regex if they are used many times. It may or not be relevant here, because I could not guess how many times table_lines = re.findall(regex_date, extracted_string) is executed. But this if often more efficient:
# before any loop
regex_date = re.compile("\d{2}\/\d{2}\/\d{4}[^\n]*")
...
# inside the loop
table_lines = regex_date.findall(extracted_string)
Do not repeatedly append to a dataframe. A dataframe is a rather complex container, and appending rows is a costly operation. It is generally much more efficient do build a Python container (list or dict) first and then convert it as a whole to a dataframe
data = [[x.split(' ') for x in line.split('\n')] for line in table_lines]
table = pd.DataFrame(data, columns = ['date_of_import', 'entry_num',
'warehouse_code_num', 'declarant_ref_num',
'declarant_EORI_num', 'VAT_due'])

Vlookup using a for-loop

I have been trying to build vlookup function in python. I have two files. name = data - created in python which has more than 2000 rows. comp_data = csv file loaded in system which has 35 rows. I have to match date of data file with comp_data and have to load Exp_date corresponding to it. Current code gives error 35. I am not able to understand the problem.
Following are the codes:
data['Exp_date'] = datetime.date(2020,3,30)
z=0
for i in range(len(data)):
if data['Date'][i] == comp_data['Date'][z]:
data['Exp_date'][i] = comp_data['Exp_date'][z]
else:
z=z+1
One option would be to put your comp_data in a dictionary with your data/exp_date as key/value pairs and let python do the lookup for you.
data = {"date":["a","b","c","d","e","f"],"exp_date":[0,0,0,0,0,0]}
comp = {"a":10,"d":13}
for i in range(len(data['date'])):
if data['date'][i] in comp:
data['exp_date'][i] = comp[data['date'][i]]
print(data)
There's probably a one-liner way of doing this with iterators!

How to create a multiIndex dataframe from a streaming csv file

I'm streaming data to a csv file.
This is the request:
symbols=["SPY", "IVV", "SDS", "SH", "SPXL", "SPXS", "SPXU", "SSO", "UPRO", "VOO"]
each symbols has a list range from (0,8)
this is how it looks like in 3 columns:
-1583353249601,symbol,SH
-1583353249601,delayed,False
-1583353249601,asset-main-type,EQUITY
-1583353250614,symbol,SH
-1583353250614,last-price,24.7952
-1583353250614,bid-size,362
-1583353250614,symbol,VOO
-1583353250614,bid-price,284.79
-1583353250614,bid-size,3
-1583353250614,ask-size,1
-1583353250614,bid-id,N
my end goal Is to reshape the data:
this is what I need to achieved.
the problems that I encounter where:
not being able to group by tiemstamp and not being able to pivot.
1)I tried to crate a dict and so later It can be passed to pandas, but I m missing data in the process.
I need to find the way to group the data that has the same timestamp.it looks like that omit the lines with the same timestamp.
code:
new_data_dict = {}
with open("stream_data.csv", 'r') as data_file:
data = csv.DictReader(data_file, delimiter=",")
for row in data:
item = new_data_dict.get(row["timestamp"], dict())
item[row["symbol"]] = row["value"]
new_data_dict[row['timestamp']] = item
data = new_data_dict
data = pd.DataFrame.from_dict(data)
data.T
print(data.T)
2)this is an other approach, I was able to group by timestamp by creating 2 different data, but I can not split the value column in to multiple columns to be merge late matching indexes.
code:
data = pd.read_csv("tasty_hola.csv",sep=',' )
data1 = data.groupby(['timestamp']).apply(lambda v: v['value'].unique())
data = data.groupby(['timestamp']).apply(lambda v: v['symbol'].unique())
data1 = pd.DataFrame({'timestamp':data1.index, 'value':data1.values})
At this moment I don't know if the logic that I m trying to apply is the correct one. very lost not being able to see the light at the end of the tunnel.
Thank you very much

Read data from excel after a string matches

I want to read the entire row data and store it in variables, later use them in selenium to write it to webelements. Programming language is Python.
Example: I have an excel sheet of Incidents and their details regarding priority, date, assignee etc
If I give the string as INC00000 it should match the excel data, fetch all the above details and store it in separate variables like
INC #= INC0000 Priority= Moderate Date = 11/2/2020
Is this feasible? I tried and failed writing a code. Please suggest other possible ways to do this.
I would,
load the sheet into a pandas DataFrame
filter the corresponding column in the DataFrame by the INC # of interest
convert the row to dictionary (assuming the INC filter produces only 1 row)
get the corresponding value in the dictionary to assign to the corresponding webelement
Example:
import pandas as pd
df = pd.read_excel("full_file_path", sheet_name="name_of_sheet")
dict_data = df[df['INC #']].to_dict("record") # <-- assuming the "INC #" are in column named "INC #" in the spreadsheet
webelement1.send_keys(dict_data[columnname1])
webelement2.send_keys(dict_data[columnname2])
webelement3.send_keys(dict_data[columnname3])
.
.
.
Please find the below code and do the changes as per your variables after saving your excel file as csv:
Please find the dummy data image
import csv
# Set up input and output variables for the script
gTrack = open("file1.csv", "r")
# Set up CSV reader and process the header
csvReader = csv.reader(gTrack)
header = csvReader.next()
print header
id_index = header.index("id")
date_index = header.index("date ")
var1_index = header.index("var1")
var2_index = header.index("var2")
# # Make an empty list
cList = []
# Loop through the lines in the file and get required id
for row in csvReader:
id = row[id_index]
if(id == 'INC001') :
date = row[date_index]
var1 = row[var1_index]
var2 = row[var2_index]
cList.append([id,date,var1,var2])
# # Print the coordinate list
print cList

Using Pandas dataframe with FOR loops

and thank you for looking.
I am trying my hand at modifying a Python script to download a bunch of data from a website. I have decided that given the large data that will be used, I am wanting to convert the script to Pandas for this. I have this code so far.
snames = ['Index','Node_ID','Node','Id','Name','Tag','Datatype','Engine']
sensorinfo = pd.read_csv(sensorpath, header = None, names = snames, index_col=['Node', 'Index'])
for j in sensorinfo['Node']:
for z in sensorinfo['Index']:
# create a string for the url of the data
data_url = "http://www.mywebsite.com/emoncms/feed/data.json?id=" + sensorinfo['Id'] + "&apikey1f8&start=&end=&dp=600"
print data_url
# read in the data from emoncms
sock = urllib.urlopen(data_url)
data_str = sock.read()
sock.close
# data is output as a string so we convert it to a list of lists
data_list = eval(data_str)
myfile = open(feed_list['Name'[k]] + ".csv",'wb')
wr=csv.writer(myfile,quoting=csv.QUOTE_ALL)
The first part of the code gives me a very nice table which means I am opening my csv data file and import the information, my question is this:
So I am trying to do this in pseudo code:
For node is nodes (4 nodes so far)
For index in indexes
data_url = websiteinfo + Id + sampleinformation
smalldata.read.csv(data_url)
merge(bigdata, smalldata.no_time_column)
This is my first post here, I tried to keep it short but still supply the relevant data. Let me know if I need to clarify anything.
In your pseudocode, you can do this:
dfs = []
For node is nodes (4 nodes so far)
For index in indexes
data_url = websiteinfo + Id + sampleinformation
df = smalldata.read.csv(data_url)
dfs.append(df)
df = pd.concat(dfs)

Categories

Resources