I've complex flat file with huge data of mixed type. Trying to parse it using Python (best known to me), Succeeded to segregate data categorically using manual parsing.
Now stuck at a point where I have extracted data and need to make it tabular so that I could write it into xls, using pandas or any other lib.
I have pasted data at pastebin , url is https://pastebin.com/qn9J5nUL
data comes in non-tabualr and tabular format, out of which I need to discard non-tabular data and only need to write tabular data into xls.
To be precise I want to delete below data -
ABC Command-----UIP BLOCK:;
SE : ABC_UIOP_89TP
Report : +ve ABC_UIOP_89TP 2016-09-23 15:16:14
O&M #998459350
%%/*Web=1571835373:;%%
ID = 0 Result Ok.
and only utilize below format data into xls (example, not exact. Please refer to pastebin url to see complete data format) -
Local Info ID ID Name ID Frequency ID Data My ID
0 XXX_1 0 12 13
Since your datafile has certain pattern i think you can do it this way.
import pandas
s = []
e = []
with open('data_to_be_parsed.txt') as f:
datafile = f.readlines()
for idx,line in enumerate(datafile):
if 'Local' in line:
s.append(idx)
if '(Number of results' in line:
e.append(idx)
maindf = pd.DataFrame()
for i in range(len(s)):
head = list(datafile[s[i]].split(" "))
head = [x for x in head if x.strip()]
tmpdf = pd.DataFrame(columns=head)
for l_ in range(s[i]+1,e[i]):
da = datafile[l_]
if len(da)>1:
data = list(da.split(" "))
data = [x for x in data if x.strip()]
tmpdf = tmpdf.append(dict(zip(head,data)),ignore_index=True)
maindf = pd.concat([maindf,tempdf])
maindf.to_excel("output.xlsx")
Related
I am extracting data from a large pdf file using regex using python in databricks. This data is in form of a long string and I am using string split function to convert this into a pandas dataframe as I want the final data as csv file. But while doing line.split command it takes about 5 hours for the command to run and I am looking for ways to optimize this. I am new to python and I am not sure which part of the code should I look at for reducing this time of running the command.
for pdf in os.listdir(data_directory):
# creating an object
file = open(data_directory + pdf, 'rb')
# creating file reader object
fileReader = PyPDF2.PdfFileReader(file)
num_pages = fileReader.numPages
#print("total pages = " + str(num_pages))
extracted_string = "start of file"
current_page = 0
while current_page < num_pages:
#print("adding page " + str(current_page) + " to the file")
extracted_string += (fileReader.getPage(current_page).extract_text())
current_page = current_page + 1
regex_date = "\d{2}\/\d{2}\/\d{4}[^\n]*"
table_lines = re.findall(regex_date, extracted_string)
Above code is to get the data from PDF
#create dataframe out of extracted string and load into a single dataframe
for line in table_lines:
df = pd.DataFrame([x.split(' ') for x in line.split('\n')])
df.rename(columns={0: 'date_of_import', 1: 'entry_num', 2: 'warehouse_code_num', 3: 'declarant_ref_num', 4: 'declarant_EORI_num', 5: 'VAT_due'}, inplace=True)
table = pd.concat([table,df],sort= False)
This part of the code is what is taking up huge time. I have tried different ways to get a dataframe out of this data but the above has worked best for me. I am looking for faster way to run this code.
https://drive.google.com/file/d/1ew3Fw1IjeToBA-KMbTTD_hIINiQm0Bkg/view?usp=share_link pdf file for reference
There are 2 immediate optimization steps in your code.
Pre-compile regex if they are used many times. It may or not be relevant here, because I could not guess how many times table_lines = re.findall(regex_date, extracted_string) is executed. But this if often more efficient:
# before any loop
regex_date = re.compile("\d{2}\/\d{2}\/\d{4}[^\n]*")
...
# inside the loop
table_lines = regex_date.findall(extracted_string)
Do not repeatedly append to a dataframe. A dataframe is a rather complex container, and appending rows is a costly operation. It is generally much more efficient do build a Python container (list or dict) first and then convert it as a whole to a dataframe
data = [[x.split(' ') for x in line.split('\n')] for line in table_lines]
table = pd.DataFrame(data, columns = ['date_of_import', 'entry_num',
'warehouse_code_num', 'declarant_ref_num',
'declarant_EORI_num', 'VAT_due'])
I have been trying to build vlookup function in python. I have two files. name = data - created in python which has more than 2000 rows. comp_data = csv file loaded in system which has 35 rows. I have to match date of data file with comp_data and have to load Exp_date corresponding to it. Current code gives error 35. I am not able to understand the problem.
Following are the codes:
data['Exp_date'] = datetime.date(2020,3,30)
z=0
for i in range(len(data)):
if data['Date'][i] == comp_data['Date'][z]:
data['Exp_date'][i] = comp_data['Exp_date'][z]
else:
z=z+1
One option would be to put your comp_data in a dictionary with your data/exp_date as key/value pairs and let python do the lookup for you.
data = {"date":["a","b","c","d","e","f"],"exp_date":[0,0,0,0,0,0]}
comp = {"a":10,"d":13}
for i in range(len(data['date'])):
if data['date'][i] in comp:
data['exp_date'][i] = comp[data['date'][i]]
print(data)
There's probably a one-liner way of doing this with iterators!
I'm streaming data to a csv file.
This is the request:
symbols=["SPY", "IVV", "SDS", "SH", "SPXL", "SPXS", "SPXU", "SSO", "UPRO", "VOO"]
each symbols has a list range from (0,8)
this is how it looks like in 3 columns:
-1583353249601,symbol,SH
-1583353249601,delayed,False
-1583353249601,asset-main-type,EQUITY
-1583353250614,symbol,SH
-1583353250614,last-price,24.7952
-1583353250614,bid-size,362
-1583353250614,symbol,VOO
-1583353250614,bid-price,284.79
-1583353250614,bid-size,3
-1583353250614,ask-size,1
-1583353250614,bid-id,N
my end goal Is to reshape the data:
this is what I need to achieved.
the problems that I encounter where:
not being able to group by tiemstamp and not being able to pivot.
1)I tried to crate a dict and so later It can be passed to pandas, but I m missing data in the process.
I need to find the way to group the data that has the same timestamp.it looks like that omit the lines with the same timestamp.
code:
new_data_dict = {}
with open("stream_data.csv", 'r') as data_file:
data = csv.DictReader(data_file, delimiter=",")
for row in data:
item = new_data_dict.get(row["timestamp"], dict())
item[row["symbol"]] = row["value"]
new_data_dict[row['timestamp']] = item
data = new_data_dict
data = pd.DataFrame.from_dict(data)
data.T
print(data.T)
2)this is an other approach, I was able to group by timestamp by creating 2 different data, but I can not split the value column in to multiple columns to be merge late matching indexes.
code:
data = pd.read_csv("tasty_hola.csv",sep=',' )
data1 = data.groupby(['timestamp']).apply(lambda v: v['value'].unique())
data = data.groupby(['timestamp']).apply(lambda v: v['symbol'].unique())
data1 = pd.DataFrame({'timestamp':data1.index, 'value':data1.values})
At this moment I don't know if the logic that I m trying to apply is the correct one. very lost not being able to see the light at the end of the tunnel.
Thank you very much
I want to read the entire row data and store it in variables, later use them in selenium to write it to webelements. Programming language is Python.
Example: I have an excel sheet of Incidents and their details regarding priority, date, assignee etc
If I give the string as INC00000 it should match the excel data, fetch all the above details and store it in separate variables like
INC #= INC0000 Priority= Moderate Date = 11/2/2020
Is this feasible? I tried and failed writing a code. Please suggest other possible ways to do this.
I would,
load the sheet into a pandas DataFrame
filter the corresponding column in the DataFrame by the INC # of interest
convert the row to dictionary (assuming the INC filter produces only 1 row)
get the corresponding value in the dictionary to assign to the corresponding webelement
Example:
import pandas as pd
df = pd.read_excel("full_file_path", sheet_name="name_of_sheet")
dict_data = df[df['INC #']].to_dict("record") # <-- assuming the "INC #" are in column named "INC #" in the spreadsheet
webelement1.send_keys(dict_data[columnname1])
webelement2.send_keys(dict_data[columnname2])
webelement3.send_keys(dict_data[columnname3])
.
.
.
Please find the below code and do the changes as per your variables after saving your excel file as csv:
Please find the dummy data image
import csv
# Set up input and output variables for the script
gTrack = open("file1.csv", "r")
# Set up CSV reader and process the header
csvReader = csv.reader(gTrack)
header = csvReader.next()
print header
id_index = header.index("id")
date_index = header.index("date ")
var1_index = header.index("var1")
var2_index = header.index("var2")
# # Make an empty list
cList = []
# Loop through the lines in the file and get required id
for row in csvReader:
id = row[id_index]
if(id == 'INC001') :
date = row[date_index]
var1 = row[var1_index]
var2 = row[var2_index]
cList.append([id,date,var1,var2])
# # Print the coordinate list
print cList
and thank you for looking.
I am trying my hand at modifying a Python script to download a bunch of data from a website. I have decided that given the large data that will be used, I am wanting to convert the script to Pandas for this. I have this code so far.
snames = ['Index','Node_ID','Node','Id','Name','Tag','Datatype','Engine']
sensorinfo = pd.read_csv(sensorpath, header = None, names = snames, index_col=['Node', 'Index'])
for j in sensorinfo['Node']:
for z in sensorinfo['Index']:
# create a string for the url of the data
data_url = "http://www.mywebsite.com/emoncms/feed/data.json?id=" + sensorinfo['Id'] + "&apikey1f8&start=&end=&dp=600"
print data_url
# read in the data from emoncms
sock = urllib.urlopen(data_url)
data_str = sock.read()
sock.close
# data is output as a string so we convert it to a list of lists
data_list = eval(data_str)
myfile = open(feed_list['Name'[k]] + ".csv",'wb')
wr=csv.writer(myfile,quoting=csv.QUOTE_ALL)
The first part of the code gives me a very nice table which means I am opening my csv data file and import the information, my question is this:
So I am trying to do this in pseudo code:
For node is nodes (4 nodes so far)
For index in indexes
data_url = websiteinfo + Id + sampleinformation
smalldata.read.csv(data_url)
merge(bigdata, smalldata.no_time_column)
This is my first post here, I tried to keep it short but still supply the relevant data. Let me know if I need to clarify anything.
In your pseudocode, you can do this:
dfs = []
For node is nodes (4 nodes so far)
For index in indexes
data_url = websiteinfo + Id + sampleinformation
df = smalldata.read.csv(data_url)
dfs.append(df)
df = pd.concat(dfs)