and thank you for looking.
I am trying my hand at modifying a Python script to download a bunch of data from a website. I have decided that given the large data that will be used, I am wanting to convert the script to Pandas for this. I have this code so far.
snames = ['Index','Node_ID','Node','Id','Name','Tag','Datatype','Engine']
sensorinfo = pd.read_csv(sensorpath, header = None, names = snames, index_col=['Node', 'Index'])
for j in sensorinfo['Node']:
for z in sensorinfo['Index']:
# create a string for the url of the data
data_url = "http://www.mywebsite.com/emoncms/feed/data.json?id=" + sensorinfo['Id'] + "&apikey1f8&start=&end=&dp=600"
print data_url
# read in the data from emoncms
sock = urllib.urlopen(data_url)
data_str = sock.read()
sock.close
# data is output as a string so we convert it to a list of lists
data_list = eval(data_str)
myfile = open(feed_list['Name'[k]] + ".csv",'wb')
wr=csv.writer(myfile,quoting=csv.QUOTE_ALL)
The first part of the code gives me a very nice table which means I am opening my csv data file and import the information, my question is this:
So I am trying to do this in pseudo code:
For node is nodes (4 nodes so far)
For index in indexes
data_url = websiteinfo + Id + sampleinformation
smalldata.read.csv(data_url)
merge(bigdata, smalldata.no_time_column)
This is my first post here, I tried to keep it short but still supply the relevant data. Let me know if I need to clarify anything.
In your pseudocode, you can do this:
dfs = []
For node is nodes (4 nodes so far)
For index in indexes
data_url = websiteinfo + Id + sampleinformation
df = smalldata.read.csv(data_url)
dfs.append(df)
df = pd.concat(dfs)
Related
I have a script that I use to fire orders from a csv file, to an exchange using a for loop.
data = pd.read_csv('orderparameters.csv')
df = pd.DataFrame(data)
for i in range(len(df)):
order = Client.new_order(...
...)
file = open('orderData.txt', 'a')
original_stdout = sys.stdout
with file as f:
sys.stdout = f
print(order)
file.close()
sys.stdout = original_stdout
I put the response from the exchange in a txt file like this...
I want to turn the multiple responses into 1 single dataframe. I would hope it would look something like...
(I did that manually).
I tried;
data = pd.read_csv('orderData.txt', header=None)
dfData = pd.DataFrame(data)
print(dfData)
but I got;
I have also tried
data = pd.read_csv('orderData.txt', header=None)
organised = data.apply(pd.Series)
print(organised)
but I got the same output.
I can print order['symbol'] within the loop etc.
I'm not certain whether I should be populating this dataframe within the loop, or by capturing and writing the response and processing it afterwards. Appreciate your advice.
It looks like you are getting json strings back, you could read json objects into dictionaries and then create a dataframe from that. Perhaps try something like this (no longer needs a file)
data = pd.read_csv('orderparameters.csv')
df = pd.DataFrame(data)
response_data = []
for i in range(len(df)):
order_json = Client.new_order(...
...)
response_data.append(eval(order_json))
response_dataframe = pd.DataFrame(response_data)
If I understand your question correctly, you can simply do the following:
import pandas as pd
orders = pd.read_csv('orderparameters.csv')
responses = pd.DataFrame(Client.new_order(...) for _ in range(len(orders)))
So this is kind of weird but I'm new to Python and I'm committed to seeing my first project with Python through to the end.
So I am reading about 100 .xlsx files in from a file path. I then trim each file and send only the important information to a list, as an individual and unique dataframe. So now I have a list of 100 unique dataframes, but iterating through the list and writing to excel just overwrites the data in the file. I want to append the end of the .xlsx file. The biggest catch to all of this is, I can only use Excel 2010, I do not have any other version of the application. So the openpyxl library seems to have some interesting stuff, I've tried something like this:
from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook(outfile_path)
ws = wb.active
for frame in main_df_list:
for r in dataframe_to_rows(frame, index = True, header = True):
ws.append(r)
Note: In another post I was told it's not best practice to read dataframes line by line using loops, but when I started I didn't know that. I am however committed to this monstrosity.
Edit after reading Comments
So my code scrapes .xlsx files and stores specific data based on a keyword comparison into dataframes. These dataframes are stored in a list, I will list the entirety of the program below so hopefully I can explain what's in my head. Also, feel free to roast my code because I have no idea what is actual good python practices vs. not.
import os
import pandas as pd
from openpyxl import load_workbook
#the file path I want to pull from
in_path = r'W:\R1_Manufacturing\Parts List Project\Tool_scraping\Excel'
#the file path where row search items are stored
search_parameters = r'W:\R1_Manufacturing\Parts List Project\search_params.xlsx'
#the file I will write the dataframes to
outfile_path = r'W:\R1_Manufacturing\Parts List Project\xlsx_reader.xlsx'
#establishing my list that I will store looped data into
file_list = []
main_df = []
master_list = []
#open the file path to store the directory in files
files = os.listdir(in_path)
#database with terms that I want to track
search = pd.read_excel(search_parameters)
search_size = search.index
#searching only for files that end with .xlsx
for file in files:
if file.endswith('.xlsx'):
file_list.append(in_path + '/' + file)
#read in the files to a dataframe, main loop the files will be maninpulated in
for current_file in file_list:
df = pd.read_excel(current_file)
#get columns headers and a range for total rows
columns = df.columns
total_rows = df.index
#adding to store where headers are stored in DF
row_list = []
column_list = []
header_list = []
for name in columns:
for number in total_rows:
cell = df.at[number, name]
if isinstance(cell, str) == False:
continue
elif cell == '':
continue
for place in search_size:
search_loop = search.at[place, 'Parameters']
#main compare, if str and matches search params, then do...
if insensitive_compare(search_loop, cell) == True:
if cell not in header_list:
header_list.append(df.at[number, name]) #store data headers
row_list.append(number) #store row number where it is in that data frame
column_list.append(name) #store column number where it is in that data frame
else:
continue
else:
continue
for thing in column_list:
df = pd.concat([df, pd.DataFrame(0, columns=[thing], index = range(2))], ignore_index = True)
#turns the dataframe into a set of booleans where its true if
#theres something there
na_finder = df.notna()
#create a new dataframe to write the output to
outdf = pd.DataFrame(columns = header_list)
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
#I turn the dataframe into booleans and read until False
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
#Store actual dataframe into my output dataframe, outdf
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
So main_df is a list that has 100+ dataframes in it. For this example I will only use 2 of them. I would like them to print out into excel like:
So the comment from Ashish really helped me, all of the dataframes had different column titles so my 100+ dataframes eventually concat'd to a dataframe that is 569X52. Here is the code that I used, I completely abandoned openpyxl because once I was able to concat all of the dataframes together, I just had to export it using pandas:
# what I want to do here is grab all the data in the same column as each
# header, then move to the next column
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
to_xlsx_df = pd.DataFrame()
for frame in main_df:
to_xlsx_df = pd.concat([to_xlsx_df, frame])
to_xlsx_df.to_excel(outfile_path)
The output to excel ended up looking something like this:
Hopefully this can help someone else out too.
I've complex flat file with huge data of mixed type. Trying to parse it using Python (best known to me), Succeeded to segregate data categorically using manual parsing.
Now stuck at a point where I have extracted data and need to make it tabular so that I could write it into xls, using pandas or any other lib.
I have pasted data at pastebin , url is https://pastebin.com/qn9J5nUL
data comes in non-tabualr and tabular format, out of which I need to discard non-tabular data and only need to write tabular data into xls.
To be precise I want to delete below data -
ABC Command-----UIP BLOCK:;
SE : ABC_UIOP_89TP
Report : +ve ABC_UIOP_89TP 2016-09-23 15:16:14
O&M #998459350
%%/*Web=1571835373:;%%
ID = 0 Result Ok.
and only utilize below format data into xls (example, not exact. Please refer to pastebin url to see complete data format) -
Local Info ID ID Name ID Frequency ID Data My ID
0 XXX_1 0 12 13
Since your datafile has certain pattern i think you can do it this way.
import pandas
s = []
e = []
with open('data_to_be_parsed.txt') as f:
datafile = f.readlines()
for idx,line in enumerate(datafile):
if 'Local' in line:
s.append(idx)
if '(Number of results' in line:
e.append(idx)
maindf = pd.DataFrame()
for i in range(len(s)):
head = list(datafile[s[i]].split(" "))
head = [x for x in head if x.strip()]
tmpdf = pd.DataFrame(columns=head)
for l_ in range(s[i]+1,e[i]):
da = datafile[l_]
if len(da)>1:
data = list(da.split(" "))
data = [x for x in data if x.strip()]
tmpdf = tmpdf.append(dict(zip(head,data)),ignore_index=True)
maindf = pd.concat([maindf,tempdf])
maindf.to_excel("output.xlsx")
I am to download a number of .csv files which I convert to pandas dataframe and append to each other.
The csv can be accessed via url which is created each day and using datetime it can be easily generated and put in a list.
I am able to open these individually in the list.
When I try to open a number of these and append them together I get an empty dataframe. The code looks like this so.
#Imports
import datetime
import pandas as pd
#Testing can open .csv file
data = pd.read_csv('https://promo.betfair.com/betfairsp/prices/dwbfpricesukwin01022018.csv')
data.iloc[:5]
#Taking heading to use to create new dataframe
data_headings = list(data.columns.values)
#Setting up string for url
path_start = 'https://promo.betfair.com/betfairsp/prices/dwbfpricesukwin'
file = ".csv"
#Getting dates which are used in url
start = datetime.datetime.strptime("01-02-2018", "%d-%m-%Y")
end = datetime.datetime.strptime("04-02-2018", "%d-%m-%Y")
date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days)]
#Creating new dataframe which is appended to
for heading in data_headings:
data = {heading: []}
df = pd.DataFrame(data, columns=data_headings)
#Creating list of url
date_list = []
for date in date_generated:
date_string = date.strftime("%d%m%Y")
x = path_start + date_string + file
date_list.append(x)
#Opening and appending csv files from list which contains url
for full_path in date_list:
data_link = pd.read_csv(full_path)
df.append(data_link)
print(df)
I have checked that they are not just empty csv but they are not. Any help would be appreciated.
Cheers,
Sandy
You are never storing the appended dataframe. The line:
df.append(data_link)
Should be
df = df.append(data_link)
However, this may be the wrong approach. You really want to use the array of URLs and concatenate them. Check out this similar question and see if it can improve your code!
I really can't understand what you wanted to do here:
#Creating new dataframe which is appended to
for heading in data_headings:
data = {heading: []}
df = pd.DataFrame(data, columns=data_headings)
By the way, try this:
for full_path in date_list:
data_link = pd.read_csv(full_path)
df.append(data_link.copy())
Hello I am trying to read in multiple files, create a dataframe of the specific key information i need and then append each dataframe for each file to a main dataframe called topics. I have tried the following code.
import pandas as pd
import numpy as np
from lxml import etree
import os
topics = pd.DataFrame()
for filename in os.listdir('./topics'):
if not filename.startswith('.'):
#print(filename)
tree = etree.parse('./topics/'+filename)
root = tree.getroot()
childA = []
elementT = []
ElementA = []
for child in root:
elementT.append(str(child.tag))
ElementA.append(str(child.attrib))
childA.append(str(child.attrib))
for element in child:
elementT.append(str(element.tag))
#childA.append(child.attrib)
ElementA.append(str(element.attrib))
childA.append(str(child.attrib))
for sub in element:
#print('***', child.attrib , ':' , element.tag, ':' , element.attrib, '***')
#childA.append(child.attrib)
elementT.append(str(sub.tag))
ElementA.append(str(sub.attrib))
childA.append(str(child.attrib))
df = pd.DataFrame()
df['c'] = np.array (childA)
df['t'] = np.array(ElementA)
df['a'] = np.array(elementT)
file = df['t'].str.extract(r'([A-Z][A-Z].*[words.xml])#')
start = df['t'].str.extract(r'words([0-9]+)')
stop = df['t'].str.extract(r'.*words([0-9]+)')
tags = df['a'].str.extract(r'.*([topic]|[pointer]|[child])')
rootTopic = df['c'].str.extract(r'rdhillon.(\d+)')
df['f'] = file
df['start'] = start
df['stop'] = stop
df['tags'] = tags
# c= topic
# r = pointerr
# d= child
df['topicID'] = rootTopic
df = df.iloc[:,3:]
topics.append(df)
However when i call topics i get the following output
topics
Out[19]:_
Can someone please let me know where i am going wrong, also any suggestions on improving my messy code would be appreciated
Unlike lists, when you append to a DataFrame you return a new object. So topics.append(df) returns an object that you are never storing anywhere and topics remains the empty DataFrame you declare on the 6th line. You can fix this by
topics = topics.append(df)
However, appending to a DataFrame within a loop is a very costly exercise. Instead you should append each DataFrame to a list within the loop and call pd.concat() on the list of DataFrames after the loop.
import pandas as pd
topics_list = []
for filename in os.listdir('./topics'):
# All of your code
topics_list.append(df) # Lists are modified with append
# After the loop one call to concat
topics = pd.concat(topics_list)