I am trying to write a python code to read a set of URLs from a CSV file and download the content in that URL. To read data from the CSV file, I am using pandas. And data is stored in data frames.Now I want to pass these values in the data frame(URLs) as an argument one by one to a function that uses the GET method to go to that particular URL and downloads the file. I am stuck in how to pass the values stored in a data frame in a loop as an argument. Any helps or any alternate methods are appreciated. Thanks in advance
Note: The data frame holds around 500 URLs.
Edit: I am using url = pd.read_csv(file_name, usecols=[26]) to read data.
My question is how to pass values in url to a function in loop
Not sure I understand your question, but maybe this is an answer to it:
d = {'URL': ['URL1', 'URL2','URL3','URL4','URL5']}
df = pd.DataFrame(data=d)
for k in range(len(df)):
url = df.at[k,'URL']
out = do_something_with_url(url)
Related
I’m new to Python and Jupyter. I have an API which I get my data from. I have located the childnode with the list of data I want from a loop. And now I want to put that data into Pandas dataframe. Could someone please help me with this? You can see my code below
resp = requests.get('http://***
auth=('***', '***'),
headers={'Accept': 'application/json'})
data = json.loads(resp.text)
for Observasjoner in data ['Holdings']:
display(Observasjoner)
just extract the data from JSON and append it into the lists, later create a data frame and save it into the data frame.
import requests
data = requests.get("form_link")
print(data.text()) #will print all text or use print(data.json())
Now search for the data which you need or use beautiful soup for if it is in HTML website
if it is JSON they will be like dictionaries so use the same concepts here, now my data is the dictionary
print(data["key"]) #it will print key in same way iterate full dictionary (JSON FILE)
Now use the dictionary concept and append all the values of the keys into the lists
now keys are the columns and values are the rows create a data frame for it
Thanks
I know similar questions have been posted before, but i haven't found something working for this case. I hope you can help.
Here is a summary of the issue:
I'am writing a web scraping code using selenium(for an assignment purpose)
The code utilizes a for-loop to go from one page to another
The output of the code is a dataframe from each page number that is imported to excel. (basically a table)
Dataframes from all the web pages to be captured in one excel sheet only.(not multiple sheets within the excel file)
Each web page has the same data format (ie. number of columns and column headers are the same, but the row values vary..)
For info, I'am using pandas as it is helping convert the output from the website to excel
The problem i'm facing is that when the dataframe is exported to excel, it over-writes the data from the previous iteration. hence, when i run the code and scraping is completed, I will only get the data from the last for-loop iteration.
Please advise the line(s) of coding i need to add in order for all the iterations to be captured in the excel sheet, in other words and more specifically, each iteration should export the data to excel starting from the first empty row.
Here is an extract from the code:
for i in range(50, 60):
url= (urlA + str(i)) #this is the url generator, URLA is the main link excluding pagination
driver.get(url)
time.sleep(random.randint(3,7))
text=driver.find_element_by_xpath('/html/body/pre').text
data=pd.DataFrame(eval(text))
export_excel = data.to_excel(xlpath)
Thanks Dijkgraaf. Your proposal worked.
Here is the full code for others (for future reference).
apologies for the font, couldnt set it properly. anyway hope below is to some use for someone in the future.
xlpath= "c:/projects/excelfile.xlsx"
df=pd.DataFrame() #creating a data frame before the for loop. (dataframe is empty before the for loop starts)
Url= www.your website.com
for i in irange(1,10):
url= (urlA + str(i)) #this is url generator for pagination (to loop thru the page)
driver.get(url)
text=driver.find_element_by_xpath('/html/body/pre').text # gets text from site
data=pd.DataFrame(eval(text)) #evalues the extracted text from site and converts to Pandas dataframe
df=df.append(data) #appends the dataframe (df) specificed before the for-loop and adds the new (data)
export_excel = df.to_excel(xlpath) #exports consolidated dataframes (df) to excel
Hello and thank your for taking the time to have a read at this,
I am looking to extract company information from a particular stock exchange and then save this information to a pandas DataFrame.
Each firm has it's own webpage that are all determined by the "KodeEmiten" ending. These codes are saved in a column of the first Dataframe:
df = pd.DataFrame.from_dict(data['data'])
Now my goal is to use these codes to call each companies website individually and create a json file for each
for i in range (len(df)):
requests.get(f'https://www.idx.co.id/umbraco/Surface/ListedCompany/GetCompanyProfilesDetail?emitenType=&kodeEmiten={df.loc[i, "KodeEmiten"]}').json()
While this works i can't save this to a new DataFrame due list index out of range and incorrect keyword errors. There is significantly more information in the xhr than i actually need and the different structures are what I believe to cause the error trying to save them to a new DataFrame. I'm really just interested in getting the data in these xhr headers:
AnakPerusahaan:, Direktur:, Komisaris, PemegangSaham:
So my question is kind of two-in-one:
a) How can I just extract the information from those specific xhr headers (all of them are tables)
b) how can i save those to a new dataframe (or even list I don't really mind)
import requests
import pandas as pd
import json
import time
# gets broad data of main page of the stock exchange
sxow = requests.get('https://www.idx.co.id/umbraco/Surface/ListedCompany/GetCompanyProfiles?draw=1&columns%5B0%5D%5Bdata%5D=KodeEmiten&columns%5B0%5D%5Bname%5D&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=KodeEmiten&columns%5B1%5D%5Bname%5D&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=false&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=NamaEmiten&columns%5B2%5D%5Bname%5D&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=false&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=TanggalPencatatan&columns%5B3%5D%5Bname%5D&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=false&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&start=0&length=700&search%5Bvalue%5D&search%5Bregex%5D=false&_=155082600847')
data = sxow.json() # save the request as .json file
df = pd.DataFrame.from_dict(data['data']) #creates DataFrame based on the data (.json) file
# add: compare file contents and overwrite original if same
cdate = time.strftime ("%Y%m%d") # creating string-variable w/ current date year|month|day
df.to_excel(f"{cdate}StockExchange_Overview.xlsx") # converts DataFrame to Excel file, can't overwrite existing file
for i in range (len(df)) :
requests.get(f'https://www.idx.co.id/umbraco/Surface/ListedCompany/GetCompanyProfilesDetail?emitenType=&kodeEmiten={df.loc[i, "KodeEmiten"]}').json()
#This is where I'm completely stuck
You don't need to convert the result to a dataframe. You can just loop through the json object and concatenate the url to get other companies website details.
Follow the code below:
import requests
import pandas as pd
import json
import time
# gets broad data of main page of the stock exchange
sxow = requests.get('https://www.idx.co.id/umbraco/Surface/ListedCompany/GetCompanyProfiles?draw=1&columns%5B0%5D%5Bdata%5D=KodeEmiten&columns%5B0%5D%5Bname%5D&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=KodeEmiten&columns%5B1%5D%5Bname%5D&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=false&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=NamaEmiten&columns%5B2%5D%5Bname%5D&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=false&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=TanggalPencatatan&columns%5B3%5D%5Bname%5D&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=false&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&start=0&length=700&search%5Bvalue%5D&search%5Bregex%5D=false&_=155082600847')
data = sxow.json() # save the request as .json file
list_of_json = []
for nested_json in data['data']:
list_of_json.append(requests.get('https://www.idx.co.id/umbraco/Surface/ListedCompany/GetCompanyProfilesDetail?emitenType=&kodeEmiten='+nested_json['KodeEmiten']).json())
time.sleep(1)
The list_of_json will contain all the json results you requested for.
Here nested_json is the loop variable to loop through the array of json of different KodeEmiten.
This is a slight improvement on #bigbounty's approach:
Since the aim is to save the information to a list and then use said list further in the script list comprehension is actually a tad faster.
i.e.
list_of_json = [requests.get('url+nested_json["KodeEmiten"]).json() for nested_json in data["data"]]'
I have an excel that has 'ID' key and API call URL
the Excel looks like this:
And the URL is this:
http://api.*******.com/2.0/location/{property_id}?key=123456ASDFG
The result of the API call is in JSON format
I would like to iterate the excel's property_id in the URL and save the result to csv in each row.
what I did so far is
import requests
import json
url = "http://api.*******.com/2.0/location/{property_id}?key=123456ASDFG"
response = requests.get(url)
data = response.test
print data
the result is basically same as what I just put the url in Chrome browser
I somehow have to read each row in the excel column A and switch the value and insert into {property_id} in the url
then, append the result to csv as row number increases..
I'm very new to API and I have no idea where to start.
I was trying to find similar questions on Stack-overflow and could not find any. (maybe wrong keywords?)
Any help is very helpful.
Thanks
I am pulling in info from an API. The returned data is in JSON format. I have to iterate through and get the same data for multiple inputs. I want to save the JSON data for each input in a python dictionary for easy access. This is what I have so far:
import pandas
import requests
ddict = {}
read_input = pandas.read_csv('input.csv')
for d in read_input.values:
print(d)
url = "https://api.xyz.com/v11/api.json?KEY=123&LOOKUP={}".format(d)
response = requests.get(url)
data = response.json()
ddict[d] = data
df = pandas.DataFrame.from_dict(ddict, orient='index')
with pandas.ExcelWriter('output.xlsx') as w:
df.to_excel(w, 'output')
With the above code, I get the following output:
a.com
I also get an excel output with the data only from this first line. My input csv file has close to 400 rows so I should be seeing more than 1 line in the output and in the output excel file.
If you have a better way of doing this, that would be appreciated. In addition, the excel output I get is very hard to understand. I want to read the JSON data using dictionaries and subdictionaries but I don't completely understand the format of the underlying data - I think it looks closest to a JSON array.
I have looked at numerous other posts including Parsing values from a JSON file using Python? and How do I write JSON data to a file in Python? and Converting JSON String to Dictionary Not List and How do I save results of a "for" loop into a single variable? but none of the techniques have worked so far. I would prefer not to pickle, if possible.
I'm new to Python so any help is appreciated!
I'm not going to address your challenges with JSON here as I'll need more information on the issues you're facing. However, with respect to reading from CSV using Pandas, here's a great resource: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html.
Now, your output is being read the way it is because a.com is being considered the header (undesirable). Your read statement should be:
read_input = pandas.read_csv('input.csv', header=None)
Now, read_input is a DataFrame (documentation). So, what you're really looking for is the values in the first column. You can easily get an array of values by read_input.values. This gives you a separate array for each row. So your for loop would be:
for d in read_input.values:
print(d[0])
get_info(d[0])
For JSON, I'd need to see a sample structure and your desired way of storing it.
I think there is a awkwardness in you program.
Try with this:
ddict = {}
read_input = pandas.read_csv('input.csv')
for d in read_input.values:
url = "https://api.xyz.com/v11/api.json?KEY=123&LOOKUP={}".format(d)
response = requests.get(url)
data = response.json()
ddict[d] = data
Edit: iterate the read_input.values.