Python - Selecting Specific Results to Place in Excel - python

EDIT: I'm making a lot of head way, I am now trying to parse out single columns from my JSON file, not the entire row. I am getting an error however whenever I try to manipulate my DataFrame to get the results I want.
The error is:
line 52, in
df = pd.DataFrame.from_dict(mlbJson['stats_sortable_player']['queryResults']['name_display_first_last'])
KeyError: 'name_display_first_last'
It only happens when I try to add another parameter, for instance i took out ['row'] and added ['name_display_first_last'] to get the first and last name of each player. If I leave in ['row'] it compiles, but gives me all the data, I only want certain snippets.
Any help would be greatly appreciated! Thanks.
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Scraping data from MLB.com
target = [MLB JSON][1]
mlbResponse = requests.get(target)
mlbJson = mlbResponse.json()
# Placing response in variable
# Collecting data and giving it names in pandas
data = {'Team': team, 'Line': line}
# places data table format, frames the data
table = pd.DataFrame(data)
# Creates excel file named Scrape
writer = pd.ExcelWriter('Scrape.xlsx')
# Moves table to excel taking in Parameters , 'Name of DOC and Sheet on that Doc'
table.to_excel(writer, 'Lines 1')
#stats = {'Name': name, 'Games': games, 'AtBats': ab, 'Runs': runs, 'Hits': hits, 'Doubles': doubles, 'Triples': triples, 'HR': hr, 'RBI': rbi, 'Walks': walks, 'SB': sb}
df = pd.DataFrame.from_dict(mlbJson['stats_sortable_player']['queryResults']['row'])
df.to_excel(writer, 'Batting 2')
# Saves File
writer.save()

It looks like the website loads the data asynchronously through another request to a different URL. The response you're getting has empty <datagrid><\datagrid> tag, and soup2.select_one("#datagrid").find_next("table") returns None.
You can use the developer tools in your browser under the network tab to find the URL to actually load data, it looks like :
http://mlb.mlb.com/pubajax/wf/flow/stats.splayer?season=2016&sort_order=%27desc%27&sort_column=%27avg%27&stat_type=hitting&page_type=SortablePlayer&game_type=%27R%27&player_pool=ALL&season_type=ANY&sport_code=%27mlb%27&results=1000&recSP=1&recPP=50
You can modify your code to make a request to this URL, which returns json
mlbResponse = requests.get(url)
mlbJson = mlbResponse.json() # python 3, otherwise use json.loads(mlbResponse.content)
df = pd.DataFrame(doc['stats_sortable_player']['queryResults']['row'])
The DataFrame has 54 columns, so I can't display it here, but you should be able to pick and rename the columns you need.

Related

Making API call with Requests in Python returns only one item instead of many

A problem with likely a very easy fix, yet I'm unfortunately new to this.
The problem is: My generated csv file includes data from only one URL, while I want them all.
I've made a list of many contract numbers and I'm trying to access them all and return their data into one csv file (a long list). The API's URL consists of a baseURL and a contract number plus some parameters, so my URLs look like this (showing 2 of 150)
https://api.nfz.gov.pl/app-umw-api/agreements/edc47a7d-a3b8-d354-79d5-a0518f8ba6d4?format=json&api-version=1.2&limit=25&page={}
https://api.nfz.gov.pl/app-umw-api/agreements/9a6d9313-c9cc-c0db-9c86-b7b4be0e11c1?format=json&api-version=1.2&limit=25&page={}
The publisher has imposed a limit of 25 records per page, therefore I've got some pagination going on here.
It seems like the program is making calls into each URL in turn, given that it printed the number of pages from each call. But the csv only has 4 rows, instead of hundreds. I'm wondering where I'm going wrong. I tried to fix by deleting the indent on the last 3 lines (no change) and other trial&error.
Another small question - the 4 rows are actually duplicated 2 rows. I think my code somewhere duplicates the first page of results, but I can't figure out where.
And another one - how can I make the first column of the csv file show the 'contract' (from my list 'contracts') that relates to the output? I need some way of identifying which rows in the csv came from which contract while the API keeps the info in a separate branch of the data 'tree' that I don't really know how to return efficiently.
import requests
import pandas as pd
import math
from contracts_list1 import contracts
baseurl = 'https://api.nfz.gov.pl/app-umw-api/agreements/'
for contract in contracts:
api_url = ''.join([baseurl, contract])
def main_request(api_url):
r = requests.get(api_url)
return r.json()
def get_pages(response):
return math.ceil(response['meta']['count'] / 25)
p_number = main_request(api_url)
all_data = []
for page in range(0, get_pages(p_number)+1): # <-- increase page numbers here
data = requests.get(api_url.format(page)).json()
for a in data["data"]["plans"]:
all_data.append({**a["attributes"]})
df = pd.DataFrame(all_data)
df.to_csv('file1.csv', encoding='utf-8-sig', index=False)
print(get_pages(p_number))
Your accumulator all_date is inside of the contracts loop, therefore each iteration will overwrite the last iteration result. That's why you're only seeing the result of the last iteration, instead of all of them.
Try to put your accumulator all_data = [] outside of your outer For Loop:
import requests
import pandas as pd
import math
from contracts_list1 import contracts
baseurl = 'https://api.nfz.gov.pl/app-umw-api/agreements/'
all_data = []
for contract in contracts:
api_url = ''.join([baseurl, contract])
def main_request(api_url):
r = requests.get(api_url)
return r.json()
def get_pages(response):
return math.ceil(response['meta']['count'] / 25)
p_number = main_request(api_url)
for page in range(0, get_pages(p_number)+1): # <-- increase page numbers here
data = requests.get(api_url.format(page)).json()
for a in data["data"]["plans"]:
all_data.append({**a["attributes"]})
df = pd.DataFrame(all_data)
df.to_csv('file1.csv', encoding='utf-8-sig', index=False)
print(get_pages(p_number))

How to preserve complicated excel header formats when manipulating data using Pandas Python?

I am parsing a large excel data file to another one, however the headers are very abnormal. I tried to use "read_excel skiprows" and that did not work. I also tried to include the header in
df = pd.read_excel(user_input, header= [1:3], sheet_name = 'PN Projection'), but then I get this error "ValueError: cannot join with no overlapping index names." To get around this I tried to name the columns by location and that did not work either.
When I run the code as shows below everything works fine, but past cell "U" I get the header titles to be "unnamed1, 2, ..." I understand this is because pandas is considering the first row to be the header(which are empty), but how do I fix this? Is there a way to preserve the headers without manually typing in the format for each cell? Any and all help is appreciated, thank you!
small section of the excel file header
the code I am trying to run
#!/usr/bin/env python
import sys
import os
import pandas as pd
#load source excel file
user_input = input("Enter the path of your source excel file (omit 'C:'): ")
#reads the source excel file
df = pd.read_excel(user_input, sheet_name = 'PN Projection')
#Filtering dataframe
#Filters out rows with 'EOL' in column 'item status' and 'xcvr' in 'description'
df = df[~(df['Item Status'] == 'EOL')]
df = df[~(df['Description'].str.contains("XCVR", na=False))]
#Filters in rows with "XC" or "spartan" in 'description' column
df = df[(df['Description'].str.contains("XC", na=False) | df['Description'].str.contains("Spartan", na=False))]
print(df)
#Saving to a new spreadsheet called Filtered Data
df.to_excel('filtered_data.xlsx', sheet_name='filtered_data')
If you do not need the top 2 rows, then:
df = pd.read_excel(user_input, sheet_name = 'PN Projection',error_bad_lines=False, skiprows=range(0,2)
This has worked for me when handling several strangely formatted files. Let me know if this isn't what your looking for, or if their are additional issues.

Python convert dictionary to CSV

I am trying to convert dictionary to CSV so that it is readable (in their respective key).
import csv
import json
from urllib.request import urlopen
x =0
id_num = [848649491, 883560475, 431495539, 883481767, 851341658, 42842466, 173114302, 900616370, 1042383097, 859872672]
for bilangan in id_num:
with urlopen("https://shopee.com.my/api/v2/item/get?itemid="+str(bilangan)+"&shopid=1883827")as response:
source = response.read()
data = json.loads(source)
#print(json.dumps(data, indent=2))
data_list ={ x:{'title':productName(),'price':price(),'description':description(),'preorder':checkPreorder(),
'estimate delivery':estimateDelivery(),'variation': variation(), 'category':categories(),
'brand':brand(),'image':image_link()}}
#print(data_list[x])
x =+ 1
i store the data in x, so it will be looping from 0 to 1, 2 and etc. i have tried many things but still cannot find a way to make it look like this or close to this:
https://i.stack.imgur.com/WoOpe.jpg
Using DictWriter from the csv module
Demo:
import csv
data_list ={'x':{'title':'productName()','price':'price()','description':'description()','preorder':'checkPreorder()',
'estimate delivery':'estimateDelivery()','variation': 'variation()', 'category':'categories()',
'brand':'brand()','image':'image_link()'}}
with open(filename, "w") as infile:
writer = csv.DictWriter(infile, fieldnames=data_list["x"].keys())
writer.writeheader()
writer.writerow(data_list["x"])
I think, maybe you just want to merge some cells like excel do?
If yes, I think this is not possible in csv, because csv format does not contain cell style information like excel.
Some possible solutions:
use openpyxl to generate a excel file instead of csv, then you can merge cells with "worksheet.merge_cells()" function.
do not try to merge cells, just keep title, price and other fields for each line, the data format should be like:
first line: {'title':'test_title', 'price': 22, 'image': 'image_link_1'}
second line: {'title':'test_title', 'price': 22, 'image': 'image_link_2'}
do not try to merge cells, but set the title, price and other fields to a blank string, so it will not show in your csv file.
use line break to control the format, that will merge multi lines with same title into single line.
hope that helps.
If I were you, I would have done this a bit differently. I do not like that you are calling so many functions while this website offers a beautiful JSON response back :) More over, I will use pandas library so that I have total control over my data. I am not a CSV lover. This is a silly prototype:
import requests
import pandas as pd
# Create our dictionary with our items lists
data_list = {'title':[],'price':[],'description':[],'preorder':[],
'estimate delivery':[],'variation': [], 'categories':[],
'brand':[],'image':[]}
# API url
url ='https://shopee.com.my/api/v2/item/get'
id_nums = [848649491, 883560475, 431495539, 883481767, 851341658,
42842466, 173114302, 900616370, 1042383097, 859872672]
shop_id = 1883827
# Loop throw id_nums and return the goodies
for id_num in id_nums:
params = {
'itemid': id_num, # take values from id_nums
'shopid':shop_id}
r = requests.get(url, params=params)
# Check if we got something :)
if r.ok:
data_json = r.json()
# This web site returns a beautiful JSON we can slice :)
product = data_json['item']
# Lets populate our data_list with the items we got. We could simply
# creating one function to do this, but for now this will do
data_list['title'].append(product['name'])
data_list['price'].append(product['price'])
data_list['description'].append(product['description'])
data_list['preorder'].append(product['is_pre_order'])
data_list['estimate delivery'].append(product['estimated_days'])
data_list['variation'].append(product['tier_variations'])
data_list['categories'].append([product['categories'][i]['display_name'] for i, _ in enumerate(product['categories'])])
data_list['brand'].append(product['brand'])
data_list['image'].append(product['image'])
else:
# Do something if we hit connection error or something.
# may be retry or ignore
pass
# Putting dictionary to a list and ordering :)
df = pd.DataFrame(data_list)
df = df[['title','price','description','preorder','estimate delivery',
'variation', 'categories','brand','image']]
# df.to ...? There are dozen of different ways to store your data
# that are far better than CSV, e.g. MongoDB, HD5 or compressed pickle
df.to_csv('my_data.csv', sep = ';', encoding='utf-8', index=False)

Python: for loop and saving to a new CSV file with pandas

been searching everywhere but cannot seem to solve this problem.
I have a csv file which contains two headings, "Name" and "URL". I've saved this in a variable called df1, as per below:
`
import pandas as pd
df1 = pd.read_csv('yahoo finance.csv')
print(df1)
Name URL
0 Gainers https://au.finance.yahoo.com/gainers?e=ax
1 Losers https://au.finance.yahoo.com/losers
2 Active https://au.finance.yahoo.com/most-active
`
What I'm trying to do is go into each of the above URL's, parse the table within it, and save the data in a new CSV file.
`
for u in df1.URL:
u2 = pd.read_html(u)
for n in u2:
row2 = pd.DataFrame(num)
row2.to_csv(name+'.csv', index=False)
`
I am missing a big step here that I can't resolve, I want to save the table from each URL into a new CSV with the name from the "Name" column of the corresponding url.
Can someone help me fix this simple part? Currently all this code does is save the last URL's data to a csv named "Active", it's not saving the first two URL's at all.
Thank you in advance!
Thank you, this has helped solve the issue now, the CSV's are saving as they should be. The updated code is:
for row in df1.iterrows():
name = row[1]['Name']
url = row[1]['URL']
url2 = str(url)
url3 = pd.read_html(url2)
for num in url3:
row2 = pd.DataFrame(num)
row2.to_csv(name+'.csv', index=False)
Do you mean you need to iterate a dataframe row by row? Is URL value used for getting data. Is Name is used for saving data. If yes probably you need it
for row in df.iterrows():
name = row[1]['Name']
url = row[1]['URL']

Python : For Loop iterating through a csv list to scrape from CSV website

I'm trying to create a loop that uses a column of names from a csv file in my path to pull data from a website that uses csv to hold its data. I need to iterate through each name and pass it to the website for a specific row with five columns that is related to each name. I feel as though I have searched every thread pertaining to scraping a website that is holding data in a CSV file by using a CSV list.
I have tested the code that is in the for loop and it works independently to gather the specific row and columns based upon the name with one value in the parameter:
i = 'City'
url = url_template.format(i) # get the url
url = pd.read_csv(url)
url_df = pd.DataFrame(url)
my_data = url_df.iloc[[13]]
# my_df = pd.DataFrame(my_data) ## greyed out for testing
print my_data
However, when I attempt to run a loop:
import pandas as pd
url_template = "http://foo.html?t={}&spam=green&eggs=yellow"
# CSV file where retrieving column list
need_conv = (pd.read_csv('pract.csv')) # csv file
# column to use in the loop
need = need_conv['column']
# empty DataFrame
test_df = pd.DataFrame()
# loop to retrieve data and print
for i in need: # for each year
url = url_template.format(i) # get the url
url = pd.read_csv(url) # convert the data
url_df = pd.DataFrame(url) # create DataFrame from parsed data
my_data = url_df.iloc[[13]] # upload specific row of data
test_df = test_df.append(my_data) # append data to empty dataframe
print test_df.head()
I receive an error:
pandas.io.common.EmptyDataError: No columns to parse from file
Help and feedback is very much appreciated!

Categories

Resources