Web scraping, reading data frames, and exporting to csv - python

The goal is to scrape pokemonDB, create a DataFrame of the Pokemon data; (Number, Name, Primary type, and secondary type), Separate the two types into their own rows, and export it as a CSV file.
I'm stuck on accessing the 'dex' dataframe, specifically the contents of the columns. Am I using ".loc" correctly?
And then there is separating the two types to each columns. I know i must use a space" " as a delimiter, but not sure how. I'm new to Pandas.
This is what i have:
import pandas as pd
import requests
page = requests.get("https://pokemondb.net/pokedex/all")
dex = pd.read_html(page.text, attrs = {'id': 'pokedex'}, index_col = '#')
column_label_list = (list(dex[0].columns))
NationalNo = column_label_list[0];
Name = column_label_list[1];
Type = column_label_list[2];
numbers_list = dex.loc[ "#"]
names_list = dex.loc[ "Name"]
types1_list = dex.loc[ "Type"]
pokemon_list = pd.DataFrame(
{
NationalNo: numbers_list,
Name: names_list,
Type: types1_list,
#'Type2': types2_list,
})
print(pokemon_list)
#pokemon_list.to_csv('output.csv',encoding='utf-8-sig')
The result should look this like:
output.csv
# | Name | Type1 | Type2 |
__|_________|_______|_______|
0 |Bulbasaur|Grass |Poison |
__|_________|_______|_______|
.
.
.
etc...
I hope what I'm trying to accomplish makes sense.

The dex is an array of all tables existed on that HTML, since there is only one table, select the first table, then you don't need to map them to dataframe anymore, just export it directly since it's already a dataframe. Please consider using this code below:
import pandas as pd
import requests
page = requests.get("https://pokemondb.net/pokedex/all")
dex = pd.read_html(page.text, attrs = {'id': 'pokedex'}, index_col = '#')
dex[0].to_csv("output.csv", encoding='utf-8')

Related

Issues with .JSON file conversion and CSV manipulation in Python

sorry for the long post! I'm a bit Python-illiterate, so please bear with me:
I am working on a project that uses extracted Fitbit resting heart-rate data to compare heart-rate values between a series of years.
The fitbit data exports as a .json file that I am attempting to convert to .csv for further analysis.
I pulled a script from github that converts .json files to .csv-formatted files, however when inputing the resting heart rate data I am running into a few troubles.
Sample lines from .json:
[{
"dateTime" : "09/30/16 00:00:00",
"value" : {
"date" : "09/30/16",
"value" : 76.83736383927637,
"error" : 2.737363838373737
}
Section of GitHub code that transforms nested frame into columns:
# reading json into dataframes
resting_hr_df = get_json_to_df(file_list=resting_hr_file_list).reset_index()
# Heart rate contains a sub json that are explicitly converted into column
resting_hr_df['date'] = resting_hr_df['value'].transform(lambda x: make_new_df_value(x, 'date'))
resting_hr_df['value'] = resting_hr_df['value'].transform(lambda x: make_new_df_value(x, 'value'))
resting_hr_df['error'] = resting_hr_df['value'].transform(lambda x: make_new_df_value(x, 'error'))
resting_hr_df = resting_hr_df.drop(['value', 'index'], axis=1)
There are two variables named 'value' and I think this is causing the issue.
When using the transform function in pandas to assign variable names for the nested dataframe keys, the second ‘value’ values store as 0 in the .csv file.
How should I store the values?
The problem is that this is a nested json file. The solution is to load the json file with json and then load it into pandas with json_normalize
import json
import pandas as pd
with open('filename.json') as data_file:
data = json.load(data_file)
resting_hr_df = pd.json_normalize(data)
resting_hr_df
Output resting_hr_df:
| | dateTime | value.date | value.value | value.error |
|---:|:------------------|:-------------|--------------:|--------------:|
| 0 | 09/30/16 00:00:00 | 09/30/16 | 76.8374 | 2.73736 |
You can easy solve this method using default pandas function like read_json .
df = pd.read_json('abc.json', orient='index')
data = df.to_csv(index=False)
print(data)
it can be easy and helpful to solve this problem by convert json to csv file

How to update a pandas dataframe, from multiple API calls

I need to do a python script to
Read a csv file with the columns (person_id, name, flag). The file has 3000 rows.
Based on the person_id from the csv file, I need to call a URL passing the person_id to do a GET
http://api.myendpoint.intranet/get-data/1234
The URL will return some information of the person_id, like example below. I need to get all rents objects and save on my csv. My output needs to be like this
import pandas as pd
import requests
ids = pd.read_csv(f"{path}/data.csv", delimiter=';')
person_rents = df = pd.DataFrame([], columns=list('person_id','carId','price','rentStatus'))
for id in ids:
response = request.get(f'endpoint/{id["person_id"]}')
json = response.json()
person_rents.append( [person_id, rent['carId'], rent['price'], rent['rentStatus'] ] )
pd.read_csv(f"{path}/data.csv", delimiter=';' )
person_id;name;flag;cardId;price;rentStatus
1000;Joseph;1;6638;1000;active
1000;Joseph;1;5566;2000;active
Response example
{
"active": false,
"ctodx": false,
"rents": [{
"carId": 6638,
"price": 1000,
"rentStatus": "active"
}, {
"carId": 5566,
"price": 2000,
"rentStatus": "active"
}
],
"responseCode": "OK",
"status": [{
"request": 345,
"requestStatus": "F"
}, {
"requestId": 678,
"requestStatus": "P"
}
],
"transaction": false
}
After save the additional data from response on csv, i need to get data from another endpoint using the carId on the URL. The mileage result must be save in the same csv.
http://api.myendpoint.intranet/get-mileage/6638
http://api.myendpoint.intranet/get-mileage/5566
The return for each call will be like this
{"mileage":1000.0000}
{"mileage":550.0000}
The final output must be
person_id;name;flag;cardId;price;rentStatus;mileage
1000;Joseph;1;6638;1000;active;1000.0000
1000;Joseph;1;5566;2000;active;550.0000
SOmeone can help me with this script?
Could be with pandas or any python 3 lib.
Code Explanation
Create dataframe, df, with pd.read_csv.
It is expected that all of the values in 'person_id', are unique.
Use .apply on 'person_id', to call prepare_data.
prepare_data expects 'person_id' to be a str or int, as indicated by the type annotation, Union[int, str]
Call the API, which will return a dict, to the prepare_data function.
Convert the 'rents' key, of the dict, into a dataframe, with pd.json_normalize.
Use .apply on 'carId', to call the API, and extract the 'mileage', which is added to dataframe data, as a column.
Add 'person_id' to data, which can be used to merge df with s.
Convert pd.Series, s to a dataframe, with pd.concat, and then merge df and s, on person_id.
Save to a csv with pd.to_csv in the desired form.
Potential Issues
If there's an issue, it's most likely to occur in the call_api function.
As long as call_api returns a dict, like the response shown in the question, the remainder of the code will work correctly to produce the desired output.
import pandas as pd
import requests
import json
from typing import Union
def call_api(url: str) -> dict:
r = requests.get(url)
return r.json()
def prepare_data(uid: Union[int, str]) -> pd.DataFrame:
d_url = f'http://api.myendpoint.intranet/get-data/{uid}'
m_url = 'http://api.myendpoint.intranet/get-mileage/'
# get the rent data from the api call
rents = call_api(d_url)['rents']
# normalize rents into a dataframe
data = pd.json_normalize(rents)
# get the mileage data from the api call and add it to data as a column
data['mileage'] = data.carId.apply(lambda cid: call_api(f'{m_url}{cid}')['mileage'])
# add person_id as a column to data, which will be used to merge data to df
data['person_id'] = uid
return data
# read data from file
df = pd.read_csv('file.csv', sep=';')
# call prepare_data
s = df.person_id.apply(prepare_data)
# s is a Series of DataFrames, which can be combined with pd.concat
s = pd.concat([v for v in s])
# join df with s, on person_id
df = df.merge(s, on='person_id')
# save to csv
df.to_csv('output.csv', sep=';', index=False)
If there are any errors when running this code:
Leave a comment, to let me know.
edit your question, and paste the entire TraceBack, as text, into a code block.
Example
# given the following start dataframe
person_id name flag
0 1000 Joseph 1
1 400 Sam 1
# resulting dataframe using the same data for both id 1000 and 400
person_id name flag carId price rentStatus mileage
0 1000 Joseph 1 6638 1000 active 1000.0
1 1000 Joseph 1 5566 2000 active 1000.0
2 400 Sam 1 6638 1000 active 1000.0
3 400 Sam 1 5566 2000 active 1000.0
There are many different ways to implement this. One of them would be, like you started in your comment:
read the CSV file with pandas
for each line take the person_id and build a call
the delivered JSON response can then be taken from the rents
the carId is then extracted for each individual rental
finally this is collected in a row_list
the row_list is then converted back to csv via pandas
A very simple solution without any error handling could look something like this:
from types import SimpleNamespace
import pandas as pd
import requests
import json
path = '/some/path/'
df = pd.read_csv(f'{path}/data.csv', delimiter=';')
rows_list = []
for _, row in df.iterrows():
rentCall = f'http://api.myendpoint.intranet/get-data/{row.person_id}'
print(rentCall)
response = requests.get(rentCall)
r = json.loads(response.text, object_hook=lambda d: SimpleNamespace(**d))
for rent in r.rents:
mileageCall = f'http://api.myendpoint.intranet/get-mileage/{rent.carId}'
print(mileageCall)
response2 = requests.get(mileageCall)
m = json.loads(response2.text, object_hook=lambda d: SimpleNamespace(**d))
state = "active" if r.active else "inactive"
rows_list.append((row['person_id'], row['name'], row['flag'], rent.carId, rent.price, state, m.mileage))
df = pd.DataFrame(rows_list, columns=('person_id', 'name', 'flag', 'carId', 'price', 'rentStatus', 'mileage'))
print(df.to_csv(index=False, sep=';'))
Speeding up with multiprocessing
You mention that you have 3000 rows, which means that you'll have to make a lot of API calls. Depending on the connection, every one of these calls might take a while. As a result, performing this in a sequential way might be too slow. The majority of the time, your program will just be waiting on a response from the server without doing anything else.
We can improve this performance by using multiprocessing.
I use all the code from Trenton his answer, but I replace the following sequential call:
# call prepare_data
s = df.person_id.apply(prepare_data)
With a parallel alternative:
from multiprocessing import Pool
n_processes=20 # Experiment with this to see what works well
with Pool(n_processes) as p:
s=p.map(prepare_data, df.person_id)
Alternatively, a threadpool might be faster, but you'll have to test that by replacing the import with
from multiprocessing.pool import ThreadPool as Pool.

Load csv into pandas dataframe from Pydrill Query

I am able to load a csv into pandas dataframe, but it is stuck in a list. How can I load directly into a pandas dataframe from Pydrill or unlist the pandas dataframe columns and data? I've tried unlisting and it puts everything into a list of a list.
I've used the to_dataframe(), but can't seem to find documentation on if I can use a delimeter. pd.dataframe doesn't work because of the Pydrill query.
reviews = drill.query("SELECT * FROM hdfs.datasets.`titanic_ML/titanic.csv` LIMIT 1000", timeout=30)
print(reviews)
import pandas as pd
df2 = reviews.to_dataframe()
df2.rename(columns=df2.iloc[0])
headers = df2.iloc[0]
print(headers)
new_df = pd.DataFrame(df2.values[1:], columns=headers)
new_df.head()
The results cast everything into a list.
["pclass","sex","age","sibsp","parch","fare","embarked","survived"]
0 ["3","1","38.0","0","0","7.8958","1","0"]
1 ["1","1","42.0","0","0","26.55","1","0"]
2 ["3","0","9.0","4","2","31.275","1","0"]
3 ["3","1","27.0","0","0","7.25","1","0"]
4 ["1","1","41.0","0","0","26.55","1","0"]
I'd like to get everything into a normal pandas dataframe.
The solution I found was this:
it doesn't unlist the dataframe, but it's an alternate solution to the problem.
connect_str = "dbname='dbname' user='dsa_ro_user'
conn = psycopg2.connect(connect_str) host='host database'
SQL = "SELECT * "
SQL += " FROM train"
df = pd.read_sql(SQL,conn)
df.head()
Try using Table Functions as described in O’Reily Text: Chapter 4. Querying Delimited Data. This will delimit the file and apply the first row to your columns. Note: because everything is being read as text, you may need to cast your values as floats if you want to do arithmetic in your select or where.
This should get you what you want:
sql="""
SELECT *
FROM table(hdfs.datasets.`/titanic_ML/titanic.csv`(
type => 'text',
extractHeader => true,
fieldDelimiter => ',')
) LIMIT 1000
"""
rows = drill.query(sql, timeout=30)
df = rows.to_dataframe()
df.head()

Normalizing a Python list to get JSON data into a Tables

I am trying to use the following python code to draw info from an API and create a players table ultimately. I have gotten the data mostly normalized to that level, but am struggling to work with the [rosters] list.
import requests
import json
import pandas as pd
from pandas.io.json import json_normalize
r1 = requests.get('https://statsapi.web.nhl.com/api/v1/teams/16?hydrate=franchise(roster(season=20182019,person(name,stats(splits=[yearByYear]))))')
data = r1.json()
df1 = json_normalize(data, 'teams',['teams.franchise'],errors='ignore')['franchise']
df2 = json_normalize(df1)['roster.roster']
df3 = pd.DataFrame(data=df2.index, columns=['Id'])
df4 = pd.DataFrame(data=df2.values, columns=['Players'])
df4
returns:
0 [{'person': {'id': 8470645, 'fullName': 'Corey...
Any ideas on what I could do to extract each person from this API into a table? IE:
ID | fullName |
.. .....
.. .....
Thanks.
Looks like the main problem is this is a deep dictionary. Some inspection showed that this code will get you to each player:
all_players = []
for team in data['teams']:
for player in team['franchise']['roster']['roster']:
player = player['person']
print(player.keys())
print(player)
print()
However, some of the keys in player correspond to more dictionaries. So you'll either have to decide which player fields are basic values like strings/ints/etc and keep those, or add more code to parse out the additional dictionaries.
But this code will get you to each player, then you can normalize how you want from there.
Let me know if you need help!

Can data loaded in as newAPIHadoopRDD be converted into a DataFrame?

I'm using PySpark to load data from Google BigQuery.
I've loaded data by using:
dfRates = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
Where conf is defined as https://cloud.google.com/hadoop/examples/bigquery-connector-spark-example.
I need this data as a DataFrame, so I tried,
row = Row(['userId','accoId','rating']) # or row = Row(('userId','accoId','rating'))
dataRDD = dfRates.map(row).toDF()
and
dataRDD = sqlContext.createDataFrame(dfRates,['userId','accoId','rating'])
But it does not convert the data into a DataFrame. Is there a way to convert it into a DataFrame?
As long as the types can represented using Spark SQL types there is no reason it couldn't be. The only problem here seems to be your code.
newAPIHadoopRDD returns a RDD of pairs (tuple of length equal two). In this particular context it looks you'll get (int, str) in Python which clearly cannot be unpacked into ['userId','accoId','rating'].
According to the doc you've linked com.google.gson.JsonObject is represented as a JSON string which can be either parsed on a Python side using standard Python utils (json module):
def parse(v, fields=["userId", "accoId", "rating"]):
row = Row(*fields)
try:
parsed = json.loads(v)
except json.JSONDecodeError:
parsed = {}
return row(*[parsed.get(x) for x in fields])
dfRates.map(parse).toDF()
or on the Scala / DataFrame side using get_json_object:
from pyspark.sql.functions import col, get_json_object
dfRates.toDF(["id", "json_string"]).select(
# This assumes you expect userId field
get_json_object(col("json_string"), "$.userId"),
...
)
Please note the differences in the syntax I've used to define and create rows.
hbase table rows:
hbase(main):008:0> scan 'test_hbase_table'
ROW COLUMN+CELL
dalin column=cf:age, timestamp=1464101679601, value=40
tangtang column=cf:age, timestamp=1464101649704, value=9
tangtang column=cf:name, timestamp=1464108263191, value=zitang
2 row(s) in 0.0170 seconds
here we go
import json
host = '172.23.18.139'
table = 'test_hbase_table'
conf = {"hbase.zookeeper.quorum": host, "zookeeper.znode.parent": "/hbase-unsecure", "hbase.mapreduce.inputtable": table}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
hbase_rdd = sc.newAPIHadoopRDD(
"org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter=keyConv,
valueConverter=valueConv,
conf=conf)
hbase_rdd1 = hbase_rdd.flatMapValues(lambda v: v.split("\n"))
and here we got the results
tt=sqlContext.jsonRDD(hbase_rdd1.values())
In [113]: tt.show()
+------------+---------+--------+-------------+----+------+
|columnFamily|qualifier| row| timestamp|type| value|
+------------+---------+--------+-------------+----+------+
| cf| age| dalin|1464101679601| Put| 40|
| cf| age|tangtang|1464101649704| Put| 9|
| cf| name|tangtang|1464108263191| Put|zitang|
+------------+---------+--------+-------------+----+------+

Categories

Resources