Unable to store output in a customized manner in dataframe - python

I've created a script in python to parse some urls and store them in a dataframe. My script can do it. However, it doesn't do the way I expect.
I've tried with:
import requests
from bs4 import BeautifulSoup
import pandas as pd
base = 'http://opml.radiotime.com/Search.ashx?query=kroq'
linklist = []
r = requests.get(base)
soup = BeautifulSoup(r.text,"xml")
for item in soup.select("outline[type='audio'][URL]"):
find_match = base.split("=")[-1].lower()
if find_match in item['text'].lower():
linklist.append(item['URL'])
df = pd.DataFrame(linklist, columns=[find_match])
print(df)
Current output:
0 http://opml.radiotime.com/Tune.ashx?id=s35105
1 http://opml.radiotime.com/Tune.ashx?id=s26581
2 http://opml.radiotime.com/Tune.ashx?id=t122458...
3 http://opml.radiotime.com/Tune.ashx?id=t132149...
4 http://opml.radiotime.com/Tune.ashx?id=t131867...
5 http://opml.radiotime.com/Tune.ashx?id=t120569...
6 http://opml.radiotime.com/Tune.ashx?id=t125126...
7 http://opml.radiotime.com/Tune.ashx?id=t131068...
8 http://cdn-cms.tunein.com/service/Audio/nostre...
9 http://cdn-cms.tunein.com/service/Audio/notcom...
Expected output (I wish to kick out the indices as well if possible):
0 http://opml.radiotime.com/Tune.ashx?id=s35105
1 http://opml.radiotime.com/Tune.ashx?id=s26581
2 http://opml.radiotime.com/Tune.ashx?id=t122458
3 http://opml.radiotime.com/Tune.ashx?id=t132149
4 http://opml.radiotime.com/Tune.ashx?id=t131867
5 http://opml.radiotime.com/Tune.ashx?id=t120569
6 http://opml.radiotime.com/Tune.ashx?id=t125126
7 http://opml.radiotime.com/Tune.ashx?id=t131068
8 http://cdn-cms.tunein.com/service/Audio/nostre
9 http://cdn-cms.tunein.com/service/Audio/notcom

You can align. To get rid of index drop it when writing to csv
df.style.set_properties(**{'text-align': 'left'})
df.to_csv(r'Data.csv', sep=',', encoding='utf-8-sig',index = False )

Related

Pull date and numbers from json and append them to pandas dataframe

I want to pull the date and the powerball numbers and append them to a pandas dataframe. I have made the columns, but I can't seem to get the data to the column. When I go to https://jsonparser.org/ and put in the url I see . But when I try to list the number I.E. ['8'] or ['9'] it doesn't append the data. I've been working on this for about 3 days. Thanks in advance.
###########
# MODULES #
###########
import json
import requests
import urllib
import pandas as pd
###########
# HISTORY #
###########
#We need to pull the data from the website.
#Then we need to organize the numbers based off of position.
#Basically it will be several lists of numbers
URL = "https://data.ny.gov/api/views/d6yy-54nr/rows"
r = requests.get(URL)
my_json = r.json()
#my_json_dumps = json.dumps(my_json, indent = 2)
#print(my_json_dumps)
df = pd.DataFrame(columns=["Date","Powerball Numbers","1","2","3","4","5","6","7"])#Create columns in df
for item in my_json['data']:
df = pd.DataFrame(my_json['data'])
l_date = df.iloc['8']#Trying to pull columns from json
p_num = (df.iloc['9'])#Trying to pull columns from json
df = df.append({"Date": l_date,
"Powerball Numbers": p_num,
},ignore_index=True)
#test = item['id']
print(l_date)
EDIT: This is what I am trying to get.
Try with this:
Old code:
l_date = df.iloc['8'] #Trying to pull columns from json
p_num = (df.iloc['9']) #Trying to pull columns from json
Change this lines with quotations df.iloc['8'] :
l_date = df.iloc[8] # without quotation
p_num = (df.iloc[9]) # without quotation
Work fine. The result is:
0 row-w3y7.4r9a-caat
1 00000000-0000-0000-3711-B495A26E6E8B
2 0
3 1619710755
4 None
5 1619710755
6 None
7 { }
8 2020-10-24T00:00:00
9 18 20 27 45 65 06
10 2
Name: 8, dtype: object
0 row-w3y7.4r9a-caat
1 00000000-0000-0000-3711-B495A26E6E8B
2 0
3 1619710755
4 None
5 1619710755
6 None
7 { }
8 2020-10-24T00:00:00
9 18 20 27 45 65 06
10 2
I hope you find it useful.
This is how I did it:
URL = "https://data.ny.gov/resource/d6yy-54nr.json"
r = requests.get(URL)
my_json = r.json()
for item in my_json:
df = pd.DataFrame(my_json)
l_date = item['draw_date']#Trying to pull columns from json
p_num = item['winning_numbers']#Trying to pull columns from json
n_list = list(map(int, p_num.split())) #Convert powerball numbers into list
n_1 = n_list[0]
n_2 = n_list[1]
n_3 = n_list[2]
n_4 = n_list[3]
n_5 = n_list[4]
n_6 = n_list[5]
df1 = pd.DataFrame(columns=["Date","Powerball Numbers", "1","2","3","4","5","6"])#Create columns in df
df = df1.append({"Date": l_date,
"Powerball Numbers": p_num,
"1": n_1,
"2": n_2,
"3": n_3,
"4": n_4,
"5": n_5,
"6": n_6,
},ignore_index=True)
print(df)
Which produced

How to Convert a text data into DataFrame

How i can convert the below text data into a pandas DataFrame:
(-9.83334315,-5.92063135,-7.83228037,5.55314146), (-5.53137301,-8.31010785,-3.28062536,-6.86067081),
(-11.49239039,-1.68053601,-4.14773043,-3.54143976), (-22.25802006,-10.12843806,-2.9688831,-2.70574665), (-20.3418791,-9.4157625,-3.348587,-7.65474665)
I want to convert this to Data frame with 4 rows and 5 columns. For example, the first row contains the first element of each parenthesis.
Thanks for your contribution.
Try this:
import pandas as pd
with open("file.txt") as f:
file = f.read()
df = pd.DataFrame([{f"name{id}": val.replace("(", "").replace(")", "") for id, val in enumerate(row.split(",")) if val} for row in file.split()])
import re
import pandas as pd
with open('file.txt') as f:
data = [re.findall(r'([\-\d.]+)',data) for data in f.readlines()]
df = pd.DataFrame(data).T.astype(float)
Output:
0 1 2 3 4
0 -9.833343 -5.531373 -11.492390 -22.258020 -20.341879
1 -5.920631 -8.310108 -1.680536 -10.128438 -9.415762
2 -7.832280 -3.280625 -4.147730 -2.968883 -3.348587
3 5.553141 -6.860671 -3.541440 -2.705747 -7.654747
Your data is basically in tuple of tuples forms, hence you can easily use pass a list of tuples instead of a tuple of tuples and get a DataFrame out of it.
Your Sample Data:
text_data = ((-9.83334315,-5.92063135,-7.83228037,5.55314146),(-5.53137301,-8.31010785,-3.28062536,-6.86067081),(-11.49239039,-1.68053601,-4.14773043,-3.54143976),(-22.25802006,-10.12843806,-2.9688831,-2.70574665),(-20.3418791,-9.4157625,-3.348587,-7.65474665))
Result:
As you see it's default takes up to 6 decimal place while you have 7, hence you can use pd.options.display.float_format and set it accordingly.
pd.options.display.float_format = '{:,.8f}'.format
To get your desired data, you simply use transpose altogether to get the desired result.
pd.DataFrame(list(text_data)).T
0 1 2 3 4
0 -9.83334315 -5.53137301 -11.49239039 -22.25802006 -20.34187910
1 -5.92063135 -8.31010785 -1.68053601 -10.12843806 -9.41576250
2 -7.83228037 -3.28062536 -4.14773043 -2.96888310 -3.34858700
3 5.55314146 -6.86067081 -3.54143976 -2.70574665 -7.65474665
OR
Simply, you can use as below as well, where you can create a DataFrame from a list of simple tuples.
data = (-9.83334315,-5.92063135,-7.83228037,5.55314146),(-5.53137301,-8.31010785,-3.28062536,-6.86067081),(-11.49239039,-1.68053601,-4.14773043,-3.54143976),(-22.25802006,-10.12843806,-2.9688831,-2.70574665),(-20.3418791,-9.4157625,-3.348587,-7.65474665)
# data = [(-9.83334315,-5.92063135,-7.83228037,5.55314146),(-5.53137301,-8.31010785,-3.28062536,-6.86067081),(-11.49239039,-1.68053601,-4.14773043,-3.54143976),(-22.25802006,-10.12843806,-2.9688831,-2.70574665),(-20.3418791,-9.4157625,-3.348587,-7.65474665)]
pd.DataFrame(data).T
0 1 2 3 4
0 -9.83334315 -5.53137301 -11.49239039 -22.25802006 -20.34187910
1 -5.92063135 -8.31010785 -1.68053601 -10.12843806 -9.41576250
2 -7.83228037 -3.28062536 -4.14773043 -2.96888310 -3.34858700
3 5.55314146 -6.86067081 -3.54143976 -2.70574665 -7.65474665
wrap the tuples as a list
data=[(-9.83334315,-5.92063135,-7.83228037,5.55314146),
(-5.53137301,-8.31010785,-3.28062536,-6.86067081),
(-11.49239039,-1.68053601,-4.14773043,-3.54143976),
(-22.25802006,-10.12843806,-2.9688831,-2.70574665),
(-20.3418791,-9.4157625,-3.348587,-7.65474665)]
df=pd.DataFrame(data, columns=['A','B','C','D'])
print(df)
output:
A B C D
0 -9.833343 -5.920631 -7.832280 5.553141
1 -5.531373 -8.310108 -3.280625 -6.860671
2 -11.492390 -1.680536 -4.147730 -3.541440
3 -22.258020 -10.128438 -2.968883 -2.705747
4 -20.341879 -9.415762 -3.348587 -7.654747

I can't correctly visualize a json dataframe from api

I am currently trying to read some data from a public API. It has different ways of reading (json, csv, txt, among others), just change the label in the url (/ json, / csv, / txt ...). The url is as follows:
https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/
https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/json/
...
My problem is that when trying to import into the Pandas dataframe it doesn't read the data correctly. I am trying the following alternatives:
import pandas as pd
import requests
url = 'https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/json/'
r = requests.get(url)
rjson = r.json()
df= json_normalize(rjson)
df['periods']
Also I try to read the data in csv format:
import pandas as pd
import requests
url = 'https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/'
collisions = pd.read_csv(url, sep='<br>')
collisions.head()
But I don't get good results; the dataframe cannot be visualized correctly since the 'periods' column is grouped with all the values ...
the output is displayed as follows:
all data appears as columns: /
Here is an example of how the data is displayed correctly:
What alternative do you recommend trying?
Thank you in advance for your time and help !!
I will be attentive to your answers, regards!
For csv you can use StringIO from io package
In [20]: import requests
In [21]: res = requests.get("https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/")
In [22]: import pandas as pd
In [23]: import io
In [24]: df = pd.read_csv(io.StringIO(res.text.strip().replace("<br>","\n")), engine='python')
In [25]: df
Out[25]:
Mes/Año Tipo de cambio - promedio del periodo (S/ por US$) - Bancario - Promedio
0 Jul.2018 3.276595
1 Ago.2018 3.288071
2 Sep.2018 3.311325
3 Oct.2018 3.333909
4 Nov.2018 3.374675
5 Dic.2018 3.364026
6 Ene.2019 3.343864
7 Feb.2019 3.321475
8 Mar.2019 3.304690
9 Abr.2019 3.303825
10 May.2019 3.332364
11 Jun.2019 3.325650
12 Jul.2019 3.290214
13 Ago.2019 3.377560
14 Sep.2019 3.357357
15 Oct.2019 3.359762
16 Nov.2019 3.371700
17 Dic.2019 3.355190
18 Ene.2020 3.327364
19 Feb.2020 3.390350
20 Mar.2020 3.491364
21 Abr.2020 3.397500
22 May.2020 3.421150
23 Jun.2020 3.470167
erh, sorry couldnt find the link for the read json with multiple objects inside it. the thing is we cant use load/s for this kind of format. so have to use raw_decode() instead
this code should work
import pandas as pd
import json
import urllib.request as ur
from pprint import pprint
d = json.JSONDecoder()
url = 'https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/json/'
#reading and transforming json into list of dictionaries
data = []
with ur.urlopen(url) as json_file:
x = json_file.read().decode() # decode to convert bytes string into normal string
while True:
try:
j, n = d.raw_decode(x)
except ValueError:
break
#print(j)
data.append(j)
x = x[n:]
#pprint(data)
#creating list of dictionaries to convert into dataframe
clean_list = []
for i, d in enumerate(data[0]['periods']):
dict_data = {
"month_year": d['name'],
"value": d['values'][0],
}
clean_list.append(dict_data)
#print(clean_list)
#pd.options.display.width = 0
df = pd.DataFrame(clean_list)
print(df)
result
month_year value
0 Jul.2018 3.27659523809524
1 Ago.2018 3.28807142857143
2 Sep.2018 3.311325
3 Oct.2018 3.33390909090909
4 Nov.2018 3.374675
5 Dic.2018 3.36402631578947
6 Ene.2019 3.34386363636364
7 Feb.2019 3.321475
8 Mar.2019 3.30469047619048
9 Abr.2019 3.303825
10 May.2019 3.33236363636364
11 Jun.2019 3.32565
12 Jul.2019 3.29021428571428
13 Ago.2019 3.37756
14 Sep.2019 3.35735714285714
15 Oct.2019 3.3597619047619
16 Nov.2019 3.3717
17 Dic.2019 3.35519047619048
18 Ene.2020 3.32736363636364
19 Feb.2020 3.39035
20 Mar.2020 3.49136363636364
21 Abr.2020 3.3975
22 May.2020 3.42115
23 Jun.2020 3.47016666666667
if I somehow found the link again, I'll edit/comment my answer

Adding headers to a table I have scraped

I have been following an online tutorial but rather than using the tutorial data which comes with headers I want to use the following code:
The problem I have is that my table has no headers so it is using the first row as the header. How can I set defined headers of "Ride" and "Queue Time"?
Thanks
import requests
import lxml.html as lh
import pandas as pd
url='http://www.ridetimes.co.uk/'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
r_elements = doc.xpath('//tr')
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print '%d:"%s"'%(i,name)
col.append((name,[]))
print(col)
How about trying this:
>>> pd.DataFrame(col,columns=["Ride","Queue Time"])
Ride Queue Time
0 Spinball Whizzer []
1 0 mins []
If I am correct then this is the answer.
Use pandas to get the table, then just assign the column names:
import pandas as pd
url='http://www.ridetimes.co.uk/'
df = pd.read_html(url)[0]
df.columns = ['Ride', 'Queue Time']
Output:
print (df)
Ride Queue Time
0 Spinball Whizzer 0 mins
1 Nemesis 5 mins
2 Oblivion 5 mins
3 Wicker Man 5 mins
4 The Smiler 10 mins
5 Rita 20 mins
6 TH13TEEN 25 mins
7 Galactica Currently Unavailable
8 Enterprise Currently Unavailable
Consider using the same source the page does to update values which returns json. You add a random number to the url to prevent cached results being served. This does all group types not just thrill.
import requests
import random
import pandas as pd
i = random.randint(1,1000000000000000000)
r = requests.get('http://ridetimes.co.uk/queue-times-new.php?r=' + str(i)).json() #to prevent cached results being served
df = pd.DataFrame([(item['ride'], item['time']) for item in r], columns = ['Ride', ' Queue Time'])
print(df)
If you want only thrill group then amend to this line:
df = pd.DataFrame([(item['ride'], item['time']) for item in r if item['group'] == 'Thrill'], columns = ['Ride', ' Queue Time'])

For loop issues with Quandl -- Python

I'm trying to create a for-loop that automatically runs through my parsed list of NASDAQ stocks, and inserts their Quandl codes to then be retrieved from Quandl's database. essentially creating a large data set of stocks to perform data analysis on. My code "appears" right, but when I print the query it only prints 'GOOG/NASDAQ_Ticker' and nothing else. Any help and/or suggestions will be most appreciated.
import quandl
import pandas as pd
import matplotlib.pyplot as plt
import numpy
def nasdaq():
nasdaq_list = pd.read_csv('C:\Users\NAME\Documents\DATASETS\NASDAQ.csv')
nasdaq_list = nasdaq_list[[0]]
print nasdaq_list
for abbv in nasdaq_list:
query = 'GOOG/NASDAQ_' + str(abbv)
print query
df = quandl.get(query, authtoken="authoken")
print df.tail()[['Close', 'Volume']]
Iterating over a pd.DataFrame as you have done iterates by column. For example,
>>> df = pd.DataFrame(np.arange(9).reshape((3,3)))
>>> df
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
>>> for i in df[[0]]: print(i)
0
I would just get the first column as a Series with .ix,
>>> for i in df.ix[:,0]: print(i)
0
3
6
Note that in general if you want to iterate by row over a DataFrame you're looking for iterrows().

Categories

Resources