I have a CSV imported using Pandas :
df = pd.read_csv('files_2.csv')
One of the columns in the data is PAGE URL , I would like to add a column to the data frame with functions executed using the urlib
something like this:
o = urlparse(df['Page URL'])
o.query #function that pull the data
parse_qs(o.query) # the logic for the column
The new column should hold the results from the parse_qs(o.query) function. I am new to using Python 3 and it will be great if you can point me to the right direction.
Thanks
Try doing the following:
page_url = urlparse(df['Page URL'])
df['query'] = parse_qs(page_url.query)
Remember that parse_qs() returns a dictionary, so you will have dictionaries in your query cells.
Using apply
Ex:
import pandas as pd
from urlparse import urlparse, parse_qs
df = pd.DataFrame({"Page URL": ["https://www.google.com/search?ei=0kkkk"]})
df['query'] = df['Page URL'].apply(urlparse).apply(lambda x: parse_qs(x.query))
print(df)
Output:
Page URL query
0 https://www.google.com/search?ei=0kkkk {u'ei': [u'0kkkk']}
Related
I want to filter a json file from pool.pm
For example, https://pool.pm/wallet/$1111 and I want to get all the "policy"'s and how many time it repeats and organize it on a excel using openpyxl and in the first column the policies and in the 2nd column how many time it repeats
What I have:
import openpyxl
import json
import requests
from bs4 import BeautifulSoup
pooladd = str(input("Insira o address:\t"))
api_pool = "https://pool.pm/wallet/{}"
api_pool2 = api_pool.format(pooladd)
data = requests.get(api_pool2).json()
Output:
Policy
How Many
f0ff48bbb7bbe9d59a40f1ce90e9e9d0ff5002ec48f232b49ca0fb9a
10
8728079879ce4304fd305b12878e0fcea3b8a3a5435a21b5dec35a11
3
another implementation!, comments are added to the code.
Build the dictionary and then write the excel.
import openpyxl
import json
import requests
from bs4 import BeautifulSoup
pooladd = str(input("Insira o address:\t"))
api_pool = "https://pool.pm/wallet/{}"
api_pool2 = api_pool.format(pooladd)
data = requests.get(api_pool2).json()
policies= [e_tkn['policy'] for e_tkn in data['tokens']]
unique_policies=set(policies)
result={}
for each_policy in unique_policies:
result[each_policy]=policies.count(each_policy)
result=sorted(((value,key) for (key,value) in result.items()),reverse=True)# sort in decending order, reverse=True
print(result)
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
ws['A1'] = 'Policy' # header
ws['B1'] = 'How Many' # header
row=2
for count,policy in result:
ws['A' + str(row)]=policy # for each row, for example A2=policy
ws['B' + str(row)]=count # for each row , for example B2 = count
row=row+1
wb.save('policies.xlsx')
I don't see pandas tag but in case you're interested, here is an approach with value_counts :
#pip install pandas
import pandas as pd
out = (
pd.Series([t["policy"] for t in data["tokens"]])
.value_counts()
.to_frame("How Many")
.reset_index(names="Policy")
)
If you wanna make an Excel spreadsheet, use pandas.DataFrame.to_excel :
out.to_excel("output_file.xlsx", sheet_name="test_name", index=False)
# Output :
print(out)
Policy How Many
0 f0ff48bbb7bbe9d59a40f1ce90e9e9d0ff5002ec48f232b49ca0fb9a 10
1 8728079879ce4304fd305b12878e0fcea3b8a3a5435a21b5dec35a11 3
I have data like this. What I am trying to do is to create a rule, based on domain names for my project. I want to create a new column named new_url based on domains. If it contains .cdn. it will take the string before .cdn. , otherwise it will call url parser library and parse url in another way. The problem is that in the csv file I created (cleanurl.csv) , there is no new_url column created. When I print parsed urls in code, I can see them. If and else condition are working. Could you help me please ?
import pandas as pd
import url_parser
from url_parser import parse_url,get_url,get_base_url
import numpy as np
df = pd.read_csv("C:\\Users\\myuser\\Desktop\\raw_data.csv", sep=';')
i=-1
for x in df['domain']:
i=i+1
print("*",x,"*")
if '.cdn.' in x:
parsed_url=x.split('.cdn')[0]
print(parsed_url)
df.iloc[i]['new_url']=parsed_url
else:
parsed_url=get_url(x).domain +'.' + get_url(x).top_domain
print(parsed_url)
df.iloc[i]['new_url']=parsed_url
df.to_csv("C:\\Users\\myuser\\Desktop\\cleanurl.csv", sep=';')
Use .loc[row, 'column'] to create new column
for idx, x in df['domain'].items():
if '.cdn.' in x:
df.loc[idx, 'new_url'] = parsed_url
else:
df.loc[idx, 'new_url'] = parsed_url
I have a csv file, which I am using to search uniprot.org for multiple variants of a protein, an example of this is the following website:
https://www.uniprot.org/uniprot/?query=KU168294+env&sort=score
import pandas as pd
import requests
from bs4 import BeautifulSoup
import csv
df = pd.read_csv('Env_seq_list.csv')
second_column_df = df['Accession']
for row in second_column_df:
theurl = 'https://www.uniprot.org/uniprot/?query=' + row + '+env&sort=score'
page = requests.get(theurl).content
df_list = pd.read_html(page)
df = df_list[-1]
print(df.loc[df['Gene names'] == 'env'])
If I perform the print function, it works fine and I get back a list of the tables that I'm after. I'm stuck at this point because if I instead use the pandas df.to_csv function I cannot seem to get it to work alongside the df.loc function. Additionally, simply using the df.to_csv function only writes the last search result to the .csv, which I'm pretty sure is due to that function being within the for loop, however I am unsure as to how to fix this. Any help would be greatly appreciated :-)
I would suggest that you take the df you find each time through the loop, and append it to a 'final' df. Then outside the loop, you can run to_csv on that 'final' df. Code below:
final_df = pd.DataFrame()
for row in second_column_df:
theurl = 'https://www.uniprot.org/uniprot/?query=' + row + '+env&sort=score'
page = requests.get(theurl).content
df_list = pd.read_html(page)
df = df_list[-1]
#print(df.loc[df['Gene names'] == 'env'])
final_df = pd.concat([final_df, df.loc[df['Gene names'] == 'env']], axis=0)
final_df.to_csv('/path/to/save/csv')
I'm trying to get an API call and save it as a dataframe.
problem is that I need the data from the 'result' column.
Didn't succeed to do that.
I'm basically just trying to save the API call as a csv file in order to work with it.
P.S when I do this with a "JSON to CSV converter" from the web it does it as I wish. (example: https://konklone.io/json/)
import requests
import pandas as pd
import json
res = requests.get("http://api.etherscan.io/api?module=account&action=txlist&
address=0xddbd2b932c763ba5b1b7ae3b362eac3e8d40121a&startblock=0&
endblock=99999999&sort=asc&apikey=YourApiKeyToken")
j = res.json()
j
df = pd.DataFrame(j)
df.head()
output example picture
Try this
import requests
import pandas as pd
import json
res = requests.get("http://api.etherscan.io/api?module=account&action=txlist&address=0xddbd2b932c763ba5b1b7ae3b362eac3e8d40121a&startblock=0&endblock=99999999&sort=asc&apikey=YourApiKeyToken")
j = res.json()
# print(j)
filename ="temp.csv"
df = pd.DataFrame(j['result'])
print(df.head())
df.to_csv(filename)
Looks like you need.
df = pd.DataFrame(j["result"])
I'm trying to get a data from this site
and then use some of it. Sorry for not copy-paste it but it's a long xml. So far I tried to get this data those ways:
from urllib.request import urlopen
url = "http://degra.wi.pb.edu.pl/rozklady/webservices.php?"
s = urlopen(url)
content = s.read()
as print(content) looks good, now I would like to get a data from it
<tabela_rozklad data-aktualizacji="1480583567">
<DZIEN>2</DZIEN>
<GODZ>3</GODZ>
<ILOSC>2</ILOSC>
<TYG>0</TYG>
<ID_NAUCZ>66</ID_NAUCZ>
<ID_SALA>79</ID_SALA>
<ID_PRZ>104</ID_PRZ>
<RODZ>W</RODZ>
<GRUPA>1</GRUPA>
<ID_ST>13</ID_ST>
<SEM>1</SEM>
<ID_SPEC>0</ID_SPEC>
</tabela_rozklad>
How can I handle this data to easy use it?
You can use Beautiful soup and capture the tags you want. The code below should get you started!
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "http://degra.wi.pb.edu.pl/rozklady/webservices.php?"
# secure url content
response = requests.get(url).content
soup = BeautifulSoup(response)
# find each tabela_rozklad
tables = soup.find_all('tabela_rozklad')
# for each tabela_rozklad looks like there is 12 nested corresponding tags
tags = ['dzien', 'godz', 'ilosc', 'tyg', 'id_naucz', 'id_sala',
'id_prz', 'rodz', 'grupa', 'id_st', 'sem', 'id_spec']
# initialize empty dataframe
df = pd.DataFrame()
# iterate over each tabela_rozklad and extract each tag and append to pandas dataframe
for table in tables:
all = map(lambda x: table.find(x).text, tags)
df = df.append([all])
# insert tags as columns
df.columns = tags
# display first 5 rows of table
df.head()
# and the shape of the data
df.shape # 665 rows, 12 columns
# and now you can get to the information using traditional pandas functionality
# for instance, count observations by rodz
df.groupby('rodz').count()
# or subset only observations where rodz = J
J = df[df.rodz == 'J']