I have a webpage which I would like to extract and store it's values into separate columns. Furthermore, I want to extract the movie title and insert it as a new column, but it must iterate over the rows of which the elements from the title were collected.
For example (expected output):
Location Name Latitude Longitude \
0 1117 Broadway (Gil's Music Shop) 47.252495 -122.439644
1 2715 North Junett St (Kat and Bianca's House) 47.272591 -122.474480
2 Aurora Bridge 47.646713 -122.347435
3 Buckaroo Tavern (closed) 47.657841 -122.350327
movie
0 10-things-i-hate-about-you-locations-250
1 10-things-i-hate-about-you-locations-250
2 10-things-i-hate-about-you-locations-250
3 10-things-i-hate-about-you-locations-250
.
.
.
What I have tried:
url = ['https://www.latlong.net/location/10-cloverfield-lane-locations-553',
'https://www.latlong.net/location/10-things-i-hate-about-you-locations-250',
'https://www.latlong.net/location/12-angry-men-locations-818']
url_test = []
for i in range(0, len(test), 1):
df = pd.read_html(test[i])[0]
df['movie'] = test[i].split('/')[-1]
However, this gives only the output:
Location Name Latitude Longitude \
0 New York City 40.742298 -73.982559
1 New York County Courthouse 40.714310 -74.001930
movie
0 12-angry-men-locations-818
1 12-angry-men-locations-818
Which is missing the rest of the results
I get the feeling it's because the data is split in the pandas dataframe, so I have tried merging before appending the columns using:
url_test = []
for i in range(0, len(test), 1):
df = pd.read_html(test[i])[0]
df = pd.merge(df, how='inner')
df['movie'] = test[i].split('/')[-1]
But I get the following error:
TypeError: merge() missing 1 required positional argument: 'right'
Try:
test = ['https://www.latlong.net/location/10-cloverfield-lane-locations-553',
'https://www.latlong.net/location/10-things-i-hate-about-you-locations-250',
'https://www.latlong.net/location/12-angry-men-locations-818']
url_test = []
for i in range(0, len(test), 1):
df = pd.read_html(test[i])[0]
df['movie'] = test[i].split('/')[-1]
url_test.append(df)
final_df = pd.concat(url_test, ignore_index=True)
print(final_df)
Related
process and data scraping of url( within all given links in a loop )looks like :
for url in urls :
page=requests.get(url)
#fetch and proceed page here and acquire cars info one per page
print(car.name)
print(car_table)
and the output :
BMW
['color','red','weight','50kg','height','120cm','width','200cm','serial','','owner','']
FORD
['color','','weight','','height','','width','','serial','','owner','']
HONDA
['color','blue','weight','60kg','height','','width','160cm','serial','OMEGA','owner','']
at the end how can i have a dataframe same as below by considering that i dunno number of car fields(columns) and number of cars(index) but defined df with them as columns and index
print(car_df)
-----|color|weight|height|width|serial|owner
BMW |red 50 120 200
FORD |
HONDA|blue 60 160 OMEGA
any help appreciated :)
This approach is to create a list of dicts as we iterate through the urls, and then after the loop we convert this to a dictionary. I'm assuming that the car_table is always the column followed by the value over and over again
import pandas as pd
import numpy as np
#Creating lists from your output instead of requesting from the url since you didn't share that
car_names = ['BMW','FORD','HONDA']
car_tables = [
['color','red','weight','50kg','height','120cm','width','200cm','serial','','owner',''],
['color','','weight','','height','','width','','serial','','owner',''],
['color','blue','weight','60kg','height','','width','160cm','serial','OMEGA','owner',''],
]
urls = range(len(car_names))
all_car_data = []
for url in urls:
car_name = car_names[url] #using car_name instead of car.name for this example
car_table = car_tables[url] #again, you get this value some other way
car_data = {'name':car_name}
columns = car_table[::2] #starting from 0, skip every other entry to just get the columns
values = car_table[1::2] #starting from 1, skip every other entry to just get the values
#Zip the columns together with the values, then iterate and update the dict
for col,val in zip(columns,values):
car_data[col] = val
#Add the dict to a list to keep track of all the cars
all_car_data.append(car_data)
#Convert to a dataframe
df = pd.DataFrame(all_car_data)
#df = df.replace({'':np.NaN}) #you can use this if you want to replace the '' with NaNs
df
Output:
name color weight height width serial owner
0 BMW red 50kg 120cm 200cm
1 FORD
2 HONDA blue 60kg 160cm OMEGA
I am trying to create a DataFrame with more than 500 rows, derived from an API query. When I check the length of my arrays, as so:
print(len(cities), len(country), len(max_temp), len(latit), len(longit), len(humid_), len(cloud_), len(wind))
I get the following output:
577 526 526 526 526 526 526 526
Now, I read the answers about casting these to Series, which adds NaN to the empty cells. The problem is that this mismatches the column values, i.e., all the numeric values are listed first, then all the NaN's at the end. This will cause the country, max_temp, etc., to line up with the wrong city. What I want to do is have the NaN appear in the correct row of each city with missing data. I could simply dropna if I had a DataFrame; but with the different array lengths, I cannot get a DataFrame.
Okay, editing in light of the comments: I began with a randomly generated list of coordinates, then:
for lat_lng in lat_lngs:
city = citipy.nearest_city(lat_lng[0], lat_lng[1]).city_name
# If the city is unique, then add it to a our cities list
if city not in cities:
cities.append(city)
This generated a list of cities. Then I did:
country = []
latit = []
longit = []
max_temp = []
humid_ = []
cloud_ = []
wind = []
for city in cities:
try:
query_url = base_url + "q=" + city + "&appid=" + weather_api_key
response = requests.get(query_url).json()
country.append(response['sys']['country'])
latit.append(response['coord']['lat'])
longit.append(response['coord']['lon'])
max_temp.append(response['main']['temp_max'])
humid_.append(response['main']['humidity'])
cloud_.append(response['clouds']['all'])
wind.append(response['wind']['speed'])
except:
print(f'Data not found.')
What I believe is occurring is that I am getting an array something like this:
City Country Max Temp (etc.)
Boston US 30 (etc.)
Honolulu
Rome IT 27
Vladivostok RU 20
In this example, "Honolulu" had no data, so generated a row with only the city Column filled. I can't be sure, since I can't view it as a DataFrame. What I want to do is either put NaN in the same row as Honolulu, or drop the row with Honolulu.
So after consulting with an expert offline, I got this solution:
In my lists of variables, add a new one:
new_city = []
country = []
latit = []
longit = []
max_temp = []
humid_ = []
cloud_ = []
wind = []
Then, in my Try loop:
for city in cities:
try:
query_url = base_url + "q=" + city + "&appid=" + weather_api_key
response = requests.get(query_url).json()
new_city.append(response['name])
country.append(response['sys']['country'])
latit.append(response['coord']['lat'])
longit.append(response['coord']['lon'])
max_temp.append(response['main']['temp_max'])
humid_.append(response['main']['humidity'])
cloud_.append(response['clouds']['all'])
wind.append(response['wind']['speed'])
except:
print(f'Data not found.')
And finally, build my dataframe with that new_city variable instead of the original city variable. This gives me all lists of the same length.
The dataframe which is in below format has to be converted like "op_df",
ip_df=pd.DataFrame({'class':['I','II','III'],'details':[[{'sec':'A','assigned_to':'tom'},{'sec':'B','assigned_to':'sam'}],[{'sec':'B','assigned_to':'joe'}],[]]})
ip_df:
class details
0 I [{'sec':'A','assigned_to':'tom'},{'sec':'B','assigned_to':'sam'}]
1 II [{'sec':'B','assigned_to':'joe'}]
2 III []
The required output dataframe is suppose to be,
op_df:
class sec assigned_to
0 I A tom
1 I B sam
2 II B joe
3 III NaN NaN
How to change each dictionaries of "details" column as a new row with keys of the dictionary as column name and value of the dictionary as its respective column value?
I have tried with,
ip_df.join(ip_df['details'].apply(pd.Series))
whereas, I am unable to frame like "op_df".
I am sure there are better ways to do it, but I had to deconstruct your details list and create your dataframe as follows:
dict_values = {'class':['I','II','III'],'details':[[{'sec':'A','assigned_to':'tom'},{'sec':'B','assigned_to':'sam'}],[{'sec':'B','assigned_to':'joe'}],[]]}
all_values = []
for cl, detail in zip(dict_values['class'], dict_values['details']):
if len(detail) > 0:
for innerdict in detail:
row = {'class': cl}
for innerkey in innerdict.keys():
row[innerkey] = innerdict[innerkey]
all_values.append(row)
else:
row = {'class': cl}
all_values.append(row)
op_df = pd.DataFrame(all_values)
I have a dataframe where I am creating a new column and populating its value. Based on the condition, the new column needs to have some values appended to it if that row is encountered again.
So for example for a given dataframe:
df
id Stores is_open
1 'Walmart', 'Target' true
2 'Best Buy' false
3 'Target' true
4 'Home Depot' true
Now If I want to add a new column as a Ticker that can be a comma-separated string of tickers or list (whichever is preferable and easier. No preference on my end) for the given comma separated stores.
So for example ticker of Walmart is wmt and target is tgt. The wmt and tgt data I am getting from another dataframe based on matching key so I tried to add as follows but not all of them are assigned even though they have values and only one value followed by a comma is assigned to Tickers column and not multiple:
df['Tickers'] = ''
for _, row in df.iterrows():
stores = row['Stores']
list_stores = stores(',')
if len(list_stores) > 1:
for store in list_stores:
tmp_df = second_df[second_df['store_id'] == store]
ticker = tmp_df['Ticker'].values[0] if len(tmp_df['Ticker'].values) > 0 else None
if ticker:
df.loc[
df['Stores'].astype(str).str.contains(store), 'Ticker'] += '{},'.format(ticker)
Expected output:
id Stores is_open Ticker
1 'Walmart', 'Target' true wmt, tgt
2 'Best Buy' false bby
3 'Target' true tgt
4 'Home Depot' true nan
I would really appreciate if someone could help me out here.
You can use the apply method with axis=1 to pass the row and perform your calculations. See the code below:
import pandas as pd
mydict = {'id':[1,2],'Store':["'Walmart','Target'","'Best Buy'"], 'is_open':['true', 'false']}
df = pd.DataFrame(mydict, index=[0,1])
df.set_index('id',drop=True, inplace=True)
The df so far:
Store is_open
id
1 'Walmart','Target' true
2 'Best Buy' false
The lookup dataframe:
df2 = pd.DataFrame({'Store':['Walmart', 'Target','Best Buy'], 'Ticker':['wmt','tgt','bby']})
Store Ticker
0 Walmart wmt
1 Target tgt
2 Best Buy bby
here is the code for adding the column:
def add_column(row):
items = row['Store'].split(',')
tkr_list = []
for string in items:
mystr = string.replace("'","")
tkr = df2.loc[df2['Store']==mystr,'Ticker'].values[0]
tkr_list.append(tkr)
return tkr_list
df['Ticker']=df.apply(add_column, axis=1)
and this is the result for df:
Store is_open Ticker
id
1 'Walmart','Target' true [wmt, tgt]
2 'Best Buy' false [bby]
I am looping through a list of 103 FourSquare URLs to find "Coffee Shops."
I can create a DataFrame for each URL and print each DataFrame as I loop through the list (sample output at bottom).
I cannot figure out how to append the DataFrame for each URL into a single DataFrame as I loop through the list. My goal is to compile a single DataFrame from the DataFrames I am printing.
x = 0
while x < 103 :
results = requests.get(URLs[x]).json()
def get_category_type(row):
try:
categories_list = row['categories']
except:
categories_list = row['venue.categories']
if len(categories_list) == 0:
return None
else:
return categories_list[0]['name']
venues = results['response']['groups'][0]['items']
nearby_venues = json_normalize(venues) # flatten JSON
# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
dfven = nearby_venues.loc[nearby_venues['categories'] == 'Coffee Shop']
print(x, '!!!', dfven, '\n')
x = x + 1
Here is some output (I do get complete results):
0 !!! name categories lat lng
5 Tim Hortons Coffee Shop 43.80200 -79.198169
8 Tim Hortons / Esso Coffee Shop 43.80166 -79.199133
1 !!! Empty DataFrame
Columns: [name, categories, lat, lng]
Index: []
2 !!! name categories lat lng
5 Starbucks Coffee Shop 43.770367 -79.186313
18 Tim Hortons Coffee Shop 43.769591 -79.187081
3 !!! name categories lat lng
0 Starbucks Coffee Shop 43.770037 -79.221156
4 Country Style Coffee Shop 43.773716 -79.207027
I apologize if this is bad form or a breach of etiquette but I solved my problem and figured I should post. Perhaps making an effort to state the problem for StackOverflow helped me solve it?
First I learned how to ignore empty DataFrames:
dfven = nearby_venues.loc[nearby_venues['categories'] == 'Coffee Shop']
if dfven.empty == False :
Once I added this code my printed output was a clean series of identically formatted data frames so appending them into one data frame was easy. I created a data frame at the beginning of my code (merge = pd.DataFrame()) and then added this line where I was printing.
merge = merge.append(dfven)
Now my output is perfect.