Overwriting values in column created with Python for loop - python

I'm building an automated MLB schedule from a base URL and a loop through a list of team names as they appear in the URL. Using pd.read_html I get each team's schedule. The only thing I'm missing is, for each team's page, the team name itself, which I'd like as a new column 'team_name'. I have a small sample of my goal at the end of this post.
Below, is what I have so far, and if you run this, the print out does exactly what I need for just one team.
import pandas as pd
url_base = "https://www.teamrankings.com/mlb/team/"
team_list = ['seattle-mariners']
df = pd.DataFrame()
for team in (team_list):
new_url = url_base + team
df = df.append(pd.read_html(new_url)[1])
df['team_name'] = team
print(df[['team_name', 'Opponent']])
The trouble is, when I have all 30 teams in team_list, the value of team_name keeps getting overwritten, so that all 4000+ records list the same team name (the last one in team_list). I've tried dynamically assigning only certain rows the team value by using
df['team_name'][a:b] = team
where a, b are the starting and ending rows on the dataframe for the index team; but this gives KeyError: 'team_name'. I've also tried using placeholder series and dataframes for team_name, then merging with df later, but get duplication errors. On a larger scale, what I'm looking for is this:
team_name opponent
0 seattle-mariners new-york-yankees
1 seattle-mariners new-york-yankees
2 seattle-mariners boston-red-sox
3 seattle-mariners boston-red-sox
4 seattle-mariners san-diego-padres
5 seattle-mariners san-diego-padres
6 cincinatti-reds new-york-yankees
7 cincinatti-reds new-york-yankees
8 cincinatti-reds boston-red-sox
9 cincinatti-reds boston-red-sox
10 cincinatti-reds san-diego-padres
11 cincinatti-reds san-diego-padres

The original code df['team_name'] = team rewrites team_name for the entire df. The code below creates a placeholder, df_team, where team_name is updated and then df.append(df_team).
url_base = "https://www.teamrankings.com/mlb/team/"
team_list = ['seattle-mariners', 'houston-astros']
Option A: for loop
df_list = list()
for team in (team_list):
new_url = url_base + team
df_team = pd.read_html(new_url)[1]
df_team['team_name'] = team
df_list.append(df_team)
df = pd.concat(df_list)
Option B: list comprehension:
df_list = [pd.read_html(url_base + team)[1].assign(team=team) for team in team_list]
df = pd.concat(df_list)
df.head()
df.tail()

Related

Pandas Python: Delete row values by name

I have a csv list of keywords in this format:
75410,Sportart
75419,Ballsport
75428,Basketball
76207,Atomenergie
76212,Atomkraftwerk
76223,Wiederaufarbeitung
76225,Atomlager
67869,Werbewirtschaft
I read the values using pandas and create a table in this format:
DF: name
id
75410 Sportart
75419 Ballsport
75428 Basketball
76207 Atomenergie
76212 Atomkraftwerk
... ...
251450 Tag und Nacht
241473 Kollektivverhalten
270930 Indigene Völker
261949 Wirtschaft und Politik
282512 Impfen
Using the name, I want to delete the whole row, e.g. 'Sportart' deletes first row.
I want to check this with values from my wordList array, I store them as Strings in a list.
What did I miss? Using the code below I receive an '(value) not in axis' error.
df = pd.read_csv("labels.csv", header=None, index_col=0)
df.index.name = "id"
df.columns = ["name"]
print('DF: ',df)
df.drop(labels=wordList, axis=0,inplace=True)
pd_frame = pd.DataFrame(df)
cleaned_pd_frame = pd_frame.query('name != {}'.format(wordList))
I succeeded to hide them with query(), but I want to remove the entirely.
You can use a helper function, index_to_drop below, to take in a name and filter its index out:
index_to_drop = lambda name: df.index[df['name']==name]
Then you can drop "Sportart" like:
df.drop(index_to_drop('Sportart'), inplace=True)
print(df)
Output:
id name
1 75419 Ballsport
2 75428 Basketball
3 76207 Atomenergie
4 76212 Atomkraftwerk
5 251450 Tag und Nacht
6 241473 Kollektivverhalten
7 270930 Indigene Völker
8 261949 Wirtschaft und Politik
9 282512 Impfen
That being said, this is just a convoluted way to drop a row. The same outcome can be obtained much simpler by using isin:
df = df[df['name']!='Sportart']

convert scraped list to pandas dataframe using columns and index

process and data scraping of url( within all given links in a loop )looks like :
for url in urls :
page=requests.get(url)
#fetch and proceed page here and acquire cars info one per page
print(car.name)
print(car_table)
and the output :
BMW
['color','red','weight','50kg','height','120cm','width','200cm','serial','','owner','']
FORD
['color','','weight','','height','','width','','serial','','owner','']
HONDA
['color','blue','weight','60kg','height','','width','160cm','serial','OMEGA','owner','']
at the end how can i have a dataframe same as below by considering that i dunno number of car fields(columns) and number of cars(index) but defined df with them as columns and index
print(car_df)
-----|color|weight|height|width|serial|owner
BMW |red 50 120 200
FORD |
HONDA|blue 60 160 OMEGA
any help appreciated :)
This approach is to create a list of dicts as we iterate through the urls, and then after the loop we convert this to a dictionary. I'm assuming that the car_table is always the column followed by the value over and over again
import pandas as pd
import numpy as np
#Creating lists from your output instead of requesting from the url since you didn't share that
car_names = ['BMW','FORD','HONDA']
car_tables = [
['color','red','weight','50kg','height','120cm','width','200cm','serial','','owner',''],
['color','','weight','','height','','width','','serial','','owner',''],
['color','blue','weight','60kg','height','','width','160cm','serial','OMEGA','owner',''],
]
urls = range(len(car_names))
all_car_data = []
for url in urls:
car_name = car_names[url] #using car_name instead of car.name for this example
car_table = car_tables[url] #again, you get this value some other way
car_data = {'name':car_name}
columns = car_table[::2] #starting from 0, skip every other entry to just get the columns
values = car_table[1::2] #starting from 1, skip every other entry to just get the values
#Zip the columns together with the values, then iterate and update the dict
for col,val in zip(columns,values):
car_data[col] = val
#Add the dict to a list to keep track of all the cars
all_car_data.append(car_data)
#Convert to a dataframe
df = pd.DataFrame(all_car_data)
#df = df.replace({'':np.NaN}) #you can use this if you want to replace the '' with NaNs
df
Output:
name color weight height width serial owner
0 BMW red 50kg 120cm 200cm
1 FORD
2 HONDA blue 60kg 160cm OMEGA

Output from webscraped page not appended to output from previous page

Similar to my last question, I'm having an iteration issue. I'm using the code
df1 = pd.DataFrame({'Username': [name.text for name in (soup.findAll('p',{'class':'profile-name'}))]})
to get the list of names from one web page. However, when I try this for all pages, it creates new tables for each page instead of appending the output from each page together.
So for page 1 I'd get
Username
0 Alice
1 Bob
2 Carl
Page 2 :
Username
0 Sandra
1 Paula
2 Tim
etc. But I want my output to be:
Username
0 Alice
1 Bob
2 Car
3 Sandra
4 Paula
5 Tim
Below is my full code (with the url omitted) for iterating through all the pages
for pageno in range(0,99):
page=requests.get('full url'+ str(pageno))
soup=BeautifulSoup(page.text, 'html.parser')
df1 = pd.DataFrame({'Username': [name.text for name in (soup.findAll('p',{'class':'profile-name'}))]})
How can I fix this?
Thank you.
Well your issue is that you are creating a new df at each loop so previous pages' records are not kept.
You might want to append your usernames to a global list and then import that list in a dataframe:
username_list = []
for pageno in range(0,99):
page=requests.get('full url'+ str(pageno))
soup=BeautifulSoup(page.text, 'html.parser')
username_list += [name.text for name in (soup.findAll('p',{'class':'profile-name'}))]
df1 = pd.DataFrame({'Username': username_list})
The question is pretty unclear, but I guess this is what you wanted?
output_df = pd.DataFrame()
for pageno in range(0,99):
page=requests.get('full url'+ str(pageno))
soup=BeautifulSoup(page.text, 'html.parser')
df1 = pd.DataFrame({'Username': [name.text for name in (soup.findAll('p',{'class':'profile-name'}))]})
output_df = pd.concat([output_df, df1])

Compile one DataFrame from a loop sequence of smaller DataFrames

I am looping through a list of 103 FourSquare URLs to find "Coffee Shops."
I can create a DataFrame for each URL and print each DataFrame as I loop through the list (sample output at bottom).
I cannot figure out how to append the DataFrame for each URL into a single DataFrame as I loop through the list. My goal is to compile a single DataFrame from the DataFrames I am printing.
x = 0
while x < 103 :
results = requests.get(URLs[x]).json()
def get_category_type(row):
try:
categories_list = row['categories']
except:
categories_list = row['venue.categories']
if len(categories_list) == 0:
return None
else:
return categories_list[0]['name']
venues = results['response']['groups'][0]['items']
nearby_venues = json_normalize(venues) # flatten JSON
# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
dfven = nearby_venues.loc[nearby_venues['categories'] == 'Coffee Shop']
print(x, '!!!', dfven, '\n')
x = x + 1
Here is some output (I do get complete results):
0 !!! name categories lat lng
5 Tim Hortons Coffee Shop 43.80200 -79.198169
8 Tim Hortons / Esso Coffee Shop 43.80166 -79.199133
1 !!! Empty DataFrame
Columns: [name, categories, lat, lng]
Index: []
2 !!! name categories lat lng
5 Starbucks Coffee Shop 43.770367 -79.186313
18 Tim Hortons Coffee Shop 43.769591 -79.187081
3 !!! name categories lat lng
0 Starbucks Coffee Shop 43.770037 -79.221156
4 Country Style Coffee Shop 43.773716 -79.207027
I apologize if this is bad form or a breach of etiquette but I solved my problem and figured I should post. Perhaps making an effort to state the problem for StackOverflow helped me solve it?
First I learned how to ignore empty DataFrames:
dfven = nearby_venues.loc[nearby_venues['categories'] == 'Coffee Shop']
if dfven.empty == False :
Once I added this code my printed output was a clean series of identically formatted data frames so appending them into one data frame was easy. I created a data frame at the beginning of my code (merge = pd.DataFrame()) and then added this line where I was printing.
merge = merge.append(dfven)
Now my output is perfect.

Extracting many URLs in a python dataframe

I have a dataframe which contains text including one or more URL(s) :
user_id text
1 blabla... http://amazon.com ...blabla
1 blabla... http://nasa.com ...blabla
2 blabla... https://google.com ...blabla ...https://yahoo.com ...blabla
2 blabla... https://fnac.com ...blabla ...
3 blabla....
I want to transform this dataframe with the count of URL(s) per user-id :
user_id count_URL
1 2
2 3
3 0
Is there a simple way to perform this task in Python ?
My code start :
URL = pd.DataFrame(columns=['A','B','C','D','E','F','G'])
for i in range(data.shape[0]) :
for j in range(0,8):
URL.iloc[i,j] = re.findall("(?P<url>https?://[^\s]+)", str(data.iloc[i]))
Thanks you
Lionel
In general, the definition of a URL is much more complex than what you have in your example. Unless you are sure you have very simple URLs, you should look up a good pattern.
import re
URLPATTERN = r'(https?://\S+)' # Lousy, but...
First, extract the URLs from each string and count them:
df['urlcount'] = df.text.apply(lambda x: re.findall(URLPATTERN, x)).str.len()
Next, group the counts by user id:
df.groupby('user_id').sum()['urlcount']
#user_id
#1 2
#2 3
#3 0
Below there is another way to do that:
#read data
import pandas as pd
data = pd.read_csv("data.csv")
#Divide data into URL and user_id and cast it to pandas DataFrame
URL = pd.DataFrame(data.loc[:,"text"].values)
user_id = pd.DataFrame(data.loc[:,"user_id"].values)
#count the number of appearance of the "http" in each row of data
sub = "http"
count_URL = []
for val in URL.iterrows():
counter = val[1][0].count(sub)
count_URL.append(counter)
#list to DataFrame
count_URL = pd.DataFrame(count_URL)
#Concatenate the two data frames and apply the code of #DyZ to group by and count the number of url
finalDF = pd.concat([user_id,count_URL],axis=1)
finalDF.columns=["user_id","urlcount"]
data = finalDF.groupby('user_id').sum()['urlcount']
print(data.head())

Categories

Resources