I am using investpy to get historical stock data for 2 stocks ( TRP_pb , TRP_pc )
import investpy
import pandas as pd
import numpy as np
TRP_pb = investpy.get_stock_historical_data(stock='TRP_pb',
country='canada',
from_date='01/01/2022',
to_date='01/04/2022')
print(TRP_pb.head())
TRP_pc = investpy.get_stock_historical_data(stock='TRP_pc',
country='canada',
from_date='01/01/2022',
to_date='01/04/2022')
print(TRP_pc.head())
I can append the two tables by using the append method
appendedtable = TRP_pb.append(TRP_pc, ignore_index=False)
What I am trying to do is to use a loop function in order to combine these two tables
Here is what I have tried so far
preferredlist = ['TRP_pb','TRP_pc']
for i in preferredlist:
new = investpy.get_stock_historical_data(stock=i,
country='canada',
from_date='01/01/2022',
to_date='01/04/2022')
new.append(new, ignore_index=True)
However this doesnt work.
I would appreciate any help
Since get_stock_historical_data returns a DataFrame, you can create an empty dataframe before the for loop and concat in the loop.
preferredlist = ['TRP_pb','TRP_pc']
final_list = pd.DataFrame()
for i in preferredlist:
new = investpy.get_stock_historical_data(stock=i,
country='canada',
from_date='01/01/2022',
to_date='01/04/2022')
final_list = pd.concat([final_list, new])
This table from Wikipedia shows the 10 biggest box office hits. I can't seem to get the total of the 'worldwide_gross' column. Can someone help? Thank you.
import pandas as pd
boxoffice_df=pd.read_html('https://en.wikipedia.org/wiki/List_of_highest-grossing_films')
films = boxoffice_df[1]
films.rename(columns = {'Worldwide gross(2020 $)':'worldwide_gross'}, inplace = True)
films.worldwide_gross.sum(axis=0)
This is the output I get when I try calculating the total global earnings:
Total =films['worldwide_gross'].astype('Int32').sum()
or convert data-types 1st.
films = films.convert_dtypes()
Total = films['worldwide_gross'].sum()
films.astype({"worldwide_gross": int})
Total =films['worldwide_gross'].sum()
You will have to keep only digits in column worldwide_gross using regex and then convert the column to float using series.astype('float')
Add:
films.worldwide_gross = films.worldwide_gross.str.replace('\D',"",regex = True).astype(float)
Complete Code:
import pandas as pd
boxoffice_df=pd.read_html('https://en.wikipedia.org/wiki/List_of_highest-grossing_films')
films = boxoffice_df[1]
films.rename(columns = {'Worldwide gross(2020 $)':'worldwide_gross'}, inplace = True)
films.worldwide_gross = films.worldwide_gross.str.replace('\D',"",regex = True).astype(float)
films.worldwide_gross.sum(axis=0)
Here's one way you can do it.
This code will convert the values in the worldwide_gross to integers and then sum the column to get the total gross.
import pandas as pd
def get_gross(gross_text):
pos = gross_text.index('$')
return int(gross_text[pos+1:].replace(',', ''))
boxoffice_df=pd.read_html('https://en.wikipedia.org/wiki/List_of_highest-grossing_films')
films = boxoffice_df[1]
films.rename(columns = {'Worldwide gross(2020 $)':'worldwide_gross'}, inplace = True)
films['gross_numeric'] = films['worldwide_gross'].apply(lambda x: get_gross(x))
total_gross = films['gross_numeric'].sum()
print(f'Total gross: ${total_gross}')
For example:
I have this code:
import pandas
df = pandas.read_csv('covid_19_data.csv')
this dataset has a column called countryterritoryCode which is the country code of the country.sample data from the dataset
This dataset has information about covid19 cases from all the countries in the world.
How do I create a new dataset where only the USA info appears
(where countryterritoryCode == USA)
import pandas
df = pandas.read_csv('covid_19_data.csv')
new_df = df[df["country"] == "USA"]
or
new_df = df[df.country == "USA"]
Use df.groupby:
df = pandas.read_csv('covid_19_data.csv')
df_new = df.groupby('countryterritoryCode', axis = 1)
I have dataframe like this.
import pandas as pd
#create dataframe
df= pd.DataFrame({"Date":range(0,22),
"Country":["USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA",],
"Number":[0,0,0,0,0,1,1,3,5,6,4,6,7,8,7,10,25,50,75,60,45,100]
"Number is Corrected":[0,0,0,0,0,1,1,3,5,6,6,6,7,7,7,10,25,50,50,60,60,100]})
But this dataframe is have a problem. Some numbers are wrong.
Previous number always has to be smaller than next number(6,4,6,,7,8,7...50,75,60,45,100)
I don't use df.sort because it's not about sorting it's about correction.
Edit: I added corrected numbers in "number is corrected" column.
guessing from your 'Number corrected' list, you could probably use this:
import pandas as pd
#create dataframe
df= pd.DataFrame({"Date":range(0,22),
"Country":["USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA",],
"Number":[0,0,0,0,0,1,1,3,5,6,4,6,7,8,7,10,25,50,75,60,45,100]})
# "Number is Corrected":[0,0,0,0,0,1,1,3,5,6,6,6,7,7,7,10,25,50,50,60,60,100]})
def correction():
df['Number is Corrected'] = df['Number']
cache = 0
for num, content in enumerate(df['Number is Corrected'], start=0):
if(df['Number is Corrected'][num] < cache):
df['Number is Corrected'][num] = cache
else:
cache = df['Number is Corrected'][num]
print(df)
if __name__ == "__main__":
correction()
but there is some inconsistency, like your conversation with jezrael. Evtl. you'll need to update the logic of the code, if it gets clearer, what the output you wished. Good luck.
I have an efficiency question for you. I wrote some code to analyze a report that holds over 70k records and over 400+ unique organizations to allow my supervisor to enter in year/month/date they are interested in and have it pop out the information.
The beginning of my code is:
import pandas as pd
import numpy as np
import datetime
main_data = pd.read_excel("UpdatedData.xlsx", encoding= 'utf8')
#column names from DF
epi_expose = "EpitheliumExposureSeverity"
sloughing = "EpitheliumSloughingPercentageSurface"
organization = "OrgName"
region = "Region"
date = "DeathOn"
#list storage of definitions
sl_list = ["",'None','Mild','Mild to Moderate']
epi_list= ['Moderate','Moderate to Severe','Severe']
#Create DF with four columns
df = main_data[[region, organization, epi_expose, sloughing, date]]
#filter it down to months
starting_date = datetime.date(2017,2,1)
ending_date = datetime.date(2017,2,28)
df = df[(df[date] > starting_date) & (df[date] < ending_date)]
I am then performing conditional filtering below to get counts by region and organization. It works, but is slow. Is there a more efficient way to query my DF and set up a DF that ONLY has the dates that it is supposed to sit between? Or is this the most efficient way without altering how the Database I am using is set up?
I can provide more of my code but if I filter it out by month before exporting to excel, the code runs in a matter of seconds so I am not concerned about the speed of it besides getting the correct date fields.
Thank you!