I have a dataframe df:
Open Volume Adj Close Ticker
Date
2006-11-22 140.750000 45505300 114.480649 SPY
I want to change df to another dataframe Open price like below:
SPY AGG
Date
2006-11-22 140.750000 NA
It only use open's data and two tickers, so how to change one dataframe to another?
I think you can use DataFrame constructor with reindex by list of ticker L:
L = ['SPY','AGG']
df1 = pd.DataFrame({'SPY': [df.Open.iloc[0]]},
index=[df.index[0]])
df1 = df1.reindex(columns=L)
print (df1)
SPY AGG
2006-11-22 140.75 NaN
You can use read_html for find list of Tickers:
df2 = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies', header=0)[0]
#print (df2)
#filter only Ticker symbols starts with SP
df2 = df2[df2['Ticker symbol'].str.startswith('SP')]
print (df2)
Ticker symbol Security SEC filings \
407 SPG Simon Property Group Inc reports
415 SPGI S&P Global, Inc. reports
418 SPLS Staples Inc. reports
GICS Sector GICS Sub Industry \
407 Real Estate REITs
415 Financials Diversified Financial Services
418 Consumer Discretionary Specialty Stores
Address of Headquarters Date first added CIK
407 Indianapolis, Indiana NaN 1063761
415 New York, New York NaN 64040
418 Framingham, Massachusetts NaN 791519
#convert column to list, add SPY because missing
L = ['SPY'] + df2['Ticker symbol'].tolist()
print (L)
['SPY', 'SPG', 'SPGI', 'SPLS']
df1 = pd.DataFrame({'SPY': [df.Open.iloc[0]]},
index=[df.index[0]])
df1 = df1.reindex(columns=L)
print (df1)
SPY SPG SPGI SPLS
2006-11-22 140.75 NaN NaN NaN
Suppose you have a list of data frame df_list for different tickers and every item of of list have the same look of the df in your example
You can first concatenate them into one frame with
df1 = pd.concat(df_list)
Then with
df1[["Open", "Ticker"]].reset_index().set_index(["Date", "Ticker"]).unstack()
It should give you an output like
Open
Ticker AGG SPY
Date
2006-11-22 NAN 140.75
Related
I've got two data frames, one has historical prices of stocks in this format:
year
Company1
Company2
1980
4.66
12.32
1981
5.68
15.53
etc with hundreds of columns, then I have a dataframe specifing a company, its sector and its country.
company 1
industrials
Germany
company 2
consumer goods
US
company 3
industrials
France
I used the first dataframe to plot the prices of various companies over time, however, I'd like to now somehow group the data from the first table with the second one and create a separate dataframe which will have form of sectors total value of time, ie.
year
industrials
consumer goods
healthcare
1980
50.65
42.23
25.65
1981
55.65
43.23
26.15
Thank you
You can do the following, assuming df_1 is your DataFrame with price of stock per year and company, and df_2 your DataFrame with information on the companies:
# turn company columns into rows
df_1 = df_1.melt(id_vars='year', var_name='company')
df_1 = df_1.merge(df_2)
# groupby and move industry to columns
output = df_1.groupby(['year', 'industry'])['value'].sum().unstack('industry')
Output:
industry consumer goods industrials
year
1980 12.32 4.66
1981 15.53 5.68
Thanks for investing time to help me out :)
I have DataFrame (df_NSE_Price_) like below:
Company Name ID 2000-01-03 00:00:00 2000-01-04 00:00:00 ....
Reliance Industries Ltd. 100325 50.810 54.
Tata Consultancy Service 123455 123 125
..
I would want output like below :
Company Name ID March 00 April 00 .....
Reliance Industries Ltd 100325 52 55
Tata Consultancy Services 123455 124.3 124
..
The output data has to have the average of data month wise.
So far i have tried
df_NSE_Price_.resample('M',axis=1).mean()
But this gave me error
Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'
Something like this should work:
df.transpose().resample('M',axis=1).mean().transpose()
First, I converted the data to a Data Frame (I added a column with February info, too).
import pandas as pd
columns = ('Company Name',
'ID',
'2000-01-03 00:00:00',
'2000-01-04 00:00:00',
'2000-02-04 00:00:00')
data = [('Reliance Industries Ltd.', 100325, 50.810, 54., 66.0),
('Tata Consultancy Service', 123455, 123, 125, 130.0),]
df = pd.DataFrame(data=data, columns=columns)
Second, I created a two-level index (MultiIndex), using Company and ID. Now, all column labels are dates. Then, I converted the column labels to date format (using .to_datetime()
df = df.set_index(['Company Name', 'ID'])
df.columns = pd.to_datetime(df.columns)
Third, I re-sampled in monthly intervals, using the 'axis=1' to aggregate by column. This creates one month per column. Convert from month-end dates to periods with 'to_period()':
df = df.resample('M', axis=1).sum()
df.columns = df.columns.to_period('M')
2000-01 2000-02
Company Name ID
Reliance Industries Ltd. 100325 104.81 66.0
Tata Consultancy Service 123455 248.00 130.0
I've got a dataframe df in Pandas that looks like this:
stores product discount
Westminster 102141 T
Westminster 102142 F
City of London 102141 T
City of London 102142 F
City of London 102143 T
And I'd like to end up with a dataset that looks like this:
stores product_1 discount_1 product_2 discount_2 product_3 discount_3
Westminster 102141 T 102143 F
City of London 102141 T 102143 F 102143 T
How do I do this in pandas?
I think this is some kind of pivot on the stores column, but with multiple . Or perhaps it's an "unmelt" rather than a "pivot"?
I tried:
df.pivot("stores", ["product", "discount"], ["product", "discount"])
But I get TypeError: MultiIndex.name must be a hashable type.
Use DataFrame.unstack for reshape, only necessary create counter by GroupBy.cumcount, last change ordering of second level and flatten MultiIndex in columns by map:
df = (df.set_index(['stores', df.groupby('stores').cumcount().add(1)])
.unstack()
.sort_index(axis=1, level=1))
df.columns = df.columns.map('{0[0]}_{0[1]}'.format)
df = df.reset_index()
print (df)
stores discount_1 product_1 discount_2 product_2 discount_3 \
0 City of London T 102141.0 F 102142.0 T
1 Westminster T 102141.0 F 102142.0 NaN
product_3
0 102143.0
1 NaN
I have a Pandas DataFrame which looks like so:
user_id item_timestamp item_cashtags item_sectors item_industries
0 406225 1483229353 SPY Financial Exchange Traded Fund
1 406225 1483229353 ERO Financial Exchange Traded Fund
2 406225 1483229350 CAKE|IWM|SDS|SPY|X|SPLK|QQQ Services|Financial|Financial|Financial|Basic M... Restaurants|Exchange Traded Fund|Exchange Trad...
3 619769 1483229422 AAPL Technology Personal Computers
4 692735 1483229891 IVOG Financial Exchange Traded Fund
I'd like to split the cashtags, sectors and industries columns by |. Each cashtag corresponds to a sector which corresponds to an industry, so they are of equal amounts.
I'd like the output to be such that each cashtag, sector and industry have their own row, with the item_timestamp and user_id copying over, ie:
user_id item_timestamp item_cashtags item_sectors item_industries
2 406225 1483229350 CAKE|IWM|SDS Services|Financial|Financial Restaurants|Exchange Traded Fund|Exchange Traded Fund
would become:
user_id item_timestam item_cashtags item_sectors item_industries
406225 1483229350 CAKE Services Restaurants
406225 1483229350 IWM Financial Exchange Traded Fund
406225 1483229350 SDS Financial Exchange Traded Fund
My problem is that this is a conditional split which I'm not sure how to do in Pandas
If the frame is not to large, one easy option is to just loop through the rows. But I agree it is not the most pandamic way to do it, and definitely not the most performing one.
from copy import copy
result = []
for idx, row in df.iterrows():
d = dict(row)
for cat1, cat2 in zip(d['cat1'].split('|'), d['cat2'].split('|')):
# here you can add an if to filter on certain categories
dd = copy(d)
dd['cat1'] = cat1
dd['cat2'] = cat2
result.append(dd)
pd.DataFrame(result) # convert back
Okay, I don't know how performant this will be, but here's another approach
# test_data
df_dict = {
"user_id": [406225, 406225],
"item_timestamp": [1483229350, 1483229353],
"item_cashtags": ["CAKE|IWM|SDS", "SPY"],
"item_sectors": ["Services|Financial|Financial", "Financial"],
"item_industries": [
"Restaurants|Exchange Traded Fund|Exchange Traded Fund",
"Exchange Traded Fund"
]
}
df = pd.DataFrame(df_dict)
# which columns to split; all others should be "copied" over
split_cols = ["item_cashtags", "item_sectors", "item_industries"]
copy_cols = [col for col in df.columns if col not in split_cols]
# for each column, split on |. This gives a list, so values is an array of lists
# summing values concatenates these into one long list
new_df_dict = {col: df[col].str.split("|").values.sum() for col in split_cols}
# n_splits tells us how many times to replicate the values from the copied columns
# so that they'll match with the new number of rows from splitting the other columns
n_splits = df.item_cashtags.str.count("\|") + 1
# we turn each value into a list so that we can easily replicate them the proper
# number of times, then concatenate these lists like with the split columns
for col in copy_cols:
new_df_dict[col] = (df[col].map(lambda x: [x]) * n_splits).values.sum()
# now make a df back from the dict of columns
new_df = pd.DataFrame(new_df_dict)
# new_df
# item_cashtags item_sectors item_industries user_id item_timestamp
# 0 CAKE Services Restaurants 406225 1483229350
# 1 IWM Financial Exchange Traded Fund 406225 1483229350
# 2 SDS Financial Exchange Traded Fund 406225 1483229350
# 3 SPY Financial Exchange Traded Fund 406225 1483229353
I have the following data frame:
population GDP
country
United Kingdom 4.5m 10m
Spain 3m 8m
France 2m 6m
I also have the following information in a 2 column dataframe(happy for this to be made into another datastruct if that will be more beneficial as the plan is that it will be sorted in a VARS file.
county code
Spain es
France fr
United Kingdom uk
The 'mapping' datastruct will be sorted in a random order as countries will be added/removed at random times.
What is the best way to re-index the data frame to its country code from its country name?
Is there a smart solution that would also work on other columns so for example if a data frame was indexed on date but one column was df['county'] then you could change df['country'] to its country code? Finally is there a third option that would add an additional column that was either country/code which selected the right code based on a country name in another column?
I think you can use Series.map, but it works only with Series, so need Index.to_series. Last rename_axis (new in pandas 0.18.0):
df1.index = df1.index.to_series().map(df2.set_index('county').code)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
It is same as mapping by dict:
d = df2.set_index('county').code.to_dict()
print (d)
{'France': 'fr', 'Spain': 'es', 'United Kingdom': 'uk'}
df1.index = df1.index.to_series().map(d)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
EDIT:
Another solution with Index.map, so to_series is omitted:
d = df2.set_index('county').code.to_dict()
print (d)
{'France': 'fr', 'Spain': 'es', 'United Kingdom': 'uk'}
df1.index = df1.index.map(d.get)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
Here are some brief ways to approach your 3 questions. More details below:
1) How to change index based on mapping in separate df
Use df_with_mapping.todict("split") to create a dictionary, then use a list comprehension to change it into {"old1":"new1",...,"oldn":"newn"} form then use df.index = df.base_column.map(dictionary) to get the changed index.
2) How to change index if the new column is in the same df:
df.index = df["column_you_want"]
3) Creating a new column by mapping on a old column:
df["new_column"] = df["old_column"].map({"old1":"new1",...,"oldn":"newn"})
1) Mapping for the current index exists in separate dataframe but you don't have the mapped column in the dataframe yet
This is essentially the same as question 2 with the additional step of creating a dictionary for the mapping you want.
#creating the mapping dictionary in the form of current index : future index
df2 = pd.DataFrame([["es"],["fr"]],index = ["spain","france"])
interm_dict = df2.to_dict("split") #Creates a dictionary split into column labels, data labels and data
mapping_dict = {country:data[0] for country,data in zip(interm_dict["index"],interm_dict['data'])}
#We only want the first column of the data and the index so we need to make a new dict with a list comprehension and zip
df["country"] = df.index #Create a new column if u want to save the index
df.index = pd.Series(df.index).map(mapping_dict) #change the index
df.index.name = "" #Blanks out index name
df = df.drop("county code",1) #Drops the county code column to avoid duplicate columns
Before:
county code language
spain es spanish
france fr french
After:
language country
es spanish spain
fr french france
2) Changing the current index to one of the columns already in the dataframe
df = pd.DataFrame([["es","spanish"],["fr","french"]], columns = ["county code","language"], index = ["spain", "french"])
df["country"] = df.index #if you want to save the original index
df.index = df["county code"] #The only step you actually need
df.index.name = "" #if you want a blank index name
df = df.drop("county code",1) #if you dont want the duplicate column
Before:
county code language
spain es spanish
french fr french
After:
language country
es spanish spain
fr french french
3) Creating an additional column based on another column
This is again essentially the same as step 2 except we create an additional column instead of assigning .index to the created series.
df = pd.DataFrame([["es","spanish"],["fr","french"]], columns = ["county code","language"], index = ["spain", "france"])
df["city"] = df["county code"].map({"es":"barcelona","fr":"paris"})
Before:
county code language
spain es spanish
france fr french
After:
county code language city
spain es spanish barcelona
france fr french paris