Adding column to pandas DataFrame with new keys

Adding column to pandas DataFrame with new keys - python

I'm new to python.
I have this code:
import pandas as pd
import quandl
quandlApiKey = "XXX" # Can't write the real one because it is forbidden.
d = {}
closing_data = pd.DataFrame()
indexes = {
'SCF/CME_SP1_FW' : 'Settle',
'CHRIS/LIFFE_Z1' : 'Settle',
'CHRIS/EUREX_FMEU1' : 'Settle'
}
for index in indexes.keys():
d[index] = quandl.get(index, start_date="2013-12-31", end_date="2014-12-31", api_key=quandlApiKey)
for index in indexes.keys():
closing_data[index] = d[index]['Settle']
In the first iteration (SCF/CME_SP1_FW) it saves the first column just fine and the keys of the rows are 31/12/2013, 02/01/2014, 03/01/2014 ...
The date 01/01/2014 is intentionally missing.
On the second iteration the for loop adds the second column, but although d[index]['Settle'] had the date 01/01/2014 a row wasn't added for it in closing_data.
Is there a way to join the rows of all the columns (when a row is missing in one of the columns I would like to have a NaN or something similar).
Thank you!

Related

python Dataframe using pandas data insert into excel file

I have received a data frame using pandas, data have one column and multiple rows in that column
and each row has multiple data like ({buy_quantity:0, symbol:nse123490,....})
I want to insert it into an excel sheet using pandas data frame with python xlwings lib. with some selected data please help me
wb = xw.Book('Easy_Algo.xlsx')
ts = wb.sheets['profile']
pdata=sas.get_profile()
df = pd.DataFrame(pdata)
ts.range('A1').value = df[['symbol','product','avg price','buy avg']]
output like this :
please help me... how to insert data into excel only selected.

Considering that the dataframe below is named df and the type of the column positions is dict, you can use the code below to transform the keys to columns and values to rows.
out = df.join(pd.DataFrame(df.pop('positions').values.tolist()))
out.to_excel('Easy_Algo.xlsx', sheet_name=['profile'], index=False) #to store the result in an Excel file/spreadsheet.
Note : Make sure to add these two lines below if the type of the column positions is not dict.
import ast
df['positions']=df['positions'].apply(ast.literal_eval)
#A sample dataframe for test :
import pandas as pd
import ast
string_dict = {'{"Symbol": "NIFTY2292218150CE NFO", "Produc": "NRML", "Avg. Price": 18.15, "Buy Avg": 0}',
'{"Symbol": "NIFTY22SEP18500CE NFO", "Produc": "NRML", "Avg. Price": 20.15, "Buy Avg": 20.15}',
'{"Symbol": "NIFTY22SEP16500PE NFO", "Produc": "NRML", "Avg. Price": 16.35, "Buy Avg": 16.35}'}
df = pd.DataFrame(string_dict, columns=['positions'])
df['positions']=df['positions'].apply(ast.literal_eval)
out = df.join(pd.DataFrame(df.pop('positions').values.tolist()))
>>> print(out)
Symbol Produc Avg. Price Buy Avg
0 NIFTY22SEP16500PE NFO NRML 16.35 16.35
1 NIFTY22SEP18500CE NFO NRML 20.15 20.15
2 NIFTY2292218150CE NFO NRML 18.15 0.00

If i understood correctly, you want only those columns written to an excel file
df = df[['symbol','product','avg price','buy avg']]
df.to_excel("final.xlsx")
df.to_excel("final.xlsx", index = False) # in case there was a default index generated by pandas and you wanna get rid of it.
i hope this helps.

Concatenate specific columns in pandas

Im trying to concatenate 4 different datasets onto pandas python. I can concatenated them but it results in several of the same column names. How do I only produce only one column of the same name, then multiples?
concatenated_dataframes = pd.concat(
[
dice.reset_index(drop=True),
json.reset_index(drop=True),
flexjobs.reset_index(drop=True),
indeed.reset_index(drop=True),
simply.reset_index(drop=True),
],
axis=1,
ignore_index=True,
)
concatenated_dataframes_columns = [
list(dice.columns),
list(json.columns),
list(flexjobs.columns),
list (indeed.columns),
list(simply.columns)
]
flatten = lambda nested_lists: [item for sublist in nested_lists for item in sublist]
concatenated_dataframes.columns = flatten(concatenated_dataframes_columns)
df= concatenated_dataframes
This results in
UNNAMED: 0 TITLE COMPANY DESCRIPTION LOCATION TITLE JOBLOCATION POSTEDDATE DETAILSPAGEURL COMPANYPAGEURL COMPANYLOGOURL SALARY CLIENTBRANDID COMPANYNAME EMPLOYMENTTYPE SUMMARY SCORE EASYAPPLY EMPLOYERTYPE WORKFROMHOMEAVAILABILITY ISREMOTE UNNAMED: 0 TITLE SALARY JOBTYPE LOCATION DESCRIPTION UNNAMED: 0 TITLE SALARY JOBTYPE DESCRIPTION LOCATION UNNAMED: 0 COMPANY DESCRIPTION LOCATION SALARY TITLE
Again, how do i combined all the 'titles' in one column, all the 'location' in one column, and so on? Instead of have multiple of them.

I think we can get away with making a blank dataframe that just has the columns we will want at the end and then concat() everything onto it.
import numpy as np
import pandas as pd
all_columns = list(dice.columns) + list(json.columns) + list(flexjobs.columns) + list(indeed.columns) + list(simply.columns)
all_unique_columns = np.unique(np.array(all_columns)) # this will, as the name suggests, give an end list of just the unique columns. You could run print(all_unique_columns) to make sure it has what you want
df = pd.DataFrame(columns=all_unique_columns)
df = pd.concat([dice, json, flexjobs, indeed, simply],axis=0)
It's a little tricky not having reproducible examples of the dataframes that you have. I tested this on a small mock-up example I put together, but let me know if it works for your more complex example.

Inaccesable first column in pandas dataframe?

I have a dataframe with multiple columns. When I execute the following code it assigns the header for the first column to the second column thereby making the first column inaccessible.
COLUMN_NAMES = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave_points_worst',
'symmetry_worst']
TUMOR_TYPE = ['M', 'B']
path_to_file = list(files.upload().keys())[0]
data = pd.read_csv(path_to_file, names=COLUMN_NAMES, header=0)
print(data)
id diagnosis ... concave_points_worst symmetry_worst
842302 M 17.99 ... 0.4601 0.11890
842517 M 20.57 ... 0.2750 0.08902
84300903 M 19.69 ... 0.3613 0.08758
The id tag is supposed to be associated with the first column but it's associated with the second one resulting in the last column header to get deleted.

pd.read_csv is going to make your first column the index rather than a column like the rest of what is on your list.
You could update it to be:
path_to_file = list(files.upload().keys())[0]
data = pd.read_csv(path_to_file, names=COLUMN_NAMES, header=0,index_col = False)
to make sure that first column isn't treated as the index.

How do I make this function iterable (getting indexerror)

I am fairly new to python and coding in general.
I have a big data file that provides daily data for the period 2011-2018 for a number of stock tickers (300~).
The data is a .csv file with circa 150k rows and looks as follows (short example):
Date,Symbol,ShortExemptVolume,ShortVolume,TotalVolume
20110103,AAWW,0.0,28369,78113.0
20110103,AMD,0.0,3183556,8095093.0
20110103,AMRS,0.0,14196,18811.0
20110103,ARAY,0.0,31685,77976.0
20110103,ARCC,0.0,177208,423768.0
20110103,ASCMA,0.0,3930,26527.0
20110103,ATI,0.0,193772,301287.0
20110103,ATSG,0.0,23659,72965.0
20110103,AVID,0.0,7211,18896.0
20110103,BMRN,0.0,21740,213974.0
20110103,CAMP,0.0,2000,11401.0
20110103,CIEN,0.0,625165,1309490.0
20110103,COWN,0.0,3195,24293.0
20110103,CSV,0.0,6133,25394.0
I have a function that allows me to filter for a specific symbol and get 10 observations before and after a specified date (could be any date between 2011 and 2018).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'D:\Project\Data\Short_Interest\exampledata.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
d = df
df = pd.DataFrame(d)
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
return [short_data]
I want to create a script that iterates a list of 'issue_dates' and 'stock_tickers'. The list (a .csv) looks as following:
ARAY,07/08/2017
ARAY,24/04/2014
ACETQ,16/11/2015
ACETQ,16/11/2015
NVLNA,15/08/2014
ATSG,29/09/2017
ATI,24/05/2016
MDRX,18/06/2013
MDRX,18/06/2013
AMAGX,10/05/2017
AMAGX,14/02/2014
AMD,14/09/2016
To break down my problem and question I would like to know how to do the following:
First, how do I load the inputs?
Second, how do I call the function on each input?
And last, how do I accumulate all the function returns in one dataframe?

To load the inputs and call the function for each row; iterate over the csv file and pass each row's values to the function and accumulate the resulting Seriesin a list.
I modified your function a bit: removed the DataFrame creation so it is only done once and added a try/except block to account for missing dates or tickers (your example data didn't match up too well). The dates in the second csv look like they are day/month/year so I converted them for that format.
import pandas as pd
import datetime, csv
def get_data(df, issue_date, stock_ticker):
'''Return a Series for the ticker centered on the issue date.
'''
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
try:
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
except IndexError:
msg = f'no data for {stock_ticker} on {issue_date}'
#log.info(msg)
print(msg)
short_data = None
return short_data
df = pd.read_csv (datafile)
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
results = []
with open('issues.csv') as issues:
for ticker,date in csv.reader(issues):
day,month,year = map(int,date.split('/'))
# dt = datetime.datetime.strptime(date, r'%d/%m/%Y')
date = datetime.date(year,month,day)
s = get_data(df,date,ticker)
results.append(s)
# print(s)
Creating a single DataFrame or table for all that info may be problematic especially since the date ranges are all different. Probably should ask a separate question regarding that. Its mcve should probably just include a few minimal Pandas Series with a couple of different date ranges and tickers.

How to Merge a list of Multiple DataFrames and Tag each Column with a another list

I have a lisit of DataFrames that come from the census api, i had stored each year pull into a list.
So at the end of my for loop i have a list with dataframes per year and a list of years to go along side the for loop.
The problem i am having is merging all the DataFrames in the list while also taging them with a list of years.
So i have tried using the reduce function, but it looks like it only taking 2 of the 6 Dataframes i have.
concat just adds them to the dataframe with out tagging or changing anything
# Dependencies
import pandas as pd
import requests
import json
import pprint
import requests
from census import Census
from us import states
# Census
from config import (api_key, gkey)
year = 2012
c = Census(api_key, year)
for length in range(6):
c = Census(api_key, year)
data = c.acs5.get(('NAME', "B25077_001E","B25064_001E",
"B15003_022E","B19013_001E"),
{'for': 'zip code tabulation area:*'})
data_df = pd.DataFrame(data)
data_df = data_df.rename(columns={"NAME": "Name",
"zip code tabulation area": "Zipcode",
"B25077_001E":"Median Home Value",
"B25064_001E":"Median Rent",
"B15003_022E":"Bachelor Degrees",
"B19013_001E":"Median Income"})
data_df = data_df.astype({'Zipcode':'int64'})
filtervalue = data_df['Median Home Value']>0
filtervalue2 = data_df['Median Rent']>0
filtervalue3 = data_df['Median Income']>0
cleandata = data_df[filtervalue][filtervalue2][filtervalue3]
cleandata = cleandata.dropna()
yearlst.append(year)
datalst.append(cleandata)
year += 1
so this generates the two seperate list one with the year and other with dataframe.
So my output came out to either one Dataframe with missing Dataframe entries or it just concatinated all without changing columns.
what im looking for is how to merge all within a list, but datalst[0] to be tagged with yearlst[0] when merging if at all possible

No need for year list, simply assign year column to data frame. Plus avoid incrementing year and have it the iterator column. In fact, consider chaining your process:
for year in range(2012, 2019):
c = Census(api_key, year)
data = c.acs5.get(('NAME', "B25077_001E","B25064_001E", "B15003_022E","B19013_001E"),
{'for': 'zip code tabulation area:*'})
cleandata = (pd.DataFrame(data)
.rename(columns={"NAME": "Name",
"zip code tabulation area": "Zipcode",
"B25077_001E": "Median_Home_Value",
"B25064_001E": "Median_Rent",
"B15003_022E": "Bachelor_Degrees",
"B19013_001E": "Median_Income"})
.astype({'Zipcode':'int64'})
.query('(Median_Home_Value > 0) & (Median_Rent > 0) & (Median_Income > 0)')
.dropna()
.assign(year_column = year)
)
datalst.append(cleandata)
final_data = pd.concat(datalst, ignore_index = True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Adding column to pandas DataFrame with new keys - python

Related

python Dataframe using pandas data insert into excel file

Concatenate specific columns in pandas

Inaccesable first column in pandas dataframe?

How do I make this function iterable (getting indexerror)

How to Merge a list of Multiple DataFrames and Tag each Column with a another list

Categories

Resources