Adding column to dataframes based on variable name - python

Im trying to add a two columns to each dataframe based on their name before concatenating all of them. One column is year and the other is trimester. So t1_15 would be trimester 1 and year 2015.
I tried building a function that did it in one go but due to time constraints just ended up doing it manually like this. I'm now returning to this problem with more time and would really like to sort it out.
frames_15 = [t1_15, t2_15, t3_15, t4_15]
for i in frames_15:
i['year'] = 2015
frames_16 = [t1_16, t2_16, t3_16, t4_16]
for i in frames_16:
i['year'] = 2016
frames_17 = [t1_17, t2_17, t3_17]
for i in frames_17:
i['year'] = 2017
frames_trim_1 = [t1_15, t1_16, t1_17]
for i in frames_trim_1:
i['trimestre'] = 1
frames_trim_2 = [t2_15, t2_16, t2_17]
for i in frames_trim_2:
i['trimestre'] = 2
frames_trim_3 = [t3_15, t3_16, t3_17]
for i in frames_trim_3:
i['trimestre'] = 3
frames_trim_4 = [t4_15, t4_16]
for i in frames_trim_4:
i['trimestre'] = 4
id like each df to have a year and trimester column based on its name.
Thanks in advance

The best way would be to build a dictionary, where you register your dataframes. You already gave them names according to their assignment to a trimester.
If you already know these infos by the time of creation, you could even register these dataframes in a dictionary, whose keys are tuples of trimester and year.
If you have something like in your description, you could also use the globals dictionary, but this is not quite clean and should be regarded as last choice in case there is no cleaner way.
If you build up your dictionary with keys named like the variable names above, or if you want to use the globals dictionary direclty, you could do it as follows:
import re
df_directory= dict(globals())
name_re= re.compile('^t([0-9])_([0-9]{2})$')
for name, df in df_directory.items():
matcher= name_re.match(name)
if matcher and isinstance(df, pd.DataFrame):
trimester, year= matcher.groups()
df['trimestre']= int(trimester)
df['year']= int(year) + 2000
This processes all variables named by the schema tX_XX and of type DataFrame and parses the trimester and year out of the name to assign it to columns each.

Related

for loop with same dataframe on both side of the operator

I have defined 10 different DataFrames A06_df, A07_df , etc, which picks up six different data point inputs in a daily time series for a number of years. To be able to work with them I need to do some formatting operations such as
A07_df=A07_df.fillna(0)
A07_df[A07_df < 0] = 0
A07_df.columns = col # col is defined
A07_df['oil']=A07_df['oil']*24
A07_df['water']=A07_df['water']*24
A07_df['gas']=A07_df['gas']*24
A07_df['water_inj']=0
A07_df['gas_inj']=0
A07_df=A07_df[['oil', 'water', 'gas','gaslift', 'water_inj', 'gas_inj', 'bhp', 'whp']]
etc for a few more formatting operations
Is there a nice way to have a for loop or something so I don’t have to write each operation for each dataframe A06_df, A07_df, A08.... etc?
As an example, I have tried
list=[A06_df, A07_df, A08_df, A10_df, A11_df, A12_df, A13_df, A15_df, A18_df, A19_df]
for i in list:
i=i.fillna(0)
But this does not do the trick.
Any help is appreciated
As i.fillna() returns a new object (an updated copy of your original dataframe), i=i.fillna(0) will update the content of ibut not of the list content A06_df, A07_df,....
I suggest you copy the updated content in a new list like this:
list_raw = [A06_df, A07_df, A08_df, A10_df, A11_df, A12_df, A13_df, A15_df, A18_df, A19_df]
list_updated = []
for i in list_raw:
i=i.fillna(0)
# More code here
list_updated.append(i)
To simplify your future processes I would recommend to use a dictionary of dataframes instead of a list of named variables.
dfs = {}
dfs['A0'] = ...
dfs['A1'] = ...
dfs_updated = {}
for k,i in dfs.items():
i=i.fillna(0)
# More code here
dfs_updated[k] = i

How to build a dataframe from scratch while filling in missing data? (details included in question)

I have a dataframe which looks like the following (Name of the first dataframe(image below) is relevantdata in the code):
I want the dataframe to be transformed to the following format:
Essentially, I want to get the relevant confirmed number for each Key for all the dates that are available in the dataframe. If a particular date is not available for a Key, we make that value to be zero.
Currently my code is as follows (A try/except block is used as some Keys don't have the the whole range of dates, hence a Keyerror occurs the first time you refer to that date using countrydata.at[date,'Confirmed'] for the respective Key, hence the except block will make an entry of zero into the dictionary for that date):
relevantdata = pandas.read_csv('https://raw.githubusercontent.com/open-covid-19/data/master/output/data_minimal.csv')
dates = relevantdata['Date'].unique().tolist()
covidcountries = relevantdata['Key'].unique().tolist()
data = dict()
data['Country'] = covidcountries
confirmeddata = relevantdata[['Date','Key','Confirmed']]
for country in covidcountries:
for date in dates:
countrydata = confirmeddata.loc[lambda confirmeddata: confirmeddata['Key'] == country].set_index('Date')
try:
if (date in data.keys()) == False:
data[date] = list()
data[date].append(countrydata.at[date,'Confirmed'])
else:
data[date].append(countrydata.at[date,'Confirmed'])
except:
if (date in data.keys()) == False:
data[date].append(0)
else:
data[date].append(0)
finaldf = pandas.DataFrame(data = data)
While the above code accomplished what I want in getting the dataframe in the format I require, it is way too slow, having to loop through every key and date. I want to know if there is a better and faster method to doing the same without having to use a nested for loop. Thank you for all your help.

How to Merge a list of Multiple DataFrames and Tag each Column with a another list

I have a lisit of DataFrames that come from the census api, i had stored each year pull into a list.
So at the end of my for loop i have a list with dataframes per year and a list of years to go along side the for loop.
The problem i am having is merging all the DataFrames in the list while also taging them with a list of years.
So i have tried using the reduce function, but it looks like it only taking 2 of the 6 Dataframes i have.
concat just adds them to the dataframe with out tagging or changing anything
# Dependencies
import pandas as pd
import requests
import json
import pprint
import requests
from census import Census
from us import states
# Census
from config import (api_key, gkey)
year = 2012
c = Census(api_key, year)
for length in range(6):
c = Census(api_key, year)
data = c.acs5.get(('NAME', "B25077_001E","B25064_001E",
"B15003_022E","B19013_001E"),
{'for': 'zip code tabulation area:*'})
data_df = pd.DataFrame(data)
data_df = data_df.rename(columns={"NAME": "Name",
"zip code tabulation area": "Zipcode",
"B25077_001E":"Median Home Value",
"B25064_001E":"Median Rent",
"B15003_022E":"Bachelor Degrees",
"B19013_001E":"Median Income"})
data_df = data_df.astype({'Zipcode':'int64'})
filtervalue = data_df['Median Home Value']>0
filtervalue2 = data_df['Median Rent']>0
filtervalue3 = data_df['Median Income']>0
cleandata = data_df[filtervalue][filtervalue2][filtervalue3]
cleandata = cleandata.dropna()
yearlst.append(year)
datalst.append(cleandata)
year += 1
so this generates the two seperate list one with the year and other with dataframe.
So my output came out to either one Dataframe with missing Dataframe entries or it just concatinated all without changing columns.
what im looking for is how to merge all within a list, but datalst[0] to be tagged with yearlst[0] when merging if at all possible
No need for year list, simply assign year column to data frame. Plus avoid incrementing year and have it the iterator column. In fact, consider chaining your process:
for year in range(2012, 2019):
c = Census(api_key, year)
data = c.acs5.get(('NAME', "B25077_001E","B25064_001E", "B15003_022E","B19013_001E"),
{'for': 'zip code tabulation area:*'})
cleandata = (pd.DataFrame(data)
.rename(columns={"NAME": "Name",
"zip code tabulation area": "Zipcode",
"B25077_001E": "Median_Home_Value",
"B25064_001E": "Median_Rent",
"B15003_022E": "Bachelor_Degrees",
"B19013_001E": "Median_Income"})
.astype({'Zipcode':'int64'})
.query('(Median_Home_Value > 0) & (Median_Rent > 0) & (Median_Income > 0)')
.dropna()
.assign(year_column = year)
)
datalst.append(cleandata)
final_data = pd.concat(datalst, ignore_index = True)

Perform pandas concatenation of dataframe and read it from a file

I have a use case where I need to create a python dictionary with year and months and then concatenate all the dataframes to single dataframe. I have done the implementation as below:
dict_year_month = {}
temp_dict_1={}
temp_dict_2={}
for ym in [201104,201105 ... 201706]:
key_name = 'df_'+str(ym)+'A'
temp_dict_1[key_name]=df[(df['col1']<=ym) & (df['col2']>ym)
& (df['col3']==1)]
temp_dict_2[key_name]=df[(df['col1']<=ym) & (df['col2']==0)
& (df['col3']==1)]
if not temp_dict_1[key_name].empty:
dict_year_month [key_name] =temp_dict_1[key_name]
dict_year_month [key_name].loc[:, 'new_col'] = ym
elif not temp_dict_2[key_name].empty:
dict_year_month [key_name] =temp_dict_2[key_name]
dict_year_month [key_name].loc[:, 'new_col'] = ym
dict_year_month [key_name]=dict_year_month [key_name].sort_values('col4')
dict_year_month [key_name]=dict_year_month [key_name].drop_duplicates('col5')
.. do some other processing
create individual dataframes as df_201104A .. and so on ..
dict_year_month
#concatenate all the above individual dataframe into single dataframe:
df1 = pd.concat([
dict_year_month['df_201104A'],dict_year_month['df_201105A'],
... so on till dict_year_month['df_201706A'])
Now the challenge is I have to rerun the set of code on each quarter so every time I have to update this script with new yearmonths dict key and in pd.concat as well needs to updated with new year month details. I am looking for some other solution by which I can probably read the keys and create a list of dataframes in concatenate from a properties file or config file?
There are only a few things you need to do to get there - the first is just to enumerate through the months between your start and end month, which I do below using rrule, reading in the start and end dates from a file. This gets you the keys for your dictionary. Then just use the .values() method on the dictionaries to get all the dataframes.
from dateutil import rrule
from datetime import datetime, timedelta
import pickle
#get these from whereever, config, etc.
params = {
'start_year':2011,
'start_month':4,
'end_year':2017,
'end_month':6,
}
pickle.dump(params, open("params.pkl", "wb"))
params = pickle.load(open("params.pkl", "rb"))
start = datetime(year=params['start_year'], month=params['start_month'], day=1)
end = datetime(year=params['end_year'], month=params['end_month'], day=1)
keys = [int(dt.strftime("%Y%m")) for dt in rrule.rrule(rrule.MONTHLY, dtstart=start, until=end)]
print(keys)
## Do some things and get a dict
dict_year_month = {'201104':pd.DataFrame([[1, 2, 3]]), '201105':pd.DataFrame([[4, 5, 6]])} #... etc
pd.concat(dict_year_month.values())
The pickle file is to show one way of saving and loading parameters - it is a binary format so manually editing the parameters wouldn't really work. You might want to investigate something like yaml to get more sophisticated, feel free to ask a new question if you need help with that.

Iterating through a list of Pandas DF's to then iterate through each DF's row

This may be a slightly insane question...
I've got a single Pandas DF of articles which I have then split into multiple DF's so each DF only contains the articles from a particular year. I have then put these variables into a list called box_of_years.
indexed_df = article_db.set_index('date')
indexed_df = indexed_df.sort_index()
year_2004 = indexed_df.truncate(before='2004-01-01', after='2004-12-31')
year_2005 = indexed_df.truncate(before='2005-01-01', after='2005-12-31')
year_2006 = indexed_df.truncate(before='2006-01-01', after='2006-12-31')
year_2007 = indexed_df.truncate(before='2007-01-01', after='2007-12-31')
year_2008 = indexed_df.truncate(before='2008-01-01', after='2008-12-31')
year_2009 = indexed_df.truncate(before='2009-01-01', after='2009-12-31')
year_2010 = indexed_df.truncate(before='2010-01-01', after='2010-12-31')
year_2011 = indexed_df.truncate(before='2011-01-01', after='2011-12-31')
year_2012 = indexed_df.truncate(before='2012-01-01', after='2012-12-31')
year_2013 = indexed_df.truncate(before='2013-01-01', after='2013-12-31')
year_2014 = indexed_df.truncate(before='2014-01-01', after='2014-12-31')
year_2015 = indexed_df.truncate(before='2015-01-01', after='2015-12-31')
year_2016 = indexed_df.truncate(before='2016-01-01', after='2016-12-31')
box_of_years = [year_2004, year_2005, year_2006, year_2007,
year_2008, year_2009, year_2010, year_2011,
year_2012, year_2013, year_2014, year_2015,
year_2016]
I've written various functions to tokenize, clean up and convert the tokens into a FreqDist object and wrapped those up into a single function called year_prep(). This works fine when I do
year_2006 = year_prep(year_2006)
...but is there a way I can iterate across every year variable, apply the function and have it transform the same variable, short of just repeating the above for every year?
I know repeating myself would be the simplest way, but not necessarily the cleanest. I may perhaps have this backwards and do the slicing later on but at that point I feel like the layers of lists will be out of hand as I'm going from a list of years to a list of years, containing a list of articles, containing a list of every word in the article.
I think you can use groupby by year with custom function:
import pandas as pd
start = pd.to_datetime('2004-02-24')
rng = pd.date_range(start, periods=30, freq='50D')
df = pd.DataFrame({'Date': rng, 'a':range(30)})
#print (df)
def f(x):
print (x)
#return year_prep(x)
#some custom output
return x.a + x.Date.dt.month
print (df.groupby(df['Date'].dt.year).apply(f))

Categories

Resources