for loop with same dataframe on both side of the operator - python

I have defined 10 different DataFrames A06_df, A07_df , etc, which picks up six different data point inputs in a daily time series for a number of years. To be able to work with them I need to do some formatting operations such as
A07_df=A07_df.fillna(0)
A07_df[A07_df < 0] = 0
A07_df.columns = col # col is defined
A07_df['oil']=A07_df['oil']*24
A07_df['water']=A07_df['water']*24
A07_df['gas']=A07_df['gas']*24
A07_df['water_inj']=0
A07_df['gas_inj']=0
A07_df=A07_df[['oil', 'water', 'gas','gaslift', 'water_inj', 'gas_inj', 'bhp', 'whp']]
etc for a few more formatting operations
Is there a nice way to have a for loop or something so I don’t have to write each operation for each dataframe A06_df, A07_df, A08.... etc?
As an example, I have tried
list=[A06_df, A07_df, A08_df, A10_df, A11_df, A12_df, A13_df, A15_df, A18_df, A19_df]
for i in list:
i=i.fillna(0)
But this does not do the trick.
Any help is appreciated

As i.fillna() returns a new object (an updated copy of your original dataframe), i=i.fillna(0) will update the content of ibut not of the list content A06_df, A07_df,....
I suggest you copy the updated content in a new list like this:
list_raw = [A06_df, A07_df, A08_df, A10_df, A11_df, A12_df, A13_df, A15_df, A18_df, A19_df]
list_updated = []
for i in list_raw:
i=i.fillna(0)
# More code here
list_updated.append(i)
To simplify your future processes I would recommend to use a dictionary of dataframes instead of a list of named variables.
dfs = {}
dfs['A0'] = ...
dfs['A1'] = ...
dfs_updated = {}
for k,i in dfs.items():
i=i.fillna(0)
# More code here
dfs_updated[k] = i

Related

Why are multiple values incorrectly updated in my dynamically created nested dicts?

Dfs is a dict with dataframes and the keys are named like this: 'datav1_135_gl_b17'
We would like to calculate a matrix with constants. It should be possible to assign the values in the matrix according to the attributes from the df name. In this example '135' and 'b17'.
If you want code to create an example dfs, let me know, I've cut it out to more clearly state the problem.
We create a nested dict dynamically with the following function:
def ex_calc_time(dfs):
formats = []
grammaturs = []
for i in dfs:
# (...)
# format
split1 = i.split('_')
format = split1[-1]
format.replace(" ", "")
formats.append(format)
formats = list(set(formats))
# grammatur
# split1 = i.split('_')
grammatur = split1[-3]
grammatur.replace(" ", "")
grammaturs.append(grammatur)
grammaturs = list(set(grammaturs))
# END FLOOP
dict_mean_time = dict.fromkeys(formats, dict.fromkeys(grammaturs, ''))
return dfs, dict_mean_time
Then we try to fill the nested dict and change the values like this (which should be working according to similiar nested dict questions, but it doesn't). 'Nope' is updated for both keys:
ex_dict_mean_time['b17']['170'] = 'nope'
ex_dict_mean_time
{'a18': {'135': '', '170': 'nope', '250': ''},
'b17': {'135': '', '170': 'nope', '250': ''}}
I also tried creating a dataframe from ex_dict_mean_time and filling it with .loc, but that didn't work either (df remains empty). Moreover I tried this method, but I always end up with the same problem and the values are overwritten. I appreciate any help. If you have any improvements for my code please let me know, I welcome any opportunity to improve.

how can I rewrite this code to make it easier to read?

start=2014
df = pd.DataFrame({'age':past_cars_sold,}, index = [start, start+1,start+2,start+3,start+4,start+5,start+6])
is there an easier way to rewrite this code. Right now i have do it one at a time and just want to know if there is an easier way to rewrite this.
Karl's comment seems the most straightforward. No list needed -- just give pandas a range object:
start = 2014
df = pd.DataFrame({'age': past_cars_sold}, index=range(start, start+7))
First, write it like this, easier to read and reorgranize later
start=2014
df = pd.DataFrame(
{
'age':past_cars_sold,
},
index = [
start,
start+1,
start+2,
start+3,
start+4,
start+5,
start+6
]
)
Then, see if you can simplify it, for example
past_cars_sold = [1,2,3,4,5,6] # dummy test values
start = 2014 # avoid hard-coding value
years = 6 # or group these values together
idxls = range(start, start + years, 1) # form a list with functions
replace it
df = pd.DataFrame(
{
'age':past_cars_sold,
},
index = idxls
)
maybe also a good idea to read the official "Pythonic" way to format your code.
Utilizing a for loop will allow you to automatically populate the list with the values you want using simple math & logic.
start = 2014
index = []
for i in range(7):
index.append(start+i)
This makes the code more readable, and also nearly infinitely scalable. You will also not have to populate your list manually.
Edit:
As others have added, you can use Pythonic List Comprehension as well.
This means you can populate a list in one line like so:
start = 2014
index = [start+i for i in range(7)] # from i=0 to i=6 (7 total elements)

How to loop through few lines

I have a doubt of how to loop over few lines :
get_sol is a function which is created which has two parameters : def get_sol(sub_dist_fil,fos_cnt)
banswara, palwai and hathin are some random values of a column named as "sub-district".
1 is fixed
I am writing it as :
out_1 = get_sol( "banswara",1)
out_1 = get_sol("palwal",1)
out_1 = get_sol("hathin",1)
How can I apply for loop to these lines in order to get results in one go
Help !!
"FEW COMMENTS HAVE HELPED ME IN ACHIEVING MY RESULTS (THANKS ALOT)". THE RESULT IS AS FOLLOW :
NOW I HAVE A QUERY THAT HOW DO I DISPLAY/PRINT THE NAME OF RESPECTIVE DISTRICT FOR WHICH THE RESULTS ARE RUNNING???????
Well in general case you can do something like this:
data = ['banswara', 'palwal', 'hathin']
result = {}
for item in data:
result[item] = get_sol(item, 1)
print(result)
This will pack your results in dictionary giving you opportunity to see which result is generated for which input.
here you go:
# save the values into a list
random_values = column["sub-district"]
# iterate through using for
for random_value in random_values:
# get the result
result = get_sol(random_value, 1)
# print the result or do whatever
# you want to the result
print(result)
Similar other answers, but using a list comprehension to make it more pythonic (and faster, usually):
districts = ['banswara', 'palwal', 'hathin']
result = [get_sol(item, 1) for item in data]
I think you are trying to get random values from the column 'subdistrict'
For the purpose of illustration, let the dataframe be df. (So to access 'subdistrict' column, df['subdistrict']
import numpy
[print(get_sol(x)) for x in np.random.choice(df['subdistrict'], 10)]
# selecting 10 random values from particular columns
Here is the official documentation

Python, loops with changeable parts of filenames

I have a bunch of very similar commands which all look like this (df means pandas dataframe):
df1_part1=...
df1_part2=...
...
df1_part5=...
df2_part1=...
I would like to make a loop for it, as follows:
for i in range(1,5):
for j in range(1,5):
df%i_part%j=...
Of course, it doesn't work with %. But is has to be some easy way to do it, I suppose.
Could You help me please?
You can try one of the following options:
Create a dictionary which maps the your df and access it by the name of the dataframe:
mapping = {"df1_part1": df1_part1, "df1_part2": df1_part2}
for i in range(1,5):
for j in range(1,5):
mapping[f"df{i}_part{j}"] = ...
Use globals to access dynamically your variables:
df1_part1=...
df1_part2=...
...
df1_part5=...
df2_part1=...
for i in range(1,5):
for j in range(1,5):
globals()[f"df{i}_part{j}"] = ...
One way would be to collect your pandas dataframes in a list of lists and iterate over that list instead of trying dynamically parse your python code.
df1_part1=...
df1_part2=...
...
df1_part5=...
df2_part1=...
dflist = [[df1_part1, df1_part2, df1_part3, df1_part4, df1_part5],
[df2_part1, df2_part2, df2_part3, df2_part4, df2_part5]]
for df in dflist:
for df_part in df:
# do something with df_part
Assuming that this process is part of data preparation, I would like to mention that you should try to work with "data preparation pipelines" whenever it is possible. Otherwise, the code will be a huge mess to read after a couple of months.
There are several ways to deal with this problem.
A dictionary is the most straightforward way to deal with this.
df_parts = {
'df1' : {'part1': df1_part1, 'part2': df1_part2,...,'partN': df1_partN},
'df2' : {'part1': df1_part1, 'part2': df1_part2,...,'partN': df2_partN},
'...' : {'part1': ..._part1, 'part2': ..._part2,...,'partN': ..._partN},
'dfN' : {'part1': dfN_part1, 'part2': dfN_part2,...,'partN': dfN_partN},
}
# print parts from `dfN`
for val in for df_parts['dfN'].values():
print(val)
# print part1 for all dfs
for df in df_parts.values():
print(df['part1'])
# print everything
for df in df_parts:
for val in df_parts[df].values():
print(val)
The good thing with this approach is that you can iterate through the whole dictionary, but you don't include range which may be confusing later. Also, it is better to assign every df_part directly to a dict instead of assigning N*N variables which may be used once or twice. In this case you can just use 1 variable and re-assign it as you progress:
# code using df1_partN
df1 = df_parts['df1']['partN']
# stuff to do
# happy? checkpoint
df_parts['df1']['partN'] = df1

Iterating through a list of Pandas DF's to then iterate through each DF's row

This may be a slightly insane question...
I've got a single Pandas DF of articles which I have then split into multiple DF's so each DF only contains the articles from a particular year. I have then put these variables into a list called box_of_years.
indexed_df = article_db.set_index('date')
indexed_df = indexed_df.sort_index()
year_2004 = indexed_df.truncate(before='2004-01-01', after='2004-12-31')
year_2005 = indexed_df.truncate(before='2005-01-01', after='2005-12-31')
year_2006 = indexed_df.truncate(before='2006-01-01', after='2006-12-31')
year_2007 = indexed_df.truncate(before='2007-01-01', after='2007-12-31')
year_2008 = indexed_df.truncate(before='2008-01-01', after='2008-12-31')
year_2009 = indexed_df.truncate(before='2009-01-01', after='2009-12-31')
year_2010 = indexed_df.truncate(before='2010-01-01', after='2010-12-31')
year_2011 = indexed_df.truncate(before='2011-01-01', after='2011-12-31')
year_2012 = indexed_df.truncate(before='2012-01-01', after='2012-12-31')
year_2013 = indexed_df.truncate(before='2013-01-01', after='2013-12-31')
year_2014 = indexed_df.truncate(before='2014-01-01', after='2014-12-31')
year_2015 = indexed_df.truncate(before='2015-01-01', after='2015-12-31')
year_2016 = indexed_df.truncate(before='2016-01-01', after='2016-12-31')
box_of_years = [year_2004, year_2005, year_2006, year_2007,
year_2008, year_2009, year_2010, year_2011,
year_2012, year_2013, year_2014, year_2015,
year_2016]
I've written various functions to tokenize, clean up and convert the tokens into a FreqDist object and wrapped those up into a single function called year_prep(). This works fine when I do
year_2006 = year_prep(year_2006)
...but is there a way I can iterate across every year variable, apply the function and have it transform the same variable, short of just repeating the above for every year?
I know repeating myself would be the simplest way, but not necessarily the cleanest. I may perhaps have this backwards and do the slicing later on but at that point I feel like the layers of lists will be out of hand as I'm going from a list of years to a list of years, containing a list of articles, containing a list of every word in the article.
I think you can use groupby by year with custom function:
import pandas as pd
start = pd.to_datetime('2004-02-24')
rng = pd.date_range(start, periods=30, freq='50D')
df = pd.DataFrame({'Date': rng, 'a':range(30)})
#print (df)
def f(x):
print (x)
#return year_prep(x)
#some custom output
return x.a + x.Date.dt.month
print (df.groupby(df['Date'].dt.year).apply(f))

Categories

Resources