Python, loops with changeable parts of filenames

Python, loops with changeable parts of filenames - python

I have a bunch of very similar commands which all look like this (df means pandas dataframe):
df1_part1=...
df1_part2=...
...
df1_part5=...
df2_part1=...
I would like to make a loop for it, as follows:
for i in range(1,5):
for j in range(1,5):
df%i_part%j=...
Of course, it doesn't work with %. But is has to be some easy way to do it, I suppose.
Could You help me please?

You can try one of the following options:
Create a dictionary which maps the your df and access it by the name of the dataframe:
mapping = {"df1_part1": df1_part1, "df1_part2": df1_part2}
for i in range(1,5):
for j in range(1,5):
mapping[f"df{i}_part{j}"] = ...
Use globals to access dynamically your variables:
df1_part1=...
df1_part2=...
...
df1_part5=...
df2_part1=...
for i in range(1,5):
for j in range(1,5):
globals()[f"df{i}_part{j}"] = ...

One way would be to collect your pandas dataframes in a list of lists and iterate over that list instead of trying dynamically parse your python code.
df1_part1=...
df1_part2=...
...
df1_part5=...
df2_part1=...
dflist = [[df1_part1, df1_part2, df1_part3, df1_part4, df1_part5],
[df2_part1, df2_part2, df2_part3, df2_part4, df2_part5]]
for df in dflist:
for df_part in df:
# do something with df_part

Assuming that this process is part of data preparation, I would like to mention that you should try to work with "data preparation pipelines" whenever it is possible. Otherwise, the code will be a huge mess to read after a couple of months.
There are several ways to deal with this problem.
A dictionary is the most straightforward way to deal with this.
df_parts = {
'df1' : {'part1': df1_part1, 'part2': df1_part2,...,'partN': df1_partN},
'df2' : {'part1': df1_part1, 'part2': df1_part2,...,'partN': df2_partN},
'...' : {'part1': ..._part1, 'part2': ..._part2,...,'partN': ..._partN},
'dfN' : {'part1': dfN_part1, 'part2': dfN_part2,...,'partN': dfN_partN},
}
# print parts from `dfN`
for val in for df_parts['dfN'].values():
print(val)
# print part1 for all dfs
for df in df_parts.values():
print(df['part1'])
# print everything
for df in df_parts:
for val in df_parts[df].values():
print(val)
The good thing with this approach is that you can iterate through the whole dictionary, but you don't include range which may be confusing later. Also, it is better to assign every df_part directly to a dict instead of assigning N*N variables which may be used once or twice. In this case you can just use 1 variable and re-assign it as you progress:
# code using df1_partN
df1 = df_parts['df1']['partN']
# stuff to do
# happy? checkpoint
df_parts['df1']['partN'] = df1

Related

efficient way of computing a dataframe using concat and split

I am new to python/pandas/numpy and I need to create the following Dataframe:
DF = pd.concat([pd.Series(x[2]).apply(lambda r: pd.Series(re.split('\#|/',r))).assign(id=x[0]) for x in hDF])
where hDF is a dataframe that has been created by:
hDF=pd.DataFrame(h.DF)
and h.DF is a list whose elements looks like this:
['5203906',
['highway=primary',
'maxspeed=30',
'oneway=yes',
'ref=N 22',
'surface=asphalt'],
['3655224911#1.735928/42.543651',
'3655224917#1.735766/42.543561',
'3655224916#1.735694/42.543523',
'3655224915#1.735597/42.543474',
'4817024439#1.735581/42.543469']]
However, in some cases the list is very long (O(10^7)) and also the list in h.DF[*][2] is very long, so I run out of memory.
I can obtain the same result, avoiding the use of the lambda function, like so:
DF = pd.concat([pd.Series(x[2]).str.split('\#|/', expand=True).assign(id=x[0]) for x in hDF])
But I am still running out of memory in the cases where the lists are very long.
Can you think of a possible solution to obtain the same results without starving resources?

I managed to make it work using the following code:
bl = []
for x in h.DF:
data = np.loadtxt(
np.loadtxt(x[2], dtype=str, delimiter="#")[:, 1], dtype=float, delimiter="/"
).tolist()
[i.append(x[0]) for i in data]
bl.append(data)
bbl = list(itertools.chain.from_iterable(bl))
DF = pd.DataFrame(bbl).rename(columns={0: "lon", 1: "lat", 2: "wayid"})
Now it's super fast :)

Why are multiple values incorrectly updated in my dynamically created nested dicts?

Dfs is a dict with dataframes and the keys are named like this: 'datav1_135_gl_b17'
We would like to calculate a matrix with constants. It should be possible to assign the values in the matrix according to the attributes from the df name. In this example '135' and 'b17'.
If you want code to create an example dfs, let me know, I've cut it out to more clearly state the problem.
We create a nested dict dynamically with the following function:
def ex_calc_time(dfs):
formats = []
grammaturs = []
for i in dfs:
# (...)
# format
split1 = i.split('_')
format = split1[-1]
format.replace(" ", "")
formats.append(format)
formats = list(set(formats))
# grammatur
# split1 = i.split('_')
grammatur = split1[-3]
grammatur.replace(" ", "")
grammaturs.append(grammatur)
grammaturs = list(set(grammaturs))
# END FLOOP
dict_mean_time = dict.fromkeys(formats, dict.fromkeys(grammaturs, ''))
return dfs, dict_mean_time
Then we try to fill the nested dict and change the values like this (which should be working according to similiar nested dict questions, but it doesn't). 'Nope' is updated for both keys:
ex_dict_mean_time['b17']['170'] = 'nope'
ex_dict_mean_time
{'a18': {'135': '', '170': 'nope', '250': ''},
'b17': {'135': '', '170': 'nope', '250': ''}}
I also tried creating a dataframe from ex_dict_mean_time and filling it with .loc, but that didn't work either (df remains empty). Moreover I tried this method, but I always end up with the same problem and the values are overwritten. I appreciate any help. If you have any improvements for my code please let me know, I welcome any opportunity to improve.

how can I rewrite this code to make it easier to read?

start=2014
df = pd.DataFrame({'age':past_cars_sold,}, index = [start, start+1,start+2,start+3,start+4,start+5,start+6])
is there an easier way to rewrite this code. Right now i have do it one at a time and just want to know if there is an easier way to rewrite this.

Karl's comment seems the most straightforward. No list needed -- just give pandas a range object:
start = 2014
df = pd.DataFrame({'age': past_cars_sold}, index=range(start, start+7))

First, write it like this, easier to read and reorgranize later
start=2014
df = pd.DataFrame(
{
'age':past_cars_sold,
},
index = [
start,
start+1,
start+2,
start+3,
start+4,
start+5,
start+6
]
)
Then, see if you can simplify it, for example
past_cars_sold = [1,2,3,4,5,6] # dummy test values
start = 2014 # avoid hard-coding value
years = 6 # or group these values together
idxls = range(start, start + years, 1) # form a list with functions
replace it
df = pd.DataFrame(
{
'age':past_cars_sold,
},
index = idxls
)
maybe also a good idea to read the official "Pythonic" way to format your code.

Utilizing a for loop will allow you to automatically populate the list with the values you want using simple math & logic.
start = 2014
index = []
for i in range(7):
index.append(start+i)
This makes the code more readable, and also nearly infinitely scalable. You will also not have to populate your list manually.
Edit:
As others have added, you can use Pythonic List Comprehension as well.
This means you can populate a list in one line like so:
start = 2014
index = [start+i for i in range(7)] # from i=0 to i=6 (7 total elements)

for loop with same dataframe on both side of the operator

I have defined 10 different DataFrames A06_df, A07_df , etc, which picks up six different data point inputs in a daily time series for a number of years. To be able to work with them I need to do some formatting operations such as
A07_df=A07_df.fillna(0)
A07_df[A07_df < 0] = 0
A07_df.columns = col # col is defined
A07_df['oil']=A07_df['oil']*24
A07_df['water']=A07_df['water']*24
A07_df['gas']=A07_df['gas']*24
A07_df['water_inj']=0
A07_df['gas_inj']=0
A07_df=A07_df[['oil', 'water', 'gas','gaslift', 'water_inj', 'gas_inj', 'bhp', 'whp']]
etc for a few more formatting operations
Is there a nice way to have a for loop or something so I don’t have to write each operation for each dataframe A06_df, A07_df, A08.... etc?
As an example, I have tried
list=[A06_df, A07_df, A08_df, A10_df, A11_df, A12_df, A13_df, A15_df, A18_df, A19_df]
for i in list:
i=i.fillna(0)
But this does not do the trick.
Any help is appreciated

As i.fillna() returns a new object (an updated copy of your original dataframe), i=i.fillna(0) will update the content of ibut not of the list content A06_df, A07_df,....
I suggest you copy the updated content in a new list like this:
list_raw = [A06_df, A07_df, A08_df, A10_df, A11_df, A12_df, A13_df, A15_df, A18_df, A19_df]
list_updated = []
for i in list_raw:
i=i.fillna(0)
# More code here
list_updated.append(i)
To simplify your future processes I would recommend to use a dictionary of dataframes instead of a list of named variables.
dfs = {}
dfs['A0'] = ...
dfs['A1'] = ...
dfs_updated = {}
for k,i in dfs.items():
i=i.fillna(0)
# More code here
dfs_updated[k] = i

Formatting Multiple Columns in a Pandas Dataframe

I have a dataframe I'm working with that has a large number of columns, and I'm trying to format them as efficiently as possible. I have a bunch of columns that all end in .pct that need to be formatted as percentages, some that end in .cost that need to be formatted as currency, etc.
I know I can do something like this:
cost_calc.style.format({'c.somecolumn.cost' : "${:,.2f}",
'c.somecolumn.cost' : "${:,.2f}",
'e.somecolumn.cost' : "${:,.2f}",
'e.somecolumn.cost' : "${:,.2f}",...
and format each column individually, but I was hoping there was a way to do something similar to this:
cost_calc.style.format({'*.cost' : "${:,.2f}",
'*.pct' : "{:,.2%}",...
Any ideas? Thanks!

The first way doesn't seem bad if you can automatically build that dictionary... you can generate a list of all columns fitting the *.cost description with something like
costcols = [x for x in df.columns.values if x[-5:] == '.cost']
then build your dict like:
formatdict = {}
for costcol in costcols: formatdict[costcol] = "${:,.2f}"
then as you suggested:
cost_calc.style.format(formatdict)
You can easily add the .pct cases similarly. Hope this helps!

I would use regEx with dict generators:
import re
mylist = cost_calc.columns
r = re.compile(r'.*cost')
cost_cols = {key: "${:,.2f}" for key in mylist if r.match(key)}
r = re.compile(r'.*pct')
pct_cols = {key: "${:,.2f}" for key in mylist if r.match(key)}
cost_calc.style.format({**cost_cols, **pct_cols})
note: code for Python 2.7 and 3 onwards

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python, loops with changeable parts of filenames - python

Related

efficient way of computing a dataframe using concat and split

Why are multiple values incorrectly updated in my dynamically created nested dicts?

how can I rewrite this code to make it easier to read?

for loop with same dataframe on both side of the operator

Formatting Multiple Columns in a Pandas Dataframe

Categories

Resources