How to loop through few lines - python

I have a doubt of how to loop over few lines :
get_sol is a function which is created which has two parameters : def get_sol(sub_dist_fil,fos_cnt)
banswara, palwai and hathin are some random values of a column named as "sub-district".
1 is fixed
I am writing it as :
out_1 = get_sol( "banswara",1)
out_1 = get_sol("palwal",1)
out_1 = get_sol("hathin",1)
How can I apply for loop to these lines in order to get results in one go
Help !!
"FEW COMMENTS HAVE HELPED ME IN ACHIEVING MY RESULTS (THANKS ALOT)". THE RESULT IS AS FOLLOW :
NOW I HAVE A QUERY THAT HOW DO I DISPLAY/PRINT THE NAME OF RESPECTIVE DISTRICT FOR WHICH THE RESULTS ARE RUNNING???????

Well in general case you can do something like this:
data = ['banswara', 'palwal', 'hathin']
result = {}
for item in data:
result[item] = get_sol(item, 1)
print(result)
This will pack your results in dictionary giving you opportunity to see which result is generated for which input.

here you go:
# save the values into a list
random_values = column["sub-district"]
# iterate through using for
for random_value in random_values:
# get the result
result = get_sol(random_value, 1)
# print the result or do whatever
# you want to the result
print(result)

Similar other answers, but using a list comprehension to make it more pythonic (and faster, usually):
districts = ['banswara', 'palwal', 'hathin']
result = [get_sol(item, 1) for item in data]

I think you are trying to get random values from the column 'subdistrict'
For the purpose of illustration, let the dataframe be df. (So to access 'subdistrict' column, df['subdistrict']
import numpy
[print(get_sol(x)) for x in np.random.choice(df['subdistrict'], 10)]
# selecting 10 random values from particular columns
Here is the official documentation

Related

Why are multiple values incorrectly updated in my dynamically created nested dicts?

Dfs is a dict with dataframes and the keys are named like this: 'datav1_135_gl_b17'
We would like to calculate a matrix with constants. It should be possible to assign the values in the matrix according to the attributes from the df name. In this example '135' and 'b17'.
If you want code to create an example dfs, let me know, I've cut it out to more clearly state the problem.
We create a nested dict dynamically with the following function:
def ex_calc_time(dfs):
formats = []
grammaturs = []
for i in dfs:
# (...)
# format
split1 = i.split('_')
format = split1[-1]
format.replace(" ", "")
formats.append(format)
formats = list(set(formats))
# grammatur
# split1 = i.split('_')
grammatur = split1[-3]
grammatur.replace(" ", "")
grammaturs.append(grammatur)
grammaturs = list(set(grammaturs))
# END FLOOP
dict_mean_time = dict.fromkeys(formats, dict.fromkeys(grammaturs, ''))
return dfs, dict_mean_time
Then we try to fill the nested dict and change the values like this (which should be working according to similiar nested dict questions, but it doesn't). 'Nope' is updated for both keys:
ex_dict_mean_time['b17']['170'] = 'nope'
ex_dict_mean_time
{'a18': {'135': '', '170': 'nope', '250': ''},
'b17': {'135': '', '170': 'nope', '250': ''}}
I also tried creating a dataframe from ex_dict_mean_time and filling it with .loc, but that didn't work either (df remains empty). Moreover I tried this method, but I always end up with the same problem and the values are overwritten. I appreciate any help. If you have any improvements for my code please let me know, I welcome any opportunity to improve.

for loop with same dataframe on both side of the operator

I have defined 10 different DataFrames A06_df, A07_df , etc, which picks up six different data point inputs in a daily time series for a number of years. To be able to work with them I need to do some formatting operations such as
A07_df=A07_df.fillna(0)
A07_df[A07_df < 0] = 0
A07_df.columns = col # col is defined
A07_df['oil']=A07_df['oil']*24
A07_df['water']=A07_df['water']*24
A07_df['gas']=A07_df['gas']*24
A07_df['water_inj']=0
A07_df['gas_inj']=0
A07_df=A07_df[['oil', 'water', 'gas','gaslift', 'water_inj', 'gas_inj', 'bhp', 'whp']]
etc for a few more formatting operations
Is there a nice way to have a for loop or something so I don’t have to write each operation for each dataframe A06_df, A07_df, A08.... etc?
As an example, I have tried
list=[A06_df, A07_df, A08_df, A10_df, A11_df, A12_df, A13_df, A15_df, A18_df, A19_df]
for i in list:
i=i.fillna(0)
But this does not do the trick.
Any help is appreciated
As i.fillna() returns a new object (an updated copy of your original dataframe), i=i.fillna(0) will update the content of ibut not of the list content A06_df, A07_df,....
I suggest you copy the updated content in a new list like this:
list_raw = [A06_df, A07_df, A08_df, A10_df, A11_df, A12_df, A13_df, A15_df, A18_df, A19_df]
list_updated = []
for i in list_raw:
i=i.fillna(0)
# More code here
list_updated.append(i)
To simplify your future processes I would recommend to use a dictionary of dataframes instead of a list of named variables.
dfs = {}
dfs['A0'] = ...
dfs['A1'] = ...
dfs_updated = {}
for k,i in dfs.items():
i=i.fillna(0)
# More code here
dfs_updated[k] = i

Rename a data frame name by adding the iteration value as suffix in a for loop (Python)

I have run the following Python code :
array = ['AEM000', 'AID017']
USA_DATA_1D = USA_DATA10.loc[USA_DATA10['JOBSPECIALTYCODE'].isin(array)]
I run a regression model and extract the log-likelyhood value on each item of this array by a for loop :
for item in array:
USA_DATA_1D = USA_DATA10.loc[USA_DATA10['JOBSPECIALTYCODE'] == item]
formula = "WEIGHTED_BASE_MEDIAN_FINAL_MEAN ~ YEAR"
response, predictors = dmatrices(formula, USA_DATA_1D, return_type='dataframe')
mod1 = sm.GLM(response, predictors, family=sm.genmod.families.family.Gaussian()).fit()
LLF_NG = {'model': ['Standard Gaussian'],
'llf_value': mod1.llf
}
df_llf = pd.DataFrame(LLF_NG , columns = ['model', 'llf_value'])
Now I would like to remane the dataframe df_llf by df_llf_(name of the item) i.e. df_llf_AEM000 when running the loop on the first item and df_llf_AID017 when running the loop on the second one.
I need some help to know how to proceed that.
If you want to rename the data frame, you need to use the copy method so that the original data frame does not get altered.
df_llf_AEM000 = df_llf.copy()
If you want to save iteratively several different versions of the original data frame, you can do something like this:
allDataframes = []
for i in range(10):
df = df_original.copy()
allDataframes.append(df)
print(allDataframes[0])

Formatting Multiple Columns in a Pandas Dataframe

I have a dataframe I'm working with that has a large number of columns, and I'm trying to format them as efficiently as possible. I have a bunch of columns that all end in .pct that need to be formatted as percentages, some that end in .cost that need to be formatted as currency, etc.
I know I can do something like this:
cost_calc.style.format({'c.somecolumn.cost' : "${:,.2f}",
'c.somecolumn.cost' : "${:,.2f}",
'e.somecolumn.cost' : "${:,.2f}",
'e.somecolumn.cost' : "${:,.2f}",...
and format each column individually, but I was hoping there was a way to do something similar to this:
cost_calc.style.format({'*.cost' : "${:,.2f}",
'*.pct' : "{:,.2%}",...
Any ideas? Thanks!
The first way doesn't seem bad if you can automatically build that dictionary... you can generate a list of all columns fitting the *.cost description with something like
costcols = [x for x in df.columns.values if x[-5:] == '.cost']
then build your dict like:
formatdict = {}
for costcol in costcols: formatdict[costcol] = "${:,.2f}"
then as you suggested:
cost_calc.style.format(formatdict)
You can easily add the .pct cases similarly. Hope this helps!
I would use regEx with dict generators:
import re
mylist = cost_calc.columns
r = re.compile(r'.*cost')
cost_cols = {key: "${:,.2f}" for key in mylist if r.match(key)}
r = re.compile(r'.*pct')
pct_cols = {key: "${:,.2f}" for key in mylist if r.match(key)}
cost_calc.style.format({**cost_cols, **pct_cols})
note: code for Python 2.7 and 3 onwards

pandas - drop row with list of values, if contains from list

I have a huge set of data. Something like 100k lines and I am trying to drop a row from a dataframe if the row, which contains a list, contains a value from another dataframe. Here's a small time example.
has = [['#a'], ['#b'], ['#c, #d, #e, #f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
tweet user
0 [#a] 1
1 [#b] 2
2 [#c, #d, #e, #f] 3
3 [#g] 5
z
0 #d
1 #a
The desired outcome would be
tweet user
0 [#b] 2
1 [#g] 5
Things i've tried
#this seems to work for dropping #a but not #d
for a in range(df.tweet.size):
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a)
#this works for my small scale example but throws an error on my big data
df['tweet'] = df.tweet.apply(', '.join)
test = df[~df.tweet.str.contains('|'.join(df2['z'].astype(str)))]
#the error being "unterminated character set at position 1343770"
#i went to check what was on that line and it returned this
basket.iloc[1343770]
user_id 17060480
tweet [#IfTheyWereBlackOrBrownPeople, #WTF]
Name: 4612505, dtype: object
Any help would be greatly appreciated.
is ['#c, #d, #e, #f'] 1 string or a list like this ['#c', '#d', '#e', '#f'] ?
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
simple solution would be
screen = set(df2.z.tolist())
to_delete = list() # this will speed things up doing only 1 delete
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
speed comparaison (for 10 000 rows):
st = time.time()
screen = set(df2.z.tolist())
to_delete = list()
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
print(time.time()-st)
2.142000198364258
st = time.time()
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break
print(time.time()-st)
43.99799990653992
For me, your code works if I make several adjustments.
First, you're missing the last line when putting range(df.tweet.size), either increase this or (more robust, if you don't have an increasing index), use df.tweet.index.
Second, you don't apply your dropping, use inplace=True for that.
Third, you have #d in a string, the following is not a list: '#c, #d, #e, #f' and you have to change it to a list so it works.
So if you change that, the following code works fine:
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break # so if we already dropped it we no longer look whether we should drop this line
This will provide the desired result. Be aware of this potentially being not optimal due to missing vectorization.
EDIT:
you can achieve the string being a list with the following:
from itertools import chain
df.tweet = df.tweet.apply(lambda l: list(chain(*map(lambda lelem: lelem.split(","), l))))
This applies a function to each line (assuming each line contains a list with one or more elements): Split each element (should be a string) by comma into a new list and "flatten" all the lists in one line (if there are multiple) together.
EDIT2:
Yes, this is not really performant But basically does what was asked. Keep that in mind and after having it working, try to improve your code (less for iterations, do tricks like collecting the indices and then drop all of them).

Categories

Resources