select appropriate dataframe based on user input - python

I have the below data frames loaded :
df_1000-2000
df_3000-4000
df_5000-6000
df_7000-8000
Now I get a user input value as 1000-2000. Based on the user input value I need to work on respective data frame.
In this case, I need to work on : df_1000-2000
How to select the data frame dynamically based on user input and start working on it ?

Use a dictionary
You should restructure how you store and access your dataframes. First define a dictionary:
dfs = {'1000-2000': df_1000-2000, '3000-4000': df_3000-4000, etc.}
Then taking a user input and using it to query your dictionary is straightforward:
value = input('Input the range you require, e.g. 1000-2000:')
res = dfs[value]

Related

How to sort the columns by length of values in an excel file in python ? (preferably using def)

I have a list of data as following. I need to add the current data to a new worksheet sorted by the length of the values in the third column (p_seq)
enter image description here
I was able to add the current data using openpyxl but I'm struggling with sorting them. Ideally I would like to create a function. Thank you in advance !
strings_col1 = ["abcdefgh", "ijklmn", "opqr", "stuvwxyz123"]
sorted_list_col1 = list(sorted(strings_col1, key = len))
print(sorted_list_col1)
output:
['opqr', 'ijklmn', 'abcdefgh', 'stuvwxyz123']

creating many dataframes as subsets of one large one

I have this code which extracts subsets of a dataframe to individual dataframes which represent rainfall events:
j=list(range(len(eventdf)))
for k in range(len(eventdf)):
dfname= 'event'+str(j[k])
dfnatp=meandf2.iloc[eventdf.iloc[k,0]: eventdf.iloc[k,1]+2]
dfnatp.to_csv(dfname+'.csv', sep=',')
while I can very easily dump each dataframe to a .csv file, to do anything with it means that I then have to read it back in.
how do I create each dataframe with name given by the value of 'dfname' in the same way that I can name each csv file?
To elaborate Muhammad's suggestion a little more, you can create an empty dictionary like this (before your for loop):
dfdict = {}
Then you can create new dictionary entries like this (inside your for loop):
dfdict[dfname] = dfnatp
These entries will have dfname as the key and dfnatp as the value, so you can access each dfnatp by using dfdict['eventXXX'], where eventXXX is your identifier.
Here is an introduction to python's dictionary data structure for further reading.
As commented, consider a dictionary of data frames which you can achieve with dictionary comprehension. You lose no functionality of a data frame if saved in a dict or list. Since you need to also save to CSV, consider a defined method. Below uses F-strings for string interpolation.
def proc_data(k):
dfnatp = meandf2.iloc[eventdf.iloc[k,0]: eventdf.iloc[k,1]+2]
dfnatp.to_csv(f"event_{k}.csv")
return dfnatp
df_dict = {
f"event_{k}": proc_data(k) for k in range(len(eventdf))
}
# ACCESS INDIVIDUAL DATA FRAMES
df_dict["event_0"]
df_dict["event_1"]
df_dict["event_2"]
...

How to loop a command in python with a list as variable input?

This is my first post to the coding community, so I hope I get the right level of detail in my request for help!
Background info:
I want to repeat (loop) command in a df using a variable that contains a list of options. While the series 'amenity_options' contains a simple list of specific items (let's say only four amenities as the example below) the df is a large data frame with many other items. My goal is the run the operation below for each item in the 'amenity_option' until the end of the list.
amenity_options = ['cafe','bar','cinema','casino'] # this is a series type with multiple options
df = df[df['amenity'] == amenity_options] # this is my attempt to select the the first value in the series (e.g. cafe) out of dataframe that contains such a column name.
df.to_excel('{}_amenity.xlsx, format('amenity') # wish to save the result (e.g. cafe_amenity) as a separate file.
Desired result:I wish to loop step one and two for each and every item available in the list (e.g. cafe, bar, cinema...). So that I will have separate excel files in the end. Any thoughts?
What #Rakesh suggested is correct, you probably just need one more step.
df = df[df['amenity'].isin(amenity_options)]
for key, g in df.groupby('amenity'):
g.to_excel('{}_amenity.xlsx'.format(key))
After you call groupby() of your df, you will get 4 groups so that you can directly loop on them.
The key is the group key, which are cafe, bar and etc. and the g is the sub-dataframe that specifically filtered by that key.
Seems like you just need a simple for loop:
for amenity in amenity_options:
df[df['amenity'] == amenity].to_excel(f"{amenity}_amenity.xlsx")

Add dictionary to pandas data frame and ignore extra values

I'm reading lot of log files, from which I generate dictionary by parsing each log, I want to add this dictionary to dataframe, later I use this dataframe for analysis. But the information I need in dataframe may differ every time based on user input. So I don't want all the information in the dictionary to add in to data frame. I want the columns I defined in the data frame only to add to data frame.
As of now I'm adding all the dictionaries one by one to a list, then loading this dictionary to dataframe.
for log in log_lines:
# here logic to parse the log and generate the dictionary
my_dict_list.append(d)
pd.Dataframe(my_dict_list)
In this way it adds all the keys and their values to the dataframe,
but what I want is, I will define some columns, let's say user asks ['a','b','c'] columns for analysis, I want the dataframe to load only these keys and their values to the data frame, rest should be ignored.
my_dict_list =[ {'a':'abc','b':'123','c':'hello', 'date':'20-5-2019'},
{'a':'dfc','b':'453','c':'user', 'date':'23-5-2019'},
{'a':'bla','b':'2313','c':'anything', 'date':'25-5-2019'} ]
Note: I don't want this ignoring keys at the time extraction of logs, because I will be extracting lot of logs so its time consuming.
is there a way I can achieve this, using pandas in faster way?.
In tmp_Dict line you can filter only requested columns and save only requested columns.
def log_dataframe(log_lines, requested_columns):
for log in log_lines:
# here logic to parse the log and generate the dictionary
tmp_Dict = {requested_key : d[requested_key] for requested_key in request_columns}
my_dict_list.append(tmp_Dict)
return pd.Dataframe(my_dict_list)
I am just providing you some raw logic for your query i may be wrong on some part but if you find it helpful for you that will be very great you can mail me also for future queries I will be happy to help you.
columns = []
x = int(input('enter no of columns you need'))
for i in range(x):
print("Please specify columns")
columns = int(input())
columns.append(columns)
my_dict_list =[ {'a':'abc','b':'123','c':'hello', 'date':'20-5-2019'},
{'a':'dfc','b':'453','c':'user', 'date':'23-5-2019'},
{'a':'bla','b':'2313','c':'anything', 'date':'25-5-2019'} ]
for data in range(x):
value = pd.DataFrame(my_dict_list[columns[data]])
print(value[[data]])

Iterate over multiple dataframes and perform maths functions save output

I have several dataframes on which I an performing the same functions - extracting mean, geomean, median etc etc for a particular column (PurchasePrice), organised by groups within another column (GORegion). At the moment I am just performing this for each dataframe separately as I cannot work out how to do this in a for loop and save separate data series for each function performed on each dataframe.
i.e. I perform median like this:
regmedian15 = pd.Series(nw15.groupby(["GORegion"])['PurchasePrice'].median(), name = "regmedian_nw15")
I want to do this for a list of dataframes [nw15, nw16, nw17], extracting the same variable outputs for each of them.
I have tried things like :
listofnwdfs = [nw15, nw16, nw17]
for df in listofcmldfs:
df+'regmedian' = pd.Series(df.groupby(["GORegion"])
['PurchasePrice'].median(), name = df+'regmedian')
but it says "can't assign to operator"
I think the main point is I can't work out how to create separate output variable names using the names of the dataframes I am inputting into the for loop. I just want a for loop function that produces my median output as a series for each dataframe in the list separately, and I can then do this for means and so on.
Many thanks for your help!
First, df+'regmedian' = ... is not valid Python syntax. You are trying to assign a value to an expression of the form A + B, which is why Python complains that you are trying to re-define the meaning of +.
Also, df+'regmedian' itself seems strange. You are trying to add a DataFrame and a string.
One way to keep track of different statistics for different datafarmes is by using dicts. For example, you can replace
listofnwdfs = [nw15, nw16, nw17]
with
dict_of_nwd_frames = {15: nw15, 16: nw16, 17: nw17}
Say you want to store 'regmedian' data for each frame. You can do this with dicts as well.
data = dict()
for key, df in dict_of_nwd_frames.items():
data[(i, 'regmedian')] = pd.Series(df.groupby(["GORegion"])['PurchasePrice'].median(), name = str(key) + 'regmedian')

Categories

Resources