I have a dataframe that's created by a host of functions. From there, I have two more dataframes I need to create off the master frame. I have another function that takes that master frame and does a few more transformations on it. One of those is changing the column names, however that is in turn changing the column names on the master and I can't figure out why.
def create_y_df(c_dataframe: pd.DataFrame):
x_col_list = [str(i) for i in c_dataframe.columns]
for i, j in enumerate(x_col_list):
if 'Unnamed:' in j:
x_col_list[i] = x_col_list[i-1]
x_col_list[i-1] = 'drop'
c_dataframe.columns = x_col_list
c_dataframe = c_dataframe.drop(['drop'], axis=1)
c_dataframe = c_dataframe.apply(lambda x: pd.Series(x.dropna().values))
return c_dataframe
master_df = create_master(params)
y_df = create_y_df(master_df)
After running this, if I export master_df again, the columns now include 'drop'. What's interesting is that if I remove the columns renaming loop from create_y_df but leave the x.dropna(), that portion is not applied to master_df. I just have no idea why the c_dataframe.column = x_col_list from create_y_df() is applying to master_df
Related
I have the following part of code:
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
where I am extracting some values based on some columns of the DataFrame batch. Since the initial DataFrame df can be quite large, I need to find an efficient way of doing the following:
Putting together the results of the for loop in a new DataFrame with columns unique_request, unique_ua, reply_length_avg and response4xx at each iteration.
Stacking these DataFrames below of each other at each iteration.
I tried to do the following:
df_final = pd.DataFrame()
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
concat = [unique_request, unique_ua, reply_length_avg, response4xx]
df_final = pd.concat([df_final, concat], axis = 1, ignore_index = True)
return df_final
But I am getting the following error:
TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid
Any idea of what should I try?
First of all avoid using pd.concat to build the main dataframe inside a for loop as it gets exponentially slower. The problem you are facing is that pd.concat should receive as input a list of dataframes, however you are passing [df_final, concat] which, in essence, is a list containing 2 elements: one dataframe and one list of dataframes. Ultimately, it seems you want to stack the dataframes vertically, thus axis should be 0 and not 1.
Therefore, I suggest you to do the following:
df_final = []
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
concat = pd.concat([unique_request, unique_ua, reply_length_avg, response4xx], axis = 1, ignore_index = True)
df_final.append(concat)
df_final = pd.concat(df_final, axis = 0, ignore_index = True)
return df_final
Note that pd.concat receives a list of dataframes and not a list that contains a list inside of it! Also, this approach is way faster since the pd.concat inside the for loop doesn't get bigger every iteration :)
I hope it helps!
I have this defined function which calculates all of the neccessary statistics i need (e.g. two way anova and multicomparison).
def stats(F1_para1,F2_para2,para3,para4,value):
#MEAN,SEM,COUNT
msc = df.groupby([F1_para1,F2_para2,para3,para4])[value].agg(['mean','sem','count'])
msc.reset_index(inplace=True) #converts any columns in index as columns
pd.DataFrame(msc)
#TWO-WAY ANOVE AND MULTICOMP
df['comb'] = df[F1_para1].map(str) + "+" + df[F2_para2].map(str)
mod = ols(value+'~'+F1_para1+'+'+F2_para2+'+'+F1_para1+'*'+F2_para2, data = df).fit()
aov = anova_lm(mod, type=2) #mod needs to be the same text as mod (i.e. mod1,mod2)
comparison=MultiComparison(df[value], df['comb'])
tukey_df = pd.read_html(comparison.tukeyhsd().summary().as_html())[0]
r=tukey_df[tukey_df['reject'] == True]
df2=aov.append(r) #combines dataframes of aov and r
So when i use the function as follows:
Water_intake = stats('Time','Drug','Diet','Pre_conditions',value='Water_intake')
food_intake = stats('Time','Drugs','Diet','Pre_conditions',value='Food_intake')
The output dataframes following statistical analysis from anova and multicompaison are combined and becomes a new dataframe - 'df2'. 'Value' is the column header of the dependent variable from the main dataframe (df in the code). so everytime i use this function with a different dependent variable from the main dataframe (e.g. food intake, water intake, etc), the statistics summary is exported to the df2 dataframe, which i want to save it as separate sheets into a "statistics" workbook.
I've looked at the solutions here: Save list of DataFrames to multisheet Excel spreadsheet
with ExcelWriter(r"path\statistics.xlsx") as writer:
for n, df2 in enumerate(df2):
df2.to_excel(writer,value)
writer.save()
But i recieved this error:
AttributeError: 'str' object has no attribute 'to_excel'
Not sure if there is another way to achieve the same goal?
You are using df2 when you're enumerating through df2, which will return the column names, which are strings not df, hence the error. You can check this by running:
for n, df2 in enumerate(df2):
print(n)
print(df2)
You're also not changing df2 or calling the function to get df2 in your for loop. I think the whole thing needs re-writing.
Firstly you need to add return df2 at the end of your function, so that you actually get your df2 when it's called.
def stats(F1_para1,F2_para2,para3,para4,value):
#MEAN,SEM,COUNT
msc = df.groupby([F1_para1,F2_para2,para3,para4])[value].agg(['mean','sem','count'])
msc.reset_index(inplace=True) #converts any columns in index as columns
pd.DataFrame(msc)
#TWO-WAY ANOVE AND MULTICOMP
df['comb'] = df[F1_para1].map(str) + "+" + df[F2_para2].map(str)
mod = ols(value+'~'+F1_para1+'+'+F2_para2+'+'+F1_para1+'*'+F2_para2, data = df).fit()
aov = anova_lm(mod, type=2) #mod needs to be the same text as mod (i.e. mod1,mod2)
comparison=MultiComparison(df[value], df['comb'])
tukey_df = pd.read_html(comparison.tukeyhsd().summary().as_html())[0]
r=tukey_df[tukey_df['reject'] == True]
df2=aov.append(r) #combines dataframes of aov and r
return df2
Then your 2 function calls in the question will actually return something. To add these into an excel document, you can then do:
Water_intake = stats('Time','Drug','Diet','Pre_conditions',value='Water_intake')
food_intake = stats('Time','Drugs','Diet','Pre_conditions',value='Food_intake')
to export these 2 to excel on different sheets, you can do:
writer = pd.ExcelWriter(r"path\statistics.xlsx")
Water_intake.to_excel(writer, sheet_name='Water_intake')
food_intake.to_excel(writer, sheet_name='Food_intake')
writer.save()
This should give you a spreadsheet with 2 sheets containing the different df2 on each. I don't know how many of these you need, or how you call the function differently for each, but it may be necessary to create a for loop.
I am looking into creating a big dataframe (pandas) from several individual frames. The data is organized in MF4-Files and the number of source files varies for each cycle. The goal is to have this process automated.
Creation of Dataframes:
df = (MDF('File1.mf4')).to_dataframe(channels)
df1 = (MDF('File2.mf4')).to_dataframe(channels)
df2 = (MDF('File3.mf4')).to_dataframe(channels)
These Dataframes are then merged:
df = pd.concat([df, df1, df2], axis=0)
How can I do this without dynamically creating variables for df, df1 etc.? Or is there no other way?
I have all filepathes in an Array of the form:
Filepath = ['File1.mf4', 'File2.mf4','File3.mf4',]
Now I am thinking of looping through it and create dynamically the data frames df,df1.df1000.... Any advice here?
Edit here is the full code:
df = (MDF('File1.mf4')).to_dataframe(channels)
df1 = (MDF('File2.mf4')).to_dataframe(channels)
df2 = (MDF('File3.mf4')).to_dataframe(channels)
#The Data has some offset:
x = df.index.max()
df1.index += x
x = df1.index.max()
df2.index += x
#With correct index now the data can be merged
df = pd.concat([df, df1, df2], axis=0)
The way I'm interpreting your question is that you have a predefined list you want. So just:
l = []
for f in [ list ... of ... files ]:
df = load_file(f) # however you load it
l.append(df)
big_df = pd.concat(l)
del l, df, f # if you want to clean it up
You therefore don't need to manually specify variable names for your data sub-sections. If you also want to do checks or column renaming between the various files, you can also just put that into the for-loop (or alternatively, if you want to simplify to a list comprehension, into the load_file function body).
Try this:
df_list = [(MDF(file)).to_dataframe(channels) for file in Filepath]
df = pd.concat(df_list)
I am iterating over a series of csv files as dataframes, eventually writing them all out to a common excel workbook.
In one of the many files, there are decimal GPS values (latitude, longitude) split into two columns (df[4] and df[5]) that I'm converting to degrees-minutes-seconds. That method returns a tuple that I'm attempting to park in two new fields called dmslat and dmslon in the same row of the original dataframe:
def convert_dd_to_dms(lat, lon):
# does the math here
return dmslat, dmslon
csv_dir = askdirectory() # tkinter directory picker
os.chdir(csv_dir)
for f in glob.iglob("*.csv"):
(csv_path, csv_name) = os.path.split(f)
(csv_prefix, csv_ext) = os.path.splitext(csv_name)
if csv_prefix[-3:] == "loc":
df = pd.read_csv(f)
df['dmslat'] = None
df['dmslon'] = None
for i, row in df.iterrows():
fixed_coords = convert_dd_to_dms(row[4], row[5])
row['dmslat'] = fixed_coords[0]
row['dmslon'] = fixed_coords[1]
print(df)
# process the other files
So when I use a print() statement I can see the coords are properly calculated but they are not being committed to the dmslat/dmslon fields.
I have also tried assigning the new fields within the row iterator, but since I am at the row scale, it ends up overwriting the entire column with the new calculated value every time.
How can I get the results to (succinctly) populate the columns?
It would appear that df.iterrows() is resulting in a "copy" of each row, thus when you add/update the columns "dmslat" and "dmslon", you are modifying the copy, not the original dataframe. This can be confirmed by printing "row" after your assignments. You will see the row item was successfully updated, but the changes are not reflected in the original dataframe.
To modify the original dataframe, you can modify your code as such:
for i, row in df.iterrows():
fixed_coords = convert_dd_to_dms(row[4], row[5])
df.loc[i, 'dmslat'] = fixed_coords[0]
df.loc[i, 'dmslon'] = fixed_coords[1]
print(df)
using df.loc guarantees the changes are made to the original dataframe.
I think you better use apply rather than iterrows.
Here's a solution that is based on apply. I replaced your location calculation with a function named 'foo' which does some arbitrary calculation from two fields 'a' and 'b' to new values for 'a' and 'b'.
df = pd.DataFrame({"a": range(10), "b":range(10, 20)})
def foo(row):
return (row["a"] + row["b"], row["a"] * row["b"])
new_df = df.apply(foo, axis=1).apply(pd.Series)
In the above code block, applying 'foo' returns a tuple for every row. Using apply again with pd.Series turns it into a data frame.
df[["a", "b"]] = new_df
df.head(3)
a b
0 10 0
1 23 132
2 38 336
I have written a function that takes a list of file-paths and then concatenates them into one large dataframe. I would like to include an argument that takes a list of column names the user is interested in looking at.
The dataframe must always contain the 'category' column if the user decides to filter the columns, but I want the default to be that it returns all of the columns. I can't quite seem to figure out how to optionally select columns from a dataframe.
Here is my function interspersed with some psuedo code to explain what I'm talking about.
def combine_all_data(data_files, columns_needed=ALL):
dataframes = map(pd.read_csv, data_files)
if columns_needed != ALL
columns_needed = ['category'] + columns_needed
df = pd.concat(dataframes, sort=False)[columns_needed]
return df
If it's the ALL you don't know how to implement, you can try this:
def combine_all_data(data_files, columns_needed=None):
kwargs= dict()
if columns_needed is not None:
if 'category' not in columns_needed:
columns_needed= ['category'] + columns_needed
kwargs['usecols']= columns_needed
dataframes = [pd.read_csv(data_file, **kwargs) for data_file in data_files]
return pd.concat(dataframes, sort=False)
The advantage of this is, that you need less memory, because the columns you don't want to see, are already skipped in the reading process.
Addtionally you return a full dataframe not a slice of one. So you can work with it without restrictions.
read_csv has a usecols argument:
def combine_all_data(data_files, columns_needed='ALL'):
if needed_columns != 'ALL':
if not 'category' in columns_needed:
columns_needed.append('category')
return pd.concat([pd.read_csv(x, usecols=columns_needed) for x
in data_files], sort=False)
else:
return pd.concat([pd.read_csv(x) for x in data_files], sort=False)