I am iteratively processing a couple of "groups" and I would like to add them together to a dataframe with every group being identified by a 2nd level index.
This:
print pd.concat([df1, df2, df3], keys=["A", "B", "C"])
was suggested to me - but it doesn't play well with iteration.
I am currently doing
data_all = pd.DataFrame([])
for a in a_list:
group = some.function(a, etc)
group = group.set_index(['CoI'], append=True, drop=True)
group = group.reorder_levels(['CoI','oldindex'])
data_all = pd.concat([data_all, group], ignore_index=False)
But the last line totally destroys my multi-index and I cannot reconstruct it.
Can you give me a hand?
Should be able just make data_all a list and concatenate once at the end:
data_all = []
for a in a_list:
group = some.function(a, etc)
group = group.set_index(['CoI'], append=True, drop=True)
group = group.reorder_levels(['CoI','oldindex'])
data_all.append(group)
data_all = pd.concat(data_all, ignore_index=False)
Also keep in mind that pandas' concat works with iterators. Something like yield group may be more efficient than appending to a list each time. I haven't profiled it though!
Related
I'm starting to lose my mind a bit. I have:
df = pd.DataFrame(bunch_of_stuff)
df2 = df.loc[bunch_of_conditions].copy()
def transform_df2(df2):
df2['new_col'] = [rand()]*len(df2)
df2['existing_column_1'] = [list of new values]
return df2
df2 = transform_df2(df2)
I know what to re-insert df2 into df, such that it overwrites all its previous records.
What would the best way to do this be? df.loc[df2.index] = df2 ? This doesn't bring over any of the new columns in df2 though.
You have the right method with pd.concat. However you can optimize a little bit by using a boolean mask to avoid to recompute the index difference:
m = bunch_of_conditions
df2 = df[m].copy()
df = pd.concat([df[~m], df2]).sort_index()
Why do you want to make a copy of your dataframe? Is not simpler to use the dataframe itself?
One way I did it was:
df= pd.concat([df.loc[~df.index.isin(df2.index)],df2])
I am able to print the small dataframe and see it is being generated correctly, I've written it using the code below. My final result however contains just the result of the final merge, as opposed to passing over each one and merging them.
MIK_Quantiles is the first larger dataframe, df2_t is the smaller dataframe being generated in the while loop. The dataframes are both produced correctly and the merge works, but I'm left with just the result of the very last merge. I want it to merge the current df2_t with the already merged result (df_merged) of the previous loop. I hope this makes sense!
i = 0
while i < df_length - 1:
cur_bound = MIK_Quantiles['bound'].iloc[i]
cur_percentile = MIK_Quantiles['percentile'].iloc[i]
cur_bin_low = MIK_Quantiles['auppm'].iloc[i]
cur_bin_high = MIK_Quantiles['auppm'].iloc[i+1]
### Grades/Counts within bin, along with min and max
df2 = df_orig['auppm'].loc[(df_orig['bound'] == cur_bound) & (df_orig['auppm'] >= cur_bin_low) & (df_orig['auppm'] < cur_bin_high)].describe()
### Add fields of interest to the output of describe for later merging together
df2['bound'] = cur_bound
df2['percentile'] = cur_percentile
df2['bin_name'] = 'bin name'
df2['bin_lower'] = cur_bin_low
df2['bin_upper'] = cur_bin_high
df2['temp_merger'] = str(int(df2['bound'])) + '_' + str(df2['percentile'])
# Write results of describe to a CSV file and transpose columns to rows
df2.to_csv('df2.csv')
df2_t = pd.read_csv('df2.csv').T
df2_t.columns = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max', 'bound', 'percentile', 'bin_name', 'bin_lower', 'bin_upper', 'temp_merger']
# Merge the results of the describe on the selected data with the table of quantile values to produce a final output
df_merged = MIK_Quantiles.merge(df2_t, how = 'inner', on = ['temp_merger'])
pd.merge(df_merged, df2_t)
print(df_merged)
i = i + 1
Your loop does not do anything meaningful, other than increment i.
You do a merge of 2 (static) dfs (MIK_Quantiles and df2_t), and you do that df_length number of times. Everytime you do that (first, i-th, and last iteration of the loop), you overwrite the output variable df_merged.
To keep in the output whatever has been created in the previous loop iteration, you need to concat all the created df2_t:
df2 = pd.concat([df2, df2_t]) to 'append' the newly created data df2_t to an output dataframe df2 during each iteration of the loop, so in the end all the data will be contained in df2
Then, after the loop, merge that one onto MIK_Quantiles
pd.merge(MIK_Quantiles, df2) (not df2_t (!)) to merge on the previous output
df2 = pd.DataFrame([]) # initialize your output
for i in range(0, df_length):
df2_t = ... # read your .csv files
df2 = pd.concat([df2, df2_t])
df2 = ... # do vector operations on df2 (process all of the df2_t at once)
out = pd.merge(MIK_Quantiles, df2)
Quite new to python for data analysis, still a noob.
I have a list of pandas data frames (+100) who's variables are saved into a list.
I then have the variables saved in another list in string format to add into the dataFrames as an identifier when plotting.
I have defined a function to prepare the tables for later feature engineering.
I want to iterate through each data frame and add the corresponding strings into a column called "Strings"
df = [df1, df2, df3]
strings = ['df1', 'df2', 'df3']
def mindex(df):
# remove time index and insert Strings column
df.reset_index(inplace=True)
df.insert(1, "Strings", "")
# iterate through each table adding the string values
for item in enumerate(df):
for item2 in strings:
df['Strings'] = item2
# the loop to cycle through all the dateframes using the function above
for i in df:
mindex(i)
When ever I use the function above it only fills the last value into all of the dataframes. I would like to note that all the dataframes are within the same date range, as I have tried to use this as a way to stop the iteration with no win.
Can anyone point me in the right direction! Google has not been my friend so far
df = [df1, df2, df3]
strings = ['df1', 'df2', 'df3']
for s, d in zip(strings, df):
d['Strings'] = s
In line df['Strings'] = item2 you assign variable item2 into entire column df["Strings"].
So first iteration assigns "df1", second assigns "df2" and ends with "df3" and this is what you see finally.
if you want to have in column Strings entirely populated by "df1" for df1, "df2" for df2 etc. you have to:
def mindex(dfs: list, strings: list) -> list:
final_dfs = []
for single_df, df_name in zip(dfs, strings):
single_df = single_df.copy()
single_df.reset_index(inplace=True)
single_df.insert(1, "Strings", "")
single_df['Strings'] = df_name
final_dfs.append(single_df)
return final_dfs
dfs = [df1, df2, df3]
strings = ['df1', 'df2', 'df3']
result = mindex(dfs, strings)
Few takeaways:
if you define list of dfs, name it dfs (plural), not df.
dfs = [df1, df2, df3]
If you iterate through pandas DataFrame, use df.iterrows(). It will generate indices and rows, so you don't need to apply enumerate.
for idx, row in df.iterrows():
....
if you use variable in for loop that is not going to be used, like in your example item, use underscore instead. It is good practice for useless variable:
for _ in enumerate(df):
for item2 in strings:
df['Strings'] = item2
I have multiple dataframes with the same columns but different values that look like that
Product 1 Dataframe
Here's the code that generated them
import pandas as pd
d1 = {"Year":[2018,2019,2020],"Quantity": [10,20,30], "Price": [100,200,300]}
df_product1 = pd.DataFrame(data=d1)
d2 = {"Year":[2018,2019,2020],"Quantity": [20,20,50], "Price": [120,110,380]}
df_product2 = pd.DataFrame(data=d2)
d3 = {"Year":[2018,2019,2020],"Quantity": [40,20,70], "Price": [1000,140,380]}
df_product3 = pd.DataFrame(data=d3)
I merge two dataframes and identify suffixes like so
df_total = df_product1.merge(df_product2,on="Year", suffixes = ("_Product1","_Product2"))
And I get
First Merged Dataframe
However, when I merge another dataframe to the result above using:
df_total = df_total.merge(df_product3,on="Year", suffixes = ("_Product","_Product3"))
I get
Final Merged Dataframe
Where there is no suffix for the third product.
I would like the last two columns of the dataframe to be Quantity_Product3, Price_Product3 instead of just Quantity and Price.
Let me know if it is possible or if I need to approach the problem from a completely different angle.
Why you don't get the result you want
It's explained in the docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
suffixeslist-like, default is (“_x”, “_y”) A length-2 sequence where
each element is optionally a string indicating the suffix to add to
overlapping column names in left and right respectively. Pass a value
of None instead of a string to indicate that the column name from left
or right should be left as-is, with no suffix. At least one of the
values must not be None.
Suffixes are added to overlapping column names.
See this example - suffixes are added to column b, because both dataframes have a column b, but not to columns a and c, as they are unique and not in common between the two dataframes.
df1 = pd.DataFrame(columns =['a','b'], data = np.random.rand(10,2))
df2 = pd.DataFrame(columns =['b','c'], data = np.random.rand(10,2), index = np.arange(5,15))
# equivalent to an inner join on the indices
out = pd.merge(df1, df2, how ='inner', left_index = True, right_index = True)
A crude solution
Why don't you just rename the columns manually? Not elegant but effective
A possible alternative
The table you are trying to build looks like a pivot. I would look into normalising all your dataframes, concatenating them, then running a pivot on the result.
Depending on your case, this may well be more convoluted and could well be overkill. I mention it because I want to bring your attention to the concepts of pivoting/unpivoting (stacking/unstacking/normalising) data.
The code below takes a df which looks similar to yours and normalises it. For simpler cases you can use pandas.melt(). I don't have the exact data of your example but this should be a good starting point.
def random_dates(start, end, n, unit='D', seed=None):
ndays = (end - start).days + 1
return start + pd.to_timedelta(
np.random.randint(0, ndays, n), unit=unit
)
df = pd.DataFrame()
mysize = 20
df['key'] = np.arange(0,mysize)
df['A_value'] = np.random.randint(0,10000,mysize)
df['A_date'] = random_dates(pd.to_datetime('2010-01-01' ), pd.to_datetime('2019-01-01'), mysize)
df['B_value'] = np.random.randint(-5000,5000,mysize)
df['B_date'] = random_dates(pd.to_datetime('2005-01-01' ), pd.to_datetime('2015-01-01'), mysize)
df['C_value'] = np.random.randint(-10000,10000,mysize)
df['C_date'] = random_dates(pd.to_datetime('2000-01-01' ), pd.to_datetime('2019-01-01'), mysize)
df2 = df.set_index('key', drop=True, verify_integrity = True)
df2 = df2.stack().reset_index()
df2.columns=['key','rawsource','rawvalue']
df2['source'] = df2['rawsource'].apply(lambda x: x[0:1])
df2['metric'] = df2['rawsource'].apply(lambda x: x[2:])
df2 = df2.drop(['rawsource'], axis = 1)
df_piv = df2.pivot_table( index=['key','source'], columns = 'metric' , values ='rawvalue', aggfunc='first' ).reset_index().rename_axis(None, axis=1)
When I try to append two or more dataframe and output the result to a csv, it shows like a waterfall format.
dataset = pd.read_csv('testdata.csv')
for i in segment_dist:
for j in step:
print_msg = str(i) + ":" + str(j)
print("\n",i,":",j,"\n")
temp = pd.DataFrame(estimateRsq(dataset,j,i),columns=[print_msg])
csv = csv.append(temp)
csv.to_csv('output.csv',encoding='utf-8', index=False)
estimateRsq() returns array. I think this much code snippet should be enough to help me out.
The format I am getting in output.csv is:
Please help, How can I shift the contents go up from index 1.
From df.append documentation:
Append rows of other to the end of this frame, returning a new
object. Columns not in this frame are added as new columns.
If you want to add column to the right, use pd.concat with axis=1 (means horizontally):
list_of_dfs = [first_df, second_df, ...]
pd.concat(list_of_dfs, axis=1)
You may want to add parameter ignore_index=True if indexes in dataframes don't match.
Build a list of dataframes, then concatenate
pd.DataFrame.append is expensive relative to list.append + a single call of pd.concat.
Therefore, you should aggregate to a list of dataframes and then use pd.concat on this list:
lst = []
for i in segment_dist:
# do something
temp = pd.DataFrame(...)
lst.append(temp)
df = pd.concat(lst, ignore_index=True, axis=0)
df.to_csv(...)