Insert a dataframe between dataframes via the nested append method - python

The following code works fine to insert one dataframe underneath the other using a nesting of the append method.
for sheet_name, df in Input_Data.items():
df1 = df[126:236]
df=df1.sort_index(ascending=False)
Indexer=df.columns.tolist()
df = [(pd.concat([df[Indexer[0]],df[Indexer[num]]],axis=1)) for num in [1,2,3,4,5,6]]
df = [(df[num].astype(str).agg(','.join, axis=1)) for num in [0,1,2,3,4,5]]
df=pd.DataFrame(df)
df=df.loc[0].append(df.loc[1].append(df.loc[2].append(df.loc[3].append(df.loc[4].append(df.loc[5])))))
However I need to add additional dataframes(one row, one column) in between df.loc[i] and as a first step, I tried to insert a dataframe at the top of df.loc[0] via
df=df_1st.append(df,ignore_index=True)
Which yields the following error cannot reindex from a duplicate axis
It seems my dataframe df has duplicate indices. Not sure how to proceed. Perhaps the nested method is not best approach?

Related

Pandas: Sort Dataframe is Column Value Exists in another Dataframe

I have a database which has two columns with unique numbers. This is my reference dataframe (df_reference). In another dataframe (df_data) I want to get the rows of this dataframe of which a column values exist in this reference dataframe. I tried stuff like:
df_new = df_data[df_data['ID'].isin(df_reference)]
However, like this I can't get any results. What am I doing wrong here?
From what I see, you are passing the whole dataframe in .isin() method.
Try:
df_new = df_data[df_data['ID'].isin(df_reference['ID'])]
Convert the ID column to the index of the df_data data frame. Then you could do
matching_index = df_reference['ID']
df_new = df_data.loc[matching_index, :]
This should solve the issue.

How can I create a new pandas DataFrame out of an existing one applying a function to every column without a for loop?

A simplified script I have now working is as follows:
columns = df.columns.tolist()
df1=pd.DataFrame()
for i in columns:
df1[i]=[random.uniform(-1*(df[i].std()*3),(df[i].std()*3))+df[i].mean()]
How can I get the same result (a one row dataframe) with a simpler, more efficient code?
Try with apply:
df1 = df.apply(lambda x: random.uniform(-3*x.std(),3*x.std())+x.mean())

Add a multiindex level Dataframe to another dataframe

I have the following sample dataframe:
df_temp = pd.DataFrame(np.arange(6).reshape(3,-1),
index=(0,1,2),
columns=pd.MultiIndex.from_tuples([('A', 'Salad'),('B','Burger')]))
I would like to put the column ('A','Salad') in another dataframe which might be empty or has this column already.
This is the case the df_output is empty or the column already exists in df_b.
df_output = pd.concat([df_output, df_temp], axis=1)
In case the column already exists, it just replaces it. However, in case df_output is empty, converts the multilevel index to a single line which is sth I don't want.
This is the case which df_output already has a column:
And how it should look like after the addition:
I am trying to use concat but the multiindex level of the columns is disappearing.
I managed to fix it with the following solution, although I believe it can be done better:
if len(df_output.columns):
df_output = pd.concat([df_output, df_temp], axis=1).sort_index(level=0, axis=1)
else:
df_output = df_temp

How to add values from one dataframe into another ignoring the row indices

I have a pandas dataframe called trg_data to collect data that I am producing in batches. Each batch is produced by a sub-routine as a smaller dataframe df with the same number of columns but less rows and I want to insert the values from df into trg_data at a new row position each time.
However, when I use the following statement df is always inserted at the top. (i.e. rows 0 to len(df)).
trg_data.iloc[trg_pt:(trg_pt + len(df))] = df
I'm guessing but I think the reason may be that even though the slice indicates the desired rows, it is using the index in df to decide where to put the data.
As a test I found that I can insert an ndarray at the right position no problem:
trg_data.iloc[trg_pt:(trg_pt + len(df))] = np.ones(df.shape)
How do I get it to ignore the index in df and insert the data where I want it? Or is there an entirely different way of achieving this? At the end of the day I just want to create the dataframe trg_data and then save to file at the end. I went down this route because there didn't seem to be a way of easily appending to an existing dataframe.
I've been working at this for over an hour and I can't figure out what to google to find the right answer!
I think I may have the answer (I thought I had already tried this but apparently not):
trg_data.iloc[trg_pt:(trg_pt + len(df))] = df.values
Still, I'm open to other suggestions. There's probably a better way to add data to a dataframe.
The way I would do this is save all the intermediate dataframes in an array, and then concatenate them together
import pandas as pd
dfs = []
# get all the intermediate dataframes somehow
# combine into one dataframe
trg_data = pd.concatenate(dfs)
Both
trg_data = pd.concat([df1, df2, ... dfn], ignore_index=True)
and
trg_data = pd.DataFrame()
for ...: #loop that generates df
trg_data = trg_data.append(df, ignore_index=True) #you can reuse the name df
shoud work for you.

Put maximum of each column from a list of dataframe into new dataframe

I recently started working with pandas dataframes.
I have a list of dataframes called 'arr'.
Edit: All the dataframes in 'arr' have same columns but different data.
Also, I have an empty dataframe 'ndf' which I need to fill in using the above list.
How do I iterate through 'arr' to fill in the max values of a column from 'arr' into a row in 'ndf'
So, we'll have
Number of rows in ndf = Number of elements in arr
I'm looking for something like this:
columns=['time','Open','High','Low','Close']
ndf=DataFrame(columns=columns)
ndf['High']=arr[i].max(axis=0)
Based on your description, I assume a basic example of your data looks something like this:
import pandas as pd
data =[{'time':'2013-09-01','open':249,'high':254,'low':249,'close':250},
{'time':'2013-09-02','open':249,'high':256,'low':248,'close':250}]
data2 =[{'time':'2013-09-01','open':251,'high':253,'low':248,'close':250},
{'time':'2013-09-02','open':245,'high':251,'low':243,'close':247}]
df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
arr = [df, df2]
If that's the case, then you can simply iterate over the list of dataframes (via enumerate()) and the columns of each dataframe (via iteritems(), see http://pandas.pydata.org/pandas-docs/stable/basics.html#iteritems), populating each new row via a dictionary comprehension: (see Create a dictionary with list comprehension in Python):
ndf = pd.DataFrame(columns = df.columns)
for i, df in enumerate(arr):
ndf = ndf.append(pd.DataFrame(data = {colName: max(colData) for colName, colData in df.iteritems()}, index = [i]))
If some of your dataframes have any additional columns, the resulting dataframe ndf will have NaN entries in the relevant places.

Categories

Resources