Creating dataframes from others - python

I would need to filter multiple data frames and create new data frames based on them.
The multiple data frames are called as df[str(i)], i.e. df["0"], df["1"], and so on.
I would need, after filtering the rows, to create new dataframes. I am trying as follows:
n=5
for i in range(0, n):
filtered = df[str(i)]
but it returns at the end only the latest dataframe created, i.e. n=5.
I have tried also with filtered[str(i)] but it gives me the error "n".
What I would like to have is:
filtered["0"] for df["0"]
filtered["1"] for df["1"]
...
I would appreciate your help to figure it out. Thanks

You could append your filtered dataframes to a list, then concatenate into a new dataframe.
import pandas as pd
n=5
dfs = []
for i in range(n):
filtered = df[str(i)]
dfs.append(filtered)
df_filtered = pd.concat(dfs)

Related

Adding whole lines of a dataframe via a for loop

I had code as follows to collect interesting rows into a new dataframe:
df = df1.iloc[[66,113,231,51,152,122,185,179,114,169,97][:]]
but I want to use a for loop to collect the data. I have read that I need to combine the data as a list and then create the dataframe, but all the examples I have seen are for numbers and I can't create the same for each line of a dataframe. At the moment I have the following:
data = ['A','B','C','D','E']
for n in range(10):
data.append(dict(zip(df1.iloc[n, 4])))
df = pd.Dataframe(data)
(P.S. I have 4 in the code because I want the data to be selected via column E and the dataframe is already sorted so I am just looking for the first 10 rows)
Thanks in advance for your help.

Pandas adding multiple null data frames

I want to create about 10 data frames with same number of rows and columns that I want to specify.
Currently I am creating a df with the specific rows and then using pd.concat to add column to the data frame. I am having to write 10 lines of code separately for each data frame. Is there a way to do it at one go together for all data frames. Say, all the data frames have 15 rows and 50 columns.
Also I don't want to use a loop. All values in the data frame are NaN and I want to perform different function on each data frame so editing one data frame shouldn't change the values of the other data frames.
You can simply create a numpy array of np.nan, and then create a dataframe:
df = pd.DataFrame(np.zeros([15, 50])*np.nan)
For creating 10 dataframes, you can just run this in a loop and add it to an array.
dfs = []
for i in range(10):
dfs.append(pd.DataFrame(np.zeros([15, 50])*np.nan))
Then you can index into dfs and change any value accordingly. It won't impact any other dataframe.
You could do something like this:
index_list = range(10)
column_list = ['a','b','c','d']
for i in range(5):
locals()["df_" + str(i)] = pd.DataFrame(index=index_list, columns=column_list)
This will create 5 different dataframes (df_1 to df_5) each with 10 rows and 4 columns named a,b,c,d having all values as Nan
import pandas as pd
row_num = 15
col_num = 50
temp=[]
for col_name in range(0, col_num):
temp.append(col_name)
Creation of Dataframe
df = pd.DataFrame(index=range(0,row_num), columns=temp)
this code creates a single data frame in pandas with specified row and column numbers. But without a loop or some form of iteration, multiple lines of same code must be written.
Note: this is a pure pandas implementation. github gist can be found here.

How to subtract all the date columns from each other (in permutation) and store them in a new pandas DataFrame?

I was working on Jupyter and arrived at a situation where I had to take differences of each column from every other column taken in permutation and then store them in a separate DataFrame. I tried using nested loops but got stuck while assigning the values to the DataFrame.
n=0
for i in range(len(list(df.columns))-1):
for j in range(i+1, len(list(df.columns))-1):
df1[n] = pd.DataFrame(abs((df.iloc[:,i] - df.iloc[:,j]).dt.days))
n=n+1
df1
Also, I would like to have column headers in this format: D1-D2, D1-D3, etc. The difference in dates has to be a positive integer. I would really appreciate if anyone could help me with this code. Thanks!
A snippet of the DataFrame
import itertools
import pandas as pd
# create a sample dataframe
df = pd.DataFrame(data={"co1":[1,2,3,4], "co22":[4,3,2,1], "co3":[2,3,2,4]})
# iterate over all permutations of size 2 and write to dictionary
newcols = {}
for col1, col2 in itertools.permutations(df.columns, 2):
newcols["-".join([col1, col2])] = df[col1]-df[col2]
# create dataframe from dict
newdf = pd.DataFrame(newcols)

How to add values from one dataframe into another ignoring the row indices

I have a pandas dataframe called trg_data to collect data that I am producing in batches. Each batch is produced by a sub-routine as a smaller dataframe df with the same number of columns but less rows and I want to insert the values from df into trg_data at a new row position each time.
However, when I use the following statement df is always inserted at the top. (i.e. rows 0 to len(df)).
trg_data.iloc[trg_pt:(trg_pt + len(df))] = df
I'm guessing but I think the reason may be that even though the slice indicates the desired rows, it is using the index in df to decide where to put the data.
As a test I found that I can insert an ndarray at the right position no problem:
trg_data.iloc[trg_pt:(trg_pt + len(df))] = np.ones(df.shape)
How do I get it to ignore the index in df and insert the data where I want it? Or is there an entirely different way of achieving this? At the end of the day I just want to create the dataframe trg_data and then save to file at the end. I went down this route because there didn't seem to be a way of easily appending to an existing dataframe.
I've been working at this for over an hour and I can't figure out what to google to find the right answer!
I think I may have the answer (I thought I had already tried this but apparently not):
trg_data.iloc[trg_pt:(trg_pt + len(df))] = df.values
Still, I'm open to other suggestions. There's probably a better way to add data to a dataframe.
The way I would do this is save all the intermediate dataframes in an array, and then concatenate them together
import pandas as pd
dfs = []
# get all the intermediate dataframes somehow
# combine into one dataframe
trg_data = pd.concatenate(dfs)
Both
trg_data = pd.concat([df1, df2, ... dfn], ignore_index=True)
and
trg_data = pd.DataFrame()
for ...: #loop that generates df
trg_data = trg_data.append(df, ignore_index=True) #you can reuse the name df
shoud work for you.

How do I append an existing column to another column, aligning with the indices?

I have three dataframes that each have different columns, but they all have the same indices and the same number of rows (exact same index). How do I combine them into a single dataframe, keeping each column separate but joining on the indices?
Currently, when I attempt to append them together, I get NaNs and the same indices are duplicated. I created an empty dataframe so that I can put all three dataframes into by append. Maybe this is wrong?
What I am doing is as follows:
df = pd.DataFrame()
frames = a list of the three dataframes
for x in frames:
df = df.append(x)
DataFrames have a join method which does exactly this. You'll just have to modify your code a bit so that you're calling the method from the real dataframes rather than the empty one.
df = pd.DataFrame()
frames = a list of the three dataframes
for x in frames:
df = x.join(df)
More in the docs.
I was able to come up with a solution by grouping by the index:
df = df.groupby(df1.index)

Categories

Resources