I had code as follows to collect interesting rows into a new dataframe:
df = df1.iloc[[66,113,231,51,152,122,185,179,114,169,97][:]]
but I want to use a for loop to collect the data. I have read that I need to combine the data as a list and then create the dataframe, but all the examples I have seen are for numbers and I can't create the same for each line of a dataframe. At the moment I have the following:
data = ['A','B','C','D','E']
for n in range(10):
data.append(dict(zip(df1.iloc[n, 4])))
df = pd.Dataframe(data)
(P.S. I have 4 in the code because I want the data to be selected via column E and the dataframe is already sorted so I am just looking for the first 10 rows)
Thanks in advance for your help.
Related
my sort_drop2 dataframe is shown in the picture below
https://imgur.com/a/mdZZa7n
new_dataframe = sort_drop2.filter(['City','Est','Nti']
sort_drop2.filer I am trying to copy specific details from the old dataset into a new dataframe.
I want to only take the top 5 values from the sort_drop2 dataframe
I have sorted the sort_drop2 by nti from largest to smaller sort_values(by='Nti', ascending=False)
How do I copy only the top 5 values from the old dataframe to new?
You can get the top n rows of dataframe df with df.head(n). So in your case, take your sorted and filtered dataframe do do:
new_dataframe.head(5)
The default for n is 5, so you could also leave the parameter blank.
That will return the dataframe. If you want to save something new as it, you would do:
df_top_5 = new_dataframe.head(5)
Use .head(5) on the old DataFrame sort_drop2 and assign the result to your new DataFrame like this:
new_dataframe = sort_drop2.filter(['City','Est','Nti']).sort_values(by='Nti', ascending=False).head(5)
Here's my answer expanded over multiply lines which is more similar to your code you described so perhaps the following answer will be easier to compare to your existing code:
new_dataframe = sort_drop2.filter(['City','Est','Nti'])
new_dataframe = new_dataframe.sort_values(by='Nti', ascending=False)
new_dataframe = new_dataframe.head(5)
I want to create about 10 data frames with same number of rows and columns that I want to specify.
Currently I am creating a df with the specific rows and then using pd.concat to add column to the data frame. I am having to write 10 lines of code separately for each data frame. Is there a way to do it at one go together for all data frames. Say, all the data frames have 15 rows and 50 columns.
Also I don't want to use a loop. All values in the data frame are NaN and I want to perform different function on each data frame so editing one data frame shouldn't change the values of the other data frames.
You can simply create a numpy array of np.nan, and then create a dataframe:
df = pd.DataFrame(np.zeros([15, 50])*np.nan)
For creating 10 dataframes, you can just run this in a loop and add it to an array.
dfs = []
for i in range(10):
dfs.append(pd.DataFrame(np.zeros([15, 50])*np.nan))
Then you can index into dfs and change any value accordingly. It won't impact any other dataframe.
You could do something like this:
index_list = range(10)
column_list = ['a','b','c','d']
for i in range(5):
locals()["df_" + str(i)] = pd.DataFrame(index=index_list, columns=column_list)
This will create 5 different dataframes (df_1 to df_5) each with 10 rows and 4 columns named a,b,c,d having all values as Nan
import pandas as pd
row_num = 15
col_num = 50
temp=[]
for col_name in range(0, col_num):
temp.append(col_name)
Creation of Dataframe
df = pd.DataFrame(index=range(0,row_num), columns=temp)
this code creates a single data frame in pandas with specified row and column numbers. But without a loop or some form of iteration, multiple lines of same code must be written.
Note: this is a pure pandas implementation. github gist can be found here.
I was working on Jupyter and arrived at a situation where I had to take differences of each column from every other column taken in permutation and then store them in a separate DataFrame. I tried using nested loops but got stuck while assigning the values to the DataFrame.
n=0
for i in range(len(list(df.columns))-1):
for j in range(i+1, len(list(df.columns))-1):
df1[n] = pd.DataFrame(abs((df.iloc[:,i] - df.iloc[:,j]).dt.days))
n=n+1
df1
Also, I would like to have column headers in this format: D1-D2, D1-D3, etc. The difference in dates has to be a positive integer. I would really appreciate if anyone could help me with this code. Thanks!
A snippet of the DataFrame
import itertools
import pandas as pd
# create a sample dataframe
df = pd.DataFrame(data={"co1":[1,2,3,4], "co22":[4,3,2,1], "co3":[2,3,2,4]})
# iterate over all permutations of size 2 and write to dictionary
newcols = {}
for col1, col2 in itertools.permutations(df.columns, 2):
newcols["-".join([col1, col2])] = df[col1]-df[col2]
# create dataframe from dict
newdf = pd.DataFrame(newcols)
I would need to filter multiple data frames and create new data frames based on them.
The multiple data frames are called as df[str(i)], i.e. df["0"], df["1"], and so on.
I would need, after filtering the rows, to create new dataframes. I am trying as follows:
n=5
for i in range(0, n):
filtered = df[str(i)]
but it returns at the end only the latest dataframe created, i.e. n=5.
I have tried also with filtered[str(i)] but it gives me the error "n".
What I would like to have is:
filtered["0"] for df["0"]
filtered["1"] for df["1"]
...
I would appreciate your help to figure it out. Thanks
You could append your filtered dataframes to a list, then concatenate into a new dataframe.
import pandas as pd
n=5
dfs = []
for i in range(n):
filtered = df[str(i)]
dfs.append(filtered)
df_filtered = pd.concat(dfs)
I have a pandas dataframe called trg_data to collect data that I am producing in batches. Each batch is produced by a sub-routine as a smaller dataframe df with the same number of columns but less rows and I want to insert the values from df into trg_data at a new row position each time.
However, when I use the following statement df is always inserted at the top. (i.e. rows 0 to len(df)).
trg_data.iloc[trg_pt:(trg_pt + len(df))] = df
I'm guessing but I think the reason may be that even though the slice indicates the desired rows, it is using the index in df to decide where to put the data.
As a test I found that I can insert an ndarray at the right position no problem:
trg_data.iloc[trg_pt:(trg_pt + len(df))] = np.ones(df.shape)
How do I get it to ignore the index in df and insert the data where I want it? Or is there an entirely different way of achieving this? At the end of the day I just want to create the dataframe trg_data and then save to file at the end. I went down this route because there didn't seem to be a way of easily appending to an existing dataframe.
I've been working at this for over an hour and I can't figure out what to google to find the right answer!
I think I may have the answer (I thought I had already tried this but apparently not):
trg_data.iloc[trg_pt:(trg_pt + len(df))] = df.values
Still, I'm open to other suggestions. There's probably a better way to add data to a dataframe.
The way I would do this is save all the intermediate dataframes in an array, and then concatenate them together
import pandas as pd
dfs = []
# get all the intermediate dataframes somehow
# combine into one dataframe
trg_data = pd.concatenate(dfs)
Both
trg_data = pd.concat([df1, df2, ... dfn], ignore_index=True)
and
trg_data = pd.DataFrame()
for ...: #loop that generates df
trg_data = trg_data.append(df, ignore_index=True) #you can reuse the name df
shoud work for you.