I think I might be over complicating this but essentially what I am trying to do is take the data-frame below and group by the unique values in the "MATNR_BATCH" column and create another data frame with the columns: "STORAGE_BIN", "FULL_IND" & "PRCNT_UTIL", "MAX_NO_SU_IN_SB", "NO_SU_IN_SB":
From something like this:
To something like this:
From here what I would like to do is only filter on the "groups" (MATNR_BATCH) that have a mix of "FULL" and "NF" values in the "FULL_IND" column. So basically, I would like to create a data-frame that only has the unique "MATNR_BATCH" (groups) that have a combination of both "FULL" and "NF" in them.
Can anyone please help me out with this? I have been struggling to come up with a way to do this in python. Is groupby the right function to use or should I try and take a different approach?
As a first pass do
df1 = df[(df.FULL_IND=='FULL')| (df.FULL_IND=='NF')]
And then carry on. I can't quite figure out what you want to do with the other columns.
Below is the code where 5 dataframes are being generated and I want to combine all the dataframes into one, but since they have different headers of the columns, i think appending it to the list are not retaining the header names instead it is providing numbers.
Is there any other solution to combine the dataframes keeping the header names as it is?
Thanks in advance!!
list=[]
i=0
while i<5:
df = pytrend.interest_over_time()
list.append(df)
i=i+1
df_concat=pd.concat(list,axis=1)
Do you have a common column in the dataframes that you can merge on? In that case - use the data frame merge function.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
I've had to do this recently with two dataframes I had, and I merged on the date column.
Are you trying to add additional columns, or append each dataframe on top of each other?
https://www.datacamp.com/community/tutorials/joining-dataframes-pandas
This link will give you an overview of the different functions you might need to use.
You can also rename the columns, if they do contain the same sort of data. Without an example of the dataframe it's tricky to know.
I have dataframes that have the same column names as follows
df1=pd.DataFrame({'Group1':['a','b','c','d','e'],'Group2':["f","g","h","i","j"],'Group3':['k','L','m','n',"0"]})
df2=pd.DataFrame({'Group1':[0,0,2,1,0],'Group2':[1,2,0,0,0],'Group3':[0,0,0,1,1]})
For some reasons, I want to concatenate these dataframe as follows.
dfnew=pd.concat([df1[["Group1","Group2"]], df2[["Group1","Group2"]]], axis=1)
I want to rename the columns of this new dataframe, thus tried below.
dfnew.columns={"1","2","3","4"}
I expected the order of the columns would be 1,2,3,4, but the actual result was 4,3,1,2 instead.
I do not know why this happens.
If someone could advise me, I would appreciate it very much.
In addition, I need to concatenate many dataframes for future work.
(i.e. concatenate df1,df2, df3...df1000).
Is there a good way to rename columns as "1,2,3,4.....1000"? because typing these numbers is lots of work.
Thank you.
To rename columns you can use this syntax:
dfnew.columns=["1","2","3","4"]
In future , if you want to rename 1000 columns as you have asked maybe you can do something like this:
dfnew.columns=[str(i) for i in range(1,1001)]
Use the brackets to ensure that the columns order is preserved
dfnew.columns=["1","2","3","4"]
Not being an expert in code efficiency (yet) & best pythonic code writing (yet), I would like to ask the experts here if the following code is the best to join dataframes that have a common Date Index, or if merge or concat may be better:
data = df1.join(df2).join(df3).join(df4).join(df5).dropna()
I used the .dropna() suffix at the end to cancel out rows where a single NaN occurs.
NB: the reason why NaN occurs in this dataset is because I have created dataframes that are in fact shifted versions of other dataframes (using .shift(n) ), which means that NaNs creep in at the head of the shifted dataframes.
I intend to use this code in many other applications, so wanted to use the best possible methodology (i.e. not make unnecessary use of memory, take too much time to process, use the correct join/merg/concat constructs).
It should be more efficient to do:
data = df1.join([df2, df3, df4, df5], how='inner')
This will merge all the dataframes in one go. It will also exclude any row that does not have values across all dataframes (so no need for dropna()). The default for how is 'left', which produces a row for every row in the calling dataframe, filling in any missing values with NaN. However, if any of the dataframes had NaN values in them before the join then you will still need to use dropna().
You can also use on=... to choose which column(s) to join the dataframes on if you don't want to use the dataframes indexes.
I generate a grouped dataframe df = df.groupby(['X','Y']).max() which I then want to write (to csv, without indexes). So I need to convert 'X' and 'Y' back to regular columns; I tried using reset_index(), but the order of columns was wrong.
How to restore columns 'X' and 'Y' to their exact original column position?
Is the solution:
df.reset_index(level=0, inplace=True)
and then find a way to change the order of the columns?
(I also found this approach, for multiindex)
This solution keeps the columns as-is and doesn't create indexes, after grouping, hence we don't need reset_index() and column reordering at the end:
df.groupby(['X','Y'],as_index=False).max()
(After testing a lot of different methods, the simplest one was the best solution (as always) and the one which eluded me the longest. Thanks to #maxymoo for pointing it out.)