I have dataframes that have the same column names as follows
df1=pd.DataFrame({'Group1':['a','b','c','d','e'],'Group2':["f","g","h","i","j"],'Group3':['k','L','m','n',"0"]})
df2=pd.DataFrame({'Group1':[0,0,2,1,0],'Group2':[1,2,0,0,0],'Group3':[0,0,0,1,1]})
For some reasons, I want to concatenate these dataframe as follows.
dfnew=pd.concat([df1[["Group1","Group2"]], df2[["Group1","Group2"]]], axis=1)
I want to rename the columns of this new dataframe, thus tried below.
dfnew.columns={"1","2","3","4"}
I expected the order of the columns would be 1,2,3,4, but the actual result was 4,3,1,2 instead.
I do not know why this happens.
If someone could advise me, I would appreciate it very much.
In addition, I need to concatenate many dataframes for future work.
(i.e. concatenate df1,df2, df3...df1000).
Is there a good way to rename columns as "1,2,3,4.....1000"? because typing these numbers is lots of work.
Thank you.
To rename columns you can use this syntax:
dfnew.columns=["1","2","3","4"]
In future , if you want to rename 1000 columns as you have asked maybe you can do something like this:
dfnew.columns=[str(i) for i in range(1,1001)]
Use the brackets to ensure that the columns order is preserved
dfnew.columns=["1","2","3","4"]
Related
I have written the following codes in three separate cells in my jupyter notebook and have been able to generate the output I want. However, having this information in one dataframe will make it much easier to read.
How can I combine these separate dataframes into one so that the member_casual column is the index with max_ride_length, avg_ride_length and most_active_day_of_week columns next to it in the same dataframe?
Malo is correct. I will expand a little bit because you can also name the columns when they are aggregated:
df.groupby('member_casual').agg(max_ride_length=('ride_length','max'), avg_ride_length=('ride_length','mean'), most_active_day_of_the_week=('day_of_week',pd.Series.mode))
In the doc https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html
agg accepts a list a function as in the example:
df.groupby('A').agg(['min', 'max'])
Below is the code where 5 dataframes are being generated and I want to combine all the dataframes into one, but since they have different headers of the columns, i think appending it to the list are not retaining the header names instead it is providing numbers.
Is there any other solution to combine the dataframes keeping the header names as it is?
Thanks in advance!!
list=[]
i=0
while i<5:
df = pytrend.interest_over_time()
list.append(df)
i=i+1
df_concat=pd.concat(list,axis=1)
Do you have a common column in the dataframes that you can merge on? In that case - use the data frame merge function.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
I've had to do this recently with two dataframes I had, and I merged on the date column.
Are you trying to add additional columns, or append each dataframe on top of each other?
https://www.datacamp.com/community/tutorials/joining-dataframes-pandas
This link will give you an overview of the different functions you might need to use.
You can also rename the columns, if they do contain the same sort of data. Without an example of the dataframe it's tricky to know.
I imported a csv as a dataframe from San Francisco Salaries database from Kaggle
df=pd.read_csv('Salaries.csv')
I created a dataframe as an aggregate function from 'df'
df2=df.groupby(['JobTitle','Year'])[['TotalPay']].median()
Problem 1: The first and second column appear as nameless and that shouldn't happen.
Even when I use code of
df2.columns
It only names TotalPay as a column
Problem 2: I try to rename, for instance, the first column as JobTitle and the code doesn't do anything
df3=df2.rename(columns = {0:'JobTitle'},inplace=True)
So the solution that was given here does not apparently work: Rename unnamed column pandas dataframe.
I wish two possible solutions:
1) That the aggregate function respects the column naming AND/OR
2) Rename the empty dataframe's columns
The problem isn't really that you need to rename the columns.
What do the first few rows of the .csv file that you're importing look at, because you're not importing it properly. Pandas isn't recognising that JobTitle and Year are meant to be column headers. Pandas read_csv() is very flexible with what it will let you do.
If you import the data properly, you won't need to reindex, or relabel.
Quoting answer by MaxU:
df3 = df2.reset_index()
Thank you!
I am using the following code to join two data frames:
new_df = df_1.join(df_2, on=['field_A', 'field_B', 'field_C'], how='left_outer')
The above code works fine, but sometimes df_1 and df_2 have hundreds of columns. Is it possible to join using the schema instead of manually adding all the columns? Or is there a way that I can transform the schema into a list? Thanks a lot!
You can't join on schema, if what you meant was somehow having join incorporate the column dtypes. What you can do is extract the column names out first, then pass them through as the list argument for on=, like this:
join_cols = df_1.columns
df_1.join(df_2, on=join_cols, how='left_outer')
Now obviously you will have to edit the contents of join_cols to make sure it only has the names you actually want to join df_1 and df_2 on. But if there are hundreds of valid columns that is probably much faster than adding them one by one. You could also make join_cols an intersection of df_1 and df_2 columns, then edit from there if that's more suitable.
Edit: Although I should add that Spark 2.0 release is literally any day now, and I haven't versed myself on all the changes yet. So that might be worth looking into also, or provide a future solution.
I generate a grouped dataframe df = df.groupby(['X','Y']).max() which I then want to write (to csv, without indexes). So I need to convert 'X' and 'Y' back to regular columns; I tried using reset_index(), but the order of columns was wrong.
How to restore columns 'X' and 'Y' to their exact original column position?
Is the solution:
df.reset_index(level=0, inplace=True)
and then find a way to change the order of the columns?
(I also found this approach, for multiindex)
This solution keeps the columns as-is and doesn't create indexes, after grouping, hence we don't need reset_index() and column reordering at the end:
df.groupby(['X','Y'],as_index=False).max()
(After testing a lot of different methods, the simplest one was the best solution (as always) and the one which eluded me the longest. Thanks to #maxymoo for pointing it out.)