I have a pandas dataframe that has the same title in multiple columns. So when I print df.columns I get title.1 title.2,title.3...etc.
What I m trying to do is to get the highest value of title and rename that column to remove the number and put it into another dataframe.
For example i have a dataframe:
df = pd.read_excel(excel_file_path_from_db, engine='openpyxl', sheet_name='Sheet1', skiprows=1)
In my case the most recent values of title will be in the last 12 columns of this dataframe. i get that by,
df2 = df.iloc[:, -12:]
My problem is that df2 will have the highest value of title example title.7. how do I remove the .X from title in df2?
Related
This has 4 to 5 merged cells, blank cells. I need to get them in a format where a column is created for each merged cell.
Please find the link for Sample file and also the output required below.
Link : https://docs.google.com/spreadsheets/d/1fkG9YT9YGg9eXYz6yTkl7AMeuXwmNYQ44oXgA8ByMPk/edit?usp=sharing
Code so far:
dfs = pd.read_excel('/Users/john/Desktop/Internal/Raw Files/Med/TVS/sample.xlsx',sheet_name=None, header=[0,4], index_col=[0,1])
df_2 = (pd.concat([df.stack(0).assign(name=n) for n,df in dfs.items()])
.rename_axis(index=['Date','WK','Brand','A_B','Foo_Foos','Cur','Units'], columns=None)
.reset_index())
How do I create multiple column in the above code? Right now I'm getting the below error:
ValueError: Length of names must match number of levels in MultiIndex.
Try this:
df = pd.read_excel("Sample_File.xlsx", header=[0,1,2,3,4,5], index_col = [0,1])
df.stack(level=[0,1,2,3,4])
I am trying to merge two dataframes based on a date column with this code:
data_df = (pd.merge(data, one_min_df, on='date', how='outer'))
The first dataframe has 3784 columns and the second dataframe has 3764. Every date in the second dataframe is also within the first dataframe. I would like to get the dataframes to merge on the date column with any dates that the longer dataframe has being left as blank or NaN etc.
The code I have here gives the 3764 values followed by 20 empty rows, rather than correctly matching them.
Try this:
data['date'] = pd.to_datetime(data['date'])
one_min_df['date'] = pd.to_datetime(one_min_df['date'])
data_df = pd.merge(data, one_min_df, on='date', how='left')
I am reading from an Excel file ".xslx", it's consist of 3 columns, but when I read from it, I get a DF full of nans, I checked the table in Excel, it consists of normal cells no formulas no hyperlinks.
My code:
data = pd.read_excel("Data.xlsx")
df = pd.DataFrame(data, columns=["subreddit_group", "links/caption", "subreddits/flair"])
print(df)
Here is the excel file:
Here is the output:
The column parameter of pd.Dataframe() function doesn't set column names in result dataframe, but selects columns from the original file.
See pandas documentation :
Column labels to use for resulting frame when data does not have them, defaulting to RangeIndex(0, 1, 2, …, n). If data contains column labels, will perform column selection instead.
So you shouldn't provide column parameter and after the file is read, rename columns of the dataframe:
df = pd.DataFrame(data)
df.columns = ['subreddit_group', 'links/caption', 'def']
Data is an unique value, id is repeated multiple times in an excel file. Data is column 1 and id's are column 2. I would like to group the unique data values to an id without losing any. Then set the column index as the id, and paste the data values associated below. Then do the same thing with the second id and paste that id's values below 1 cell to the left of the first id column. Could anyone help me sort it out to such layout?
You can't have varying length columns in a dataframe. So NaNs are unavoidable.
import pandas as pd
df = pd.DataFrame({'col1':[2,3,3,4,2,1,3,4], 'col2':[1,1,1,1,2,2,2,3]})
# First problem
df2 = df.pivot(columns='col2')["col1"]
df2 = df2.apply(lambda x: pd.Series(x.dropna().values))
print(df2)
# Second problem
def concat(s):
return s.tolist()
df3 = df.groupby('col2').agg(concat)["col1"].apply(pd.Series)
print(df3)
I'm trying to read an Excel or CSV file into pandas dataframe. The file will read the first two columns only, and the top row of the first two columns will be the column names. The problem is when I have the first column of the top row empty in the Excel file.
IDs
2/26/2010 2
3/31/2010 4
4/31/2010 2
5/31/2010 2
Then, the last line of the following code fails:
uploaded_file = request.FILES['file-name']
if uploaded_file.name.endswith('.csv'):
df = pd.read_csv(uploaded_file, usecols=[0,1])
else:
df = pd.read_excel(uploaded_file, usecols=[0,1])
ref_date = 'ref_date'
regime_tag = 'regime_tag'
df.columns = [ref_date, regime_tag]
Apparently, it only reads one column (i.e. the IDs). However, with read_csv, it reads both column, with the first column being unnamed. I want it to behave that way and read both columns regardless of whether the top cells are empty or filled. How do I go about doing that?
What's happening is the first "column" in the Excel file is being read in as an index, while in the CSV file it's being treated as a column / series.
I recommend you work the other way and amend pd.read_csv to read the first column as an index. Then use reset_index to elevate the index to a series:
if uploaded_file.name.endswith('.csv'):
df = pd.read_csv(uploaded_file, usecols=[0,1], index_col=0)
else:
df = pd.read_excel(uploaded_file, header=[0,1], usecols=[0,1])
df = df.reset_index() # this will elevate index to a column called 'index'
This will give consistent output, i.e. first series will have label 'index' and the index of the dataframe will be the regular pd.RangeIndex.
You could potentially use a dispatcher to get rid of the unwieldy if / else construct:
file_flag = {True: pd.read_csv, False: pd.read_excel}
read_func = file_flag[uploaded_file.name.endswith('.csv')]
df = read_func(uploaded_file, usecols=[0,1], index_col=0).reset_index()