I see this question asked multiple times but solutions from other questions did not worked!
I have data frame like
df = pd.DataFrame({
"date": ["20180920"] * 3 + ["20180921"] * 3,
"id": ["A12","A123","A1234","A12345","A123456","A0"],
"mean": [1,2,3,4,5,6],
"std" :[7,8,9,10,11,12],
"test": ["a", "b", "c", "d", "e", "f"],
"result": [70, 90, 110, "(-)", "(+)", 0.3],})
using pivot_table
df_sum_table = (pd.pivot_table(df,index=['id'], columns = ['date'], values = ['mean','std']))
I got
df_sum_table.columns
MultiIndex([('mean', '20180920'),
('mean', '20180921'),
( 'std', '20180920'),
( 'std', '20180921')],
names=[None, 'date'])
So I wanted to shift date column one row below and remove id row. but keep id name there.
by following these past solutions
ValueError when trying to have multi-index in DataFrame.pivot
Removing index name from df created with pivot_table()
Resetting index to flat after pivot_table in pandas
pandas pivot_table keep index
df_sum_table = (pd.pivot_table(df,index=['id'], columns = ['date'], values = ['mean','std'])).reset_index().rename_axis(None, axis=1)
but getting error
TypeError: Must pass list-like as names.
How can I remove date but keep the id in the first column ?
The desired output
#jezrael
Try with rename_axis:
df = df.pivot_table(index=['id'], columns = ['date'], values = ['mean', 'std']).rename_axis(columns={'date': None}).fillna('').reset_index().T.reset_index(level=1).T.reset_index(drop=True).reset_index(drop=True)
df.index = df.pop('id').replace('', 'id').tolist()
print(df)
Output:
mean mean std std
id 20180920 20180921 20180920 20180921
A0 6 12
A12 1 7
A123 2 8
A1234 3 9
A12345 4 10
A123456 5 11
You could use rename_axis and rename the specific column axis name with dictionary mapping. I specify the columns argument for column axis name mapping.
Related
I have a large dataframe where I need to add an empty row after any instance where colA contains a colon.
To be honest I have absolutely no clue how to do this, my guess is that a function/ for loop needs to be written but I have had no luck...
I think you are looking for this
You have dataframe like this
df = pd.DataFrame({"cola": ["a", "b", ":", "c", "d", ":", "e"]})
# wherever you find : in column a you want to append new empty row
idx = [0] + (df[df.cola.str.match(':')].index +1).tolist()
df1 = pd.DataFrame()
for i in range(len(idx)-1):
df1 = pd.concat([df1, df.iloc[idx[i]: idx[i+1]]],ignore_index=True)
df1.loc[len(df1)] = ""
df1 = pd.concat([df1, df.iloc[idx[-1]: ]], ignore_index=True)
print(df1)
# df1 is your result dataframe also it handles the case where colon is present at the last row of dataframe
Resultant dataframe
cola
0 a
1 b
2 :
3
4 c
5 d
6 :
7
8 e
For every two rows in my df, I would like to concatenate them into one.
Starting with this:
and ending with this:
I've been able to apply this to one column, but have not been able to apply it across all of them. I would also like to loop this for every two rows for the entire df.
This is my actual df:
Team Spread
0 Wagner Seahawks (-11.5, -118)
1 Fairleigh Dickinson Knights (11.5, -110)
I know this isn't the best way to format a table, but for my needs it is the best option. Thank you
If I were to do this in excel - I would use this:
=TEXTJOIN(CHAR(10),TRUE,A1:A2)
Does this work for you?
>>> df = pd.DataFrame({
"Col1": ["A", "B", "C", "D"],
"Col2": [(-11.5, -118), (11.5, -110), (-11.5, -118), (11.5, -110)],
})
>>> df
Col1 Col2
0 A (-11.5, -118)
1 B (11.5, -110)
2 C (-11.5, -118)
3 D (11.5, -110)
If you have non-string columns, you'll need to transform them to string first:
>>> df["Col2"] = df["Col2"].astype(str)
Now, use .groupby using real floor division, and aggregate each pair of rows using "\n".join.
>>> df = df.groupby(df.index // 2).agg("\n".join)
>>> df
Col1 Col2
0 A\nB (-11.5, -118)\n(11.5, -110)
1 C\nD (-11.5, -118)\n(11.5, -110)
Consider that you would need to write the Excel file on your own to dump the dataframe and load the Excel in the format that you want (as described in this SO answer).
So apparently I am trying to declare an empty dataframe, then assign some values in it
df = pd.DataFrame()
df["a"] = 1234
df["b"] = b # Already defined earlier
df["c"] = c # Already defined earlier
df["t"] = df["b"]/df["c"]
I am getting the below output:
Empty DataFrame
Columns: [a, b, c, t]
Index: []
Can anyone explain why I am getting this empty dataframe even when I am assigning the values. Sorry if my question is kind of basic
I think, you have to initialize DataFrame like this.
df = pd.DataFrame(data=[[1234, b, c, b/c]], columns=list("abct"))
When you make DataFrame with no initial data, the DataFrame has no data and no columns.
So you can't append any data I think.
Simply add those values as a list, e.g.:
df["a"] = [123]
You have started by initialising an empty DataFrame:
# Initialising an empty dataframe
df = pd.DataFrame()
# Print the DataFrame
print(df)
Result
Empty DataFrame
Columns: []
Index: []
As next you've created a column inside the empty DataFrame:
df["a"] = 1234
print(df)
Result
Empty DataFrame
Columns: [a]
Index: []
But you never added values to the existing column "a" - f.e. by using a dictionary (key: "a" and value list [1, 2, 3, 4]:
df = pd.DataFrame({"a":[1, 2, 3, 4]})
print(df)
Result:
In case a list of values is added each value will get an index entry.
The problem is that a cell in a table needs both a row index value and a column index value to insert the cell value. So you need to decide if "a", "b", "c" and "t" are columns or row indexes.
If they are column indexes, then you'd need a row index (0 in the example below) along with what you have written above:
df = pd.DataFrame()
df.loc[0, "a"] = 1234
df.loc[0, "b"] = 2
df.loc[0, "c"] = 3
Result:
In : df
Out:
a b c
0 1234.0 2.0 3.0
Now that you have data in the dataframe you can perform column operations (i.e., create a new column "t" and for each row assign the value of the corresponding item under "b" divided by the corresponding items under "c"):
df["t"] = df["b"]/df["c"]
Of course, you can also use different indexes for each item as follows:
df = pd.DataFrame()
df.loc[0, "a"] = 1234
df.loc[1, "b"] = 2
df.loc[2, "c"] = 3
Result:
In : df
Out:
a b c
0 1234.0 NaN NaN
1 NaN 2.0 NaN
2 NaN NaN 3.0
But as you can see the cells where you have not specified the (row, column, value) tuple now are NaN. This means if you try df["b"]/df["c"] you will get NaN values out as you are trying a linear operation with a NaN value.
In : df["b"]/df["c"]
Out:
0 NaN
1 NaN
2 NaN
dtype: float64
The converse is if you wanted to insert the items under one column. You'd now need a column header for this (0 in the below):
df = pd.DataFrame()
df.loc["a", 0] = 1234
df.loc["b", 0] = 2
df.loc["c", 0] = 3
Result:
In : df
Out:
0
a 1234.0
b 2.0
c 3.0
Now in inserting the value for "t" you'd need to specify exactly which cells you are referring to (note that pandas won't perform vectorised row operations in the same way that it performs vectorised columns operations).
df.loc["t", 0] = df.loc["b", 0]/df.loc["c", 0]
So i have this two dataframe
df1 and df2
df1 :
Data1 Created
1 22-01-01
4 22-01-01
3 22-01-01
df2 :
Data1 Created
1 22-01-01
6 23-01-01
each have the same columns names.
And i would like to use the same column "Created" which is a date to count occurence by day and plot them in the same graph.
I've tried this :
ax = df1.plot()
df2.plot(ax=ax,x_compat=True,figsize=(20,10))
but i have this :
Edit :
df2.resample('D').sum() give me :
Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
I've try this also :
ax = df1.set_index('Created').resample('1D', how='count').plot()
df2.set_index('Created').resample('1D', how='count').plot(ax=ax,x_compat=True,figsize=(20,10))
df1 = pd.DataFrame({'Data1': np.random.randint(0,30,size=10),'Created': pd.date_range("20180101", periods=10)})
df2 = pd.DataFrame({'Data1': np.random.randint(0,30,size=10),'Created': pd.date_range("20180103", periods=10)})
df = df1.merge(df2, on='Created', how='outer').fillna(0)
df['sum'] = df['Data1_x']+df['Data1_x']
df have all the data.
To plot the sum together
plt.plot(df['sum'], df['Created'])
Or two plots
plt.plot(df['Data1_x'], df['date'])
plt.plot(df['Data1_y'], df['date'])
I have a pandas dataframe with two column of data. Now i want to make a label for two column, like the picture bellow:
Because two column donot have the same value so cant use groupby. I just only want add the label AAA like that. So, how to do it? Thank you
reassign to the columns attribute with an newly constructed pd.MultiIndex
df.columns = pd.MultiIndex.from_product([['AAA'], df.columns.tolist()])
Consider the dataframe df
df = pd.DataFrame(1, ['hostname', 'tmserver'], ['value', 'time'])
print(df)
value time
hostname 1 1
tmserver 1 1
Then
df.columns = pd.MultiIndex.from_product([['AAA'], df.columns.tolist()])
print(df)
AAA
value time
hostname 1 1
tmserver 1 1
If need create MultiIndex in columns, simpliest is:
df.columns = [['AAA'] * len(df.columns), df.columns]
It is similar as MultiIndex.from_arrays, also is possible add names parameter:
n = ['a','b']
df.columns = pd.MultiIndex.from_arrays([['AAA'] * len(df.columns), df.columns], names=n)