For each row of a dataframe I want to repeat the row n times inside a iterrows in a new dataframe. Basically I'm doing this:
df = pd.DataFrame(
[
("abcd", "abcd", "abcd") # create your data here, be consistent in the types.
],
["A", "B", "C"] # add your column names here
)
n_times = 2
for index, row in df.iterrows():
new_df = row.loc[row.index.repeat(n_times)]
new_df
and I get the following output:
0 abcd
0 abcd
1 abcd
1 abcd
2 abcd
2 abcd
Name: C, dtype: object
while it should be:
A B C
0 abcd abcd abcd
1 abcd abcd abcd
How should I proceed to get the desired output?
The df.T attribute in Pandas is used to transpose a DataFrame. Transposing a DataFrame means to flip its rows and columns, so that the rows become columns and the columns become rows.
I don't think you defined your df the right way.
df = pd.DataFrame(data = [["abcd", "abcd", "abcd"]],
columns = ["A", "B", "C"])
n_times = 2
for _ in range(n_times):
new_df = pd.concat([df, df], axis=0)
Is that how it should look like?
Related
Given a DataFrame, how can I add a new level to the columns based on an iterable given by the user? In other words, how do I append a new level?
The question How to simply add a column level to a pandas dataframe shows how to add a new level given a single value, so it doesn't cover this case.
Here is the expected behaviour:
>>> df = pd.DataFrame(0, columns=["A", "B"], index=range(2))
>>> df
A B
0 0 0
1 0 0
>>> append_level(df, ["C", "D"])
A B
C D
0 0 0
1 0 0
The solution should also work with MultiIndex columns, so
>>> append_level(append_level(df, ["C", "D"]), ["E", "F"])
A B
C D
E F
0 0 0
1 0 0
If the columns is not multiindex, you can just do:
df.columns = pd.MultiIndex.from_arrays([df.columns.tolist(), ['C','D']])
If its multiindex:
if isinstance(df.columns, pd.MultiIndex):
df.columns = pd.MultiIndex.from_arrays([*df.columns.levels, ['E', 'F']])
The pd.MultiIndex.levels gives a Frozenlist of level values and you need to unpack to form the list of lists as input to from_arrays
def append_level(df, new_level):
new_df = df.copy()
new_df.columns = pd.MultiIndex.from_tuples(zip(*zip(*df.columns), new_level))
return new_df
I have a large dataframe where I need to add an empty row after any instance where colA contains a colon.
To be honest I have absolutely no clue how to do this, my guess is that a function/ for loop needs to be written but I have had no luck...
I think you are looking for this
You have dataframe like this
df = pd.DataFrame({"cola": ["a", "b", ":", "c", "d", ":", "e"]})
# wherever you find : in column a you want to append new empty row
idx = [0] + (df[df.cola.str.match(':')].index +1).tolist()
df1 = pd.DataFrame()
for i in range(len(idx)-1):
df1 = pd.concat([df1, df.iloc[idx[i]: idx[i+1]]],ignore_index=True)
df1.loc[len(df1)] = ""
df1 = pd.concat([df1, df.iloc[idx[-1]: ]], ignore_index=True)
print(df1)
# df1 is your result dataframe also it handles the case where colon is present at the last row of dataframe
Resultant dataframe
cola
0 a
1 b
2 :
3
4 c
5 d
6 :
7
8 e
How to convert following list to a pandas dataframe?
my_list = [["A","B","C"],["A","B","D"]]
And as an output I would like to have a dataframe like:
Index
A
B
C
D
1
1
1
1
0
2
1
1
0
1
You can craft Series and concatenate them:
my_list = [["A","B","C"],["A","B","D"]]
df = (pd.concat([pd.Series(1, index=l, name=i+1)
for i,l in enumerate(my_list)], axis=1)
.T
.fillna(0, downcast='infer') # optional
)
or with get_dummies:
df = pd.get_dummies(pd.DataFrame(my_list))
df = df.groupby(df.columns.str.split('_', 1).str[-1], axis=1).max()
output:
A B C D
1 1 1 1 0
2 1 1 0 1
I'm unsure how those two structures relate. The my_list is a list of two lists containing ["A","B","C"] and ["A", "B","D"].
If you want a data frame like the table you have, I would suggest making a dictionary of the values first, then converting it into a pandas dataframe.
my_dict = {"A":[1,1], "B":[1,1], "C": [1,0], "D":[0,1]}
my_df = pd.DataFrame(my_dict)
print(my_df)
Output:
So apparently I am trying to declare an empty dataframe, then assign some values in it
df = pd.DataFrame()
df["a"] = 1234
df["b"] = b # Already defined earlier
df["c"] = c # Already defined earlier
df["t"] = df["b"]/df["c"]
I am getting the below output:
Empty DataFrame
Columns: [a, b, c, t]
Index: []
Can anyone explain why I am getting this empty dataframe even when I am assigning the values. Sorry if my question is kind of basic
I think, you have to initialize DataFrame like this.
df = pd.DataFrame(data=[[1234, b, c, b/c]], columns=list("abct"))
When you make DataFrame with no initial data, the DataFrame has no data and no columns.
So you can't append any data I think.
Simply add those values as a list, e.g.:
df["a"] = [123]
You have started by initialising an empty DataFrame:
# Initialising an empty dataframe
df = pd.DataFrame()
# Print the DataFrame
print(df)
Result
Empty DataFrame
Columns: []
Index: []
As next you've created a column inside the empty DataFrame:
df["a"] = 1234
print(df)
Result
Empty DataFrame
Columns: [a]
Index: []
But you never added values to the existing column "a" - f.e. by using a dictionary (key: "a" and value list [1, 2, 3, 4]:
df = pd.DataFrame({"a":[1, 2, 3, 4]})
print(df)
Result:
In case a list of values is added each value will get an index entry.
The problem is that a cell in a table needs both a row index value and a column index value to insert the cell value. So you need to decide if "a", "b", "c" and "t" are columns or row indexes.
If they are column indexes, then you'd need a row index (0 in the example below) along with what you have written above:
df = pd.DataFrame()
df.loc[0, "a"] = 1234
df.loc[0, "b"] = 2
df.loc[0, "c"] = 3
Result:
In : df
Out:
a b c
0 1234.0 2.0 3.0
Now that you have data in the dataframe you can perform column operations (i.e., create a new column "t" and for each row assign the value of the corresponding item under "b" divided by the corresponding items under "c"):
df["t"] = df["b"]/df["c"]
Of course, you can also use different indexes for each item as follows:
df = pd.DataFrame()
df.loc[0, "a"] = 1234
df.loc[1, "b"] = 2
df.loc[2, "c"] = 3
Result:
In : df
Out:
a b c
0 1234.0 NaN NaN
1 NaN 2.0 NaN
2 NaN NaN 3.0
But as you can see the cells where you have not specified the (row, column, value) tuple now are NaN. This means if you try df["b"]/df["c"] you will get NaN values out as you are trying a linear operation with a NaN value.
In : df["b"]/df["c"]
Out:
0 NaN
1 NaN
2 NaN
dtype: float64
The converse is if you wanted to insert the items under one column. You'd now need a column header for this (0 in the below):
df = pd.DataFrame()
df.loc["a", 0] = 1234
df.loc["b", 0] = 2
df.loc["c", 0] = 3
Result:
In : df
Out:
0
a 1234.0
b 2.0
c 3.0
Now in inserting the value for "t" you'd need to specify exactly which cells you are referring to (note that pandas won't perform vectorised row operations in the same way that it performs vectorised columns operations).
df.loc["t", 0] = df.loc["b", 0]/df.loc["c", 0]
Assuming that I have a dataframe with the following values:
df:
col1 col2 value
1 2 3
1 2 1
2 3 1
I want to first groupby my dataframe based on the first two columns (col1 and col2) and then average over values of the thirs column (value). So the desired output would look like this:
col1 col2 avg-value
1 2 2
2 3 1
I am using the following code:
columns = ['col1','col2','avg']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
print(df[['col1','col2','avg']].groupby('col1','col2').mean())
which gets the following error:
ValueError: No axis named col2 for object type <class 'pandas.core.frame.DataFrame'>
Any help would be much appreciated.
You need to pass a list of the columns to groupby, what you passed was interpreted as the axis param which is why it raised an error:
In [30]:
columns = ['col1','col2','avg']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
print(df[['col1','col2','avg']].groupby(['col1','col2']).mean())
avg
col1 col2
1 2 3
3 3
If you want to group by multiple columns, you should put them in a list:
columns = ['col1','col2','value']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
df.loc[2] = [2,3,1]
print(df.groupby(['col1','col2']).mean())
Or slightly more verbose, for the sake of getting the word 'avg' in your aggregated dataframe:
import numpy as np
columns = ['col1','col2','value']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
df.loc[2] = [2,3,1]
print(df.groupby(['col1','col2']).agg({'value': {'avg': np.mean}}))