I have a large dataframe where I need to add an empty row after any instance where colA contains a colon.
To be honest I have absolutely no clue how to do this, my guess is that a function/ for loop needs to be written but I have had no luck...
I think you are looking for this
You have dataframe like this
df = pd.DataFrame({"cola": ["a", "b", ":", "c", "d", ":", "e"]})
# wherever you find : in column a you want to append new empty row
idx = [0] + (df[df.cola.str.match(':')].index +1).tolist()
df1 = pd.DataFrame()
for i in range(len(idx)-1):
df1 = pd.concat([df1, df.iloc[idx[i]: idx[i+1]]],ignore_index=True)
df1.loc[len(df1)] = ""
df1 = pd.concat([df1, df.iloc[idx[-1]: ]], ignore_index=True)
print(df1)
# df1 is your result dataframe also it handles the case where colon is present at the last row of dataframe
Resultant dataframe
cola
0 a
1 b
2 :
3
4 c
5 d
6 :
7
8 e
Related
For each row of a dataframe I want to repeat the row n times inside a iterrows in a new dataframe. Basically I'm doing this:
df = pd.DataFrame(
[
("abcd", "abcd", "abcd") # create your data here, be consistent in the types.
],
["A", "B", "C"] # add your column names here
)
n_times = 2
for index, row in df.iterrows():
new_df = row.loc[row.index.repeat(n_times)]
new_df
and I get the following output:
0 abcd
0 abcd
1 abcd
1 abcd
2 abcd
2 abcd
Name: C, dtype: object
while it should be:
A B C
0 abcd abcd abcd
1 abcd abcd abcd
How should I proceed to get the desired output?
The df.T attribute in Pandas is used to transpose a DataFrame. Transposing a DataFrame means to flip its rows and columns, so that the rows become columns and the columns become rows.
I don't think you defined your df the right way.
df = pd.DataFrame(data = [["abcd", "abcd", "abcd"]],
columns = ["A", "B", "C"])
n_times = 2
for _ in range(n_times):
new_df = pd.concat([df, df], axis=0)
Is that how it should look like?
Given a DataFrame, how can I add a new level to the columns based on an iterable given by the user? In other words, how do I append a new level?
The question How to simply add a column level to a pandas dataframe shows how to add a new level given a single value, so it doesn't cover this case.
Here is the expected behaviour:
>>> df = pd.DataFrame(0, columns=["A", "B"], index=range(2))
>>> df
A B
0 0 0
1 0 0
>>> append_level(df, ["C", "D"])
A B
C D
0 0 0
1 0 0
The solution should also work with MultiIndex columns, so
>>> append_level(append_level(df, ["C", "D"]), ["E", "F"])
A B
C D
E F
0 0 0
1 0 0
If the columns is not multiindex, you can just do:
df.columns = pd.MultiIndex.from_arrays([df.columns.tolist(), ['C','D']])
If its multiindex:
if isinstance(df.columns, pd.MultiIndex):
df.columns = pd.MultiIndex.from_arrays([*df.columns.levels, ['E', 'F']])
The pd.MultiIndex.levels gives a Frozenlist of level values and you need to unpack to form the list of lists as input to from_arrays
def append_level(df, new_level):
new_df = df.copy()
new_df.columns = pd.MultiIndex.from_tuples(zip(*zip(*df.columns), new_level))
return new_df
So apparently I am trying to declare an empty dataframe, then assign some values in it
df = pd.DataFrame()
df["a"] = 1234
df["b"] = b # Already defined earlier
df["c"] = c # Already defined earlier
df["t"] = df["b"]/df["c"]
I am getting the below output:
Empty DataFrame
Columns: [a, b, c, t]
Index: []
Can anyone explain why I am getting this empty dataframe even when I am assigning the values. Sorry if my question is kind of basic
I think, you have to initialize DataFrame like this.
df = pd.DataFrame(data=[[1234, b, c, b/c]], columns=list("abct"))
When you make DataFrame with no initial data, the DataFrame has no data and no columns.
So you can't append any data I think.
Simply add those values as a list, e.g.:
df["a"] = [123]
You have started by initialising an empty DataFrame:
# Initialising an empty dataframe
df = pd.DataFrame()
# Print the DataFrame
print(df)
Result
Empty DataFrame
Columns: []
Index: []
As next you've created a column inside the empty DataFrame:
df["a"] = 1234
print(df)
Result
Empty DataFrame
Columns: [a]
Index: []
But you never added values to the existing column "a" - f.e. by using a dictionary (key: "a" and value list [1, 2, 3, 4]:
df = pd.DataFrame({"a":[1, 2, 3, 4]})
print(df)
Result:
In case a list of values is added each value will get an index entry.
The problem is that a cell in a table needs both a row index value and a column index value to insert the cell value. So you need to decide if "a", "b", "c" and "t" are columns or row indexes.
If they are column indexes, then you'd need a row index (0 in the example below) along with what you have written above:
df = pd.DataFrame()
df.loc[0, "a"] = 1234
df.loc[0, "b"] = 2
df.loc[0, "c"] = 3
Result:
In : df
Out:
a b c
0 1234.0 2.0 3.0
Now that you have data in the dataframe you can perform column operations (i.e., create a new column "t" and for each row assign the value of the corresponding item under "b" divided by the corresponding items under "c"):
df["t"] = df["b"]/df["c"]
Of course, you can also use different indexes for each item as follows:
df = pd.DataFrame()
df.loc[0, "a"] = 1234
df.loc[1, "b"] = 2
df.loc[2, "c"] = 3
Result:
In : df
Out:
a b c
0 1234.0 NaN NaN
1 NaN 2.0 NaN
2 NaN NaN 3.0
But as you can see the cells where you have not specified the (row, column, value) tuple now are NaN. This means if you try df["b"]/df["c"] you will get NaN values out as you are trying a linear operation with a NaN value.
In : df["b"]/df["c"]
Out:
0 NaN
1 NaN
2 NaN
dtype: float64
The converse is if you wanted to insert the items under one column. You'd now need a column header for this (0 in the below):
df = pd.DataFrame()
df.loc["a", 0] = 1234
df.loc["b", 0] = 2
df.loc["c", 0] = 3
Result:
In : df
Out:
0
a 1234.0
b 2.0
c 3.0
Now in inserting the value for "t" you'd need to specify exactly which cells you are referring to (note that pandas won't perform vectorised row operations in the same way that it performs vectorised columns operations).
df.loc["t", 0] = df.loc["b", 0]/df.loc["c", 0]
I have a dataframe where each second column name is skipped:
eg
Step_1.
The idea is to fill unnamed columns with previous name to get:
Step_2.
To sum up "in" and "out" in each class, to get final result like this
The intermediary Step_1 is important and cannot be skipped to get the final result.
I appreciate any help and apologize for not being clear enough when asking question at the first attempt.
Thank you
Idea is convert columns to Series, so possible replace missing values instead values starting by Unnamed with forward filling:
df.columns = df.columns.to_series().mask(lambda x: x.str.startswith('Unnamed')).ffill()
print (df)
Column_1 Column_1 Column_2 Column_2
0 a d f g
EDIT:
If missing values in index:
df.columns = df.columns.to_series().ffill()
MultiIndex solution is necessary, if second row is header too - first use header=[0,1] for MultiIndex:
import pandas as pd
temp=u"""Column_1;Unnamed_column;Column_2;Unnamed_column
a;d;f;g
1;5;5;6
7;8;9;4"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep=";", header=[0,1])
print (df)
Column_1 Unnamed_column Column_2 Unnamed_column
a d f g
0 1 5 5 6
1 7 8 9 4
a = df.columns.get_level_values(0)
b = df.columns.get_level_values(1)
df.columns = [a.to_series().mask(lambda x: x.str.startswith('Unnamed')).ffill(), b]
print (df)
Column_1 Column_2
a d f g
0 1 5 5 6
1 7 8 9 4
I tried this,
t = pd.DataFrame(df.columns)
t.loc[t[0].str.startswith('Unnamed: '),0] = np.NaN
t[0].bfill(inplace=True)
df.columns = t[0].values
Create temp dataframe with column of original dataframe. apply ffill or bfill as per your wish. assign back the values again to original dataframe.
You can rewrite the df.index with a list comprehension.
from itertools import chain
df = pd.DataFrame(
{"Column_1": [1], "Unnamed_column1": [2], "Column_2": [3], "Unnamed_column2": [4]})
cols = [[c, c] for c in df.columns[::2]]
df.columns = [_ for _ in chain(*cols)]
Having said that it might be better to assign unique names to columns as they will be used keys/indices, i.e .
cols = [[c, c+"_new"] for c in df.columns[::2]]
I have a panda DataFrame that I want to add rows to. The Dataframe looks like this:
col1 col2
a 1 5
b 2 6
c 3 7
I want to add rows to the dataframe, but only if they are unique. The problem is that some new rows might have the same index, but different values in the columns. If this is the case, I somehow need to know.
Some example rows to be added and the desired result:
row 1:
col1 col2
a 1 5
desired row 1 result: Not added - it is already in the dataframe
row 2:
col1 col2
a 9 9
desired row 2 result: something like,
print('non-unique entries for index a')
row 3:
col1 col2
d 4 4
desired row 3 result: just add the row to the dataframe.
try this:
# existing dataframe == df
# new rows == df_newrows
# dividing newrows dataframe into two, one for repeated indexes, one without.
df_newrows_usable = df_newrows.loc[df_newrows.index.isin(list(df.index.get_values()))==False]
df_newrows_discarded = df_newrows.loc[df_newrows.index.isin(list(df.index.get_values()))]
print ('repeated indexes:', df_newrows_discarded)
# concat df and newrows without repeated indexes
new_df = pd.concat([df,df_newrows],0)
print ('new dataframe:', new_df)
the easy option would be to merge all rows and then keep the unique ones via the dataframe method drop_duplicates
However, this option doesn't report a warning / error when a duplicate row is appended.
drop_duplicates doesn't consider indexes, so the dataframe must be reset before dropping the duplicates, and set back after:
import pandas as pd
# set up data frame
df = pd.DataFrame({'col1': [1, 2, 3], 'col2':[5, 6, 7]}, index=['a', 'b', 'c'])
# set up row to be appended
row = pd.DataFrame({'col1':[3], 'col2': [7]}, index=['c'])
# append row (don't care if it's duplicate)
df = df.append([row])
# drop duplicatesdf2 = df2.reset_index()
df2 = df2.drop_duplicates()
df2 = df2.set_index('index')
if the warning message is an absolute requirement, we can write a function to that effect that checks if a row is duplicate via a merge operation and appends the row only if it is unique.
def append_unique(df, row):
d = df.reset_index()
r = row.reset_index()
if d.merge(r, on=list(d.columns), how='inner').empty:
d2 = d.append(r)
d2 = d2.set_index('index')
return d2
print('non-unique entries for index a')
return df
df2 = append_unique(df2, row)