So apparently I am trying to declare an empty dataframe, then assign some values in it
df = pd.DataFrame()
df["a"] = 1234
df["b"] = b # Already defined earlier
df["c"] = c # Already defined earlier
df["t"] = df["b"]/df["c"]
I am getting the below output:
Empty DataFrame
Columns: [a, b, c, t]
Index: []
Can anyone explain why I am getting this empty dataframe even when I am assigning the values. Sorry if my question is kind of basic
I think, you have to initialize DataFrame like this.
df = pd.DataFrame(data=[[1234, b, c, b/c]], columns=list("abct"))
When you make DataFrame with no initial data, the DataFrame has no data and no columns.
So you can't append any data I think.
Simply add those values as a list, e.g.:
df["a"] = [123]
You have started by initialising an empty DataFrame:
# Initialising an empty dataframe
df = pd.DataFrame()
# Print the DataFrame
print(df)
Result
Empty DataFrame
Columns: []
Index: []
As next you've created a column inside the empty DataFrame:
df["a"] = 1234
print(df)
Result
Empty DataFrame
Columns: [a]
Index: []
But you never added values to the existing column "a" - f.e. by using a dictionary (key: "a" and value list [1, 2, 3, 4]:
df = pd.DataFrame({"a":[1, 2, 3, 4]})
print(df)
Result:
In case a list of values is added each value will get an index entry.
The problem is that a cell in a table needs both a row index value and a column index value to insert the cell value. So you need to decide if "a", "b", "c" and "t" are columns or row indexes.
If they are column indexes, then you'd need a row index (0 in the example below) along with what you have written above:
df = pd.DataFrame()
df.loc[0, "a"] = 1234
df.loc[0, "b"] = 2
df.loc[0, "c"] = 3
Result:
In : df
Out:
a b c
0 1234.0 2.0 3.0
Now that you have data in the dataframe you can perform column operations (i.e., create a new column "t" and for each row assign the value of the corresponding item under "b" divided by the corresponding items under "c"):
df["t"] = df["b"]/df["c"]
Of course, you can also use different indexes for each item as follows:
df = pd.DataFrame()
df.loc[0, "a"] = 1234
df.loc[1, "b"] = 2
df.loc[2, "c"] = 3
Result:
In : df
Out:
a b c
0 1234.0 NaN NaN
1 NaN 2.0 NaN
2 NaN NaN 3.0
But as you can see the cells where you have not specified the (row, column, value) tuple now are NaN. This means if you try df["b"]/df["c"] you will get NaN values out as you are trying a linear operation with a NaN value.
In : df["b"]/df["c"]
Out:
0 NaN
1 NaN
2 NaN
dtype: float64
The converse is if you wanted to insert the items under one column. You'd now need a column header for this (0 in the below):
df = pd.DataFrame()
df.loc["a", 0] = 1234
df.loc["b", 0] = 2
df.loc["c", 0] = 3
Result:
In : df
Out:
0
a 1234.0
b 2.0
c 3.0
Now in inserting the value for "t" you'd need to specify exactly which cells you are referring to (note that pandas won't perform vectorised row operations in the same way that it performs vectorised columns operations).
df.loc["t", 0] = df.loc["b", 0]/df.loc["c", 0]
Related
I have two dataframes with meaningless index's, but carefully curated order and I want to merge them while preserving that order. So, for example:
>>> df1
First
a 1
b 3
and
>>> df2
c 2
d 4
After merging, what I want to obtain is this:
>>> Desired_output
First Second
AnythingAtAll 1 2 # <--- Row Names are meaningless.
SeriouslyIDontCare 3 4 # <--- But the ORDER of the rows is critical and must be preserved.
The fact that I've got row-indices "a/b", and "c/d" is irrelevent, but what is crucial is the order in which the rows appear. Every version of "join" I've seen requires me to manually reset indices, which seems really awkward, and I don't trust that it won't screw up the ordering. I thought concat would work, but I get this:
>>> pd.concat( [df1, df2] , axis = 1, ignore_index= True )
0 1
a 1.0 NaN
b 3.0 NaN
c NaN 2.0
d NaN 4.0
# ^ obviously not what I want.
Even when I explicitly declare ignore_index.
How do I "overrule" the indexing and force the columns to be merged with the rows kept in the exact order that I supply them?
Edit:
Note that if I assign another column, the results are all "NaN".
>>> df1["second"]=df2["Second"]
>>> df1
First second
a 1 NaN
b 3 NaN
This was screwing me up but thanks to the suggestion from jsmart and topsail, you can dereference the indices by directly accessing the values in the column:
df1["second"]=df2["Second"].values
>>> df1
First second
a 1 2
b 3 4
^ Solution
This should also work I think:
df1["second"] = df2["second"].values
It would keep the index from the first dataframe, but since you have values in there such as "AnyThingAtAll" and "SeriouslyIdontCare" I guess any index values whatsoever are acceptable.
Basically, we are just adding a the values from your series as a new column to the first dataframe.
Here's a test example similar to your described problem:
# -----------
# sample data
# -----------
df1 = pd.DataFrame(
{
'x': ['a','b'],
'First': [1,3],
})
df1.set_index("x", drop=True, inplace=True)
df2 = pd.DataFrame(
{
'x': ['c','d'],
'Second': [2, 4],
})
df2.set_index("x", drop=True, inplace=True)
# ---------------------------------------------
# Add series as a new column to first dataframe
# ---------------------------------------------
df1["Second"] = df2["Second"].values
Result is:
First
Second
a
1
2
b
3
4
The goal is to combine data based on position (not by Index). Here is one way to do it:
import pandas as pd
# create data frames df1 and df2
df1 = pd.DataFrame(data = {'First': [1, 3]}, index=['a', 'b'])
df2 = pd.DataFrame(data = {'Second': [2, 4]}, index = ['c', 'd'])
# add a column to df1 -- add by position, not by Index
df1['Second'] = df2['Second'].values
print(df1)
First Second
a 1 2
b 3 4
And you could create a completely new data frame like this:
data = {'1st': df1['First'].values, '2nd': df1['Second'].values}
print(pd.DataFrame(data))
1st 2nd
0 1 2
1 3 4
ignore_index means whether to keep the output dataframe index from original along axis. If it is True, it means don't use original index but start from 0 to n just like what the column header 0, 1 shown in your result.
You can try
out = pd.concat( [df1.reset_index(drop=True), df2.reset_index(drop=True)] , axis = 1)
print(out)
First Second
0 1 2
1 3 4
I have a large dataframe where I need to add an empty row after any instance where colA contains a colon.
To be honest I have absolutely no clue how to do this, my guess is that a function/ for loop needs to be written but I have had no luck...
I think you are looking for this
You have dataframe like this
df = pd.DataFrame({"cola": ["a", "b", ":", "c", "d", ":", "e"]})
# wherever you find : in column a you want to append new empty row
idx = [0] + (df[df.cola.str.match(':')].index +1).tolist()
df1 = pd.DataFrame()
for i in range(len(idx)-1):
df1 = pd.concat([df1, df.iloc[idx[i]: idx[i+1]]],ignore_index=True)
df1.loc[len(df1)] = ""
df1 = pd.concat([df1, df.iloc[idx[-1]: ]], ignore_index=True)
print(df1)
# df1 is your result dataframe also it handles the case where colon is present at the last row of dataframe
Resultant dataframe
cola
0 a
1 b
2 :
3
4 c
5 d
6 :
7
8 e
i have a pandas dataframe where the columns are named like:
0,1,2,3,4,.....,n
i would like to drop every 3rd column so that i get a new dataframe where i would have the columns like:
0,1,3,4,6,7,9,.....,n
I have tried like this:
shape = df.shape[1]
for i in range(2,shape,3):
df = df.drop(df.columns[i], axis=1)
but i get an error saying index is out of bound and i assume this happens because the shape of the dataframe changes when i am dropping the columns. if i just don't store the output of the "for" loop, then the code works but i don't get my new dataframe.
How do i solve this?
Thanks
The issue with code is, each time you drop a column in your loop, you end up with a different set of columns because you overwrite the df back after each iteration. When you try to drop the next 3rd column of THAT new set of columns, you not only drop the wrong one, you end up running out of columns eventually. That's why you get the error you are getting.
iter1 -> 0,1,3,4,5,6,7,8,9,10 ... n #first you drop 2 which is 3rd col
iter2 -> 0,1,3,4,5,7,8,9,10 ... n #next you drop 6 which is 6th col (should be 5)
iter3 -> 0,1,3,4,5,7,8,9, ... n #next you drop 10 which is 9th col (should be 8)
What you want to do is calculate the indexes beforehand and then remove them in one go.
You can simply just get the indexes of columns you want to remove with range and then drop those.
drop_idx = list(range(2,df.shape[1],3)) #Indexes to drop
df2 = df.drop(drop_idx, axis=1) #Drop them at once over axis=1
print('old columns->', list(df.columns))
print('idx to drop->', drop_idx)
print('new columns->',list(df2.columns))
old columns-> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
idx to drop-> [2, 5, 8]
new columns-> [0, 1, 3, 4, 6, 7, 9]
Note: This works only because your columns names are same as indexes. If however, your column names are not like that, you will have to do an extra step of fetching the column names based on the index you want to drop.
drop_idx = list(range(2,df.shape[1],3))
drop_cols = [j for i,j in enumerate(df.columns) if i in drop_idx] #<--
df2 = df.drop(drop_cols, axis=1)
Here is solution with inverted logic - select all columns with removed each 3rd column.
You can filter values by compare added 1 to helper array, with 3 modulo compare for not equal 0 and pass to DataFrame.loc:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
df = df.loc[:, (np.arange(len(df.columns)) + 1) % 3 != 0]
print (df)
A B D E
0 a 4 1 5
1 b 5 3 3
2 c 4 5 6
3 d 5 7 9
4 e 5 1 2
5 f 4 0 4
You can use list comprehension to filter columns:
df = df[[k for k in df.columns if (k + 1) % 3 != 0]]
If the names are different (e.g. strings) and you want to discard every 3rd column regardless of its name, then:
df = df[[k for i, k in enumerate(df.columns, 1) if i % 3 != 0]]
I have a panda DataFrame that I want to add rows to. The Dataframe looks like this:
col1 col2
a 1 5
b 2 6
c 3 7
I want to add rows to the dataframe, but only if they are unique. The problem is that some new rows might have the same index, but different values in the columns. If this is the case, I somehow need to know.
Some example rows to be added and the desired result:
row 1:
col1 col2
a 1 5
desired row 1 result: Not added - it is already in the dataframe
row 2:
col1 col2
a 9 9
desired row 2 result: something like,
print('non-unique entries for index a')
row 3:
col1 col2
d 4 4
desired row 3 result: just add the row to the dataframe.
try this:
# existing dataframe == df
# new rows == df_newrows
# dividing newrows dataframe into two, one for repeated indexes, one without.
df_newrows_usable = df_newrows.loc[df_newrows.index.isin(list(df.index.get_values()))==False]
df_newrows_discarded = df_newrows.loc[df_newrows.index.isin(list(df.index.get_values()))]
print ('repeated indexes:', df_newrows_discarded)
# concat df and newrows without repeated indexes
new_df = pd.concat([df,df_newrows],0)
print ('new dataframe:', new_df)
the easy option would be to merge all rows and then keep the unique ones via the dataframe method drop_duplicates
However, this option doesn't report a warning / error when a duplicate row is appended.
drop_duplicates doesn't consider indexes, so the dataframe must be reset before dropping the duplicates, and set back after:
import pandas as pd
# set up data frame
df = pd.DataFrame({'col1': [1, 2, 3], 'col2':[5, 6, 7]}, index=['a', 'b', 'c'])
# set up row to be appended
row = pd.DataFrame({'col1':[3], 'col2': [7]}, index=['c'])
# append row (don't care if it's duplicate)
df = df.append([row])
# drop duplicatesdf2 = df2.reset_index()
df2 = df2.drop_duplicates()
df2 = df2.set_index('index')
if the warning message is an absolute requirement, we can write a function to that effect that checks if a row is duplicate via a merge operation and appends the row only if it is unique.
def append_unique(df, row):
d = df.reset_index()
r = row.reset_index()
if d.merge(r, on=list(d.columns), how='inner').empty:
d2 = d.append(r)
d2 = d2.set_index('index')
return d2
print('non-unique entries for index a')
return df
df2 = append_unique(df2, row)
I'm iterating over a DataFrame, evaluating each row and then sticking it into another DataFrame by using the concat() method. However, the receiving DataFrame is still empty.
import pandas as pd
empty = DataFrame(columns=('col1', 'col2'))
d = {'col1' : Series([1, 2, 3]),
'col2' : Series([3, 4, 5])
}
some_data = DataFrame(d)
print empty
print some_data
print 'concat should happen below'
for index, row in some_data.iterrows():
pd.concat([empty, DataFrame(row)])
print empty # should contain 3 rows of data
OUTPUT:
Empty DataFrame
Columns: [col1, col2]
Index: []
col1 col2
0 1 3
1 2 4
2 3 5
concat should happen below
Empty DataFrame
Columns: [col1, col2]
Index: []
You need to update empty if you want it to stores the values: empty = pd.concat([empty, DataFrame(row)])
Also you can concatenate the whole DataFrames, try this print pd.concat([ empty,some_data])
If you want to filter the rows you can try this:
def f(r):
#check the row here
return True #returns True or False if include/exclude the row
print some_data.groupby(some_data.index).filter(f)