I have a dataframe (df) which looks like:
0 1 2 3
0 BBG.apples.S BBG.XNGS.bananas.S 0
1 BBG.apples.S BBG.XNGS.oranges.S 0
2 BBG.apples.S BBG.XNGS.pairs.S 0
3 BBG.apples.S BBG.XNGS.mango.S 0
4 BBG.apples.S BBG.XNYS.mango.S 0
5 BBG.XNGS.bananas.S BBG.XNGS.oranges.S 0
6 BBG.XNGS.bananas.S BBG.XNGS.pairs.S 0
7 BBG.XNGS.bananas.S BBG.XNGS.kiwi.S 0
8 BBG.XNGS.oranges.S BBG.XNGS.pairs.S 0
9 BBG.XNGS.oranges.S BBG.XNGS.kiwi.S 0
10 BBG.XNGS.peaches.S BBG.XNGS.strawberrys.S 0
11 BBG.XNGS.peaches.S BBG.XNGS.strawberrys.S 0
12 BBG.XNGS.peaches.S BBG.XNGS.strawberrys.S 0
13 BBG.XNGS.peaches.S BBG.XNGS.kiwi.S 0
I am trying to update a value (first row, third column) in the dataframe using:
for index, row in df.iterrows():
status = row[3]
if int(status) == 0:
df[index]['3'] = 1
but when I print the dataframe out it remains unchanged.
What am I doing wrong?
Replace your last line by:
df.at[index,'3'] = 1
Obviously as mentioned by others you're better off using a vectorized expression instead of iterating, especially for large dataframes.
You can't modify a data frame by iterating like that. See here.
If you only want to modify the element at [1, 3], you can access it directly:
df[1, 3] = 1
If you're trying to turn every 0 in column 3 to a 1, try this:
df[df['3'] == 0] = 1
EDIT: In addition, the docs for iterrows say that you'll often get a copy back, which is why the operation fails.
If you are trying to update the third column for all rows based on the row having a certain value, as shown in your example code, then it would be much easier use the where method on the dataframe:
df.loc[:,'3'] = df['3'].where(df['3']!=0, 1)
Try to update the row using .loc or .iloc (depend on your needs).
For example, in this case:
if int(status) == 0:
df.iloc[index]['3']='1'
Related
I have a pandas dataframe as shown below:
Pandas Dataframe
I want to drop the rows that has only one non zero value. What's the most efficient way to do this?
Try boolean indexing
# sample data
df = pd.DataFrame(np.zeros((10, 10)), columns=list('abcdefghij'))
df.iloc[2:5, 3] = 1
df.iloc[4:5, 4] = 1
# boolean indexing based on condition
df[df.ne(0).sum(axis=1).ne(1)]
Only rows 2 and 3 are removed because row 4 has two non-zero values and every other row has zero non-zero values. So we drop rows 2 and 3.
df.ne(0).sum(axis=1)
0 0
1 0
2 1
3 1
4 2
5 0
6 0
7 0
8 0
9 0
Not sure if this is the most efficient but I'll try:
df[[col for col in df.columns if (df[col] != 0).sum() == 1]]
2 loops per column here: 1 for checking if != 0 and one more to sum the boolean values up (could break earlier if the second value is found).
Otherwise, you can define a custom function to check without looping twice per column:
def check(column):
already_has_one = False
for value in column:
if value != 0:
if already_has_one:
return False
already_has_one = True
return already_has_one
then:
df[[col for col in df.columns if check(df[col])]]
Which is much faster than the first.
Or like this:
df[(df.applymap(lambda x: bool(x)).sum(1) > 1).values]
Let's say that I have a simple Dataframe.
data1 = [12,34,465,678,896]
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 465
3 678
4 896
I want to delete all the data except the last value of the column that I want to save in the first row. It can be an column with thousands of rows. So I would like the result :
Data
0 896
1
2
3
4
What are the simplest functions to do that efficiently ?
Thank you
You an use iloc where 0 is the first row of the data column, -1 is the last row and 1: is every row except the first row:
df1['Data'].iloc[0] = df1['Data'].iloc[-1]
df1['Data'].iloc[1:] = ''
df1
Out[1]:
Data
0 896
1
2
3
4
Use the loc accessor. Utilise the python x,y=a,b to assign the values
df1.loc[0,'Data'],df1.loc[1::,'Data']=df1['Data'].values[-1],''
Data
0 896
1
2
3
4
You can use .reverse() method of python lists, something like this:
my_data = df1['Data'].to_list() # Get list from Serie
my_data.reverse() # Reverse order.
my_data[1:] = [""]*len(my_data[1:]) # Fill with spaces from the second item.
df1['Data'] = my_data
Output:
Data
0 896
1
2
3
4
I would like to transform a table which looks similiar to this below:
X|Y|Z|
1|2|3|
3|5|2|
4|2|1|
The result, I want to achive, should look like that:
col|1|2|3|4|5|
X |1|0|1|0|0|
Y |0|2|0|0|1|
Z |1|1|1|0|0|
So, after transformation the new columns should be unique values from previous table, the new values should be populated with count/appearance, and in the index should be the old column names.
I got stuck and i do not know hot to handle with cause I am a newbe in python, so thanks in advance for support.
Regards,
guddy_7
Use apply with value_counts, replace missing values to 0 and transpose by T:
df = df.apply(pd.value_counts).fillna(0).astype(int).T
print (df)
1 2 3 4 5
X 1 0 1 1 0
Y 0 2 0 0 1
Z 1 1 1 0 0
I have a dataset showing below.
What I would like to do is three things.
Step 1: AA to CC is an index, however, happy to keep in the dataset for the future purpose.
Step 2: Count 0 value to each row.
Step 3: If 0 is more than 20% in the row, which means more than 2 in this case because DD to MM is 10 columns, remove the row.
So I did a stupid way to achieve above three steps.
df = pd.read_csv("dataset.csv", header=None)
df_bool = (df == "0")
print(df_bool.sum(axis=1))
then I got an expected result showing below.
0 0
1 0
2 1
3 0
4 1
5 8
6 1
7 0
So removed the row #5 as I indicated below.
df2 = df.drop([5], axis=0)
print(df2)
This works well even this is not an elegant, kind of a stupid way to go though.
However, if I import my dataset as header=0, then this approach did not work at all.
df = pd.read_csv("dataset.csv", header=0)
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
How come this happens?
Also, if I would like to write a code with loop, count and drop functions, what does the code look like?
You can just continue using boolean_indexing:
First we calculate number of columns and number of zeroes per row:
n_columns = len(df.columns) # or df.shape[1]
zeroes = (df == "0").sum(axis=1)
We then select only rows that have less than 20 % zeroes.
proportion_zeroes = zeroes / n_columns
max_20 = proportion_zeroes < 0.20
df[max_20] # This will contain only rows that have less than 20 % zeroes
One liner:
df[((df == "0").sum(axis=1) / len(df.columns)) < 0.2]
It would have been great if you could have posted how the dataframe looks in pandas rather than a picture of an excel file. However, constructing a dummy df
df = pd.DataFrame({'index1':['a','b','c'],'index2':['b','g','f'],'index3':['w','q','z']
,'Col1':[0,1,0],'Col2':[1,1,0],'Col3':[1,1,1],'Col4':[2,2,0]})
Step1, assigning the index can be done using the .set_index() method as per below
df.set_index(['index1','index2','index3'],inplace=True)
instead of doing everything manually when it comes fo filtering out, you can use the return you got from df_bool.sum(axis=1) in the filtering of the dataframe as per below
df.loc[(df==0).sum(axis=1) / (df.shape[1])>0.6]
index1 index2 index3 Col1 Col2 Col3 Col4
c f z 0 0 1 0
and using that you can drop those rows, assuming 20% then you would use
df = df.loc[(df==0).sum(axis=1) / (df.shape[1])<0.2]
Ween it comes to the header issue it's a bit difficult to answer without seeing the what the file or dataframe looks like
I'm working on some code that generates features from a dataframe and adds these features as columns to the dataframe.
The trouble is I'm working with a time series so that for any given tuple, I need (let's say) 5 of the previous tuples to generate the corresponding feature for that tuple.
lookback_period = 5
df['feature1'] = np.zeros(len(df)) # preallocate
for index, row in df.iterrows():
if index < lookback_period:
continue
slice = df[index - lookback_period:index]
some_int = SomeFxn(slice)
row['feature1'] = some_int
Is there a way to execute this code without explicitly looping through every row and then slicing?
One way is to create several lagged columns using df['column_name'].shift() such that all the necessary information is contained in each row, but this quickly gets intractable for my computer's memory since the dataset is large (millions of rows).
I don't have enough reputation to comment so will just post it here.
Can't you use apply for your dataframe e.g.
df['feature1'] = df.apply(someRowFunction, axis=1)
where someRowFunction will accept the full row and you can perform whatever row based slice and logic you want to do.
--- updated ---
As we do not have much information about the dataframe and the required/expected output, I just based the answer on the information from the comments
Let's define a function that will take a DataFrame slice (based on current row index and lookback) and the row and will return sum of the first column of the slice and value of the current row.
def someRowFunction (slice, row):
if slice.shape[0] == 0:
return 0
return slice[slice.columns[0]].sum() + row.b
d={'a':[1,2,3,4,5,6,7,8,9,0],'b':[0,9,8,7,6,5,4,3,2,1]}
df=pd.DataFrame(data=d)
lookback = 5
df['c'] = df.apply(lambda current_row: someRowFunction(df[current_row.name -lookback:current_row.name],current_row),axis=1)
we can get row index from apply using its name attribute and as such we can retrieve the required slice. Above will result to the following
print(df)
a b c
0 1 0 0
1 2 9 0
2 3 8 0
3 4 7 0
4 5 6 0
5 6 5 20
6 7 4 24
7 8 3 28
8 9 2 32
9 0 1 36