Modifing data while using iterrows() does not work - python

I'm using iterrows() to work my way through a dataframe. Using a for loop and nested if statements I'm able to identify the cells I want to change.
I used a print statement to verifiy I'm able to change the data but when I print out the dataframe the information is unchanged. I was able to do this on a smaller dataframe. Any ideas?
My original this was my code that worked:
data.loc[(data.ID.isin([10,45])) & (data.source.notnull()), 'ID'] = 50
But I need to add this:
data.loc[(data.ID.isin([23,45])) & (data.source.notnull()), 'ID'] = 60
This worked for me as a test
The DataFrame did change with this logic:
import pandas as pd
data = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [10, 23, 32, 45],
'num_specimen_seen': [10, 2, 1, 8]},
index=['falcon', 'dog', 'spider', 'fish'])
for x,y in data.iterrows():
if y['num_wings'] in [10,45]:
y['num_wings'] = 50
print(x,y)
This is basically what I'm trying to do:
I can changed the data using this logic but it doesn't seem to change the actual DataFrame:
import pandas as pd
...
...
for x,y in data.iterrows():
if y['ID'] in [10,45]:
if y['source'] == 0:
if y['username'] == 'bill':
y['IDs'] = 50
print(x,y) #print the results to confirmed it worked, it did/
# however, dataframe is unchanged
This worked for me
The DataFrame did change with this logic:
import pandas as pd
data = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [10, 23, 32, 45],
'num_specimen_seen': [10, 2, 1, 8]},
index=['falcon', 'dog', 'spider', 'fish'])
for x,y in data.iterrows():
if y['num_wings'] in [10,45]:
y['num_wings'] = 50
print(x,y)
I feel confident that I can make the changes I want but I need to appy it to the DataFrame.

To clarify, you're trying to conditionally update the value of the num_wings column? If so, here you go. You need to use the .loc method to update values in a dataframe.
import pandas as pd
data = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [10, 23, 32, 45],
'num_specimen_seen': [10, 2, 1, 8]},
index=['falcon', 'dog', 'spider', 'fish'])
data.loc[data['num_wings'].isin([10,45]),'num_wings'] = 50
data
num_legs num_specimen_seen num_wings
falcon 2 10 50
dog 4 2 23
spider 8 1 32
fish 0 8 50

The code doesn't work because: (source)
Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
To write to it, you can try to see if at works, i.e.,
for x,y in data.iterrows():
if y['num_wings'] in [10,45]:
data.at[x, 'num_wings'] = 50
Just modifying something while you're iterating over it is not recommended. But I think it should be OK in your case.

Related

Usign pandas python, How do I make groupby of a subset of a groupby not return filtered off values

So I'm using Plotly to plot some grouped data. The problem is that after I updated pandas, my code stopped working. I managed to isolate what was going wrong. It turns out that inside px.bar, there is a get_group that returns groups which I filtered out. Why is that? How could I resolve this?
# Code outside px.bar
old_df2 = pd.DataFrame({'name': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C'],
'id1': [18, 22, 19, 14, 14, 11, 20, 28],
'id2': [5, 7, 7, 9, 12, 9, 9, 4],
'id3': [11, 8, 10, 6, 6, 7, 9, 12]})
new_df = old_df2.groupby([pd.Categorical(old_df2.name),'id2'])['id3'].count().fillna(0)
# Transforms count from series to data frame
new_df = new_df.to_frame()
# rowname to index
new_df.reset_index(inplace=True)
new_df = new_df[new_df["level_0"].isin(["A","B"])]
# Take this bit as an example of what happens inside px.bar
new_df.groupby("level_0").count()
# Result
# id2 id3
# level_0
# A 5 5
# B 5 5
# C 0 0
# Desired Result
# id2 id3
# level_0
# A 5 5
# B 5 5
me again. I solved the problem. This was done by unlinking the objects ( don't know if this is a thing, but apparently it is). for this, I created a dict from the data frame and a data frame from the dict. It's not pretty, but it works. I transformed new_df into a dict with
new_df_list = new_df.to_dict("records")
unlinked_df = pd.DataFrame(new_df_list )

KeyError: when making new column in pandas

I am working on a dataset and when I try to create a new column after find the difference I get the KeyError: 'filtered'
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
d = {'col1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'col2': [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]}
df = pd.DataFrame(data=d)
fig, ax = plt.subplots(2, figsize=(8,8))
df['col2'].diff().plot(ax=ax[0])
cutoff = 3
df['filtered'] = df.loc[df['col2'].diff().abs() > cutoff]
df.plot(ax=ax[1])
I used to create new column like this (df['filtered'] = some operation), but it gives KeyError: 'filtered' in this situation. Thank you for the help.
You need to replace the second-to-last line with:
df['filtered'] = df.loc[df['col2'].diff().abs() > cutoff, 'col2']
assuming that you want to get a filtered version of 'col2'. As #RafaelC mentioned, the current .loc[] operation you have returns all the columns (2 in this case) for which the row filter applies hence the error.

How to keep the index when using pd.melt and merge to create a DataFrame for Seaborn and matplotlib

I am trying to draw subplots using two identical DataFrames ( predicted and observed) with exact same structure ... the first column is index
The code below makes new index when they are concatenated using pd.melt and merge
as you can see in the figure the index of orange line is changed from 1-5 to 6-10
I was wondering if some could fix the code below to keep the same index for the orange line:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
actual = pd.DataFrame({'a': [5, 8, 9, 6, 7, 2],
'b': [89, 22, 44, 6, 44, 1]})
predicted = pd.DataFrame({'a': [7, 2, 13, 18, 20, 2],
'b': [9, 20, 4, 16, 40, 11]})
# Creating a tidy-dataframe to input under seaborn
merged = pd.concat([pd.melt(actual), pd.melt(predicted)]).reset_index()
merged['category'] = ''
merged.loc[:len(actual)*2,'category'] = 'actual'
merged.loc[len(actual)*2:,'category'] = 'predicted'
g = sns.FacetGrid(merged, col="category", hue="variable")
g.map(plt.plot, "index", "value", alpha=.7)
g.add_legend();
The orange line ('variable' == 'b') doesn't have an index of 0-5 because of how you used melt. If you look at pd.melt(actual), the index doesn't match what you are expecting, IIUC.
Here is how I would rearrange the dataframe:
merged = pd.concat([actual, predicted], keys=['actual', 'predicted'])
merged.index.names = ['category', 'index']
merged = merged.reset_index()
merged = pd.melt(merged, id_vars=['category', 'index'], value_vars=['a', 'b'])
Set the ignore_index variable to false to preserve the index., e.g.
df = df.melt(var_name=‘species’, value_name=‘height’, ignore_index = False)

Check which rows of pandas exist in another

I have two Pandas Data Frames of different sizes (at least 500,000 rows in both of them). For simplicity, you can call them df1 and df2 . I'm interested in finding the rows of df1 which are not present in df2. It is not necessary that any of the data frames would be the subset of the other. Also, the order of the rows does not matter.
For example, ith observation in df1 may be jth observation in df2 and I need to consider it as being present (order won't matter). Another important thing is that both data frames may contain null values (so the operation has to work also for that).
A simple example of both data frame would be
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 100], 'col2' : [10, 11, NaN, 50})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 100], 'col2' : [20, 21, NaN, 13, 14, 50]})
in this case the solution would be
df3 = pandas.DataFrame(data = {'col1' : [1, 2 ], 'col2' : [10, 11]})
Please note that in reality, both data frames have 15 columns (exactly same columns names, exact same data type). Also, I'm using Python 2.7 on Jupyter Notebook on windows 7. I have used Pandas built in function df1.isin(df2) but it does not provide the accurate results that I want.
Moreover, I have also seen this question
but this assumes that one data frame is the subset of another which is not necessarily true in my case.
Here's one way:
import pandas as pd, numpy as np
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 100], 'col2' : [10, 11, np.nan, 50]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 100], 'col2' : [20, 21, np.nan, 13, 14, 50]})
x = set(map(tuple, df1.fillna(-1).values)) - set(map(tuple, df2.fillna(-1).values))
# {(1.0, 10.0), (2.0, 11.0)}
pd.DataFrame(list(x), columns=['col1', 'col2'])
If you have np.nan data in your result, it'll come through as -1, but you can easily convert back. Assumes you won't have negative numbers in your underlying data [if so, replace by some impossible value].
The reason for the complication is that np.nan == np.nan is considered False.
Here is on solution
pd.concat([df1,df2.loc[df2.col1.isin(df1.col1)]],keys=[1,2]).drop_duplicates(keep=False).loc[1]
Out[892]:
col1 col2
0 1 10.0
1 2 11.0

Drop all data in a pandas dataframe

I would like to drop all data in a pandas dataframe, but am getting TypeError: drop() takes at least 2 arguments (3 given). I essentially want a blank dataframe with just my columns headers.
import pandas as pd
web_stats = {'Day': [1, 2, 3, 4, 2, 6],
'Visitors': [43, 43, 34, 23, 43, 23],
'Bounce_Rate': [3, 2, 4, 3, 5, 5]}
df = pd.DataFrame(web_stats)
df.drop(axis=0, inplace=True)
print df
You need to pass the labels to be dropped.
df.drop(df.index, inplace=True)
By default, it operates on axis=0.
You can achieve the same with
df.iloc[0:0]
which is much more efficient.
My favorite:
df = df.iloc[0:0]
But be aware df.index.max() will be nan.
To add items I use:
df.loc[0 if math.isnan(df.index.max()) else df.index.max() + 1] = data
My favorite way is:
df = df[0:0]
Overwrite the dataframe with something like that
import pandas as pd
df = pd.DataFrame(None)
or if you want to keep columns in place
df = pd.DataFrame(columns=df.columns)
If your goal is to drop the dataframe, then you need to pass all columns. For me: the best way is to pass a list comprehension to the columns kwarg. This will then work regardless of the different columns in a df.
import pandas as pd
web_stats = {'Day': [1, 2, 3, 4, 2, 6],
'Visitors': [43, 43, 34, 23, 43, 23],
'Bounce_Rate': [3, 2, 4, 3, 5, 5]}
df = pd.DataFrame(web_stats)
df.drop(columns=[i for i in check_df.columns])
This code make clean dataframe:
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
#clean
df = pd.DataFrame()

Categories

Resources