I'm using Python 3.6 and Pandas 0.20.3.
I'm sure this must be addressed somewhere, but I can't seem to find it. I alter a dataframe inside a function by adding columns; then I restore the dataframe to the original columns. I don't return the dataframe. The added columns stay.
I could understand if I add columns inside the function and they are not permanent AND updating the dataframe does not work. I'd also understand if adding columns altered the dataframe and assigning the dataframe also stuck.
Here is the code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10, 5))
df
which gives
0 1 2 3 4
0 0.406779 -0.481733 -1.187696 -0.210456 -0.608194
1 0.732978 -0.079787 -0.051720 1.097441 0.089850
2 1.859737 -1.422845 -1.148805 0.254504 1.207134
3 0.074400 -1.352875 -1.341630 -1.371050 0.005505
4 -0.102024 -0.905506 -0.165681 2.424180 0.761963
5 0.400507 -0.069214 0.228971 -0.079805 -1.059972
6 1.284812 0.843705 -0.885566 1.087703 -1.006714
7 0.135243 0.055807 -1.217794 0.018104 -1.571214
8 -0.524320 -0.201561 1.535369 -0.840925 0.215584
9 -0.495721 0.284237 0.235668 -1.412262 -0.002418
Now, I create a function:
def mess_around(df):
cols = df.columns
df['extra']='hi'
df = df[cols]
then run it and display dataframe:
mess_around(df)
df
which gives:
0 1 2 3 4 extra
0 0.406779 -0.481733 -1.187696 -0.210456 -0.608194 hi
1 0.732978 -0.079787 -0.051720 1.097441 0.089850 hi
2 1.859737 -1.422845 -1.148805 0.254504 1.207134 hi
3 0.074400 -1.352875 -1.341630 -1.371050 0.005505 hi
4 -0.102024 -0.905506 -0.165681 2.424180 0.761963 hi
5 0.400507 -0.069214 0.228971 -0.079805 -1.059972 hi
6 1.284812 0.843705 -0.885566 1.087703 -1.006714 hi
7 0.135243 0.055807 -1.217794 0.018104 -1.571214 hi
8 -0.524320 -0.201561 1.535369 -0.840925 0.215584 hi
9 -0.495721 0.284237 0.235668 -1.412262 -0.002418 hi
I know I can solve the problem by return ts. So I can fix the problem. I want to understand where I am going wrong. I suspect that the scope of the variable ts is inside the function; it is given a pointer but that does not change because of scope. Yet the column assignment is using the pointer that is passed in and therefore impacts the dataframe "directly". Is that correct?
EDIT:
For those that might want to address the dataframe in place, I've added:
for c in ts.columns:
if c not in cols:
del ts[c]
I'm guessing if I return the new dataframe, then there will be a potentially large dataframe that will have to be dealt with by garbage collection.
To understand what happens, you should know the difference between passing attributes to functions by value versus passing them by reference:
How do I pass a variable by reference?
You pass a variable df to your function messing_around. The function modifies the original dataframe in-place by adding a column.
This subsequent line of code seems to be the cause for confusion here:
df = df[cols]
What happens here is that the variable df originally held a reference to your dataframe. But, the reassignment causes the variable to point to a different object - your original dataframe is not changed.
Here's a simpler example:
def foo(l):
l.insert(0, np.nan) # original modified
l = [4, 5, 6] # reassignment - no change to the original,
# but the variable l points to something different
lst = [1, 2, 3]
foo(lst)
print(lst)
[nan, 1, 2, 3] # notice here that the insert modifies the original,
# but not the reassignment
Related
Lets say I have a dataframe A with attribute called 'score'.
I can modify the 'score' attribute of the second row by doing:
tmp = A.loc[2]
tmp.score = some_new_value
A.loc[2] = tmp
But I cant do it like this:
A.loc[2].score = some_new_value
Why ?
It will be hard to reproduce your case, because Pandas does not guarantee, when using chained indexing, whether the operation will return a view or a copy of the dataframe.
When you access a "cell" of the dataframe by
A.loc[2].score
you are actually performing two steps: first .loc and then .score (which is essentially chained indexing). The Pandas documentation has a nice post about it here.
The simplest way to prevent this is by consistently using .loc or .iloc to access the rows/columns you need and reassigning the value. Therefore, I would recommend always using either
A.loc[2, "score"] = some_new_value
or
A.at[2, "score"] = some_new_value
This kind of indexing + setting will be translated "under the hood" to:
A.loc.__setitem__((2, 'score'), some_new_value) # modifies A directly
instead of an unreliable chain of __getitem__ and __setitem__.
Let's show an example:
import pandas as pd
dict_ = {'score': [1,2,3,4,5,6], 'other':'a'}
A = pd.DataFrame(dict_)
A
Dataframe:
score other
0 1 a
1 2 a
2 3 a
3 4 a
4 5 a
5 6 a
Now you can do the following, and the values are actually saved:
A.loc[2,'score'] = 'Heyyyy'
A
Dataframe:
score other
0 1 a
1 2 a
2 Heyyyy a
3 4 a
4 5 a
5 6 a
I have a very simple for loop problem and I haven't found a solution in any of the similar questions on Stack. I want to use a for loop to create values in a pandas dataframe. I want the values to be strings that contain a numerical index. I can make the correct value print, but I can't make this value get saved in the dataframe. I'm new to python.
# reproducible example
import pandas as pd
df1 = pd.DataFrame({'x':range(5)})
# for loop to add a row with an index
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
df1['file'] = "data_{i}.txt".format(i=i)
This loop prints the exact value that I want to put into the 'file' column of df1, but when I look at df1, it only uses the last value for the index.
x file
0 0 data_4.txt
1 1 data_4.txt
2 2 data_4.txt
3 3 data_4.txt
4 4 data_4.txt
I have tried using enumerate, but can't find a solution with this. I assume everyone will yell at me for posting a duplicate question, but I have not found anything that works and if someone points me to a solution that solves this problem, I'll happily remove this question.
There are better ways to create a DataFrame, but to answer your question:
Replace the last line in your code:
df1['file'] = "data_{i}.txt".format(i=i)
with:
df1.loc[i, 'file'] = "data_{0}.txt".format(i)
For more information, read about the .loc here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
On the same page, you can read about accessors like .at and .iloc as well.
You can do list-comprehension:
df1['file'] = ["data_{i}.txt".format(i=i) for i in range(5)]
print(df1)
Prints:
x file
0 0 data_0.txt
1 1 data_1.txt
2 2 data_2.txt
3 3 data_3.txt
4 4 data_4.txt
OR at the creating of DataFrame:
df1 = pd.DataFrame({'x':range(5), 'file': ["data_{i}.txt".format(i=i) for i in range(5)]})
print(df1)
OR:
df1 = pd.DataFrame([{'x':i, 'file': "data_{i}.txt".format(i=i)} for i in range(5)])
print(df1)
I've found success with the .at method
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
df1.at[i, 'file'] = "data_{i}.txt".format(i=i)
Returns:
x file
0 0 data_0.txt
1 1 data_1.txt
2 2 data_2.txt
3 3 data_3.txt
4 4 data_4.txt
when you assign a variable to a dataframe column the way you do -
using the df['colname'] = 'val', it assigns the val across all rows.
That is why you are seeing only the last value.
Change your code to:
import pandas as pd
df1 = pd.DataFrame({'x':range(5)})
# for loop to add a row with an index
to_assign = []
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
to_assign.append(data_{i}.txt".format(i=i))
##outside of the loop - only once - to all dataframe rows
df1['file'] = to_assign.
As a thought, pandas has a great API for performing these type of actions without for loops.
You should start practicing those.
I am trying to obtain a view of a pandas dataframe using the loc method but it is not working as expected when I am modifying the original DataFrame.
I want to extract a row/slice of a DataFrame using the loc method so that when a modification is done to the DataFrame, the slice reflects the change.
Let's have a look at this example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':np.arange(0,5,2), 'a':np.arange(3), 'b':np.arange(3)}).set_index('ID')
df
a b
ID
0 0 0
2 1 1
4 2 2
Now I create a slice using loc:
slice1 = df.loc[[2],]
slice1
a b
ID
2 1 1
Then I modify the original DataFrame:
df.loc[2, 'b'] = 9
df
a b
ID
0 0 0
2 1 9
4 2 2
But unfortunately our slice does not reflect this modification as I would be expecting for a view:
slice1
a b
ID
2 1 1
My expectation:
a b
ID
2 1 9
I found an ugly fix using a mix of iloc and loc but I hope there is a nicer way to obtain the result I am expecting.
Thank you for your help.
Disclaimer: This is not an answer.
I tried testing how over-writing the values in chained assignment vs .loc referring to the pandas documentation link that was shared by #Quang Hoang above.
This is what I tried:
dfmi = pd.DataFrame([list('abcd'),
list('efgh'),
list('ijkl'),
list('mnop')],
columns=pd.MultiIndex.from_product([['one', 'two'],
['first', 'second']]))
df1 = dfmi['one']['second']
df2 = dfmi.loc[:, ('one', 'second')]
Output of both df1 and df2:
0 b
1 f
2 j
3 n
Iteration 1:
value = ['z', 'x', 'c', 'v']
dfmi['one']['second'] = value
Output df1:
0 z
1 x
2 c
3 v
Iteration 2:
value = ['z', 'x', 'c', 'v']
dfmi.loc[:, ('one', 'second')] = value
Output df2:
0 z
1 x
2 c
3 v
The assignment of new sets is changing the values in both the cases.
The documentation says:
Quote 1: 'method 2 (.loc) is much preferred over method 1 (chained [])'
Quote 2:
'Outside of simple cases, it’s very hard to predict whether "getitem" (used by chained option) will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees), and therefore whether the "setitem" (used by .loc) will modify dfmi or a temporary object that gets thrown out immediately afterward.'
I am not able to understand the explanation above. If the value in dfmi can change (in my case) and may not change (like in Benoit's case) then which way to obtain the result? Not sure if I am missing a point here.
Looking for help
The reason the slice didn't reflect the changes you made in the original dataframe is b/c you created the slice first.
When you create a slice, you create a "copy" of a slice of the data. You're not directly linking the two.
The short answer here is that you have two options 1) changed the original df first, then create a slice 2) don't slice, just do your operations referencing the original df using .loc or iloc
The memory address of your dataframe and slice are different, so changes in dataframe won't reflect in the slice-
The answer is to change the value in the dataframe and then slice it -
I would like to check the value of the row above and see it it is the same as the current row. I found a great answer here: df['match'] = df.col1.eq(df.col1.shift()) such that col1 is what you are comparing.
However, when I tried it, I received a SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. warning. My col1 is a string. I know you can suppress warnings but how would I check the same row above and make sure that I am not creating a copy of the dataframe? Even with the warning I do get my desired output, but was curious if there exists a better way.
import pandas as pd
data = {'col1':['a','a','a','b','b','c','c','c','d','d'],
'week':[1,1,1,1,1,2,2,2,2,2]}
df = pd.DataFrame(data, columns=['col1','week'])
df['check_condition'] = 1
while sum(df.check_condition) != 0:
for week in df.week:
wk = df.loc[df.week == week]
wk['match'] = wk.col1.eq(wk.col1.shift()) # <-- where the warning occurs
# fix the repetitive value...which I have not done yet
# for now just exit out of the while loop
df.loc[df.week == week,'check_condition'] = 0
You can't ignore a pandas SettingWithCopyWarning!
It's 100% telling you that your code is not going to work as intended, if at all. Stop, investigate and fix it. (It's not an ignoreable thing you can filter out, like a pandas FutureWarning nagging about deprecation.)
Multiple issues with your code:
You're trying to iterate over a dataframe (but not with groupby()), take slices of it (in the subdataframe wk, which yes is a copy of a slice)...
then assign to the (nonexistent) new column wk['match']. This is bad, you shouldn't do this. (You could initialize df['match'] = np.nan, but it'd still be wrong to try to assign to the copy in wk)...
SettingWithCopyWarning is being triggered when you try to assign to wk['match']. It's telling you wk is a copy of a slice from dataframe df, not df itself. Hence like it tells you: A value is trying to be set on a copy of a slice from a DataFrame. That assignment would only get thrown away every time wk gets overwritten by your loop, so even if you could force it to work on wk it would be wrong. That's why SettingWithCopyWarning is a code smell you shouldn't be making a copy of a slice of df in the first place.
Later on, you also try to assign to column df['check_condition'] while iterating over the df, that's also bad.
Solution:
df['check_condition'] = df['col1'].eq(df['col1'].shift()).astype(int)
df
col1 week check_condition
0 a 1 0
1 a 1 1
2 a 1 1
3 b 1 0
4 b 1 1
5 c 2 0
6 c 2 1
7 c 2 1
8 d 2 0
9 d 2 1
More generally, for more complicated code where you want to iterate over each group of dataframe according to some grouping criteria, you'd use use groupby() and split-apply-combine instead.
you're grouping by wk.col1.eq(wk.col1.shift()), i.e. rows where col1 value doesn't change from the preceding row
and you want to set check_condition to 0 on those rows
and 1 on rows where col1 value did change from the preceding row
But in this simpler case you can skip groupby() and do a direct assignment.
I have a DataFrame, say one column is:
{'university':'A','B','A','C'}
I want to change the column into:
{'university':1,2,1,3}
According to an imaginary dict:
{'A':1,'B':2,'C':3}
how to get this done?
ps: I solved the original problem, it's something about my own computer setting.
And I changed the question accordingly to be more helpful.
I think you need map by dict - d:
df.university = df.university.map(d)
If need encode the object as an enumerated type or categorical variable use factorize:
df.university = pd.factorize(df.university)[0] + 1
Sample:
d = {'A':1,'B':2,'C':3}
df = pd.DataFrame({'university':['A','B','A','C']})
df['a'] = df.university.map(d)
df['b'] = pd.factorize(df.university)[0] + 1
print (df)
university a b
0 A 1 1
1 B 2 2
2 A 1 1
3 C 3 3
I try rewrite your function:
def given_value(column):
columnlist=column.drop_duplicates()
#reset to default monotonic increasing (0,1,2, ...)
columnlist = columnlist.reset_index(drop=True)
#print (columnlist)
#swap index and values to new Series columnlist_rev
columnlist_rev= pd.Series(columnlist.index, index=columnlist.values)
#map by columnlist_rev
column=column.map(columnlist_rev)
return column
print (given_value(df.university))
0 0
1 1
2 0
3 2
Name: university, dtype: int64
AttributeError: 'DataFrame' object has no attribute 'column'
Your answer is written in the Exception statement! DataFrame object doesn't have an attribute called column, which means you can't call on DataFrame.column at any point in your code. I believe your problem exists outside of what you have posted here, likely to be somewhere near the part where you imported the data as a DataFrame fro the first time. My guess is that when you were naming the columns, you did something like df.column = [university] instead of df.columns = [university]. The s matters. If you read the Traceback closely, you'll be able to figure out precisely which line is throwing the error.
Also, in your posted function, you do not need the parameter df as it is not used at any point during the process.