Let's say I have a Series of flags in a DataFrame:
a=pd.DataFrame({'flag':[0,1,0,0,1]})
and I want to change the values of the flags which are in a specific indexes:
lind=[0,1,3]
This is a simple solution:
def chnflg(series,ind):
if series.ix[ind]==0:
series.ix[ind]=1
else:
series.ix[ind]=0
map(partial(chnflg,a),lind)
It works fine but there are two issues: the first is that it makes the changes in-place, while I would like a new series in DataFrame. This is not a big deal after all.
The second point is that it does not seems pythonic enough. Is it possible to do better?
An easier way to describe your function is as x -> 1 - x, this will be more efficient that apply/map.
In [11]: 1 - a.iloc[lind]
Out[11]:
flag
0 1
1 0
3 1
Note: I like to use iloc here as it's less ambiguous.
If you wanted to assign these inplace then do the explicit assignment:
In [12]: a.iloc[lind] = 1 - a.iloc[lind]
In [13]: a
Out[13]:
flag
0 1
1 0
2 0
3 1
4 1
You could create a dict that flips the values and call map, this would return a series and you can create a new dataframe and leave the original intact:
In [6]:
temp={0:1,1:0}
pd.DataFrame(a.ix[lind]['flag'].map(temp))
Out[6]:
flag
0 1
1 0
3 1
Related
If I have a dataframe with elements of list type in it. Now I want to use pandas or numpy to replace
each element(each list) of this dataframe to second element of this element(list). How can I do that?
I wrote the below code to make a trial dataframe.
import pandas as pd
df = pd.DataFrame({'a':[[1,7],[0,5]],'b':[[3,1],[4,0]],'c':[[1,4],[2,0]]})
My df looks like:
a b c
0 [1,7] [3,1] [1,4]
1 [0,5] [4,0] [2,0]
Now I want df to be changed like shown below:
a b c
0 7 1 4
1 5 0 0
I tried using replace function, lambda function etc but nothing worked.
I don't want to use loops or anything that takes time to run.
here is one way to do it, using applymap
df.applymap(lambda x: x[1])
a b c
0 7 1 4
1 5 0 0
I have a dataframe (df) which looks like:
0 1 2 3
0 BBG.apples.S BBG.XNGS.bananas.S 0
1 BBG.apples.S BBG.XNGS.oranges.S 0
2 BBG.apples.S BBG.XNGS.pairs.S 0
3 BBG.apples.S BBG.XNGS.mango.S 0
4 BBG.apples.S BBG.XNYS.mango.S 0
5 BBG.XNGS.bananas.S BBG.XNGS.oranges.S 0
6 BBG.XNGS.bananas.S BBG.XNGS.pairs.S 0
7 BBG.XNGS.bananas.S BBG.XNGS.kiwi.S 0
8 BBG.XNGS.oranges.S BBG.XNGS.pairs.S 0
9 BBG.XNGS.oranges.S BBG.XNGS.kiwi.S 0
10 BBG.XNGS.peaches.S BBG.XNGS.strawberrys.S 0
11 BBG.XNGS.peaches.S BBG.XNGS.strawberrys.S 0
12 BBG.XNGS.peaches.S BBG.XNGS.strawberrys.S 0
13 BBG.XNGS.peaches.S BBG.XNGS.kiwi.S 0
I am trying to update a value (first row, third column) in the dataframe using:
for index, row in df.iterrows():
status = row[3]
if int(status) == 0:
df[index]['3'] = 1
but when I print the dataframe out it remains unchanged.
What am I doing wrong?
Replace your last line by:
df.at[index,'3'] = 1
Obviously as mentioned by others you're better off using a vectorized expression instead of iterating, especially for large dataframes.
You can't modify a data frame by iterating like that. See here.
If you only want to modify the element at [1, 3], you can access it directly:
df[1, 3] = 1
If you're trying to turn every 0 in column 3 to a 1, try this:
df[df['3'] == 0] = 1
EDIT: In addition, the docs for iterrows say that you'll often get a copy back, which is why the operation fails.
If you are trying to update the third column for all rows based on the row having a certain value, as shown in your example code, then it would be much easier use the where method on the dataframe:
df.loc[:,'3'] = df['3'].where(df['3']!=0, 1)
Try to update the row using .loc or .iloc (depend on your needs).
For example, in this case:
if int(status) == 0:
df.iloc[index]['3']='1'
I want to delete the row in my data frame if it meets any value in a set. I have tried many different iterations of the following code, but they don't work:
if subid in intersection == df_1["SubId"][x]:
for x in range(len(df_1)):
del df_1.iloc[x]
I am getting Key Error:0. Any ideas??
Thanks in advance.
Edit: I have defined intersection as the following:
intersection = set(ABC).intersection(XYZ)
If you just want to remove those then use isin:
df_1[~df_1["SubId"].isin(intersection)]
This will produce a boolean mask of the rows where they match one of the values in intersection and we invert the mask using ~
What you're doing will be slow plus won't your indexing potentially run off the end of the df if you keep removing rows?
Example:
In [2]:
df = pd.DataFrame({'a':[0,1,2,3,4], 'b':np.random.randn(5)})
df
Out[2]:
a b
0 0 0.987283
1 1 0.683479
2 2 1.640774
3 3 1.262665
4 4 -1.462040
In [3]:
df[~df.a.isin([0,3])]
Out[3]:
a b
1 1 0.683479
2 2 1.640774
4 4 -1.462040
When running my code I get the following message:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df['detect'][df.index[0]] = df['event'][df.index[0]]
What is the correct way of setting the first value of a column equal to the first value of another column?
Use ix to perform the index label selection:
In [102]:
df = pd.DataFrame({'detect':np.random.randn(5), 'event':np.arange(5)})
df
Out[102]:
detect event
0 -0.815105 0
1 -0.656923 1
2 -1.417722 2
3 0.210070 3
4 0.211728 4
In [103]:
df.ix[0,'detect'] = df.ix[0,'event']
df
Out[103]:
detect event
0 0.000000 0
1 -0.656923 1
2 -1.417722 2
3 0.210070 3
4 0.211728 4
What you are doing is called chained indexing and may or may not work hence the warning
I have a large DataFrame (million +) records I'm using to store core of my data (like a database) and I then have a smaller DataFrame (1 to 2000) records that I'm combining a few of the columns for each time step in my program which can be several thousand time steps . Both DataFrames are indexed the same way by a id column.
the code I'm using is:
df_large.loc[new_ids, core_cols] = df_small.loc[new_ids, core_cols]
Where core_cols is a list of about 10 fields that I'm coping over and new_ids are the ids from the small DataFrame. This code works fine but it is the slowest part of my code my a magnitude of three. I just wanted to know if they was a faster way to merge the data of the two DataFrame together.
I tried merging the data each time with the merge function but process took way to long that is way I have gone to creating a larger DataFrame that I update to improve the speed.
There is nothing inherently slow about using .loc to set with an alignable frame, though it does go through a bit of code to cover lot of cases, so probably it's not ideal to have in a tight loop. FYI, this example is slightly different that the 2nd example.
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: from pandas import DataFrame
In [4]: df = DataFrame(1.,index=list('abcdefghij'),columns=[0,1,2])
In [5]: df
Out[5]:
0 1 2
a 1 1 1
b 1 1 1
c 1 1 1
d 1 1 1
e 1 1 1
f 1 1 1
g 1 1 1
h 1 1 1
i 1 1 1
j 1 1 1
[10 rows x 3 columns]
In [6]: df2 = DataFrame(0,index=list('afg'),columns=[1,2])
In [7]: df2
Out[7]:
1 2
a 0 0
f 0 0
g 0 0
[3 rows x 2 columns]
In [8]: df.loc[df2.index,df2.columns] = df2
In [9]: df
Out[9]:
0 1 2
a 1 0 0
b 1 1 1
c 1 1 1
d 1 1 1
e 1 1 1
f 1 0 0
g 1 0 0
h 1 1 1
i 1 1 1
j 1 1 1
[10 rows x 3 columns]
Here's an alternative. It may or may not fit your data pattern. If the updates (your small frame) are pretty much independent this would work (IOW you are not updating the big frame, then picking out a new sub-frame, then updating, etc. - if this is your pattern, then using .loc is about right).
Instead of updating the big frame, update the small frame with the columns from the big frame, e.g.:
In [10]: df = DataFrame(1.,index=list('abcdefghij'),columns=[0,1,2])
In [11]: df2 = DataFrame(0,index=list('afg'),columns=[1,2])
In [12]: needed_columns = df.columns-df2.columns
In [13]: df2[needed_columns] = df.reindex(index=df2.index,columns=needed_columns)
In [14]: df2
Out[14]:
1 2 0
a 0 0 1
f 0 0 1
g 0 0 1
[3 rows x 3 columns]
In [15]: df3 = DataFrame(0,index=list('cji'),columns=[1,2])
In [16]: needed_columns = df.columns-df3.columns
In [17]: df3[needed_columns] = df.reindex(index=df3.index,columns=needed_columns)
In [18]: df3
Out[18]:
1 2 0
c 0 0 1
j 0 0 1
i 0 0 1
[3 rows x 3 columns]
And concat everything together when you want (they are kept in a list in the mean time, or see my comments below, these sub-frames could be moved to external storage when created, then read back before this concatenating step).
In [19]: pd.concat([ df.reindex(index=df.index-df2.index-df3.index), df2, df3]).reindex_like(df)
Out[19]:
0 1 2
a 1 0 0
b 1 1 1
c 1 0 0
d 1 1 1
e 1 1 1
f 1 0 0
g 1 0 0
h 1 1 1
i 1 0 0
j 1 0 0
[10 rows x 3 columns]
The beauty of this pattern is that it is easily extended to using an actual db (or much better an HDFStore), to actually store the 'database', then creating/updating sub-frames as needed, then writing out to a new store when finished.
I use this pattern all of the time, though with Panels actually.
perform a computation on a sub-set of the data and write each to a separate file
then at the end read them all in and concat (in memory), and write out a gigantic new file. The concat step could be done all at once in memory, or if truly a large task, then can be done iteratively.
I am able to use multi-processes to perform my computations AND write each individual Panel to a file separate as they are all completely independent. The only dependent part is the concat.
This is essentially a map-reduce pattern.
Quickly: Copy columns a and b from the old df into a new df.
df1 = df[['a', 'b']]
I've had to copy between large dataframes a fair bit. I'm using dataframes with realtime market data, which may not be what pandas is designed for, but this is my experience..
On my pc, copying a single datapoint with .at takes 15µs with the df size making negligible difference. .loc takes a minimum of 550µs and increases as the df gets larger: 3100µs to copy a single point from one 100000x2 df to another. .ix seems to be just barely faster than .loc.
For a single datapoint .at is very fast and is not impacted by the size of the dataframe, but it cannot handle ranges so loops are required, and as such the time scaling is linear. .loc and .ix on the other hand are (relatively) very slow for single datapoints, but they can handle ranges and scale up better than linearly. However, unlike .at they slow down significantly wrt dataframe size.
Therefore when I'm frequently copying small ranges between large dataframes, I tend to use .at with a for loop, and otherwise I use .ix with a range.
for new_id in new_ids:
for core_col in core_cols:
df_large.at[new_id, core_col] = df_small.at[new_id, core_col]
Of course, to do it properly I'd go with Jeff's solution above, but it's nice to have options.
Caveats of .at: it doesn't work with ranges, and it doesn't work if the dtype is datetime (and maybe others).