I'm working with the following DataFrame:
age height weight shoe_size
0 8.0 6.0 2.0 1.0
1 8.0 NaN 2.0 1.0
2 6.0 1.0 4.0 NaN
3 5.0 1.0 NaN 0.0
4 5.0 NaN 1.0 NaN
5 3.0 0.0 1.0 0.0
I added another header to the df in this way:
zipped = list(zip(df.columns, ["RHS", "height", "weight", "shoe_size"]))
df.columns = pd.MultiIndex.from_tuples(zipped)
So this is the new DataFrame:
age height weight shoe_size
RHS height weight shoe_size
0 8.0 6.0 2.0 1.0
1 8.0 NaN 2.0 1.0
2 6.0 1.0 4.0 NaN
3 5.0 1.0 NaN 0.0
4 5.0 NaN 1.0 NaN
5 3.0 0.0 1.0 0.0
Now I know how to select the first column, by using the corresponding tuple ("age", "RHS"):
df[("age", "RHS")]
but I was wondering about how to do this by using only the second index "RHS".
Ideally something like:
df[(any, "RHS")]
You could use get_level_values
In [700]: df.loc[:, df.columns.get_level_values(1) == 'RHS']
Out[700]:
age
RHS
0 8.0
1 8.0
2 6.0
3 5.0
4 5.0
5 3.0
You pass slice(None) as the first argument to .loc, provided you sort your columns first using df.sort_index:
In [325]: df.sort_index(1).loc[:, (slice(None), 'RHS')]
Out[325]:
age
RHS
0 8.0
1 8.0
2 6.0
3 5.0
4 5.0
5 3.0
You can also use pd.IndexSlice with df.loc:
In [332]: idx = pd.IndexSlice
In [333]: df.sort_index(1).loc[:, idx[:, 'RHS']]
Out[333]:
age
RHS
0 8.0
1 8.0
2 6.0
3 5.0
4 5.0
5 3.0
With the slicer, you don't need to explicitly pass slice(None) because IndexSlice does that for you.
If you don't sort your columns, you get:
UnsortedIndexError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (0)'
If you have multiple RHS columns in the second level, all those columns are returned.
Related
I have two dataframes shown below:
df_1 =
Lon Lat N
0 2 1 1
1 2 2 3
2 2 3 1
3 3 2 2
and
df_2 =
Lon Lat N
0 1.0 1.0 NaN
1 2.0 1.0 NaN
2 3.0 1.0 NaN
3 4.0 1.0 NaN
4 1.0 2.0 NaN
5 2.0 2.0 NaN
6 3.0 2.0 NaN
7 4.0 2.0 NaN
8 1.0 3.0 NaN
9 2.0 3.0 NaN
10 3.0 3.0 NaN
11 4.0 3.0 NaN
What I want to do is to compare these two dfs and merge them according to Lon and Lat. That is to say NaN in df_2 will be covered with values in df_1 if the corresponding Lon and Lat are identical. The ideal output should be as:
Lon Lat N
0 1.0 1.0 NaN
1 2.0 1.0 1
2 3.0 1.0 NaN
3 4.0 1.0 NaN
4 1.0 2.0 NaN
5 2.0 2.0 3
6 3.0 2.0 2
7 4.0 2.0 NaN
8 1.0 3.0 NaN
9 2.0 3.0 1
10 3.0 3.0 NaN
11 4.0 3.0 NaN
The reason I want to do this is df_1's coordinates Lat and Lon are non-rectangular or unstructured grid, and I need to fill some NaN values so as to get a rectangular meshgrid and make contourf applicable. It would be highly appreciated if you can provide better ways to make the contour plot.
I have tried df_2.combine_first(df_1), but it doesn't work.
Thanks!
df_2.drop(columns = 'N').merge(df_1, on = ['Lon', 'Lat'], how = 'left')
Lon Lat N
0 1.0 1.0 NaN
1 2.0 1.0 1.0
2 3.0 1.0 NaN
3 4.0 1.0 NaN
4 1.0 2.0 NaN
5 2.0 2.0 3.0
6 3.0 2.0 2.0
7 4.0 2.0 NaN
8 1.0 3.0 NaN
9 2.0 3.0 1.0
10 3.0 3.0 NaN
11 4.0 3.0 NaN
If you first create the df_2 with all needed values you can update it with the second DataFrame by using pandas.DataFrame.update.
For this you need to first set the the correct index by using pandas.DataFrame.set_index.
Have a look at this Post for more information.
I have two .csv files, one contains what could be described as a header and a body. The header contains data like the total number of rows, datetime, what application generated the data, and what line the body starts on. The second file contains a single row.
>>> import pandas as pd
>>> df = pd.read_csv("data.csv", names=list('abcdef'))
>>> df
a b c d e f
0 data start row 5 NaN NaN NaN NaN
1 row count 7 NaN NaN NaN NaN
2 made by foo.exe NaN NaN NaN NaN
3 date 01-01-2000 NaN NaN NaN NaN
4 a b c d e f
5 0.0 1.0 2.0 3.0 4.0 5.0
6 0.0 1.0 2.0 3.0 4.0 5.0
7 0.0 1.0 2.0 3.0 4.0 5.0
8 0.0 1.0 2.0 3.0 4.0 5.0
9 0.0 1.0 2.0 3.0 4.0 5.0
10 0.0 1.0 2.0 3.0 4.0 5.0
11 0.0 1.0 2.0 3.0 4.0 5.0
>>> df2 = pd.read_csv("extra_data.csv")
>>> df2
a b c
0 6.0 5.0 4.0
>>> row = df2.loc[0]
>>>
I am having trouble modifying the 'a', 'b' and 'c' columns and then saving the DataFrame to a new .csv file.
I have tried adding the row by way of slicing and the addition operator but this did not work:
>>> df[5:,'a':'c'] += row
TypeError: '(slice(5, None, None), slice('a', 'c', None))' is an invalid key
>>>
I also tried the answer I found here, but this gave a similar error:
>>> df[5:,row.index] += row
TypeError: '(slice(5, None, None), Index(['a', 'b', 'c'], dtype='object'))' is an invalid key
>>>
I suspect the problem is coming from object dtypes so I tried converting a subframe to the float type:
>>> sub_section = df.loc[5:,['a','b','c']].astype(float)
>>> sub_section
a b c
5 0.0 1.0 2.0
6 0.0 1.0 2.0
7 0.0 1.0 2.0
8 0.0 1.0 2.0
9 0.0 1.0 2.0
10 0.0 1.0 2.0
11 0.0 1.0 2.0
>>> sub_section += row
>>> sub_section
a b c
5 6.0 6.0 6.0
6 6.0 6.0 6.0
7 6.0 6.0 6.0
8 6.0 6.0 6.0
9 6.0 6.0 6.0
10 6.0 6.0 6.0
11 6.0 6.0 6.0
>>> df
a b c d e f
0 data start row 5 NaN NaN NaN NaN
1 row count 7 NaN NaN NaN NaN
2 made by foo.exe NaN NaN NaN NaN
3 date 01-01-2000 NaN NaN NaN NaN
4 a b c d e f
5 0.0 1.0 2.0 3.0 4.0 5.0
6 0.0 1.0 2.0 3.0 4.0 5.0
7 0.0 1.0 2.0 3.0 4.0 5.0
8 0.0 1.0 2.0 3.0 4.0 5.0
9 0.0 1.0 2.0 3.0 4.0 5.0
10 0.0 1.0 2.0 3.0 4.0 5.0
11 0.0 1.0 2.0 3.0 4.0 5.0
>>>
Obviously, in this case df.loc[] is returning a copy, and then modifying the copy does nothing to the df.
How do I modify parts of a DataFrame (dtype=object) and then save the changes?
Taking a Pandas dataframe df I would like to be able to both take away the value in the particular column for all rows/entries and also add another value. This value to be added is a fixed additive for each of the columns.
I believe I could reproduce df, say dfcopy=df, set all cell values in dfcopy to the particular numbers and then subtract df from dfcopy but am hoping for a simpler way.
I am thinking that I need to somehow modify
df.iloc[:, [0,3,4]]
So for example of how this should look:
A B C D E
0 1.0 3.0 1.0 2.0 7.0
1 2.0 1.0 8.0 5.0 3.0
2 1.0 1.0 1.0 1.0 6.0
Then negating only those values in columns (0,3,4) and then adding 10 (for example) we would have:
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0
Thanks.
You can first multiply by -1 with mul and then add 10 with add for those columns we select with iloc:
df.iloc[:, [0,3,4]] = df.iloc[:, [0,3,4]].mul(-1).add(10)
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0
Or as anky_91 suggested in the comments:
df.iloc[:, [0,3,4]] = 10-df.iloc[:,[0,3,4]]
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0
pandas is very intuitive in letting you perform these operations,
negate:
df.iloc[:, [0,2,7,10,11] = -df.iloc[:, [0,2,7,10,11]
add a constant c:
df.iloc[:, [0,2,7,10,11] = df.iloc[:, [0,2,7,10,11]+c
or change to constant value c:
df.iloc[:, [0,2,7,10,11] = c
and any other arithmetics you can think of
Suppose I have a dataframe that looks like:
df =
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 NaN
2 6.0 NaN NaN
Then it is possible to use df.fillna(method='ffill', axis=1) to obtain:
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 5.0
2 6.0 6.0 6.0
i.e. I forward fill the rows.
However, now I have a dataframe with -1 instead of np.nan. Pandas has the replace function that also has the possibility to use method='ffill', but replace() does not take an axis argument, so to obtain the same result as above, I would need to call df.T.replace(-1, method='ffill').T. Since transposing is quite expensive (especially considering I'm working on a dataframe of multiple gigabytes), this is not an option. How could I achieve the desired result?
Use mask and ffill
df.mask(df.eq(-1)).ffill(axis=1)
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 5.0
2 6.0 6.0 6.0
You can convert your -1 values to NaN before using pd.DataFrame.ffill:
print(df)
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 -1.0
2 6.0 -1.0 -1.0
res = df.replace(-1, np.nan)\
.ffill(axis=1)
print(res)
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 5.0
2 6.0 6.0 6.0
IIUC, use mask and ffill with axis=1:
Where df1 = df.fillna(-1.0)
df1.mask(df1 == -1).ffill(1)
Output:
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 5.0
2 6.0 6.0 6.0
I have a big dataframe with many columns (like 1000). I have a list of columns (generated by a script ~10). And I would like to select all the rows in the original dataframe where at least one of my list of columns is not null.
So if I would know the number of my columns in advance, I could do something like this:
list_of_cols = ['col1', ...]
df[
df[list_of_cols[0]].notnull() |
df[list_of_cols[1]].notnull() |
...
df[list_of_cols[6]].notnull() |
]
I can also iterate over the list of cols and create a mask which then I would apply to df, but his looks too tedious. Knowing how powerful is pandas with respect to dealing with nan, I would expect that there is a way easier way to achieve what I want.
Use the thresh parameter in the dropna() method. By setting thresh=1, you specify that if there is at least 1 non null item, don't drop it.
df = pd.DataFrame(np.random.choice((1., np.nan), (1000, 1000), p=(.3, .7)))
list_of_cols = list(range(10))
df[list_of_cols].dropna(thresh=1).head()
Starting with this:
data = {'a' : [np.nan,0,0,0,0,0,np.nan,0,0, 0,0,0, 9,9,],
'b' : [np.nan,np.nan,1,1,1,1,1,1,1, 2,2,2, 1,7],
'c' : [np.nan,np.nan,1,1,2,2,3,3,3, 1,1,1, 1,1],
'd' : [np.nan,np.nan,7,9,6,9,7,np.nan,6, 6,7,6, 9,6]}
df = pd.DataFrame(data, columns=['a','b','c','d'])
df
a b c d
0 NaN NaN NaN NaN
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
Rows where not all values are nulls. (Removing row index 0)
df[~df.isnull().all(axis=1)]
a b c d
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
One can use boolean indexing
df[~pd.isnull(df[list_of_cols]).all(axis=1)]
Explanation:
The expression df[list_of_cols]).all(axis=1) returns a boolean array that is applied as a filter to the dataframe:
isnull() applied to df[list_of_cols] creates a boolean mask for the dataframe df[list_of_cols] with True values for the null elements in df[list_of_cols], False otherwise
all() returns True if all of the elements are True (row-wise axis=1)
So, by negation ~ (not all null = at least one is non-null) one gets a mask for all rows that have at least one non-null element in the given list of columns.
An example:
Dataframe:
>>> df=pd.DataFrame({'A':[11,22,33,np.NaN],
'B':['x',np.NaN,np.NaN,'w'],
'C':['2016-03-13',np.NaN,'2016-03-14','2016-03-15']})
>>> df
A B C
0 11 x 2016-03-13
1 22 NaN NaN
2 33 NaN 2016-03-14
3 NaN w 2016-03-15
isnull mask:
>>> ~pd.isnull(df[list_of_cols])
B C
0 True True
1 False False
2 False True
3 True True
apply all(axis=1) row-wise:
>>> ~pd.isnull(df[list_of_cols]).all(axis=1)
0 True
1 False
2 True
3 True
dtype: bool
Boolean selection from dataframe:
>>> df[~pd.isnull(df[list_of_cols]).all(axis=1)]
A B C
0 11 x 2016-03-13
2 33 NaN 2016-03-14
3 NaN w 2016-03-15