Python/Pandas - Query a MultiIndex Column [duplicate] - python

This question already has answers here:
Select columns using pandas dataframe.query()
(5 answers)
Closed 4 years ago.
I'm trying to use query on a MultiIndex column. It works on a MultiIndex row, but not the column. Is there a reason for this? The documentation shows examples like the first one below, but it doesn't indicate that it won't work for a MultiIndex column.
I know there are other ways to do this, but I'm specifically trying to do it with the query function
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((4,4)))
df.index = pd.MultiIndex.from_product([[1,2],['A','B']])
df.index.names = ['RowInd1', 'RowInd2']
# This works
print(df.query('RowInd2 in ["A"]'))
df = pd.DataFrame(np.random.random((4,4)))
df.columns = pd.MultiIndex.from_product([[1,2],['A','B']])
df.columns.names = ['ColInd1', 'ColInd2']
# query on index works, but not on the multiindexed column
print(df.query('index < 2'))
print(df.query('ColInd2 in ["A"]'))

To answer my own question, it looks like query shouldn't be used at all (regardless of using MultiIndex columns) for selecting certain columns, based on the answer(s) here:
Select columns using pandas dataframe.query()

You can using IndexSlice
df.query('ilevel_0>2')
Out[327]:
ColInd1 1 2
ColInd2 A B A B
3 0.652576 0.639522 0.52087 0.446931
df.loc[:,pd.IndexSlice[:,'A']]
Out[328]:
ColInd1 1 2
ColInd2 A A
0 0.092394 0.427668
1 0.326748 0.383632
2 0.717328 0.354294
3 0.652576 0.520870

Related

Add column Python [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 1 year ago.
Hi I am trying to add a new column ("A") in an existing data frame based in which the values will be 1 or 3 based on the information in one of the columns ("B")
df["A"] = np.where(df["B"] == "reported-public", 1,3)
When doing so I am getting the warning message:
<ipython-input-239-767754e40f8a>:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Any idea why?
Thanks
Any idea why?
A very simple explanation is that you are slicing the data and trying to assign a value to the slice. Is this slice the same as your original dataframe ? We don't know what Pandas is doing exactly doing underneath. Under some situations it will get assigned into your original dataframe. If it works, then probably it got assigned correctly. That's why it's a warning.
There are some links you get more detailed explanation:
How to deal with SettingWithCopyWarning in Pandas
I have made dummy date as follows, to my best abilities based on your limited sample:
import pandas as pd
data = []
data.append([1, "reported-private"])
data.append([2, "reported-private"])
data.append([3, "reported-public"])
df = pd.DataFrame(data, columns=['Number', 'B'])
While using the command provided with numpy 1.19.5 and pandas 1.2.4
df["A"] = np.where(df["B"] == "reported-public", 1,3)
The following output, probably the one your expecting:
Number B A
1 reported-private 3
2 reported-private 3
3 reported-public 1
Now the error is hinting that you might want to use .loc from pandas itself, and maybe .apply for extra functionality. Example provided as such:
df['A'] = df.apply(lambda row: 1 if row.B == 'reported-public' else 3, axis = 1)
Output for this way is the same as previous:
Number B A
1 reported-private 3
2 reported-private 3
3 reported-public 1
So to sum up, might be a version problem, if it is, try changing the version or try the second approach. Cheers.
You can always disable this behavior, as shown below and is from this post:
import pandas as pd
pd.options.mode.chained_assignment = None # default='warn'

How to cast pandas series into dataframe [duplicate]

This question already has answers here:
Python dataframe replace last n rows with a list of n elements
(2 answers)
df.append() is not appending to the DataFrame
(2 answers)
Closed 1 year ago.
I'm trying to cast a series of 20 values at the end of a dataframe with more than 20 rows.
The original values are coming from a numpy array 'Y_pred':
[[3495.47227957]
[3493.27865109]
[3491.08502262]
[3488.89139414]
[3486.69776567]
[3484.50413719]
[3482.31050871]
[3480.11688024]
[3477.92325176]
[3475.72962329]
[3473.53599481]
[3471.34236633]
[3469.14873786]
[3466.95510938]
[3464.7614809 ]
[3462.56785243]
[3460.37422395]
[3458.18059548]
[3455.986967 ]
[3453.79333852]]
creating column Y_pred and trying to cast the converted series:
df['Y_pred'] = np.nan
df.Y_pred.iloc[-len(Y_pred):].append(pd.Series({'Y_pred': Y_pred}), ignore_index=True)
result is that all rows are NaN
I tried as well this:
series = pd.Series(Y_pred[:, 0])
df.Y_pred.iloc[-20:].append(series, ignore_index=True)
and
df['Y_pred'].append(Y_pred)
nothing works. How to do it properly?

Using condition of a dataframe in pandas.where of another dataframe [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have two dataframes: df1 has data and df2 is kind of like a map for the data. (They are both the same size and are 2D).
I would like to use pandas.where (or any method that isn't too convoluted) to replace the values of df1 based of the condition of the same cell in df2.
For instance, if df2 is equal to 0, I want to set the same cell in df1 also to 0. How do I do this?
When I try the following I get an error:
df3 = df1.where(df2 == 0, other = 0)
import pandas as pd
df = pd.DataFrame()
df_1 = pd.DataFrame()
df['a'] = [1,2,3,4,5]
df_1['b'] = [5,6,7,8,0]
This will give a sample df:
Now implement a loop, using range or len(df.index)
for i in range(0,5):
df['a'][i] = np.where( df_1['b'][i] == 0, 0, df['a'][i])
Generally you shouldn't need to handle multiple dataframes separately like this; if df1, df2 have the same shape and either the same index or some common column they can be joined/merged on (e.g. say it's named 'id'), then merge them:
df = pd.merge(df1, df2, on='id')
See Pandas Merging 101

How can I use multiple .contains() inside a .when() in pySpark? [duplicate]

This question already has answers here:
PySpark: multiple conditions in when clause
(4 answers)
Closed 3 years ago.
I am trying to create classes in a new column, based on existing words in another column. For that, I need to include multiple .contains() conditions. But none of the one I tried work.
def classes_creation(data):
df = data.withColumn("classes", when(data.where(F.col("MISP_RFW_Title").like('galleys') | F.col("MISP_RFW_Title").like('coffee')),"galleys") ).otherwise(lit(na))
return df
# RETURNS ERROR
def classes_creation(data):
df = data.withColumn("classes", when(col("MISP_RFW_Title").contains("galleys").contains("word"), 'galleys').otherwise(lit(na))
return df
# RETURNS COLUMN OF NA ONLY
def classes_creation(data):
df = data.withColumn("classes", when(col("MISP_RFW_Title").contains("galleys" | "word"), 'galleys').otherwise(lit(na))
return df
# RETURNS COLUMN OF NA ONLY
If I understood your requirements correctly, you can use regex for matching with rlike
data.withColumn("classes", when(col("MISP_RFW_Title").rlike("galleys|word"), 'galleys').otherwise('a'))
or maybe if you have different columns, you can use something like this
data.withColumn("classes", when((col("MISP_RFW_Title").contains("galleys")|col("MISP_RFW_Title").contains("word")), 'galleys').otherwise('a'))

DataFrame Column Manipulation [duplicate]

This question already has answers here:
DataFrame String Manipulation
(3 answers)
Closed 8 years ago.
I have a dataframe which I load from an excel file like this:
df = pd.read_excel(filename, 0, index_col=0, skiprows=0, parse_cols=[0, 8, 9], tz='UTC',
parse_dates=True)
I do some simple changing of the column names just for my own readability:
df.columns = ['Ticker', 'Price']
The data in the ticker column looks like:
AAV.
AAV.
AAV.UN
AAV.UN
I am trying to remove the period from the end of the letters when there is no other letters following it.
I know I could use something like:
df['Ticker'].str.rstrip('.')
But that does not work, is there some other way to do what I need? I think my issue is that method is for a series and not a column of values. I tried apply and could not seem to get that to work either.
Any suggestions?
You use map() and a lambda like this
df['Ticker'] = df['Ticker'].map( lambda x : x[:-1] if x.endswith('.') else x)
Ticker
0 AAV
1 AAV
2 AAV.UN
3 AAV.UN

Categories

Resources