Pandas - iterating DataFrame rows efficiently and getting values by column name - python

Iterating in Pandas is notoriously inefficient, and best avoid whenever possible (using apply for data manipulation, etc.). In my case, unfortunately, it is unavoidable.
While it is wildly known that the most efficient way to do this is itertuples, with that function accessing the column data using the str name of the tuple throws the following error:
TypeError: tuple indices must be integers or slices, not str
Some suggest that the solution to this problem is to just switch to iterrows, but as mentioned previously, this is not efficient.
How do I utilize itertuples, while still using the str name of the column to get its row value?

Essentially, one just needs to use the index of the required column instead. Since the first value in the tuple is the Index from the originating dataframe, one can use the column index from the original dataframe, and add one to account for the index.
df = pd.DataFrame(some_data)
col_idx = df.columns.get_loc('col name') + 1 # +1 to account for the tuple Index
for row in training_df.itertuples():
val = row[col_idx]
print(val)
This solution may not be the most elegant option, but it works :)

Related

Getting "cannot set using a multi-index selection indexer with a different length than the value" error while using np.where function

I am trying to append two data frames row wise iteratively. After that I am trying fill 0 values in one column with the values in other columns and vice versa. I am using np.where function to fill the 0 values. When I am doing it separately it is giving correct result but when I am using it in a loop it is throwing "cannot set using a multi-index selection indexer with a different length than the value" error. My code looks like below.
def myfunc(dd1,dd2,dfc):
n=dd1.shape[0]
for i in range(n):
dfc2=dd1.iloc[i:i+1].append(dd2.iloc[i:i+1])
dfc=dfc.append(dfc2)
m=dfc.shape[0]
for j in range(m):
dfc.iloc[j:j+1,2:3]=np.where(dfc.iloc[j:j+1,2:3]==0,dfc.iloc[j+1:j+2,3:4],dfc.iloc[j:j+1,2:3])
dfc.iloc[j+1:j+2,3:4]=np.where(dfc.iloc[j+1:j+2,3:4]==0,dfc.iloc[j:j+1,2:3],dfc.iloc[j+1:j+2,3:4])
return dfc
Where dd1 and dd2 are my dataframes, I am appending rows in them iteratively to a empty dataframe dfc. Here I am using row and column indices to fill the values. Any help on this will be appreciated.
This is not how np.where works. The input of np.where is a list-like object. Instead of looping every data in the dataframe and fed it into the np.where, you should input the entire array to the np where.
dfc.iloc[:,2:3] = np.where(dfc.iloc[:,2:3]==0,dfc.iloc[:,3:4].shift(-1),dfc.iloc[:,2:3])
dfc.iloc[:,3:4] = np.where(dfc.iloc[:,3:4]==0,dfc.iloc[:,2:3],dfc.iloc[:,3:4].shift(-1))
This should work now. Be careful about the pd.DataFrame.iloc and avoid it if you are assigning it to new values. I would recommend you to use loc instead. My script may have potential bug depends on your pandas version.

How to limit Pandas loc selection

I search Pandas DataFrame by loc -for example like this
x = df.loc[df.index.isin(['one','two'])]
But I need only the first row of the result. If I use
x = df.loc[df.index.isin(['one','two'])].iloc[0]
I get error in the case that no row is found. Of course, I can select all the rows (the first example) and then check if result is empty or not. But I seek some more efficient way (the dataframe can be long). Is there any?
pandas.Index.duplicated
The pandas.Index object has a duplicated method that identifies all repeated values after the first occurance.
x[~x.index.duplicated()]
If you wanted to ...
df[df.index.isin(['one', 'two']) & ~df.index.duplicated()]

Pandas .isin() for list of values in each row of a column

I have a small problem: I have a column in my DataFrame, which has multiple rows, and in each row it holds either 1 or more values starting with 'M' letter followed by 3 digits. If there is more than 1 value, they are separated by a comma.
I would like to print out a view of the DataFrame, only featuring rows where that 1 column holds values I specify (e.g. I want them to hold any item from list ['M111', 'M222'].
I have started to build my boolean mask in the following way:
df[df['Column'].apply(lambda x: x.split(', ').isin(['M111', 'M222']))]
In my mind, .apply() with .split() methods in there first convert 'Column' values to lists in each row with 1 or more values in it, and then .isin() method confirms whether or not any of items in list of items in each row are in the list of specified values ['M111', 'M222'].
In practice however, instead of getting a desired view of DataFrame, I get error
'TypeError: unhashable type: 'list'
What am I doing wrong?
Kind regards,
Greem
I think you need:
df2 = df[df['Column'].str.contains('|'.join(['M111', 'M222']))]
You can only access the isin() method with a Pandas object. But split() returns a list. Wrapping split() in a Series will work:
# sample data
data = {'Column':['M111, M000','M333, M444']}
df = pd.DataFrame(data)
print(df)
Column
0 M111, M000
1 M333, M444
Now wrap split() in a Series.
Note that isin() will return a list of boolean values, one for each element coming out of split(). You want to know "whether or not any of item in list...are in the list of specified values", so add any() to your apply function.
df[df['Column'].apply(lambda x: pd.Series(x.split(', ')).isin(['M111', 'M222']).any())]
Output:
Column
0 M111, M000
As others have pointed out, there are simpler ways to go about achieving your end goal. But this is how to resolve the specific issue you're encountering with isin().

Subsetting Pandas dataframe via column number

When I want to retrieve the jth+1 value from the column of a panda dataframe, I can write: df["column_name"].ix[j]
When I check the type of the above code, I get:
type(df["column_name"].ix[i]) #str
I want to write less lengthy code though subsetting by the index. So I write:
df[[i]].ix[j]
However, when I check the type, I get: pandas.core.series.Series
How I rewrite this for the indexical subsetting to produce a str?
The double subscripting does something else than what you seem to imply it does - it returns a DataFrame of the corresponding columns.
As far as I know, the shortest way to do what you're asking using column-row ordering is
df.iloc[:, j].ix[i]
(There's the shorter
df.icol(j).ix[i]
but it's deprecated.)
One way to do this is like so:
df.ix[i][j]
This is kind of funky though, because the first index is the row, and the second is the column, which is rather not pandas. More like matrix indexing than pandas indexing.

Type of item in pandas DataFrame bug or feature?

If I have a pandas DataFrame
df = read_csv("infile.csv")
where infile looks something like
i1,i2,f1,f2
3,1,0.1,2.0
2,1,0.3,0.5
i.e. two columns of integers and one of floats.
If I query this DataFrame with:
print type(df["i1"].ix[0])
the type is (as I would expect it too be!) np.int64
Whereas if I use:
print type(df.ix[0]["i1"])
The type is np.float64
Is this correct behaviour or a bug?
I guess that this is because:
df.ix[0]
creates a series object which ["i1"] then selects from? But still this is annoying.
As you note yourself, this is indeed expected behaviour because in df.ix[0]["i1"] you first create a Series for the first row (so all items are upcasted to float to get one dtype), and only then you take the item with label "i1"
The solution is easy: don't use this chained indexing, but combine both look-ups (for row and column) in one indexing call:
df.ix[0, "i1"]
There are also other good reasons to avoid this chained indexing (getting problems with view/copy): http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy

Categories

Resources