Pandas, accessing every nth element in nested array - python

I have a dataframe of many rows and 4 columns. Each column contains an array of 100 values.
My intuitive way of doing this is the same way I would do it with multi-dimensional numpy arrays.
For example, I want the first element of every array in column1. So I say
df["column1"][:][0]
To me this makes sense: first select the column, then take every array, then take the first element of every array.
However, it just doesn't work at all. Instead, it simply spits out the entire array from column1, row 1.
But - and this is the most frustrating thing - if I say:
df["column1"][1][0]
It gives me EXACTLY what I expect based on my expected logic, as in, I get the first element in the array in the second row in column1.
How can I get every nth element in every array in column1?

The reason that df["column1"][:][0] isn't doing what you expect is that df["column1"][:] returns a Series. With a Series, using bracket indexing returns the item of the series at that index.
If you want to a series where each item in the series is the item in the corresponding array at that index, the correct solution - whether it seems intuitive or not - is to use .str[...] on the Series.
Instead of
df["column1"][:][0]
use this:
df["column1"].str[0]
It might seem like .str should only be used for actual str values, but a neat trick is that works for lists too.

Here are some ways to do this:
[item[0] for item in df['column1']] # will result in a list
or
df['column1'].apply(lambda item: item[0]) # will result in a series
Not sure if you're looking for a way that's similar to slicing, but AFAIU pandas sees the lists in your table are just arbitrary objects, not something pandas provides a sugar for.
Of course, you can do other fancy things by creating a data frame out of your column:
pd.DataFrame(df['column1'].tolist())
And then do whatever you want with it.

Related

iterate through a dataframe and return index label and column label in a tuple

I have generated a correlation matrix based on the pearson's correlation. Based on these information i want to do some regression analysis. This is how the dataframe looks like:
Currently I'm manually going through the correlation matrix and select specific index label and column label to generate a regplot. But I was thinking this is counterintuitive since I can iterate through the correlation matrix and select specific columns and index labels and use a for loop generate multiple regplots at the same time. So I was thinking I could do a list comprehension to append every index and every label in a list. However I have no Idea how i can iterate through each value in the dataframe. I was thinking I could use dataframe.iterrows() however this doesn't really make sense since it returns an index and the entire row. I would love to hear some suggestions!
I've figured out a way to do it!
all_combinations = combinations(df.columns, 2)
all_comparisons = [(index, row) for index, row in all_combinations if df.corr().loc[index, row] > 0.5]

How to select a column from a pandas dataframe to be plotted without addressing its name

I need to plot data from a column, and I want to do it without using its name.
The problem is, I want to have user input to make the the analysis customised, and it means I'll always get a different name for the column, thus having to change the name manually for the plot. Any possible solutions to make it automatic?
I tried
stocks_ret.iloc[0,1].plot(figsize=(16,8), grid=True)
I also tried using .iloc but got
AttributeError: 'numpy.float64' object has no attribute 'plot'
try changing
stocks_ret.iloc[0,1].plot(figsize=(16,8), grid=True)
to
stocks_ret.iloc[:,1].plot(figsize=(16,8), grid=True)
explanation
iloc works by selecting row and column indices to extract from your dataframe, using the syntax my_dataframe.iloc[row_range, column_range]. For example, by writing my_dataframe.iloc[0:2, 0:4], you're asking to extract the values from the first to third row and from the first to fifth column (remember, indices start at 0 in Python).
Similarly, by writing my_dataframe.iloc[2, 3], you're asking what's the specific value inside my_dataframe at the third row, fourth column. This is what you've done in your code. Since it returns a single value, and not a pandas series/dataframe, it doesn't have a plot attribute, resulting in the error you see.
In order to select the whole column, you need to pass a range equivalent to the whole column's length, instead of a single index. The : notation can be used as a shorthand to do exactly that, so that my_dataframe.iloc[:, 3] returns the series of all values in the fourth column.

Python Dataframe for loop syntax

I'm having trouble in correctly executing a for loop through my dataframe in python.
Basically, for every row in the dataframe (df_weather), the code should select one value each from the column no. 13 and 14 and execute a function which is defined earlier in the code. Eventually, I require the calculated value in each row to be summed to give me one final answer.
The error being returned is as follows: "string indices must be integers"
Request anyone to help me through this step. The code for the same is provided below.
Thanks!
stress_rate = 0
for i in df_weather:
b = GetStressDampHeatParameterized(i[:,13], i[:,14])
stress_rate = b + stress_rate
print(stress_rate)
This can be solved in a single line:
print sum(df.apply(lambda row: func(row[14], row[15]), axis=1))
Where func is your desired function and axis=1 ensures that the function is applied on each row as opposed to each column (which is the default).
My solution first creates a temporary series (picture: an unattached column) that is constructed by applying a function to each row in turn. The function that is actually being applied is an anonymous function indicated by the keyword lambda, which takes a single input row and which is fed a single row at a time from the apply method. That anonymous function simply calls your function func and passes the two column values in the row.
A Series can be summed using the sum function.
Note the indexing of the columns starts at 0.
Also note, saying for x in df: will iterate over the columns.
your number one problem is the following line:
for i in df_weather: This line is actually yielding you the column titles and not the rows themselves. What you're looking for is actually the following:
for i in df_weather.values():. The values will return a numpy array that you could itterate. The problem though is that the variable i will be a single row in the matrix now.

Extract value from single row of pandas DataFrame

I have a dataset in a relational database format (linked by ID's over various .csv files).
I know that each data frame contains only one value of an ID, and I'd like to know the simplest way to extract values from that row.
What I'm doing now:
# the group has only one element
purchase_group = purchase_groups.get_group(user_id)
price = list(purchase_group['Column_name'])[0]
The third row is bothering me as it seems ugly, however I'm not sure what is the workaround. The grouping (I guess) assumes that there might be multiple values and returns a <class 'pandas.core.frame.DataFrame'> object, while I'd like just a row returned.
If you want just the value and not a df/series then call values and index the first element [0] so just:
price = purchase_group['Column_name'].values[0]
will work.
If purchase_group has single row then doing purchase_group = purchase_group.squeeze() would make it into a series so you could simply call purchase_group['Column_name'] to get your values
Late to the party here, but purchase_group['Column Name'].item() is now available and is cleaner than some other solutions
This method is intuitive; for example to get the first row (list from a list of lists) of values from the dataframe:
np.array(df)[0]

Python Pandas: .apply taking forever?

I have a DataFrame 'clicks' created by parsing CSV of size 1.4G. I'm trying to create a new column 'bought' using apply function.
clicks['bought'] = clicks['session'].apply(getBoughtItemIDs)
In getBoughtItemIDs, I'm checking if 'buys' dataframe has values I want, and if so, return a string concatenating them. The first line in getBoughtItemIDs is taking forever. What are the ways to make it faster?
def getBoughtItemIDs(val):
boughtSessions = buys[buys['session'] == val].values
output = ''
for row in boughtSessions:
output += str(row[1]) + ","
return output
There are a couple of things that make this code run slowly.
apply is essentially just syntactic sugar for a for loop over the rows of a column. There's also an explicit for loop over a NumPy array in your function (the for row in boughtSessions part). Looping in this (non-vectorised) way is best avoided whenever possible as it impacts performance heavily.
buys[buys['session'] == val].values is looking up val across an entire column for each row of clicks, then returning a sub-DataFrame and then creating a new NumPy array. Repeatedly looking for values in this way is expensive (O(n) complexity each lookup). Creating new arrays is going to be expensive since memory has to be allocated and the data copied across each time.
If I understand what you're trying to do, you could try the following approach to get your new column.
First use groupby to group the rows of buys by the values in 'session'. apply is used to join up the strings for each value:
boughtSessions = buys.groupby('session')[col_to_join].apply(lambda x: ','.join(x))
where col_to_join is the column from buys which contains the values you want to join together into a string.
groupby means that only one pass through the DataFrame is needed and is pretty well-optimised in Pandas. The use of apply to join the strings is unavoidable here, but only one pass through the grouped values is needed.
boughtSessions is now a Series of strings indexed by the unique values in the 'session' column. This is useful because lookups to Pandas indexes are O(1) in complexity.
To match each string in boughtSessions to the approach value in clicks['session'] you can use map. Unlike apply, map is fully vectorised and should be very fast:
clicks['bought'] = clicks['session'].map(boughtSessions)

Categories

Resources