I have a counter column which contains an integer. Based on that integer I would like to pick one of consecutive columns in my dataframe.
I tried using .apply(lambda x: ..., axis =1) but my solution there requires an extra if for each column I want to pick from.
df2 = pd.DataFrame(np.array([[1, 2, 3, 0 ], [4, 5, 6, 2 ], [7, 8, 9, 1]]),columns=['a', 'b', 'c','d'])
df2['e'] = df.iloc[:,df2['d']]
This code doesn't work because iloc only wants one item in that position and not 3 (df2['d']= [0,2,1]).
What I would like it to do is give me the 0th item in the first row the 2nd item in the second row and the 1st item in the third row. so
df2['e'] = [1,6,8]
You are asking for something similar to fancy indexing in numpy. In pandas, it is lookup. Try this:
df2.lookup(df2.index, df2.columns[df2['d']])
Out[86]: array([1, 6, 8])
Related
Assuming I have a data frame similar to the below (actual data frame has million observations), how would I get the correlation between signal column and list of return columns, then group by the Signal_Up column?
I tried the pandas corrwith function but it does not give me the correlation grouping for the signal_up column
df[['Net_return_at_t_plus1', 'Net_return_at_t_plus5',
'Net_return_at_t_plus10']].corrwith(df['Signal_Up']))
I am trying to look for correlation between signal column and other net returns columns group by various values of signal_up column.
Data and desired result is given below.
Desired Result
Data
Using simple dataframe below:
df= pd.DataFrame({'v1': [1, 3, 2, 1, 6, 7],
'v2': [2, 2, 4, 2, 4, 4],
'v3': [3, 3, 2, 9, 2, 5],
'v4': [4, 5, 1, 4, 2, 5]})
(1st interpretation) one way to get correlations of one variable with the other columns is:
correlations = df.corr().unstack().sort_values(ascending=False) # Build correlation matrix
correlations = pd.DataFrame(correlations).reset_index() # Convert to dataframe
correlations.columns = ['col1', 'col2', 'correlation'] # Label it
correlations.query("col1 == 'v2' & col2 != 'v2'") # Filter by variable
# output of this code will give correlation of column v2 with all the other columns
(2nd interpretation) one way to get correlations of column v1 with column v3, v4 after grouping by column v2 is using this one line:
df.groupby('v2')[['v1', 'v3', 'v4']].corr().unstack()['v1']
In your case, v2 is 'Signal_Up', v1 is 'signal' and v3, v4 columns proxy 'Net_return_at_t_plusX' columns.
I am able to get the correlations by individual category of Signal_Up column by using “groupby” function. However, I am not able to apply “corr” function to more than two columns.
So, I had to use “concat” function to combine all of them.
a = df.groupby('Signal_Up')[['signal','Net_return_at_t_plus1']].corr().unstack().iloc[:,1]
b = df.groupby('Signal_Up')[['signal','Net_return_at_t_plus5']].corr().unstack().iloc[:,1]
c = df.groupby('Signal_Up')[['signal','Net_return_at_t_plus10']].corr().unstack().iloc[:,1]
dfCorr = pd.concat([a, b, c], axis=1)
This is a follow up question from the following Question:
Pandas Similarity Matching
The ultimate goal of the first question was to find a way to similarity match each row with another if they have the same CountryId.
Here is the sample dataframe:
df = pd.DataFrame([[1, 5, 'AADDEEEEIILMNORRTU'], [2, 5, 'AACEEEEGMMNNTT'], [3, 5, 'AAACCCCEFHIILMNNOPRRRSSTTUUY'], [4, 5, 'DEEEGINOOPRRSTY'], [5, 5, 'AACCDEEHHIIKMNNNNTTW'], [6, 5, 'ACEEHHIKMMNSSTUV'], [7, 5, 'ACELMNOOPPRRTU'], [8, 5, 'BIT'], [9, 5, 'APR'], [10, 5, 'CDEEEGHILLLNOOST'], [11, 5, 'ACCMNO'], [12, 5, 'AIK'], [13, 5, 'CCHHLLOORSSSTTUZ'], [14, 5, 'ANNOSXY'], [15, 5, 'AABBCEEEEHIILMNNOPRRRSSTUUVY']],columns=['PartnerId','CountryId','Name'])
The answer in other thread was good for the question but I ended up getting computational problems. My real source contains >19.000 rows and will be even bigger in the future.
The answer suggested to merge the dataframe with each self to compare it with every other row that has the same CountryId:
df = df.merge(df, on='CountryId', how='outer')
Even for the small example of 15 rows provided above we will end up with 225 merged rows. For the whole dataset I ended up with 131.044.638 rows which made my RAM refuse to work. Therefore I need to think of a better way to mergethe two dataframes.
As I´m doing a similarity check I was wondering if there is a possibility to:
Sort the dataframe based on the CountryId and the Name
Only merge each row with the +/- 3 rows connecting. E.g. After sorting Row 1 will only be merged with (2,3 & 4) as this is the first
row., Row 2 will only be merged with (1, 3, 4, 5) and so on.
Like this I will have similar names almost next to each other and names "further away" will not be similar anyway. Therefore its not needed to check the similarity of them.
I found a workaround for my problem that is taking the 3 rows before (if existing) and after.
sorted_df = df.sort_values(by=['CountryId','Name']).reset_index(drop=True)
new_sorted = pd.Series()
min = -3
max = 3
for s in list(range(min,max+1,1)):
if s == min:
new_sorted = sorted_df['PartnerId'].astype(str).shift(s,fill_value='A').rename('MatchingID')
elif s != 0:
new_sorted = new_sorted + '-' + sorted_df['PartnerId'].astype(str).shift(s,fill_value='A').rename('MatchingID')
match = sorted_df.merge(new_sorted,left_index=True,right_index=True)
matching_df = []
for index, row in match.iterrows():
row_values = row.tolist()
matching_df += [row_values[0:-1] + [int(w)] for w in row_values[-1].split('-') if w != 'A']
If anyone can come up with a better idea I would be glad to hear it!
I have a pandas variable X which has a shape of (14931, 381).
That's 14,931 examples, with each example having 381 features. I want to add 483 features (each with a zero) value to each example, except I want them to be before the 381 existing ones
How can this be done?
Create a DataFrame of zeros and call pd.concat.
v = pd.DataFrame(0, index=df.index, columns=range(483))
df = pd.concat([v, df], axis=1)
For demonstration purpose let's set up a smaller DataFrame
(7 rows and 2 columns, with feature (column) names f1, f2, ...):
df = pd.DataFrame(data={'f1': [ 1, 4, 6, 5, 7, 2, 3 ],
'f2': [ 4, 6, 5, 0, 2, 3, 2 ]})
Then, let's create a DataFrame filled with zeroes, to be
prepended to df (3 columns instead of your 483):
zz = pd.DataFrame(data=np.zeros((df.shape[0], 3), dtype=int),
columns=[ 'p' + str(n + 1) for n in range(3) ], index=df.index)
As you can see:
I named the "new" columns as p1, p2 and so on,
the index is a copy of the index in df (it will be important
at the next stage).
And the last step is to join these 2 DataFrames and substitute under
df:
df = zz.join(df)
The last step for you is to change the number of added columns to the
proper value.
I'm trying to sum the a portion of the sessions in my dictionary so I can get totals for the current and previous week.
I've converted the JSON into a pandas dataframe in one test. I'm summing the total of the sessions using the .sum() function in pandas. However, I also need to know the total sessions from this week and the week prior. I've tried a few methods to sum values (-1:-7) and (-8:-15), but I'm pretty sure I need to use .iloc.
IN:
response = requests.get("url")
data = response.json()
df=pd.DataFrame(data['DailyUsage'])
total_sessions = df['Sessions'].sum()
current_week= df['Sessions'].iloc[-1:-7]
print(current_week)
total_sessions =['current_week'].sum
OUT:
Series([], Name: Sessions, dtype: int64)
AttributeError 'list' object has no attribute 'sum'
Note: I've tried this with and without pd.to_numeric and also with variations on the syntax of the slice and sum methods. Pandas doesn't feel very Pythonic and I'm out of ideas as to what to try next.
Assuming that df['Sessions'] holds each day, and you are comparing current and previous week only, you can use reshape to create a weekly sum for the last 14 values.
weekly_matrix = df['Sessions'][:-15:-1].values.reshape((2, 7))
Then, you can sum each row and get the weekly sum, most recent will be the first element.
import numpy as np
weekly_sum = np.sum(weekly_matrix, axis=1)
current_week = weekly_sum[0]
previous_week = weekly_sum[1]
EDIT: how the code works
Let's take the 1D-array which is accessed by the values attribute of the pandas Series. It contains the last 14 days, which is ordered from most recent to the oldest. I will call it x.
x = array([14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
The array's reshape function is then called on x to split this data into a 2D-array (matrix) with 2 rows and 7 columns.
The default behavior of the reshape function is to first fill all columns in a row before moving to the next row. Therefore, x[0] will be the element (1,1) in the reshaped array, x[1] will be the element (1,2), and so on. After the element (1,7) is filled with x[6] (ending the current week), the next element x[7] will then be placed in (2,1). This continues until finishing the reshape operation, with the placement of x[13] in (2,7).
This results in placing the first 7 elements of x (current week) in the first row, and the last 7 elements of x (previous week) in the second row. This was called weekly_matrix.
weekly_matrix = x.reshape((2, 7))
# weekly_matrix = array([[14, 13, 12, 11, 10, 9, 8],
# [ 7, 6, 5, 4, 3, 2, 1]])
Since now we have the values of each week organized in a matrix, we can use numpy.sum function to finish our operation. numpy.sum can take an axis argument, which will control how the value is computed:
if axis=None, all elements are added in a grand total.
if axis=0, all rows in each column will be added. In the case of weekly_matrix, this will result in a 7 element 1D-array ([21, 19,
17, 15, 13, 11, 9], which is not the result we want, as we are
actually adding equivalent days on each week).
if axis=1 (as the case of the solution), all columns in each row will be added, producing a 2 element 1D-array in the case of weekly_matrix. Order of this result
array follows the same order of the rows in the matrix (i.e., element
0 is the total of the first row, and element 1 is the total of the
second row). Since we know that the first row is the current week, and
the second row is the previous week, we can extract the information
using these indexes, which is
# weekly_sum = array([77, 28])
current_week = weekly_sum[0] # sum of [14, 13, 12, 11, 10, 9, 8] = 77
previous_week = weekly_sum[1] # sum of [ 7, 6, 5, 4, 3, 2, 1] = 28
To group and sum by a fixed number of values, for instance with daily data and weekly aggregation, consider groupby. You can do this forwards or backwards by slicing your series as appropriate:
np.random.seed(0)
df = pd.DataFrame({'col': np.random.randint(0, 10, 21)})
print(df['col'].values)
# array([5, 0, 3, 3, 7, 9, 3, 5, 2, 4, 7, 6, 8, 8, 1, 6, 7, 7, 8, 1, 5])
# forwards groupby
res = df['col'].groupby(df.index // 7).sum()
# 0 30
# 1 40
# 2 35
# Name: col, dtype: int32
# backwards groupby
df['col'].iloc[::-1].reset_index(drop=True).groupby(df.index // 7).sum()
# 0 35
# 1 40
# 2 30
# Name: col, dtype: int32
Is it possible to get the row number (i.e. "the ordinal position of the index value") of a DataFrame row without adding an extra row that contains the row number (the index can be arbitrary, i.e. even a MultiIndex)?
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [2, 3, 4, 2, 4, 6]})
>>> result = df[df.a > 3]
>>> result.iloc[0]
a 4
Name: 2, dtype: int64
# but how can I get the original row index of iloc[0] in df?
I could have done df['row_index'] = range(len(df)) which would maintain the original row number, but I am wondering if Pandas has a built-in way of doing this.
Access the .name attribute and use get_loc:
In [10]:
df.index.get_loc(result.iloc[0].name)
Out[10]:
2
Looking this from a different side:
for r in df.itertuples():
getattr(r, 'Index')
Where df is the data frame. May be you want to use a conditional to get the index when a condition are met.