Mean of every 15 rows of a dataframe in python - python

I have a dataframe of (1500x11). I have to select each of the 15 rows and take mean of every 11 columns separately. So my final dataframe should be of dimension 100x11. How to do this in Python.

The following should work:
dfnew=df[:0]
for i in range(100):
df2=df.iloc[i*15:i*15+15, :]
x=pd.Series(dict(df2.mean()))
dfnew=dfnew.append(x, ignore_index=True)
print(dfnew)

Don't know much about pandas, hence I've coded my next solution in pure numpy. Without any python loops hence very efficient. And converted result back to pandas DataFrame:
Try next code online!
import pandas as pd, numpy as np
df = pd.DataFrame([[i + j for j in range(11)] for i in range(1500)])
a = df.values
a = a.reshape((a.shape[0] // 15, 15, a.shape[1]))
a = np.mean(a, axis = 1)
df = pd.DataFrame(a)
print(df)

You can use pandas.DataFrame.
Use a for loop to compute the means and create a counter which should be reseted at every 15 entries.
columns = [col1, col2, ..., col12]
for columns, values in df.items():
# compute mean
# at every 15 entries save it
Also, using pd.DataFrame() you can create the new dataframe.
I'd recommend you to read the documentation.
https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

Related

Is there a faster way to search every column of a dataframe for a String than with .apply and str.contains?

So basically i have a bunch of dataframes with about 100 columns and 500-3000 rows filled with different String values. Now I want to search the entire Dataframe for lets say the String "Airbag" and delete every row which doesnt contain this String? I was able to do this with the following code:
df = df[df.apply(lambda row: row.astype(str).str.contains('Airbag', regex=False).any(), axis=1)]
This works exactly like i want to, but it is way too slow. So i tried finding a way to do it with Vectorization or List Comprehension but i wasn't able to do it or find some example code on the internet. So my question is, if it is possible to fasten this process up or not?
Example Dataframe:
df = pd.DataFrame({'col1': ['Airbag_101', 'Distance_xy', 'Sensor_2'], 'col2': ['String1', 'String2', 'String3'], 'col3': ['Tires', 'Wheel_Airbag', 'Antenna']})
Let's start from this dataframe with random strings and numbers in COLUMN:
import numpy as np
np.random.seed(0)
strings = np.apply_along_axis(''.join, 1, np.random.choice(list('ABCD'), size=(100, 5)))
junk = list(range(10))
col = list(strings)+junk
np.random.shuffle(col)
df = pd.DataFrame({'COLUMN': col})
>>> df.head()
COLUMN
0 BBCAA
1 6
2 ADDDA
3 DCABB
4 ADABC
You can simply apply pandas.Series.str.contains. You need to use fillna to account for the non-string elements:
>>> df[df['COLUMN'].str.contains('ABC').fillna(False)]
COLUMN
4 ADABC
31 BDABC
40 BABCB
88 AABCA
101 ABCBB
testing all columns:
Here is an alternative using a good old custom function. One could think that it should be slower than apply/transform, but it is actually faster when you have a lot of columns and a decent frequency of the seached term (tested on the example dataframe, a 3x3 with no match, and 3x3000 dataframes with matches and no matches):
def has_match(series):
for s in series:
if 'Airbag' in s:
return True
return False
df[df.apply(has_match, axis=1)]
Update (exact match)
Since it looks like you actually want an exact match, test with eq() instead of str.contains(). Then use boolean indexing with loc:
df.loc[df.eq('Airbag').any(axis=1)]
Original (substring)
Test for the string with applymap() and turn it into a row mask using any(axis=1):
df[df.applymap(lambda x: 'Airbag' in x).any(axis=1)]
# col1 col2 col3
# 0 Airbag_101 String1 Tires
# 1 Distance_xy String2 Wheel_Airbag
As mozway said, "optimal" depends on the data. These are some timing plots for reference.
Timings vs number of rows (fixed at 3 columns):
Timings vs number of columns (fixed at 3,000 rows):
Ok I was able to speed it up with the help of numpy arrays, but thanks for the help :D
master_index = []
for column in df.columns:
np_array = df[column].values
index = np.where(np_array == 'Airbag')
master_index.append(index)
print(df.iloc[master_index[1][0]])

Get all previous values for every row

I'm about to write a backtesting tool and so for every row I'd like to have access to all the dataframe till the given row. In the following example I'm doing it from a fixed index using a loop. I'm wondering if there is any better solution.
import numpy as np
import pandas as pd
N
df = pd.DataFrame({"a":np.arange(N)})
for i in range(3,N):
print(df["a"][:i].values)
UPDATE (toy example)
I need to apply a custom function to all the previous values. Here as a toy example I will use the sum of the square of all previous values.
def toyFun(v):
return np.sum(v**2)
res = np.empty(N)
res[:] = np.nan
for i in range(3, N):
res[i] = toyFun(df["a"][:i].values)
df["res"] = res
If you are indexing rows for a particular column say 'a', you can use .iloc indexer (i stands for index, loc means location) to index on the columns.
df = pd.DataFrame({'a': [1,2,3,4]})
print(df.a.iloc[:2]) # get first two values
So, you can do:
for i in range(3, 10):
print(df.a.iloc[:i])
The best way is to use a temporary column with the direct results, that way you are not re-calculating everything.
df["a"].apply(lambda x: x**2).cumsum()
Then re-index as you which:
res[3:] = df["a"].apply(lambda x: x**2).cumsum()[2:N-1].values
or directly to the dataframe.

How to index a pandas data frame starting at n?

Is it possible to start the index from n in a pandas dataframe?
I have some datasets saved as csv files, and would like to add the column index with the row number starting from where the last row number ended in the previous file.
For example, for the first file I'm using the following code which works fine, so I got an output csv file with rows starting at 1 to 1048574, as expected:
yellow_jan['index'] = range(1, len(yellow_jan) + 1)
I would like to do same for the yellow_feb file, but starting the row index at 1048575 and so on.
Appreciate any help!
df["new_index"] = range(10, 20)
df = df.set_index("new_index")
df
If your plan is to concat the dataframe you can just use
import pandas as pd
import numpy as np
df1 = pd.DataFrame({"a": np.arange(10)})
df2 = pd.DataFrame({"a": np.arange(10,20)})
df = pd.concat([df1, df2],ignore_index=True)
otherwise
df2.index += len(df)
you may just reset the index at the end or define a local variable and use it in `arange' function. update the variable with the numbers of rows for each file you read.

What is the correct way of passing parameters to stats.friedmanchisquare based on a DataFrame?

I am trying to pass values to stats.friedmanchisquare from a dataframe df, that has shape (11,17).
This is what works for me (only for three rows in this example):
df = df.as_matrix()
print stats.friedmanchisquare(df[1, :], df[2, :], df[3, :])
which yields
(16.714285714285694, 0.00023471398805908193)
However, the line of code is too long when I want to use all 11 rows of df.
First, I tried to pass the values in the following manner:
df = df.as_matrix()
print stats.friedmanchisquare([df[x, :] for x in np.arange(df.shape[0])])
but I get:
ValueError:
Less than 3 levels. Friedman test not appropriate.
Second, I also tried not converting it to a matrix-form leaving it as a DataFrame (which would be ideal for me), but I guess this is not supported yet, or I am doing it wrong:
print stats.friedmanchisquare([row for index, row in df.iterrows()])
which also gives me the error:
ValueError:
Less than 3 levels. Friedman test not appropriate.
So, my question is: what is the correct way of passing parameters to stats.friedmanchisquare based on df? (or even using its df.as_matrix() representation)
You can download my dataframe in csv format here and read it using:
df = pd.read_csv('df.csv', header=0, index_col=0)
Thank you for your help :)
Solution:
Based on #Ami Tavory and #vicg's answers (please vote on them), the solution to my problem, based on the matrix representation of the data, is to add the *-operator defined here, but better explained here, as follows:
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
And the same is true if you want to work with the original dataframe, which is what I ideally wanted:
print stats.friedmanchisquare(*[row for index, row in df.iterrows()])
in this manner you iterate over the dataframe in its native format.
Note that I went ahead and ran some timeit tests to see which way is faster and as it turns out, converting it first to a numpy array beforehand is twice as fast than using df in its original dataframe format.
This was my experimental setup:
import timeit
setup = '''
import pandas as pd
import scipy.stats as stats
import numpy as np
df = pd.read_csv('df.csv', header=0, index_col=0)
'''
theCommand = '''
df = np.array(df)
stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
theCommand = '''
stats.friedmanchisquare(*[row for index, row in df.iterrows()])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
which yields the following results:
4.97029900551
8.7627799511
The problem I see with your first attempt is that you end up passing one list with multiple dataframes inside of it.
The stats.friedmanchisquare needs multiple array_like arguments, not one list
Try using the * (star/unpack) operator to unpack the list
Like this
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
You could pass it using the "star operator", similarly to this:
a = np.array([[1, 2, 3], [2, 3, 4] ,[4, 5, 6]])
friedmanchisquare(*(a[i, :] for i in range(a.shape[0])))

Reading values from Pandas dataframe rows into equations and entering result back into dataframe

I have a dataframe. For each row of the dataframe: I need to read values from two column indexes, pass these values to a set of equations, enter the result of each equation into its own column index in the same row, go to the next row and repeat.
After reading the responses to similar questions I tried:
import pandas as pd
DF = pd.read_csv("...")
Equation_1 = f(x, y)
Equation_2 = g(x, y)
for index, row in DF.iterrows():
a = DF[m]
b = DF[n]
DF[p] = Equation_1(a, b)
DF[q] = Equation_2(a, b)
Rather than iterating over DF, reading and entering new values for each row, this codes iterates over DF and enters the same values for each row. I am not sure what I am doing wrong here.
Also, from what I have read it is actually faster to treat the DF as a NumPy array and perform the calculation over the entire array at once rather than iterating. Not sure how I would go about this.
Thanks.
Turns out that this is extremely easy. All that must be done is to define two variables and assign the desired columns to them. Then set "the row to be replaced" equivalent to the equation containing the variables.
Pandas already knows that it must apply the equation to every row and return each value to its proper index. I didn't realize it would be this easy and was looking for more explicit code.
e.g.,
import pandas as pd
df = pd.read_csv("...") # df is a large 2D array
A = df[0]
B = df[1]
f(A,B) = ....
df[3] = f(A,B)
# If your equations are simple enough, do operations column-wise in Pandas:
import pandas as pd
test = pd.DataFrame([[1,2],[3,4],[5,6]])
test # Default column names are 0, 1
test[0] # This is column 0
test.icol(0) # This is COLUMN 0-indexed, returned as a Series
test.columns=(['S','Q']) # Column names are easier to use
test #Column names! Use them column-wise:
test['result'] = test.S**2 + test.Q
test # results stored in DataFrame
# For more complicated stuff, try apply, as in Python pandas apply on more columns :
def toyfun(df):
return df[0]-df[1]**2
test['out2']=test[['S','Q']].apply(toyfun, axis=1)
# You can also define the column names when you generate the DataFrame:
test2 = pd.DataFrame([[1,2],[3,4],[5,6]],columns = (list('AB')))

Categories

Resources