I know it is a simple answer but i could'nt find anywhere, I need to show the multiplication of all values of a single column in python.
Here's the dataframe:
VALUE
0 2
1 3
2 1
3 3
4 1
The output should give me 23131 = 18
Try prod
df.VALUE.prod()
Out[345]: 18
To add to the previous answer you can use df.product(axis=0) as well.
Related
Given a table,
Id
Value
1
1
2
2
2
3
3
4
4
5
4
6
2
8
2
3
1
1
Instead of a simple groupby('Id').agg({'Value':'sum'}) which would perform aggregation over all the instances and yield a table with only four rows, I wish the result to aggregate only over the nearby instances and hence maintaining the order the table was created.
The expected output is following,
Id
Value
1
1
2
5
3
4
4
11
2
11
1
1
If not possible with pandas groupby, any other kind of trick would also be greatly appreciated.
Note: If the above example is not helpful, basically what I want is to somehow compress the table with aggregating over 'Value'. The aggregation should be done only over the duplicate 'Id's which occur one exactly after the other.
Unfortunately, the answers from eshirvana and wwnde doesn't seem to work for a long dataset. Inspired from answer of wwnde, I found a workaround,
# create a series referring to group of identicals
new=[]
i=-1
for item in df.Id:
if item !=seen:
i+=1
seen=items
new.append(i)
df['temp']=new
Now, we groupby over 'temp' column.
df.groupby('temp').agg({'Id':max, 'Value':sum}).reset_index(drop=True)
I have the following dataframe:
A,B,C,D
10,1,2,3
1,4,7,3
10,5,2,3
40,7,9,3
9,9,9,9
I would like to create another dataframe starting from the previous one which have only two row. The selection of these two rows is based on the minimum and maximum value in the column "A". I would like to get:
A,B,C,D
1,4,7,3
40,7,9,3
Do you think I should work with a sort of index.min e index.max and then select only the two rows and append then in a new dataframe? Do you have same other suggestions?
Thanks for any kind of help,
Best
IIUC you can simply subset the dataframe with an OR condition on df.A.min() and df.A.max():
df = df[(df.A==df.A.min())|(df.A==df.A.max())]
df
A B C D
1 1 4 7 3
3 40 7 9 3
Yes, you can use idxmin/idxmax and then use loc:
df.loc[df['A'].agg(['idxmin','idxmax']) ]
Output:
A B C D
1 1 4 7 3
3 40 7 9 3
Note that this only gives one row for min and one for max. If you want all values, you should use #CHRD's solution.
I have a certain condition (Incident = yes) and I want to know the values in two columns fulfilling this condition. I have a very big data frame (many rows and many columns) and I am looking for a "screening" function.
To illustrate the following example with the df (which has many more columns than shown):
Repetition Step Incident Test1 Test2
1 1 no 10 20
1 1 no 9 11
1 2 yes 9 19
1 2 yes 11 20
1 2 yes 12 22
1 3 yes 9 18
1 3 yes 8 18
What I would like to get as an answer is
Repetition Step
1 2
1 3
If I only wanted to know the Step, I would use the following command:
df[df.Incident == 'yes'].Step.unique()
Is there a similar command to get the values of two columns for a specific condition?
Thanks for the help! :-)
You could use the query option for the condition, select the interested columns, and finally remove duplicate values
df.query('Incident=="yes"').filter(['Repetition','Step']).drop_duplicates()
OR
you could use the Pandas' loc method, set the rows as the condition, set the columns part with the columns you are interested in, then drop the duplicates.
df.loc[df.Incident=="yes",['Repetition','Step']].drop_duplicates()
I am new to Pandas and I have a question which I could not fix it by myself.
This is my column
0 12.000000
1 21.540659
2 19.122413
3 16.568042
4 17.082154
5 15.932148
6 15.226856
7 14.400521
8 17.900962
9 17.169741
10 NaN
and I want to shift it by one row.The expected result should be looks like this:
0 NaN
1 12.000000
2 21.540659
3 19.122413
4 16.568042
5 17.082154
6 15.932148
7 15.226856
8 14.400521
9 17.900962
10 17.169741
Here is my code:
data['A']=pd.Series(a).shift(periods=1)
a is a list and I convert it to the pandas series to add a new column as an "A" in my dataframe. Howerevr I need to shift my rows without losing the last data.
Wait now? Doesn't this work?
data["A"] = data["A"].shift(1)
IIUC,
You could pad your Series creation with an extra row then shift:
data['A'] = pd.Series(a+[pd.np.nan]).shift(1)
You can use np.roll for that.
Answer found here, check the link for details.
I was struggling this afternoon to find a way of selecting few columns of my Pandas DataFrame, by checking the occurrence of a certain pattern in their name (label?).
I had been looking for something like contains or isin for nd.arrays / pd.series, but got no luck.
This frustrated me quite a bit, as I was already checking the columns of my DataFrame for occurrences of specific string patterns, as in:
hp = ~(df.target_column.str.contains('some_text') | df.target_column.str.contains('other_text'))
df_cln= df[hp]
However, no matter how I banged my head, I could not apply .str.contains() to the object returned bydf.columns - which is an Index - nor the one returned by df.columns.values - which is an ndarray. This works fine for what is returned by the "slicing" operation df[column_name], i.e. a Series, though.
My first solution involved a for loop and the creation of a help list:
ll = []
for a in df.columns:
if a.startswith('start_exp1') | a.startswith('start_exp2'):
ll.append(a)
df[ll]
(one could apply any of the str functions, of course)
Then, I found the map function and got it to work with the following code:
import re
sel = df.columns.map(lambda x: bool(re.search('your_regex',x))
df[df.columns[sel]]
Of course in the first solution I could have performed the same kind of regex checking, because I can apply it to the str data type returned by the iteration.
I am very new to Python and never really programmed anything so I am not too familiar with speed/timing/efficiency, but I tend to think that the second method - using a map - could potentially be faster, besides looking more elegant to my untrained eye.
I am curious to know what you think of it, and what possible alternatives would be. Given my level of noobness, I would really appreciate if you could correct any mistakes I could have made in the code and point me in the right direction.
Thanks,
Michele
EDIT : I just found the Index method Index.to_series(), which returns - ehm - a Series to which I could apply .str.contains('whatever').
However, this is not quite as powerful as a true regex, and I could not find a way of passing the result of Index.to_series().str to the re.search() function..
Select column by partial string, can simply be done, via:
df.filter(like='hello') # select columns which contain the word hello
And to select rows by partial string match, you can pass axis=0 to filter:
df.filter(like='hello', axis=0)
Your solution using map is very good. If you really want to use str.contains, it is possible to convert Index objects to Series (which have the str.contains method):
In [1]: df
Out[1]:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
In [2]: df.columns.to_series().str.contains('x')
Out[2]:
x True
y False
z False
dtype: bool
In [3]: df[df.columns[df.columns.to_series().str.contains('x')]]
Out[3]:
x
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
UPDATE I just read your last paragraph. From the documentation, str.contains allows you to pass a regex by default (str.contains('^myregex'))
I think df.keys().tolist() is the thing you're searching for.
A tiny example:
from pandas import DataFrame as df
d = df({'somename': [1,2,3], 'othername': [4,5,6]})
names = d.keys().tolist()
for n in names:
print n
print type(n)
Output:
othername
type 'str'
somename
type 'str'
Then with the strings you got, you can do any string operation you want.