I have a pandas dataframe with the following structure:
import numpy as np
import pandas as pd
myData = pd.DataFrame({'x': [1.2,2.4,5.3,2.3,4.1], 'y': [6.7,7.5,8.1,5.3,8.3], 'condition':[1,1,np.nan,np.nan,1],'calculation': [np.nan]*5})
print myData
calculation condition x y
0 NaN 1 1.2 6.7
1 NaN 1 2.4 7.5
2 NaN NaN 5.3 8.1
3 NaN NaN 2.3 5.3
4 NaN 1 4.1 8.3
I want to enter a value in the 'calculation' column based on the values in 'x' and 'y' (e.g. x/y) but only in those cells where the 'condition' column contains NaN (np.isnan(myData['condition']). The final dataframe should look like this:
calculation condition x y
0 NaN 1 1.2 6.7
1 NaN 1 2.4 7.5
2 0.654 NaN 5.3 8.1
3 0.434 NaN 2.3 5.3
4 NaN 1 4.1 8.3
I'm happy with the idea of stepping through each row in turn using a 'for' loop and then using 'if' statements to make the calculations but the actual dataframe I have is very large and I wanted do the calculations in an array-based way. Is this possible? I guess I could calculate the value for all rows and then delete the ones I don't want but this seems like a lot of wasted effort (the NaNs are quite rare in the dataframe) and, in some cases where 'condition' equals 1, the calculation cannot be made due to division by zero.
Thanks in advance.
Use where and pass your condition to it, this will then only perform your calculation where the rows meet the condition:
In [117]:
myData['calculation'] = (myData['x']/myData['y']).where(myData['condition'].isnull())
myData
Out[117]:
calculation condition x y
0 NaN 1 1.2 6.7
1 NaN 1 2.4 7.5
2 0.654321 NaN 5.3 8.1
3 0.433962 NaN 2.3 5.3
4 NaN 1 4.1 8.3
EdChum's answer worked for me well! Still, I wanted to extend this thread as I think it will be useful for other people.
Let's assume your dataframe is
c x y
0 1 1.2 6.7
1 1 2.4 7.5
2 0 5.3 8.1
3 0 2.3 5.3
4 1 4.1 8.3
and you would like to update 0s in column c with associated x/y.
c x y
0 1 1.2 6.7
1 1 2.4 7.5
2 0.65 5.3 8.1
3 0.43 2.3 5.3
4 1 4.1 8.3
You can do
myData['c'] = (myData['x']/myData['y']).where(cond=myData['c']==0, other=myData['c'])
or
myData['c'].where(cond=myData['c'] != 0, other=myData['x']/myData['y'], inplace=True)
In both cases where 'cond' is not satisfied, 'other' is performed. In the second code snippet, inplace flag also works nicely (as it would also in the first code snippet.)
I found these solutions from pandas official site "where" and pandas official site "indexing"
This kind of operations are exactly what I need most of the time. I am new to Pandas and it took me a while to find this useful thread. Could anyone recommend some comprehensive tutorials to practice these types of arithmetic operations? I need to "filter/ groupby/ slice a dataframe then apply different functions/operations to each group/slice separately or all at once and keep it all inplace." Cheers!
Related
I have a data frame like so:
canopy speed
0 1 3.3
1 2 3.3
2 2 3.1
3 2 3.1
4 2 3.1
5 2 3.0
6 2 3.0
7 2 3.5
8 2 3.5
I want to count the number of rows (observations) for each combination of canopy and speed and plot it. I expect to see in the plot something like that:
canopy = 2:
3.3 1
3.1 3
3.0 2
3.5 3
You could do:
df.groupby('canopy')['speed'].value_counts().unstack('canopy').plot.bar()
This gives you some options, for example normalizing within each group (to get frequency instead of count):
(df
.groupby('canopy')['speed']
.value_counts(normalize=True)
.unstack('canopy').plot.bar()
)
And, of course, you could control the rounding of the speed values (as #QuangHoang rightly mentioned: not a good idea to group on floats --to which I would add: without some rounding):
(df
.assign(speed=df['speed'].round(0))
.groupby('canopy')['speed']
.value_counts()
.unstack('canopy').plot.bar()
)
Try:
df.canopy.eq(2).groupby(df['speed']).sum()
Output:
speed
3.0 2
3.1 3
3.3 1
3.5 2
Name: canopy, dtype: int64
Note it's really not a good idea to group on floats.
So I have two dataframes dfA and dfB. I want to select several columns of dfA based on the rows in dfB. This is how my dfA looks like:
index abandoned dismiss yes train tram go
0 0.5 9.1 1.4 2.5 2.5 5.6
1 2.4 3.2 1.8 4.9 9.3 3.2
2 1.5 5.7 3.9 2.1 1.1 0.9
and this is how dfB looks like:
index keywords
0 abandoned
1 wanted
2 goes
3 train
4 bold
5 go
6 images
7 links
so I want my dfC looks like this:
index abandoned train go
0 0.5 2.5 5.6
1 2.4 4.9 3.2
2 1.5 2.1 0.9
This was my attempt, but it gave me null dataframe:
dfC= dfB[~dfB["keywords"].isin(dfA)]
can anyone help me? thank you
Use DataFrame.loc with filter columns names by Index.isin:
dfC = dfA.loc[:, dfA.columns.isin(dfB['keywords'])]
Or filtering by Index.intersection:
dfC = dfA[dfA.columns.intersection(dfB['keywords'])]
print (dfC)
abandoned train go
index
0 0.5 2.5 5.6
1 2.4 4.9 3.2
2 1.5 2.1 0.9
I've got a dataframe with 3 columns and I want to add them together and test different weights.
I've written this code so far but I feel this might not be the best way:
weights = [0.5,0.6,0.7,0.8,0.9,1.0]
for i in weights:
for j in weights:
for k in weights:
outname='outname'+str(i)+'TV'+str(j)+'BB'+str(k)+'TP'
df_media[['outname']]=df_media[['TP']].multiply(i)
+df_media[['TV']].multiply(j)
+df_media[['BB']].multiply(k)
Below is the input dataframe and the first output iteration of the loops. So all of the columns have been multiplied by 0.5.
df_media:
TV BB TP
1 2 6
11 4 5
4 4 3
Output DataFrame:
'Outname0.5TV0.5BB0.5TP'
4.5
10
5.5
Dictionary
If you need a dataframe for each loop, you can use a dictionary. With this solution you also don't need to store your factor in your column name, since the weight can be your key. Here's one way via a dictionary comprehension:
weights = [0.5,0.6,0.7,0.8,0.9,1.0]
col_name = '-'.join(df_media.columns)
dfs = {w: (df_media.sum(1) * w).to_frame(col_name) for w in weights}
print(dfs[0.5])
TV-BB-TP
0 4.5
1 10.0
2 5.5
Single dataframe
Much more efficient is to store your result in a single dataframe. This removes the need for a Python-level loop.
res = pd.DataFrame(df.sum(1).values[:, None] * np.array(weights),
columns=weights)
print(res)
0.5 0.6 0.7 0.8 0.9 1.0
0 4.5 5.4 6.3 7.2 8.1 9.0
1 10.0 12.0 14.0 16.0 18.0 20.0
2 5.5 6.6 7.7 8.8 9.9 11.0
Then, for example, access the first weight as a series via res[0.5].
Initial Problem
When I run the following in ipython
import numpy as np
import pandas as pd
df = pd.DataFrame(np.round(9*np.random.rand(4,4), decimals=1))
df.index.name = 'x'
df.columns.name = 'y'
df.to_csv('output.csv')
df
it outputs the following result:
y 0 1 2 3
x
0 7.6 7.4 0.3 7.5
1 5.6 0.0 1.5 5.9
2 7.1 2.1 0.0 0.9
3 3.7 6.6 3.3 8.4
However when I open output.csv the "y" is removed:
x 0 1 2 3
0 7.6 7.4 0.3 7.5
1 5.6 0 1.5 5.9
2 7.1 2.1 0 0.9
3 3.7 6.6 3.3 8.4
How do I make it so that the df.columns.name is retained when I output the dataframe to csv?
Crude workaround
Current crude work-around is me doing the following:
df.to_csv('output.csv', index_label = 'x|y')
Which results in output.csv reading:
x|y 0 1 2 3
0 7.6 7.4 0.3 7.5
1 5.6 0 1.5 5.9
2 7.1 2.1 0 0.9
3 3.7 6.6 3.3 8.4
Something better would be great! Thanks for your help (in advance).
Context
This is what I am working on: https://github.com/SimonBiggs/Electron-Cutout-Factors
This is an example table: https://github.com/SimonBiggs/Electron-Cutout-Factors/blob/master/output/20140807_173714/06app06eng/interpolation-table.csv
You can pass a list to name the columns, then you can specify the index name when you are writing to csv:
df.columns = ['column_name1', 'column_name2', 'column_name3']
df.to_csv('/path/to/file.csv', index_label='Index_name')
How about this? It's slightly different but hopefully usable, since it fits the CSV paradigm:
>>> df.columns = ['y{}'.format(name) for name in df.columns]
>>> df.to_csv('output.csv')
>>> print open('output.csv').read()
x,y0,y1,y2,y3
0,3.5,1.5,1.6,0.3
1,7.0,4.7,6.5,5.2
2,6.6,7.6,3.2,5.5
3,4.0,2.8,7.1,7.8
So Let's say I have a csv file with data like so:
'time' 'speed'
0 2.3
0 3.4
0 4.1
0 2.1
1 1.3
1 3.5
1 5.1
1 1.1
2 2.3
2 2.4
2 4.4
2 3.9
I want to be able to return this file so that for each increasing number under the header 'time', I fine the max number found in the column speed and return that number for speed next to the number for time in an array. The actual csv file I'm using is a lot larger so I'd want to iterate over a big mass of data and not just run it where 'time' is 0, 1, or 2.
So basically I want this to return:
array([[0,41], [1,5.1],[2,4.4]])
Using numpy specifically.
This is a bit tricky to get done in a fully vectorised way in NumPy. Here's one option:
a = numpy.genfromtxt("a.csv", names=["time", "speed"], skip_header=1)
a.sort()
unique_times = numpy.unique(a["time"])
indices = a["time"].searchsorted(unique_times, side="right") - 1
result = a[indices]
This will load the data into a one-dimenasional array with two fields and sort it first. The result is an array that has its data grouped by time, with the biggest speed value always being the last in each group. We then determine the unique time values that occur and find the rightmost entry in the array for each time value.
pandas fits nicely for this kind of stuff:
>>> from io import StringIO
>>> import pandas as pd
>>> df = pd.read_table(StringIO("""\
... time speed
... 0 2.3
... 0 3.4
... 0 4.1
... 0 2.1
... 1 1.3
... 1 3.5
... 1 5.1
... 1 1.1
... 2 2.3
... 2 2.4
... 2 4.4
... 2 3.9
... """), delim_whitespace=True)
>>> df
time speed
0 0 2.3
1 0 3.4
2 0 4.1
3 0 2.1
4 1 1.3
5 1 3.5
6 1 5.1
7 1 1.1
8 2 2.3
9 2 2.4
10 2 4.4
11 2 3.9
[12 rows x 2 columns]
once you have the data-frame, all you need is groupby time and aggregate by maximum of speeds:
>>> df.groupby('time')['speed'].aggregate(max)
time
0 4.1
1 5.1
2 4.4
Name: speed, dtype: float64