So Let's say I have a csv file with data like so:
'time' 'speed'
0 2.3
0 3.4
0 4.1
0 2.1
1 1.3
1 3.5
1 5.1
1 1.1
2 2.3
2 2.4
2 4.4
2 3.9
I want to be able to return this file so that for each increasing number under the header 'time', I fine the max number found in the column speed and return that number for speed next to the number for time in an array. The actual csv file I'm using is a lot larger so I'd want to iterate over a big mass of data and not just run it where 'time' is 0, 1, or 2.
So basically I want this to return:
array([[0,41], [1,5.1],[2,4.4]])
Using numpy specifically.
This is a bit tricky to get done in a fully vectorised way in NumPy. Here's one option:
a = numpy.genfromtxt("a.csv", names=["time", "speed"], skip_header=1)
a.sort()
unique_times = numpy.unique(a["time"])
indices = a["time"].searchsorted(unique_times, side="right") - 1
result = a[indices]
This will load the data into a one-dimenasional array with two fields and sort it first. The result is an array that has its data grouped by time, with the biggest speed value always being the last in each group. We then determine the unique time values that occur and find the rightmost entry in the array for each time value.
pandas fits nicely for this kind of stuff:
>>> from io import StringIO
>>> import pandas as pd
>>> df = pd.read_table(StringIO("""\
... time speed
... 0 2.3
... 0 3.4
... 0 4.1
... 0 2.1
... 1 1.3
... 1 3.5
... 1 5.1
... 1 1.1
... 2 2.3
... 2 2.4
... 2 4.4
... 2 3.9
... """), delim_whitespace=True)
>>> df
time speed
0 0 2.3
1 0 3.4
2 0 4.1
3 0 2.1
4 1 1.3
5 1 3.5
6 1 5.1
7 1 1.1
8 2 2.3
9 2 2.4
10 2 4.4
11 2 3.9
[12 rows x 2 columns]
once you have the data-frame, all you need is groupby time and aggregate by maximum of speeds:
>>> df.groupby('time')['speed'].aggregate(max)
time
0 4.1
1 5.1
2 4.4
Name: speed, dtype: float64
Related
I have a data frame like so:
canopy speed
0 1 3.3
1 2 3.3
2 2 3.1
3 2 3.1
4 2 3.1
5 2 3.0
6 2 3.0
7 2 3.5
8 2 3.5
I want to count the number of rows (observations) for each combination of canopy and speed and plot it. I expect to see in the plot something like that:
canopy = 2:
3.3 1
3.1 3
3.0 2
3.5 3
You could do:
df.groupby('canopy')['speed'].value_counts().unstack('canopy').plot.bar()
This gives you some options, for example normalizing within each group (to get frequency instead of count):
(df
.groupby('canopy')['speed']
.value_counts(normalize=True)
.unstack('canopy').plot.bar()
)
And, of course, you could control the rounding of the speed values (as #QuangHoang rightly mentioned: not a good idea to group on floats --to which I would add: without some rounding):
(df
.assign(speed=df['speed'].round(0))
.groupby('canopy')['speed']
.value_counts()
.unstack('canopy').plot.bar()
)
Try:
df.canopy.eq(2).groupby(df['speed']).sum()
Output:
speed
3.0 2
3.1 3
3.3 1
3.5 2
Name: canopy, dtype: int64
Note it's really not a good idea to group on floats.
I have an ascii file containing 2 columns as following;
id value
1 15.1
1 12.1
1 13.5
2 12.4
2 12.5
3 10.1
3 10.2
3 10.5
4 15.1
4 11.2
4 11.5
4 11.7
5 12.5
5 12.2
I want to estimate the average value of column "value" for each id (i.e. group by id)
Is it possible to do that in python using numpy or pandas ?
If you don't know how to read the file, there are several methods as you can see here that you could use, so you can try one of them, e.g. pd.read_csv().
Once you have read the file, you could try this using pandas functions as pd.DataFrame.groupby and pd.Series.mean():
df.groupby('id').mean()
#if df['id'] is the index, try this:
#df.reset_index().groupby('id').mean()
Output:
value
id
1 13.566667
2 12.450000
3 10.266667
4 12.375000
5 12.350000
import pandas as pd
filename = "data.txt"
df = pd.read_fwf(filename)
df.groupby(['id']).mean()
Output
value
id
1 13.566667
2 12.450000
3 10.266667
4 12.375000
5 12.350000
I have a sensor. For some reasons, the sensor like to record data like this:
>df
obs count
-0.3 3
0.9 2
1.4 5
i.e. it first records observations and make a count table out of it. What I would like to do it convert this df into a series with raw observations. For example, I would like to end up with: [-0.3,-0.3,-0.3,0.9,0.9,1.4,1.4 ....]
Similar question asked for excel.
If your dataframe structure is like this one (or similar):
obs count
0 -0.3 3
1 0.9 2
2 1.4 5
This is an option, using numpy.repeat:
import numpy as np
times = df['count']
df2['obs'] = np.concatenate([np.repeat(df['obs'],times)])
print(df2)
obs
0 -0.3
1 -0.3
2 -0.3
3 0.9
4 0.9
5 1.4
6 1.4
7 1.4
8 1.4
9 1.4
Initial Problem
When I run the following in ipython
import numpy as np
import pandas as pd
df = pd.DataFrame(np.round(9*np.random.rand(4,4), decimals=1))
df.index.name = 'x'
df.columns.name = 'y'
df.to_csv('output.csv')
df
it outputs the following result:
y 0 1 2 3
x
0 7.6 7.4 0.3 7.5
1 5.6 0.0 1.5 5.9
2 7.1 2.1 0.0 0.9
3 3.7 6.6 3.3 8.4
However when I open output.csv the "y" is removed:
x 0 1 2 3
0 7.6 7.4 0.3 7.5
1 5.6 0 1.5 5.9
2 7.1 2.1 0 0.9
3 3.7 6.6 3.3 8.4
How do I make it so that the df.columns.name is retained when I output the dataframe to csv?
Crude workaround
Current crude work-around is me doing the following:
df.to_csv('output.csv', index_label = 'x|y')
Which results in output.csv reading:
x|y 0 1 2 3
0 7.6 7.4 0.3 7.5
1 5.6 0 1.5 5.9
2 7.1 2.1 0 0.9
3 3.7 6.6 3.3 8.4
Something better would be great! Thanks for your help (in advance).
Context
This is what I am working on: https://github.com/SimonBiggs/Electron-Cutout-Factors
This is an example table: https://github.com/SimonBiggs/Electron-Cutout-Factors/blob/master/output/20140807_173714/06app06eng/interpolation-table.csv
You can pass a list to name the columns, then you can specify the index name when you are writing to csv:
df.columns = ['column_name1', 'column_name2', 'column_name3']
df.to_csv('/path/to/file.csv', index_label='Index_name')
How about this? It's slightly different but hopefully usable, since it fits the CSV paradigm:
>>> df.columns = ['y{}'.format(name) for name in df.columns]
>>> df.to_csv('output.csv')
>>> print open('output.csv').read()
x,y0,y1,y2,y3
0,3.5,1.5,1.6,0.3
1,7.0,4.7,6.5,5.2
2,6.6,7.6,3.2,5.5
3,4.0,2.8,7.1,7.8
I have a pandas dataframe with the following structure:
import numpy as np
import pandas as pd
myData = pd.DataFrame({'x': [1.2,2.4,5.3,2.3,4.1], 'y': [6.7,7.5,8.1,5.3,8.3], 'condition':[1,1,np.nan,np.nan,1],'calculation': [np.nan]*5})
print myData
calculation condition x y
0 NaN 1 1.2 6.7
1 NaN 1 2.4 7.5
2 NaN NaN 5.3 8.1
3 NaN NaN 2.3 5.3
4 NaN 1 4.1 8.3
I want to enter a value in the 'calculation' column based on the values in 'x' and 'y' (e.g. x/y) but only in those cells where the 'condition' column contains NaN (np.isnan(myData['condition']). The final dataframe should look like this:
calculation condition x y
0 NaN 1 1.2 6.7
1 NaN 1 2.4 7.5
2 0.654 NaN 5.3 8.1
3 0.434 NaN 2.3 5.3
4 NaN 1 4.1 8.3
I'm happy with the idea of stepping through each row in turn using a 'for' loop and then using 'if' statements to make the calculations but the actual dataframe I have is very large and I wanted do the calculations in an array-based way. Is this possible? I guess I could calculate the value for all rows and then delete the ones I don't want but this seems like a lot of wasted effort (the NaNs are quite rare in the dataframe) and, in some cases where 'condition' equals 1, the calculation cannot be made due to division by zero.
Thanks in advance.
Use where and pass your condition to it, this will then only perform your calculation where the rows meet the condition:
In [117]:
myData['calculation'] = (myData['x']/myData['y']).where(myData['condition'].isnull())
myData
Out[117]:
calculation condition x y
0 NaN 1 1.2 6.7
1 NaN 1 2.4 7.5
2 0.654321 NaN 5.3 8.1
3 0.433962 NaN 2.3 5.3
4 NaN 1 4.1 8.3
EdChum's answer worked for me well! Still, I wanted to extend this thread as I think it will be useful for other people.
Let's assume your dataframe is
c x y
0 1 1.2 6.7
1 1 2.4 7.5
2 0 5.3 8.1
3 0 2.3 5.3
4 1 4.1 8.3
and you would like to update 0s in column c with associated x/y.
c x y
0 1 1.2 6.7
1 1 2.4 7.5
2 0.65 5.3 8.1
3 0.43 2.3 5.3
4 1 4.1 8.3
You can do
myData['c'] = (myData['x']/myData['y']).where(cond=myData['c']==0, other=myData['c'])
or
myData['c'].where(cond=myData['c'] != 0, other=myData['x']/myData['y'], inplace=True)
In both cases where 'cond' is not satisfied, 'other' is performed. In the second code snippet, inplace flag also works nicely (as it would also in the first code snippet.)
I found these solutions from pandas official site "where" and pandas official site "indexing"
This kind of operations are exactly what I need most of the time. I am new to Pandas and it took me a while to find this useful thread. Could anyone recommend some comprehensive tutorials to practice these types of arithmetic operations? I need to "filter/ groupby/ slice a dataframe then apply different functions/operations to each group/slice separately or all at once and keep it all inplace." Cheers!