I have an ascii file containing 2 columns as following;
id value
1 15.1
1 12.1
1 13.5
2 12.4
2 12.5
3 10.1
3 10.2
3 10.5
4 15.1
4 11.2
4 11.5
4 11.7
5 12.5
5 12.2
I want to estimate the average value of column "value" for each id (i.e. group by id)
Is it possible to do that in python using numpy or pandas ?
If you don't know how to read the file, there are several methods as you can see here that you could use, so you can try one of them, e.g. pd.read_csv().
Once you have read the file, you could try this using pandas functions as pd.DataFrame.groupby and pd.Series.mean():
df.groupby('id').mean()
#if df['id'] is the index, try this:
#df.reset_index().groupby('id').mean()
Output:
value
id
1 13.566667
2 12.450000
3 10.266667
4 12.375000
5 12.350000
import pandas as pd
filename = "data.txt"
df = pd.read_fwf(filename)
df.groupby(['id']).mean()
Output
value
id
1 13.566667
2 12.450000
3 10.266667
4 12.375000
5 12.350000
Related
I am working with one data set. Data contains values with different decimal places. Data and code you can see below :
data = {
'value':[9.1,10.5,11.8,
20.1,21.2,22.8,
9.5,10.3,11.9,
]
}
df = pd.DataFrame(data, columns = ['value'])
Which gives the following dataframe:
value
0 9.1
1 10.5
2 11.8
3 20.1
4 21.2
5 22.8
6 9.5
7 10.3
8 11.9
Now I want to add a new column with the title adjusted.This column I want to calculate with numpy.isclose function with a tolerance of 2 (plus or minus 1). At the end I expect to have results as result shown in the next table
value adjusted
0 9.1 10
1 10.5 10
2 11.8 10
3 20.1 21
4 21.2 21
5 22.8 21
6 9.5 10
7 10.3 10
8 11.9 10
I tried with this line but I get only results such true and false and also this is only for one value (10) not for all values.
np.isclose(df1['value'],10,atol=2)
So can anybody help me how to solve this problem and calculate tolerance for values 10 and 21 with one line ?
The exact logic and how this would generalize are not fully clear. Below are two options.
Assuming you want to test your values against a list of defined references, you can use the underlying numpy array and broadcasting:
vals = np.array([10, 21])
a = df['value'].to_numpy()
m = np.isclose(a[:, None], vals, atol=2)
df['adjusted'] = np.where(m.any(1), vals[m.argmax(1)], np.nan)
Assuming you want to group successive values, you can get the diff and start a new group when the difference is above threshold. Then round and get the median per group with groupby.transform:
group = df['value'].diff().abs().gt(2).cumsum()
df['adjusted'] = df['value'].round().groupby(group).transform('median')
Output:
value adjusted
0 9.1 10.0
1 10.5 10.0
2 11.8 10.0
3 20.1 21.0
4 21.2 21.0
5 22.8 21.0
6 9.5 10.0
7 10.3 10.0
8 11.9 10.0
This question already has answers here:
Split pandas dataframe based on groupby
(4 answers)
Closed 2 years ago.
Hello i have dataset containing 4 columns
x y z s
1 42.8 157.5 1
1 43.8 13.5 1
1 44.8 152 2
.
.
.
4 7528 157.5 2
4 45.8 13.5 3
8 72.8 152 3
i want to split my dataframe into separate csv files by their "s" column but I couldn't figure out a proper way of doing it.
"s" column has arbitrary numbers of labels. We don't know how many 1's or 2's dataset has. it is until 30 but not every number is contained in this dataset.
My desired output is:
df1
x y z s
1 42.8 157.5 1
.
1 43.8 13.5 1
df2
1 44.8 152 2
.
4 7528 157.5 2
df3
4 45.8 13.5 3
.
8 72.8 152 3
after I get this split I can easily write it to separate csv files.
The problem I am having is that I don't know how many different "s" values I have and how much from each of them.
Thank you
Just groupby before sending to csv to do this dynamically:
for i, x in df.groupby('s'): x.to_csv(f'df{i}.csv', index=False)
I have a df that looks like this and want to add an adj mean that selects the max if one of the two columns (avg or rolling_mean) is 0 otherwise it gets the avg of the two columns.
ID Avg rolling_mean adj_mean (goal to have this column)
0 5 0 5
1 6 6.3 6.15
2 5 8 6.5
3 4 0 4
I was able to get the max value of the columns using this code
df["adj_mean"]=df[["Avg", "rolling_mean"]].max(axis=1)
but not sure how to add the avg if both values are greater than zero.
Many thanks!
One approach can be to treat 0 as NaN and then simply calculate the mean
df['adj_mean'] = df.replace({0: np.nan})[["Avg", "rolling_mean"]].mean(axis=1)
Out[1]:
rolling_mean Avg adj_mean
0 0.0 5 5.00
1 6.3 6 6.15
2 8.0 5 6.50
3 0.0 4 4.00
By default, df.mean() skips null values. Per the docs:
skipna : bool, default True
Exclude NA/null values when computing the result.
The Dataframe consists of table, the format of which is shown in the Attached image. I apologize for not being able to type the format here as while trying to type the format of the Dataframe, it was getting messed up due to long decimal values, so i thought to attach its snapshot.
Country names are the index of the data frame and the cell values consists of corresponding GDP value. The intent is to calculate the average of all the rows for each country. When np.average was applied -
#name of Dataframe - GDP
def function_average()
GDP['Average'] = np.average(GDP.iloc[:,0:])
return GDP
function_average()
The new column got created reflecting all the values as NaN. I assumed its probably due to the inappropriately formatted cell values. I tried truncating that using the following code -
GDP = np.round(GDP, decimals =2)
And yet, there was no change in values. The code ran successfully though and there was no error.
Please advise, how to proceed in this case, should i try to make change in the spreadsheet itself or attempt to format cell values in Dataframe?
I regret for any inconvenience caused for not being able to provide any other required information at this point. please let me know if any other detail is required.
Problem is need axis=1 for count mean per rows and change function to numpy.nanmean or DataFrame.mean:
Sample:
np.random.seed(100)
GDP = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
GDP.loc[0, 'A'] = np.nan
GDP['Average1'] = np.average(GDP.iloc[:,0:], axis=1)
GDP['Average2'] = np.nanmean(GDP.iloc[:,0:], axis=1)
GDP['Average3'] = GDP.iloc[:,0:].mean(axis=1)
print (GDP)
A B C D E Average1 Average2 Average3
0 NaN 8 3 7 7 NaN 6.25 6.25
1 0.0 4 2 5 2 2.6 2.60 2.60
2 2.0 2 1 0 8 2.6 2.60 2.60
3 4.0 0 9 6 2 4.2 4.20 4.20
4 4.0 1 5 3 4 3.4 3.40 3.40
You get NaN, because at least one NaN:
print (np.average(GDP.iloc[:,0:]))
nan
GDP['Average'] = np.average(GDP.iloc[:,0:])
print (GDP)
A B C D E Average
0 NaN 8 3 7 7 NaN
1 0.0 4 2 5 2 NaN
2 2.0 2 1 0 8 NaN
3 4.0 0 9 6 2 NaN
4 4.0 1 5 3 4 NaN
So Let's say I have a csv file with data like so:
'time' 'speed'
0 2.3
0 3.4
0 4.1
0 2.1
1 1.3
1 3.5
1 5.1
1 1.1
2 2.3
2 2.4
2 4.4
2 3.9
I want to be able to return this file so that for each increasing number under the header 'time', I fine the max number found in the column speed and return that number for speed next to the number for time in an array. The actual csv file I'm using is a lot larger so I'd want to iterate over a big mass of data and not just run it where 'time' is 0, 1, or 2.
So basically I want this to return:
array([[0,41], [1,5.1],[2,4.4]])
Using numpy specifically.
This is a bit tricky to get done in a fully vectorised way in NumPy. Here's one option:
a = numpy.genfromtxt("a.csv", names=["time", "speed"], skip_header=1)
a.sort()
unique_times = numpy.unique(a["time"])
indices = a["time"].searchsorted(unique_times, side="right") - 1
result = a[indices]
This will load the data into a one-dimenasional array with two fields and sort it first. The result is an array that has its data grouped by time, with the biggest speed value always being the last in each group. We then determine the unique time values that occur and find the rightmost entry in the array for each time value.
pandas fits nicely for this kind of stuff:
>>> from io import StringIO
>>> import pandas as pd
>>> df = pd.read_table(StringIO("""\
... time speed
... 0 2.3
... 0 3.4
... 0 4.1
... 0 2.1
... 1 1.3
... 1 3.5
... 1 5.1
... 1 1.1
... 2 2.3
... 2 2.4
... 2 4.4
... 2 3.9
... """), delim_whitespace=True)
>>> df
time speed
0 0 2.3
1 0 3.4
2 0 4.1
3 0 2.1
4 1 1.3
5 1 3.5
6 1 5.1
7 1 1.1
8 2 2.3
9 2 2.4
10 2 4.4
11 2 3.9
[12 rows x 2 columns]
once you have the data-frame, all you need is groupby time and aggregate by maximum of speeds:
>>> df.groupby('time')['speed'].aggregate(max)
time
0 4.1
1 5.1
2 4.4
Name: speed, dtype: float64