I'm finding that when adding a series, based on the same time-period, to an existing dataframe it gets imported as NaNs. The dataframe has a field column, but I don't understand why that should change anything. To see the steps of my code, you can review the attached image. Hope that someone can help!
Illustration showing how the dataframe that the series is inserted into and how it gets read as NaN
Assuming that the value in the Field Index column is "actual" for every row, a solution could be the following:
test.reset_index().set_index('Date').assign(m1=m1)
That solution works but it can be done shorter:
days = pd.to_datetime(['2018-01-31', '2018-02-28', '2018-03-31'])
df = pd.DataFrame({'Field': ['Actual']*3, 'Date': days, 'Val':[1, 2, 3]}).set_index(['Field', 'Date'])
m1 = pd.Series([0, 2, 4], index=days)
df.reset_index(level='Field').assign(m1=m1)
Field Val m1
Date
2018-01-31 Actual 1 0
2018-02-28 Actual 2 2
2018-03-31 Actual 3 4
btw, that would be a nice mcve
Related
I have a dataframe with 11 000k rows. There are multiple columns but I am interested only in 2 of them: TagName and Samples_Value. One tag can repeat itself multiple times among rows. I want to calculate the average value for each tag and create a new dataframe with the average value for each tag. I don't really know how to walk through rows and how to calculate the average. Any help will be highly appreciated. Thank you!
Name DataType TimeStamp Value Quality
Food Float 2019-01-01 13:00:00 105.75 122
Food Float 2019-01-01 17:30:00 11.8110352 122
Food Float 2019-01-01 17:45:00 12.7932892 122
Water Float 2019-01-01 14:01:00 16446.875 122
Water Float 2019-01-01 14:00:00 146.875 122
RangeIndex: 11140487 entries, 0 to 11140486
Data columns (total 6 columns):
Name object
Value object
This is what I have and I know it is really noob ish but I am having a difficult time walking through rows.
for i in range(0, len(df):
if((df.iloc[i]['DataType']!='Undefined')):
print df.loc[df['Name'] == df.iloc[i]['Name'], df.iloc[i]['Value']].mean()
It sounds like the groupby() functionality is what you want. You define the column where your groups are and then you can take the mean() of each group. An example from the documentation:
df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])
df.groupby('A').mean()
Output:
B C
A
1 3.0 1.333333
2 4.0 1.500000
In your case it would be something like this:
df.groupby('TagName')['Samples_value'].mean()
Edit: So, I applied the code to your provided input dataframe and following is the output:
TagName
Steam 1.081447e+06
Utilities 3.536931e+05
Name: Sample_value, dtype: float64
Is this what you are looking for?
You don't need to walk through the rows, you can just take all of the fields that match your criteria
d = {'col1': [1,2,1,2,1,2], 'col2': [3, 4,5,6,7,8]}
df = pd.DataFrame(data=d)
#iterate over all unique entries in col1
for entry in df["col1"].unique():
# get all the col2 values where col1 is the current iter of col1 entries
meanofcurrententry=df[df["col1"]==entry]["col2"].mean()
print(meanofcurrententry)
This is not a full solution, but I think it helps more to understand the necessary logic. You still need to wrap it up into your own dataframe, however it hopefuly helps to understand how to use the indexing
You should avoid as much as possible to iterate rows in a dataframe, because it is very unefficient...
groupby is the way to go when you want to apply the same processing to various groups of rows identified by their values in one or more columns. Here what you want is (*):
df.groupby('TagName')['Sample_value'].mean().reset_index()
it gives as expected:
TagName Sample_value
0 Steam 1.081447e+06
1 Utilities 3.536931e+05
Details on the magic words:
groupby: identifies the column(s) used to group the rows (same values)
['Sample_values']: restrict the groupby object to the column of interest
mean(): computes the mean per group
reset_index(): by default the grouping columns go into the index, which is fine for the mean operation. reset_index make them back normal columns
I have a dataframe, when:
one of column is a Date column.
another column it's X column, that column have missing values.
I want to fill column X by a specific range of dates.
so far I got to this code:
df[df['Date'] < datetime.date(2017,1,1)]['X'].fillna(1,inplace=True)
But it dose not work, I am not getting an error, but the data isn't fill.
and another point it look messy, maybe there is a better way.
Thank for the help.
First, you need to create your data frame:
import pandas as pd
df = pd.DataFrame({'Date': ['2016-01-01', '2018-01-01']})
df['Date'] = pd.to_datetime(df['Date'])
Next, you can conditionally set the column value:
df.loc[df['Date'] < '2017-01-01','X'] = 1
The result would be like this:
Date X
0 2016-01-01 1.0
1 2018-01-01 NaN
How do you get the last (or "nth") column in a dataFrame?
I tried several different articles such as 1 and 2.
df = pd.read_csv(csv_file)
col=df.iloc[:,0] #returns Index([], dtype='object')
col2=df.iloc[:,-1] #returns the whole dataframe
col3=df.columns[df.columns.str.startswith('c')] #returns Index([], dtype='object')
The commented out parts after the code is what I am getting after a print. Most of the time I am getting things like "returns Index([], dtype='object')"
Here is what df prints:
date open high low close
0 0 2019-07-09 09:20:10 296.235 296.245 296...
1 1 2019-07-09 09:20:15 296.245 296.245 296...
2 2 2019-07-09 09:20:20 296.235 296.245 296...
3 3 2019-07-09 09:20:25 296.235 296.275 296...
df.iloc is able to refer to both rows and columns. If you only input one integer, it will automatically refer to a row. You can mix the indexer types for the index and columns. Use : to select the entire axis.
df.iloc[:,-1:] will print out all of the final column.
I have a csv that has incorrect datetime formatting. I've worked out how to convert those values into the format I need, but now I need to reassign all the values in a column to the new, converted values.
For example, I'm hoping that there's something I can put into the following FOR loop that will insert the values back into the dataframe at the correct location:
for i in df[df.columns[1]]:
t = pd.Timestamp(i)
short_date = t.date().strftime('%m/%d/%Y').lstrip('0')
# Insert back into dataframe?
As always, your help is very much appreciated!
Part of the column in question:
Part of the dataframe in question:
Created Date
2019-02-27 22:55:16
2019-01-29 22:57:12
2018-11-29 00:13:31
2019-01-30 21:35:15
2018-12-20 21:14:45
2018-11-01 16:20:15
2019-04-11 16:38:07
2019-01-24 00:23:17
2018-12-21 19:30:10
2018-12-19 22:33:04
2018-11-07 19:54:19
2019-05-10 21:15:00
In the simplest, but most instructive, possible terms:
df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
df
# x y
# 0 1 4
# 1 2 5
# 2 3 6
df[:] = df[:].astype(float)
df
# x y
# 0 1.0 4.0
# 1 2.0 5.0
# 2 3.0 6.0
Let pandas do the work for you.
Or, for only one column:
df.x = df.x.astype(float)
df
# x y
# 0 1.0 4
# 1 2.0 5
# 2 3.0 6
You'll, of course, replace astype(float) with .date().strftime('%m/%d/%Y').lstrip('0').
To reassign a column, no need for a loop. Something like this should work:
df["column"] = new_column
new_column is either a Series of matching length, or something that can be broadcasted1 to that length. You can find more details in the docs.
That said, if pd.Timestamp can already parse your data, there is no need for "formatting". The formatting is not associated with a timestamp instance. You can choose a particular formatting when you are converting to string with something like df["timestamp"].dt.strftime("%m/%d/%Y").
On the other hand, if you want to change the precision of your timestamp, you can do something like this:
df["timestamp"] = df["timestamp"].astype("datetime64[D]")
Here, all time information will be rounded to a resolution of days. The letter between the [ and ] is the resolution. Again, all this and more is discussed in the docs.
1 Broadcasting is a concept from numpy where you can operate between different but compatibly shaped arrays. Again, everything is covered in the docs.
Thank you all for your help. All of the answers were helpful, but the answer I ended up using was as follows:
import pandas as pd
df[df.columns[0]] = pd.to_datetime(df[df.columns[0]]).dt.strftime('%m/%d/%Y')
My pandas dataframe consists of a Column "timeStamp", the elements of which are of type datetime.datetime. I'm trying to obtain the difference between two consecutive rows of this column to obtain the time spent in seconds. I use the following piece of code for it.
df["Time"] = df["timeStamp"].diff(0).dt.total_seconds()
Generally it's working fine, however, I keep getting 0.0 as a result of this operation in quite a few instances even when it's not the case.
Examples values that result in 0.0:
import pandas as pd
import datetime
import numpy as np
df = pd.DataFrame({'S.No.': [1, 2, 3, 4], 'ABC': [datetime.datetime(2019,2,25,11,49,50), datetime.datetime(2019,2,25,11,50,0),datetime.datetime(2019,2,25,11,50,7),datetime.datetime(2019,2,25,11,50,12)]})
df["Time"] = df["ABC"].diff(0).dt.seconds
print df
Note: using python2.7
Try this:
print(df["timestamp"].diff().fillna(0).dt.seconds)
0 0
1 10
2 7
3 5
df['difference']=df["timestamp"].diff().fillna(0).dt.seconds
print(df)
timestamp difference
0 2019-02-25 11:49:50 0
1 2019-02-25 11:50:00 10
2 2019-02-25 11:50:07 7
3 2019-02-25 11:50:12 5
Use
df["Time"] = df["timeStamp"].diff().dt.total_seconds()
instead.
The argument in diff specifies the number of rows above of the row with which you want to calculate the difference. Now, you're filling it with 0, so your subtracting a value from itself, which will always give 0. By leaving it empty, it uses the default value 1, so the difference with 1 row above.