I am running a for loop for each of 12 months. For each month I get bunch of dates in random order over various years in history. I also have corresponding temperature data on those dates. e.g. if I am in month January, of loop all dates and temperature I get from history are for January only.
I want to start with empty pandas dataframe with two columns namely 'Dates' and 'Temperature'. As the loop progresses I want to add the dates from another month and corresponding data to the 'Temperature' column.
After my dataframe is ready I want to finally use the 'Dates'column as index to order the 'Temperature' history available so that I have correct historical sorted dates with their temperatures.
I have thought about using numpy array and storing dates and data in two separate arrays; sort the dates and then sort the temperature using some kind of index. I believe using pandas pivot table feature it will be better implemented in pandas.
#Zanam Pls refer this syntax. I think your question is similar to this answer
df = DataFrame(columns=('lib', 'qty1', 'qty2'))
for i in range(5):
df.loc[i] = [randint(-1,1) for n in range(3)]
print(df)
lib qty1 qty2
0 0 0 -1
1 -1 -1 1
2 1 -1 1
3 0 0 0
4 1 -1 -1
[5 rows x 3 columns]
Related
I have a data frame with measurements for several groups of participants, and I am doing some calculations for each group. I want to add a column in a big data frame (all participants), from secondary data frames (partial list of participants).
When I do merge a couple of times (merging a new data frame to the existing one), it creates a duplicate of the column instead of one column.
As the size of the dataframes is different I can not compare them directly.
I tried
#df1 - main bigger dataframe, df2 - smaller dataset contains group of df1
for i in range(len(df1)):
# checking indeces to place the data to correct participant:
if df1.index[i] not in df2['index']:
pass
else :
df1['rate'][i] = list(df2[rate][df2['index']==i])
It does not work properly though. Can you please help with the correct way of assembling the column?
update: where the index of the initial dataframe and the "index" column of the calculation is the same, copy the rate value from the calculation into main df
main dataframe 1df
index
rate
1
0
2
0
3
0
4
0
5
0
6
0
dataframe with calculated values
index
rate
1
value
4
value
6
value
output df
index
rate
1
value
2
0
3
0
4
value
5
0
6
value
Try this – using .join() to merge dataframes on their indices and combining two columns using .combine_first():
df = df1.join(df2, lsuffix="_df1", rsuffix="_df2")
df["rate"] = df["rate_df2"].combine_first(df["rate_df1"])
EDIT:
This assumes both dataframes use a matching index. If that is not the case for df2, run this first:
df2 = df2.set_index('index')
I have a dataframe consist of a date column and other columns.
as a sample see the below,
a = pd.DataFrame({'Date':['2021/2/21', '2021/2/20','2021/3/5','2021/5/30'],
'Number':[2,4,6,9]})
a
Date Number
0 2021/2/21 2
1 2021/2/20 4
2 2021/3/5 6
3 2021/5/30 9
a['Date'].dtypes
Object
neither of the following got me the subset
a = a[a['Date'] > '20/02/2021']
[x for x in a['Date'] if x > '20/02/2021' ]
how can I get the subset?
Use pd.to_datetime and standardize date column
a['Date'] = pd.to_datetime(a.Date)
Then compare using ge i.e. greater than
a['Date'].ge('2021/02/21')
I have a large dataset (df) with lots of columns and I am trying to get the total number of each day.
|datetime|id|col3|col4|col...
1 |11-11-2020|7|col3|col4|col...
2 |10-11-2020|5|col3|col4|col...
3 |09-11-2020|5|col3|col4|col...
4 |10-11-2020|4|col3|col4|col...
5 |10-11-2020|4|col3|col4|col...
6 |07-11-2020|4|col3|col4|col...
I want my result to be something like this
|datetime|id|col3|col4|col...|Count
6 |07-11-2020|4|col3|col4|col...| 1
3 |5|col3|col4|col...| 1
2 |10-11-2020|5|col3|col4|col...| 1
4 |4|col3|col4|col...| 2
1 |11-11-2020|7|col3|col4|col...| 1
I tried to use resample like this df = df.groupby(['id','col3', pd.Grouper(key='datetime', freq='D')]).sum().reset_index() and this is my result. I am still new to programming and Pandas but I have read up on pandas docs and am still unable to do it.
|datetime|id|col3|col4|col...
6 |07-11-2020|4|col3|1|0.0
3 |07-11-2020|5|col3|1|0.0
2 |10-11-2020|5|col3|1|0.0
4 |10-11-2020|4|col3|2|0.0
1 |11-11-2020|7|col3|1|0.0
try this:
df = df.groupby(['datetime','id','col3']).count()
If you want the count values for all columns based only on the date, then:
df.groupby('datetime').count()
And you'll get a DataFrame who has the date time as the index and the column cells representing the number of entries for that given index.
Date Sub Value
10/24/2020 A 1
9/18/2020 A 2
9/21/2020 A 3
9/13/2020 A 4
9/20/2020 A 5
I want to extract the data using latest date from the dataframe.
I was using the following formula, but the output is different
df = df.Date.max()
Output: 2020-10-24 00:00:00.
The output which i am looking for is
Date Sub Value
10/24/2020 A 1
To get multiple rows matching the same max value, you can do this:
In [2679]: df[df.Date == df.Date.max()]
Out[2679]:
Date Sub Value
0 2020-10-24 A 1
Use Series.idxmax with DataFrame.loc and [[]] for output one row DataFrame - get only one row by first maximal datetime:
df1 = df.loc[[df.Date.idxmax()]]
Or boolean indexing with compare max - then get multiple rows if more like 1 max values:
df1 = df[df.Date.eq(df.Date.max())]
How do you get the last (or "nth") column in a dataFrame?
I tried several different articles such as 1 and 2.
df = pd.read_csv(csv_file)
col=df.iloc[:,0] #returns Index([], dtype='object')
col2=df.iloc[:,-1] #returns the whole dataframe
col3=df.columns[df.columns.str.startswith('c')] #returns Index([], dtype='object')
The commented out parts after the code is what I am getting after a print. Most of the time I am getting things like "returns Index([], dtype='object')"
Here is what df prints:
date open high low close
0 0 2019-07-09 09:20:10 296.235 296.245 296...
1 1 2019-07-09 09:20:15 296.245 296.245 296...
2 2 2019-07-09 09:20:20 296.235 296.245 296...
3 3 2019-07-09 09:20:25 296.235 296.275 296...
df.iloc is able to refer to both rows and columns. If you only input one integer, it will automatically refer to a row. You can mix the indexer types for the index and columns. Use : to select the entire axis.
df.iloc[:,-1:] will print out all of the final column.