Im sorry for not posting the data but it wouldn't really help. The thing is a need to make a graph and I have a csv file full of information organised by date. It has 'Cases' 'Deaths' 'Recoveries' 'Critical' 'Hospitalized' 'States' as categories. It goes in order by date and has the amount of cases,deaths,recoveries per day of each state. How do I sum this categories to make a graph that shows how the total is increasing? I really have no idea how to start so I can't post my data. Below are some numbers that try to explain what I have.
0 2020-02-20 1 Andalucía NaN NaN NaN
1 2020-02-20 2 Aragón NaN NaN NaN
2 2020-02-20 3 Asturias NaN NaN NaN
3 2020-02-20 4 Baleares 1.0 NaN NaN
4 2020-02-20 5 Canarias 1.0 NaN NaN
.. ... ... ... ... ... ...
888 2020-04-06 19 Melilla 92.0 40.0 3.0
889 2020-04-06 14 Murcia 1283.0 500.0 84.0
890 2020-04-06 15 Navarra 3355.0 1488.0 124.0
891 2020-04-06 16 País Vasco 9021.0 4856.0 417.0
892 2020-04-06 17 La Rioja 2846.0 918.0 66.0
It's unclear exactly what you mean by "sum this categories". I'm assuming you mean that for each date, you want to sum the values across all different regions to come up with the total values for Spain?
In which case, you will want to groupby date, then .sum() the columns (you can drop the States category.
grouped_df = df.groupby("date")["Cases", "Deaths", ...].sum()
grouped_df.set_index("date").plot()
This snippet will probably not work directly, you may need to reformat the dates etc. But should be enough to get you started.
I think you are looking for groupby followed by a cumsum not including dates.
columns_to_group = ['Cases', 'Deaths',
'Recoveries', 'Critical', 'Hospitalized', 'date']
new_columns = ['Cases_sum', 'Deaths_sum',
'Recoveries_sum', 'Critical_sum', 'Hospitalized_sum']
df_grouped = df[columns_to_group].groupby('date').sum().reset_index()
For plotting seaborn provides an easy functions:
import seaborn as sns
df_melted = df_grouped.melt(id_vars=["date"])
sns.lineplot(data=df_melted, x='date', y = 'value', hue='variable')
Related
I have a dataframe like this (the real DF has 94 columns and 40 rows):
NAME
TIAS
EFGA
SOE
KERA
CODE
SURVIVAL
SOAP corp
1.391164e+10
1.265005e+10
0.000000e+00
186522000.0
366
21
NiANO inc
42673.0
0.0
0.0
42673.0
366
3
FFS jv
9.523450e+05
NaN
NaN
8.754379e+09
737
4
KELL Corp
1.045967e+07
9.935970e+05
0.000000e+00
NaN
737
4
Os inc
7.732654e+10
4.046270e+07
1.391164e+10
8.754379e+09
737
4
I need to make a correlation for each group in frame by CODE. The target value is SURVIVAL column.
I tried this:
df = df.groupby('CODE').corr()[['SURVIVAL']]
but it returns something like this:
CODE
SURVIVAL
366
TIAS
NaN
EFGA
NaN
SOE
NaN
KERA
NaN
SURVIVAL
NaN
737
TIAS
NaN
EFGA
NaN
SOE
NaN
KERA
NaN
SURVIVAL
NaN
Why is it NaN in all columns?
I tried to fill NaNs in DataFrame with mean values before making a correlations:
df = df.fillna(df.mean())
or drop them but it does not work.
But when I make the correlation for all dataframe without any modifications like this:
df.corr()[['SURVIVAL']]
everything works good and I have correlations, not NaNs.
All types are float64 and int64.
Is there the way to get correlation by group without NaNs? I have no idea why it works on all dataframe but does not work in groups.
Thank you in advance for help!
You can do it this way
df = df.groupby('CODE')[['SURVIVAL']].corr()
Try this:
survival_corr = lambda x: x.corrwith(x['SURVIVAL'])
by_code = df.groupby('CODE')
by_code.apply(survival_corr)
I have seen many methods like concat, join, merge but i am missing the technique for my simple dataset.
I have two datasets looks like mentioned below
dates.csv
2020-07-06
2020-07-07
2020-07-08
2020-07-09
2020-07-10
.....
...
...
mydata.csv
Expected,Predicted
12990,12797.578628473471
12990,12860.382061836583
12990,12994.159035827917
12890,13019.073929662367
12890,12940.34108357684
.............
.......
.....
I want to combine these two datasets which have same number of rows on btoh csv files. I tried concat method but i see NaN's
delete = dates.csv (pd.DataFrame)
data1 = mydata.csv (pd.DataFrame)
result = pd.concat([delete, data1], axis=0, ignore_index=True)
print(result)
Output:
0 Expected Predicted
0 2020-07-06 NaN NaN
1 2020-07-07 NaN NaN
2 2020-07-08 NaN NaN
3 2020-07-09 NaN NaN
4 2020-07-10 NaN NaN
.. ... ... ...
307 NaN 10999.0 10526.433098
308 NaN 10999.0 10911.247147
309 NaN 10490.0 11038.685328
310 NaN 10490.0 10628.204624
311 NaN 10490.0 10632.495169
[312 rows x 3 columns]
I dont want all NaN's.
Thanks for your help!
You could use .join() method from pandas.
delete = dates.csv (pd.DataFrame)
data1 = mydata.csv (pd.DataFrame)
result = delete.join(data1)
If your two dataframes respect the same order, you can use the join method mentionned by Nik, by default it joins on the index.
Otherwise, if you have a key that you can join your dataframes on, you can specify it like this:
joined_data = first_df.join(second_df, on=key)
Your first_df and second_df should then share a column with the same name to join on.
I am quite new to pandas here, I have been stuck for weeks in this issue, so as a last resort i have come to this forum.
Below is my dataframe
S2Rate S2BillDate Sale Average Total Sale
0 20.00 2019-05-18 20.000000 20.00
1 15.00 2019-05-18 26.250000 420.00
2 15.00 2019-05-19 36.000000 180.00
3 7.50 2019-05-19 34.500000 172.50
4 7.50 2019-05-21 32.894737 625.00
I am trying to plot a graph where my primary y axis will have the S2rate and secondary Yaxis will have sale average. But I would like my x axis to have the date , for which I will need my df to like like this(below)
S2Rate S2BillDate Sale Average Total Sale
0 20.00 2019-05-18 20.000000 20.00
1 15.00 2019-05-18 to 2019-05-19 31.1250000 600.00
2 7.50 2019-05-19 to 2019-05-21 33.690000 797.50
That is for S2rate 15 min date is 2019-05-18 and max date is 2019-05-19, so it needs to pic the min and max date for the S2rate that needs to be grouped, cause there can be situations when for a same S2rate, there can be many days.
Can anyone guide me towards this, also please do not mistake that I am directly asking help/code, even pointing me to the right concepts will do. I kinda have no clue how to proceed further.
Any help is much appreciated. TIA !
First, since S2Rate values can recur, consecutive dates of a S2Rate must be identified first. This can be done by a diff-cumsum trick. Ignore this step if you'd like to group by all S2Rates.
# identify consecutive groups of S2Rate
df["S2RateGroup"] = (df["S2Rate"].diff() != 0).cumsum()
df
Out[268]:
S2Rate S2BillDate Sale Average Total Sale S2RateGroup
0 20.0 2019-05-18 20.000000 20.0 1
1 15.0 2019-05-18 26.250000 420.0 2
2 15.0 2019-05-19 36.000000 180.0 2
3 7.5 2019-05-19 34.500000 172.5 3
4 7.5 2019-05-21 32.894737 625.0 3
Next, just write your custom title-producing function and put it into .agg() using Named Aggregation:
def date_agg(col):
dmin = col.min()
dmax = col.max()
return f"{dmin} to {dmax}" if dmax > dmin else f"{dmin}"
df.groupby("S2RateGroup").agg( # or .groupby("S2Rate")
s2rate=pd.NamedAgg("S2Rate", np.min),
date=pd.NamedAgg("S2BillDate", date_agg),
sale_avg=pd.NamedAgg("Sale Average", np.mean),
total_sale=pd.NamedAgg("Total Sale", np.sum)
)
# result
Out[270]:
s2rate date sale_avg total_sale
S2RateGroup
1 20.0 2019-05-18 20.000000 20.0
2 15.0 2019-05-18 to 2019-05-19 31.125000 600.0
3 7.5 2019-05-19 to 2019-05-21 33.697368 797.5
Since you are new to pandas, it would also be helpful to go through the official how-to.
I am new to time-series programming with Pandas.
Here is the sample data:
date 2 3 4 ShiftedPrice
0 2017-11-05 09:20:01.134 2123.0 12.23 34.12 300.0
1 2017-11-05 09:20:01.789 2133.0 32.43 45.62 330.0
2 2017-11-05 09:20:02.238 2423.0 35.43 55.62 NaN
3 2017-11-05 09:20:02.567 3423.0 65.43 56.62 NaN
4 2017-11-05 09:20:02.948 2463.0 45.43 58.62 NaN
I would like to plot date vs ShiftedPrice for all pairs where there is no missing values for ShiftedPrice column. You can assume that there will absolutely be no missing values in data column. So please do help me in this.
You can first remove NaNs rows by dropna:
df = df.dropna(subset=['ShiftedPrice'])
And then plot:
df.plot(x='date', y='ShiftedPrice')
All together:
df.dropna(subset=['ShiftedPrice']).plot(x='date', y='ShiftedPrice')
I've got a 'DataFrame` which has occasional missing values, and looks something like this:
Monday Tuesday Wednesday
================================================
Mike 42 NaN 12
Jenna NaN NaN 15
Jon 21 4 1
I'd like to add a new column to my data frame where I'd calculate the average across all columns for every row.
Meaning, for Mike, I'd need
(df['Monday'] + df['Wednesday'])/2, but for Jenna, I'd simply use df['Wednesday amt.']/1
Does anyone know the best way to account for this variation that results from missing values and calculate the average?
You can simply:
df['avg'] = df.mean(axis=1)
Monday Tuesday Wednesday avg
Mike 42 NaN 12 27.000000
Jenna NaN NaN 15 15.000000
Jon 21 4 1 8.666667
because .mean() ignores missing values by default: see docs.
To select a subset, you can:
df['avg'] = df[['Monday', 'Tuesday']].mean(axis=1)
Monday Tuesday Wednesday avg
Mike 42 NaN 12 42.0
Jenna NaN NaN 15 NaN
Jon 21 4 1 12.5
Alternative - using iloc (can also use loc here):
df['avg'] = df.iloc[:,0:2].mean(axis=1)
Resurrecting this Question because all previous answers currently print a Warning.
In most cases, use assign():
df = df.assign(avg=df.mean(axis=1))
For specific columns, one can input them by name:
df = df.assign(avg=df.loc[:, ["Monday", "Tuesday", "Wednesday"]].mean(axis=1))
Or by index, using one more than the last desired index as it is not inclusive:
df = df.assign(avg=df.iloc[:,0:3]].mean(axis=1))