Dataframe Joining Columns based on date within a dataframe - python

I tried search around for an answer but most was based on merging 2 dataframe, however mine exist within a single dataframe
D1date D1price D2date D2price
1/2/2017 11.4 1/3/2017 11.3
1/3/2017 12.4 1/4/2017 12.3
1/4/2017 14.4 1/5/2017 12.4
1/5/2017 15.5 1/6/2017 12.5
Results I am looking for
D1date D1price D2price
1/2/2017 11.4 nan
1/3/2017 12.4 11.3
1/4/2017 14.4 12.3
1/5/2017 15.5 12.4
Can any kind souls advise me please?

Use filter + join:
df = df.filter(like='D1').join(df.filter(like='D2').set_index('D2date'), on='D1date')
print (df)
D1date D1price D2price
0 1/2/2017 11.4 NaN
1 1/3/2017 12.4 11.3
2 1/4/2017 14.4 12.3
3 1/5/2017 15.5 12.4

Have you tried like this:
df[['D1date', 'D1price']].merge(df[['D2date', 'D2price']], how='left', left_on='D1date', right_on='D2date')
You can add:
.drop('D2date', axis=1)
To remove D2date column.
Complete code:
df = df[['D1date', 'D1price']].merge(df[['D2date', 'D2price']], how='left', left_on='D1date', right_on='D2date').drop('D2date', axis=1)

Related

How to append/concat second row with the first row within the similar dataframe in pandas

I have a dataframe
0 1 2 3 ............ 1041 1042 1043
0 32.5 19.4 66.6 91.4 55.5 10.4 77.2
1 13.3 85.3 22.4 65.8 23.4 90.2 14.5
2 22.4 91.7 57.1 23.5 58.2 81.5 46.7
3 75.7 47.1 91.4 45.2 89.7 38.7 78.3
.
.
.
.
18 32.5 23.4 90.2 14.5 91.7 57.1 23.5
19 56.7 58.2 81.5 46.7 65.5 43.4 76.2
20 76.8 19.4 66.6 91.4 54.9 60.4 96.4
the dataframe has 20 rows and 1044 columns. i want to append/concat 2nd row with first row and 4th row with 3rd row and 6th row with 5th row and so on. in this way, the dataframe will become 10 rows and 2088 columns.
After this, from 0 to 1043 columns label should be rename as x1,x2,x3, ...., x1043 and from 1044 to 2083 columns should be rename as y1,y2,y3,.... , y20837. Now the updated dataframe looks like as follow
X0 X1 X2 .... X1042 X1043 Y0 Y1 Y2 .... Y2086 Y2087
0 32.5 19.4 66.6 ... 10.4 77.2 13.3 85.3 22.4 ... 90.2 14.5
1 22.4 91.7 57.1 ... 81.5 46.7 75.7 47.1 91.4 ... 38.7 78.3
.
.
.
.
10 56.7 58.2 81.5 ... 43.4 76.2 76.8 19.4 66.6 ... 60.4 96.4
how to do achieve these two tasks? I have tried a lot (using append/concat/join functions) but still unsuccessful to do this.
Some manipulation of the column names and a trick with indexes is needed but sure it can be done. Remember to pay attention to the behavior when the number of columns is odd.
import numpy as np
import pandas as pd
# Create dummy data
df = pd.DataFrame(np.random.randint(0, 100, size=(1044, 20)))
def pairwise(it):
a = iter(it)
return zip(a, a)
l = [l for l, r in pairwise(df.columns)]
r = [r for l, r in pairwise(df.columns)]
# Rename columns of one of the DFs otherwise the concat will overlap them
dfr = df[r].rename(columns={r: l for l, r in pairwise(df.columns)})
# Concat vertically, ignore index to have them stacked sequentially
df = pd.concat([df[l], dfr], axis=0, ignore_index=True)

How to plot a (grouped) bar chart from a dataframe using pandas

I have a dataframe with the following data
T1 SO DR AX NO Overig
SK1 20.2 21.7 27 22.4 22.6 25
PA 20.2 21.7 21.6 20.4 17.7 25.0
T4 30.8 30.0 24.3 28.6 32.3 0.0
XXS 7.7 10.0 10.8 8.2 9.7 25.0
MvM 20.2 16.7 13.5 18.4 14.5 25.0
ACH 1.0 0.0 2.7 2.0 3.2 0.0
With an specified index and columns.
I need a bar chart for just the columns T1, SO, and DR, with the index name on the x-axis, and the values of the index for the three columns on the y-axis. In this case the total of bars will be 6*3 = 18.
I have tried the following:
df.T.plot(kind='bar')
tevr_asp[['T1','SO','DR']].T.plot.bar()
You can use dataframe plot function to render specific columns in y-axis and with use_index you can render the index in x-axis.
df.plot(y=["T1", "SO", "DR"],use_index=True, kind="bar")

correlation matrix between cities

I want to find the corr btw cities and and Rainfall. Note that 'city' is categorical, not numerical.
I wand to compare their rainfall.
How do I go about it? I haven't seen anything on here that talk about how to deal with duplicate cities with different data
like
Date Location MinTemp MaxTemp Rainfall
12/1/2008 Albury 13.4 22.9 0.6
12/2/2008 Albury 7.4 25.1 0
12/3/2008 Albury 12.9 25.7 0
12/5/2008 Brisbane 20.5 29 9.6
12/6/2008 Brisbane 22.1 33.4 7.8
12/7/2008 Brisbane 22.6 33.4 12.4
12/8/2008 Brisbane 21.9 26.7 0
12/9/2008 Brisbane 19.5 27.6 0.2
12/10/2008 Brisbane 22.1 30.3 0.6
3/30/2011 Tuggeranong 9.8 25.2 0.4
3/31/2011 Tuggeranong 10.3 18.5 2.8
5/1/2011 Tuggeranong 5.5 20.8 0
5/2/2011 Tuggeranong 11 16.1 0
5/3/2011 Tuggeranong 7.3 17.5 0.6
8/29/2016 Woomera 15 22.9 0
8/30/2016 Woomera 12.5 22.1 12.8
8/31/2016 Woomera 8 20 0
9/1/2016 Woomera 11.6 21.4 0
9/2/2016 Woomera 11.2 19.6 0.3
9/3/2016 Woomera 7.1 20.4 0
9/4/2016 Woomera 6.5 18.6 0
9/5/2016 Woomera 7.3 21.5 0
One possible solution, if I understood you correctly (based on the title of OP), is:
Step 1
Preparing a dataset with Locations as columns and Rainfall as rows (note, you will lose information here up to a shortest rainfall series)
df2=df.groupby("Location")[["Location", "Rainfall"]].head(3) # head(3) is first 3 observations
df2.loc[:,"col"] = 4*["x1","x2","x3"] # 4 is number of unique cities
df3 = df2.pivot_table(index="col",columns="Location",values="Rainfall")
df3
Location Albury Brisbane Tuggeranong Woomera
col
x1 0.6 9.6 0.4 0.0
x2 0.0 7.8 2.8 12.8
x3 0.0 12.4 0.0 0.0
Step 2
Doing correlation matrix on the obtained dataset
df3.corr()
Location Albury Brisbane Tuggeranong Woomera
Location
Albury 1.000000 -0.124534 -0.381246 -0.500000
Brisbane -0.124534 1.000000 -0.869799 -0.797017
Tuggeranong -0.381246 -0.869799 1.000000 0.991241
Woomera -0.500000 -0.797017 0.991241 1.000000
An alternative, slightly more involved solution would be to keep the longest series and impute missing values with means or median.
But even though you will feed more data into your algo, it won't cure the main problem: your data seem to be misaligned. What I mean by this is that to do correlation analysis properly you should make it sure, that you compare comparable values, e.g. rainfall for summer with rainfall for summer for another city. To do analysis this way, you should make it sure you have equal amount of comparable rainfalls for each city: e.g. winter, spring, summer, autumn; or, January, February, ..., December.

How can I fill my dataframe

Can someone please tell me how I can fill in the missing values of my dataframe? The missing values dont come up as NaN or anything that is common instead they show as two dots like .. How would i go about filling them in with the mean of that row that they are in?
1971 1990 1999 2000 2001 2002
Estonia .. 17.4 8.3 8.5 8.5 8.6
Spain 61.6 151.2 205.9 222.2 233.2 241.6
SlovakRepublic 10.9 25.5 28.1 30.8 31.9 32.2
Slovenia .. 12.4 13.3 13.6 14.5 14.6
My headers are the years and my index are the countries.
It seems you can use mask, compare by numpy array created by values and replace by mean, last cast all columns to float:
print (df.mean(axis=1))
Estonia 10.26
Spain 210.82
SlovakRepublic 29.70
Slovenia 13.68
df = df.mask(df.values == '..', df.mean(axis=1), axis=0).astype(float)
print (df)
1971 1990 1999 2000 2001 2002
Estonia 10.26 17.4 8.3 8.5 8.5 8.6
Spain 61.6 151.2 205.9 222.2 233.2 241.6
SlovakRepublic 10.9 25.5 28.1 30.8 31.9 32.2
Slovenia 13.68 12.4 13.3 13.6 14.5 14.6
You should be able to use an .set_value
try df_name.set_value('index', 'column', value)
something like
df_name.set_value('Estonia','1971', 50)

Calculating daily, weekly and monthly mean in python (Pandas)

I have a file which has different readings in columns. I am able to find daily mean by using the following code. This is when the month, date, time are separated by space. I am just wondering, how can I do the same if I have date,month, year in first column and then time in second column. How can I calculate weekly, daily and monthly averages? Please note, the data is not equally sampled.
import pandas as pd
import numpy as np
df = pd.read_csv("Data_set.csv", sep='\s*', names=["month", "day", "time", "Temperature"])
group=df.groupby(["month","day"])
daily=group.aggregate({"Temperature":np.mean})
daily.to_csv('daily.csv')
Date Time T1 T2 T3
17/12/2013 00:28:38 19 23.1 7.3
17/12/2013 00:58:38 19 22.9 7.3
17/12/2013 01:28:38 18.9 22.8 6.3
17/12/2013 01:58:38 18.9 23.1 6.3
17/12/2013 02:28:38 18.8 23 6.3
.....
.....
24/12/2013 19:58:21 14.7 15.5 7
24/12/2013 20:28:21 14.7 15.5 7
24/12/2013 20:58:21 14.7 15.5 7
24/12/2013 21:28:21 14.7 15.6 6
24/12/2013 21:58:21 14.7 15.5 6
24/12/2013 22:28:21 14.7 15.5 5
24/12/2013 22:58:21 14.7 15.5 4

Categories

Resources