I would like to have a dataframe, created by combine only the total row values on two pivot tables and keeping the same column names, including the All column.
testA:
sum
ALL_APPS
MONTH 2012/08 2012/09 2012/10 All
DESCRIPTION
A1 111 112 113 336
A2 121 122 123 366
A3 131 132 133 396
All 363 366 369 1098
testA:
sum
ALL_APPS
MONTH 2012/08 2012/09 2012/10 All
DESCRIPTION
A1 211 212 213 636
A2 221 222 223 666
A3 231 232 233 696
All 663 666 669 1998
As I result I would like to have a data frame that would look like:
2019/08 2019/09 2019/10 All
363 366 369 1098
663 666 669 1998
I tried:
A=testA.iloc[3]
B=testB.iloc[3]
my_series = pd.concat([A,B],axis=1)
But it does not do what I expected :(
All All
MONTH
sum ALL_APPS 2019/08 363.0 NaN
2019/09 366.0 NaN
2019/10 369.0 NaN
All 1098.0 NaN
CUR_VER 2019/08 NaN 663.0
2019/09 NaN 666.0
2019/10 NaN 669.0
All NaN 1998.0
Try:
my_series=pd.concat([testA.iloc[-1], testB.iloc[-1]], axis=1, ignore_index=True).T
my_series.columns=map(lambda x: x[3], testA.columns)
Related
Assume the following simplified framework:
I have a 3D Pandas dataframe of parameters composed of 100 rows, 4 classes and 4 features for each instance:
iterables = [list(range(100)), [0,1,2,3]]
index = pd.MultiIndex.from_product(iterables, names=['instances', 'classes'])
columns = ['a', 'b', 'c', 'd']
np.random.seed(42)
parameters = pd.DataFrame(np.random.randint(1, 2000, size=(len(index), len(columns))), index=index, columns=columns)
parameters
instances classes a b c d
0 0 1127 1460 861 1295
1 1131 1096 1725 1045
2 1639 122 467 1239
3 331 1483 88 1397
1 0 1124 872 1688 131
... ... ... ... ...
98 3 1321 1750 779 1431
99 0 1793 814 1637 1429
1 1370 1646 420 1206
2 983 825 1025 1855
3 1974 567 371 936
Let df be a dataframe that for each instance and each feature (column), report the observed class.
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 3, size=(100, len(columns))), index=list(range(100)),
columns=columns)
a b c d
0 2 0 2 2
1 0 0 2 1
2 2 2 2 2
3 0 2 1 0
4 1 1 1 1
.. .. .. .. ..
95 1 2 0 1
96 2 1 2 1
97 0 0 1 2
98 0 0 0 1
99 1 2 2 2
I would like to create a third dataframe (let's call it new_df) of shape (100, 4) containing the parameters in the dataframe parameters based on the observed classes on the dataframe df.
For example, in the first row of df for the first column (a) i observe the class 2, so the value I am interested in is the second class in the first instance of the parameters dataframe, namely 1127 that will populate the first row and column of new df. Following this method, the first observation for the column "b" is class 0, so in the first row, column b of the new_df I would like to observe 1460 and so on.
With a for loop I can obtain the desired result:
new_df = pd.DataFrame(0, index=list(range(100)), columns=columns) # initialize the df
for i in range(len(df)):
for c in df.columns:
new_df.iloc[i][c] = parameters.loc[i][c][df.iloc[i][c]]
new_df
a b c d
0 1639 1460 467 1239
1 1124 872 806 344
2 1083 511 1706 1500
3 958 1155 1268 563
4 14 242 777 1370
.. ... ... ... ...
95 1435 1316 1709 755
96 346 712 363 815
97 1234 985 683 1348
98 127 1130 1009 1014
99 1370 825 1025 1855
However, the original dataset contains millions of rows and hundreds of columns, and proceeding with for loop is unfeasible.
Is there a way to vectorize such a problem in order to avoid for loops? (at least over 1 dimension)
Reshape both DataFrames, using stack, into a long format, then perform the merge and reshape, with unstack, back to the wide format. There's a bunch of renaming just so we can reference and align the columns in the merge.
(df.rename_axis(index='instances', columns='cols').stack().to_frame('classes')
.merge(parameters.rename_axis(columns='cols').stack().rename('vals'),
on=['instances', 'classes', 'cols'])
.unstack(-1)['vals']
.rename_axis(index=None, columns=None)
)
a b c d
0 1639 1460 467 1239
1 1124 872 806 344
2 1083 511 1706 1500
3 958 1155 1268 563
4 14 242 777 1370
.. ... ... ... ...
95 1435 1316 1709 755
96 346 712 363 815
97 1234 985 683 1348
98 127 1130 1009 1014
99 1370 825 1025 1855
I have two data frames df1 and df2 as shown below:
df1
Date ID Amount BillNo1
10/08/2020 ABBCSQ1ZA 878 2020/156
10/08/2020 ABBCSQ1ZA 878 2020/157
10/12/2020 AC928Q1ZS 3998 343SONY
10/14/2020 AC9268RE3 198 432
10/16/2020 AA171E1Z0 5490 AIPO325
10/19/2020 BU073C1ZW 3432 IDBI436-Total
10/19/2020 BU073C1ZW 3432 IDBI437-Total
df2
Date ID Amount BillNo2
10/08/2020 ABBCSQ1ZA 878 156
10/11/2020 ATRC95REW 115 265
10/14/2020 AC9268RE3 198 A/432
10/16/2020 AA171E1Z0 5490 325
10/19/2020 BU073C1ZW 3432 436
10/19/2020 BU073C1ZW 3432 437
My final answer should be:
Matched
Date ID Amount BillNo1 BillNo2
10/08/2020 ABBCSQ1ZA 878 2020/156 156 # 156 matches
10/14/2020 AC9268RE3 198 432 A/432 # 432 matches
10/16/2020 AA171E1Z0 5490 AIPO325 325 # 325 matches
10/19/2020 BU073C1ZW 3432 IDBI436-Total 436 # 436 matches
10/19/2020 BU073C1ZW 3432 IDBI437-Total 437 # 437 matches
Non Matched
Date ID Amount BillNo1 BillNo2
10/08/2020 ABBCSQ1ZA 878 2020/157 NaN
10/12/2020 AC928Q1ZS 3998 343SONY NaN
10/11/2020 ATRC95REW 115 NaN 265
How do I merge two dataframes based on partial string match of Column =['BillNo1','BillNo2']?
You can define your own thresholds, but one proposal is below:
import difflib
from functools import partial
#the below function is inspired from https://stackoverflow.com/a/56521804/9840637
def get_closest_match(x,y):
"""x=possibilities , y = input"""
f = partial(
difflib.get_close_matches, possibilities=x.unique(), n=1,cutoff=0.5)
matches = y.astype(str).drop_duplicates().map(f).fillna('').str[0]
return pd.DataFrame([y,matches.rename('BillNo2')]).T
temp = get_closest_match(df2['BillNo2'],df1['BillNo1'])
temp['BillNo2'] = (temp['BillNo2']
.fillna(df1['BillNo1']
.str.extract('('+'|'.join(df2['BillNo2'])+')',expand=False)))
merged = (df1.assign(BillNo2=df1['BillNo1'].map(dict(temp.values)))
.merge(df2.drop_duplicates(),on=['Date','ID','Amount','BillNo2']
,how='outer',indicator=True))
print(merged)
Date ID Amount BillNo1 BillNo2 _merge
0 10/08/2020 ABBCSQ1ZA 878 2020/156 156 both
1 10/08/2020 ABBCSQ1ZA 878 2020/157 NaN left_only
2 10/12/2020 AC928Q1ZS 3998 343SONY NaN left_only
3 10/14/2020 AC9268RE3 198 432 A/432 both
4 10/16/2020 AA171E1Z0 5490 AIPO325 325 both
5 10/19/2020 BU073C1ZW 3432 IDBI436-Total 436 both
6 10/19/2020 BU073C1ZW 3432 IDBI437-Total 437 both
7 10/11/2020 ATRC95REW 115 NaN 265 right_only
Once you have above merged df, you can do;
matched = merged.query("_merge=='both'")
unmatched = merged.query("_merge!='both'")
print("Matched Df \n ", matched,'\n\n',"Unmatched Df \n " , unmatched)
Matched Df
Date ID Amount BillNo1 BillNo2 _merge
0 10/08/2020 ABBCSQ1ZA 878 2020/156 156 both
3 10/14/2020 AC9268RE3 198 432 A/432 both
4 10/16/2020 AA171E1Z0 5490 AIPO325 325 both
5 10/19/2020 BU073C1ZW 3432 IDBI436-Total 436 both
6 10/19/2020 BU073C1ZW 3432 IDBI437-Total 437 both
Unmatched Df
Date ID Amount BillNo1 BillNo2 _merge
1 10/08/2020 ABBCSQ1ZA 878 2020/157 NaN left_only
2 10/12/2020 AC928Q1ZS 3998 343SONY NaN left_only
7 10/11/2020 ATRC95REW 115 NaN 265 right_only
I have a dataframe df1 with a column dates which includes dates. I want to plot the dataframe for just a certain month. The column dates look like:
Unnamed: 0 Unnamed: 0.1 dates DPD weekday
0 0 1612 2007-06-01 23575.0 4
1 3 1615 2007-06-04 28484.0 0
2 4 1616 2007-06-05 29544.0 1
3 5 1617 2007-06-06 29129.0 2
4 6 1618 2007-06-07 27836.0 3
5 7 1619 2007-06-08 23434.0 4
6 10 1622 2007-06-11 28893.0 0
7 11 1623 2007-06-12 28698.0 1
8 12 1624 2007-06-13 27959.0 2
9 13 1625 2007-06-14 28534.0 3
10 14 1626 2007-06-15 23974.0 4
.. ... ... ... ... ...
513 721 2351 2009-06-09 54658.0 1
514 722 2352 2009-06-10 51406.0 2
515 723 2353 2009-06-11 48255.0 3
516 724 2354 2009-06-12 40874.0 4
517 727 2357 2009-06-15 77085.0 0
518 728 2358 2009-06-16 77989.0 1
519 729 2359 2009-06-17 75209.0 2
520 730 2360 2009-06-18 72298.0 3
521 731 2361 2009-06-19 60037.0 4
522 734 2364 2009-06-22 69348.0 0
523 735 2365 2009-06-23 74086.0 1
524 736 2366 2009-06-24 69187.0 2
525 737 2367 2009-06-25 68912.0 3
526 738 2368 2009-06-26 57848.0 4
527 741 2371 2009-06-29 72718.0 0
528 742 2372 2009-06-30 72306.0 1
And I just want to have June 2007 for example.
df1 = pd.read_csv('DPD.csv')
df1['dates'] = pd.to_datetime(df1['dates'])
df1['month'] = pd.PeriodIndex(df1.dates, freq='M')
nov_mask=df1['month'] == 2007-06
plot_data= df1[nov_mask].pivot(index='dates', values='DPD')
plot_data.plot()
plt.show()
I don't know what's wrong with my code.The error shows that there is something wrong with 2007-06 when i defining nov_mask, i think the data type is wrong but I tried a lot and nothing works..
You don't need PeriodIndex if you just want to get June 2007 data. I have no access to IPython right now but this should point you in the right direction.
df1 = pd.read_csv('DPD.csv')
df1['dates'] = pd.to_datetime(df1['dates'])
df1['year'] = df1['dates'].dt.year
df1['month'] = df1['dates'].dt.month
july_mask = ((df1['year'] == 2007) & (df1['month'] == 7))
filtered = df1[july_mask ]
# ... Do something with filtered.
I have a dataframe that has rows with repeated values in sequences.
For example:
df_raw
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14....
220 450 451 456 470 224 220 223 221 340 224 220 223 221 340.....
234 333 453 460 551 226 212 115 117 315 226 212 115 117 315.....
As you see the columns 0-6 are unique in this example and then we have repeated sequences [220 223 221 340 224] for row 1 from columns 6-10 and then again from 11-14.
This pattern is the same for row 2.
I'd like to remove the repeated sequences for each row of my dataframe (more than just 2) for an output like this:
df_clean
0 1 2 3 4 5 6 7 8 9.....
220 450 451 456 470 224 220 223 221 340.....
234 333 453 460 551 226 212 115 117 315.....
I trail with ...... because the columns are long and have multiple repeatitions for each row. I also cannot assume that each row has the exact same amount of repeated sequences nor that each sequence starts at the exact same index or ends at the same index.
Is there an easy way to do this with pandas or even a numpy array?
I have an excel file with 2 sheets.
one sheet containing the data:
DATE TMAX TMIN
20110706 317 211
20110707 322 211
20110708 317 211
20110709 322 211
20110710 328 222
20110711 333 244
20110712 356 250
20110713 356 222
and the other sheet includes:
Start Date End Date Rep Month Cost kWh kW
7/6/2011 8/3/2011 July 5,065.17 76,640 205
8/3/2011 9/7/2011 August 5,572.38 86,640 195
My goal is to write another column on sheet one (kwh) from sheet two depending on if the date on sheet one falls within the range of a certain kWh.
For An Example:
DATE TMAX TMIN kWh
20110706 317 211 76640
20110707 322 211 76640
20110708 317 211 76640
20110709 322 211 76640
20110710 328 222 76640
20110711 333 244 76640
20110712 356 250 76640
20110713 356 222 76640
20110801 344 228 76640
20110802 356 200 76640
20110803 367 200 86640
20110804 361 228 86640
I am having trouble figuring out how to do a kind of algorithmic parsing to be able to implement what I am trying to do.
I am already familiar with how to write to a file an read a file/cells with pandas.
Here is my code:
import pandas as pd
from pandas import ExcelWriter
df = pd.read_excel("thecddhddtest.xlsx",'Sheet1')
df2 = pd.read_excel("thecddhddtest.xlsx",'Sheet2')
df.head()
df["DATE"] = pd.to_datetime(df["DATE"], format="%Y%m%d")
pd.to_datetime(df2["Start Date"], format="%m/%d/%Y")
df3 = df2.set_index("Start Date")
df3["kWh"].reindex(df["DATE"], method="ffill")
df["kWh"] = df3["kWh"].reindex(df["DATE"], method="ffill")
print(df["kWh"])
writer = ExcelWriter('thecddhddtestkWh.xlsx')
df.to_excel(writer,'Sheet1',index=False)
df2.to_excel(writer,'Sheet2',index=False)
writer.save()
which results in:
DATE TMAX TMIN kWh
20110706 317 211
20110707 322 211
20110708 317 211
20110709 322 211
20110710 328 222
20110711 333 244
20110712 356 250
20110713 356 222
kWh cell is empty for some reason
It's critical to parse the date columns as pandas Timestamps/ numpy datetime64. The best way is to use to_datetime with a format.
In [11]: df
Out[11]:
DATE TMAX TMIN
0 20110706 317 211
1 20110707 322 211
2 20110708 317 211
3 20110709 322 211
4 20110710 328 222
5 20110711 333 244
6 20110712 356 250
7 20110713 356 222
8 20110801 344 228
9 20110802 356 200
10 20110803 367 200
11 20110804 361 228
In [12]: df["DATE"] = pd.to_datetime(df["DATE"], format="%Y%m%d")
In [13]: df
Out[13]:
DATE TMAX TMIN
0 2011-07-06 317 211
1 2011-07-07 322 211
2 2011-07-08 317 211
3 2011-07-09 322 211
4 2011-07-10 328 222
5 2011-07-11 333 244
6 2011-07-12 356 250
7 2011-07-13 356 222
8 2011-08-01 344 228
9 2011-08-02 356 200
10 2011-08-03 367 200
11 2011-08-04 361 228
Similarly (with a different format):
In [14]: pd.to_datetime(df2["Start Date"], format="%m/%d/%Y")
Out[14]:
0 2011-07-06
1 2011-08-03
Name: Start Date, dtype: datetime64[ns]
Now, the first observation is that this wouldn't make sense if the periods were not mutually exclusive. This means we only need consider the start date*.
This means you can reindex the seconds sheet, forward fill, and you're done:
In [21]: df3 = df2.set_index("Start Date")
In [22]: df3
Out[22]:
End Date Rep Month Cost kWh kW
Start Date
2011-07-06 8/3/2011 July 5,065.17 76,640 205
2011-08-03 9/7/2011 August 5,572.38 86,640 195
This allows you to reindex by the dates from your DataFrame:
In [23]: df3["kWh"].reindex(df["DATE"], method="ffill")
Out[23]:
DATE
2011-07-06 76,640
2011-07-07 76,640
2011-07-08 76,640
2011-07-09 76,640
2011-07-10 76,640
2011-07-11 76,640
2011-07-12 76,640
2011-07-13 76,640
2011-08-01 76,640
2011-08-02 76,640
2011-08-03 86,640
2011-08-04 86,640
Name: kWh, dtype: object
and set this as the column in df.
In [24]: df["kWh"] = df3["kWh"].reindex(df["DATE"], method="ffill")
*If there are some "empty" periods we could add in some NaN rows, with the corresponding "empty" start-date.