I have this df:
df1
t0
ACER 50
BBV 20
ACC 75
IRAL 25
DECO 58
and on the other side :
df2
t1
ACER 50
BBV 20
CEL 0
DIA 25
I am looking forward to add both dfs to obtain the following output:
df3
t2
ACER 100
BBV 40
ACC 75
IRAL 25
DECO 58
DIA 25
CEL 0
Basically is the addition of the common index values in df1 and df2 and include those that didnt appeared in df1
I tried a
d1.add(df2)
but Nan values arised, I also thought in merging dfs, substitute Nans by zeros and add columns , but I think maybe is lot of code to perform this.
Use concat and sum
df3 = pd.concat([df1, df2], axis=1).sum(1)
ACC 75.0
ACER 100.0
BBV 40.0
CEL 0.0
DECO 58.0
DIA 25.0
IRAL 25.0
You were actually on the right track, you just needed fill_value=0:
df1.t0.add(df2.t1, fill_value=0).to_frame('t2')
t2
ACC 75.0
ACER 100.0
BBV 40.0
CEL 0.0
DECO 58.0
DIA 25.0
IRAL 25.0
Note here that you'd have to add the Series objects together to prevent misalignment issues.
Related
I am new to python and pandas, so my doubt can be silly also.
Problem:
So I have two data frames let's say df1 and df2 where
df1 is like
treatment1 treatment2 value comparision test adjustment statsig p_value
0 Treatment Control 0.795953 Treatment:Control t-test Benjamini-Hochberg False 0.795953
1 Treatment2 Control 0.795953 Treatment2:Control t-test Benjamini-Hochberg False 0.795953
2 Treatment2 Treatment 0.795953 Treatment2:Treatment t-test Benjamini-Hochberg False 0.795953
and df2 is like
group_type metric
0 Treatment 31.0
1 Treatment2 83.0
2 Treatment 51.0
3 Treatment 20.0
4 Control 41.0
.. ... ...
336 Treatment3 35.0
337 Treatment3 9.0
338 Treatment3 35.0
339 Treatment3 9.0
340 Treatment3 35.0
I want to add a column mean_percentage_lift in df1 where
lift_mean_percentage = (mean(treatment1)/mean(treatment2) -1) * 100
where `treatment1` and `treatment2` can be anything in `[Treatment, Control, Treatment2]`
My Approach:
I am using the assign function of the data frame.
df1.assign(mean_percentage_lift = lambda dataframe: lift_mean_percentage(df2, dataframe['treatment1'], dataframe['treatment2']))
where
def lift_mean_percentage(df, treatment1, treatment2):
treatment1_data = df[df[group_type_col] == treatment1]
treatment2_data = df[df[group_type_col] == treatment2]
mean1 = treatment1_data['metric'].mean()
mean2 = treatment2_data['metric'].mean()
return (mean1/mean2 -1) * 100
But I am getting this error Can only compare identically-labeled Series objects for line
treatment1_data = df[df[group_type_col] == treatment1]. Is there something I am doing wrong is there any alternative to this.
For dataframe df2:
group_type metric
0 Treatment 31.0
1 Treatment2 83.0
2 Treatment 51.0
3 Treatment 20.0
4 Control 41.0
5 Treatment3 35.0
6 Treatment3 9.0
7 Treatment 35.0
8 Treatment3 9.0
9 Control 5.0
You can try:
def lift_mean_percentage(df, T1, T2):
treatment1= df['metric'][df['group_type']==T1].mean()
treatment2= df['metric'][df['group_type']==T2].mean()
return (treatment1/treatment2 -1) * 100
runing:
lift_mean_percentage(df2,'Treatment2','Control')
the result:
260.8695652173913
I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to do one operation in a performant manner.
Here is how my dataset looks like:
temp size
location_id hours
135 78 12.0 100.0
79 NaN NaN
80 NaN NaN
81 15.0 112.0
82 NaN NaN
83 NaN NaN
84 14.0 22.0
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float). I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to fill those NaN values by using the values around it. Basically, the value of hour 79 will be derived from the values of 78 and 81. For this example, the temp value of 79 will be 13.0 (basic extrapolation).
I always know that only the 78, 81, 84 (multiples of 3) hours will be filled and the rest will have NaN. That will always be the case. This is true for hours between 78-120.
With these in mind, I have implemented the following algorithm in Pandas:
df_relevant_data = df.loc[(df.index.get_level_values(1) >= 78) & (df.index.get_level_values(1) <= 120), :]
for location_id, data_of_location_id in df_relevant_data.groupby("location_id"):
for hour in range(81, 123, 3):
top_hour_data = data_of_location_id.loc[(location_id, hour), ['temp', 'size']] # e.g. 81
bottom_hour_data = data_of_location_id.loc[(location_id, (hour - 3)), ['temp', 'size']] # e.g. 78
difference = top_hour_data.values - bottom_hour_data.values
bottom_bump = difference * (1/3) # amount to add to calculate the 79th hour
top_bump = difference * (2/3) # amount to add to calculate the 80th hour
df.loc[(location_id, (hour - 2)), ['temp', 'size']] = bottom_hour_data.values + bottom_bump
df.loc[(location_id, (hour - 1)), ['temp', 'size']] = bottom_hour_data.values + top_bump
This works really well functionally, however the performance is horrible. It is taking at least 10 minutes on my dataset and that is currently not acceptable.
Is there a better/faster way to implement this? I am actually working only on a slice of the whole data (only hours between 78-120) so I would really expect it to work much faster.
I believe you are looking for interpolate:
print (df.interpolate())
temp size
location_id hours
135 78 12.000000 100.0
79 13.000000 104.0
80 14.000000 108.0
81 15.000000 112.0
82 14.666667 82.0
83 14.333333 52.0
84 14.000000 22.0
First, skip the row of data if the columns have more than 2 columns that are empty. After this step, the rows with more than 2 columns missing value will be filtered out.
Then, as some of the columns still have 1 or 2 columns are empty. So I will fill in the empty column with the mean value of that row.
I can run the second step with my code below, however, I am not sure how to filter out the rows with more than 2 columns missing value.
I have tried using dropna but it deleted all the columns of the table.
My code:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as pp
%matplotlib inline
# high technology exports percentage of manufatory exports
hightech_export = pd.read_csv('hightech_export_1.csv')
#skip the row of data if the columns have more than 2 columns are empty
hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)
# Fill in data with mean value.
m = hightech_export.mean(axis=1)
for i, col in enumerate(hightech_export):
hightech_export.iloc[:, i] = hightech_export.iloc[:, i].fillna(m)
My dataset:
Country Name 2001 2002 2003 2004
Philippines 71
Malta 62 58 60 58
Singapore 60 56
Malaysia 58 57 55
Ireland 47 41 34 34
Georgia 38 41 24 38
Costa Rica
You can make use of .isnull() method for doing your first task.
Replace this:
hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)
with:
hightech_export= hightech_export.loc[hightech_export.isnull().sum(axis=1)<=2]
Ok try this ...
import pandas as pd
import numpy as np
data1={'Name':['Tom',np.NaN,'Mary','Jane'],'Age':[20,np.NaN,40,30],'Pay':[np.NaN,np.NaN,20,25]}
data2={'Name':['Tom','Bob','Mary'],'Age':[40,30,20]}
df1=pd.DataFrame.from_records(data1)
Check the df
df1
Age Name Pay
0 20.0 Tom NaN
1 NaN NaN NaN
2 40.0 Mary 20.0
3 30.0 Jane 25.0
record with index 1 has 3 missing values...
Replace and make missing values None
df1 = df1.replace({pd.np.nan: None})
Now write function to count missing values per row.... and to create a list
def count_na(lst):
missing = [n for n in lst if not n]
return len(missing)
missing_data=[]
for index,n in df1.iterrows():
missing_data.append(count_na(list(n)))
Use this list as a new Column in the Dataframe
df1['missing']=missing_data
df1 should look like this
Age Name Pay missing
0 20 Tom None 1
1 None None None 3
2 40 Mary 20 0
3 30 Jane 25 0
So filtering becomes easy....
# Now only take records with <2 missing
df1[df1.missing<2]
Hope that helps...
A simple way is to compare on a row basis the count of value and the number of columns of the dataframe. You can then just replace NaN with the avg of the dataframe.
Code could be:
result = df.loc[df.apply(lambda x: x.count(), axis=1) >= (len(df.columns) - 2)].replace(
np.nan, df.agg('mean'))
With your example data, it gives as expected:
Country Name 2001 2002 2003 2004
1 Malta 62.0 58.00 60.000000 58.0
2 Singapore 60.0 49.25 39.333333 56.0
3 Malaysia 58.0 57.00 39.333333 55.0
4 Ireland 47.0 41.00 34.000000 34.0
5 Georgia 38.0 41.00 24.000000 38.0
Try this
hightech_export.dropna(thresh=2, inplace=True)
in place of the line of code
hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)
I have a 4x4 dataframe (df). I created two child dataframes (4x1), (4x2). And updated both. In first case, the parent is updated, in second, it is not. How to ensure that the parent dataframe is updated when child dataframe is updated?
I have a 4x4 dataframe (df). From this as a parent, I created two child dataframes - dfA with single column (4x1) and dfB with two columns (4x2). I have NaN values in both subsets. Now, when I use fillna on both, in respective dfA and dfB, i can see the NaN values updated with given value. Fine upto now. However, now when I check the Parent Dataframe, in First case (4x1), the updated value reflects whereas in Second case (4x2), it does not. Why it is so. And What should I do to let the changes in child dataframe reflect in the parent dataframe.
studentnames = ['Maths','English','Soc.Sci', 'Hindi', 'Science']
semisteronemarks = [15, 50, np.NaN, 50, np.NaN]
semistertwomarks = [25, 53, 45, 45, 54]
semisterthreemarks = [20, 50, 45, 15, 38]
semisterfourmarks = [26, 33, np.NaN, 35, 34]
semisters = ['Rakesh','Rohit', 'Sam', 'Sunil']
df1 = pd.DataFrame([semisteronemarks,semistertwomarks,semisterthreemarks,semisterfourmarks],semisters, studentnames)
# case 1
dfA = df['Soc.Sci']
dfA.fillna(value = 98, inplace = True)
print(dfA)
print(df)
# case 2
dfB = df[['Soc.Sci', 'Science']]
dfB.fillna(value = 99, inplace = True)
print(dfB)
print(df)
'''
## contents of parent df ->>
## Actual Output -
# case 1
Maths English Soc.Sci Hindi Science
Rakesh 15 50 98.0 50 NaN
Rohit 25 53 45.0 45 54.0
Sam 20 50 45.0 15 38.0
Sunil 26 33 98.0 35 34.0
# case 2
Maths English Soc.Sci Hindi Science
Rakesh 15 50 NaN 50 NaN
Rohit 25 53 45.0 45 54.0
Sam 20 50 45.0 15 38.0
Sunil 26 33 NaN 35 34.0
## Expected Output -
# case 1
Maths English Soc.Sci Hindi Science
Rakesh 15 50 98.0 50 NaN
Rohit 25 53 45.0 45 54.0
Sam 20 50 45.0 15 38.0
Sunil 26 33 98.0 35 34.0
# case 2
Maths English Soc.Sci Hindi Science
Rakesh 15 50 99.0 50 NaN
Rohit 25 53 45.0 45 54.0
Sam 20 50 45.0 15 38.0
Sunil 26 33 99.0 35 34.0
# note the difference in output for column Soc.Sci in case 2.
In your code df1 is defined df is not.
With the approach being used
# case 1
dfA = df1['Soc.Sci'] # changed df to df1
dfA.fillna(value = 98, inplace = True)
df1['Soc.Sci'] = dfA # Because dfA is not a dataframe but a series
# if you want to do
df1['Soc.Sci'] = dfA['Soc.Sci']
# you will need to change the dfA
dfA = df1[['Soc.Sci']] # this makes it a dataframe
# case 2
dfB = df1[['Soc.Sci', 'Science']] # changed df to df1
dfB.fillna(value = 99, inplace = True)
df1[['Soc.Sci','Science']] = dfB[['Soc.Sci','Science']]
print(df1)
I would suggest just using the fillna in the parent df.
df1['Soc.Sci'].fillna(value=99,inplace=True)
You should have seen a warning:
Warning (from warnings module):
...
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
It means that dfB may be a copy instead of a view. And according to the results it is. There is little that can be done here, and specifically you cannot force pandas to generate a view. The choice depends of parameters only known to pandas and its developpers.
But it is always possible to assign to the columns of the parent DataFrame:
# case 2
df = pd.DataFrame([semisteronemarks,semistertwomarks,semisterthreemarks,semisterfourmarks],semisters, studentnames)
df[['Soc.Sci', 'Science']] = df[['Soc.Sci', 'Science']].fillna(value = 99)
print(df)
I have 2 Python Dataframes:
The first Dataframe contains all data imported to the DataFrame, which consists of "prodcode", "sentiment", "summaryText", "reviewText",etc. of all initial Review Data.
DFF = DFF[['prodcode', 'summaryText', 'reviewText', 'overall', 'reviewerID', 'reviewerName', 'helpful','reviewTime', 'unixReviewTime', 'sentiment','textLength']]
which produces:
prodcode summaryText reviewText overall reviewerID ... helpful reviewTime unixReviewTime sentiment textLength
0 B00002243X Work Well - Should Have Bought Longer Ones I needed a set of jumper cables for my new car... 5.0 A3F73SC1LY51OO ... [4, 4] 08 17, 2011 1313539200 2 516
1 B00002243X Okay long cables These long cables work fine for my truck, but ... 4.0 A20S66SKYXULG2 ... [1, 1] 09 4, 2011 1315094400 2 265
2 B00002243X Looks and feels heavy Duty Can't comment much on these since they have no... 5.0 A2I8LFSN2IS5EO ... [0, 0] 07 25, 2013 1374710400 2 1142
3 B00002243X Excellent choice for Jumper Cables!!! I absolutley love Amazon!!! For the price of ... 5.0 A3GT2EWQSO45ZG ... [19, 19] 12 21, 2010 1292889600 2 4739
4 B00002243X Excellent, High Quality Starter Cables I purchased the 12' feet long cable set and th... 5.0 A3ESWJPAVRPWB4 ... [0, 0] 07 4, 2012 1341360000 2 415
The second Dataframe is a grouping of all prodcodes and the ratio of sentiment score / all reviews made for that product. It is the ratio for that review score over all reviews scores made, for that particular product.
df1 = (
DFF.groupby(["prodcode", "sentiment"]).count()
.join(DFF.groupby("prodcode").count(), "prodcode", rsuffix="_r"))[['reviewText', 'reviewText_r']]
df1['result'] = df1['reviewText']/df1['reviewText_r']
df1 = df1.reset_index()
df1 = df1.pivot("prodcode", 'sentiment', 'result').fillna(0)
df1 = round(df1 * 100)
df1.astype('int')
sorted_df2 = df1.sort_values(['0', '1', '2'], ascending=False)
which produces the following DF:
sentiment 0 1 2
prodcode
B0024E6QOO 80.0 0.0 20.0
B000GPV2QA 67.0 17.0 17.0
B0067DNSUI 67.0 0.0 33.0
B00192JH4S 62.0 12.0 25.0
B0087FSA0C 60.0 20.0 20.0
B0002KM5L0 60.0 0.0 40.0
B000DZBP60 60.0 0.0 40.0
B000PJCBOE 60.0 0.0 40.0
B0033A5PPO 57.0 29.0 14.0
B003POL69C 57.0 14.0 29.0
B0002Z9L8K 56.0 31.0 12.0
What I am now trying to do filter my first dataframe in two ways. The first, by the results of the second dataframe. By that, I mean I want the first dataframe to be filtered by the prodcode's from the second dataframe where df1.sentiment['0'] > 40. From that list, I want to filter the first dataframe by those rows where 'sentiment' from the first dataframe = 0.
At a high level, I am trying to obtain the prodcode, summaryText and reviewText in the first dataframe for Products that had high ratios in lower sentiment scores, and whose sentiment is 0.
Something like this :
assuming all the data you need is in df1 and no merges are needed.
m = list(DFF['prodcode'].loc[DFF['sentiment'] == 0] # create a list matching your criteria
df.loc[(df['0'] > 40) & (df['sentiment'].isin(m)] # filter according to your conditions
I figured it out:
DF3 = pd.merge(DFF, df1, left_on='prodcode', right_on='prodcode')
print(DF3.loc[(DF3['0'] > 50.0) & (DF3['2'] < 50.0) & (DF3['sentiment'].isin(['0']))].sort_values('0', ascending=False))