I have looked through a lot of posts, but none of the solutions I can implement in my code:
x4 = x4.set_index('grupa').T.rename_axis('DANE').reset_index().rename_axis(None,1).round()
After which I get the results DataFrame:
DANE BAKALIE NASIONA OWOCE WARZYWA
0 ilosc 5.0 94.0 61.0 623.0
1 marza_netto 7.0 120.0 69.0 668.0
2 marza_procent2 32.0 34.0 29.0 27.0
But I would like to receive:
DANE BAKALIE NASIONA OWOCE WARZYWA
0 ilosc 5 94 61 623
1 marza_netto 7 120 69 668
2 marza_procent2 32 34 29 27
I tried replace('.0',''),int(round(),astype(int), but I don't get good results or I get the incompatibility of the attributes with the DataFrame.
If only non numeric column is DANE then cast before convert to column:
x4 = x4.set_index('grupa')
.T
.rename_axis('DANE')
.astype(int)
.reset_index()
.rename_axis(None,1)
More general solution is select all floats columns and cast:
cols = df.select_dtypes(include=['float']).columns
df[cols] = df[cols].astype(int)
print (df)
DANE BAKALIE NASIONA OWOCE WARZYWA
0 ilosc 5 94 61 623
1 marza_netto 7 120 69 668
2 marza_procent2 32 34 29 27
If some NaNs values convert to int is not possible.
So is possible:
1.drop all NaNs rows:
df = df.dropna()
2.replace NaNs to some integer like 0:
df = df.fillna(0)
Not 100% sure I got your question, but you can use an astype(int) conversion.
df = df.set_index('DANE').astype(int).reset_index()
df
DANE BAKALIE NASIONA OWOCE WARZYWA
0 ilosc 5 94 61 623
1 marza_netto 7 120 69 668
2 marza_procent2 32 34 29 27
If you're dealing with rows that have NaNs, either drop those rows and convert, or convert to astype(object). The latter is not recommended because you lose performance.
Related
As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000
ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2
I have two dataframes df1 and df2.
d = d = {'ID': [31,42,63,44,45,26],
'lat': [64,64,64,64,64,64],
'lon': [152,152,152,152,152,152],
'other1': [12,13,14,15,16,17],
'other2': [21,22,23,24,25,26]}
df1 = pd.DataFrame(data=d)
d2 ={'ID': [27,48,31,45,49,10],
'LAT': [63,63,63,63,63,63],
'LON': [153,153,153,153,153,153]}
df2 = pd.DataFrame(data=d2)
df1 has incorrect values for columns lat and lon, but has correct data in the other columns that I need to keep track of. df2 has correct LAT and LON values but only has a few common IDs with df1. There are two things I would like to accomplish. First, I want to split df1 into two dataframes: df3 which has IDs that are present in df2; and df4 which has everything else. I can get df3 with:
df3=pd.DataFrame()
for i in reduce(np.intersect1d, [df1.ID, df2.ID]):
df3=df3.append(df1.loc[df1.ID==i])
but how do I get df4 to be the remaining data?
Second, I want to replace the lat and lon values in df3 with the correct data fromdf2.
I figure there is a slick python way to do something like:
for j in range(len(df3)):
for k in range(len(df2)):
if df3.ID[j] == df2.ID[k]:
df3.lat[j] = df2.LAT[k]
df3.lon[j] = df2.LON[k]
But I can't even get the above nested loop working correctly. I don't want to spend a lot of time getting it working if there is a better way to accomplish this in python.
For question 1, you can use boolean indexing:
m = df1.ID.isin(df2.ID)
df3 = df1[m]
df4 = df1[~m]
print(df3)
print(df4)
Prints:
ID lat lon other1 other2
0 31 64 152 12 21
4 45 64 152 16 25
ID lat lon other1 other2
1 42 64 152 13 22
2 63 64 152 14 23
3 44 64 152 15 24
5 26 64 152 17 26
For question 2:
x = df3.merge(df2, on="ID")[["ID", "other1", "other2", "LAT", "LON"]]
print(x)
Prints:
ID other1 other2 LAT LON
0 31 12 21 63 153
1 45 16 25 63 153
EDIT: For question 2 you can do:
x = df3.merge(df2, on="ID").drop(columns=["lat", "lon"])
print(x)
You can merge with indicator True and then keep preference for LAT and LON and fill the rest by lat and lon, then use the indicator and a grouper and create a dictionary. Then grab the keys of the dictionary:
u = df1.merge(df2,on='ID',how='left',indicator='I')
u[['LAT','LON']] = np.where(u[['LAT','LON']].isna(),u[['lat','lon']],u[['LAT','LON']])
u = u.drop(['lat','lon'],1)
u['I'] = np.where(u['I'].eq("left_only"),"left_df","others")
d = dict(iter(u.groupby("I")))
print(d['left_df'],'\n--------\n',d['others'])
ID other1 other2 LAT LON I
1 42 13 22 64.0 152.0 left_df
2 63 14 23 64.0 152.0 left_df
3 44 15 24 64.0 152.0 left_df
5 26 17 26 64.0 152.0 left_df
--------
ID other1 other2 LAT LON I
0 31 12 21 63.0 153.0 others
4 45 16 25 63.0 153.0 others
First, skip the row of data if the columns have more than 2 columns that are empty. After this step, the rows with more than 2 columns missing value will be filtered out.
Then, as some of the columns still have 1 or 2 columns are empty. So I will fill in the empty column with the mean value of that row.
I can run the second step with my code below, however, I am not sure how to filter out the rows with more than 2 columns missing value.
I have tried using dropna but it deleted all the columns of the table.
My code:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as pp
%matplotlib inline
# high technology exports percentage of manufatory exports
hightech_export = pd.read_csv('hightech_export_1.csv')
#skip the row of data if the columns have more than 2 columns are empty
hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)
# Fill in data with mean value.
m = hightech_export.mean(axis=1)
for i, col in enumerate(hightech_export):
hightech_export.iloc[:, i] = hightech_export.iloc[:, i].fillna(m)
My dataset:
Country Name 2001 2002 2003 2004
Philippines 71
Malta 62 58 60 58
Singapore 60 56
Malaysia 58 57 55
Ireland 47 41 34 34
Georgia 38 41 24 38
Costa Rica
You can make use of .isnull() method for doing your first task.
Replace this:
hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)
with:
hightech_export= hightech_export.loc[hightech_export.isnull().sum(axis=1)<=2]
Ok try this ...
import pandas as pd
import numpy as np
data1={'Name':['Tom',np.NaN,'Mary','Jane'],'Age':[20,np.NaN,40,30],'Pay':[np.NaN,np.NaN,20,25]}
data2={'Name':['Tom','Bob','Mary'],'Age':[40,30,20]}
df1=pd.DataFrame.from_records(data1)
Check the df
df1
Age Name Pay
0 20.0 Tom NaN
1 NaN NaN NaN
2 40.0 Mary 20.0
3 30.0 Jane 25.0
record with index 1 has 3 missing values...
Replace and make missing values None
df1 = df1.replace({pd.np.nan: None})
Now write function to count missing values per row.... and to create a list
def count_na(lst):
missing = [n for n in lst if not n]
return len(missing)
missing_data=[]
for index,n in df1.iterrows():
missing_data.append(count_na(list(n)))
Use this list as a new Column in the Dataframe
df1['missing']=missing_data
df1 should look like this
Age Name Pay missing
0 20 Tom None 1
1 None None None 3
2 40 Mary 20 0
3 30 Jane 25 0
So filtering becomes easy....
# Now only take records with <2 missing
df1[df1.missing<2]
Hope that helps...
A simple way is to compare on a row basis the count of value and the number of columns of the dataframe. You can then just replace NaN with the avg of the dataframe.
Code could be:
result = df.loc[df.apply(lambda x: x.count(), axis=1) >= (len(df.columns) - 2)].replace(
np.nan, df.agg('mean'))
With your example data, it gives as expected:
Country Name 2001 2002 2003 2004
1 Malta 62.0 58.00 60.000000 58.0
2 Singapore 60.0 49.25 39.333333 56.0
3 Malaysia 58.0 57.00 39.333333 55.0
4 Ireland 47.0 41.00 34.000000 34.0
5 Georgia 38.0 41.00 24.000000 38.0
Try this
hightech_export.dropna(thresh=2, inplace=True)
in place of the line of code
hightech_export.dropna(axis=1, how='any', thresh=2, subset=None, inplace=False)
Trying to filter out a number of actions a user has done if the number of actions reaches a threshold.
Here is the data set: (Only Few records)
user_id,session_id,item_id,rating,length,time
123,36,28,3.5,6243.0,2015-03-07 22:44:40
123,36,29,2.5,4884.0,2015-03-07 22:44:14
123,36,30,3.5,6846.0,2015-03-07 22:44:28
123,36,54,6.5,10281.0,2015-03-07 22:43:56
123,36,61,3.5,7639.0,2015-03-07 22:43:44
123,36,62,7.5,18640.0,2015-03-07 22:43:34
123,36,63,8.5,7189.0,2015-03-07 22:44:06
123,36,97,2.5,7627.0,2015-03-07 22:42:53
123,36,98,4.5,9000.0,2015-03-07 22:43:04
123,36,99,7.5,7514.0,2015-03-07 22:43:13
223,63,30,8.0,5412.0,2015-03-22 01:42:10
123,36,30,5.5,8046.0,2015-03-07 22:42:05
223,63,32,8.5,4872.0,2015-03-22 01:42:03
123,36,32,7.5,11914.0,2015-03-07 22:41:54
225,63,35,7.5,6491.0,2015-03-22 01:42:19
123,36,35,5.5,7202.0,2015-03-07 22:42:15
123,36,36,6.5,6806.0,2015-03-07 22:42:43
123,36,37,2.5,6810.0,2015-03-07 22:42:34
225,63,41,5.0,15026.0,2015-03-22 01:42:37
225,63,45,6.5,8532.0,2015-03-07 22:42:25
I can groupby the data using user_id and session_id and get a count of items a user has rated in a session:
df.groupby(['user_id', 'session_id']).agg({'item_id':'count'}).rename(columns={'item_id': 'count'})
List of items that user has rated in a session can be obtained:
df.groupby(['user_id','session_id'])['item_id'].apply(list)
The goal is to get following if a user has rated more than 3 items in session, I want to pick only the first three items (keep only first three per user per session) from the original data frame. Maybe use the time to sort the items?
First tried to obtain which sessions contain more than 3, somewhat struggling to go beyond.
df.groupby(['user_id', 'session_id'])['item_id'].apply(
lambda x: (x > 3).count())
Example: from original df, user 123 should have first three records belong to session 36
It seems like you want to use groupby with head:
In [8]: df.groupby([df.user_id, df.session_id]).head(3)
Out[8]:
user_id session_id item_id rating length time
0 123 36 28 3.5 6243.0 2015-03-07 22:44:40
1 123 36 29 2.5 4884.0 2015-03-07 22:44:14
2 123 36 30 3.5 6846.0 2015-03-07 22:44:28
10 223 63 30 8.0 5412.0 2015-03-22 01:42:10
12 223 63 32 8.5 4872.0 2015-03-22 01:42:03
14 225 63 35 7.5 6491.0 2015-03-22 01:42:19
18 225 63 41 5.0 15026.0 2015-03-22 01:42:37
19 225 63 45 6.5 8532.0 2015-03-07 22:42:25
One way is to use sort_values followed by groupby.cumcount. A method I find useful is to extract any series or MultiIndex data before applying any filtering.
The below example filters for minimum user_id / session_id combination of 3 items and only takes the first 3 in each group.
sizes = df.groupby(['user_id', 'session_id']).size()
counter = df.groupby(['user_id', 'session_id']).cumcount() + 1 # counting begins at 0
indices = df.set_index(['user_id', 'session_id']).index
df = df.sort_values('time')
res = df[(indices.map(sizes.get) >= 3) & (counter <=3)]
print(res)
user_id session_id item_id rating length time
0 123 36 28 3.5 6243.0 2015-03-07 22:44:40
1 123 36 29 2.5 4884.0 2015-03-07 22:44:14
2 123 36 30 3.5 6846.0 2015-03-07 22:44:28
14 225 63 35 7.5 6491.0 2015-03-22 01:42:19
18 225 63 41 5.0 15026.0 2015-03-22 01:42:37
19 225 63 45 6.5 8532.0 2015-03-07 22:42:25
I have the following data frame:
SID AID START END
71 1 1 -11136 -11122
74 1 1 -11121 -11109
78 1 1 -11034 -11014
79 1 2 -11137 -11152
83 1 2 -11114 -11127
86 1 2 -11032 -11038
88 1 2 -11121 -11002
I want to do a subtraction of the START elements with AID==1 and AID==2, in order, such that the expected result would be:
-11136 - (-11137) = 1
-11121 - (-11114) =-7
-11034 - (-11032) =-2
Nan - (-11002) = NaN
So I extracted two groups:
values1 = group.loc[group['AID'] == 1]["START"]
values2 = group.loc[group['AID'] == 2]["START"]
with the following result:
71 -11136
74 -11121
78 -11034
Name: START, dtype: int64
79 -11137
83 -11114
86 -11032
88 -11002
Name: START, dtype: int64
and did a simple subtraction:
values1-values2
But I got all NaNs:
71 NaN
74 NaN
78 NaN
79 NaN
83 NaN
86 NaN
I noticed that if I use data from the same AID group (e.g. START-END), I get the right answer. I get the NaN only when I "mix" AID group. I'm just getting started with Pandas, but I'm obviously missing something here. Any suggestion?
Let's try this:
df.set_index([df.groupby(['SID','AID']).cumcount(),'AID'])['START'].unstack().add_prefix('col_').eval('col_1 - col_2')
Output:
0 1.0
1 -7.0
2 -2.0
3 NaN
dtype: float64
pandas does those operations based on labels. Since your labels ((71, 74, 78) and (79, 83, 86)) don't match, it cannot find any value to subtract. One way to deal with this is to use a numpy array instead of a Series so there is no label associated:
values1 - values2.values
Out:
71 1
74 -7
78 -2
Name: START, dtype: int64
Bizarre way to go about it
-np.diff([g.reset_index(drop=True) for n, g in df.groupby('AID').START])[0]
0 1.0
1 -7.0
2 -2.0
3 NaN
Name: START, dtype: float64