I want to shift column values one space to the left. I don't want to save the original values of the column 'average_rating'.
I used the shift command:
data3 = data3.shift(-1, axis=1)
But the output I get has missing values for two columns- 'num_pages' and 'text_reviews_count'
It is because the data types of the source and target columns do not match. Try converting the column value after shift() to the target data type for each source and target column - for example .fillna(0).astype(int).
Alternately, you can convert all the data in the data frame to strings and then perform the shift. You might want to convert them back to their original data types again.
df = df.astype(str) # convert all data to str
df_shifted = (df.shift(-1,axis=1)) # perform the shift
df_string = df_shifted.to_csv() # store the shifted to a string variable
new_df = pd.read_csv(StringIO(df_string), index_col=0) # read the data again from the string variable
Output:
average_rating isbn isbn13 language_code num_pages ratings_count text_reviews_count extra
0 3.57 0674842111 978067 en-US 236 55 6.0 NaN
1 3.60 1593600119 978067 eng 400 25 4.0 NaN
2 3.63 156384155X 978067 eng 342 38 4.0 NaN
3 3.98 1857237250 978067 eng 383 2197 17.0 NaN
4 0.00 0851742718 978067 eng 49 0 0.0 NaN
Related
lets say i have a dataset like below:
I want to replace the null values with the median of each column. But when I am trying to do that all NA is replaced with the median of the first column only.
Rough_df = pd.read_excel(r'Cleandata_withOutliers.xlsx', sheet_name='Sheet2')
Rough_df.fillna(Rough_df.select_dtypes(include='number').median().iloc[0], inplace=True)
My output looks like this:
But, ideally, the NA values in the 2nd column should be replaced with 10170.5 and not with 77.5. Where I am doing wrong?
You can just do median with fillna
out = df.fillna(df.median())
Out[68]:
X Y
0 60.0 9550.0
1 85.0 10170.5
2 77.5 10791.0
3 101.0 14215.0
4 47.0 16321.0
5 108.0 10170.5
6 77.5 8658.0
7 70.0 7945.0
I have a big dataframe. Some of the values in a column are NaN. I want to fill them with some value based on the other column value.
Data:
df =
A B
2019-10-01 09:19:40 667.029710 10
2019-10-01 09:20:15 673.518030 20
2019-10-01 09:21:29 533.137144 30
2020-07-25 15:51:15 NaN 40
2020-07-25 17:20:20 NaN 50
2020-07-25 17:21:23 NaN 60
I want to fill NaN in A column based on the B column value.
My code:
sdf = df[df['A'].isnull()] # slice NaN and create a new dataframe
sdf['A'] = sdf['B']*sdf['B']
df = pd.concat([df,sdf])
Everything works fine. I feel my code is lengthy. Is there a one line code?
For fillna we can do
df.A.fillna(df.B**2, inplace=True)
I have a csv file which is something like below
date,mean,min,max,std
2018-03-15,3.9999999999999964,inf,0.0,100.0
2018-03-16,0.46403712296984756,90.0,0.0,inf
2018-03-17,2.32452732452731,,0.0,143.2191767899579
2018-03-18,2.8571428571428523,inf,0.0,100.0
2018-03-20,0.6928406466512793,100.0,0.0,inf
2018-03-22,2.8675703858185635,,0.0,119.05383697172658
I want to select those column values which is > 20 and < 500 that is (20 to 500) and put those values along with date in another column of a dataframe.The other dataframe looks something like this
Date percentage_change location
2018-02-14 23.44 BOM
So I want to get the date, value from the csv and add it into the new dataframe at appropriate columns.Something like
Date percentage_change location
2018-02-14 23.44 BOM
2018-03-15 100.0 NaN
2018-03-16 90.0 NaN
2018-03-17 143.2191767899579 NaN
.... .... ....
Now I am aware of functions like df.max(axis=1) and df.min(axis=1) which gives you the min and max but not sure for finding values based on a range.So how can this be achieved?
Given dataframes df1 and df2, you can achieve this via aligning column names, cleaning numeric data, and then using pd.DataFrame.append.
df_app = df1.loc[:, ['date', 'mean', 'min', 'std']]\
.rename(columns={'date': 'Date'})\
.replace(np.inf, 0)\
.fillna(0)
print(df_app)
df_app['percentage_change'] = np.maximum(df_app['min'], df_app['std'])
print(df_app)
df_app = df_app[df_app['percentage_change'].between(20, 500)]
res = df2.append(df_app.loc[:, ['Date', 'percentage_change']])
print(res)
# Date location percentage_change
# 0 2018-02-14 BOM 23.440000
# 0 2018-03-15 NaN 100.000000
# 1 2018-03-16 NaN 90.000000
# 2 2018-03-17 NaN 143.219177
# 3 2018-03-18 NaN 100.000000
# 4 2018-03-20 NaN 100.000000
# 5 2018-03-22 NaN 119.053837
My table looks like this:
In [82]:df.head()
Out[82]:
MatDoc MatYr MvT Material Plnt SLoc Batch Customer AmountLC Amount ... PO MatYr.1 MatDoc.1 Order ProfitCtr SLED/BBD PstngDate EntryDate Time Username
0 4912693062 2015 551 100062 HDC2 0001 5G30MC1A11 NaN 9.03 9.06 ... NaN NaN NaN NaN IN1165B085 26.01.2016 01.08.2015 01.08.2015 01:13:16 O33462
1 4912693063 2015 501 166 HDC2 0004 NaN NaN 0.00 0.00 ... NaN NaN NaN NaN IN1165B085 NaN 01.08.2015 01.08.2015 01:13:17 O33462
2 4912693320 2015 551 101343 HDC2 0001 5G28MC1A11 NaN 53.73 53.72 ... NaN NaN NaN NaN IN1165B085 25.01.2016 01.08.2015 01.08.2015 01:16:30 O33462
Here, I need to group by data on Order column and sum only AmountLC column.Then I need to check for the Order column values such that it should be present in both MvT101group and MvT102group. and if an Order matches in both sets of data then I need to subtract MvT102group from MvT101group. and display
Order|Plnt|Material|Batch|Sum101=SumofMvt101ofAmountLC|Sum102=SumofMvt102ofAmountLC|(Sum101-Sum102)/100
What I have done is first I made new df containing only 101 and 102: Mvt101 and MvT102
MvT101 = df.loc[df['MvT'] == 101]
MvT102 = df.loc[df['MvT'] == 102]
Then I grouped it by Order and got the sum value for the column
MvT101group = MvT101.groupby('Order', sort=True)
In [76]:
MvT101group[['AmountLC']].sum()
Out[76]:
Order AmountLC
1127828 16348566.88
1127829 22237710.38
1127830 29803745.65
1127831 30621381.06
1127832 33926352.51
MvT102group = MvT102.groupby('Order', sort=True)
In [77]:
MvT102group[['AmountLC']].sum()
Out[77]:
Order AmountLC
1127830 53221.70
1127831 651475.13
1127834 67442.16
1127835 2477494.17
1128622 218743.14
After this I am not able to understand how should I write my query.
Please ask me any further details if you want.Here is the CSV file from where I am working Link
Hope I understood the question correctly. After grouping both groups as you did:
MvT101group = MvT101.groupby('Order',sort=True).sum()
MvT102group = MvT102.groupby('Order',sort=True).sum()
You can update the columns' names for both groups:
MvT101group.columns = MvT101group.columns.map(lambda x: str(x) + '_101')
MvT102group.columns = MvT102group.columns.map(lambda x: str(x) + '_102')
Then merge all 3 tables so that you will have all 3 columns in the main table:
df = df.merge(MvT101group, left_on=['Order'], right_index=True, how='left')
df = df.merge(MvT102group, left_on=['Order'], right_index=True, how='left')
And then you can add the calculated column:
df['calc'] = (df['Order_101']-df['Order_102']) / 100
I am trying to do some column manipulations with row and column at same time including date and time series in Pandas. Traditionally with no series python dictionaries are great. But with Pandas it a new thing for me.
Input file : N number of them.
File1.csv, File2.csv, File3.csv, ........... Filen.csv
Ids,Date-time-1 Ids,Date-time-2 Ids,Date-time-1
56,4568 645,5545 25,54165
45,464 458,546
I am trying to merge the Date-time column of all the files into a big data file with respect to Ids
Ids,Date-time-ref,Date-time-1,date-time-2
56,100,4468,NAN
45,150,314,NAN
645,50,NAN,5495
458,200,NAN,346
25,250,53915,NAN
Check for date-time column - If not matched create one and then fill the values with respect to Ids by Subtracting the current date-time value with the value of date-time-ref of that respective Ids.
Fill in empty place with NAN and if next file has that value then replace the new value with NAN
If it were straight column subtract it was pretty much easy but in sync with date-time series and with respect to Ids seems a bit confusing.
Appreciate some suggestions to begin with. Thanks in advance.
Here is one way to do it.
import pandas as pd
import numpy as np
from StringIO import StringIO
# your csv file contents
csv_file1 = 'Ids,Date-time-1\n56,4568\n45,464\n'
csv_file2 = 'Ids,Date-time-2\n645,5545\n458,546\n'
# add a duplicated Ids record for testing purpose
csv_file3 = 'Ids,Date-time-1\n25,54165\n645, 4354\n'
csv_file_all = [csv_file1, csv_file2, csv_file3]
# read csv into df using list comprehension
# I use buffer here, replace stringIO with your file path
df_all = [pd.read_csv(StringIO(csv_file)) for csv_file in csv_file_all]
# processing
# =====================================================
# concat along axis=0, outer join on axis=1
merged = pd.concat(df_all, axis=0, ignore_index=True, join='outer').set_index('Ids')
Out[206]:
Date-time-1 Date-time-2
Ids
56 4568 NaN
45 464 NaN
645 NaN 5545
458 NaN 546
25 54165 NaN
645 4354 NaN
# custom function to handle/merge duplicates on Ids (axis=0)
def apply_func(group):
return group.fillna(method='ffill').iloc[-1]
# remove Ids duplicates
merged_unique = merged.groupby(level='Ids').apply(apply_func)
Out[207]:
Date-time-1 Date-time-2
Ids
25 54165 NaN
45 464 NaN
56 4568 NaN
458 NaN 546
645 4354 5545
# do the subtraction
master_csv_file = 'Ids,Date-time-ref\n56,100\n45,150\n645,50\n458,200\n25,250\n'
df_master = pd.read_csv(io.StringIO(master_csv_file), index_col=['Ids']).sort_index()
# select matching records and horizontal concat
df_matched = pd.concat([df_master,merged_unique.reindex(df_master.index)], axis=1)
# use broadcasting
df_matched.iloc[:, 1:] = df_matched.iloc[:, 1:].sub(df_matched.iloc[:, 0], axis=0)
Out[208]:
Date-time-ref Date-time-1 Date-time-2
Ids
25 250 53915 NaN
45 150 314 NaN
56 100 4468 NaN
458 200 NaN 346
645 50 4304 5495