I'm trying to add a new row to the DataFrame with a specific index name 'e'.
number variable values
a NaN bank true
b 3.0 shop false
c 0.5 market true
d NaN government true
I have tried the following but it's creating a new column instead of a new row.
new_row = [1.0, 'hotel', 'true']
df = df.append(new_row)
Still don't understand how to insert the row with a specific index. Will be grateful for any suggestions.
You can use df.loc[_not_yet_existing_index_label_] = new_row.
Demo:
In [3]: df.loc['e'] = [1.0, 'hotel', 'true']
In [4]: df
Out[4]:
number variable values
a NaN bank True
b 3.0 shop False
c 0.5 market True
d NaN government True
e 1.0 hotel true
PS using this method you can't add a row with already existing (duplicate) index value (label) - a row with this index label will be updated in this case.
UPDATE:
This might not work in recent Pandas/Python3 if the index is a
DateTimeIndex and the new row's index doesn't exist.
it'll work if we specify correct index value(s).
Demo (using pandas: 0.23.4):
In [17]: ix = pd.date_range('2018-11-10 00:00:00', periods=4, freq='30min')
In [18]: df = pd.DataFrame(np.random.randint(100, size=(4,3)), columns=list('abc'), index=ix)
In [19]: df
Out[19]:
a b c
2018-11-10 00:00:00 77 64 90
2018-11-10 00:30:00 9 39 26
2018-11-10 01:00:00 63 93 72
2018-11-10 01:30:00 59 75 37
In [20]: df.loc[pd.to_datetime('2018-11-10 02:00:00')] = [100,100,100]
In [21]: df
Out[21]:
a b c
2018-11-10 00:00:00 77 64 90
2018-11-10 00:30:00 9 39 26
2018-11-10 01:00:00 63 93 72
2018-11-10 01:30:00 59 75 37
2018-11-10 02:00:00 100 100 100
In [22]: df.index
Out[22]: DatetimeIndex(['2018-11-10 00:00:00', '2018-11-10 00:30:00', '2018-11-10 01:00:00', '2018-11-10 01:30:00', '2018-11-10 02:00:00'], dtype='da
tetime64[ns]', freq=None)
Use append by converting list a dataframe in case you want to add multiple rows at once i.e
df = df.append(pd.DataFrame([new_row],index=['e'],columns=df.columns))
Or for single row (Thanks #Zero)
df = df.append(pd.Series(new_row, index=df.columns, name='e'))
Output:
number variable values
a NaN bank True
b 3.0 shop False
c 0.5 market True
d NaN government True
e 1.0 hotel true
If it's the first row you need:
df = Dataframe(columns=[number, variable, values])
df.loc['e', [number, variable, values]] = [1.0, 'hotel', 'true']
df.loc['e', :] = [1.0, 'hotel', 'true']
should be the correct implementation in case of conflicting index and column names.
In future versions of Pandas, DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=False) will be deprecated.
Source: Pandas Documentation
The documentation recommends using .concat().
It would look like this (if you wanted an empty row with only the added index name:
df = pd.concat([df, pd.Series(index=['New index label'], dtype=str)])
If you wanted to add data use this:
df = pd.concat([df, pd.Series(data, index=['New index label'], dtype=str)])
Hope that helps!
Related
My data frame has 6 columns of dates which i want them to in 1 column
DATA FRAME IMAGE HERE
Code to make another column is as below
df['Mega'] = df['Mega'].append(df['RsWeeks','RsMonths','RsDays','PsWeeks','PsMonths','PsDays'])
i am new to python and pandas i would like to learn more so please point me sources too as i am really bad with debugging as i have no programming background.
Pandas documentation is a great source for good examples. Click here to visit a page with a lot of examples and visuals.
For your particular case:
We construct a sample DataFrame:
import pandas as pd
df = pd.DataFrame([
{"RsWeeks": "2015-11-10", "RsMonths": "2016-08-01"},
{"RsWeeks": "2015-11-11", "RsMonths": "2015-12-30"}
])
print("DataFrame preview:")
print(df)
Output:
DataFrame preview:
RsWeeks RsMonths
0 2015-11-10 2016-08-01
1 2015-11-11 2015-12-30
We concatenate the columns RsWeeks and RsMonths to create a Series:
my_series = pd.concat([df["RsWeeks"], df["RsMonths"]], ignore_index=True)
print("\nSeries preview:")
print(my_series)
Output:
Series preview:
0 2015-11-10
1 2015-11-11
2 2016-08-01
3 2015-12-30
Edit
If you really need to add the new Series as a column to your DataFrame, you can do the following:
df2 = pd.DataFrame({"Mega": my_series})
df = pd.concat([df, df2], axis=1)
print("\nDataFrame preview:")
print(df)
Output:
DataFrame preview:
RsWeeks RsMonths Mega
0 2015-11-10 2016-08-01 2015-11-10
1 2015-11-11 2015-12-30 2015-11-11
2 NaN NaN 2016-08-01
3 NaN NaN 2015-12-30
Data:
df = pd.DataFrame({"name" : 'Dav Las Oms'.split(),
'age' : [25, 50, 70]})
df['Name'] = list(['a', 'M', 'm'])
df:
name age Name
0 Dav 25 a
1 Las 50 M
2 Oms 70 m
df = pd.DataFrame(df.astype(str).apply('|'.join, axis=1))
df:
0
0 Dav|25|a
1 Las|50|M
2 Oms|70|m
You can use pd.melt() which makes your dataframe from wide to long:
df_reshaped = pd.melt(df, id_vars = ['id_1','id_2','id_3'], var_name = 'new_name', value_name = 'Mega')
There is a huge dataframe containing multiple data types in different columns. I want to find rows that contain date values in different columns.
Here a test dataframe:
dt = pd.Series(['abc', datetime.now(), 12, '', None, np.nan, '2020-05-05'])
dt1 = pd.Series([3, datetime.now(), 'sam', '', np.nan, 'abc-123', '2020-05-25'])
dt3 = pd.Series([1,2,3,4,5,6,7])
df = pd.DataFrame({"A":dt.values, "B":dt1.values, "C":dt3.values})
Now, I want to create a new dataframe that contains only dates in both columns A and B, here rows 2nd and last.
Expected output:
A B C
1 2020-06-01 16:58:17.274311 2020-06-01 17:13:20.391394 2
6 2020-05-05 2020-05-25 7
What is the best way to do that? Thanks.
P.S.> Dates can be in any standard format.
Use:
m = df[['A', 'B']].transform(pd.to_datetime, errors='coerce').isna().any(axis=1)
df = df[~m]
Result:
# print(df)
A B C
1 2020-06-01 17:54:16.377722 2020-06-01 17:54:16.378432 2
6 2020-05-05 2020-05-25 7
Solution for test only A,B columns is boolean indexing with DataFrame.notna and DataFrame.all for not match any non datetimes:
df = df[df[['A','B']].apply(pd.to_datetime, errors='coerce').notna().all(axis=1)]
print (df)
A B C
1 2020-06-01 16:14:35.020855 2020-06-01 16:14:35.021855 2
6 2020-05-05 2020-05-25 7
import pandas as pd
from datetime import datetime
dt = pd.Series(['abc', datetime.now(), 12, '', None, np.nan, '2020-05-05'])
dt1 = pd.Series([3, datetime.now(), 'sam', '', np.nan, 'abc-123', '2020-05-25'])
dt3 = pd.Series([1,2,3,4,5,6,7])
df = pd.DataFrame({"A":dt.values, "B":dt1.values, "C":dt3.values})
m = pd.concat([pd.to_datetime(df['A'], errors='coerce'),
pd.to_datetime(df['B'], errors='coerce')], axis=1).isna().all(axis=1)
print(df[~m])
Prints:
A B C
1 2020-06-01 12:17:51.320286 2020-06-01 12:17:51.320826 2
6 2020-05-05 2020-05-25 7
I have a question about flattening or collapsing a dataframe from several columns in one row with information about a key to several rows each with the same key column and the appropriate data. Suppose a dataframe is something like this:
df = pd.DataFrame({'CODE': ['AA', 'BB', 'CC'],
'START_1': ['1990-01-01', '2000-01-01', '2005-01-01'],
'END_1': ['1990-02-14', '2000-03-01', '2005-12-31'],
'MEANING_1': ['SOMETHING', 'OR', 'OTHER'],
'START_2': ['1990-02-15', None, '2006-01-01'],
'END_2': ['1990-06-14', None, '2006-12-31'],
'MEANING_2': ['ELSE', None, 'ANOTHER']})
CODE START_1 END_1 MEANING_1 START_2 END_2 MEANING_2
0 AA 1990-01-01 1990-02-14 SOMETHING 1990-02-15 1990-06-14 ELSE
1 BB 2000-01-01 2000-03-01 OR None None None
2 CC 2005-01-01 2005-12-31 OTHER 2006-01-01 2006-12-31 ANOTHER
and I need to get it into a form somewhat like this:
CODE START END MEANING
0 AA 1990-01-01 1990-02-14 SOMETHING
1 AA 1990-02-15 1990-06-14 ELSE
2 BB 2000-01-01 2000-03-01 OR
3 CC 2005-01-01 2005-12-31 OTHER
4 CC 2006-01-01 2006-12-31 ANOTHER
I have a solution as follows:
df_a = df[['CODE', 'START_1', 'END_1', 'MEANING_1']]
df_b = df[['CODE', 'START_2', 'END_2', 'MEANING_2']]
df_a = df_a.rename(index=str, columns={'CODE': 'CODE',
'START_1': 'START',
'END_1': 'END',
'MEANING_1': 'MEANING'})
df_b = df_b.rename(index=str, columns={'CODE': 'CODE',
'START_2': 'START',
'END_2': 'END',
'MEANING_2': 'MEANING'})
df = pd.concat([df_a, df_b], ignore_index=True)
df = df.dropna(axis=0, how='any')
Which yields the desired result. Of course this doesn't seem very pythonic and is clearly not ideal if you have more than 2 column groups which need to be collapsed (I actually have 6 in my real code). I've examined the groupby(), melt() and stack() methods but haven't really found them to be very useful yet. Any suggestions would be appreciated.
Use pd.wide_to_long:
pd.wide_to_long(df, stubnames=['END', 'MEANING', 'START'],
i='CODE', j='Number', sep='_', suffix='*')
Output:
END MEANING START
CODE Number
AA 1 1990-02-14 SOMETHING 1990-01-01
BB 1 2000-03-01 OR 2000-01-01
CC 1 2005-12-31 OTHER 2005-01-01
AA 2 1990-06-14 ELSE 1990-02-15
BB 2 None None None
CC 2 2006-12-31 ANOTHER 2006-01-01
Then, we can drop Number column/index and dropna's if you wish, e.g. df.reset_index().drop('Number', 1).
This is what melt will achieve this
df1=df.melt('CODE')
df1[['New','New2']]=df1.variable.str.split('_',expand=True)
df1.set_index(['CODE','New2','New']).value.unstack()
Out[492]:
New END MEANING START
CODE New2
AA 1 1990-02-14 SOMETHING 1990-01-01
2 1990-06-14 ELSE 1990-02-15
BB 1 2000-03-01 OR 2000-01-01
2 None None None
CC 1 2005-12-31 OTHER 2005-01-01
2 2006-12-31 ANOTHER 2006-01-01
Here is one way. This is similar to your logic, I have just optimised a little and cleaned up the code so you only have to maintain common_cols, var_cols, data_count.
common_cols = ['CODE']
var_cols = ['START', 'END', 'MEANING']
data_count = 2
dfs = {i: df[common_cols + [k+'_'+str(int(i)) for k in var_cols]].\
rename(columns=lambda x: x.split('_')[0]) for i in range(1, data_count+1)}
pd.concat(list(dfs.values()), ignore_index=True)
# CODE START END MEANING
# 0 AA 1990-01-01 1990-02-14 SOMETHING
# 1 BB 2000-01-01 2000-03-01 OR
# 2 CC 2005-01-01 2005-12-31 OTHER
# 3 AA 1990-02-15 1990-06-14 ELSE
# 4 BB None None None
# 5 CC 2006-01-01 2006-12-31 ANOTHER
Here is another way:
df.columns = [i[0] for i in df.columns.str.split('_')]
df = df.T
cond = df.index.duplicated()
concat_df = pd.concat([df[~cond],df[cond]],axis=1).T
sort_df = concat_df.sort_values('START').iloc[:-1]
sort_df.CO = sort_df.CO.ffill()
This should also work.
# the following line get rid of _x suffix
df = df.set_index("CODE")
df.columns = pd.Index(map(lambda x : str(x)[:-2], df.columns)
pd.concat([df.iloc[:, range(len(df.columns))[i::2]] for i in range(2)])
The method to remove suffix is taken from Remove last two characters from column names of all the columns in Dataframe - Pandas
It should be easy to extend the method to more than 2 columns per group. Say 6 as OP had.
pd.concat([df.iloc[:, range(len(df.columns))[i::6]] for i in range(6)])
I have a dataframe with let's say 2 columns: dates and doubles
2017-05-01 2.5
2017-05-02 3.5
... ...
2017-05-17 0.2
2017-05-18 2.5
Now I would like to do a groupby and sum with x rows. So i.e. with 6 rows it would return:
2017-05-06 15.6
2017-05-12 13.4
2017-05-18 18.0
Is there a clean solution to do this without running it through a for-loop with something like this:
temp = pd.DataFrame()
j = 0
for i in range(0,len(df.index),6):
temp[df.ix[i]['date']] = df.ix[i:i+6]['value'].sum()
I guess you are looking for resample. consider this dataframe
rng = pd.date_range('2017-05-01', periods=18, freq='D')
num = np.random.randint(5,size = 18)
df = pd.DataFrame({'date': rng, 'val': num})
df.resample('6D', on = 'date').sum().reset_index()
will return
date val
0 2017-05-01 14
1 2017-05-07 11
2 2017-05-13 16
This is alternative solution using groupby range of length of the dataframe.
Two columns using agg
df.groupby(np.arange(len(df))//6).agg(lambda x: {'date': x.date.iloc[0],
'value': x.value.sum()})
Multiple columns you can use first (or last) for date and sum for other columns.
group = df.groupby(np.arange(len(df))//6)
pd.concat((group['date'].first(),
group[[c for c in df.columns if c != 'date']].sum()), axis=1)
I have a time-series that is not recognized as a DatetimeIndex despite being indexed by standard YYYY-MM-DD strings with valid dates. Coercing them to a valid DatetimeIndex seems to be inelegant enough to make me think I'm doing something wrong.
I read in (someone else's lazily formatted) data that contains invalid datetime values and remove these invalid observations.
In [1]: df = pd.read_csv('data.csv',index_col=0)
In [2]: print df['2008-02-27':'2008-03-02']
Out[2]:
count
2008-02-27 20
2008-02-28 0
2008-02-29 27
2008-02-30 0
2008-02-31 0
2008-03-01 0
2008-03-02 17
In [3]: def clean_timestamps(df):
# remove invalid dates like '2008-02-30' and '2009-04-31'
to_drop = list()
for d in df.index:
try:
datetime.date(int(d[0:4]),int(d[5:7]),int(d[8:10]))
except ValueError:
to_drop.append(d)
df2 = df.drop(to_drop,axis=0)
return df2
In [4]: df2 = clean_timestamps(df)
In [5] :print df2['2008-02-27':'2008-03-02']
Out[5]:
count
2008-02-27 20
2008-02-28 0
2008-02-29 27
2008-03-01 0
2008-03-02 17
This new index is still only recognized as a 'object' dtype rather than a DatetimeIndex.
In [6]: df2.index
Out[6]: Index([2008-01-01, 2008-01-02, 2008-01-03, ..., 2012-11-27, 2012-11-28,
2012-11-29], dtype=object)
Reindexing produces NaNs because they're different dtypes.
In [7]: i = pd.date_range(start=min(df2.index),end=max(df2.index))
In [8]: df3 = df2.reindex(index=i,columns=['count'])
In [9]: df3['2008-02-27':'2008-03-02']
Out[9]:
count
2008-02-27 NaN
2008-02-28 NaN
2008-02-29 NaN
2008-03-01 NaN
2008-03-02 NaN
I create a fresh dataframe with the appropriate index, drop the data to a dictionary, then populate the new dataframe based on the dictionary values (skipping missing values).
In [10]: df3 = pd.DataFrame(columns=['count'],index=i)
In [11]: values = dict(df2['count'])
In [12]: for d in i:
try:
df3.set_value(index=d,col='count',value=values[d.isoformat()[0:10]])
except KeyError:
pass
In [13]: print df3['2008-02-27':'2008-03-02']
Out[13]:
count
2008-02-27 20
2008-02-28 0
2008-02-29 27
2008-03-01 0
2008-03-02 17
In [14]: df3.index
Out[14];
<class 'pandas.tseries.index.DatetimeIndex'>
[2008-01-01 00:00:00, ..., 2012-11-29 00:00:00]
Length: 1795, Freq: D, Timezone: None
This last part of setting values based on lookups to a dictionary keyed by strings seems especially hacky and makes me think I've missed something important.
You could use pd.to_datetime:
In [1]: import pandas as pd
In [2]: pd.to_datetime('2008-02-27')
Out[2]: datetime.datetime(2008, 2, 27, 0, 0)
This allows you to "clean" the index (or similarly a column) by applying it to the Series:
df.index = pd.to_datetime(df.index)
or
df['date_col'] = df['date_col'].apply(pd.to_datetime)