In picture it show clear i have dataframe with mentioned columns and data. Now how can i assign this data to columns.enter image description here. if you look at the picture it will be more clear.
I try differen assigning operation but it show error like shape of passed values. I expecting that data(array) value to columns
I assume that the problem you face is that there are many columns in the DataFrame and the Array contains values for some but not all of these columns. Hence your problem with shape when you try and combine the two. What you need to do is define the column names for the values in the data Array before combining the two. See example code below which forms another DataFrames with the correct column names and then finally joins things together.
import pandas as pd
df1 = pd.DataFrame({ 'a': [1.0, 2.0],
'b': [3.0, 5.0],
'c': [4.0, 7.0]
})
data = [1.1, 2.1]
names = ['a', 'b']
df2 = pd.DataFrame({key : val for key, val in zip(names, data)}, index= [0])
df3 = pd.concat([df1, df2]).reset_index(drop = True)
print(df3)
this produces
a b c
0 1.0 3.0 4.0
1 2.0 5.0 7.0
2 1.1 2.1 NaN
with NaN for the columns that were missing in the data to be added. You can change the NaN to any value you want using fillna
Please refrain from adding code in the form of an image, it's hard to access. Here's an article explaining why.
Create a new dataframe (although you can choose to modify the existing one) with the column names and the data. I have inferred that your array is stored in a variable named data.
df_updated = pd.DataFrame(data, df.columns)
Note: Thanks to Timus for suggesting the removal of redundant code.
Related
I did check for possible solutions, but the most common solutions didn't work.
df_woningen.groupby(['postcode'], dropna=True)['energy_ranking'].agg(pd.Series.mode)
Gives me multiple arrays in this format:
2611BA []
2611BB 4.0
2611BC [3.0, 6.0]
QUESTION: How to select the last item to use as value for a new column?
Background: one column has rankings. Per group I want to take the mode() and put it as imputed value for NaN's in that group.
In case of multiple modes I want to take the highest. Sometimes a group has only NaN, in that case it should or could stay like that. If a group has 8 NaN's and 1 ranking '8', that de mode should be 8, disregarding the NaN's.
I am trying to create a new column by using codes like this:
df_woningen.groupby(['postcode'], dropna=True)['energy_ranking'].agg(
lambda x: pd.Series.mode(x)[0])
Or
df_woningen.groupby(['postcode'], dropna=True)['energy_ranking'].agg(lambda x:x.value_counts(dropna=True).index[0])
But I get errors and I believe it's because of the different lengths of the arrays.
TypeError: 'function' object is not subscriptable
index 0 is out of bounds for axis 0 with size 0
Anyone an idea how to solve this?
Assuming this example:
df = pd.DataFrame({'group': list('AAABBC'), 'value': [1,1,2,1,2,float('nan')]})
s = df.groupby('group')['value'].agg(pd.Series.mode)
Input:
group
A 1.0
B [1.0, 2.0]
C []
Name: value, dtype: object
You can use the str accessor and fillna:
s.str[-1].fillna(s.mask(s.str.len().eq(0)))
# or for numbers
# s.str[-1].fillna(pd.to_numeric(s, errors='coerce'))
Output:
group
A 1.0
B 2.0
C NaN
Name: value, dtype: float64
IIUC you can use use a lambda function in conjunction with a -1 for a list to display the data you are looking for
data = {
'Column1' : ['2611BA', '2611BB', '2611BC'],
'Column2' : [[], [4.0], [3.0, 6.0]]
}
df = pd.DataFrame(data)
df['Column3'] = df['Column2'].apply(lambda x : x[-1] if len(x) > 0 else '')
df
I have an empty panda date frame. I want to append value to one column at a time. I am trying to iterate through the columns using for loop and append a value (5 for example). I wrote the code below but it does not work. any idea?
example:
df: ['a', 'b', 'c']
for column in df:
df.append({column: 5}, ignore_index=True)
I want to implement this by iterating through the columns. the result should be
df: ['a', 'b', 'c']
5 5 5
This sounds like a horrible idea as it would become extremely inefficient as your df grows in size and I'm almost certain there is a much better way to do this if you would give more context. But for sake of answering the question you could use the shape of the df to figure out the row, and the column name as the column and use .at to manually assign the value.
Here we assign 3 values to the df, one column at a time.
import pandas as pd
df = pd.DataFrame({'a':[],'b':[],'c':[]})
values_to_add = [3,4,5]
for v in values_to_add:
row = df.shape[0]
for column in df.columns:
df.at[row,column] = v
Output
a b c
0 3.0 3.0 3.0
1 4.0 4.0 4.0
2 5.0 5.0 5.0
Greeting everyone. I have an excel file that I need to clean and fill NaN values according to column data types, like if column data type is object I need to fill "NULL" in that column and if data types is integer or float 0 needs to be filled in those columns.
So far I have tried 2 method to do the job but no luck, here is the first
df = pd.read_excel("myExcel_files.xlsx")
using bulit method for selecting columns by data types
df.select_dtypes(include='int64').fillna(0, inplace=True)
df.select_dtypes(include='float64').fillna(0.0, inplace=True)
df.select_dtypes(include='object').fillna("NULL", inplace=True)
and the output that I get is not an error but a warning and there is no change in data frame
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py:4259: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
**kwargs
as the first one was slice error so I thought doing it one column at a time, here is the code
df = pd.read_excel("myExcel_files.xlsx")
#get the list of all integer columns
int_cols = list(df.select_dtypes('int64').columns)
#get the list of all float columns
float_cols = list(df.select_dtypes('float64').columns)
#get the list of all object columns
object_cols = list(df.select_dtypes('object').columns)
#looping through if each column to fillna
for i in int_cols:
df[i].fillna(0,inplace=True)
for f in float_cols:
df[f].fillna(0,inplace=True)
for o in object_cols:
df[o].fillna("NULL",inplace=True)
Both of my methods doesn't work.
Many thanks for any help or suggestions.
Regards -Manish
I think that instead of using select_dtypes and iterating over columns you can take the .dtypes of your DF and replace float64's wth 0.0 and objects with "NULL"... you don't need to worry about int64's as they generally won't have missing values to fill (unless you're using pd.NA or a nullable int type), so you might be able to do a single operation of:
df.fillna(df.dtypes.replace({'float64': 0.0, 'O': 'NULL'}), inplace=True)
You can also add downcast='infer' so that if you have what can be int64s in a float64 column, you end up with int64s, eg given:
df = pd.DataFrame({
'a': [1.0, 2, np.nan, 4],
'b': [np.nan, 'hello', np.nan, 'blah'],
'c': [1.1, 1.2, 1.3, np.nan]
})
Then:
df.fillna(df.dtypes.replace({'float64': 0.0, 'O': 'NULL'}), downcast='infer', inplace=True)
Will give you (note column a was downcast to int but c remains float):
a b c
0 1 NULL 1.1
1 2 hello 1.2
2 0 NULL 1.3
3 4 blah 0.0
I have been searching through the web whether there is a simple method when using python/pandas to get a dataframe consisting only the unique rows and their basic stats (occurences, mean, and so on) from an original dataframe.
So far my efforts came only half way:
I found how to get all the unique rows using
data.drop_duplicates
But then Im not quite sure how I should retrieve all the stats I desire easily. I could do a for loop on a groupedby, but that would be rather slow.
Another approach that I thought of was using the groupby and then use describe, e.g.,
data.groupby(allColumns)[columnImInterestedInForStats].describe()
But it turns out that this, for 19 columns in allColumns, only returns me one row with no stats at all. Surprisingly, if I choose only a small subset for allColumns, I actually do get each unique combination of the subset and all their stats. My expectation was that if I fill in all 19 columns in groupby() I would get all unique groups?
Data example:
df = pd.DataFrame([[1.1, 1.1, 1.1, 2.6, 2.5, 3.4,2.6,2.6,3.4,3.4,2.6,1.1,1.1,3.3], list('AAABBBBABCBDDD'), ['1','3','3','2','4','2','5','3','6','3','5','1','1','1']]).T
df.columns = ['col1','col2','col3']
Desired result:
col2 col3 mean count and so on
A 1 1.1 1
3 4.8 3
B 2 6.0 2
4 2.5 1
5 5.2 2
6 3.4 1
C 3 3.4 1
D 1 5.5 3
into a dataframe.
Im sure it must be something very trivial that Im missing, but I cant find the proper answer. Thanks in advance.
You can achieve desired effect using agg().
import pandas as pd
import numpy as np
df = pd.DataFrame([[1.1, 1.1, 1.1, 2.6, 2.5, 3.4,2.6,2.6,3.4,3.4,2.6,1.1,1.1,3.3], list('AAABBBBABCBDDD'), \
['1','3','3','2','4','2','5','3','6','3','5','1','1','1']]).T
df.columns = ['col1','col2','col3']
df['col1'] = df['col1'].astype(float)
df.groupby(['col2','col3'])['col1'].agg([np.mean,'count',np.max,np.min,np.median])
In place of 'col1' in df.groupby you can place list of columns you are interested in.
What's the easiest way to add an empty column to a pandas DataFrame object? The best I've stumbled upon is something like
df['foo'] = df.apply(lambda _: '', axis=1)
Is there a less perverse method?
If I understand correctly, assignment should fill:
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
>>> df
A B
0 1 2
1 2 3
2 3 4
>>> df["C"] = ""
>>> df["D"] = np.nan
>>> df
A B C D
0 1 2 NaN
1 2 3 NaN
2 3 4 NaN
To add to DSM's answer and building on this associated question, I'd split the approach into two cases:
Adding a single column: Just assign empty values to the new columns, e.g. df['C'] = np.nan
Adding multiple columns: I'd suggest using the .reindex(columns=[...]) method of pandas to add the new columns to the dataframe's column index. This also works for adding multiple new rows with .reindex(rows=[...]). Note that newer versions of Pandas (v>0.20) allow you to specify an axis keyword rather than explicitly assigning to columns or rows.
Here is an example adding multiple columns:
mydf = mydf.reindex(columns = mydf.columns.tolist() + ['newcol1','newcol2'])
or
mydf = mydf.reindex(mydf.columns.tolist() + ['newcol1','newcol2'], axis=1) # version > 0.20.0
You can also always concatenate a new (empty) dataframe to the existing dataframe, but that doesn't feel as pythonic to me :)
I like:
df['new'] = pd.Series(dtype='int')
# or use other dtypes like 'float', 'object', ...
If you have an empty dataframe, this solution makes sure that no new row containing only NaN is added.
Specifying dtype is not strictly necessary, however newer Pandas versions produce a DeprecationWarning if not specified.
an even simpler solution is:
df = df.reindex(columns = header_list)
where "header_list" is a list of the headers you want to appear.
any header included in the list that is not found already in the dataframe will be added with blank cells below.
so if
header_list = ['a','b','c', 'd']
then c and d will be added as columns with blank cells
Starting with v0.16.0, DF.assign() could be used to assign new columns (single/multiple) to a DF. These columns get inserted in alphabetical order at the end of the DF.
This becomes advantageous compared to simple assignment in cases wherein you want to perform a series of chained operations directly on the returned dataframe.
Consider the same DF sample demonstrated by #DSM:
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
df
Out[18]:
A B
0 1 2
1 2 3
2 3 4
df.assign(C="",D=np.nan)
Out[21]:
A B C D
0 1 2 NaN
1 2 3 NaN
2 3 4 NaN
Note that this returns a copy with all the previous columns along with the newly created ones. In order for the original DF to be modified accordingly, use it like : df = df.assign(...) as it does not support inplace operation currently.
if you want to add column name from a list
df=pd.DataFrame()
a=['col1','col2','col3','col4']
for i in a:
df[i]=np.nan
df["C"] = ""
df["D"] = np.nan
Assignment will give you this warning SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame. Try
using .loc[row_indexer,col_indexer] = value instead
so its better to use insert:
df.insert(index, column-name, column-value)
#emunsing's answer is really cool for adding multiple columns, but I couldn't get it to work for me in python 2.7. Instead, I found this works:
mydf = mydf.reindex(columns = np.append( mydf.columns.values, ['newcol1','newcol2'])
One can use df.insert(index_to_insert_at, column_header, init_value) to insert new column at a specific index.
cost_tbl.insert(1, "col_name", "")
The above statement would insert an empty Column after the first column.
this will also work for multiple columns:
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
>>> df
A B
0 1 2
1 2 3
2 3 4
df1 = pd.DataFrame(columns=['C','D','E'])
df = df.join(df1, how="outer")
>>>df
A B C D E
0 1 2 NaN NaN NaN
1 2 3 NaN NaN NaN
2 3 4 NaN NaN NaN
Then do whatever you want to do with the columns
pd.Series.fillna(),pd.Series.map()
etc.
The below code address the question "How do I add n number of empty columns to my existing dataframe". In the interest of keeping solutions to similar problems in one place, I am adding it here.
Approach 1 (to create 64 additional columns with column names from 1-64)
m = list(range(1,65,1))
dd=pd.DataFrame(columns=m)
df.join(dd).replace(np.nan,'') #df is the dataframe that already exists
Approach 2 (to create 64 additional columns with column names from 1-64)
df.reindex(df.columns.tolist() + list(range(1,65,1)), axis=1).replace(np.nan,'')
You can do
df['column'] = None #This works. This will create a new column with None type
df.column = None #This will work only when the column is already present in the dataframe
If you have a list of columns that you want to be empty, you can use assign, then comprehension dict, then dict unpacking.
>>> df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
>>> nan_cols_name = ["C","D","whatever"]
>>> df.assign(**{col:np.nan for col in nan_cols_name})
A B C D whatever
0 1 2 NaN NaN NaN
1 2 3 NaN NaN NaN
2 3 4 NaN NaN NaN
You can also unpack multiple dict in a dict that you unpack if you want different values for different columns.
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
nan_cols_name = ["C","D","whatever"]
empty_string_cols_name = ["E","F","bad column with space"]
df.assign(**{
**{col:np.nan for col in my_empy_columns_name},
**{col:"" for col in empty_string_cols_name}
}
)
Sorry for I did not explain my answer really well at beginning. There is another way to add an new column to an existing dataframe.
1st step, make a new empty data frame (with all the columns in your data frame, plus a new or few columns you want to add) called df_temp
2nd step, combine the df_temp and your data frame.
df_temp = pd.DataFrame(columns=(df_null.columns.tolist() + ['empty']))
df = pd.concat([df_temp, df])
It might be the best solution, but it is another way to think about this question.
the reason of I am using this method is because I am get this warning all the time:
: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df["empty1"], df["empty2"] = [np.nan, ""]
great I found the way to disable the Warning
pd.options.mode.chained_assignment = None
The reason I was looking for such a solution is simply to add spaces between multiple DFs which have been joined column-wise using the pd.concat function and then written to excel using xlsxwriter.
df[' ']=df.apply(lambda _: '', axis=1)
df_2 = pd.concat([df,df1],axis=1) #worked but only once.
# Note: df & df1 have the same rows which is my index.
#
df_2[' ']=df_2.apply(lambda _: '', axis=1) #didn't work this time !!?
df_4 = pd.concat([df_2,df_3],axis=1)
I then replaced the second lambda call with
df_2['']='' #which appears to add a blank column
df_4 = pd.concat([df_2,df_3],axis=1)
The output I tested it on was using xlsxwriter to excel.
Jupyter blank columns look the same as in excel although doesnt have xlsx formatting.
Not sure why the second Lambda call didnt work.