Pandas: get string from specific column header - python

Context:
i am new in Pandas and I need a function that creates new columns based on existing columns. The new columns name will have the name from the original column plus new characters (example: create a "As NEW" column from "As" column).
Can i access the old column header string to make the name of the new column?
Problem:
i have df['columnA'] and need to get "columnA" string

If I understand you correctly, this may be what you're looking for.
You can use str.contains() for the columns, then use string formatting to create the new column name.
df = pd.DataFrame({'col1':['A', 'A', 'B','B'], 'As': ['B','B','C','C'], 'col2': ['C','C','A','A'], 'col3': [30,10,14,91]})
col = df.columns[df.columns.str.contains('As')]
df['%s New' % col[0]] = 'foo'
print (df)
As col1 col2 col3 As New
0 B A C 30 foo
1 B A C 10 foo
2 C B A 14 foo
3 C B A 91 foo

Assuming that you have an empty DataFrame df with columns, you could access the columns of df as a list with:
>>> df.columns
Index(['columnA', 'columnB'], dtype='object')
.columns will allow you to overwrite the columns of df, but you don't need to pass in another Index. You can pass it a regular list, like so:
>>> df.columns = ['columna', 'columnb']
>>> df
Empty DataFrame
Columns: [columna, columnb]
Index: []

This can be done through the columns attribute.
cols = df.columns
# Do whatever operation you want on the list of strings in cols
df.columns = cols

Related

Error when creating a data frame in Python: ValueError [duplicate]

This may be a simple question, but I can not figure out how to do this. Lets say that I have two variables as follows.
a = 2
b = 3
I want to construct a DataFrame from this:
df2 = pd.DataFrame({'A':a,'B':b})
This generates an error:
ValueError: If using all scalar values, you must pass an index
I tried this also:
df2 = (pd.DataFrame({'a':a,'b':b})).reset_index()
This gives the same error message.
The error message says that if you're passing scalar values, you have to pass an index. So you can either not use scalar values for the columns -- e.g. use a list:
>>> df = pd.DataFrame({'A': [a], 'B': [b]})
>>> df
A B
0 2 3
or use scalar values and pass an index:
>>> df = pd.DataFrame({'A': a, 'B': b}, index=[0])
>>> df
A B
0 2 3
You may try wrapping your dictionary into a list:
my_dict = {'A':1,'B':2}
pd.DataFrame([my_dict])
A B
0 1 2
You can also use pd.DataFrame.from_records which is more convenient when you already have the dictionary in hand:
df = pd.DataFrame.from_records([{ 'A':a,'B':b }])
You can also set index, if you want, by:
df = pd.DataFrame.from_records([{ 'A':a,'B':b }], index='A')
You need to create a pandas series first. The second step is to convert the pandas series to pandas dataframe.
import pandas as pd
data = {'a': 1, 'b': 2}
pd.Series(data).to_frame()
You can even provide a column name.
pd.Series(data).to_frame('ColumnName')
Maybe Series would provide all the functions you need:
pd.Series({'A':a,'B':b})
DataFrame can be thought of as a collection of Series hence you can :
Concatenate multiple Series into one data frame (as described here )
Add a Series variable into existing data frame ( example here )
Pandas magic at work. All logic is out.
The error message "ValueError: If using all scalar values, you must pass an index" Says you must pass an index.
This does not necessarily mean passing an index makes pandas do what you want it to do
When you pass an index, pandas will treat your dictionary keys as column names and the values as what the column should contain for each of the values in the index.
a = 2
b = 3
df2 = pd.DataFrame({'A':a,'B':b}, index=[1])
A B
1 2 3
Passing a larger index:
df2 = pd.DataFrame({'A':a,'B':b}, index=[1, 2, 3, 4])
A B
1 2 3
2 2 3
3 2 3
4 2 3
An index is usually automatically generated by a dataframe when none is given. However, pandas does not know how many rows of 2 and 3 you want. You can however be more explicit about it
df2 = pd.DataFrame({'A':[a]*4,'B':[b]*4})
df2
A B
0 2 3
1 2 3
2 2 3
3 2 3
The default index is 0 based though.
I would recommend always passing a dictionary of lists to the dataframe constructor when creating dataframes. It's easier to read for other developers. Pandas has a lot of caveats, don't make other developers have to experts in all of them in order to read your code.
You could try:
df2 = pd.DataFrame.from_dict({'a':a,'b':b}, orient = 'index')
From the documentation on the 'orient' argument: If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.
I usually use the following to to quickly create a small table from dicts.
Let's say you have a dict where the keys are filenames and the values their corresponding filesizes, you could use the following code to put it into a DataFrame (notice the .items() call on the dict):
files = {'A.txt':12, 'B.txt':34, 'C.txt':56, 'D.txt':78}
filesFrame = pd.DataFrame(files.items(), columns=['filename','size'])
print(filesFrame)
filename size
0 A.txt 12
1 B.txt 34
2 C.txt 56
3 D.txt 78
You need to provide iterables as the values for the Pandas DataFrame columns:
df2 = pd.DataFrame({'A':[a],'B':[b]})
I had the same problem with numpy arrays and the solution is to flatten them:
data = {
'b': array1.flatten(),
'a': array2.flatten(),
}
df = pd.DataFrame(data)
import pandas as pd
a=2
b=3
dict = {'A': a, 'B': b}
pd.DataFrame(pd.Series(dict)).T
# *T :transforms the dataframe*
Result:
A B
0 2 3
To figure out the "ValueError" understand DataFrame and "scalar values" is needed.
To create a Dataframe from dict, at least one Array is needed.
IMO, array itself is indexed.
Therefore, if there is an array-like value there is no need to specify index.
e.g. The index of each element in ['a', 's', 'd', 'f'] are 0,1,2,3 separately.
df_array_like = pd.DataFrame({
'col' : 10086,
'col_2' : True,
'col_3' : "'at least one array'",
'col_4' : ['one array is arbitrary length', 'multi arrays should be the same length']})
print("df_array_like: \n", df_array_like)
Output:
df_array_like:
col col_2 col_3 col_4
0 10086 True 'at least one array' one array is arbitrary length
1 10086 True 'at least one array' multi arrays should be the same length
As shows in the output, the index of the DataFrame is 0 and 1.
Coincidently same with the index of the array ['one array is arbitrary length', 'multi arrays should be the same length']
If comment out the 'col_4', it will raise
ValueError("If using all scalar values, you must pass an index")
Cause scalar value (integer, bool, and string) does not have index
Note that Index(...) must be called with a collection of some kind
Since index used to locate all the rows of DataFrame
index should be an array. e.g.
df_scalar_value = pd.DataFrame({
'col' : 10086,
'col_2' : True,
'col_3' : "'at least one array'"
}, index = ['fst_row','snd_row','third_row'])
print("df_scalar_value: \n", df_scalar_value)
Output:
df_scalar_value:
col col_2 col_3
fst_row 10086 True 'at least one array'
snd_row 10086 True 'at least one array'
third_row 10086 True 'at least one array'
I'm a beginner, I'm learning python and English. 👀
I tried transpose() and it worked.
Downside: You create a new object.
testdict1 = {'key1':'val1','key2':'val2','key3':'val3','key4':'val4'}
df = pd.DataFrame.from_dict(data=testdict1,orient='index')
print(df)
print(f'ID for DataFrame before Transpose: {id(df)}\n')
df = df.transpose()
print(df)
print(f'ID for DataFrame after Transpose: {id(df)}')
Output
0
key1 val1
key2 val2
key3 val3
key4 val4
ID for DataFrame before Transpose: 1932797100424
key1 key2 key3 key4
0 val1 val2 val3 val4
ID for DataFrame after Transpose: 1932797125448
​```
the input does not have to be a list of records - it can be a single dictionary as well:
pd.DataFrame.from_records({'a':1,'b':2}, index=[0])
a b
0 1 2
Which seems to be equivalent to:
pd.DataFrame({'a':1,'b':2}, index=[0])
a b
0 1 2
This is because a DataFrame has two intuitive dimensions - the columns and the rows.
You are only specifying the columns using the dictionary keys.
If you only want to specify one dimensional data, use a Series!
If you intend to convert a dictionary of scalars, you have to include an index:
import pandas as pd
alphabets = {'A': 'a', 'B': 'b'}
index = [0]
alphabets_df = pd.DataFrame(alphabets, index=index)
print(alphabets_df)
Although index is not required for a dictionary of lists, the same idea can be expanded to a dictionary of lists:
planets = {'planet': ['earth', 'mars', 'jupiter'], 'length_of_day': ['1', '1.03', '0.414']}
index = [0, 1, 2]
planets_df = pd.DataFrame(planets, index=index)
print(planets_df)
Of course, for the dictionary of lists, you can build the dataframe without an index:
planets_df = pd.DataFrame(planets)
print(planets_df)
Change your 'a' and 'b' values to a list, as follows:
a = [2]
b = [3]
then execute the same code as follows:
df2 = pd.DataFrame({'A':a,'B':b})
df2
and you'll get:
A B
0 2 3
simplest options ls :
dict = {'A':a,'B':b}
df = pd.DataFrame(dict, index = np.arange(1) )
Another option is to convert the scalars into list on the fly using Dictionary Comprehension:
df = pd.DataFrame(data={k: [v] for k, v in mydict.items()})
The expression {...} creates a new dict whose values is a list of 1 element. such as :
In [20]: mydict
Out[20]: {'a': 1, 'b': 2}
In [21]: mydict2 = { k: [v] for k, v in mydict.items()}
In [22]: mydict2
Out[22]: {'a': [1], 'b': [2]}
Convert Dictionary to Data Frame
col_dict_df = pd.Series(col_dict).to_frame('new_col').reset_index()
Give new name to Column
col_dict_df.columns = ['col1', 'col2']
You could try this:
df2 = pd.DataFrame.from_dict({'a':a,'b':b}, orient = 'index')
If you have a dictionary you can turn it into a pandas data frame with the following line of code:
pd.DataFrame({"key": d.keys(), "value": d.values()})
Just pass the dict on a list:
a = 2
b = 3
df2 = pd.DataFrame([{'A':a,'B':b}])

Fill in empty header with previous column name - pandas

I have a dataframe where each second column name is skipped:
eg
Step_1.
The idea is to fill unnamed columns with previous name to get:
Step_2.
To sum up "in" and "out" in each class, to get final result like this
The intermediary Step_1 is important and cannot be skipped to get the final result.
I appreciate any help and apologize for not being clear enough when asking question at the first attempt.
Thank you
Idea is convert columns to Series, so possible replace missing values instead values starting by Unnamed with forward filling:
df.columns = df.columns.to_series().mask(lambda x: x.str.startswith('Unnamed')).ffill()
print (df)
Column_1 Column_1 Column_2 Column_2
0 a d f g
EDIT:
If missing values in index:
df.columns = df.columns.to_series().ffill()
MultiIndex solution is necessary, if second row is header too - first use header=[0,1] for MultiIndex:
import pandas as pd
temp=u"""Column_1;Unnamed_column;Column_2;Unnamed_column
a;d;f;g
1;5;5;6
7;8;9;4"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep=";", header=[0,1])
print (df)
Column_1 Unnamed_column Column_2 Unnamed_column
a d f g
0 1 5 5 6
1 7 8 9 4
a = df.columns.get_level_values(0)
b = df.columns.get_level_values(1)
df.columns = [a.to_series().mask(lambda x: x.str.startswith('Unnamed')).ffill(), b]
print (df)
Column_1 Column_2
a d f g
0 1 5 5 6
1 7 8 9 4
I tried this,
t = pd.DataFrame(df.columns)
t.loc[t[0].str.startswith('Unnamed: '),0] = np.NaN
t[0].bfill(inplace=True)
df.columns = t[0].values
Create temp dataframe with column of original dataframe. apply ffill or bfill as per your wish. assign back the values again to original dataframe.
You can rewrite the df.index with a list comprehension.
from itertools import chain
df = pd.DataFrame(
{"Column_1": [1], "Unnamed_column1": [2], "Column_2": [3], "Unnamed_column2": [4]})
cols = [[c, c] for c in df.columns[::2]]
df.columns = [_ for _ in chain(*cols)]
Having said that it might be better to assign unique names to columns as they will be used keys/indices, i.e .
cols = [[c, c+"_new"] for c in df.columns[::2]]

How to perform multiple pandas data type changes on different columns with one function?

I have 41 columns in a dataframe, out of those 22 I want to change the data type to 'str' except 1 column I want to change to 'float'.
Currently, I am doing this line of code to change individual columns to the datatype str or float, now doing this to 20 other columns:
df.active = df.active.astype(str)
df.total_spent = df.total_spent.astype(float)
How do I write a function that takes the columns I want to make into string and the one column above that I want as a float?
Let me know if you would like the list of columns, I thought it would be too much for now.
Thank you in advance.
I believe need seelct columns by list of names or by positions and converting:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
If want select columns by names:
c = ['B', 'C']
df[c] = df[c].astype(str)
If want select columns by positions:
p = [1,2]
df.iloc[:, p] = df.iloc[:, p].astype(str)
print (df.dtypes)
A object
B object
C object
D int64
E int64
F object
dtype: object

Check if entry to panda dataframe is unique when index might be the same

I have a panda DataFrame that I want to add rows to. The Dataframe looks like this:
col1 col2
a 1 5
b 2 6
c 3 7
I want to add rows to the dataframe, but only if they are unique. The problem is that some new rows might have the same index, but different values in the columns. If this is the case, I somehow need to know.
Some example rows to be added and the desired result:
row 1:
col1 col2
a 1 5
desired row 1 result: Not added - it is already in the dataframe
row 2:
col1 col2
a 9 9
desired row 2 result: something like,
print('non-unique entries for index a')
row 3:
col1 col2
d 4 4
desired row 3 result: just add the row to the dataframe.
try this:
# existing dataframe == df
# new rows == df_newrows
# dividing newrows dataframe into two, one for repeated indexes, one without.
df_newrows_usable = df_newrows.loc[df_newrows.index.isin(list(df.index.get_values()))==False]
df_newrows_discarded = df_newrows.loc[df_newrows.index.isin(list(df.index.get_values()))]
print ('repeated indexes:', df_newrows_discarded)
# concat df and newrows without repeated indexes
new_df = pd.concat([df,df_newrows],0)
print ('new dataframe:', new_df)
the easy option would be to merge all rows and then keep the unique ones via the dataframe method drop_duplicates
However, this option doesn't report a warning / error when a duplicate row is appended.
drop_duplicates doesn't consider indexes, so the dataframe must be reset before dropping the duplicates, and set back after:
import pandas as pd
# set up data frame
df = pd.DataFrame({'col1': [1, 2, 3], 'col2':[5, 6, 7]}, index=['a', 'b', 'c'])
# set up row to be appended
row = pd.DataFrame({'col1':[3], 'col2': [7]}, index=['c'])
# append row (don't care if it's duplicate)
df = df.append([row])
# drop duplicatesdf2 = df2.reset_index()
df2 = df2.drop_duplicates()
df2 = df2.set_index('index')
if the warning message is an absolute requirement, we can write a function to that effect that checks if a row is duplicate via a merge operation and appends the row only if it is unique.
def append_unique(df, row):
d = df.reset_index()
r = row.reset_index()
if d.merge(r, on=list(d.columns), how='inner').empty:
d2 = d.append(r)
d2 = d2.set_index('index')
return d2
print('non-unique entries for index a')
return df
df2 = append_unique(df2, row)

Create label for two column in pandas

I have a pandas dataframe with two column of data. Now i want to make a label for two column, like the picture bellow:
Because two column donot have the same value so cant use groupby. I just only want add the label AAA like that. So, how to do it? Thank you
reassign to the columns attribute with an newly constructed pd.MultiIndex
df.columns = pd.MultiIndex.from_product([['AAA'], df.columns.tolist()])
Consider the dataframe df
df = pd.DataFrame(1, ['hostname', 'tmserver'], ['value', 'time'])
print(df)
value time
hostname 1 1
tmserver 1 1
Then
df.columns = pd.MultiIndex.from_product([['AAA'], df.columns.tolist()])
print(df)
AAA
value time
hostname 1 1
tmserver 1 1
If need create MultiIndex in columns, simpliest is:
df.columns = [['AAA'] * len(df.columns), df.columns]
It is similar as MultiIndex.from_arrays, also is possible add names parameter:
n = ['a','b']
df.columns = pd.MultiIndex.from_arrays([['AAA'] * len(df.columns), df.columns], names=n)

Categories

Resources