is there a good clean way to hard code a pandas dataframe into python code (e.g. a .py file)?
I don't want to store in a separate CSV (I want the script file to be able to run on it's own), and the dataframe is not very big. I also want it clear in the code what it is and easily modifiable.
For example:
cols = ['val1', 'val2', 'val3']
rows = ['red', 'blue', 'green','orange','pink']
data = [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0], [10.0,11.0,12.0],[13.0, 14.0,15.0]]
pd.DataFrame(data, index=rows, columns=cols)
This works ok, but if you want to modify, say, green val2, it's not easy to immediately find the right value. Slightly better (in some ways):
cols = ['val1', 'val2', 'val3']
rows = ['red', 'blue', 'green','orange','pink']
data = [
# val1, val2, val3
[1.0, 2.0, 3.0], # red
[4.0, 5.0, 6.0], # blue
[7.0, 8.0, 9.0], # green
[10.0,11.0,12.0], # orange
[13.0, 14.0,15.0]] # pink
pd.DataFrame(data, index=rows, columns=cols)
but this requires a lot of manual formatting, or writing a separate dataframe printer, and is ugly and hackish.
Use pd.read_csv based on a string literal:
try: from io import StringIO # Python 3
except: from StringIO import StringIO # Python 2
import pandas as pd
TESTDATA = u"""\
val1, val2, val3, color
1.0, 2.0, 3.0, red
4.0, 5.0, 6.0, blue
7.0, 8.0, 9.0, green
10.0, 11.0, 12.0, orange
13.0, 14.0, 15.0, pink
"""
df = pd.read_csv(StringIO(TESTDATA), index_col=-1, sep=r",\s*", engine='python')
print(df)
# prints:
# val1 val2 val3
# color
# red 1.0 2.0 3.0
# blue 4.0 5.0 6.0
# green 7.0 8.0 9.0
# orange 10.0 11.0 12.0
# pink 13.0 14.0 15.0
The inclusion of \s* in sep means that you then have the option to pretty-format your data with whitespace. Since you say the dataframe is not very big, why not do that, for the sake of readability? But if you're averse to manually aligning things even for a small dataframe, you could just remove the spaces and paste the raw CSV content in TESTDATA. Then you can drop the \s* out of sep and remove engine='python' (the latter is only there to suppress a warning associated with the use of regular expressions in sep).
An even better version, which allows you to use the print(df) output itself as the input, without manual editing, would be:
try: from io import StringIO # Python 3
except: from StringIO import StringIO # Python 2
import pandas as pd
TESTDATA = u"""\
val1 val2 val3
color
red 1.0 2.0 3.0
blue 4.0 5.0 6.0
green 7.0 8.0 9.0
orange 10.0 11.0 12.0
pink 13.0 14.0 15.0
"""
df = pd.read_csv(StringIO(TESTDATA), index_col=0, sep=r"\s+", engine='python')
print(df)
To provide a complete answer based on our comments:
from io import StringIO
data = """
col1,col2,col3
a,b,c
d,e,f
"""
s = StringIO(data)
df = pd.read_csv(s)
result:
col1 col2 col3
0 a b c
1 d e f
Related
I am comparing dataframes with pandas. I want to distinguish the compared dataframe columns by naming them, therefore I'm using the parameter result_names from pandas documentation but it returns: 'TypeError: DataFrame.compare() got an unexpected keyword argument 'result_names''.
Here is the code, that is simply the suggested one in the documentation: (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html)
df = pd.DataFrame(
{
"col1": ["a", "a", "b", "b", "a"],
"col2": [1.0, 2.0, 3.0, np.nan, 5.0],
"col3": [1.0, 2.0, 3.0, 4.0, 5.0]
},
columns=["col1", "col2", "col3"],
)
df2 = df.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0
df.compare(df2, result_names=("left", "right"))
Any ideas why?
You need pandas ≥1.5.
For earlier versions, you can instead rename the level:
df.compare(df2).rename({'self': 'left', 'other': 'right'}, axis=1, level=1)
output:
col1 col3
left right left right
0 a c NaN NaN
2 NaN NaN 3.0 4.0
I have a Pandas series sf:
email
email1#email.com [1.0, 0.0, 0.0]
email2#email.com [2.0, 0.0, 0.0]
email3#email.com [1.0, 0.0, 0.0]
email4#email.com [4.0, 0.0, 0.0]
email5#email.com [1.0, 0.0, 3.0]
email6#email.com [1.0, 5.0, 0.0]
And I would like to transform it to the following DataFrame:
index | email | list
_____________________________________________
0 | email1#email.com | [1.0, 0.0, 0.0]
1 | email2#email.com | [2.0, 0.0, 0.0]
2 | email3#email.com | [1.0, 0.0, 0.0]
3 | email4#email.com | [4.0, 0.0, 0.0]
4 | email5#email.com | [1.0, 0.0, 3.0]
5 | email6#email.com | [1.0, 5.0, 0.0]
I found a way to do it, but I doubt it's the more efficient one:
df1 = pd.DataFrame(data=sf.index, columns=['email'])
df2 = pd.DataFrame(data=sf.values, columns=['list'])
df = pd.merge(df1, df2, left_index=True, right_index=True)
Rather than create 2 temporary dfs you can just pass these as params within a dict using the DataFrame constructor:
pd.DataFrame({'email':sf.index, 'list':sf.values})
There are lots of ways to construct a df, see the docs
to_frame():
Starting with the following Series, df:
email
email1#email.com A
email2#email.com B
email3#email.com C
dtype: int64
I use to_frame to convert the series to DataFrame:
df = df.to_frame().reset_index()
email 0
0 email1#email.com A
1 email2#email.com B
2 email3#email.com C
3 email4#email.com D
Now all you need is to rename the column name and name the index column:
df = df.rename(columns= {0: 'list'})
df.index.name = 'index'
Your DataFrame is ready for further analysis.
Update: I just came across this link where the answers are surprisingly similar to mine here.
One line answer would be
myseries.to_frame(name='my_column_name')
Or
myseries.reset_index(drop=True, inplace=True) # As needed
Series.reset_index with name argument
Often the use case comes up where a Series needs to be promoted to a DataFrame. But if the Series has no name, then reset_index will result in something like,
s = pd.Series([1, 2, 3], index=['a', 'b', 'c']).rename_axis('A')
s
A
a 1
b 2
c 3
dtype: int64
s.reset_index()
A 0
0 a 1
1 b 2
2 c 3
Where you see the column name is "0". We can fix this be specifying a name parameter.
s.reset_index(name='B')
A B
0 a 1
1 b 2
2 c 3
s.reset_index(name='list')
A list
0 a 1
1 b 2
2 c 3
Series.to_frame
If you want to create a DataFrame without promoting the index to a column, use Series.to_frame, as suggested in this answer. This also supports a name parameter.
s.to_frame(name='B')
B
A
a 1
b 2
c 3
pd.DataFrame Constructor
You can also do the same thing as Series.to_frame by specifying a columns param:
pd.DataFrame(s, columns=['B'])
B
A
a 1
b 2
c 3
Super simple way is also
df = pd.DataFrame(series)
It will return a DF of 1 column (series values) + 1 index (0....n)
Series.to_frame can be used to convert a Series to DataFrame.
# The provided name (columnName) will substitute the series name
df = series.to_frame('columnName')
For example,
s = pd.Series(["a", "b", "c"], name="vals")
df = s.to_frame('newCol')
print(df)
newCol
0 a
1 b
2 c
probably graded as a non-pythonic way to do this but this'll give the result you want in a line:
new_df = pd.DataFrame(zip(email,list))
Result:
email list
0 email1#email.com [1.0, 0.0, 0.0]
1 email2#email.com [2.0, 0.0, 0.0]
2 email3#email.com [1.0, 0.0, 0.0]
3 email4#email.com [4.0, 0.0, 3.0]
4 email5#email.com [1.0, 5.0, 0.0]
This is an example of a bigger data. Imagine I have two dataframes like these:
import pandas as pd
import numpy as np
np.random.seed(42)
df1 = pd.DataFrame({'Depth':np.arange(0.5, 4.5, 0.5),
'Feat1':np.random.randint(20, 70, 8)})
df2 = pd.DataFrame({'Depth':[0.4, 1.1, 1.5, 2.2, 2.8],
'Rock':['Sand','Sand','Clay','Clay','Marl']})
They have different size and I would like to put the information of 'Rock' column from df2 on df1 as a new column. This combination should be done based on the 'Depth' columns from these two dataframes, but they have different sampling rates. Df1 follows a constant step of 0.5, but the thickness of df2 is different.
So I would like to merge these information based on approximate values of 'Depth'. For example: if sample of df2 has 'Depth' of 2.2, then look at the most near 'Depth' value of df1, that should be 2.0, and add 'Rock' information ('Clay') on that sample. And it is important to say that 'Rock' values can be repeated on the new column to avoid missing data just inside this segmentation. Anyone could help me?
I already tried some pandas methods as 'merge' and 'combine_first', but I couldn't get the result I wanted. It should be something like this:
Use merge_asof:
df3 = pd.merge_asof(df1, df2, on='Depth', tolerance=0.5, direction='nearest')
df3:
Depth Feat1 Rock
0 0.5 58 Sand
1 1.0 48 Sand
2 1.5 34 Clay
3 2.0 62 Clay
4 2.5 27 Clay
5 3.0 40 Marl
6 3.5 58 NaN
7 4.0 38 NaN
Complete Working Example:
import numpy as np
import pandas as pd
np.random.seed(42)
df1 = pd.DataFrame({
'Depth': np.arange(0.5, 4.5, 0.5),
'Feat1': np.random.randint(20, 70, 8)
})
df2 = pd.DataFrame({
'Depth': [0.4, 1.1, 1.5, 2.2, 2.8],
'Rock': ['Sand', 'Sand', 'Clay', 'Clay', 'Marl']
})
df3 = pd.merge_asof(df1, df2, on='Depth', tolerance=0.5, direction='nearest')
print(df3)
I have the following sample DataFrame
import numpy as np
import pandas as pd
df = pd.DataFrame({'Tom': [2, np.nan, np.nan],
'Ron': [np.nan, 5, np.nan],
'Jim': [np.nan, np.nan, 6],
'Mat': [7, np.nan, np.nan],},
index=['Min', 'Max', 'Avg'])
that looks like this where each row have only one non-null value
Tom Ron Jim Mat
Min 2.0 NaN NaN 7.0
Max NaN 5.0 NaN NaN
Avg NaN NaN 6.0 NaN
Desired Outcome
For each column, I want to have the non-null value and then append the index of the corresponding non-null value to the name of the column. So the final result should look like this
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
My attempt
Using list comprehensions: Find the non-null value, and append the corresponding index to the column name and then create a new DataFrame
values = [df[col][~pd.isna(df[col])].values[0] for col in df.columns]
# [2.0, 5.0, 6.0, 7.0]
new_cols = [col + '_{}'.format(df[col][~pd.isna(df[col])].index[0]) for col in df.columns]
# ['Tom_Min', 'Ron_Max', 'Jim_Avg', 'Mat_Min']
df_new = pd.DataFrame([values], columns=new_cols)
My question
Is there some in-built functionality in pandas which can do this without using for loops and list comprehensions?
If there is only one non missing value is possible use DataFrame.stack with convert Series to DataFrame and then flatten MultiIndex, for correct order is used DataFrame.swaplevel with DataFrame.reindex:
df = df.stack().to_frame().T.swaplevel(1,0, axis=1).reindex(df.columns, level=0, axis=1)
df.columns = df.columns.map('_'.join)
print (df)
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
Use:
s = df.T.stack()
s.index = s.index.map('_'.join)
df = s.to_frame().T
Result:
# print(df)
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
I have following Dataframe looks like in below image:
I need to add one more column 'Key' to existing Dataframe such as it looks like Dataframe in below image:
Is there a way to create a Column "Key" based on columns Field and Seq
Here is one solution.
import pandas as pd
df = pd.DataFrame({'Field': ['Indicator', 'A', 'B', 'Code', '1', '2', '3', 'Name', 'Address'],
'Count': [26785, 785, 26000, 12345, 45, 300, 12000, 12312, 1212],
'Seq': [1.0, 1.1, 1.1, 2.0, 2.1, 2.1, 2.1, 3.0, 4.0]})
sep = df.loc[df['Seq'].apply(lambda x: x == int(x)), 'Field'].tolist()
df['key'] = pd.Series(np.where(~df['Field'].isin(sep), None, df['Field'])).ffill()
df.loc[df['Field'] != df['key'], 'key'] += '+' + df['Field']
# Count Field Seq key
# 0 26785 Indicator 1.0 Indicator
# 1 785 A 1.1 Indicator+A
# 2 26000 B 1.1 Indicator+B
# 3 12345 Code 2.0 Code
# 4 45 1 2.1 Code+1
# 5 300 2 2.1 Code+2
# 6 12000 3 2.1 Code+3
# 7 12312 Name 3.0 Name
# 8 1212 Address 4.0 Address
Explanation
Add a 'key' column and replace values not in sep with None, then use ffill() to fill the None values.
Update 'key' column only where 'Field' and 'key' are misaligned.