I have a pandas dataframe df. And lets say I wanted to share df with you guys here to allow you to easily recreate df in your own notebook.
Is there a command or function that will generate the pandas dataframe create statement? I realize that for a lot of data the statement would be quite large seeing that it must include the actual data, so a header would be ideal.
Essentially, a command that I can run on df and get something like this:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
or
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
... columns=['a', 'b', 'c'])
I'm not sure how to even phrase this question. Like taking a dataframe and deconstructing the top 5 rows or something?
We usually using read_clipboard
pd.read_clipboard()
Out[328]:
col1 col2
0 1 3
1 2 4
Or If you have the df save it into dict so that we can easily convert it back to the sample we need
df.head(5).to_dict()
Related
I have three dataframes that have the same format and I want to simply add the three respective values on top of each other, so that df_new = df1 + df2 + df3. The new df would have the same amount of rows and columns as each old df.
But doing so only appends the columns. I have searched through the docs and there is a lot on merging etc but nothing on adding values. I suppose there must be a one liner for such a basic operation?
Possible solution is the following:
# pip install pandas
import pandas as pd
#set test dataframes with same structure but diff values
df1 = pd.DataFrame({"col1": [1, 1, 1], "col2": [1, 1, 1],})
df2 = pd.DataFrame({"col1": [2, 2, 2], "col2": [2, 2, 2],})
df3 = pd.DataFrame({"col1": [3, 3, 3], "col2": [3, 3, 3],})
df_new = pd.DataFrame()
for col in list(df1.columns):
df_new[col] = df1[col].map(str) + df2[col].map(str) + df3[col].map(str)
df_new
Returns
I recently heard about the pandas read_clipboard() function and it has been super useful to quickly use DataFrames from SO questions. The problem is that the DataFrame must be the last copied thing. Is there a way to for example print a DataFrame in a way that can be used to hardcode a new DataFrame. I'll try to make this a bit clearer:
Say I find this DataFrame somewhere:
a b
0 1 4
1 2 5
2 3 6
So I can copy this DataFrame and then import it in my code like this
df = pd.read_clipboard()
But when I run this script later I have to make sure the DataFrame is the last thing I copied. What I'm looking for is a function (print_to_reuse()) that does something like this:
df.print_to_reuse()
out: {'a': [1, 2, 3], 'b': [4, 5, 6]}
Now I could copy this output and hardcode the definition of df as
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
In this way, it doesn't matter when I rerun my code and what is the last thing I copied.
I can think of a method that does the same but it seems like there should be an easier approach. I could export the copied DataFrame as a csv and then later on import this csv like this:
df = pd.read_clipboard()
df.to_csv("path")
And then
df = read_csv("path")
use_df()
So basically, is there a way to do this that doesn't require making a new csv?
Thank you.
I think you are looking for:
df.to_dict("list")
Which will give:
{'a': [1, 2, 3], 'b': [4, 5, 6]}
That you can use to run later the same script instead of read_clipboard()
Is there a way to sort a dataframe by a combination of different columns? As in if specific columns match among rows, they will be clustered together? An example below: Any help is greatly appreciated!
Original DataFrame
Transformed DataFrame
One way to sort pandas dataframe is to use .sort_values().
The code below replicates your sample dataframe:
df= pd.DataFrame({'v1': [1, 3, 2, 1, 4, 3],
'v2': [2, 2, 4, 2, 3, 2],
'v3': [3, 3, 2, 3, 2, 3],
'v4': [4, 5, 1, 4, 2, 5]})
Using the code below, can sort the dataframe by both column v1 and v2. In this case, v2 is only used to break ties.
df.sort_values(by=['v1', 'v2'], ascending=True)
"by" parameter here is not limited to any number of variables, so could extend the list to include more variables in desired order.
This is the best to match your sort pattern shown in the image.
import pandas as pd
df = pd.DataFrame(dict(
v1=[1,3,2,1,4,3],
v2=[2,2,4,2,3,2],
v3=[3,3,2,3,2,3],
v4=[4,5,1,4,2,5],
))
# Make a temp column to sort the df by
df['sort'] = df.astype(str).values.sum(axis=1)
# Sort the df by that column, drop it and reset the index
df = df.sort_values(by='sort').drop(columns='sort').reset_index(drop=1)
print(df)
Link you can refe - Code in python tutor
Edit: Zolzaya Luvsandorj's recommendation is better:
import pandas as pd
df = pd.DataFrame(dict(
v1=[1,3,2,1,4,3],
v2=[2,2,4,2,3,2],
v3=[3,3,2,3,2,3],
v4=[4,5,1,4,2,5],
))
df = df.sort_values(by=list(df.columns)).reset_index(drop=1)
print(df)
Link you can refe - Better code in python tutor
I am modifying a dataframe within a function but I do not want it to change the global variable.
I use two different ways to change my dataframe and they affect my global variable differently. The first method to add a new column by assigning a non-existent column modifies the global dataframe. By concatenation of a new column the global dataframe remains unchanged.
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
def mutation(data):
data['d'] = [1, 2, 3]
mutation(df)
print(df)
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
def mutation(data):
data = pd.concat([data,pd.DataFrame([1,2,3], columns=['d'])], axis =1)
mutation(df)
print(df)
I expect that when I print df after calling the function I see columns a, b and c. But, the first method also shows column d.
When you pass the data object to the function, you are actually passing its reference to the function. So when you do in-place mutations on the object it points to, you can see these mutations outside of the function as well.
If you want to keep your original data un-mutated, pass a clone of the original data frame as follows:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
def mutation(data):
data['d'] = [1, 2, 3]
mutation(df.copy())
print(df)
Output:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
The function operated on the clone, so the original data frame is unmodified.
The second example you've done is not an in-place operation on the original data frame: It instead creates a new data frame. So in the second example, your original DF is not modified.
Ideally, I want to be able something like:
cols = ['A', 'B', 'C']
df = pandas.DataFrame(index=range(5), columns=cols)
df.get_column(cols[0]) = [1, 2, 3, 4, 5]
What is the pythonic/pandonic way to do this?
Edit: I know that I can access the column 'A' by df.A, but in general I do not know what the column names are.
You do not need to store what columns a DataFrame has separately.
You can find out what columns exist in a pandas DataFrame by accessing the DataFrame.columns variable.
To access the Series attached to a particular column, you can use the getitem method of the DataFrame []
Tiny example:
col = df.columns[0]
df[col] = [1, 2, 3, 4, 5]
Okay, this is particularly straightforward.
df[cols[0]] = [1, 2, 3, 4, 5]