Pandas convert columns to rows - python

I am a beginner in Data Science and I am trying to pivot this data frame using Pandas:
So it becomes something like this: (The labels should become the column and file paths the rows.)
I tried this code which gave me an error:
EDIT:
I have tried Marcel's suggestion, the output it gave is this:
The "label" column is a group or class of file paths. I want to convert it in such a way it fits this function: tf.Keras.preprocessing.image.flow_from_dataframe in categorical
Thanks in advance to all for helping me out.

I did not understand your question very well, but if you just want to convert columns to rows then you can do
train_df.T
wich means transpose

I think you are looking for something like this:
import pandas as pd
df = pd.DataFrame({
'labels': ['a', 'a', 'a', 'b', 'b'],
'pathes' : [1, 2, 3, 4, 5]
})
labels = df['labels'].unique()
new_cols = []
for label in labels:
new_cols.append(df['pathes'].where(df['labels'] == label).dropna().reset_index(drop=True))
df_final = pd.concat(new_cols, axis=1)
print(df_final)

I've found what was wrong, I misunderstood y_col and x_col in tf.Keras.preprocessing.image.ImageDataGenerator.flow_from_dataframe. Thanks to all of you for your contributions. Your answers are all correct in different ways. Thanks again Marcel h and user16714199!

Related

Connection between two dataframes created by dividing one dataframe in two

I'm sorry if my title is confusing but I wasn't sure how to describe the situation that I'm currently trying to understand. But basically I stumbled upon this question when I was working with train_test_split procedure from sklearn module.
So, let's go ahead and I show you an example of what has been confusing me for couple of hours already.
Let's create a simple dataframe with 3 columns:
'Letter' - a letter from alphabet;
'Number' - serial number of the letter;
'Type' - type of the number.
import pandas as pd
data = [['A', 1, 'Odd'], ['B', 2, 'Even'], ['C', 3, 'Odd'],
['D', 4, 'Even'], ['E', 5, 'Odd'], ['F', 6, 'Even'], ['G', 7, 'Odd']]
df = pd.DataFrame(data, columns=['Letter', 'Number', 'Type'])
We can create 4 samples to work with using train_test_split:
from sklearn.model_selection import train_test_split
target = df['Type']
features = df.drop('Type', axis=1)
features_train, features_valid, target_train, target_valid = train_test_split(features,
target, test_size=0.4, random_state=12)
And now if we want to see the rows of features_train with the odd numbers we can write the following code:
features_odds = features_train[target_train == 'Odd']
features_odds
And we get this:
Output
And there we have it right as new dataframe contains the rows exactly with the odd numbers.
How does that work when features_train can get the info from target_train even though those are two separated dataframes?
I think there should be an easy answer but for some reason I'm not able to understand the mechanics of this right now.
I have also tried a different approach (not using train_test_split) but it works just as fine:
target_dummy = df['Type']
features_dummy = df.drop('Type', axis=1)
features_dumb_odds = features_dummy[target_dummy == 'Odd']
features_dumb_odds
Would appreciate and help in understanding it a lot!
target_train == 'Odd' is a Series of boolean values. As a Series, it also has an index. That index is used to align with features_train that you index into, and it's compatible.
As a first step of exploration, start with print(target_train == 'Odd')
It's good to think about how the pieces fit together. In this case, the boolean series and where you index into need to have exactly the same index for it to not raise an exception.

Changing column values for a value in an adjacent column in the same dataframe using Python

I am quite new to Python programming.
I am working with the following dataframe:
Before
Note that in column "FBgn", there is a mix of FBgn and FBtr string values. I would like to replace the FBtr-containing values with FBgn values provided in the adjacent column called "## FlyBase_FBgn". However, I want to keep the FBgn values in column "FBgn". Maybe keep in mind that I am showing only a portion of the dataframe (reality: 1432 rows). How would I do that? I tried the replace() method from Pandas, but it did not work.
This is actually what I would like to have:
After
Thanks a lot!
With Pandas, you could try:
df.loc[df["FBgn"].str.contains("FBtr"), "FBgn"] = df["## FlyBase_FBgn"]
Welcome to stackoverflow. Please next time provide more info including your code. It is always helpful
Please see the code below, I think you need something similar
import pandas as pd
#ignore the dict1, I just wanted to recreate your df
dict1= {"FBgn": ['FBtr389394949', 'FBgn3093840', 'FBtr000025'], "FBtr": ['FBgn546466646', '', 'FBgn15565555']}
df = pd.DataFrame(dict1) #recreating your dataframe
#print df
print(df)
#function to replace the values
def replace_values(df):
for i in range(0, (df.size//2)):
if 'tr' in df['FBgn'][i]:
df['FBgn'][i] = df['FBtr'][i]
return df
df = replace_values(df)
#print new df
print(df)

Rename column values using pandas DataFrame

in one of the columns in my dataframe I have five values:
1,G,2,3,4
How to make it change the name of all "G" to 1
I tried:
df = df['col_name'].replace({'G': 1})
I also tried:
df = df['col_name'].replace('G',1)
"G" is in fact 1 (I do not know why there is a mixed naming)
Edit:
works correctly with:
df['col_name'] = df['col_name'].replace({'G': 1})
If I am understanding your question correctly, you are trying to change the values in a column and not the column name itself.
Given you have mixed data type there, I assume that column is of type object and thus the number is read as string.
df['col_name'] = df['col_name'].str.replace('G', '1')
You could try the following line
df.replace('G', 1, inplace=True)
use numpy
import numpy as np
df['a'] = np.where((df.a =='G'), 1, df.a)
You can try this, lets say your data is like :
ab=pd.DataFrame({'a':[1,2,3,'G',5]})
And you will replace it as :
ab1=ab.replace('G',4)

Is there a way to allow NaN values to be writen to CSV from panadas?

I have the following error builtins.AssertionError: 12 columns passed, passed data had 6 columns The last 6 Columns datawise will vary so Im happy to have None in the areas the data is missing. However I cant seem to find a simple way to do this, im pretty sure there must be an option for it but I cant see it in the docs or any google searches.
Any help would be apprecaited. I would like to reiterate that I know what is causing the problem and I know data is missing from coloumns. I would like to ignore missing data and am ahppy to have None or NaN in the output csv.
I imagine you have fixed headers, so my solution would be to extend each row respectively:
import pandas as pd
import numpy as np
columns = ('Person', 'Title', 'AnotherPerson', 'AnotherPerson2', 'AnotherPerson3', 'AnotherPerson4', 'Date', 'Group')
mandatory = len(columns)
data = [[1,2,3], [1, 2], [1, 2, 3, 4]]
data = list(map(lambda x: dict(enumerate(x)), data))
data = [[item.get(i, np.nan) for i in range(mandatory)] for item in data]
df = pd.DataFrame(data=data, columns=columns)

Pandas column access w/column names containing spaces

If I import or create a pandas column that contains no spaces, I can access it as such:
from pandas import DataFrame
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
df1.data1
which would return that series for me. If, however, that column has a space in its name, it isn't accessible via that method:
from pandas import DataFrame
df2 = DataFrame({'key': ['a','b','d'],
'data 2': range(3)})
df2.data 2 # <--- not the droid I'm looking for.
I know I can access it using .xs():
df2.xs('data 2', axis=1)
There's got to be another way. I've googled it like mad and can't think of any other way to google it. I've read all 96 entries here on SO that contain "column" and "string" and "pandas" and could find no previous answer. Is this the only way, or is there something better?
Old post, but may be interesting: an idea (which is destructive, but does the job if you want it quick and dirty) is to rename columns using underscores:
df1.columns = [c.replace(' ', '_') for c in df1.columns]
I think the default way is to use the bracket method instead of the dot notation.
import pandas as pd
df1 = pd.DataFrame({
'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'dat a1': range(7)
})
df1['dat a1']
The other methods, like exposing it as an attribute are more for convenience.
If you like to supply spaced columns name to pandas method like assign you can dictionarize your inputs.
df.assign(**{'space column': (lambda x: x['space column2'])})
While the accepted answer works for column-specification when using dictionaries or []-selection, it does not generalise to other situations where one needs to refer to columns, such as the assign method:
> df.assign("data 2" = lambda x: x.sum(axis=1)
SyntaxError: keyword can't be an expression
You can do it with df['Column Name']
If you want to apply filtering, that's also possible with column names having spaces in it, e.g. filtering for NULL-values or empty strings:
df_package[(df_package['Country_Region Code'].notnull()) |
(df_package['Country_Region Code'] != u'')]
as I figured out thanks to Rutger Kassies answer.

Categories

Resources