I have a data frame df like:
age=14 gender=male loc=NY key=0012328434 Unnamed: 4
age=45 gender=female loc=CS key=834734hh43 pre="axe"
age=23 gender=female loc=CA key=545df35fdf NaN
..
..
age=65 gender=male loc=LA key=dfdf545dfg pre="cold"
And I need this df to have a header and remove the redundant data, like desired_df:
age gender loc key pre
14 male NY 0012328434 NaN
45 female CS 834734hh43 axe
23 female CA 545df35fdf NaN
..
..
65 male LA dfdf545dfg cold
what I tried to do:
df1 = df.str.split()
df_out = pd.DataFrame(df1.str[1::2].tolist(), columns=df1[0][0::2])
but this fails, clearly as I do not have a df name to begin with. Any help would be really appreciated.
# df = pd.read_csv(r'xyz.csv', header = None)
df1=(pd.DataFrame(df.fillna('NaN=NaN')
.apply(lambda x: dict(list(x.str.replace('"', '')
.str.split('='))), axis=1).to_list())
.drop('NaN', axis = 1))
age gender loc key pre
0 14 male NY 0012328434
1 45 female CS 834734hh43 axe
2 23 female CA 545df35fdf NaN
3 65 male LA dfdf545dfg cold
(Untested!)
headers = ['age', 'gender', 'loc', 'key', 'pre']
df.columns = headers
for name in df.columns:
df[name] = df[name].str.removeprefix(f'{name}=')
Related
I have a Pandas dataframe similar to this one:
age name sex
0 30 jon male
1 blue php null
2 18 jane female
3 orange c++ null
and I am trying to concatenate every second row to the previous one adding extra columns:
age name sex colour language other
0 30 jon male blue php null
1 18 jane female orange c++ null
I tried shift() but was duplicating every row.
How can this be done?
You can create a new dataframe by slicing the dataframe using iloc with a step of 2:
cols = ['age', 'name', 'sex']
new_cols = ['colour', 'language', 'other']
d = dict()
for col, ncol in zip(cols, new_cols):
d[col] = df[col].iloc[::2].values
d[ncol] = df[col].iloc[1::2].values
pd.DataFrame(d)
Result:
age colour name language sex other
0 30 blue jon PHP male NaN
1 18 orange jane c++ female NaN
TRY:
df = pd.concat([df.iloc[::2].reset_index(drop=True), pd.DataFrame(
df.iloc[1::2].values, columns=['colour', 'language', 'other'])], 1)
OUTPUT:
age name sex colour language other
0 30 jon male blue php NaN
1 18 jane female orange c++ NaN
Reshape the values and create a new dataframe
pd.DataFrame(df.values.reshape(-1, df.shape[1] * 2),
columns=['age', 'name', 'sex', 'colour', 'language', 'other'])
age name sex colour language other
0 30 jon male blue php NaN
1 18 jane female orange c++ NaN
Let's say that I have this dataframe with four columns : "Name", "Value", "Ccy" and "Group" :
import pandas as pd
Name = ['ID', 'Country', 'IBAN','Dan_Age', 'Dan_city', 'Dan_country', 'Dan_sex', 'Dan_Age', 'Dan_country','Dan_sex' , 'Dan_city','Dan_country' ]
Value = ['TAMARA_CO', 'GERMANY','FR56','18', 'Berlin', 'GER', 'M', '22', 'FRA', 'M', 'Madrid', 'ESP']
Ccy = ['','','','EUR','EUR','USD','USD','','CHF', '','DKN','']
Group = ['0','0','0','1','1','1','1','2','2','2','3','3']
df = pd.DataFrame({'Name':Name, 'Value' : Value, 'Ccy' : Ccy,'Group':Group})
print(df)
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 Dan_Age 18 EUR 1
4 Dan_city Berlin EUR 1
5 Dan_country GER USD 1
6 Dan_sex M USD 1
7 Dan_Age 22 2
8 Dan_country FRA CHF 2
9 Dan_sex M 2
10 Dan_city Madrid DKN 3
11 Dan_country ESP 3
I want to represent this data differently before saving it in a csv. I would like to group the duplicates in the column "Name" with the associates values in "Values" and "Ccy". I want that the data in the column "Value" and "Ccy" are stored in the row(index) defined by the column "Group". Like that I do not mixed the data.
Then if the name is in the "group" 0, it means that it is general data so I would like that the all the rows from this "Name" are filled with the same value.
So I would like to get this result :
ID_Value Country_Value IBAN_Value Dan_age Dan_age_Ccy Dan_city_Value Dan_city_Ccy Dan_sex_Value
1 TAMARA GER FR56 18 EUR Berlin EUR M
2 TAMARA GER FR56 22 M
3 TAMARA GER FR56 Madrid DKN
I can not find how to do the first part. With the code below, I do not get what I want evn if I remove the columns empty
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
Anyone can help me !
Thank you
You can use the following. See comments in code for each step:
s = df.loc[df['Group'] == '0', 'Name'].tolist() # this variable will be used later according to Condition 2
df['Name'] = pd.Categorical(df['Name'], categories=df['Name'].unique(), ordered=True) #this preserves order before pivoting
df = df.pivot(index='Group', columns='Name') #transforms long-to-wide per expected output
for col in df.columns:
if col[1] in s: df[col] = df[col].shift().ffill() #Condition 2
df = df.iloc[1:].replace('',np.nan).dropna(axis=1, how='all').fillna('') #dataframe cleanup
df.columns = ['_'.join(col) for col in df.columns.swaplevel()] #column name cleanup
df
Out[1]:
ID_Value Country_Value IBAN_Value Dan_Age_Value Dan_city_Value \
Group
1 TAMARA_CO GERMANY FR56 18 Berlin
2 TAMARA_CO GERMANY FR56 22
3 TAMARA_CO GERMANY FR56 Madrid
Dan_country_Value Dan_sex_Value Dan_Age_Ccy Dan_city_Ccy \
Group
1 GER M EUR EUR
2 FRA M
3 ESP DKN
Dan_country_Ccy Dan_sex_Ccy
Group
1 USD USD
2 CHF
3
From there, you can drop columns you don't want, change strings from "TAMARA_CO" to "TAMARA", "GERMANY" to "GER", use reset_index(drop=True), etc.
You can do this quite easily with only 3 steps:
Split your data frame into 2 parts: the "general data" (which we want as a series) and the more specific data. Each data frame now contains the same kinds of information.
The key part of your problem: reorganizing the data. All you need is the pandas pivot function. It does exactly what you need!
Add the general information and the pivoted data back together.
# Split Data
general = df[df.Group == "0"].set_index("Name")["Value"].copy()
main_df = df[df.Group != "0"]
# Pivot Data
result = main_df.pivot(index="Group", columns=["Name"],
values=["Value", "Ccy"]).fillna("")
result.columns = [f"{c[1]}_{c[0]}" for c in result.columns]
# Create a data frame that has an identical row for each group
general_df = pd.DataFrame([general]*3, index=result.index)
general_df.columns = [c + "_Value" for c in general_df.columns]
# Merge the data back together
result = general_df.merge(result, on="Group")
The result given above does not give the exact column order you want, so you'd have to specify that manually with
final_cols = ["ID_Value", "Country_Value", "IBAN_Value",
"Dan_age_Value", "Dan_Age_Ccy", "Dan_city_Value",
"Dan_city_Ccy", "Dan_sex_Value"]
result = result[final_cols]
I have the following pandas df:
Name
Jack
Alex
Jackie
Susan
i also have the following dict:
d = {'Jack':['Male','22'],'Alex':['Male','26'],'Jackie':['Female','28'],'Susan':['Female','30']}
I would like to add in two colums for Gender and Age so that my df returns:
Name Gender Age
Jack Male 22
Alex Male 26
Jackie Female 28
Susan Female 30
I have tried:
df['Gender'] = df.Name.map(d[0])
df['Age'] = df.Name.map(d[1])
but no such luck. Any ideas or help would be muhc appreciated! Thanks!
df['Gender'] = df.Name.map(lambda x: d[x][0])
df['Age'] = df.Name.map(lambda x: d[x][1])
Take all the values of the dictionary
d = {'Jack':['Male','22'],'Alex':['Male','26'],'Jackie':['Female','28'],'Susan':['Female','30']}
value_list = list(d.values())
df = pd.DataFrame(value_list, columns =['Gender', 'Age'])
print(df)
Use pd.DataFrame constructor with Series.map and use pd.concat to concat with df:
In [2696]: df = pd.concat([df,pd.DataFrame(df.Name.map(d).tolist(), columns=['Gender', 'Age'])], axis=1)
In [2695]: df
Out[2696]:
Name Gender Age
0 Jack Male 22
1 Alex Male 26
2 Jackie Female 28
3 Susan Female 30
Solutions working well also if no match in dictionary like:
d = {'Alex':['Male','26'],'Jackie':['Female','28'],'Susan':['Female','30']}
print (df)
Name Gender Age
0 Alex Male 26
1 Jack NaN NaN
2 Jackie Female 28
3 Susan Female 30
Use DataFrame.from_dict from your dictionary and add to column Name by DataFrame.join, advantage is if more columns in input data all working same way:
df = df.join(pd.DataFrame.from_dict(d, orient='index', columns=['Gender','Age']), on='Name')
print (df)
Name Gender Age
0 Jack Male 22
1 Alex Male 26
2 Jackie Female 28
3 Susan Female 30
Your solution should working if create 2 dictionaries:
d1 = {k:v[0] for k,v in d.items()}
d2 = {k:v[1] for k,v in d.items()}
df['Gender'] = df.Name.map(d1)
df['Age'] = df.Name.map(d2)
print (df)
Name Gender Age
0 Jack Male 22
1 Alex Male 26
2 Jackie Female 28
3 Susan Female 30
I am a newbie to datascience and I want to count how many female/male are in each Title.
I tried the following piece of code:
'''
newdf = pd.DataFrame()
newdf[ 'Title' ] = full[ 'Name' ].map( lambda name: name.split( ',' )
[1].split( '.' )[0].strip() )
newdf['Age'] = full['Age']
newdf['Sex'] = full['Sex']
newdf.dropna(axis = 0,inplace=True)
print(newdf.head())
What I get is :
Title Age Sex
0 Mr 22.0 male
1 Mrs 38.0 female
2 Miss 26.0 female
3 Mrs 35.0 female
4 Mr 35.0 male
Then I am trying this to add #male,#female columns
df = pd.DataFrame()
df = newdf[['Age','Title']].groupby('Title').mean().sort_values(by='Age',ascending=False)
df['#People'] = newdf['Title'].value_counts()
df['Male'] = newdf['Title'].sum(newdf['Sex']=='male')
df['Female'] = newdf['Title'].sum(newdf['Sex']=='female')
Error message that I have:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
What I expected is to have four columns: Title,Age(average),#People, #male,#female. So I want to know how many of those #people are male and female
P.S Without these lines :
df['Male'] = newdf['Title'].sum(newdf['Sex']=='male')
df['Female'] = newdf['Title'].sum(newdf['Sex']=='female')
everything works fine,and I get:
Age #People
Title
Capt 70.000000 1
Col 54.000000 4
Sir 49.000000 1
Major 48.500000 2
Lady 48.000000 1
Dr 43.571429 7
....
But without #male,#female.
Use GroupBy.agg for aggregate mean with size and for new columns add crosstab by DataFrame.join:
df1 = (df.groupby('Title')['Age']
.agg([('Age','mean'),('#People','size')])
.sort_values(by='Age',ascending=False))
df2 = pd.crosstab(df['Title'], df['Sex']).add_suffix('_avg')
df = df1.join(df2)
print (df)
Age #People female_avg male_avg
Title
Mrs 36.5 2 2 0
Mr 28.5 2 0 2
Miss 26.0 1 1 0
This question already has answers here:
Insert a row to pandas dataframe
(18 answers)
Closed 4 years ago.
Below is my dataframe
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex':['male','male','female','male']})
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
I want to insert a new row at the first position
name: dean, age: 45, sex: male
age name sex
0 45 dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
What is the best way to do this in pandas?
Probably this is not the most efficient way but:
df.loc[-1] = ['45', 'Dean', 'male'] # adding a row
df.index = df.index + 1 # shifting index
df.sort_index(inplace=True)
Output:
age name sex
0 45 Dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
If it's going to be a frequent operation, then it makes sense (in terms of performance) to gather the data into a list first and then use pd.concat([], ignore_index=True) (similar to #Serenity's solution):
Demo:
data = []
# always inserting new rows at the first position - last row will be always on top
data.insert(0, {'name': 'dean', 'age': 45, 'sex': 'male'})
data.insert(0, {'name': 'joe', 'age': 33, 'sex': 'male'})
#...
pd.concat([pd.DataFrame(data), df], ignore_index=True)
In [56]: pd.concat([pd.DataFrame(data), df], ignore_index=True)
Out[56]:
age name sex
0 33 joe male
1 45 dean male
2 30 jon male
3 25 sam male
4 18 jane female
5 26 bob male
PS I wouldn't call .append(), pd.concat(), .sort_index() too frequently (for each single row) as it's pretty expensive. So the idea is to do it in chunks...
#edyvedy13's solution worked great for me. However it needs to be updated for the deprecation of pandas' sort method - now replaced with sort_index.
df.loc[-1] = ['45', 'Dean', 'male'] # adding a row
df.index = df.index + 1 # shifting index
df = df.sort_index() # sorting by index
Use pandas.concat and reindex new dataframe:
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex':['male','male','female','male']})
# new line
line = pd.DataFrame({'name': 'dean', 'age': 45, 'sex': 'male'}, index=[0])
# concatenate two dataframe
df2 = pd.concat([line,df.ix[:]]).reset_index(drop=True)
print (df2)
Output:
age name sex
0 45 dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex': ['male','male','female','male']})
df1 = pd.DataFrame({'name': ['dean'], 'age': [45], 'sex':['male']})
df1 = df1.append(df)
df1 = df1.reset_index(drop=True)
That works
This will work for me.
>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
... 'age': [30,25,18,26],
... 'sex':['male','male','female','male']}) >>> df
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
>>> df.loc['a']=[45,'dean','male']
>>> df
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
a 45 dean male
>>> newIndex=['a']+[ind for ind in df.index if ind!='a']
>>> df=df.reindex(index=newIndex)
>>> df
age name sex
a 45 dean male
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male