I have a list called 'gender', of which I counted all the occurrences of the values with Counter:
gender = ['2',
'Female,',
'All Female Group,',
'All Male Group,',
'Female,',
'Couple,',
'Mixed Group,'....]
gender_count = Counter(gender)
gender_count
Counter({'2': 1,
'All Female Group,': 222,
'All Male Group,': 119,
'Couple,': 256,
'Female,': 1738,
'Male,': 2077,
'Mixed Group,': 212,
'NA': 16})
I want to put this dict into a pandas Dataframe. I have used pd.series(Convert Python dict into a dataframe):
s = pd.Series(gender_count, name='gender count')
s.index.name = 'gender'
s.reset_index()
Which gives me the dataframe I want, but I don't know how to save these steps into a pandas DataFrame.
I also tried using DataFrame.from_dict()
s2 = pd.DataFrame.from_dict(gender_count, orient='index')
But this creates a dataframe with the categories of gender as the index.
I eventually want to use gender categories and the count for a piechart.
Skip the intermediate step
gender = ['2',
'Female',
'All Female Group',
'All Male Group',
'Female',
'Couple',
'Mixed Group']
pd.value_counts(gender)
Female 2
2 1
Couple 1
Mixed Group 1
All Female Group 1
All Male Group 1
dtype: int64
In [21]: df = pd.Series(gender_count).rename_axis('gender').reset_index(name='count')
In [22]: df
Out[22]:
gender count
0 2 1
1 All Female Group, 222
2 All Male Group, 119
3 Couple, 256
4 Female, 1738
5 Male, 2077
6 Mixed Group, 212
7 NA 16
what about just
s = pd.DataFrame(gender_count)
Related
I have a Pandas dataframe similar to this one:
age name sex
0 30 jon male
1 blue php null
2 18 jane female
3 orange c++ null
and I am trying to concatenate every second row to the previous one adding extra columns:
age name sex colour language other
0 30 jon male blue php null
1 18 jane female orange c++ null
I tried shift() but was duplicating every row.
How can this be done?
You can create a new dataframe by slicing the dataframe using iloc with a step of 2:
cols = ['age', 'name', 'sex']
new_cols = ['colour', 'language', 'other']
d = dict()
for col, ncol in zip(cols, new_cols):
d[col] = df[col].iloc[::2].values
d[ncol] = df[col].iloc[1::2].values
pd.DataFrame(d)
Result:
age colour name language sex other
0 30 blue jon PHP male NaN
1 18 orange jane c++ female NaN
TRY:
df = pd.concat([df.iloc[::2].reset_index(drop=True), pd.DataFrame(
df.iloc[1::2].values, columns=['colour', 'language', 'other'])], 1)
OUTPUT:
age name sex colour language other
0 30 jon male blue php NaN
1 18 jane female orange c++ NaN
Reshape the values and create a new dataframe
pd.DataFrame(df.values.reshape(-1, df.shape[1] * 2),
columns=['age', 'name', 'sex', 'colour', 'language', 'other'])
age name sex colour language other
0 30 jon male blue php NaN
1 18 jane female orange c++ NaN
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have two dataframes in Pandas. What I want achieve is, grab every 'Name' from DF1 and get the corresponding 'City' and 'State' present in DF2.
For example, 'Dwight' from DF1 should return corresponding values 'Miami' and 'Florida' from DF2.
DF1
Name Age Student
0 Dwight 20 Yes
1 Michael 30 No
2 Pam 55 No
. . . .
70000 Jim 27 Yes
DF1 has approx 70,000 rows with 3 columns
Second Dataframe, DF2 has approx 320,000 rows.
Name City State
0 Dwight Miami Florida
1 Michael Scranton Pennsylvania
2 Pam Austin Texas
. . . . .
325082 Jim Scranton Pennsylvania
Currently I have two functions, which return the values of 'City' and 'State' using a filter.
def read_city(id):
filt = (df2['Name'] == id)
if filt.any():
field = (df2[filt]['City'].values[0])
else:
field = ""
return field
def read_state(id):
filt = (df2['Name'] == id)
if filt.any():
field = (df2[filt]['State'].values[0])
else:
field = ""
return field
I am using the apply function to process all the values.
df['city_list'] = df['Name'].apply(read_city)
df['State_list'] = df['Name'].apply(read_state)
The result takes a long time to compute in the above way. It roughly takes me around 18 minutes to get back the df['city_list'] and df['State_list'].
Is there a faster to compute this ? Since I am completely new to pandas, I would like to know if there is a efficient way to compute this ?
I believe you can do a map:
s = df2.groupby('name')[['City','State']].agg(list)
df['city_list'] = df['Name'].map(s['City'])
df['State_list'] = df['Name'].map(s['State'])
Or a left merge after you got s:
df = df.merge(s.add_suffix('_list'), left_on='Name', right_index=True, how='left')
I think you can do something like this:
# Dataframe DF1 (dummy data)
DF1 = pd.DataFrame(columns=['Name', 'Age', 'Student'], data=[['Dwight', 20, 'Yes'], ['Michael', 30, 'No'], ['Pam', 55, 'No'], ['Jim', 27, 'Yes']])
print("DataFrame DF1")
print(DF1)
# Dataframe DF2 (dummy data)
DF2 = pd.DataFrame(columns=['Name', 'City', 'State'], data=[['Dwight', 'Miami', 'Florida'], ['Michael', 'Scranton', 'Pennsylvania'], ['Pam', 'Austin', 'Texas'], ['Jim', 'Scranton', 'Pennsylvania']])
print("DataFrame DF2")
print(DF2)
# You do a merge on 'Name' column and then, you change the name of columns 'City' and 'State'
df = pd.merge(DF1, DF2, on=['Name']).rename(columns={'City': 'city_list', 'State': 'State_list'})
print("DataFrame final")
print(df)
Output:
DataFrame DF1
Name Age Student
0 Dwight 20 Yes
1 Michael 30 No
2 Pam 55 No
3 Jim 27 Yes
DataFrame DF2
Name City State
0 Dwight Miami Florida
1 Michael Scranton Pennsylvania
2 Pam Austin Texas
3 Jim Scranton Pennsylvania
DataFrame final
Name Age Student city_list State_list
0 Dwight 20 Yes Miami Florida
1 Michael 30 No Scranton Pennsylvania
2 Pam 55 No Austin Texas
3 Jim 27 Yes Scranton Pennsylvania
I have a dataset with several columns.
Now what I want is to basically calculate score based on a particular column ("name") but grouped on the "id" column.
_id fName lName age
0 ABCD Andrew Schulz
1 ABCD Andreww 23
2 DEFG John boy
3 DEFG Johnn boy 14
4 CDGH Bob TANNA 13
5 ABCD. Peter Parker 45
6 DEFGH Clark Kent 25
So what I am looking is whether for the same id, I am getting similar entries, so I can remove those entries based on a threshold score values. Like here if i run it for col "fName". I should be able to reduce this dataframe to based on a score threshold:
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
5 ABCD Peter Parker 45
6 DEFG Clark Kent 25
I intend to use pyjarowinkler.
If I had two independent columns (without all the group by stuff) to check, this is how I use it.
df['score'] = [distance.get_jaro_distance(x, y) for x, y in zip(df['name_1'],df['name_2'])]
df = df[df['score'] > 0.87]
Can someone suggest a pythonic and fast way of doing this
UPDATE
So, I have tried using record linkage library for this. And I have ended up at a dataframe containing pair of indexes that are similar called 'matches'. Now I just want to basically combine the data.
# Indexation step
indexer = recordlinkage.Index()
indexer.block(left_on='_id')
candidate_links = indexer.index(df)
# Comparison step
compare_cl = recordlinkage.Compare()
compare_cl.string('fName', 'fName', method='jarowinkler', threshold=threshold, label='full_name')
features = compare_cl.compute(candidate_links, df)
# Classification step
matches = features[features.sum(axis=1) >= 1]
print(len(matches))
This is how matches looks:
index1 index2 fName
0 1 1.0
2 3 1.0
I need someone to suggest a way to combine the similar rows in a way that takes data from similar rows
just wanted to clear some doubts regarding your ques. Couldn't clear them in comments due to low reputation.
Like here if i run it for col "fName". I should be able to reduce this
dataframe to based on a score threshold:
So basically your function would return the DataFrame containing the first row in each group (by ID)? This will result in the above listed resultant DataFrame.
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
I hope this code answer your question
r0 =['ABCD','Andrew','Schulz', '' ]
r1 =['ABCD','Andrew', '' , '23' ]
r2 =['DEFG','John' ,'boy' , '' ]
r3 =['DEFG','John' ,'boy' , '14' ]
r4 =['CDGH','Bob' ,'TANNA' , '13' ]
Rx =[r0,r1,r2,r3,r4]
print(Rx)
print()
Dict= dict()
for i in Rx:
if (Dict.__contains__(i[0]) == True):
if (i[2] != ''):
Dict[i[0]][2] = i[2]
if (i[3] != ''):
Dict[i[0]][3] = i[3]
else:
Dict[i[0]]=i
Rx[:] = Dict.values()
print(Rx)
I am lost with the 'score' part of your question, but if what you need is to fill the gaps in data with values from other rows and then drop the duplicates by id, maybe this can help:
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
First make sure that empty values are replaced with nulls. Then use fillna to 'back fill' the data. Then drop duplicates keeping the first occurrence of Id. fillna will fill the values from the next value found in the column, which may correspond to other Id, but since you will discard the duplicated rows, I believe drop_duplicates keeping the first occurrence will do the job. (This assumes that at least one value is provided in every column for every Id)
I've tested with this dataset and code:
data = [
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', 'Schulz'],
['AABBCC', 'Andrew', '', 23],
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', '',],
['DDEEFF', 'Karl', 'boy'],
['DDEEFF', 'Karl', ''],
['DDEEFF', 'Karl', '', 14],
['GGHHHH', 'John', 'TANNA', 13],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', 'Blob'],
['HLHLHL', 'Bob', 'Blob', 15],
['HLHLHL', 'Bob','', 15],
['JLJLJL', 'Nick', 'Best', 20],
['JLJLJL', 'Nick', '']
]
df = pd.DataFrame(data, columns=['Id', 'fName', 'lName', 'Age'])
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
Output:
Id fName lName Age
0 AABBCC Andrew Schulz 23.0
5 DDEEFF Karl boy 14.0
8 GGHHHH John TANNA 13.0
9 HLHLHL Bob Blob 15.0
14 JLJLJL Nick Best 20.0
Hope this helps and apologies if I misunderstood the question.
I have a dataset containing the following columns:
['sex', 'age', 'relationship_status]
There are some NaN values in 'relationship_status' column and I want to replace them with the most common value in each group based on age and gender.
I know how to groupby and count the values:
df2.groupby(['age','sex'])['relationship_status'].value_counts()
and it returns:
age sex relationship_status
17.0 female Married with kids 1
18.0 female In relationship 5
Married 4
Single 4
Married with kids 2
male In relationship 9
Single 5
Married 4
Married with kids 4
Divorced 3
.
.
.
86.0 female In relationship 1
92.0 male Married 1
97.0 male In relationship 1
So again, what I need to achieve is that whenever "relationship_status" is empty I need the program to replace it with the most frequent value based on persons age and gender.
Can anyone suggest how can I do it?
Kind regards.
Something like this:
mode = df2.groupby(['age','sex'])['relationship_status'].agg(lambda x: pd.Series.mode(x)[0])
df2['relationship_status'].fillna(mode, inplace=True)
Check this, it returns 'ALL_NAN' when within (age,sex) subgroups are only nans:
import pandas as pd
df = pd.DataFrame(
{'age': [25, 25, 25, 25, 25, 25,],
'sex': ['F', 'F', 'F', 'M', 'M', 'M', ],
'status': ['married', np.nan, 'married', np.nan, np.nan, 'single']
})
df.loc[df['status'].isna(), 'status'] = df.groupby(['age','sex'])['status'].transform(lambda x: x.mode()[0] if any(x.mode()) else 'ALL_NAN')
Output:
age sex status
0 25 F married
1 25 F married
2 25 F married
3 25 M single
4 25 M single
5 25 M single
This question already has answers here:
Insert a row to pandas dataframe
(18 answers)
Closed 4 years ago.
Below is my dataframe
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex':['male','male','female','male']})
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
I want to insert a new row at the first position
name: dean, age: 45, sex: male
age name sex
0 45 dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
What is the best way to do this in pandas?
Probably this is not the most efficient way but:
df.loc[-1] = ['45', 'Dean', 'male'] # adding a row
df.index = df.index + 1 # shifting index
df.sort_index(inplace=True)
Output:
age name sex
0 45 Dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
If it's going to be a frequent operation, then it makes sense (in terms of performance) to gather the data into a list first and then use pd.concat([], ignore_index=True) (similar to #Serenity's solution):
Demo:
data = []
# always inserting new rows at the first position - last row will be always on top
data.insert(0, {'name': 'dean', 'age': 45, 'sex': 'male'})
data.insert(0, {'name': 'joe', 'age': 33, 'sex': 'male'})
#...
pd.concat([pd.DataFrame(data), df], ignore_index=True)
In [56]: pd.concat([pd.DataFrame(data), df], ignore_index=True)
Out[56]:
age name sex
0 33 joe male
1 45 dean male
2 30 jon male
3 25 sam male
4 18 jane female
5 26 bob male
PS I wouldn't call .append(), pd.concat(), .sort_index() too frequently (for each single row) as it's pretty expensive. So the idea is to do it in chunks...
#edyvedy13's solution worked great for me. However it needs to be updated for the deprecation of pandas' sort method - now replaced with sort_index.
df.loc[-1] = ['45', 'Dean', 'male'] # adding a row
df.index = df.index + 1 # shifting index
df = df.sort_index() # sorting by index
Use pandas.concat and reindex new dataframe:
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex':['male','male','female','male']})
# new line
line = pd.DataFrame({'name': 'dean', 'age': 45, 'sex': 'male'}, index=[0])
# concatenate two dataframe
df2 = pd.concat([line,df.ix[:]]).reset_index(drop=True)
print (df2)
Output:
age name sex
0 45 dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex': ['male','male','female','male']})
df1 = pd.DataFrame({'name': ['dean'], 'age': [45], 'sex':['male']})
df1 = df1.append(df)
df1 = df1.reset_index(drop=True)
That works
This will work for me.
>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
... 'age': [30,25,18,26],
... 'sex':['male','male','female','male']}) >>> df
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
>>> df.loc['a']=[45,'dean','male']
>>> df
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
a 45 dean male
>>> newIndex=['a']+[ind for ind in df.index if ind!='a']
>>> df=df.reindex(index=newIndex)
>>> df
age name sex
a 45 dean male
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male