I'm trying to deal with data that is xls written in html(or xml. IDK)
I tried to do this
df = pandas.read_html(r"filename.xls", skiprows=0)
and it was not dataframe but just list. so I did this
df = df[0]
and after this, I could do,
print(df)
the result is as below
0 1 2
0 name age gender
1 john 18 male
2 ryan 20 male
before, I did similar task with other xlsx files that just worked fine but not with this one.
for instance,
for index, row in df.itterrows():
target = str(row['gender'])
if target = 'male':
df.loc[index,'gender'] = 'Y'
else:
df.loc[index,'gender'] = 'N'
in real, the code is 400 lines long....
I want my dataframe looks like below so that I can re-use the code that I wrote already.
name age gender
0 john 18 male
1 ryan 20 male
as the comment, I'm adding this result too.
I tried to skip the row
df = pandas.read_html(r"filename.xls", skiprows=1)
the result is as below
0 1 2
0 john 18 male
1 ryan 20 male
how would I do it?
Use parameter header=0:
df = pandas.read_html(r"filename.xls", header=0)[0]
And then instead loops is posible use np.where:
change:
for index, row in df.itterrows():
target = str(row['gender'])
if target = 'male':
df.loc[index,'gender'] = 'Y'
else:
df.loc[index,'gender'] = 'N'
to:
df['gender'] = np.where(df['gender'] == 'male', 'Y', 'N')
Related
I have a column that looks like this:
Age
[0-10)
[10-20)
[20-30)
[30-40)
[40-50)
[50-60)
[60-70)
[70-80)
and want to remove the "[","-" and ")". Instead of showing the range such as 0-10, I would like to show the middle value instead for every row in the column
Yet another solution:
The dataframe:
df = pd.DataFrame({'Age':['[0-10)','[10-20)','[20-30)','[30-40)','[40-50)','[50-60)','[60-70)','[70-80)']})
df
Age
0 [0-10)
1 [10-20)
2 [20-30)
3 [30-40)
4 [40-50)
5 [50-60)
6 [60-70)
7 [70-80)
The code:
df['Age'] = df.Age.str.extract('(\d+)-(\d+)').astype('int').mean(axis=1).astype('int')
The result:
df
Age
0 5
1 15
2 25
3 35
4 45
5 55
6 65
7 75
If you want to explode a row into multiple rows where each row carries a value from the range, you can do this:
data = '''[0-10)
[10-20)
[20-30)
[30-40)
[40-50)
[50-60)
[60-70)
[70-80)'''
df = pd.DataFrame({'Age': data.splitlines()})
df['Age'] = df['Age'].str.extract(r'\[(\d+)-(\d+)\)').astype(int).apply(lambda r: list(range(r[0], r[1])), axis=1)
df.explode('Age')
Note that I assume your Age column is string typed, so I used extract to get the boundaries of the ranges, and convert them to a real list of integers. Finally explode your dataframe for the modified Age column will get you a new row for each integer in the list. Values in other columns will be copied accordingly.
I tried this:
import pandas as pd
import re
data = {
'age_range': [
'[0-10)',
'[10-20)',
'[20-30)',
'[30-40)',
'[40-50)',
'[50-60)',
'[60-70)',
'[70-80)',
]
}
df = pd.DataFrame(data=data)
def get_middle_age(age_range):
pattern = r'(\d+)'
ages = re.findall(pattern, age_range)
return int((int(ages[0])+int(ages[1]))/2)
df['age'] = df.apply(lambda row: get_middle_age(row['age_range']), axis=1)
I am wanting to replace variables in a CSV file with a randomly generated variable for each of the variables.
For instance, changing 'not available' to either 'male' or 'female'
Sample:
Number Sex
0 Female
1 Male
2 Not Available
3 Male
4 Not Available
after random change:
Number Sex
0 Female
1 Male
2 Female
3 Male
4 Male
import pandas as pd
import random
def RandomSex():
return random.choice(['Male','Female'])
df = pd.read_csv(r'data.csv')
df2 = df.loc[: , 'Sex']
print(df2)
df.loc[(df.Sex == 'Not Available'),'Gender'] = RandomSex()
print(df2)
But this is changing all of the 'Not Available' to either all 'Male' or all 'Female'
You can first get the number of "Not Available"s and then look for choices with random.choices from your list instead of choosing only one (which random.choice does):
not_availables = df.Sex.eq("Not Available")
num_not_availables = not_availables.sum()
choice_list = ["Male", "Female"]
new_values = random.choices(choice_list, k=num_not_availables)
df.loc[not_availables, "Sex"] = new_values
to get (for example)
Number Sex
0 Female
1 Male
2 Male
3 Male
4 Female
I want to cut a certain part of a string (applied to multiple columns and differs on each column) when one column contains a particular substring
Example:
Assume the following dataframe
import pandas as pd
df = pd.DataFrame({'name':['Allan2','Mike39','Brenda4','Holy5'], 'Age': [30,20,25,18],'Zodiac':['Aries','Leo','Virgo','Libra'],'Grade':['A','AB','B','AA'],'City':['Aura','Somerville','Hendersonville','Gannon'], 'pahun':['a_b_c','c_d_e','f_g','h_i_j']})
print(df)
Out:
name Age Zodiac Grade City pahun
0 Allan2 30 Aries A Aura a_b_c
1 Mike39 20 Leo AB Somerville c_d_e
2 Brenda4 25 Virgo B Hendersonville f_g
3 Holy5 18 Libra AA Gannon h_i_j
For example if one entry of column City ends with 'e', cut the last three letters of column 'City' and the last two letters of column 'name'.
What I tried so far is something like this:
df['City'] = df['City'].apply(lambda x: df['City'].str[:-3] if df.City.str.endswith('e'))
That doesn't work and I also don't really know how to cut letters on other columns while having the same if clause.
I'm happy for any help I get.
Thank you
You can record the rows with City ending with e then use loc update:
mask = df['City'].str[-1] == 'e'
df.loc[mask, 'City'] = df.loc[mask, 'City'].str[:-3]
df.loc[mask, 'name'] = df.loc[mask, 'name'].str[:-2]
Output:
name Age Zodiac Grade City pahun
0 Allan2 30 Aries A Aura a_b_c
1 Mike 20 Leo AB Somervi c_d_e
2 Brend 25 Virgo B Hendersonvi f_g
3 Holy5 18 Libra AA Gannon h_i_j
import pandas as pd
df = pd.DataFrame({'name':['Allan2','Mike39','Brenda4','Holy5'], 'Age': [30,20,25,18],'Zodiac':['Aries','Leo','Virgo','Libra'],'Grade':['A','AB','B','AA'],'City':['Aura','Somerville','Hendersonville','Gannon'], 'pahun':['a_b_c','c_d_e','f_g','h_i_j']})
def func(row):
index = row.name
if row['City'][-1] == 'c': #check the last letter of column City for each row, implement your condition here.
df.at[index, 'City'] = df['City'][index][:-3]
df.at[index, 'name'] = df['name'][index][:-1]
df.apply(lambda x: func(x), axis =1 )
print (df)
I have input data like as shown below. Here 'gender' and 'ethderived' are two columns. I would like to replace their values like 1,2,3 etc with categorical values. Ex - 1 with Male, 2 with Female
The mapping file looks like as shown below - sample 2 columns
Input data looks like as shown below
I expect my output dataframe to look like this
I have tried to do this using the below code. Though the code works fine, I don't see any replace happening. Can you please help me with this?
mapp = pd.read_csv('file2.csv')
data = pd.read_csv('file1.csv')
for col in mapp:
if col in data.columns:
print(col)
s = list(mapp.loc[(mapp[col].str.contains('^\d')==True)].index)
print("s is",s)
for i in s:
print("i is",i)
try:
value = mapp[col][i].split('. ')
print("value 0 is",value[0])
print("value 1 is",value[1])
if value[0] in data[col].values:
data.replace({col:{value[0]:value[1]}})
except:
print("column not present")
else:
print("No")
Please note that I have shown only two columns but in real time there might more than 600 columns. Any elegant approach/suggestions to make it simple is helpful. As I have two separate csv files, any suggestions on merge/join etc will also be helpful but please note that my mapping file contains values as "1. Male", "2. Female". hence I used regex
Also note that, several other column values can also have mapping values that start with 1. ex: 1. Single, 2. Married, 3. Divorced etc
Looking forward to your help
Use DataFrame.replace with nested dictionaries - first key define colum name for replace and another values for replace created by function Series.str.extract:
df = pd.DataFrame({'Gender':['1.Male','2.Female', np.nan],
'Ethnicity':['1.Chinese','2.Indian','3.Malay']})
print (df)
Gender Ethnicity
0 1.Male 1.Chinese
1 2.Female 2.Indian
2 NaN 3.Malay
d={x:df[x].str.extract(r'(\d+)\.(.+)').dropna().set_index(0)[1].to_dict() for x in df.columns}
print (d)
{'Gender': {'1': 'Male', '2': 'Female'},
'Ethnicity': {'1': 'Chinese', '2': 'Indian', '3': 'Malay'}}
df1 = pd.DataFrame({'Gender':[2,1,2,1],
'Ethnicity':[1,2,3,1]})
print (df1)
Gender Ethnicity
0 2 1
1 1 2
2 2 3
3 1 1
#convert to strings before replace
df2 = df1.astype(str).replace(d)
print (df2)
Gender Ethnicity
0 Female Chinese
1 Male Indian
2 Female Malay
3 Male Chinese
If the entries are always in order(1.XXX,2.XXX...), use:
m=df1.apply(lambda x: x.str[2:])
n=df2.sub(1).replace(m)
print(n)
gender ethderived
0 Female Chinese
1 Male Indian
2 Male Malay
3 Female Chinese
4 Male Chinese
5 Female Indian
I have two pandas data-frame and each of them are of different sizes each over 1 million records.
I am looking to compare these two data-frames and identify the differences.
DataFrameA
ID Name Age Sex
1A1 Cling 21 M
1B2 Roger 22 M
1C3 Stew 23 M
DataFrameB
ID FullName Gender Age
1B2 Roger M 21
1C3 Rick M 23
1D4 Ash F 21
DataFrameB will always have more records than DataFrameA but the records found in DataFrameA may not still be in DataFrameB.
The column names in the DataFrameA and DataFrameB are different. I have the mapping stored in a different dataframe.
MappingDataFrame
DataFrameACol DataFrameBCol
ID ID
Name FullName
Age Age
Sex Gender
I am looking to compare these two and add a column next to it with the result.
Col Name Adder for DataFrame A = "_A_Txt"
Col Name Adder for DataFrame B = "_B_Txt"
ExpectedOutput
ID Name_A_Txt FullName_B_Text Result_Name Age_A_Txt Age_B_Txt Result_Age
1B2 Roger Roger Match ... ...
1C3 Stew Rick No Match ... ...
The column names should have the text added before this.
I am using a For loop at the moment to build this logic. But 1 million record is taking ages to complete. I left the program running for more than 50 minutes and it wasn't completed as in real-time, I am building it for more than 100 columns.
I will open bounty for this question and award the bounty, even if the question was answered before opening it as a reward. As, I have been struggling really for performance using For loop iteration.
To start with DataFrameA and DataFrameB, use the below code,
import pandas as pd
d = {
'ID':['1A1', '1B2', '1C3'],
'Name':['Cling', 'Roger', 'Stew'],
'Age':[21, 22, 23],
'Sex':['M', 'M', 'M']
}
DataFrameA = pd.DataFrame(d)
d = {
'ID':['1B2', '1C3', '1D4'],
'FullName':['Roger', 'Rick', 'Ash'],
'Gender':['M', 'M', 'F'],
'Age':[21, 23, 21]
}
DataFrameB = pd.DataFrame(d)
I believe, this question is a bit different from the suggestion (definition on joins) provided by Coldspeed as this also involves looking up at different column names and adding a new result column along. Also, the column names need to be transformed on the result side.
The OutputDataFrame Looks as below,
For better understanding of the readers, I am putting the column
names in the Row in order
Col 1 - ID (Coming from DataFrameA)
Col 2 - Name_X (Coming from DataFrameA)
Col 3 - FullName_Y (Coming from DataFrameB)
Col 4 - Result_Name (Name is what is there in DataFrameA and this is a comparison between Name_X and FullName_Y)
Col 5 - Age_X (Coming from DataFrameA)
Col 6 - Age_Y (Coming From DataFrameB)
Col 7 - Result_Age (Age is what is there in DataFrameA and this is a result between Age_X and Age_Y)
Col 8 - Sex_X (Coming from DataFrameA)
Col 9 - Gender_Y (Coming from DataFrameB)
Col 10 - Result_Sex (Sex is what is there in DataFrameA and this is a result between Sex_X and Gender_Y)
m = list(mapping_df.set_index('DataFrameACol')['DataFrameBCol']
.drop('ID')
.iteritems())
m[m.index(('Age', 'Age'))] = ('Age_x', 'Age_y')
m
# [('Name', 'FullName'), ('Age_x', 'Age_y'), ('Sex', 'Gender')]
Start with an inner merge:
df3 = (df1.merge(df2, how='inner', on=['ID'])
.reindex(columns=['ID', *(v for V in m for v in V)]))
df3
ID Name FullName Age_x Age_y Sex Gender
0 1B2 Roger Roger 22 21 M M
1 1C3 Stew Rick 23 23 M M
Now, compare the columns and set values with np.where:
l, r = map(list, zip(*m))
matches = (df3[l].eq(df3[r].rename(dict(zip(r, l)), axis=1))
.add_prefix('Result_')
.replace({True: 'Match', False: 'No Match'}))
for k, v in m:
name = f'Result_{k}'
df3.insert(df3.columns.get_loc(v)+1, name, matches[name])
df3.columns
# Index(['ID', 'Name', 'FullName', 'Result_Name', 'Age_x', 'Age_y',
# 'Result_Age_x', 'Sex', 'Gender', 'Result_Sex'],
# dtype='object')
df3.filter(like='Result_')
Result_Name Result_Age_x Result_Sex
0 Match No Match Match
1 No Match Match Match