Replace Dataframe value in row with the value of the row below - python

I have the following frame
+-------+--------+-----+--+
| 1 | 2 | 3 | |
+-------+--------+-----+--+
| hi | banana | 123 | |
| | apple | | |
| hello | pear | 456 | |
| | orange | | |
+-------+--------+-----+--+
What is the most pythonic way of replacing the value in column 2 for each odd row with the value from the row below, i.e. having a df like
+-------+--------+-----+
| 1 | 2 | 3 |
+-------+--------+-----+
| hi | apple | 123 |
| hello | orange | 456 |
+-------+--------+-----+

Related

›How to join tables in Python without overwriting existing column data

I need to join multiple tables but I can't get the join in Python to behave as expected. I need to left join table 2 to table 1, without overwriting the existing data in the "geometry" column of table 1. What I'm trying to achieve is sort of like a VLOOKUP in Excel. I want to pull matching values from my other tables (~10) into table 1 without overwriting what is already there. Is there a better way? Below is what I tried:
TABLE 1
| ID | BLOCKCODE | GEOMETRY |
| -- | --------- | -------- |
| 1 | 123 | ABC |
| 2 | 456 | DEF |
| 3 | 789 | |
TABLE 2
| ID | GEOID | GEOMETRY |
| -- | ----- | -------- |
| 1 | 123 | |
| 2 | 456 | |
| 3 | 789 | GHI |
TABLE 3 (What I want)
| ID | BLOCKCODE | GEOID | GEOMETRY |
| -- | --------- |----- | -------- |
| 1 | 123 | 123 | ABC |
| 2 | 456 | 456 | DEF |
| 3 | | 789 | GHI |
What I'm getting
| ID | GEOID | GEOMETRY_X | GEOMETRY_Y |
| -- | ----- | -------- | --------- |
| 1 | 123 | ABC | |
| 2 | 456 | DEF | |
| 3 | 789 | | GHI |
join = pd.merge(table1, table2, how="left", left_on="BLOCKCODE", right_on="GEOID"
When I try this:
join = pd.merge(table1, table2, how="left", left_on=["BLOCKCODE", "GEOMETRY"], right_on=["GEOID", "GEOMETRY"]
I get this:
TABLE 1
| ID | BLOCKCODE | GEOMETRY |
| -- | --------- | -------- |
| 1 | 123 | ABC |
| 2 | 456 | DEF |
| 3 | 789 | |
You could try:
# rename the Blockcode column in table1 to have the same column ID as table2.
# This is necessary for the next step to work.
table1 = table1.rename(columns={"Blockcode": "GeoID",})
# Overwrites all NaN values in table1 with the value from table2.
table1.update(table2)

Creating dummy variables for dataset with continuous and categorical variables in Python using Pandas

I am working on a logistic regression for employee turnover.
The variables I have are
q3attendance object
q4attendance object
average_attendance object
training_days int64
esat float64
lastcompany object
client _category object
qual_category object
location object
rating object
role object
band object
resourcegroup object
skill object
status int64
I have marked the categorical variables using
cat=[' bandlevel ',' resourcegroup ',' skill ',..}
I define x and y using x=df.iloc[:,:-1] and y=df.iloc[:,-1].
Next I need to create dummy variables. So, I use the command
xd = pd.get_dummies(x,drop_first='True')
After this, I expect the continuous variables to remain as they are and the dummies to be created for all categorical variables. However, on executing the command, I find that the code is treating the continuous variables also as categorical and ends up creating dummies for all of them as well. So, it the tenure is 3 years 2 months, 4 years 3 months etc, 3.2 and 4.3 both are taken as categorical. I end up with more than 1500 dummies and it is a challenge to run the regression after that.
What am I missing? Should I specifically mark out the categorical variables when using get_dummies?
pd.get_dummies has a optional param columns which accepts list of columns for which you need to create encoding.
For example:
df.head()
+----+------+--------------+-------------+---------------------------+----------+-----------------+
| | id | first_name | last_name | email | gender | ip_address |
|----+------+--------------+-------------+---------------------------+----------+-----------------|
| 0 | 1 | Lucine | Krout | lkrout0#sourceforge.net | Female | 199.158.46.27 |
| 1 | 2 | Sherm | Jullian | sjullian1#mapy.cz | Male | 8.97.22.209 |
| 2 | 3 | Derk | Mulloch | dmulloch2#china.com.cn | Male | 132.108.184.131 |
| 3 | 4 | Elly | Sulley | esulley3#com.com | Female | 63.177.149.251 |
| 4 | 5 | Brocky | Jell | bjell4#huffingtonpost.com | Male | 152.32.40.4 |
| 5 | 6 | Harv | Allot | hallot5#blogtalkradio.com | Male | 71.135.240.164 |
| 6 | 7 | Wolfie | Stable | wstable6#utexas.edu | Male | 211.31.189.141 |
| 7 | 8 | Harcourt | Dunguy | hdunguy7#whitehouse.gov | Male | 224.214.43.40 |
| 8 | 9 | Devina | Salerg | dsalerg8#furl.net | Female | 49.169.34.38 |
| 9 | 10 | Missie | Korpal | mkorpal9#wunderground.com | Female | 119.115.90.232 |
+----+------+--------------+-------------+---------------------------+----------+-----------------+
then
columns = ["gender"]
pd.get_dummies(df, columns=columns)
+----+------+--------------+-------------+---------------------------+-----------------+-----------------+---------------+
| | id | first_name | last_name | email | ip_address | gender_Female | gender_Male |
|----+------+--------------+-------------+---------------------------+-----------------+-----------------+---------------|
| 0 | 1 | Lucine | Krout | lkrout0#sourceforge.net | 199.158.46.27 | 1 | 0 |
| 1 | 2 | Sherm | Jullian | sjullian1#mapy.cz | 8.97.22.209 | 0 | 1 |
| 2 | 3 | Derk | Mulloch | dmulloch2#china.com.cn | 132.108.184.131 | 0 | 1 |
| 3 | 4 | Elly | Sulley | esulley3#com.com | 63.177.149.251 | 1 | 0 |
| 4 | 5 | Brocky | Jell | bjell4#huffingtonpost.com | 152.32.40.4 | 0 | 1 |
| 5 | 6 | Harv | Allot | hallot5#blogtalkradio.com | 71.135.240.164 | 0 | 1 |
| 6 | 7 | Wolfie | Stable | wstable6#utexas.edu | 211.31.189.141 | 0 | 1 |
| 7 | 8 | Harcourt | Dunguy | hdunguy7#whitehouse.gov | 224.214.43.40 | 0 | 1 |
| 8 | 9 | Devina | Salerg | dsalerg8#furl.net | 49.169.34.38 | 1 | 0 |
| 9 | 10 | Missie | Korpal | mkorpal9#wunderground.com | 119.115.90.232 | 1 | 0 |
+----+------+--------------+-------------+---------------------------+-----------------+-----------------+---------------+
print(tabulate())
will encode only column gender
All data is auto-generated, doesn't represent real world

i have from my original dataframe obtained another two , how can i merge in a final one the columns that i need

i have a table with 4 columns , from this data i obtained another 2 tables with some rolling averages from the original table. now i want to combine these 3 into a final table. but the indexes are not in order now and i cant do it. I just started to learn python , i have zero experience and i would really need all the help i can get.
DF
+----+------------+-----------+------+------+
| | A | B | C | D |
+----+------------+-----------+------+------+
| 1 | Home Team | Away Team | Htgs | Atgs |
| 2 | dalboset | sopot | 1 | 2 |
| 3 | calnic | resita | 1 | 3 |
| 4 | sopot | dalboset | 2 | 2 |
| 5 | resita | sopot | 4 | 1 |
| 6 | sopot | dalboset | 2 | 1 |
| 7 | caransebes | dalboset | 1 | 2 |
| 8 | calnic | resita | 1 | 3 |
| 9 | dalboset | sopot | 2 | 2 |
| 10 | calnic | resita | 4 | 1 |
| 11 | sopot | dalboset | 2 | 1 |
| 12 | resita | sopot | 1 | 2 |
| 13 | sopot | dalboset | 1 | 3 |
| 14 | caransebes | dalboset | 2 | 2 |
| 15 | calnic | resita | 4 | 1 |
| 16 | dalboset | sopot | 2 | 1 |
| 17 | calnic | resita | 1 | 2 |
| 18 | sopot | dalboset | 4 | 1 |
| 19 | resita | sopot | 2 | 1 |
| 20 | sopot | dalboset | 1 | 2 |
| 21 | caransebes | dalboset | 1 | 3 |
| 22 | calnic | resita | 2 | 2 |
+----+------------+-----------+------+------+
CODE
df1 = df.groupby('Home Team',) ['Htgs', 'Atgs',].rolling(window=4, min_periods=3).mean()
df1 =df1.rename(columns={'Htgs': 'Htgs/3', 'Atgs': 'Htgc/3'})
df1
df2 = df.groupby('Away Team',) ['Htgs', 'Atgs',].rolling(window=4, min_periods=3).mean()
df2 =df2.rename(columns={'Htgs': 'Atgc/3', 'Atgs': 'Atgs/3'})
df2
now i need a solution to see the columns with the rolling average next to the Home Team,,,,Away Team,,,,Htgs,,,,Atgs from the original table
Done !
i create a new column direct in the data frame like this
df = pd.read_csv('Fd.csv', )
df['Htgs/3'] = df.groupby('Home Team', ) ['Htgs'].rolling(window=4, min_periods=3).mean().reset_index(0,drop=True)
the Htgs/3 will be the new column with the rolling average of the column Home Team, and for the rest i will do the same like in this part.

Multi-Index Lookup Mapping

I'm trying to create a new column which has a value based on 2 indices of that row. I have 2 dataframes with equivalent multi-index on the levels I'm querying (but not of equal size). For each row in the 1st dataframe, I want the value of the 2nd df that matches the row's indices.
I originally thought perhaps I could use a .loc[] and filter off the index values, but I cannot seem to get this to change the output row-by-row. If I wasn't using a dataframe object, I'd loop over the whole thing to do it.
I have tried to use the .apply() method, but I can't figure out what function to pass to it.
Creating some toy data with the same structure:
#import pandas as pd
#import numpy as np
np.random.seed = 1
df = pd.DataFrame({'Aircraft':np.ones(15),
'DC':np.append(np.repeat(['A','B'], 7), 'C'),
'Test':np.array([10,10,10,10,10,10,20,10,10,10,10,10,10,20,10]),
'Record':np.array([1,2,3,4,5,6,1,1,2,3,4,5,6,1,1]),
# There are multiple "value" columns in my data, but I have simplified here
'Value':np.random.random(15)
}
)
df.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
df.sort_index(inplace=True)
v = pd.DataFrame({'Aircraft':np.ones(7),
'DC':np.repeat('v',7),
'Test':np.array([10,10,10,10,10,10,20]),
'Record':np.array([1,2,3,4,5,6,1]),
'Value':np.random.random(7)
}
)
v.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
v.sort_index(inplace=True)
df['v'] = df.apply(lambda x: v.loc[df.iloc[x]])
Returns error for indexing on multi-index.
To set all values to a single "v" value:
df['v'] = float(v.loc[(slice(None), 'v', 10, 1), 'Value'])
So inputs look like this:
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | A | 10 | 1 | 0.847576 |
| | | | 2 | 0.860720 |
| | | | 3 | 0.017704 |
| | | | 4 | 0.082040 |
| | | | 5 | 0.583630 |
| | | | 6 | 0.506363 |
| | | 20 | 1 | 0.844716 |
| | B | 10 | 1 | 0.698131 |
| | | | 2 | 0.112444 |
| | | | 3 | 0.718316 |
| | | | 4 | 0.797613 |
| | | | 5 | 0.129207 |
| | | | 6 | 0.861329 |
| | | 20 | 1 | 0.535628 |
| | C | 10 | 1 | 0.121704 |
--------------------------------------------
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | v | 10 | 1 | 0.961791 |
| | | | 2 | 0.046681 |
| | | | 3 | 0.913453 |
| | | | 4 | 0.495924 |
| | | | 5 | 0.149950 |
| | | | 6 | 0.708635 |
| | | 20 | 1 | 0.874841 |
--------------------------------------------
And after the operation, I want this:
| Aircraft | DC | Test | Record | Value | v |
|----------|----|------|--------|----------|----------|
| 1.0 | A | 10 | 1 | 0.847576 | 0.961791 |
| | | | 2 | 0.860720 | 0.046681 |
| | | | 3 | 0.017704 | 0.913453 |
| | | | 4 | 0.082040 | 0.495924 |
| | | | 5 | 0.583630 | 0.149950 |
| | | | 6 | 0.506363 | 0.708635 |
| | | 20 | 1 | 0.844716 | 0.874841 |
| | B | 10 | 1 | 0.698131 | 0.961791 |
| | | | 2 | 0.112444 | 0.046681 |
| | | | 3 | 0.718316 | 0.913453 |
| | | | 4 | 0.797613 | 0.495924 |
| | | | 5 | 0.129207 | 0.149950 |
| | | | 6 | 0.861329 | 0.708635 |
| | | 20 | 1 | 0.535628 | 0.874841 |
| | C | 10 | 1 | 0.121704 | 0.961791 |
Edit:
as you are on pandas 0.23.4, you just change droplevel to reset_index with option drop=True
df_result = (df.reset_index('DC').assign(v=v.reset_index('DC', drop=True))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Original:
One way is putting index DC of df to columns and using assign to create new column on it and reset_index and reorder_index
df_result = (df.reset_index('DC').assign(v=v.droplevel('DC'))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Out[1588]:
Value v
Aircraft DC Test Record
1.0 A 10 1 0.847576 0.961791
2 0.860720 0.046681
3 0.017704 0.913453
4 0.082040 0.495924
5 0.583630 0.149950
6 0.506363 0.708635
20 1 0.844716 0.874841
B 10 1 0.698131 0.961791
2 0.112444 0.046681
3 0.718316 0.913453
4 0.797613 0.495924
5 0.129207 0.149950
6 0.861329 0.708635
20 1 0.535628 0.874841
C 10 1 0.121704 0.961791

Compare elements of two pandas data frame columns and create a new column based on a third column

I have two dataframes:
df1:
| ID | PersonID | Sex |
|:--:|:--------:|:---:|
| 1 | 123 | M |
| 2 | 124 | F |
| 3 | 125 | F |
| 4 | 126 | F |
| 5 | 127 | M |
| 6 | 128 | M |
| 7 | 129 | F |
df2:
| ID | PersonID | Infected |
|:--:|:--------:|:--------:|
| 1 | 125 | True |
| 2 | 124 | False |
| 3 | 126 | False |
| 4 | 128 | True |
I'd like to compare the person IDs in both these dataframes and insert the corresponding Infected value into df1 and False if the personID is not matched. The output would ideally look like this:
df1:
| ID | PersonID | Sex | Infected |
|:--:|:--------:|:---:|:--------:|
| 1 | 123 | M | False |
| 2 | 124 | F | False |
| 3 | 125 | F | True |
| 4 | 126 | F | False |
| 5 | 127 | M | False |
| 6 | 128 | M | True |
| 7 | 129 | F | False |
I have a for loop coded and it takes too long and is not very readable. Is there an efficient way to do this? Thanks!
One approach is to provide df1['PersonID'].map() with a Series whose index is PersonID and values are Infected:
df1['Infected'] = df1['PersonID'].map(df2.set_index('PersonID')['Infected']).fillna(False)
Another approach is to use pd.merge
df1 = pd.merge(df1, df2[['PersonID', 'Infected']], on=['PersonID'], how='left').fillna(False)
Or
df1 = df1.merge(df2[['PersonID', 'Infected']], on=['PersonID'], how='left').fillna(False)

Categories

Resources