I have two dataframe:
Source dataframe
index A x y
1 1 100 100
2 1 100 400
3 1 100 700
4 1 300 200
5 2 50 200
6 2 100 200
7 2 800 400
8 2 1200 800
Destination dataframe
index A x y
1 1 105 100
2 1 110 410
3 1 110 780
4 2 1000 90
For each row in source dataframe I need to find values nearest to it based on values in the destination dataframe grouped by 'A' column. The resultant dataframe should be as below(Just a sample taking only one row from source(index 1) and corresponding nearest ones from destination in that group(A == 1))
A x_1 y_1 x_2 y_2 nearness(approx.)
1 100 100 105 100 95
1 100 100 110 410 50
1 100 100 110 780 20
NOTE: The nearness column is just a mere representation and will be a calculation function in the future based on x and y. What I need is row wise merging between the two dataframe.
This might be arbitrary, but can someone explain how merge works?
pd.merge(source_df, dest_df, on='A')
Basically, it will go through every item of the left dataframe, look for its key in the right dataframe, and create an entry in the merged datagrame (it creates an entry for each time the key is found in the right dataframe, but you can change this behaviour with the validate keyword)
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html for more infos!!!
source_df.merge(dest_df, on='A')
What it does is it first looks at source_df's column and 'A' and matches it with dest_df's column 'A' (if 'on' is specified) - much like SQL join -, else it tries to do this using index, if fails then it tries to achieve joining using common column names. You can also join on different column names using 'left' and 'right' arguments.
Related
How to parse data on all rows, and use this row to populate other dataframes with data from multiple rows ?
I am trying to parse a csv file containing several data entry for training purpose as I am quite new to this technology.
My data consist in 10 columns, and hunderds of rows.
The first column is filled with a code that is either 10, 50, or 90.
Example :
Dataframe 1 :
0
1
10
Power-220
90
End
10
Power-290
90
End
10
Power-445
90
End
10
Power-390
50
Clotho
50
Kronus
90
End
10
Power-550
50
Ares
50
Athena
50
Artemis
50
Demeter
90
End
And the list goes on..
On one hand I want to be able to read the first cell, and to populate another dataframe directly if this is a code 10.
On the other hand, I'd like to populate another dataframe with all the codes 50s, but I want to be able to get the data from the previous code 10, as it hold the type of Power that is used, and populate a new column on this dataframe.
The new data frames are supposed to look like this:
Dataframe 2 :
0
1
10
Power-220
10
Power-290
10
Power-445
10
Power-390
10
Power-550
Dataframe 3 :
0
1
2
50
Clotho
Power-390
50
Kronus
Power-390
50
Ares
Power-550
50
Athena
Power-550
50
Artemis
Power-550
50
Demeter
Power-550
So far, I was using iterrows, and I've read everywhere that it was a bad idea.. but i'm struggling implementing another method..
In my code I just create two other dataframes, but I don't know yet a way to retrieve data from the previous cell. I would usually use a classic method, but I think it's rather archaic.
for index, row in df.iterrows():
if (df.iat[index,0] == '10'):
df2 = df2.append(df.loc[index], ignore_index = True)
if (df.iat[index,0] == '50'):
df3 = df3.append(df.loc[index], ignore_index = True)
Any ideas ?
(Update)
For df2, it's pretty simple:
df2 = df.rename(columns={'Power/Character': 'Power'}) \
.loc[df['Code'] == 10, :]
For df3, it's a bit more complex:
# Extract power and fill forward values
power = df.loc[df['Code'] == 10, 'Power/Character'].reindex(df.index).ffill()
df3 = df.rename(columns={'Power/Character': 'Character'}) \
.assign(Power=power).loc[lambda x: x['Code'] == 50]
Output:
>>> df2
Code Power
0 10 Power-220
2 10 Power-290
4 10 Power-445
6 10 Power-390
10 10 Power-550
>>> df3
Code Character Power
7 50 Clotho Power-390
8 50 Kronus Power-390
11 50 Ares Power-550
12 50 Athena Power-550
13 50 Artemis Power-550
14 50 Demeter Power-550
You could simply copy the required rows to another dataframe,
df2 = df[df.col_1 == '10'].copy()
This will make a new dataframe df2 that contains only the rows from column col_1 that fits some criteria. The copy() function guarantees that the two dataframes are not identical, and changes in one do not affect the other.
If df2 already exists, you can concatenate them
df2 = pd.concat([df2, df[df.col_1 == '10'].copy()])
I have a dataframe consist of the following, and want to add a new column based on
high - open < x number
and High.rowNum >= Open.rowNUm
basically I just want to get the first Row Num that match the criteria above and store it as different column
S/N
High
Low
Open
Close
Date
[New Column] e.g. High - Open >= 85 [Value of S/N]
1
100
20
22
90
1 Jan
1
2
200
40
72
50
2 Jan
3
3
390
20
55
90
2 Jan
As per my understanding based on your question and comment, you need 'S/N' in the new column which satisfy the criteria .. so simply you can use apply function in dataframe and store result as new column
df['New'] = df.apply(lambda x: x['S/N'] if x['High']-x['Open'] >= 85 else np.nan, axis=1)
Here we get new column with 'S/N' which satisfy the condition else we fill it with NaN
I have a DataFrame that looks like this:
Id
Price
1
300
1
300
1
300
2
400
2
400
3
100
My goal is to divide the price for each observation by the number of rows with the same Id number. The expected output would be:
Id
Price
1
100
1
100
1
100
2
200
2
200
3
100
However I am having some issues finding the most optimized way to conduct this operation. I did manage to do this using the code below, but it takes more than 5 minutes to run (as I have roughly 200k observations):
# For each row in the dataset, get the number of rows with the same Id and store them in a list
sum_of_each_id=[]
for i in df['Id'].to_numpy():
sum_of_each_id.append(len(df[df['Id']==i]))
# Creating an auxiliar column in the dataframe, with the number of rows associated to each Id
df['auxiliar']=sum_of_each_id
# Dividing the price by the number of rows with the same Id
df['Price']=df['Price']/df['auxiliar']
Could you please let me know what would be the best way to do this?
Try groupby with transform.
Make groups on the basis of id using groupby('Id')
Get count of values in a group for each row using `transform('count')
Divide df["Price] by that series which contains count.
df = pd.DataFrame({"Id":[1,1,1,2,2,3],"Price":[300,300,300,400,400,100]})
df["new_Price"] = (df["Price"]/df.groupby("Id")["Price"].transform("count")).astype('int')
print(df)
Id Price new_Price
0 1 300 100
1 1 300 100
2 1 300 100
3 2 400 200
4 2 400 200
5 3 100 100
import pandas as pd
df = pd.DataFrame({"id": [1, 1, 1, 2, 2, 3], "price": [300, 300, 300, 400, 400, 100]})
df.set_index("id") / df.groupby("id").count()
Explanation:
df.groupby("id").count() calculates the number of rows with the same Id number. the resulting DataFrame will have an Id as index.
df.set_index("id") will set the Id column as index
Then we simply divide the frames and pandas will match the numbers by the index.
I use the pandas read_excel function to work with data. I have two excel files with 70k rows and 3 columns (the first column is date), and it only takes 4-5 seconds to combine, align the data, delete any rows with incomplete data and return a new dataframe (df) with 50k rows and 4 columns, where date is the index.
Then, i use the below code to perform some calculations and add another 2 columns in my df:
for i, row in df.iterrows():
df["new_column1"] = df["column1"] - 2 * df["column4"]
df["new_column2"]= df["column1"] - 2.5 * df["column4"]
It takes approx 30 seconds for the above code to be executed, even though the calculations are simple. Is this normal, or is there a way to speed up the execution? (i am on win 10, 16GB Ram and i7-8565U processor)
I am not particularly interested in increasing the columns in my database - getting the two new columns on a list would suffice.
Thanks.
Note that the code in your loop contains neither row nor i.
So drop for ... row and execute just:
df["new_column1"] = df["column1"] - 2 * df["column4"]
df["new_column2"]= df["column1"] - 2.5 * df["column4"]
It is enough to execute the above code only once, not in a loop.
Your code unnecessarily performs the above operations multiple times
(actually as many times as how many rows has your DataFrame) and this
is why it takes so long.
Edit following question as of 18:59Z
To perform vectorized operations like "check one column and do something
to another column", use the following schema, base on boolean indexing.
Assume that the source df contains:
column1 column4
0 1 11
1 2 12
2 3 13
3 4 14
4 5 15
5 6 16
6 7 17
7 8 18
Then if you want to:
select rows with even value in column1,
and add some value (e.g. 200) to column4,
run:
df.loc[df.column1 % 2 == 0, 'column4'] += 200
In this example:
df.column1 % 2 == 0 - provides boolean indexing over rows,
column4 - selects particular column,
+= 200 - performs the actual operation.
The result is:
column1 column4
0 1 11
1 2 212
2 3 13
3 4 214
4 5 15
5 6 216
6 7 17
7 8 218
But there ase more complex cases, when the condition involves calling of
some custom code or you want to update several columns.
In such cases you should use either iterrow of apply, but these
operations are executed much slower.
I'm trying to grab value from an existing df using iloc coordinates stored in another df, then stored that value in the second df.
df_source (source):
Category1 Category2 Category3
Bucket1 100 200 300
Bucket2 400 500 600
Bucket3 700 800 900
df_coord (coordinates):
Index_X Index_Y
0 0
1 1
2 2
Want:
df_coord
Index_X Index_Y Added
0 0 100
1 1 500
2 2 900
I'm more familiar with analytical language like SAS, where data is processed one line at a time, so the natural approach for me was this:
df_coord['Added'] = df_source.iloc[df_coord[Index_X][df_coord[Index_Y]]
When I tried this I got an error, which I understand as df_coord[Index_X] does not refer to the data on the same row. I have seen a few posts where using a "axis=1" option worked for their respective cases, but I can't figure out how to apply it to this case. Thank you.
You could index the underlying ndarray, i.e calling the values method, using the columns in df_coord as first and second axis:
df_coord['Added'] = df_source.values[df_coord.Index_X, df_coord.Index_Y]
Index_X Index_Y Added
0 0 0 100
1 1 1 500
2 2 2 900