Modifying one dataframe appears to change another [duplicate] - python

This question already has answers here:
Why can pandas DataFrames change each other?
(3 answers)
How do I clone a list so that it doesn't change unexpectedly after assignment?
(24 answers)
Closed 1 year ago.
I am new to loop in Python and just came across a weird question. I was doing some calculations on multiple dataframes, and to simplify the question, here is an illustration.
Suppose I have 3 dataframes filled with NaN:
# generate NaN entries
data = np.empty((15, 10))
# create dataframe
data[:] = np.nan
dfnan = pd.DataFrame(data)
df1 = dfnan
df2 = dfnan
df3 = dfnan
After this step, all the three dataframes give me NaN as expected.
But then, if I add two for loops in one block like below:
for i in range(0, 15, 1):
df1.iloc[i] = 0
for j in range(0, 15, 1):
df2.iloc[j] = df1.iloc[j].transform(lambda x: x+1)
Then all of df1, df2, and df3 give me 1 entries. But shouldn't it be that:
df1 filled with 0, df2 filled with 1 and df3 filled with NaN (since I didn't make any change to it)?
Why is that and how I can change it to get the wanted result?

Assignment never copies in python. df1, df2, df3 and dfnan are all references to the same object (pd.DataFrame(data)). This means that changes in one are reflected in the remaining ones, as they all point to the same object.
This is a great reading https://nedbatchelder.com/text/names.html.
To create independent copies use the copy method
dfnan = pd.DataFrame(data)
df1 = dfnan.copy()
df2 = dfnan.copy()
df3 = dfnan.copy()

Related

Reindex dataframe inside loop [duplicate]

This question already has answers here:
How to change variables fed into a for loop in list form
(4 answers)
Closed 5 months ago.
I'm trying to reindex the columns in a set of dataframes inside a loop. This only seems to work outside the loop. See sample code below
import pandas as pd
data1 = [[1,2,3],[4,5,6],[7,8,9]]
data2 = [[10,11,12],[13,14,15],[16,17,18]]
data3 = [[19,20,21],[22,23,24],[25,26,27]]
index = ['a','b','c']
columns = ['d','e','f']
df1 = pd.DataFrame(data=data1,index=index,columns=columns)
df2 = pd.DataFrame(data=data2,index=index,columns=columns)
df3 = pd.DataFrame(data=data3,index=index,columns=columns)
columns2 = ['f','e','d']
for i in [df1,df2,df3]:
i = i.reindex(columns=columns2)
print(df1)
df2 = df2.reindex(columns=columns2)
print(df2)
df1 is not reindexed as desired, however if I reindex df2 outside of the loop it works. Why is that?
Thanks
Andrew
That happens for the same reason this happens:
a = 5
b = 6
for i in [a, b]:
i = 4
>>> a
5
Why? See this accepted answer.
Concerning your problem, one way to go about it is create a list of reindexed dataframes like so:
reindexed_dfs = [df.reindex(columns=columns2) for df in [df1, df2, df3]]
and then reassign df1, df2 and df3. But it's better to just keep using your newly created list anyways.

Python Pandas showing change in position between two dataframes

I am reading two dataframes looking at one column and then showing the difference in position between the two dataframe with a -1 or +1 etc.
I have try the following code but it only shows 0 in Position Change when there should be a difference between British Airways and Ryanair
first = pd.read_csv("C:\\Users\\airma\\PycharmProjects\\Vatsim_Stats\\Vatsim_stats\\Base.csv", encoding='unicode_escape')
df1 = pd.DataFrame(first, columns=['airlines', 'Position'])
second = pd.read_csv("C:\\Users\\airma\\PycharmProjects\\Vatsim_Stats\\Vatsim_stats\\Base2.csv", encoding='unicode_escape')
df2 = pd.DataFrame(second, columns=['airlines', 'Position'])
df1['Position Change'] = np.where(df1['airlines'] == df2['airlines'], 0, df1['Position'] - df2['Position'])
I have also try to do it with the following code, but just keep getting a ValueError: cannot reindex from a duplicate axis
df1.set_index('airlines', drop=False) # Set index to cross reference by (icao)
df2.set_index('airlines', drop=False)
df2['Position Change'] = df1[['Position']].sub(df2['Position'], axis=0)
df2 = df2.reset_index(drop=True)
pd.set_option('display.precision', 0)
Base csv looks like this -
and Base2 csv looks like this -
As you can see British Airways is in 3 position on Base csv and 4 in Base 2 csv, but when running the code it just shows 0 and does not do the math between the two dataframes.
Have been stuck on this for days now, would be so grateful for any help.
I would like to offer some easier way based on columns, value and if-statement.
It is probably a little bit useless while you have big dataframe, but it can gives you the information you expect.
first = pd.read_csv("C:\\Users\\airma\\PycharmProjects\\Vatsim_Stats\\Vatsim_stats\\Base.csv", encoding='unicode_escape')
df1 = pd.DataFrame(first, columns=['airlines', 'Position'])
second = pd.read_csv("C:\\Users\\airma\\PycharmProjects\\Vatsim_Stats\\Vatsim_stats\\Base2.csv", encoding='unicode_escape')
df2 = pd.DataFrame(second, columns=['airlines', 'Position'])
I agree, that my answer was not correct with your question.
Now, if I understand correctly - you want to create new column in DataFrame that gives you -1 if two same columns in 2 DataFrames are incorrect and 1 if correct.
It should help:
key = "Name_Of_Column"
new = []
for i in range(0, len(df1)):
if df1[key][i] != df2[key][i]:
new.append(-1)
else:
new.append(1)
df3 = pd.DataFrame({"Diff":new}) # I create new DataFrame as Dictionary.
df1 = df1.append(df3, ignore_index = True)
print(df1)
i am giving u an alternative, i am not sure whether it is appreciated or not. But just an idea.
After reading two csv's and getting the column u require, why don't you try to join two dataframes for the column'airlines'? it will merge two dataframes with key as 'airlines'

Using condition of a dataframe in pandas.where of another dataframe [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have two dataframes: df1 has data and df2 is kind of like a map for the data. (They are both the same size and are 2D).
I would like to use pandas.where (or any method that isn't too convoluted) to replace the values of df1 based of the condition of the same cell in df2.
For instance, if df2 is equal to 0, I want to set the same cell in df1 also to 0. How do I do this?
When I try the following I get an error:
df3 = df1.where(df2 == 0, other = 0)
import pandas as pd
df = pd.DataFrame()
df_1 = pd.DataFrame()
df['a'] = [1,2,3,4,5]
df_1['b'] = [5,6,7,8,0]
This will give a sample df:
Now implement a loop, using range or len(df.index)
for i in range(0,5):
df['a'][i] = np.where( df_1['b'][i] == 0, 0, df['a'][i])
Generally you shouldn't need to handle multiple dataframes separately like this; if df1, df2 have the same shape and either the same index or some common column they can be joined/merged on (e.g. say it's named 'id'), then merge them:
df = pd.merge(df1, df2, on='id')
See Pandas Merging 101

How can I concat multiple dataframes in Python? [duplicate]

This question already has answers here:
Append multiple pandas data frames at once
(5 answers)
How do I create variable variables?
(17 answers)
Closed 4 years ago.
I have multiple (more than 100) dataframes. How can I concat all of them?
The problem is, that I have too many dataframes, that I can not write them manually in a list, like this:
>>> cluster_1 = pd.DataFrame([['a', 1], ['b', 2]],
... columns=['letter ', 'number'])
>>> cluster_1
letter number
0 a 1
1 b 2
>>> cluster_2 = pd.DataFrame([['c', 3], ['d', 4]],
... columns=['letter', 'number'])
>>> cluster_2
letter number
0 c 3
1 d 4
>>> pd.concat([cluster_1, cluster_2])
letter number
0 a 1
1 b 2
0 c 3
1 d 4
The names of my N dataframes are cluster_1, cluster_2, cluster_3,..., cluster_N. The number N can be very high.
How can I concat N dataframes?
I think you can just put it into a list, and then concat the list. In Pandas, the chunk function kind of already does this. I personally do this when using the chunk function in pandas.
pdList = [df1, df2, ...] # List of your dataframes
new_df = pd.concat(pdList)
To create the pdList automatically assuming your dfs always start with "cluster".
pdList = []
pdList.extend(value for name, value in locals().items() if name.startswith('cluster_'))
Generally it goes like:
frames = [df1, df2, df3]
result = pd.concat(frames)
Note: It will reset the index automatically.
Read more details on different types of merging here.
For a large number of data frames:
If you have hundreds of data frames, depending one if you have in on disk or in memory you can still create a list ("frames" in the code snippet) using a for a loop. If you have it in the disk, it can be easily done just saving all the df's in one single folder then reading all the files from that folder.
If you are generating the df's in memory, maybe try saving it in .pkl first.
Use:
pd.concat(your list of column names)
And if want regular index:
pd.concat(your list of column names,ignore_index=True)

How to return same data frame in 'for loop' after passing some function on it, without appending etc.? [duplicate]

This question already has answers here:
How do I create variable variables?
(17 answers)
Closed 4 years ago.
I have three dataframes that I would like to crop, I have defined a function;
def croping(data, start_date='2017-04-10 00:00:00', end_date='2018-05-31 21:55:00' ):
return data.loc[start_date:end_date]
I know this is a bit extra but I am trying to learn how to use user-defined functions.
I then want to use this function on the list of dataframes;
df_list = [df1, df2, df3]
where
df1=
Timestamp A B C D E
2017-04-01 00:00:00 106.46451 98.94002 118.59085 100.83779 108.89098
2017-04-01 00:05:00 105.74346 98.93000 113.47805 86.77218 105.37943
2017-04-01 00:10:00 105.99000 99.15727 115.48461 96.76406 106.55555
2017-04-01 00:15:00 105.04311 98.93000 112.15814 88.38959 104.71931
... ... ... ... ...
etc.
I am then trying to run a for loop to crop each dataframe
for name in df_list:
holding = croping(name)
if I do it this way I need to then append the holding dataframes together, is there a way that I can call the cropped dataframe a different thing in each iteration? Something like this;
for name in df_list:
name_cropped = croping(name)
where name changes in each iteration, so I am left with df1_cropped, df2_cropped etc.
Maybe the best way to do this is not with a for loop, I am still very much learning
You have a couple of options. You can index your dataframes by location in a list. In this case, you can use a list comprehension. Using pd.DataFrame.pipe would be the Pandorable method as it facilitates method chaining:
df_list = [df1, df2, df3]
df_croped = [df.pipe(croping) for df in df_list]
However, you may find it cleaner to use dictionaries instead:
df_dict = dict(enumerate((df1, df2, df3), 1))
df_croped = {k: v.pipe(croping) for k, v in df_dict.items()}
You can then access the first dataframe original or uncropped, via df_dict[0] or df_croped[0] respectively.
One way is to use 2 lists in place of one i.e
df_list = [df1, df2, df3]
df_list_updated = []
for name in df_list:
df_list_updated.append(croping(name))

Categories

Resources