I formulated this question about adding rows WITH index, but it is not yet clear to me how/why this happens when there are no indexes:
columnsList=['A','B','C','D']
df8=pd.DataFrame(columns=columnsList)
L=['value aa','value bb','value cc','value dd']
s = pd.Series(dict(zip(df8.columns, L)))
df8.append(s,ignore_index=True)
df8.append(s,ignore_index=True)
I EXPECT HERE A 2X4 DATAFRAME.
nevertheless no values where added, nor an error occurred.
print(df8.shape)
#>>> (0,4)
Why is the series not being added, and why is not given any error?
If I try to add a row with LOC, an index is added,
df8.loc[df8.index.max() + 1, :] = [4, 5, 6,7]
print(df8)
result:
A B C D
NaN 4 5 6 7
I guess neither LOC, nor iLOC could be used to append rows without index name (i.e. Loc adds the index name NaN, and iLoc can not be used when the index number is higher than the rows of the database)
DataFrame.append is not an in-place operation. From the docs,
DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=None)
Append rows of other to the end of this frame, returning a new object.
Columns not in this frame are added as new columns.
You need to assign the result back.
df8 = df8.append([s] * 2, ignore_index=True)
df8
A B C D
0 value aa value bb value cc value dd
1 value aa value bb value cc value dd
The statement data.append(sub_data) does not work on its own.
But the statement data=data.append(sub_data) will work
Assigning it back solved the issue for me. Good tip not available elsewhere.
Related
I'm sure this question must have already been answered somewhere but I couldn't find an answer that suits my case.
I have 2 pandas DataFrames
a = pd.DataFrame({'A1':[1,2,3], 'A2':[2,4,6]}, index=['a','b','c'])
b = pd.DataFrame({'A1':[3,5,6], 'A2':[3,6,9]}, index=['a','c','d'])
I want to merge them in order to obtain something like
result = pd.DataFrame({
'A1' : [3,2,5,6],
'A2' : [3,4,6,9]
}, index=['a','b','c','d'])
Basically, I want a new df with the union of both indexes. Where indexes match, the value in each column should be updated with the one from the second df (in this case b). Where there is no match the value is taken from the starting df (in this case a).
I tried with merge(), join() and concat() but I could not manage to obtain this result.
If the comments are correct and there's indeed a typo in your result, you could use pd.concat to create one dataframe (b being the first one as it is b that has a priority for it's values to be kept over a), and then drop the duplicated index:
Using your sample data:
c = pd.concat([b,a])
c[~c.index.duplicated()].sort_index()
prints:
A1 A2
a 3 3
b 2 4
c 5 6
d 6 9
I'm normally OK on the joining and appending front, but this one has got me stumped.
I've got one dataframe with only one row in it. I have another with multiple rows. I want to append the value from one of the columns of my first dataframe to every row of my second.
df1:
id
Value
1
word
df2:
id
data
1
a
2
b
3
c
Output I'm seeking:
df2
id
data
Value
1
a
word
2
b
word
3
c
word
I figured that this was along the right lines, but it listed out NaN for all rows:
df2 = df2.append(df1[df1['Value'] == 1])
I guess I could just join on the id value and then copy the value to all rows, but I assumed there was a cleaner way to do this.
Thanks in advance for any help you can provide!
Just get the first element in the value column of df1 and assign it to value column of df2
df2['value'] = df1.loc[0, 'value']
I am new to python.Could you help on follow
I have a dataframe as follows.
a,d,f & g are column names. dataframe can be named as df1
a d f g
20 30 20 20
0 1 NaN NaN
I need to put second row of the df1 into a list without NaN's.
Ideally as follows.
x=[0,1]
Select the second row using df.iloc[1] then using .dropna remove the nan values, finally using .tolist method convert the series into python list.
Use:
x = df.iloc[1].dropna().astype(int).tolist()
# x = [0, 1]
Check itertuples()
So you would have something like taht:
for row in df1.itertuples():
row[0] #-> that's your index of row. You can do whatever you want with it, as well as with whole row which is a tuple now.
you can also use iloc and dropna() like that:
row_2 = df1.iloc[1].dropna().to_list()
I have a python data frame as below:
A B C
2 [4,3,9] 1
6 [4,8] 2
3 [3,9,4] 3
My goal is to loop through the data frame and compare column B, if column B are the same, the update column C to the same number such as below:
A B C
2 [4,3,9] 1
6 [4,8] 2
3 [3,9,4] 1
I tried with the code below:
for i, j in df.iterrows():
if len(df['B'][i] ==len(df['B'][j] & collections.Counter(df['B'][i]==collections.Counter(df['B'][j])
df['C'][j]==df['C'][i]
else:
df['C'][j]==df['C'][j]
I got error message unhashable type: 'list'
Anyone knows what cause this error and better way to do this? Thank you for your help!
Because lists are not hashable convert lists to sorted tuples and get first values by GroupBy.transform with GroupBy.first:
df['C'] = df.groupby(df.B.apply(lambda x: tuple(sorted(x)))).C.transform('first')
print (df)
A B C
0 2 [4, 3, 9] 1
1 6 [4, 8] 2
2 3 [3, 9, 4] 1
Detail:
print (df.B.apply(lambda x: tuple(sorted(x))))
0 (3, 4, 9)
1 (4, 8)
2 (3, 4, 9)
Name: B, dtype: object
Not quite sure about the efficiency of the code, but it gets the job done:
uniqueRows = {}
for index, row in df.iterrows():
duplicateFound = False
for c_value, uniqueRow in uniqueRows.items():
if duplicateFound:
continue
if len(row['B']) == len(uniqueRow):
if len(list(set(row['B']) - set(uniqueRow))) == 0:
print(c_value)
df.at[index, 'C'] = c_value
uniqueFound = True
if not duplicateFound:
uniqueRows[row['C']] = row['B']
print(df)
print(uniqueRows)
This code first loops over your dataframe. It has a duplicateFound boolean for each row that will be used later.
It will loop over the uniqueRows dict and first checks if a duplicate is found. In this case it will continue skip the calculations, because this is not needed anymore.
Afterwards it compares the length of the list to skip some comparisons and in case it's the same uses the following code: This returns a list with the differences and in case there are no differences returns an empty list.
So if the list is empty it sets the value from the C column at this position using pandas dataframe at function (this has to be used when iterating over a dataframe link). It sets the unqiueFound variable to True to prevent further comparisons. In case no duplicates were found the uniqueFound value will still be False and will trigger the addition to the uniqueRows dict at the end of the for loop before going to the next row.
In case you have any comments or improvements to my code feel free to discuss and hope this code helps you with your project!
Create a temporary column by applying sorted to each entry in the B column; group by the temporary column to get your matches and get rid of the temporary column.
df1['B_temp'] = df1.B.apply(lambda x: ''.join(sorted(x)))
df1['C'] = df1.groupby('B_temp').C.transform('min')
df1 = df1.drop('B_temp', axis = 1)
df1
A B C
0 2 [4, 3, 9] 1
1 6 [4, 8] 2
2 3 [3, 9, 4] 1
I'm trying to re-insert back into a pandas dataframe a column that I extracted and of which I changed the order by sorting it.
Very simply, I have extracted a column from a pandas df:
col1 = df.col1
This column contains integers and I used the .sort() method to order it from smallest to largest. And did some operation on the data.
col1.sort()
#do stuff that changes the values of col1.
Now the indexes of col1 are the same as the indexes of the overall df, but in a different order.
I was wondering how I can insert the column back into the original dataframe (replacing the col1 that is there at the moment)
I have tried both of the following methods:
1)
df.col1 = col1
2)
df.insert(column_index_of_col1, "col1", col1)
but both methods give me the following error:
ValueError: cannot reindex from a duplicate axis
Any help will be greatly appreciated.
Thank you.
Consider this DataFrame:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 5, 4]}, index=[0, 0, 1])
df
Out:
A B
0 1 6
0 2 5
1 3 4
Assign the second column to b and sort it and take the square, for example:
b = df['B']
b = b.sort_values()
b = b**2
Now b is:
b
Out:
1 16
0 25
0 36
Name: B, dtype: int64
Without knowing the exact operation you've done on the column, there is no way to know whether 25 corresponds to the first row in the original DataFrame or the second one. You can take the inverse of the operation (take the square root and match, for example) but that would be unnecessary I think. If you start with an index that has unique elements (df = df.reset_index()) it would be much easier. In that case,
df['B'] = b
should work just fine.