Here is my Pandas DataFrame:
import pandas as pd
dfa = df = pd.read_csv("twitDB3__org.csv")
dfa.drop([7-100], axis=0, inplace=True)
Output
ValueError: labels [-93] not contained in axis
I am new to canopy and want to delete a range of rows and it seems to require each row individual. Would appreciate any help
a) I think you want dfa.drop(range(7,101),... (What you did was just subtract 100 from 7 and pass the result (-93) as the label to drop.)
b) Note that this will also change df, because as you've written it, df and dfa are just two names for the same mutable object. If you want to end up with two different dataframes, then either make an explicit copy, or don't use inplace, and save the result: df2 = df.drop(...
c) This is a pandas question, not a canopy question. Canopy provides 500+ Python packages, and while it's true that pandas is one of the more popular of these, there is a whole pandas community out there.
Related
I am reading data from EXCEL to a pandas DataFrame:
df = pd.read_excel(file, sheet_name='FactoidList', ignore_index=False, sort=False)
Applying sort=False preserves the original order of my columns. But when I apply a numpy condition list, which generates a numpy array, the order of the columns changes.
Numpy orders the columns alphabetically from A to Z and I do not know how I can prevent it. Is there an equivalent to sort=False?
I searched online but could not find a solution. The problem is that I want to re-convert the numpy array to a dataframe in the original format, re-applying the original column names.
ADDITION: code for condition list used in script:
condlist = [f['pers_name'].str.contains('|'.join(qn)) ^ f['pers_name'].isin(qn),
f['inst_name'].isin(qi),
f['pers_title'].isin(qt),
f['pers_function'].isin(qf),
f['rel_pers'].str.contains('|'.join(qr)) ^ f['rel_pers'].isin(qr)]
choicelist = [f['pers_name'],
f['inst_name'],
f['pers_title'],
f['pers_function'],
f['rel_pers']]
output = np.select(condlist, choicelist)
print(output) # this print output already shows an inversion of columns
rows=np.where(output)
new_array=f.to_numpy()
result_array=new_array[rows]
Reviewing my script, I figured out that the problem isn't numpy but pandas.
Before applying my condition list, I am adding the dataframe df with the explicit sort=False statement to another dataframe f with the exact same structure, but I made the wrong assumption that the new combined dataframe would inherit sort=False.
Instead, I had to make it explicit:
f = f.append(df, axis=1, ignore_index=False, sort=False)
I have a data set that contains 70 columns and I want to extract all rows, the first 4 columns and the last 54 columns. I've tried the following:
df_3 = df_3.iloc[:, [0:3, 16:70]]
but it keeps saying it's the wrong syntax...
Then I tried using np.r_ (although I'm not sure if I understand what it really does, I prefer a solution with iloc)
df_3 = df_3.iloc[:, np.r_[0:3, 16:69]]
but this returns the first 4 column twice, and the columns in the middle (4:15) which are the ones I want to get rid of...
Then I tried this code:
df_3 = df_3.iloc[:, [0:3, -54:]]
but it returns same output as above, with np.r_
and my latest try
df_3 = df_3.iloc[:, [+4:, -54:]]
returns a syntax error...
My python version is 3.7.4 and pandas version is 0.25.1
Any help with this is much appreciated. Thank you all in advance
You could achieve that using df.columns & df.loc
Take the first 4 columns and last 54 columns, turn them to lists and add them up. Then access those columns at the data frame using .loc
df_3 = df_3.loc[:, df_3.columns[0:4].to_list() + df_3.columns[-54:].to_list()]
I am appending different dataframes to make one set. Occasionally, some values have the same index, so it stores the value as a series. Is there a quick way within Pandas to just overwrite the value instead of storing all the values as a series?
You weren't very clear guy. If you want to resolve the duplicated indexes problem, probably the pd.Dataframe.reset_index() method will be enough. But, if you have duplicate rows when you concat the Dataframes, just use the pd.DataFrame.drop_duplicates() method. Else, share a bit of your code with or be clearer.
I'm not sure that the code below is what you're searching.
we say two dataframes, one columns, the same index and different values. and you wanna overwrite the value in one dataframe with the other. you can do it with a simple loop with iloc indexer.
import pandas as pd
df_1 = pd.DataFrame({'col_1':['a','b','c','d']})
df_2 = pd.DataFrame({'col_1':['q','w','e','r']})
rows = df_1.shape[0]
for idx in range(rows):
df_1['col_1'].iloc[idx] = df_2['col_2'].iloc[idx]
Then, you check the df_1. you should get that:
df_1
col_1
0 q
1 w
2 e
3 r
Whatever the response is what you want, let me know so I can help you.
Today I've been working with five DataFrames that are almost the same, but for different courses. They are named df2b2015, df4b2015, df6b2015, df2m2015.
Every one of those DataFrames has a column named prom_lect2b_rbd for df2b2015, prom_lect4b_rbd for df4b2015, and so on.
I want to append those DataFrames, but because every column has a different name, they don't go together. I'm trying to turn every one of those columns into a prom_lect_rbd column, so I can then append them without problem.
Is there a way I can do that with a for loop and regex.
Else, is there a way I can do it with other means?
Thanks!
PS: I know some things, like I can turn the columns into what I want using:
re.sub('\d(b|m)','', a)
Where a is the columns name. But I can't find a way to mix that with loops and column renaming.
Edit:
DataFrame(s) look like this:
df2b2015:
rbd prom_lect2b_rbd
1 5
2 6
df4b2015:
rbd prom_lect4b_rbd
1 8
2 9
etc.
Managed to do it. Probably not the most Pythonic way, but it does what I wanted:
dfs=[df2b2015,df4b2015,df6b2015,df8b2015,df2m2015]
cols_lect=['prom_lect2b_rbd','prom_lect4b_rbd','prom_lect6b_rbd',
'prom_lect8b_rbd','prom_lect2m_rbd']
for j,k in zip(dfs,cols_lect):
j.rename(columns={k:re.sub('\d(b|m)','', k)}, inplace=True)
Something like this, with .filter(regex=)? It does assume there is only one matching column per dataframe but your example permits that.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.rand(10,3),columns=['prom_lect2b_rbd','foo','bar'])
df2 = pd.DataFrame(np.random.rand(10,3),columns=['prom_lect4b_rbd','foo','bar'])
for df in [df1,df2]:
colname = df.filter(regex='prom_lect*').columns.format()
df.rename(columns={colname[0]:'prom_lect_rbd'})
print(df1)
print(df2)
Usually I create a new column df['c'] = df['a'] * df['b'] to calculate a production between column a and column b where df is a pandas's dataframe dtypes are float64. And pandas officially recommended this method instead of ".mul()" method. But when I run the code in the following I find bugs.
def func(sym):
location="D:\\data\\"
df=pd.read_csv(location+sym+".csv")
df['c']=df['A']*df['B'] # bug existing method (1)
# df['c']=df['A'].mul(df['B'],axis=0) #replacing method (2)
.....
for sym in symbollist:
func(sym)
I use code below to clean stocks data, obiviously df may be huge.But len(symbollist) only equals to 50. After codes have been run many times, method(1) may lead to random symbol's column c's value assigning to zero while method (2) perform well from the beginning to the end.
I use eclipse and Anaconda's newest version and python is 2.7. Pandas: 0.17.1, Numpy:1.10.1.