so just messing around with Pandas for the first time - curious, specifically with the variables in my code - does it make sense to keep iterating with "df#" or should I just keep rewriting "df"? Or if there's a more elegant way that I'm missing.
def func(csvfile):
df = pd.read_csv(csvfile)
df.columns = df.columns.str.replace(" ", "_")
df2 = df.assign(column3=df.column3.str.split(",")).explode(
"column3"
)
df3 = df2.assign(column2=df.column2.str.split("; ")).explode("column2")
df3["column2"] = df3["column2"].str.replace(r"\(\d+\)", "", regex=True)
df4 = df3[df3["column2"].str.contains("value2") == False]
print(df4)
Taking a complete shot in the dark since you're unable to provide anything to work with, but I'd bet that this does the same:
def func(csvfile):
df = pd.read_csv(csvfile)
df.columns = df.columns.str.replace(" ", "_")
df.column2 = df.column2.str.split("; ")
df.column3 = df.column3.str.split(",")
df = df.explode(['column2', 'column3']) # Or maybe explode them one at a time? I have no idea what you're doing.
df.column2 = df.column2.str.replace(r"\(\d+\)", "", regex=True)
df = df[~df.column2.str.contains("value2")]
return df
df = func(csvfile)
print(df)
Related
I want to make a new column for each dataframe in a list of dataframes called "RING" which contains the word "RING" + the another column called "No".
here is my solution so far
df_all = [df1,df2,df3]
for df in df_all:
df["RING "] = "RING" + str(df['No'])
df_all
Is there away that doesn't require a for loop?
You are almost there:
df_all = [df1,df2,df3]
for df in df_all:
df["RING"] = "RING" + df["No"]
# If df["No"] is not of type string, cast it to string:
# df["RING"] = "RING" + df["No"].astype("str")
df_all
you can concat all dataframes in the list to get one df (then work with it):
df_all = [df1,df2,df3]
df = pd.concat(df_all, axis=0, ignore_index=True)
df["RING "] = "RING" + df['No'].astype(str)
if you want to come bak and get separate dataframes, you can do this:
df_all = [df1,df2,df3]
df1['df_id'] = 1
df2['df_id'] = 2
df3['df_id'] = 3
df = pd.concat(df_all, axis=0, ignore_index=True)
df["RING "] = "RING" + df['No'].astype(str)
#-->
df1 = df.loc[df['df_id'].eq(1)]
df2 = df.loc[df['df_id'].eq(2)]
df3 = df.loc[df['df_id'].eq(3)]
if you don't want use concat, you can try list comprehension, usually faster than for loop:
df_all = [df1,df2,df3]
def process_df(df):
df["RING "] = "RING" + df['No'].astype(str)
return df
processed_df_all = [process_df(df) for df in df_all]
#df1 = processed_df_all[0]
I am trying to create a dynamic pandas dataframe based on the number of records read, where each record would be a column.
My logic has been to apply a cycle where "for i=1 in N", where N is a read data (string format) to create the columns. This is not quite right for me, I have tried some alternatives but without good results. I only get the last record of the read.
I leave a proposal:
def funct_example(client):
documents = [ v_document ]
poller = client.begin_analyze_entities(documents)
result = poller.result()
docs = [doc for doc in result if not doc.is_error]
i = 1
df_final = pd.DataFrame()
for idx, doc in enumerate(docs):
for entity in doc.entities:
for i in doc.entities:
d = {'col' + i : [format(entity.text)]}
df = pd.DataFrame(data=d)
df_final = pd.concat([df_final, df], axis=1)
display(df_final)
i = i + 1
funct_example(client)
What alternative do you recommend?
SOLUTION:
for idx, doc in enumerate(docs):
for entity in doc.entities:
name = 'col' + str(i)
d = {name : [format(entity.text)]}
df = pd.DataFrame(data=d)
df_final = pd.concat([df_final, df], axis=1)
i = i + 1
display(df_final)
Thanks you!
this is because df is getting reassigned after each iteration.
here is one way to accomplish it
declare an empty DF before the start of the for loop
df_final = pd.DataFrame()
add after you have created the df df = pd.DataFrame(data=d)
df_final = pd.concat([df_final, df], axis=1)
this appends to your df_final
I have the following df:
df = pd.DataFrame(columns=['Place', 'PLZ','shortName','Parzellen'])
new_row1 = {'Place':'Winterthur', 'PLZ':[8400, 8401, 8402, 8404, 8405, 8406, 8407, 8408, 8409, 8410, 8411], 'shortName':'WIN', 'Parzellen':[]}
new_row2 = {'Place':'Opfikon', 'PLZ':[8152], 'shortName':'OPF', 'Parzellen':[]}
new_row3 = {'Place':'Stadel', 'PLZ':[8174], 'shortName':'STA', 'Parzellen':[]}
new_row4 = {'Place':'Kloten', 'PLZ':[8302], 'shortName':'KLO', 'Parzellen':[]}
new_row5 = {'Place':'Niederhasli', 'PLZ':[8155,8156], 'shortName':'NIH', 'Parzellen':[]}
new_row6 = {'Place':'Bassersdorf', 'PLZ':[8303], 'shortName':'BAS', 'Parzellen':[]}
new_row7 = {'Place':'Oberglatt', 'PLZ':[8154], 'shortName':'OBE', 'Parzellen':[]}
new_row8 = {'Place':'Bülach', 'PLZ':[8180], 'shortName':'BUE', 'Parzellen':[]}
df = df.append(new_row1, ignore_index=True)
df = df.append(new_row2, ignore_index=True)
df = df.append(new_row3, ignore_index=True)
df = df.append(new_row4, ignore_index=True)
df = df.append(new_row5, ignore_index=True)
df = df.append(new_row6, ignore_index=True)
df = df.append(new_row7, ignore_index=True)
df = df.append(new_row8, ignore_index=True)
print (df)
Now I have a number like 8405 and I want to know the Place or whole Row which has this number under df['PLZ'].
I also tried with classes but it was hard to get all Numbers of all Objects because I want to be able to call all PLZ in a list and also check, if I have any number, to which Place it belongs. Maybe there is an obvious better way and I just don't know it.
try with boolean masking and map() method:
df[df['PLZ'].map(lambda x:8405 in x)]
OR
via boolean masking and agg() method:
df[df['PLZ'].agg(lambda x:8405 in x)]
#you can also use apply() in place of agg
output of above code:
Place PLZ shortName Parzellen
0 Winterthur [8400, 8401, 8402, 8404, 8405, 8406, 8407, 840... WIN []
I am trying to delete a column called Rank but nothing happens. The remaining code all executes without any issue but the column itself remains in the output file. I've highlighted the part of the code that is not working.
def read_csv():
file = "\mona" + yday+".csv"
#df=[]
df = pd.read_csv(save_path+file,skiprows=3,encoding = "ISO-8859-1",error_bad_lines=False)
return df
# replace . with / in column EPIC
def tickerchange():
df=read_csv()
df['EPIC'] = df['EPIC'].str.replace('.','/')
return df
def consolidate_AB_listings():
df=tickerchange()
Aline = df.loc[(df['EPIC'] =='RDSA'),'Mkt Cap (àm)']
Bline = df.loc[(df['EPIC'] =='RDSB'),'Mkt Cap (àm)']
df.loc[(df['EPIC'] =='RDSA'),'Mkt Cap (àm)']= float(Aline) + float(Bline)
df = df.loc[(df.Ind != 'I/E')]
df = df.loc[(df.Ind != 'FL')]
df = df.loc[(df.Ind != 'M')]
df = df.loc[(df.EPIC != 'RDSB')]
return df
def ranking_mktcap():
df = consolidate_AB_listings()
df['Rank']= df['Mkt Cap (àm)'].rank(ascending=False)
df = df.loc[(df.Rank != 1)]
df['Rank1']= df['Mkt Cap (Em)'].rank(ascending=False)
## This doesn't seem to work
df = df.drop(df['Security'], 1)
return df
def save_outputfile():
#df = drop()
df = ranking_mktcap()
df.to_csv(r'S:\Index_Analytics\UK\Index Methodology\FTSE\Py_file_download\MonitoredList.csv', index=False)
print("finished")
if __name__ == "__main__":
main()
read_csv()
tickerchange()
consolidate_AB_listings()
ranking_mktcap()
save_outputfile()
DataFrame.drop() takes the following: DataFrame.drop(self, labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise').
When you call df = df.drop(df['Security'], 1) it's using df['security'] as the labels to drop. And the 1 is being passed through the axis parameter.
If you want to drop the column 'Security' then you'd want to do:
df = df.drop('Security', axis=1)
# this is same as
df = df.drop(labels='Security', axis=1)
# you can also specify the column name directly, like this
df = df.drop(columns='Security')
Note: the columns= parameter can take a single lable (str) like above, or can take a list of column names.
Try by replacing
df = df.drop(df['Security'], 1)
By
df.drop(['Security'],axis=1, inplace=True)
I had the same issue and all I did was add inplace = True
So it will be df = df.drop(df['Security'], 1, inplace = True)
Here is the problem. I use a function to return a randomized data,
data1 = [3,5,7,3,2,6,1,6,7,8]
data2 = [1,5,2,1,6,4,3,2,7,8]
df = pd.DataFrame(data1, columns = ['c1'])
df['c2'] = data2
def randomize_data(df):
df['c1_ran'] = df['c1'].apply(lambda x: (x + np.random.uniform(0,1)))
df['c1']=df['c1_ran']
# df.drop(['c1_ran'], 1, inplace=True)
return df
temp_df = randomize_data(df)
display(df)
display(temp_df)
However, the df (source data) and the temp_df (randomized_data) is the same. Here is the result:
How can I make the temp_df and df different from each other?
I find I can get rid of the problem by adding df.copy() at the beginning of the function
def randomize_data(df):
df = df.copy()
But I'm not sure if this is the right way to deal with it?
Use DataFrame.assign():
def randomize_data(df):
return df.assign(c1=df.c1 + np.random.uniform(0, 1, df.shape[0]))
I think you are right, and DataFrame.copy() have an optional argument 'deep'. You can find details in http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html