I am trying to create a dynamic pandas dataframe based on the number of records read, where each record would be a column.
My logic has been to apply a cycle where "for i=1 in N", where N is a read data (string format) to create the columns. This is not quite right for me, I have tried some alternatives but without good results. I only get the last record of the read.
I leave a proposal:
def funct_example(client):
documents = [ v_document ]
poller = client.begin_analyze_entities(documents)
result = poller.result()
docs = [doc for doc in result if not doc.is_error]
i = 1
df_final = pd.DataFrame()
for idx, doc in enumerate(docs):
for entity in doc.entities:
for i in doc.entities:
d = {'col' + i : [format(entity.text)]}
df = pd.DataFrame(data=d)
df_final = pd.concat([df_final, df], axis=1)
display(df_final)
i = i + 1
funct_example(client)
What alternative do you recommend?
SOLUTION:
for idx, doc in enumerate(docs):
for entity in doc.entities:
name = 'col' + str(i)
d = {name : [format(entity.text)]}
df = pd.DataFrame(data=d)
df_final = pd.concat([df_final, df], axis=1)
i = i + 1
display(df_final)
Thanks you!
this is because df is getting reassigned after each iteration.
here is one way to accomplish it
declare an empty DF before the start of the for loop
df_final = pd.DataFrame()
add after you have created the df df = pd.DataFrame(data=d)
df_final = pd.concat([df_final, df], axis=1)
this appends to your df_final
Related
I want to make a new column for each dataframe in a list of dataframes called "RING" which contains the word "RING" + the another column called "No".
here is my solution so far
df_all = [df1,df2,df3]
for df in df_all:
df["RING "] = "RING" + str(df['No'])
df_all
Is there away that doesn't require a for loop?
You are almost there:
df_all = [df1,df2,df3]
for df in df_all:
df["RING"] = "RING" + df["No"]
# If df["No"] is not of type string, cast it to string:
# df["RING"] = "RING" + df["No"].astype("str")
df_all
you can concat all dataframes in the list to get one df (then work with it):
df_all = [df1,df2,df3]
df = pd.concat(df_all, axis=0, ignore_index=True)
df["RING "] = "RING" + df['No'].astype(str)
if you want to come bak and get separate dataframes, you can do this:
df_all = [df1,df2,df3]
df1['df_id'] = 1
df2['df_id'] = 2
df3['df_id'] = 3
df = pd.concat(df_all, axis=0, ignore_index=True)
df["RING "] = "RING" + df['No'].astype(str)
#-->
df1 = df.loc[df['df_id'].eq(1)]
df2 = df.loc[df['df_id'].eq(2)]
df3 = df.loc[df['df_id'].eq(3)]
if you don't want use concat, you can try list comprehension, usually faster than for loop:
df_all = [df1,df2,df3]
def process_df(df):
df["RING "] = "RING" + df['No'].astype(str)
return df
processed_df_all = [process_df(df) for df in df_all]
#df1 = processed_df_all[0]
so just messing around with Pandas for the first time - curious, specifically with the variables in my code - does it make sense to keep iterating with "df#" or should I just keep rewriting "df"? Or if there's a more elegant way that I'm missing.
def func(csvfile):
df = pd.read_csv(csvfile)
df.columns = df.columns.str.replace(" ", "_")
df2 = df.assign(column3=df.column3.str.split(",")).explode(
"column3"
)
df3 = df2.assign(column2=df.column2.str.split("; ")).explode("column2")
df3["column2"] = df3["column2"].str.replace(r"\(\d+\)", "", regex=True)
df4 = df3[df3["column2"].str.contains("value2") == False]
print(df4)
Taking a complete shot in the dark since you're unable to provide anything to work with, but I'd bet that this does the same:
def func(csvfile):
df = pd.read_csv(csvfile)
df.columns = df.columns.str.replace(" ", "_")
df.column2 = df.column2.str.split("; ")
df.column3 = df.column3.str.split(",")
df = df.explode(['column2', 'column3']) # Or maybe explode them one at a time? I have no idea what you're doing.
df.column2 = df.column2.str.replace(r"\(\d+\)", "", regex=True)
df = df[~df.column2.str.contains("value2")]
return df
df = func(csvfile)
print(df)
I have a list of filepaths in the first column of a dataframe. My goal is to create a second column that represents file categories, with categories reflecting the words in the filepath.
import pandas as pd
import numpy as np
data = {'filepath': ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
}
df = pd.DataFrame(data)
df["Animal"] =(df['filepath'].str.contains("dog|cat",case=False,regex=True))
df["Fish"] =(df['filepath'].str.contains("barracuda",case=False))
df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))
df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df = df.apply(squeeze_nan, axis=1)
print(df)
This code works. The problem arises when I have 200 statements beginning with df['columnName'] =. Because I have so many, I get the error:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
To fix this I have tried:
dfAnimal = df.copy
dfAnimal['Animal'] = dfAnimal['filepath'].str.contains("dog|cat",case=False,regex=True)
dfFish = df.copy
dfFish["Fish"] =dfFish['filepath'].str.contains("barracuda",case=False)
df = pd.concat(dfAnimal,dfFish)
The above gives me errors such as method object is not iterable and method object is not subscriptable. I then tried df = df.loc[df['filepath'].isin(['cat','dog'])] but this only works when 'cat' or 'dog' is the only word in the column. How do I avoid the performance error?
Try creating all your new columns in a dict, and then convert that dict into a dataframe, and then use pd.concat to add the resulting dataframe (containing the new columns) to the original dataframe:
new_columns = {
'Animal': df['filepath'].str.contains("dog|cat",case=False,regex=True),
'Fish': df['filepath'].str.contains("barracuda",case=False),
}
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1)
Added to your original code, it would be something like this:
import pandas as pd
import numpy as np
data = {'filepath': ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
}
df = pd.DataFrame(data)
##### These are the new lines #####
new_columns = {
'Animal': df['filepath'].str.contains("dog|cat",case=False,regex=True),
'Fish': df['filepath'].str.contains("barracuda",case=False),
}
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1)
##### End of new lines #####
df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))
df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df = df.apply(squeeze_nan, axis=1)
print(df)
I have the following df:
df = pd.DataFrame(columns=['Place', 'PLZ','shortName','Parzellen'])
new_row1 = {'Place':'Winterthur', 'PLZ':[8400, 8401, 8402, 8404, 8405, 8406, 8407, 8408, 8409, 8410, 8411], 'shortName':'WIN', 'Parzellen':[]}
new_row2 = {'Place':'Opfikon', 'PLZ':[8152], 'shortName':'OPF', 'Parzellen':[]}
new_row3 = {'Place':'Stadel', 'PLZ':[8174], 'shortName':'STA', 'Parzellen':[]}
new_row4 = {'Place':'Kloten', 'PLZ':[8302], 'shortName':'KLO', 'Parzellen':[]}
new_row5 = {'Place':'Niederhasli', 'PLZ':[8155,8156], 'shortName':'NIH', 'Parzellen':[]}
new_row6 = {'Place':'Bassersdorf', 'PLZ':[8303], 'shortName':'BAS', 'Parzellen':[]}
new_row7 = {'Place':'Oberglatt', 'PLZ':[8154], 'shortName':'OBE', 'Parzellen':[]}
new_row8 = {'Place':'Bülach', 'PLZ':[8180], 'shortName':'BUE', 'Parzellen':[]}
df = df.append(new_row1, ignore_index=True)
df = df.append(new_row2, ignore_index=True)
df = df.append(new_row3, ignore_index=True)
df = df.append(new_row4, ignore_index=True)
df = df.append(new_row5, ignore_index=True)
df = df.append(new_row6, ignore_index=True)
df = df.append(new_row7, ignore_index=True)
df = df.append(new_row8, ignore_index=True)
print (df)
Now I have a number like 8405 and I want to know the Place or whole Row which has this number under df['PLZ'].
I also tried with classes but it was hard to get all Numbers of all Objects because I want to be able to call all PLZ in a list and also check, if I have any number, to which Place it belongs. Maybe there is an obvious better way and I just don't know it.
try with boolean masking and map() method:
df[df['PLZ'].map(lambda x:8405 in x)]
OR
via boolean masking and agg() method:
df[df['PLZ'].agg(lambda x:8405 in x)]
#you can also use apply() in place of agg
output of above code:
Place PLZ shortName Parzellen
0 Winterthur [8400, 8401, 8402, 8404, 8405, 8406, 8407, 840... WIN []
Here is the problem. I use a function to return a randomized data,
data1 = [3,5,7,3,2,6,1,6,7,8]
data2 = [1,5,2,1,6,4,3,2,7,8]
df = pd.DataFrame(data1, columns = ['c1'])
df['c2'] = data2
def randomize_data(df):
df['c1_ran'] = df['c1'].apply(lambda x: (x + np.random.uniform(0,1)))
df['c1']=df['c1_ran']
# df.drop(['c1_ran'], 1, inplace=True)
return df
temp_df = randomize_data(df)
display(df)
display(temp_df)
However, the df (source data) and the temp_df (randomized_data) is the same. Here is the result:
How can I make the temp_df and df different from each other?
I find I can get rid of the problem by adding df.copy() at the beginning of the function
def randomize_data(df):
df = df.copy()
But I'm not sure if this is the right way to deal with it?
Use DataFrame.assign():
def randomize_data(df):
return df.assign(c1=df.c1 + np.random.uniform(0, 1, df.shape[0]))
I think you are right, and DataFrame.copy() have an optional argument 'deep'. You can find details in http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html