For example DataFrame:
import pandas as pd
df = pd.DataFrame.from_dict({
'art1':['n1','n2'],
'sizes':['35 36 37', '36 38']
})
print (df)
# need that
df_result = pd.DataFrame.from_dict({
'art1':['n1','n1','n1','n2','n2'],
'sizes':[35,36,37,36,38]
})
print (df_result)
BELOW IS CORRECT BUT NOT EFFICIENT DECISION !!!
lst_art = []
lst_sizes = [x.split() for x in df['sizes']]
for i in range(len(lst_sizes)):
for j in range(len(lst_sizes[i])):
lst_art.append(df['art1'][i])
lst_sizes = sum(lst_sizes, [])
df = pd.DataFrame({'art1':lst_art, 'sizes':lst_sizes})
print (df)
any pandas efficient way to get df_result from df?
You can first split the string column into a list and then you can explode each item in the list into a new row
df = pd.DataFrame.from_dict({
'art1':['n1','n2'],
'sizes':['35 36 37', '36 38']
})
# convert str to list
df['sizes'] = df['sizes'].str.split()
# create one new row per item in list of `sizes`
df_result = df.explode('sizes')
or you can do an overly powerful one liner
df.assign(sizes=df['sizes'].str.split()).explode('sizes')
Related
I have two dataset, one with over 100,000 rows and 300 columns and the other with 200 rows and 6 columns.
I'm comparing these two datasets and updating df1 from df2 using for loop.
Here is the sample dataset
df1:
KEY MAIN_METHOD DRUG_ETCDTL
0 100944 1 unknown
1 67488 20 unknown
2 101476 20 unknown
3 102549 1 sleepingpill_plunitrazeparm
4 103227 1 some drug
df2:
5. 방법/수단 Unnamed: 4
0 100944 sleepingpill_unknown
1 100984 others_green material
2 101476 others_anorexia
3 102549 sleepingpill_plunitrazeparm
4 103227 sleepingpill_pentobarbytal
and here is the code that I tried:
for i in range(0,4):
index_key = df2['5. 방법/수단'][i]
index_rawdata = df1.loc[df1['KEY']==index_key,'DRUG_ETCDTL'].index[0]
method1 = df1['DRUG_ETCDTL'][index_rawdata]
method2 = df1['METHOD_ETCDTL'][index_rawdata]
# split df2
mainmethod = df2['Unnamed: 4'].str.split('_',expland=False)
mainmethod[i][0] = mainmethod[i][0].replace('sleepingpill','1').replace('others','20')
# change the type so we can compare it with df1
mainmethod[i][0] = int(mainmethod[i][0])
if (mainmethod[i][1] == 1) & (df1['MAIN_METHOD'][index_rawdata] ==1 ):
method1 = mainmethod[i][1]
elif (mainmethod[i][1] == 20) & df1['MAIN_METHOD'][index_rawdata] == 20):
method2 = mainmethodp[i][1]
so the df1 should be changed but when it use print df1 it is not changed.
The desired output is:
KEY MAIN_METHOD DRUG_ETCDTL
0 100944 1 unknown
1 67488. 20 unknown
2 101476 20 anorexia
3 102549 1 plunitrazeparm
4 103227 1 pentobarbytal
NOTE: I approached this for loop method since I didn't want to manipulate df2
To address the issue of the different column sizes, this solution manipulates the indexes of the two data frames before performing an update of df1 using the pandas.DataFrame.update() method. The update method aligns the data frames using the index values and updates the values in columns with matching names.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'KEY': [100944, 67488, 101476, 102549, 103227, 123456],
'MAIN_METHOD': [1, 20, 20, 1, 1, 20],
'DRUG_ETCDTL': ['unknown', 'unknown', 'unknown', 'sleepingpill_plunitrazeparm', 'some drug', 'something extra']
}, index=np.arange(111,117))
df2 = pd.DataFrame({
'5. 방법/수단': [100944, 100984, 101476, 102549, 103227],
'Unnamed: 4': ['sleepingpill_unknown', 'others_green material', 'others_anorexia', 'sleepingpill_plunitrazeparm', 'sleepingpill_pentobarbytal']
})
# make a temporary copy of 'df2'
tmp_df = df2[['5. 방법/수단', 'Unnamed: 4']].copy()
# rename columns
tmp_df.columns = ['KEY', 'METHOD_DRUG']
# split the string to get 'METHOD' and 'DRUG_ETCDTL' information
tmp_df[['METHOD','DRUG_ETCDTL']] = tmp_df['METHOD_DRUG'].str.split('_', expand=True)
# use a map to create 'MAIN_METHOD' column
method_map = { 'sleepingpill': 1, 'others': 20 }
tmp_df['MAIN_METHOD'] = tmp_df['METHOD'].map(method_map)
# drop all unwanted DataFrame columns
tmp_df.drop(['METHOD_DRUG', 'METHOD'], inplace=True, axis=1)
# make a copy of the index of df1
index_copy = df1.index.copy(dtype=type(df1.index[0]))
# make 'KEY' and 'MAIN_METHOD' columns the new index
df1.set_index(['KEY', 'MAIN_METHOD'], inplace=True, append=False, drop=True)
# create the same index for tmp_df
tmp_df.set_index(['KEY', 'MAIN_METHOD'], inplace=True, append=False, drop=True)
# update df1 with the values in df2
df1.update(tmp_df)
# restore the 'KEY' and 'MAIN_METHOD' columns in df1
df1.reset_index(inplace=True)
# restore the original index
df1.set_index(index_copy, inplace=True, append=False, drop=True)
# delete the temporary data frame
del tmp_df
# delete the copy of the df1 index
del index_copy
ORIGINAL SOLUTION: This works when there are the same number of columns in both data frames.
This solution avoids for loops and instead uses a temporary data frame to perform the task. The strings in the Unnamed: 4 column are split using the str.split() function provided by Pandas. The MAIN_METHOD information is transformed using a mapping. The df1 data frame is conditionally updated using numpy.where() before the temporay data frame is deleted.
EDIT: The code has been modified to convert the temporary data frame column series to a numpy array using .values to avoid the error:
ValueError: Can only compare identically-labeled Series objects
Modified np.where() conditions:
df1['DRUG_ETCDTL'] = np.where(((df1['KEY']==tmp_df['KEY'].values) &
(df1['MAIN_METHOD']==tmp_df['MAIN_METHOD'].values)),
tmp_df['DRUG_ETCDTL'],
df1['DRUG_ETCDTL'])
An alternative solution to avoiding the error would be to use .equals() instead of == when performing the comparison.
df1['DRUG_ETCDTL'] = np.where(((df1['KEY'].equals(tmp_df['KEY'])) &
(df1['MAIN_METHOD'].equals(tmp_df['MAIN_METHOD']))),
tmp_df['DRUG_ETCDTL'],
df1['DRUG_ETCDTL'])
Original code:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'KEY': [100944, 67488, 101476, 102549, 103227],
'MAIN_METHOD': [1, 20, 20, 1, 1],
'DRUG_ETCDTL': ['unknown', 'unknown', 'unknown', 'sleepingpill_plunitrazeparm', 'some drug']
}, index=np.arange(11,16))
df2 = pd.DataFrame({
'5. 방법/수단': [100944, 100984, 101476, 102549, 103227],
'Unnamed: 4': ['sleepingpill_unknown', 'others_green material', 'others_anorexia', 'sleepingpill_plunitrazeparm', 'sleepingpill_pentobarbytal']
})
# make a temporary copy of 'df2'
tmp_df = df2[['5. 방법/수단', 'Unnamed: 4']].copy()
# rename columns
tmp_df.columns = ['KEY', 'METHOD_DRUG']
# split the string to get 'METHOD' and 'DRUG_ETCDTL' information
tmp_df[['METHOD', 'DRUG_ETCDTL']] = tmp_df['METHOD_DRUG'].str.split('_', expand=True)
# use a mapping to create 'MAIN_METHOD' column
method_map = { 'sleepingpill': 1, 'others': 20 }
tmp_df['MAIN_METHOD'] = tmp_df['METHOD'].map(method_map)
# drop unwanted columns (This step is optional)
tmp_df.drop(['METHOD_DRUG', 'METHOD'], inplace=True, axis=1)
# update 'df1'
df1['DRUG_ETCDTL'] = np.where(((df1['KEY']==tmp_df['KEY'].values) &
(df1['MAIN_METHOD']==tmp_df['MAIN_METHOD'].values)),
tmp_df['DRUG_ETCDTL'],
df1['DRUG_ETCDTL'])
# delete temporary copy of 'df2'
del tmp_df
I have a file with 136 columns. I was trying to find the unique values of each column and from there, I need to find the number of rows for the unique values.
I tried using df and dict for the unique values. However, when I export it back to csv file, the unique values are exported as a list in one cell for each column.
Is there any way I can do to simplify the counting process of the unique values in each column?
df = pd.read_excel(filename)
column_headers = list(df.columns.values)
df_unique = {}
df_count = {}
def approach_1(data):
count = 0
for entry in data:
if not entry =='nan' or not entry == 'NaN':
count += 1
return count
for unique in column_headers:
new = df.drop_duplicates(subset=unique , keep='first')
df_unique[unique] = new[unique].tolist()
csv_unique = pd.DataFrame(df_unique.items(), columns = ['Data Source Field', 'First Row'])
csv_unique.to_csv('Unique.csv', index = False)
for count in df_unique:
not_nan = approach_1(df_unique[count])
df_count[count] = not_nan
csv_count = pd.DataFrame(df_count.items(), columns = ['Data Source Field', 'Count'])
.unique() is simpler ->len(df[col].unique()) is the count
import pandas as pd
dict = [
{"col1":"0","col2":"a"},
{"col1":"1","col2":"a"},
{"col1":"2","col2":"a"},
{"col1":"3","col2":"a"},
{"col1":"4","col2":"a"},
{"col2":"a"}
]
df = pd.DataFrame.from_dict(dict)
result_dict = {}
for col in df.columns:
result_dict[col] = len(df[col].dropna().unique())
print(result_dict)
i have 5 different data frames that are output of different conditions or tables.
i want to have an output if these data-frames are empty or not. basically i will define with len(df) each data frame and will pass a string if they have anything in them.
def(df1,df2,df3,df4,df5)
if len(df1) > 0:
"df1 not empty"
else: ""
if len(df2) > 0:
"df2 not empty"
else: ""
then i want to append these string to each other and will have a string like
**df1 not empty, df3 not empty**
try this :
import pandas as pd
dfs = {'milk': pd.DataFrame(['a']), 'bread': pd.DataFrame(['b']), 'potato': pd.DataFrame()}
print(''.join(
[f'{name} not empty. ' for name, df in dfs.items() if (not df.empty)])
)
output:
milk not empty. bread not empty.
data = [1,2,3]
df = pd.DataFrame(data, columns=['col1']) #create not empty df
data1 = []
df1 = pd.DataFrame(data) #create empty df
dfs = [df, df1] #list them
#the "for loop" is replaced here by a list comprehension
#I used enumerate() to attribute an index to each df in the list of dfs, because otherwise in the print output if you call directly df0 or df1 it will print th entire dataframe, not only his name
print(' '.join([f'df{i} is not empty.' for i,df in enumerate(dfs) if not df.empty]))
Result:
df0 is not empty. df1 is not empty.
With a one-liner:
dfs = [df1,df2,df3,df4,df5]
output = ["your string here" for df in dfs if not df.empty]
You can then concatenate strings together, if you want:
final_string = "; ".join(output)
I have a list of filepaths in the first column of a dataframe. My goal is to create a second column that represents file categories, with categories reflecting the words in the filepath.
import pandas as pd
import numpy as np
data = {'filepath': ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
}
df = pd.DataFrame(data)
df["Animal"] =(df['filepath'].str.contains("dog|cat",case=False,regex=True))
df["Fish"] =(df['filepath'].str.contains("barracuda",case=False))
df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))
df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df = df.apply(squeeze_nan, axis=1)
print(df)
This code works. The problem arises when I have 200 statements beginning with df['columnName'] =. Because I have so many, I get the error:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
To fix this I have tried:
dfAnimal = df.copy
dfAnimal['Animal'] = dfAnimal['filepath'].str.contains("dog|cat",case=False,regex=True)
dfFish = df.copy
dfFish["Fish"] =dfFish['filepath'].str.contains("barracuda",case=False)
df = pd.concat(dfAnimal,dfFish)
The above gives me errors such as method object is not iterable and method object is not subscriptable. I then tried df = df.loc[df['filepath'].isin(['cat','dog'])] but this only works when 'cat' or 'dog' is the only word in the column. How do I avoid the performance error?
Try creating all your new columns in a dict, and then convert that dict into a dataframe, and then use pd.concat to add the resulting dataframe (containing the new columns) to the original dataframe:
new_columns = {
'Animal': df['filepath'].str.contains("dog|cat",case=False,regex=True),
'Fish': df['filepath'].str.contains("barracuda",case=False),
}
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1)
Added to your original code, it would be something like this:
import pandas as pd
import numpy as np
data = {'filepath': ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
}
df = pd.DataFrame(data)
##### These are the new lines #####
new_columns = {
'Animal': df['filepath'].str.contains("dog|cat",case=False,regex=True),
'Fish': df['filepath'].str.contains("barracuda",case=False),
}
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1)
##### End of new lines #####
df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))
df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df = df.apply(squeeze_nan, axis=1)
print(df)
I'm writing a program where I want to extract data from multiple docx files and fill it into a pandas dataframe. I'm currently achieving this in a for loop like so:
cols = ["path","col1", "col2", "col3", "col4"]
def add_to_df(path):
col1_val = extract_col1(path)
col2_val = extract_col2(path)
col3_val = extract_col3(path)
col4_val = extract_col4(path)
temp_df = pd.DataFrame(
[[path, col1_val,col2_val,col3_val,col4_val]],
columns=cols)
return temp_df
df = pd.DataFrame()
for path in paths:
df = df.append(add_to_df(path), ignore_index=True)
Is this the best way to do this? Or is there a nicer, more accepted way? (This is just a simplified example of what I'm trying to do, the actual code looks a lot messier...)
I think better is create list of lists instead many DataFrames and then pass to DataFrame constructor:
def add_to_df(path):
col1_val = extract_col1(path)
col2_val = extract_col2(path)
col3_val = extract_col3(path)
col4_val = extract_col4(path)
temp_L = [path, col1_val,col2_val,col3_val,col4_val]
return temp_L
List comprehension solution:
L = [add_to_df(path) for path in paths]
If want to use for loop:
L = []
for path in paths:
L.append(add_to_df(path))
df = pd.DataFrame(L, columns=cols)
I prefer extracting data to a dictionary first and then create a dataframe from that dictionary. Example:
data = {'doc1': {'subject': 'x', 'n_words': 100},
'doc2': {'subject': 'y', 'n_words': 200},
'doc3': {'subject': 'z', 'n_words': 300}}
df = pd.DataFrame.from_dict(data, orient='index')
print(df)
Result:
subject n_words
doc1 x 100
doc2 y 200
doc3 z 300