I have a dataset with two columns: in the first column, the full directory path of a file. In the second column, the date the file was last modified. I am trying to figure out the number of files in each upper level folder ("dog", "feline", "mouse", "anteater") that were last modified later than 2004-06-23. Ultimately, I'd like something like this:
Here's my dataset:
import pandas as pd
data = {'FullName': ["dog\cat\cow\rover.doc","feline\cat\cow\digger.doc","dog\cat\cow\whatamess.doc","mouse\cat\mouse\jude.doc","anteater\cat\mouse\sam.doc","dog\cat\owl\audrey.doc",
], 'LastWriteTime': ['2003-01-02', '2004-01-02', '2005-01-02','2006-01-02','2007-01-02','2008-01-02']}
df1 = pd.DataFrame(data)
I can count the number of times the upper level folder recurs in the dataset:
df2 = (df1['FullName'].apply(lambda x: x.split('\\')[0]).value_counts()
I can also count the number of times files with a date greater than '2004-06-23' recurs in the dataset:
df3 = df1['LastWriteTime'].apply(lambda x: pd.to_datetime((x),yearfirst=True) > pd.to_datetime('2004-06-23',yearfirst=True)).value_counts()
I tried to combine them as follows:
df2 = (df1['FullName'].apply(lambda x: x.split('\\')[0]) & pd.to_datetime((x),yearfirst=True) > pd.to_datetime('2004-06-23',yearfirst=True)).value_counts()
but I get the error code: x is not defined
Does anyone know how to combine them?
You can only use expression. Refer python docs.
Just wrap it in parentheses and python interpreter is happy about it.
lambda x: (x.split('\\')[0]) & (pd.to_datetime(x, yearfirst=True) > pd.to_datetime('2004-06-23',yearfirst=True)))
However, you are applying function to respective table 'LastWriteTime' and 'FullName', function suits better. Create multiple evaluation functions and store it in dict with corresponding keys, and pass to aggregate().
def check_date(x):
return pd.to_datetime(x) > pd.to_datetime('2004-06-23', yearfirst=True)
def files_count(x):
return x.split('\\')[0]
df1 = pd.DataFrame(data)
func_dict = {'FullName': files_count, 'LastWriteTime': check_date}
df4 = df1.aggregate(func_dict) # applying multiple function to multiple columns
print(df4, end='\n'*2)
# use df[df['a']] & df[df['b']] for filtering.
df3_filtered = df4[df4['LastWriteTime']] # filtering.
print(df3_filtered, end='\n'*2)
df2 = df3_filtered['FullName'].value_counts() # counting.
print(df2)
Result:
FullName LastWriteTime
0 dog False
1 feline False
2 dog True
3 mouse True
4 anteater True
5 dog True
FullName LastWriteTime
2 dog True
3 mouse True
4 anteater True
5 dog True
dog 2
anteater 1
mouse 1
Name: FullName, dtype: int64
Related
I am a beginner and getting familiar with pandas .
It is throwing an error , When I was trying to create a new column this way :
drinks['total_servings'] = drinks.loc[: ,'beer_servings':'wine_servings'].apply(calculate,axis=1)
Below is my code, and I get the following error for line number 9:
"Cannot set a DataFrame with multiple columns to the single column total_servings"
Any help or suggestion would be appreciated :)
import pandas as pd
drinks = pd.read_csv('drinks.csv')
def calculate(drinks):
return drinks['beer_servings']+drinks['spirit_servings']+drinks['wine_servings']
print(drinks)
drinks['total_servings'] = drinks.loc[:, 'beer_servings':'wine_servings'].apply(calculate,axis=1)
drinks['beer_sales'] = drinks['beer_servings'].apply(lambda x: x*2)
drinks['spirit_sales'] = drinks['spirit_servings'].apply(lambda x: x*4)
drinks['wine_sales'] = drinks['wine_servings'].apply(lambda x: x*6)
drinks
In your code, when functioncalculate is called with axis=1, it passes each row of the Dataframe as an argument. Here, the function calculate is returning dataframe with multiple columns but you are trying to assigned to a single column, which is not possible. You can try updating your code to this,
def calculate(each_row):
return each_row['beer_servings'] + each_row['spirit_servings'] + each_row['wine_servings']
drinks['total_servings'] = drinks.apply(calculate, axis=1)
drinks['beer_sales'] = drinks['beer_servings'].apply(lambda x: x*2)
drinks['spirit_sales'] = drinks['spirit_servings'].apply(lambda x: x*4)
drinks['wine_sales'] = drinks['wine_servings'].apply(lambda x: x*6)
print(drinks)
I suppose the reason is the wrong argument name inside calculate method. The given argument is drink but drinks used to calculate sum of columns.
The reason is drink is Series object that represents Row and sum of its elements is scalar. Meanwhile drinks is a DataFrame and sum of its columns will be a Series object
Sample code shows that this method works.
import pandas as pd
df = pd.DataFrame({
"A":[1,1,1,1,1],
"B":[2,2,2,2,2],
"C":[3,3,3,3,3]
})
def calculate(to_calc_df):
return to_calc_df["A"] + to_calc_df["B"] + to_calc_df["C"]
df["total"] = df.loc[:, "A":"C"].apply(calculate, axis=1)
print(df)
Result
A B C total
0 1 2 3 6
1 1 2 3 6
2 1 2 3 6
3 1 2 3 6
4 1 2 3 6
I am processing a pandas dataframe and want to remove rows if they contain a "Full Path" that is already contained in other "Full Path" of the dataframe.
In the example below I want to remove the rows 1 2 3 4 because c:/dir/ "contains" them (we are talking about file systems path here):
Full Path Value
0 c:/dir/ x
1 c:/dir/sub1/ x
2 c:/dir/sub2/ x
3 c:/dir/sub2/a x
4 c:/dir/sub2/b x
5 c:/anotherdir/ x
6 c:/anotherdir_A/ x
7 c:/anotherdir_C/ x
Rows 6 & 7 are kept because the path is not contained in 5 (a in b in my code below).
The code I came up with is the following, res is the initial dataframe:
to_drop = []
for index, row in res.iterrows():
a = row['Full Path']
for idx, row2 in res.iterrows():
b = row2['Full Path']
if a != b and a in b:
to_drop.append(idx)
res2 = res.loc[~res.index.isin(to_drop)]
It works but the code does not feel 100% pythonic to me. I am quite sure there is a more elegant/clever way to do this. Any idea?
pd.concat([df, df['Full Path'].str.extract('(.*:\/.*?\/)')], axis = 1)\
.drop_duplicates([0])\
.drop(columns = 0)
You can use .str.extract and regex to extract the base directory, concating the extract back to the original df, dropping the duplicates of the base directory, followed by finally dropping the extracted column.
Edit: Alternate if Path is not in order:
df[df['Full Path'] == df['Full Path'].str.extract('(.*:\/.*?\/)', expand = False)]
The time complexity of this is in the tank (no matter how you turn it, you have to check every path against every other path), but a single line solution using str.startswith:
df = pd.DataFrame({'Full Path': ['c:/dir/', 'c:/dir/sub/', 'c:/anotherdir/dir',
'c:/anotherdir/'],
'Value': ['A', 'B', 'C', 'D']})
print(df[[any(a.startswith(b) if a != b else False for a in df['Full Path'])
for b in df['Full Path']]])
output
Full Path Value
0 c:/dir/ A
3 c:/anotherdir/ D
I used python to read a file which contains the baby's names, genders and birth-years. Now I want to find out the names which are used both by boys and girls. I used value_counts()to get the appearance times of each name, but now I don't know how to extract the names from all the names.
Here is my codes:
def names_both(year):
names = []
path = 'babynames/yob%d.txt' % year
columns = ['name', 'sex', 'birth']
frame = pd.read_csv(path, names=columns)
frame = frame['name'].value_counts()
print(frame)
"""if len(names) != 0:
print(names)
else:
print('None')"""
The frame now is like this:
Lou 2
Willie 2
Erie 2
Cora 2
..
Perry 1
Coy 1
Adolphus 1
Ula 1
Emily 1
Name: name, Length: 1889, dtype: int64
Here is the csv:
Anna,F,2604
Emma,F,2003
Elizabeth,F,1939
Minnie,F,1746
Margaret,F,1578
Ida,F,1472
Alice,F,1414
Bertha,F,1320
Sarah,F,1288
Annie,F,1258
Clara,F,1226
Ella,F,1156
Florence,F,1063
...
Thanks for helping!
Here we are for counting the number of names given both to girls and boys:
common_girl_and_boys_names = (
# work name by name
frame.groupby('name')
# count the number of sex given for the name and keep the one given to both sex, this boolean will be put in a column call 0
.apply(lambda x: len(x['sex'].unique()) == 2)
# the name are now in the index, reset it in order to get the names
.reset_index()
# keep only names with the column 0 with True value
.loc[lambda x: x[0], 'name']
)
final_df = (
# keep only the names common to boys and girls (the series build before)
frame.loc[frame['name'].isin(common_girl_and_boys_names), :]
# sex is now useless
.drop(['sex'], axis='columns')
# work name by name and sum the number of birth
.groupby('name')
.sum()
)
You can put those lines after the read_csv function. I hope it is want you want.
I have a pandas dataframe with approximately 3 million rows.
I want to partially aggregate the last column in seperate spots based on another variable.
My solution was to separate the dataframe rows into a list of new dataframes based on that variable, aggregate the dataframes, and then join them again into a single dataframe. The problem is that after a few 10s of thousands of rows, I get a memory error. What methods can I use to improve the efficiency of my function to prevent these memory errors?
An example of my code is below
test = pd.DataFrame({"unneeded_var": [6,6,6,4,2,6,9,2,3,3,1,4,1,5,9],
"year": [0,0,0,0,1,1,1,2,2,2,2,3,3,3,3],
"month" : [0,0,0,0,1,1,1,2,2,2,3,3,3,4,4],
"day" : [0,0,0,1,1,1,2,2,2,2,3,3,4,4,5],
"day_count" : [7,4,3,2,1,5,4,2,3,2,5,3,2,1,3]})
test = test[["year", "month", "day", "day_count"]]
def agg_multiple(df, labels, aggvar, repl=None):
if(repl is None): repl = aggvar
conds = df.duplicated(labels).tolist() #returns boolean list of false for a unique (year,month) then true until next unique pair
groups = []
start = 0
for i in range(len(conds)): #When false, split previous to new df, aggregate count
bul = conds[i]
if(i == len(conds) - 1): i +=1 #no false marking end of last group, special case
if not bul and i > 0 or bul and i == len(conds):
sample = df.iloc[start:i , :]
start = i
sample = sample.groupby(labels, as_index=False).agg({aggvar:sum}).rename(columns={aggvar : repl})
groups.append(sample)
df = pd.concat(groups).reset_index(drop=True) #combine aggregated dfs into new df
return df
test = agg_multiple(test, ["year", "month"], "day_count", repl="month_count")
I suppose that I could potentially apply the function to small samples of the dataframe, to prevent a memory error and then combine those, but I'd rather improve the computation time of my function.
This function does the same, and is 10 times faster.
test.groupby(["year", "month"], as_index=False).agg({"day_count":sum}).rename(columns={"day_count":"month_count"})
There are almost always pandas methods that are pretty optimized for tasks that will vastly outperform iteration through the dataframe. If I understand correctly, in your case, the following will return the same exact output as your function:
test2 = (test.groupby(['year', 'month'])
.day_count.sum()
.to_frame('month_count')
.reset_index())
>>> test2
year month month_count
0 0 0 16
1 1 1 10
2 2 2 7
3 2 3 5
4 3 3 5
5 3 4 4
To check that it's the same:
# Your original function:
test = agg_multiple(test, ["year", "month"], "day_count", repl="month_count")
>>> test == test2
year month month_count
0 True True True
1 True True True
2 True True True
3 True True True
4 True True True
5 True True True
I have currently run the following script which uses Fuzzylogic to replace some common words from the list. Dataframe df1 contains my default list of possible values. Dataframe df2 is the main dataframe where transformations/changes are undertaken after referring to Dataframe df1. The code is as follows:
df1 = pd.DataFrame(['one','two','three','four','five','tsst'])
df2 = pd.DataFrame({'not_shifted':[np.nan,'one','too','three','fours','five','six',np.nan,'test']})
# Drop nan value
df2=pd.DataFrame(df2['not_shifted'].fillna(value=''))
df2['not_shifted'] = df2['not_shifted'].map(lambda x: difflib.get_close_matches(x, df1[0]))
The problem is the output is a dataframe which contains square brackets. To make matters worse, none of the texts within df2['not_shifted'] are viewable/ recallable:
Out[421]:
not_shifted
0 []
1 [one]
2 [two]
3 [three]
4 [four]
5 [five]
6 []
7 []
8 [tsst]
Please help.
df2.not_shifted.apply(lambda x: x[0] if len(x) != 0 else "") or simply df2.not_shifted.str[0] as solved by #Psidom
def replace_all(eg):
rep = {"[":"",
"]":"",
"u":"",
"}":"",
"'":"",
'"':"",
"frozenset":""}
for i,j in rep.items():
eg = eg.replace(i,j)
return eg
for each in df.columns:
df[each] = df[each].apply(lambda x : replace_all(str(x)))