Create a dataframe from unequal length strings - python

I have a dataframe of the file name and its path in the format of a continuous string:
e.g:
files = pandas.Dataframe((
name path
0 file1.txt \\drive\folder1\folder2\folder3\...\file1.txt
1 file2.pdf \\drive\folder1\file2.pdf
2 file3.xls \\drive\folder1\folder2\folder3\...\folder21\file3.xls
n ... ...))
The size of the frame is about 1.02E+06 entries, the depth of the drive is at most 21 folders, but varies greatly.
The goal is have a dataframe in the format of:
name level1 level2 level3 level4 ... level21
0 file.txt folder1 folder2 folder3 0 ... 0
1 file.pdf folder1 0 0 0 ... 0
2 file3.xls folder1 folder2 folder3 folder4 ... folder21
...
I split the string of the file location and created an array with, which can be filled up with zeros if the path is shorter:
files = files.assign(plist=files['path'].iloc[:].apply(path_split))
def path_split(name):
return np.array(os.path.normpath(name).split(os.sep)[7:])
Add a column with number of folder in the files path:
files = files.assign(len_plist = files.plist.iloc[:].map(len))
The problem here is that the split path string creates an nested arrays within the dataframe.
Then an empty Dataframe with the number of columns in the quantity of folders ( 21 here) and rows accordin to the number of files (1.02E+06 here):
max_folder = files['len_plist'].max() # get the maximum amount of folders
levelcos = [ 'flevel_{}'.format(i) for i in np.arange(max_folder)]
levels = pd.DataFrame(np.zeros((files.shape[0],max_folder)),
columns =levelcos, index = files.index )
and now I fill the empty frame with the entries of the path array:
levels = fill_rows(levels,files.plist.values)
def fill_rows(df,array):
for i,row in enumerate(array):
df.iloc[i,:row.shape[0] - 1] = row[:-1]
return df
This takes a lot of time, since the varying length of the path arrays does not allow a vectorize solution right away. If I need to loop all 1.02E+06 rows of the dataframe, it would take at least 34h maybe up to 200h.
First and foremost, I want to optimize the filling of the dataframe and in a second step I would split the dataframe, parallelize the operations and assemble the frame again afterwards.
edit: added clarification, that a shorter path can be filled up to the maximum length with zeros.

Maybe I'm missing something but why doesn't this work for you?
expanded = files['path'].str.split(os.path.sep, expand=True).fillna(0)
expanded = expanded.rename(columns=lambda x: 'level_' + str(x))
df = pd.concat([files.name, expanded], axis=1)

Related

python pandas: fulfill condition and assign a value to it

I am really hoping you can help me here...I need to assign a label(df_label) to an exact file within dataframe (df_data) and save all labels that appear in each file in a separate txt file (that's an easy bit)
df_data:
file_name file_start file_end
0 20190201_000004.wav 0.000 1196.000
1 20190201_002003.wav 1196.000 2392.992
2 20190201_004004.wav 2392.992 3588.992
3 20190201_010003.wav 3588.992 4785.984
4 20190201_012003.wav 4785.984 5982.976
df_label:
Begin Time (s)
0 27467.100000
1 43830.400000
2 43830.800000
3 46378.200000
I have tried to switch to np.array and use for loop and np.where but without any success...
If the time values in df_label fall under exactly one entry in df_data, you can use the following
def get_file_name(begin_time):
file_names = df_data[
(df_data["file_start"] <= begin_time)
& (df_data["file_end"] >= begin_time)
]["file_name"].values
return file_names.values[0] if file_names.values.size > 0 else None
df_label["file_name"] = df_label["Begin Time (s)"].apply(get_label)
This will add another col file_name to df_label
If the labels from df_label matches the order of files in df_data you can simply:
add the labels as new column of df_data (df_data["label"] = df_label["Begin Time (s)"]).
or
use DataFrame.merge() function (df_data = df_data.merge(df_labels, left_index=True, right_index=True)).
More about merging/joining with examples you can find here:
https://thispointer.com/pandas-how-to-merge-dataframes-by-index-using-dataframe-merge-part-3/
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

splitting of urls from a list in dataframe where column name is company_urls

I have a dataframe(df) like this:
company_urls
0 [https://www.linkedin.com/company/gulf-capital...
1 [https://www.linkedin.com/company/gulf-capital...
2 [https://www.linkedin.com/company/fajr-capital...
3 [https://www.linkedin.com/company/goldman-sach...
And df.company_urls[0] is
['https://www.linkedin.com/company/gulf-capital/about/',
'https://www.linkedin.com/company/the-abraaj-group/about/',
'https://www.linkedin.com/company/abu-dhabi-investment-company/about/',
'https://www.linkedin.com/company/national-bank-of-dubai/about/',
'https://www.linkedin.com/company/efg-hermes/about/']
So I have to create a new columns like this:
company_urls company_url1 company_url2 company_url3 ...
0 [https://www.linkedin.com/company/gulf-capital... https://www.linkedin.com/company/gulf-capital/about/ https://www.linkedin.com/company/the-abraaj-group/about/...
1 [https://www.linkedin.com/company/gulf-capital... https://www.linkedin.com/company/gulf-capital/about/ https://www.linkedin.com/company/gulf-related/about/...
2 [https://www.linkedin.com/company/fajr-capital... https://www.linkedin.com/company/fajr-capital/about/...
3 [https://www.linkedin.com/company/goldman-sach... https://www.linkedin.com/company/goldman-sachs/about/...
How do I do that?
I have created this function for my personal use, and I think will work for your needs:
a) Specify the df name
b) Specify the column you want to split
c) Specify the delimiter
def composition_split(dat,col,divider =','): # set your delimiter here
"""
splits the column of interest depending on how many delimiters we have
creates all the columns needed to make the split
"""
x1 = dat[col].astype(str).apply(lambda x: x.count(divider)).max()
x2 = ["company_url_"+str(i) for i in np.arange(0,x1+1,1)]
dat[x2] = dat[col].str.split(divider,expand = True)
return dat
Basically this will create as many columns needed depending on how you specify the delimiter. For example, if the URL can be split 3 times based on a certain delimiter, it will create 3 new columns.
your_new_df = composition_split(df,'col_to_split',',') # for example

More pythonic way to remove rows where one value begins by another row's value in a pandas dataframe

I am processing a pandas dataframe and want to remove rows if they contain a "Full Path" that is already contained in other "Full Path" of the dataframe.
In the example below I want to remove the rows 1 2 3 4 because c:/dir/ "contains" them (we are talking about file systems path here):
Full Path Value
0 c:/dir/ x
1 c:/dir/sub1/ x
2 c:/dir/sub2/ x
3 c:/dir/sub2/a x
4 c:/dir/sub2/b x
5 c:/anotherdir/ x
6 c:/anotherdir_A/ x
7 c:/anotherdir_C/ x
Rows 6 & 7 are kept because the path is not contained in 5 (a in b in my code below).
The code I came up with is the following, res is the initial dataframe:
to_drop = []
for index, row in res.iterrows():
a = row['Full Path']
for idx, row2 in res.iterrows():
b = row2['Full Path']
if a != b and a in b:
to_drop.append(idx)
res2 = res.loc[~res.index.isin(to_drop)]
It works but the code does not feel 100% pythonic to me. I am quite sure there is a more elegant/clever way to do this. Any idea?
pd.concat([df, df['Full Path'].str.extract('(.*:\/.*?\/)')], axis = 1)\
.drop_duplicates([0])\
.drop(columns = 0)
You can use .str.extract and regex to extract the base directory, concating the extract back to the original df, dropping the duplicates of the base directory, followed by finally dropping the extracted column.
Edit: Alternate if Path is not in order:
df[df['Full Path'] == df['Full Path'].str.extract('(.*:\/.*?\/)', expand = False)]
The time complexity of this is in the tank (no matter how you turn it, you have to check every path against every other path), but a single line solution using str.startswith:
df = pd.DataFrame({'Full Path': ['c:/dir/', 'c:/dir/sub/', 'c:/anotherdir/dir',
'c:/anotherdir/'],
'Value': ['A', 'B', 'C', 'D']})
print(df[[any(a.startswith(b) if a != b else False for a in df['Full Path'])
for b in df['Full Path']]])
output
Full Path Value
0 c:/dir/ A
3 c:/anotherdir/ D

Merge multiple CSV files that share 2 columns into one unique data frame

I have multiple CSV files (like 200) in a folder that I want to merge them into one unique dataframe. For example, each file has 3 columns, of which 2 are common in all the files (Country and Year), the third column is different in each file.
For example, one file has the following columns:
Country Year X
----------------------
Mexico 2015 10
Spain 2014 6
And other file can be like this:
Country Year A
--------------------
Mexico 2015 90
Spain 2014 67
USA 2020 8
I can read this files and merge them with the following code:
x = pd.read_csv("x.csv")
a = pd.read_csv("a.csv")
df = pd.merge(a, x, how="left", left_on=["country", "year"],
right_on=["country", "year"], indicator=False)
And this result in the output that I want, like this:
Country Year A X
-------------------------
Mexico 2015 90 10
Spain 2014 67 6
USA 2020 8
However, my problem is to do the previously process with each file, there are more than 200, I want to know if I can use a loop (or other method) in order to read the files and merge them into a unique dataframe.
Thank you very much, I hope I was clear enough.
Use glob like this:
import glob
print(glob.glob("/home/folder/*.csv"))
This gives all your files in a list : ['/home/folder/file1.csv', '/home/folder/file2.csv', .... ]
Now, you can just iterate over this list : from 1->end, keeping 0 as your base, and do pd.read_csv() and pd.merge() - it should be sorted!
Try this:
import os
import pandas as pd
# update this to path that contains your .csv's
path = '.'
# get files that end with csv in path
dir_list = [file for file in os.listdir(path) if file.endswith('.csv')]
# initiate empty list
df_list = []
# simple for loop with Try, Except that passes on iterations that throw errors when trying to 'read_csv' your files
for file in dir_list:
try:
# append to df_list and set your indices to match across your df's for later pd.concat to work
df_list.append(pd.read_csv(file).set_index(['Country', 'Year']))
except: # change this depending on whatever Errors pd.read_csv() throws
pass
concatted = pd.concat(df_list)

python csv subcombinations columns

I have csv file like this:
F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,label
a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,L1
b1,b2,b3,b4,b5,b6,b7,b8,b9,b10,L2
I want to have the combination of columns into the file as follows:
-For a column combination with the label and write the expected results to:
file1:
F1,label
a1,L1
b1,L2
file2:
F2,label
a2,L1
b2,L2
until
file10:
F10,label
a10,L1
b10,L2
-For 2 column combinations with the label and write the expected results to:
2C_file1:
F1,F2,label
a1,a2,L1
b1,b2,L2
2C_file2:
F1,F3,label
a1,a3,L1
b1,b3,L2
until
45C_file45:
F9,F10,label
a9,a10,L1
b9,b10,L2
-For 3 columns combinations with the label and write to 120 files:
.....until.....
-For 9 columns combinations with the label and write to 10 files:
-For 10 columns combinations with the label and write to 1 files:
I have searched and I found a python code for string combination with itertool.
How could I archive above tasks with python code?
import itertools as iters
text='ABCDEFGHIJ'
C1= iters.combinations(text,1)
print list(C1)
C2= iters.combinations(text,2)
print list(C2)
.....
C9= iters.combinations(text,9)
print list(C9)
C10=iters.combinations(text,10)
print list(10)
This loop structure should be able to create the structure you would like to have. It has to be changed to write to files. Here the length of the generated sequence or the position of the output that was generated is part of the same structure that you would like to print into a file:
#!/usr/bin/env python
row1 = ["F{i}".format(i=i) for i in range(1,11)]
row1.append("label")
row2 = ["a{i}".format(i=i) for i in range(1,11)]
row2.append("L1")
row3 = ["b{i}".format(i=i) for i in range(1,11)]
row3.append("L2")
for SequenceLength in range(1, len(row1)):
for SequencePositionStart in range(len(row1)):
if row1[SequencePositionStart:SequenceLength] == []:
continue
print(','.join(row1[SequencePositionStart:SequenceLength]), row1[-1], sep=",")
print(','.join(row2[SequencePositionStart:SequenceLength]), row2[-1], sep=",")
print(','.join(row3[SequencePositionStart:SequenceLength]), row3[-1], sep=",")

Categories

Resources