How to store data collected from different sources to create dataframe?

How to store data collected from different sources to create dataframe? - python

I am working on creating a dataframe for classification tasks.
Since my data is coming from all kinds of different sources I am wondering what the best way to collect the data step by step would be.
I am starting off with a folder of files, and want to store their path and filename and add new data, such as their label, that I get from a txtfile that is saved somewhere else.
But what is the best way to do that?
I was thinking about a list of dictionary like
data = [{"path": path_to_file_1, "filename" : filename_1, "label" : label_1},
{"path": path_to_file_2, "filename" : filename_2, "label" : label_2},
{"path": path_to_file_3, "filename" : filename_3, "label" : label_3}]
and so on .
My idea was to iterate through my folder, collect the information via different functions that I wrote and create a dictionary for each of my files like so:
for filename in folder:
dict_filename={}
label=get_label(filename)
path=get_path(filename)
dict_filename["label"]=label
dict_filename["path"]=path
dict_filename["filename"]=filename
data.append(dict_filename)
with dict_filename being a dictionary that only contains the information of the file that I am looking at at the moment.
SO at the end I would get a list containing all the dictionaries that I created for all of my files.
My questions are:
Is this a way that makes sense or is there a different way that works better/easier/smoother?
If this works, what do I do to create a new dictionary in every loop (I suppose I need a different name for each dictionary so I just don't overwrite my first one with every loop)?
This might be something pretty basic as I am new to Python, but I am grateful for everyone that can help me out!
Thanks in advance!

The dictionary is the way to go on this one. However, there is a lot of redundancy that could be eliminated depending on the structure of your data.
For example, you can use one dictionary to store all the dataframes in this manner:
dfs[filename] = pd.DataFrame(path).rename(label)
This basically creates makes accessing the information much easier later on. In addition, you can use:
df = pd.concat(dfs, axis=1)
To combine all your dataframes in the end.

Related

Splitting a DataFrame to filtered "sub - datasets"

So I have a DataFrame with several columns, some contain objects (string) and some are numerical.
I'd like to create new dataframes which are "filtered" to the combination of the objects available.
To be clear, those are my object type columns:
Index(['OS', 'Device', 'Design',
'Language'],
dtype='object')
["Design"] and ["Language"] have 3 options each.
I filtered ["OS"] and ["Device"] manually as I needed to match them.
However, now I want to create multiple variables each contains a "filtered" dataframe.
For example:
I have
"android_fltr1_d1" to represent the next filter:
["OS"]=android, ["Device"]=1,["Design"]=1
and "android_fltr3_d2" to represent:
["OS"]=android, ["Device"]=3,["Design"]=2
I tried the next code (which works perfectly fine).
android_fltr1_d1 = android_fltr1[android_fltr1["Design"]==1].drop(["Design"],axis=1)
android_fltr1_d2 = android_fltr1[android_fltr1["Design"]==2].drop(["Design"],axis=1)
android_fltr1_d3 = android_fltr1[android_fltr1["Design"]==3].drop(["Design"],axis=1)
android_fltr3_d1 = android_fltr3[android_fltr3["Design"]==1].drop(["Design"],axis=1)
android_fltr3_d2 = android_fltr3[android_fltr3["Design"]==2].drop(["Design"],axis=1)
android_fltr3_d3 = android_fltr3[android_fltr3["Design"]==3].drop(["Design"],axis=1)
android_fltr5_d1 = android_fltr5[android_fltr5["Design"]==1].drop(["Design"],axis=1)
android_fltr5_d2 = android_fltr5[android_fltr5["Design"]==2].drop(["Design"],axis=1)
android_fltr5_d3 = android_fltr5[android_fltr5["Design"]==3].drop(["Design"],axis=1)
As you can guess, I don't find it efficient and would like to use a for loop to generate those variables (as I'd need to match each ["Language"] option to each filter I created. Total of 60~ variables).
Thought about using something similar to .format() in the loop in order to be some kind of a "place-holder", couldn't find a way to do it.
It would be probably the best to use a nested loop to create all the variables, though I'd be content even with a single loop for each column.
I find it difficult to build the for loop to execute it and would be grateful for any help or directions.
Thanks!
As suggested I tried to find my answer in:How do I create variable variables?
Yet I failed to understand how I use the globals() function in my case. I also found that using '%' is not working anymore.

Manage python structures stored in a file as if they are in memory?

I want to manage many files in such a way that the file stays on disk and my app work with part of the data.
I have to manage 2 types of files text-files/book-like, cvs-files/time-series.
For every file I may generate multiple dimentionally reduced copies, which i want to keep and cache so i dont have to regenerate them.
I can see two ways of doing this:
1. create my own lib that uses mem-mapping
2. use tool as DASK
Dask seem like a good choice, but I can not find a way for the Bag object to iterate in a loop and/or range-access i.e.
for i in bag_obj[2:10] : .....
bag_obj[5:10]
I can only do .take()
Second is there a way to map a LIST to a file and do list operations as normal list as if it is in memory.
I came up with it , is this the best :
def slice(self, pfrom, pto):
assert self.bag is not None
self.bag.take(pto)[pfrom:]
but does not work cause returns computed() value ;(

this may be a solution ?
from dask.bag.core import Bag
def slice(self, pfrom, pto): return self.take(pto)[pfrom:]
Bag.slice = slice

Python dictionary creation?

Hey so currently I've been trying to convert the text containing column of a csv file into a dictionary. From here I would then like to create word embeddings (& potentially embeddings for subparts of words. ie. dictionary => dict - tion - nary) What would be the best way to go about this, what frameworks would work best? I have attached my current code and an example of one row of the database.
# First we must input the data, we can use pandas to do this.
import pandas as pd
# Our data does not have headers so we will fabricate ones ourself during the importing
data = pd.read_csv('agr_en_train.csv', header=None, names =['Unique_ID', 'Text', 'Aggression-level'])
# We can now check the data has loaded properly.
data
{0}{facebook_corpus_msr_1723796} {Well said sonu..you have courage to stand agai...} {OAG}
Please let me know if you require any other info the answer this question more adeptly. Additionally are there any recommended pre-created dictionaries. How would I utilise them, an answer or direction to any sources that would be helpful are greatly appreciated!

Delete entries of a list nested in a dictionary depending on values located in a different nested list in the same dictionary

I am writing a script to read in a lot of data contained in multiple CSV files. When I read the data from each CSV I put it in a list which is stored in a dictionary so the eventual data structure is:
data_set = {user1 : {filenames: [file1,file2...]
labels: [file1label_1,file1label_2,file1label_3,file2label_1...]
features: [file1feat_1,file1feat_2,file1feat_3,file2feat_1...]
file_timepoints: [file1time_1,file1time_2,file1time_3,file2time_1,...]
}
user2 : {filenames: [file1,file2...]
labels: [file1label_1,file1label_2,file1label_3,file2label_1...]
features: [file1feat_1,file1feat_2,file1feat_3,file2feat_1...]
file_timepoints: [file1time_1,file1time_2,file1time_3,file2time_1,...]
}
}
Now the filenames variable is a list of 200 files so the length is 200, but all other variables in the dictionary are lists of length 7000 because they contain data from each timestep of each file.
I am wondering, what would be an efficient way to delete data corresponding to a specific file from all of the lists within a dictionary? So for example if I wanted to delete file1 data for user1 the resulting dictionary would look like:
data_set = {user1 : {filenames: [file2...]
labels: [file2label_1...]
features: [file2feat_1...]
file_timepoints: [file2time_1,...]
}
user2 : {filenames: [file1,file2...]
labels: [file1label_1,file1label_2,file1label_3,file2label_1...]
features: [file1feat_1,file1feat_2,file1feat_3,file2feat_1...]
file_timepoints: [file1time_1,file1time_2,file1time_3,file2time_1,...]
}
}
So far I have tried using nested for loops but it gets extremely messy and is highly inefficient. Any suggestions would be greatly appreciated!
EDIT:
This is an example of what the data looks like. The labels come from a CSV that is just 1 1xNtimesteps row of data, the feature data come from a CSV that is NfeaturesxNtimesteps, and timepoints come from a CSV that is 1 1xNtimesteps row.

After doing some research I think the best way for me to approach this issue would be to use object-oriented concepts, namely using a factory design pattern with classes and an inheritance hierarchy with the files and variables.

I have been working on Python for about 1.5yrs and looking for some direction. This is the first time I can't find what I need after doing a lot of searching and must be missing something- most likely searching the wrong terms.
Problem: I am working on an app that has many processes (Could be hundreds or even thousands). Each process may have a unique input and output data format - could be multiline strings, comma separated strings, excel or csv with or without varying headers and many others. I need something that will format the input correctly and handle the output based upon the process. New processes also need to be easily added/defined. I am open to whatever is the best approach, but my thoughts are to use a database that stores the template/data definition and use that to know the format given a process. However, I'm struggling to come up with exactly how, if this is really the best approach, but it needs to be a solution that is scalable. Any direction would be appreciated. Thank you.
A couple simple examples of data
Process 1 example data (multi line string with Header)
Input of
[ABC123, XYZ453, CDE987]
and the resulting data input below would be created:
Barcode
ABC123
XYZ453
CDE987
This code below works, but is not reusable for the example 2.
list = [ABC123, XYZ453, CDE987]
input = "Barcode /r/n"
for l in list:
input = input + l + '/r/n'
Process 2 example input template (comma separated with Header):
Barcode,Location,Param1,Param2
Item1,L1,11,A
Item1,L1,22,B
Item2,L1,33,C
Item2,L2,44,F
Item3,L2,55,B
Item3,L2,66,P
Process 2 example resulting input data (comma separated with Header):
Input of
{'Barcode':['ABC123', 'XYZ453', 'CDE987', 'FGH487', 'YTR123'], 'Location':['Shelf1', 'Shelf2']}
and using the template to create the input data below:
Barcode,Location,Param1,Param2
ABC123,Shelf1,11,A
ABC123,Shelf1,22,B
XYZ453,Shelf1,33,C
XYZ453,Shelf2,44,F
CDE987,Shelf2,55,B
CDE987,Shelf2,66,P
FGH487,Shelf1,11,A
FGH487,Shelf1,22,B
YTR123,Shelf1,33,C
YTR123,Shelf2,44,F
I know how to handle each process with hardcoded loop/dataframe merge, etc. Ive done some abstraction in other cases with dicts. However, how to define/store each format that vary so much and create reusable abstracted code is where I am stuck.

Maybe you can do the output of the functions as a tuple with the keys "datatype" and "output" for the actual output

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.