I would like to create a data frame from a dictionary by looping over a list of string column names, rather than slicing the dataframe directly. For instance
df = pd.DataFrame(np.random.randn(100,7), columns=list('ABCDEFG'))
list_of_cols = ['A','B','C']
dictslice = {'%s': df['%s'] % (elt for elt in list_of_cols), 'Z': np.ones(len(df))}
But I cannot have a format string outside of a string so am not sure how to proceed. I do not want a solution like
df[[list_of_cols]]
since I want to add more vectors to dictslice that may not necessarily be in df.
Can anyone help?
EDIT
I am a fool, it works with this:
dictslice = {'%s' % elt : df[elt] for elt in list_of_cols}
but this does not work:
dictslice = {'%s' % elt : df[elt] for elt in list_of_cols, 'Z': np.ones(len(df))}
This seems like something that can be done with simple variable access.
What's wrong with this:
df = pd.DataFrame(np.random.randn(100,7), columns=list('ABCDEFG'))
list_of_cols = ['A','B','C']
dictslice = dict([(elt, df[elt]) for elt in list_of_cols] + [('Z', np.ones(len(df)))])
Related
Can you help me with my algorithm in Python to parse a list, please?
List = ['PPPP_YYYY_ZZZZ_XXXX', 'PPPP_TOTO_TATA_TITI_TUTU', 'PPPP_TOTO_MMMM_TITI_TUTU', 'PPPP_TOTO_EHEH_TITI_TUTU', 'PPPP_TOTO_EHEH_OOOO_AAAAA', 'PPPP_TOTO_EHEH_IIII_SSSS_RRRR']
In this list, I have to get the last two words (PARENT_CHILD). For example for PPPP_TOTO_TATA_TITI_TUTU, I only get TITI_TUTU
In the case where there are duplicates, that is to say that in my list, I have: PPPP_TOTO_TATA_TITI_TUTU and PPPP_TOTO_EHEH_TITI_TUTU, I would have two times TITI_TUTU, I then want to recover the GRANDPARENT for each of them, that is: TATA_TITI_TUTU and EHEH_TITI_TUTU
As long as the names are duplicated, we take the level above.
But in this case, if I added the GRANDPARENT for EHEH_TITI_TUTU, I also want it to be added for all those who have EHEH in the name so instead of having OOOO_AAAAA, I would like to have EHEH_OOO_AAAAA and EHEH_IIII_SSSS_RRRR
My final list =
['ZZZZ_XXXX', 'TATA_TITI_TUTU', 'MMMM_TITI_TUTU', 'EHEH_TITI_TUTU', 'EHEH_OOOO_AAAAA', 'EHEH_IIII_SSSS_RRRR']
Thank you in advance.
Here is the code I started to write:
json_paths = ['PPPP_YYYY_ZZZZ_XXXX', 'PPPP_TOTO_TATA_TITI_TUTU',
'PPPP_TOTO_EHEH_TITI_TUTU', 'PPPP_TOTO_MMMM_TITI_TUTU', 'PPPP_TOTO_EHEH_OOOO_AAAAA']
cols_name = []
for path in json_paths:
acc=2
col_name = '_'.join(path.split('_')[-acc:])
tmp = cols_name
while col_name in tmp:
acc += 1
idx = tmp.index(col_name)
cols_name[idx] = '_'.join(json_paths[idx].split('_')[-acc:])
col_name = '_'.join(path.split('_')[-acc:])
tmp = ['_'.join(item.split('_')[-acc:]) for item in json_paths].pop()
cols_name.append(col_name)
print(cols_name.index(col_name), col_name)
cols_name
help ... with ... algorithm
use a dictionary for the initial container while iterating
keys will be PARENT_CHILD's and values will be lists containing grandparents.
>>> s = 'PPPP_TOTO_TATA_TITI_TUTU'
>>> d = collections.defaultdict(list)
>>> *_,grandparent,parent,child = s.rsplit('_',maxsplit=3)
>>> d['_'.join([parent,child])].append(grandparent)
>>> d
defaultdict(<class 'list'>, {'TITI_TUTU': ['TATA']})
>>> s = 'PPPP_TOTO_EHEH_TITI_TUTU'
>>> *_,grandparent,parent,child = s.rsplit('_',maxsplit=3)
>>> d['_'.join([parent,child])].append(grandparent)
>>> d
defaultdict(<class 'list'>, {'TITI_TUTU': ['TATA', 'EHEH']})
>>>
after iteration determine if there are multiple grandparents in a value
if there are, join/append the parent_child to each grandparent
additionally find all the parent_child's with these grandparents and prepend their grandparents. To facilitate build a second dictionary during iteration - {grandparent:[list_of_children],...}.
if the parent_child only has one grandparent use as-is
Instead of splitting each string the info could be extracted with a regular expression.
pattern = r'^.*?_([^_]*)_([^_]*_[^_]*)$'
I have two lists, where the first one is a list of strings called names and has been generated by using the name of the corresponding csv files.
names = ['ID1','ID2','ID3']
I have loaded the csv files into individual pandas dataframes and then done some preprocessing which leaves me with a list of lists, where each element is the data of each dataframe:
dfs = [['car','fast','blue'],[],['red','bike','slow']]
As you can see it can happen that after preprocessing a dataframe could be empty, which leads to an empty list in dfs.
I would like to remove the element from this list and return it's index, so far I have tried this but I get no index when printing k.
k = [i for i,x in enumerate(dfs) if not x]
The reason I need this index is, so I can then look at removing the corresponding index element in list names.
The end results would look a bit like this:
names = ['ID1','ID3']
dfs = [['car','fast','blue'],['red','bike','slow']]
This way I can then save each individual dataframe as a csv file:
for df, name in zip(dfs, names):
df.to_csv(name + '_.csv', index=False)
EDIT: I MADE A MISTAKE: The list of lists called dfs needs changing from [''] to []
You can use the built-in any() method:
k = [i for i, x in enumerate(dfs) if not any(x)]
The reason your
k = [i for i, x in enumerate(dfs) if not x]
doesn't work is because, regardless of what is in a list, as long as the list is not empty, the truthy value of the list will be True.
The any() method will take in an array, and return whether any of the elements in the array has a truthy value of True. If the array has no elements such, it will return False. The thruthy value of an empty string, '', is False.
EDIT: The question got edited, here is my updated answer:
You can try creating new lists:
names = ['ID1','ID2','ID3']
dfs = [['car','fast','blue'],[],['red','bike','slow']]
new_names = list()
new_dfs = list()
for i, x in enumerate(dfs):
if x:
new_names.append(names[i])
new_dfs.append(x)
print(new_names)
print(new_dfs)
Output:
['ID1', 'ID3']
[['car', 'fast', 'blue'], ['red', 'bike', 'slow']]
If it doesn't work, try adding a print(x) to the loop to see what is going on:
names = ['ID1','ID2','ID3']
dfs = [['car','fast','blue'],[],['red','bike','slow']]
new_names = list()
new_dfs = list()
for i, x in enumerate(dfs):
print(x)
if x:
new_names.append(names[i])
new_dfs.append(x)
Since you are already using enumerate , you do not have to loop again.
Hope this solves your problem:
names = ['ID1', 'ID2', 'ID3']
dfs = [['car', 'fast', 'blue'], [''], ['red', 'bike', 'slow']]
for index, i in enumerate(dfs):
if len(i) == 1 and '' in i:
del dfs[index]
del names[index]
print(names)
print(dfs)
# Output
# ['ID1', 'ID3']
# [['car', 'fast', 'blue'], ['red', 'bike', 'slow']]
I think The issue is because of [''].
l = ['']
len(l)
Gives output as 1. Hence,
not l
Gives False
If you are sure it will be [''] only, then try
dfs = [['car','fast','blue'],[''],['red','bike','slow']]
k = [i for i,x in enumerate(dfs) if len(x)==1 and x[0]=='']
this gives [1] as output
Or you can try with any(x)
Looking at the data presented, I would do the following:
Step 1: Check if list has any values. If it does, if df will be True.
Step 2: Once you have the list, create a dataframe and write to csv.
The code is as shown below:
names = ['ID1','ID2','ID3']
dfs = [['car','fast','blue'],[],['red','bike','slow']]
dfx = {names[i]:df for i,df in enumerate(dfs) if df)}
import pandas as pd
for name,val in dfx.items():
df = pd.DataFrame({name:val})
df.to_csv(name + '_.csv', index=False)
I'm working on cleaning up some of my code to make it a bit more pythonic, but I'm wondering if the below could be written in a nicer way with something like an itertools or pandas method. The below code works, however I'm hoping to remove the double for-loop and consolidate a bit of the code for performance reasons.
Ultimately, I'm working with a list of indices that call a Pandas column.
def foo(dataset):
api_reshaped = pd.DataFrame(columns=['foo', 'bar'])
k = 0
for index, _ in dataset.iterrows():
for key in dataset.iloc[index][0][0]:
api_reshaped.loc[k, 'foo'] = key
api_reshaped.loc[k, 'bar'] = dataset.iloc[index][0][0][key]
k += 1
return api_reshaped
Below is the expected input/output from this function:
foo_input = pd.dataframe({
'batch_data': [{'foo_query': [{'bar_query': 'data'}]}],
'query_spell': ['foo']
})
print foo_input(foo_input)
# expected_output = pd.dataframe({
# 'foo': 'foo_query',
# 'bar': [{'bar_query': 'data'}]
# })
Many thanks!
You can use list comprehension with transpose:
# your input data
foo_input = pd.DataFrame({
'batch_data': [{'foo_query': [{'bar_query': 'data'}]}],
'query_spell': ['foo']
})
# use list comprehension with transpose
df = pd.DataFrame([item for item in foo_input['batch_data']]).T.reset_index()
# rename your columns
df.columns = ['Foo', 'Bar']
Foo Bar
0 foo_query [{'bar_query': 'data'}]
you can use applymap with a lambda function if you want to remove the list and just have a dict
I'd like to convert lists of formatted strings into a dictionnary.
the strings formatted like this:
str = 'abcd="efgh"'
And i'd like to get this into a dict like this:
d = {'abcd': 'efgh'}
Example:
l = ['abc="efg"', 'hij="klm"', 'nop="qrs"']
into >
d = {'abc': 'efg', 'hij': 'klm', 'nop' :'qrs'}
I tried the following :
d = dict(element.split('=') for element in l)
-> but this doesn't work
Thanks.
You can parse the list and break each element using the split method and then add it to dict. Adding sample code:
d = {}
for element in l:
string_elements = element.split("=")
d[string_elements[0]] = string_elements[1].replace('"','')
I am trying to make 100 lists with names such as: list1, list2, list3, etc. Essentially what I would like to do is below (although I know it doesn't work I am just not sure why).
num_lists=100
while i < num_lists:
intial_pressure_{}.format(i) = []
centerline_temperature_{}.format(i) = []
And then I want to loop through each list inserting data from a file but I am unsure how I can have the name of the list change in that loop. Since I know this won't work.
while i < num_lists:
initial_pressure_i[0] = value
I'm sure what I'm trying to do is really easy, but my experience with python is only a couple of days. Any help is appreciated.
Thanks
Instead of creating 100 list variables, you can create 100 lists inside of a list. Just do:
list_of_lists = [[] for _ in xrange(100)]
Then, you can access lists on your list by doing:
list_of_lists[0] = some_value # First list
list_of_lists[1] = some_other_value # Second list
# ... and so on
Welcome to Python!
Reading your comments on what you are trying to do, I suggest ditching your current approach. Select an easier data structure to work with.
Suppose you have a list of files:
files = ['data1.txt', 'data2.txt',...,'dataN.txt']
Now you can loop over those files in turn:
data = {}
for file in files:
data[file] = {}
with open(file,'r') as f:
lines=[int(line.strip()) for line in f]
data[file]['temps'] = lines[::2] #even lines just read
data[file]['pressures'] = lines[1::2] #odd lines
Then you will have a dict of dict of lists like so:
{'data1.txt': {'temps': [1, 2, 3,...], 'pressures': [1,2,3,...]},
'data2.txt': {'temps': [x,y,z,...], 'pressures': [...]},
...}
Then you can get your maxes like so:
max(data['data1.txt']['temps'])
Just so you can see what the data will look like, run this:
data = {}
for i in range(100):
item = 'file' + str(i)
data[item] = {}
kind_like_file_of_nums = [float(x) for x in range(10)]
data[item]['temps'] = kind_like_file_of_nums[0::2]
data[item]['pres'] = kind_like_file_of_nums[1::2]
print(data)
You could just make a dictionary of lists. Here's an example found in a similar thread:
>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> for i in a:
... for j in range(int(i), int(i) + 2):
... d[j].append(i)
...
>>> d
defaultdict(<type 'list'>, {1: ['1'], 2: ['1', '2'], 3: ['2']})
>>> d.items()
[(1, ['1']), (2, ['1', '2']), (3, ['2'])]