I would like to loop through a bunch of .txt files, for each of the files processing it (removing columns, changing names, nan etc) to get the end dataframe output of df1, which has certain date, lat, lon, and variables assigned to it. Over the loop, I would like to get df_all, with all the information from all the files in (most likely in date order).
However, each of my dataframes are different lengths, and there is the possibility of them sharing the same date+ lat/lon values in that column.
I have made code to feed in and process files individually, but I'm stuck on how to make this into a larger loop (via concat/append...?).
I am trying to end up with one large dataframe (df_all), which contains all the 'scattered' information of the different files (df1 outputs). In addition, if there is a conflicting date and lat/lon, I would find the mean. Is this possible to do in python/pandas?
Any help at all on any of the multiple issues would be greatly appreciated! Or ideas on how to go about this.
Here are fake tables that are read in by a for-loop and concat to a big table. Then after all rows are added to a single big table, you can group together multiple rows that have the same values in the A column and get the mean of the B and C columns as an example. You should be able to run this chunk of code yourself and I hope this helps give you keywords to use to search for other questions similar to yours!
import pandas as pd
#Making fake table read ins. you'd be using pd.read_csv or similar
def fake_read_table(name):
small_df1 = pd.DataFrame({'A': {0: 5, 1: 1, 2: 3, 3: 1}, 'B': {0: 4, 1: 4, 2: 4, 3: 4}, 'C': {0: 2, 1: 1, 2: 4, 3: 1}})
small_df2 = pd.DataFrame({'A': {0: 4, 1: 5, 2: 1, 3: 4, 4: 3, 5: 2, 6: 5, 7: 1}, 'B': {0: 3, 1: 1, 2: 1, 3: 1, 4: 5, 5: 1, 6: 4, 7: 2}, 'C': {0: 4, 1: 1, 2: 5, 3: 2, 4: 4, 5: 4, 6: 5, 7: 2}})
small_df3 = pd.DataFrame({'A': {0: 2, 1: 2, 2: 4, 3: 3, 4: 1, 5: 4, 6: 5}, 'B': {0: 1, 1: 2, 2: 3, 3: 1, 4: 3, 5: 5, 6: 4}, 'C': {0: 5, 1: 2, 2: 3, 3: 3, 4: 5, 5: 4, 6: 5}})
if name == '1.txt':
return small_df1
if name == '2.txt':
return small_df2
if name == '3.txt':
return small_df3
#Start here
txt_paths = ['1.txt','2.txt','3.txt']
big_df = pd.DataFrame()
for txt_path in txt_paths:
small_df = fake_read_table(txt_path)
# .. do some processing you need to do somewhere in here ..
big_df = pd.concat((big_df,small_df))
#Taking the average B and C values for rows that have the same A value
agg_df = big_df.groupby('A').agg(
mean_B = ('B','mean'),
mean_C = ('C','mean'),
).reset_index()
print(agg_df)
Related
Is it possible to apply json.loads to multiple columns? If I do something like:
df['col1'] = df['col1'].apply(json.loads)
I can apply it to each entry in col1 and everything is fine. But if I do something like,
df[['col1', 'col2', 'col3'] = df[['col1', 'col2', 'col3' ].apply(json.loads)
I get the error:
TypeError: the JSON object must be str, bytes or bytearray, not Series.
Why doesn't this way work? Is it possible to apply it all at once or should I just do each column individually?
Per https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html, apply applies a function along an axis.
One way is to write a function that applies to the row series. The example doesn't apply the json.loads because the sample input would need to be json, but you can change the apply_transform function to meet your needs
(sorry for the to_dict() output but I have trouble getting dataframe output into text editors)
import pandas as pd
import numpy as np
columns = [
"col1",
"col2",
"col3",
"col4",
"col5"
]
df = pd.DataFrame(np.random.randint(0,5,size=(5, 5)), columns=columns)
df.to_dict()
# {'col1': {0: 4, 1: 3, 2: 3, 3: 1, 4: 2},
# 'col2': {0: 1, 1: 1, 2: 3, 3: 3, 4: 1},
# 'col3': {0: 1, 1: 2, 2: 3, 3: 4, 4: 3},
# 'col4': {0: 0, 1: 1, 2: 1, 3: 3, 4: 2},
# 'col5': {0: 3, 1: 2, 2: 0, 3: 2, 4: 0}}
You will notice that after the transformation, the columns have been doubled. You would replace the multiplication transformation with your own code
def apply_transform(row):
new_row = row.copy()
for col in ['col1', 'col2', 'col3']:
new_row[col] = new_row[col] * 2 # apply your own transform here
return new_row
df_new = df.apply(apply_transform, axis=1)
df_new.to_dict()
# {'col1': {0: 8, 1: 6, 2: 6, 3: 2, 4: 4},
# 'col2': {0: 2, 1: 2, 2: 6, 3: 6, 4: 2},
# 'col3': {0: 2, 1: 4, 2: 6, 3: 8, 4: 6},
# 'col4': {0: 0, 1: 1, 2: 1, 3: 3, 4: 2},
# 'col5': {0: 3, 1: 2, 2: 0, 3: 2, 4: 0}}
I've reviewed many questions in StackOverflow that may be related to my question
Q1, Q2, Q3, Q4 None of them are related to my question. Apart from these, I examined almost 20 questions here.
I created a sample code block to explain my problem simply. My aim is to add data to a dictionary in for loop.
When I run the code block below, the output is as follows.
dictionary = defaultdict(int)
for uid in range(10):
for i in range(5):
distance = 2*i
dictionary[uid] = distance
Output
My aim is to keep the key value in each loop and add on it.
Expected Output:
{0: {0,2,4,6,8,}, 1:{0,2,4,6,8,}, 2:{0,2,4,6,8} , ...
My Solution
from collections import defaultdict
dictionary = defaultdict(int)
for uid in range(10):
for i in range(5):
distance = 2*i
dictionary[uid].append(distance)
My solution approach is also not working there is a problem
Try this:
from collections import defaultdict
dictionary = defaultdict(set)
for uid in range(10):
for i in range(5):
distance = 2*i
dictionary[uid].add(distance)
Output (dictionary):
> defaultdict(set,
{0: {0, 2, 4, 6, 8},
1: {0, 2, 4, 6, 8},
2: {0, 2, 4, 6, 8},
3: {0, 2, 4, 6, 8},
4: {0, 2, 4, 6, 8},
5: {0, 2, 4, 6, 8},
6: {0, 2, 4, 6, 8},
7: {0, 2, 4, 6, 8},
8: {0, 2, 4, 6, 8},
9: {0, 2, 4, 6, 8}})
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I have a function that returns an array of values and another array with dictionaries, All dictionaries are different but it returns the same value
When I print it form the function I get correct values for example
{0: 0}
{0: 1, 1: 0}
{0: 2, 1: 1, 2: 0}
{0: 3, 1: 5, 2: 0}
{0: 4, 1: 1, 2: 0}
{0: 5, 1: 0, 2: 0}
{0: 6, 1: 5, 2: 0}
{0: 7, 1: 2, 2: 1, 3: 0}
But when I return the array y get this (the wrong answer)
([0, 2, 4, 4, 6, 1, 6, 5], # This array is correct
[{0: 7, 1: 2, 2: 1, 3: 0}, # From here is incorrect
{0: 7, 1: 2, 2: 1, 3: 0},
{0: 7, 1: 2, 2: 1, 3: 0},
{0: 7, 1: 2, 2: 1, 3: 0},
{0: 7, 1: 2, 2: 1, 3: 0},
{0: 7, 1: 2, 2: 1, 3: 0},
{0: 7, 1: 2, 2: 1, 3: 0},
{0: 7, 1: 2, 2: 1, 3: 0}])
This is the fragment of code with this problem
.
.
.
for i in range(n):
j = i
count = 0
while parent[j] != -1:
s[i][count] = j
count = count + 1
j = parent[j]
s[i][count] = start
###########
print(s[i])
###########
return dist, s
I think this is what u 're looking for:
for i in range(n):
j = i
s = []
while parent[j] != -1:
s.append(j)
j = parent[j]
s.append(start)
path[i] = s[::-1]
This question already has answers here:
How do I count the occurrences of a list item?
(30 answers)
Closed 3 years ago.
I am trying to track seen elements, from a big array, using a dict.
Is there a way to force a dictionary object to be integer type and set to zero by default upon initialization?
I have done this with a very clunky codes and two loops.
Here is what I do now:
fl = [0, 1, 1, 2, 1, 3, 4]
seenit = {}
for val in fl:
seenit[val] = 0
for val in fl:
seenit[val] = seenit[val] + 1
Of course, just use collections.defaultdict([default_factory[, ...]]):
from collections import defaultdict
fl = [0, 1, 1, 2, 1, 3, 4]
seenit = defaultdict(int)
for val in fl:
seenit[val] += 1
print(fl)
# Output
defaultdict(<class 'int'>, {0: 1, 1: 3, 2: 1, 3: 1, 4: 1})
print(dict(seenit))
# Output
{0: 1, 1: 3, 2: 1, 3: 1, 4: 1}
In addition, if you don't like to import collections you can use dict.get(key[, default])
fl = [0, 1, 1, 2, 1, 3, 4]
seenit = {}
for val in fl:
seenit[val] = seenit.get(val, 0) + 1
print(seenit)
# Output
{0: 1, 1: 3, 2: 1, 3: 1, 4: 1}
Also, if you only want to solve the problem and don't mind to use exactly dictionaries you may use collection.counter([iterable-or-mapping]):
from collections import Counter
fl = [0, 1, 1, 2, 1, 3, 4]
seenit = Counter(f)
print(seenit)
# Output
Counter({1: 3, 0: 1, 2: 1, 3: 1, 4: 1})
print(dict(seenit))
# Output
{0: 1, 1: 3, 2: 1, 3: 1, 4: 1}
Both collection.defaultdict and collection.Counter can be read as dictionary[key] and supports the usage of .keys(), .values(), .items(), etc. Basically they are a subclass of a common dictionary.
If you want to talk about performance I checked with timeit.timeit() the creation of the dictionary and the loop for a million of executions:
collection.defaultdic: 2.160868141 seconds
dict.get: 1.3540439499999999 seconds
collection.Counter: 4.700308418999999 seconds
collection.Counter may be easier, but much slower.
You can use collections.Counter:
from collections import Counter
Counter([0, 1, 1, 2, 1, 3, 4])
Output:
Counter({1: 3, 0: 1, 2: 1, 3: 1, 4: 1})
You can then address it like a dictionary:
>>> Counter({1: 3, 0: 1, 2: 1, 3: 1, 4: 1})[1]
3
>>> Counter({1: 3, 0: 1, 2: 1, 3: 1, 4: 1})[0]
1
Using val in seenit is a bit faster than .get():
seenit = dict()
for val in fl:
if val in seenit :
seenit[val] += 1
else:
seenit[val] = 1
For larger lists, Counter will eventually outperform all other approaches. and defaultdict is going to be faster than using .get() or val in seenit.
I have a dictionary like this:
dict1 = {0: set([1, 4, 5]), 1: set([2, 6]), 2: set([3]), 3: set([0]), 4: set([1]), 5: set([2]), 6: set([])}
and from this dictionary I want to build another dictionary that count the occurrences of keys in dict1 in every other value ,that is the results should be:
result_dict = {0: 1, 1: 2, 2: 2, 3: 1, 4: 1, 5: 1, 6: 1}
My code was this :
dict1 = {0: set([1, 4, 5]), 1: set([2, 6]), 2: set([3]), 3: set([0]), 4: set([1]), 5:set([2]), 6: set([])}
result_dict = {}
for pair in dict1.keys():
temp_dict = list(dict1.keys())
del temp_dict[pair]
count = 0
for other_pairs in temp_dict :
if pair in dict1[other_pairs]:
count = count + 1
result_dict[pair] = count
The problem with this code is that it is very slow with large set of data.
Another attempt was in a single line, like this :
result_dict = dict((key ,dict1.values().count(key)) for key in dict1.keys())
but it gives me wrong results, since values of dict1 are sets:
{0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0}
thanks a lot in advance
I suppose, for a first stab, I would figure out which values are there:
all_values = set().union(*dict1.values())
Then I'd try to count how many times each value occurred:
result_dict = {}
for v in all_values:
result_dict[v] = sum(v in dict1[key] for key in dict1)
Another approach would be to use a collections.Counter:
result_dict = Counter(v for set_ in dict1.values() for v in set_)
This is probably "cleaner" than my first solution -- but it does involve a nested comprehension which can be a little difficult to grok. It does work however:
>>> from collections import Counter
>>> dict1
{0: set([1, 4, 5]), 1: set([2, 6]), 2: set([3]), 3: set([0]), 4: set([1]), 5: set([2]), 6: set([])}
>>> result_dict = Counter(v for set_ in dict1.values() for v in set_)
Just create a second dictionary using the keys from dict1, with values initiated at 0. Then iterate through the values in the sets of dict1, incrementing values of result_dict as you go. The runtime is O(n), where n is the aggregate number of values in sets of dict1.
dict1 = {0: set([1, 4, 5]), 1: set([2, 6]), 2: set([3]), 3: set([0]), 4: set([1]), 5:set([2]), 6: set([])}
result_dict = dict.fromkeys(dict1.keys(), 0)
# {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0}
for i in dict1.keys():
for j in dict1[i]:
result_dict[j] += 1
print result_dict
# {0: 1, 1: 2, 2: 2, 3: 1, 4: 1, 5: 1, 6: 1}