df apply function in loop overrides prior values - python

so my df looks like this:
x y group
0 53 10 csv1
1 53 10 csv1
2 48 9 csv0
3 48 9 csv0
4 48 9 csv0
... ... ... ...
I have some files that are depending on the group name and want to use them in a function besides the x and y value.
what I am doing so far is the following:
dfGrouped = df.groupby('group') #group the dataframe
df['newcol'] = np.nan #crete new empty col
#use for loop to load file depending on group, note the file is very large, thats why I want to load it only once per group
for name, group in groupHashed:
file = open(name+'.txt')
#open the file
df['newcol'] = df[df['group'] == name].apply(lambda row: newValueFromFile(row.x,row.y, file), axis=1)
It seemed to work at first, unfortunately, newcol only holds the value of the last loop and seems to override the values created earlier with nan. Somebody any idea?

instead of file = open, use
with open('filename.txt', 'a') as file:
and then for the lambda expression file.write...
The 'a' in the opening tells it should append the data to the existing.
I guess currently you are overwriting the content of the file.
'with open()' also takes care for the automatic closing after you're done with the file.

Related

python pandas: fulfill condition and assign a value to it

I am really hoping you can help me here...I need to assign a label(df_label) to an exact file within dataframe (df_data) and save all labels that appear in each file in a separate txt file (that's an easy bit)
df_data:
file_name file_start file_end
0 20190201_000004.wav 0.000 1196.000
1 20190201_002003.wav 1196.000 2392.992
2 20190201_004004.wav 2392.992 3588.992
3 20190201_010003.wav 3588.992 4785.984
4 20190201_012003.wav 4785.984 5982.976
df_label:
Begin Time (s)
0 27467.100000
1 43830.400000
2 43830.800000
3 46378.200000
I have tried to switch to np.array and use for loop and np.where but without any success...
If the time values in df_label fall under exactly one entry in df_data, you can use the following
def get_file_name(begin_time):
file_names = df_data[
(df_data["file_start"] <= begin_time)
& (df_data["file_end"] >= begin_time)
]["file_name"].values
return file_names.values[0] if file_names.values.size > 0 else None
df_label["file_name"] = df_label["Begin Time (s)"].apply(get_label)
This will add another col file_name to df_label
If the labels from df_label matches the order of files in df_data you can simply:
add the labels as new column of df_data (df_data["label"] = df_label["Begin Time (s)"]).
or
use DataFrame.merge() function (df_data = df_data.merge(df_labels, left_index=True, right_index=True)).
More about merging/joining with examples you can find here:
https://thispointer.com/pandas-how-to-merge-dataframes-by-index-using-dataframe-merge-part-3/
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

How to read a file containing more than one record type within?

I have a .csv file that contains 3 types of records, each with different quantity of columns.
I know the structure of each record type and that the rows are always of type1 first, then type2 and type 3 at the end, but I don't know how many rows of each record type there are.
The first 4 characters of each row define the record type of that row.
CSV Example:
typ1,John,Smith,40,M,Single
typ1,Harry,Potter,22,M,Married
typ1,Eva,Adams,35,F,Single
typ2,2020,08,16,A
typ2,2020,09,02,A
typ3,Chevrolet,FC101TT,2017
typ3,Toyota,CE972SY,2004
How can I read It with Pandas? It doesn't matter if I have to read one record type each time.
Thanks!!
Here it is a pandas solution.
First we must read the csv file in a way that pandas keeps the entires lines in one cell each. We do that by simply using a wrong separator, such as the 'at' symbol '#'. It can be whatever we want, since we guarantee it won't ever appear in our data file.
wrong_sep = '#'
right_sep = ','
df = pd.read_csv('my_file.csv', sep=wrong_sep).iloc[:, 0]
The .iloc[:, 0] is used as a quick way to convert a DataFrame into a Series.
Then we use a loop to select the rows that belong to each data structure based on their starting characters. Now we use the "right separator" (probably a comma ',') to split the desired data into real DataFrames.
starters = ['typ1', 'typ2', 'typ3']
detected_dfs = dict()
for start in starters:
_df = df[df.str.startswith(start)].str.split(right_sep, expand=True)
detected_dfs[start] = _df
And here you go. If we print the resulting DataFrames, we get:
0 1 2 3 4 5
0 typ1 Harry Potter 22 M Married
1 typ1 Eva Adams 35 F Single
0 1 2 3 4
2 typ2 2020 08 16 A
3 typ2 2020 09 02 A
0 1 2 3
4 typ3 Chevrolet FC101TT 2017
5 typ3 Toyota CE972SY 2004
Let me know if it helped you!
Not Pandas:
from collections import defaultdict
filename2 = 'Types.txt'
with open(filename2) as dataLines:
nL = dataLines.read().splitlines()
defDList = defaultdict(list)
subs = ['typ1','typ2','typ3']
dataReadLines = [defDList[i].append(j) for i in subs for j in nL if i in j]
# dataReadLines = [i for i in nL]
print(defDList)
Output:
defaultdict(<class 'list'>, {'typ1': ['typ1,John,Smith,40,M,Single', 'typ1,Harry,Potter,22,M,Married', 'typ1,Eva,Adams,35,F,Single'], 'typ2': ['typ2,2020,08,16,A', 'typ2,2020,09,02,A'], 'typ3': ['typ3,Chevrolet,FC101TT,2017', 'typ3,Toyota,CE972SY,2004']})
You can make use of the skiprows parameter of pandas read_csv method to skip the rows you are not interested in for a particular record type. The following gives you a dictionary dfs of dataframes for each type. An advantage is that records of the same types don't necessarily have to be adjacent to each other in the csv file.
For larger files you might want to adjust the code such that the file is only read once instead of twice.
import pandas as pd
from collections import defaultdict
indices = defaultdict(list)
types = ['typ1', 'typ2', 'typ3']
filename = 'test.csv'
with open(filename) as csv:
for idx, line in enumerate(csv.readlines()):
for typ in types:
if line.startswith(typ):
indices[typ].append(idx)
dfs = {typ: pd.read_csv(filename, header=None,
skiprows=lambda x: x not in indices[typ])
for typ in types}
Read the file as a CSV file using the CSV reader. The reader fortunately does not care about line formats:
import csv
with open("yourfile.csv") as infile:
data = list(csv.reader(infile))
Collect the rows with the same first element and build a dataframe of them:
import pandas as pd
from itertools import groupby
dfs = [pd.DataFrame(v) for _,v in groupby(data, lambda x: x[0])]
You've got a list of three dataframes (or as many as necessary).
dfs[1]
# 0 1 2 3 4
#0 typ2 2020 08 16 A
#1 typ2 2020 09 02 A

Adding values from a CSV file

I am beginning to learn python and am struggling with Syntax.
I have a simple CSV file that looks like this
0.01,10,20,0.35,40,50,60,70,80,90,100
2,22,32,42,52,62,72,82,92,102,112
3,33,43,53,63,5647,83,93,103,113,123
I want to look for the highest and lowest value in all the data in the csv file except in the first value of each row.
So effectively the answer here would be
highestValue=5647
lowestValue=0.35
because the data that is looked at is as follows (it ignored the first value of each row)
10,20,0.35,40,50,60,70,80,90,100
22,32,42,52,62,72,82,92,102,112
33,43,53,63,73,5647,93,103,113,123
I would like my code to work for ANY row length.
I really have to admit I'm struggling but here's what I've tried. I usually program PHP so this is all new to me. I have been working on this simple task for a day and can't fathom it out. I think I'm getting confused with terminology 'lists' for example.
import numpy
test_data_file = open ("Anaconda3JamesData/james_test_3.csv","r")
test_data_list = test_data_file.readlines()
test_data_file.close()
for record in test_data_list:
all_values = record.split(',')
maxvalue = np.max(numpy.asfarray(all_values[1:])
print (maxvalue)
With the test data (the CSV file shown at the very top of this question) I would expect the answer to be
highestValue=5647
lowestValue=0.35
If you're using numpy, you can read your csv file as a numpy.ndarray using numpy.genfromtxt() and then use the array's .max() and .min() methods
import numpy
array = numpy.genfromtxt('Anaconda3JamesData/james_test_3.csv', delimiter=',')
array[:, 1:].max()
array[:, 1:].min()
The [:, 1:] part is using numpy's array indexing. It's saying take all the rows (the first [:, part), and for each row take all but the first column (the 1:] part) . This doesn't work with Python's built in lists.
You're overwriting maxvalue each time through the loop, so you're just getting the max value from the last line, not the whole file. You need to compare with the previous maximum.
maxvalue = None
for record in test_data_list:
all_values = record.split(',')
if maxvalue is None:
maxvalue = np.max(numpy.asfarray(all_values[1:])
else:
maxvalue = max(maxvalue, np.max(numpy.asfarray(all_values[1:]))
You do not need the power of numpy for this problem. A simple CSV reader is good enough:
with open("Anaconda3JamesData/james_test_3.csv") as infile:
r = csv.reader(infile)
rows = [list(map(float, line))[1:] for line in r]
max(map(max, rows))
# 5647.0
min(map(min, rows))
# 0.35
I think using numpy is unneeded for this task. First of all, this:
test_data_file = open ("Anaconda3JamesData/james_test_3.csv","r")
test_data_list = test_data_file.readlines()
test_data_file.close()
for record in test_data_list:
can be simplified into this:
with open("Anaconda3JamesData/james_test_3.csv","r") as test_data_file:
for record in test_data_file:
We can use a list comprehension to read in all of the values:
with open("Anaconda3JamesData/james_test_3.csv","r") as test_data_file:
values = [float(val) for line in test_data_file for val in line.split(",")[1:]]
values now contains all relevant numbers, so we can just do:
highest_value = max(values)
lowest_value = min(values)
Here's a pandas solution that can give the desired results:
import pandas as pd
df = pd.read_csv('test1.csv', header=None)
# df:
# 0 1 2 3 4 5 6 7 8 9 10
# 0 0.01 10 20 0.35 40 50 60 70 80 90 100
# 1 2.00 22 32 42.00 52 62 72 82 92 102 112
# 2 3.00 33 43 53.00 63 5647 83 93 103 113 123
df = df.iloc[:, 1:]
print("Highest value: {}".format(df.values.max()))
print("Lowest value: {}".format(df.values.min()))
#Output:
Highest value: 5647.0
Lowest value: 0.35

add computed column to a csv file

I expect that this don't be a classic beginner question. However I read and spent days trying to save my csv data without success.
I have a function that uses an input parameter that I give manually. The function generates 3 columns that I saved in a CSV file. When I want to use the function with other inputs and save the new data allocated at right from the previous computed columns, the result is that pandas sort my CSV file in 3 single columns one below each other with the headings.
I'm using the next code to save my data:
data.to_csv('/Users/Computer/Desktop/Examples anaconda/data_new.csv', sep=',',mode='a')
and the result is:
dot lake mock
1 42 11.914558
2 41 42.446977
3 40 89.188668
dot lake mock
1 42 226.266513
2 41 317.768887
dot lake mock
3 42 560.171830
4. 41. 555.005333
What I want is:
dot lake mock mock mock
0 42 11.914558. 226.266513. 560.171830
1 41 42.446977. 317.768887. 555.005533
2 40 89.188668
UPDATE:
My DataFrame was generated using a function like this:
First I opened a csv file:
df1=pd.read_csv('current_state.csv')
def my_function(df1, photos, coords=['X', 'Y']):
Hzs = t.copy()
shifts = np.floor(Hzs / t_step).astype(np.int)
ms = np.zeros(shifts.size)
delta_inv = np.arange(N+1)
dot = delta_inv[N:0:-1]
lake = np.arange(1,N+1)
for i, shift in enumerate(shifts):
diffs = df1[coords] - df1[coords].shift(-shift)
sqdist = np.square(diffs).sum(axis=1)
ms[i] = sqdist.sum()
mock = np.divide(ms, dot)
msds = pd.DataFrame({'dot':dot, 'lake':lake, 'mock':mock})
return msds
data = my_function(df1, photos, coords=['X', 'Y'])
print(data)
data.to_csv('/Users/Computer/Desktop/Examples anaconda/data_new.csv', sep=',',mode='a'
I looked for several day the way to write in a csv file containing several computed columns just right to the next one. Even the unpleasant comments of some guys! I finally found how to do this. If someone need something similar:
First I save my data using to_csv:
data.to_csv('/Users/Computer/Desktop/Examples/data_new.csv', sep=',',mode='a', index=False)
after the file has been already generated with the headers, I remove the index that I don't need and I only call the function using at the end:
b = data
a = pd.read_csv('data_new.csv')
c = pd.concat ([a,b],axis=1, ignore_index=True)
c.to_csv('/Users/Computer/Desktop/Examples/data_new.csv', sep=',', index=False)
As a result I got the CSV file desired and is possible to call the function the times that you want!

Creating multiple spreadsheets in one excel file with Python Pandas

As title, I need to create multiple spreadsheets into a excel file with Pandas. While this thread and this one
all provided solutions, I figured my situation is a bit different. Both of the cases use something similar to:
writer = pd.ExcelWriter('output.xlsx')
DF1.to_excel(writer,'Sheet1')
DF2.to_excel(writer,'Sheet2')
writer.save()
However, the problem is that I cannot afford to keep multiple dataframes in my memory at the same time since each of them are just too big. My data can be the complicated version of this:
df = pd.DataFrame(dict(A=list('aabb'), B=range(4), C=range(6,10)))
Out: A B C
0 a 0 6
1 a 1 7
2 b 2 8
3 b 3 9
I intend to use the items ['a', 'b', 'c'] in grplist to perform some sort of calculation and eventually generate separate spreadsheets when data['A'] == a through c :
data = pd.read_csv(fileloc)
grplist = [['a','b','c'],['d','e','f']]
for groups, numbers in zip(grplist, range(1, 5)):
for category in groups:
clean = data[(data['A'] == category) & (data['B'] == numbers)]['C']
# --------My calculation to generate a dataframe--------
my_result_df = pd.DataFrame(my_result)
writer = ExcelWriter('my_path_of_excel')
my_resultdf.to_excel(writer, 'Group%s_%s' % (numbers, category[:4]))
writer.save()
gc.collect()
Sadly my code does not create multiple spreadsheets as groups, numbers are looped through. I can only get the last result in the single spreadsheet lying in my excel. What can I do?
This is my very first post here. I hope I'm following every rules so this thread can end well. If anything need to be modified or improved, please kindly let me know. Thanks for your help :)
consider the df
df = pd.DataFrame(dict(A=list('aabb'), B=range(4)))
loop through groups and print
for name, group in df.groupby('A'):
print('{}\n\n{}\n\n'.format(name, group))
a
A B
0 a 0
1 a 1
b
A B
2 b 2
3 b 3
to_excel
writer = pd.ExcelWriter('output.xlsx')
for name, group in df.groupby('A'):
group.to_excel(writer, name)
writer.save()
writer.close()

Categories

Resources