Beginner in python and pandas and trying to figure out how to read from csv in a particular way.
My datafile
01 AAA1234 AAA32452 AAA123123 0 -9 C C A A T G A G .......
01 AAA1334 AAA12452 AAA125123 1 -9 C A T G T G T G .......
...
...
...
So I have 100.000 columns in this file and I want to merge every two columns into one. But the merging needs to occur after the first 6 columns. I would prefer to do this while reading the file if possible instead of manipulating this huge datafile/
Desired outcome
01 AAA1234 AAA32452 AAA123123 0 -9 CC AA TG AG .......
01 AAA1334 AAA12452 AAA125123 1 -9 CA TG TG TG .......
...
...
...
That will result in a dataframe with half the columns. My datafile has no col names, the names reside in a different csv but that is another subject.
I d appreciate a solution, thanks in advance!
Separate the data frame initially. I created one for experimental purposes:
Then I defined a function. Then passed in the dataframe which needed manipulation as an argument into the function
def columns_joiner(data):
new_data = pd.DataFrame()
for i in range(0,11,2): # You can change range to your wish
# Here, I had only 10 columns to concatenate (Therefore the range ends at 11)
ser = data[i] + data[i + 1]
new_data = pd.concat([new_data, ser], axis = 1)
return new_data
I don't think this is an efficient solution. But it worked for me.
so my df looks like this:
x y group
0 53 10 csv1
1 53 10 csv1
2 48 9 csv0
3 48 9 csv0
4 48 9 csv0
... ... ... ...
I have some files that are depending on the group name and want to use them in a function besides the x and y value.
what I am doing so far is the following:
dfGrouped = df.groupby('group') #group the dataframe
df['newcol'] = np.nan #crete new empty col
#use for loop to load file depending on group, note the file is very large, thats why I want to load it only once per group
for name, group in groupHashed:
file = open(name+'.txt')
#open the file
df['newcol'] = df[df['group'] == name].apply(lambda row: newValueFromFile(row.x,row.y, file), axis=1)
It seemed to work at first, unfortunately, newcol only holds the value of the last loop and seems to override the values created earlier with nan. Somebody any idea?
instead of file = open, use
with open('filename.txt', 'a') as file:
and then for the lambda expression file.write...
The 'a' in the opening tells it should append the data to the existing.
I guess currently you are overwriting the content of the file.
'with open()' also takes care for the automatic closing after you're done with the file.
I am beginning to learn python and am struggling with Syntax.
I have a simple CSV file that looks like this
0.01,10,20,0.35,40,50,60,70,80,90,100
2,22,32,42,52,62,72,82,92,102,112
3,33,43,53,63,5647,83,93,103,113,123
I want to look for the highest and lowest value in all the data in the csv file except in the first value of each row.
So effectively the answer here would be
highestValue=5647
lowestValue=0.35
because the data that is looked at is as follows (it ignored the first value of each row)
10,20,0.35,40,50,60,70,80,90,100
22,32,42,52,62,72,82,92,102,112
33,43,53,63,73,5647,93,103,113,123
I would like my code to work for ANY row length.
I really have to admit I'm struggling but here's what I've tried. I usually program PHP so this is all new to me. I have been working on this simple task for a day and can't fathom it out. I think I'm getting confused with terminology 'lists' for example.
import numpy
test_data_file = open ("Anaconda3JamesData/james_test_3.csv","r")
test_data_list = test_data_file.readlines()
test_data_file.close()
for record in test_data_list:
all_values = record.split(',')
maxvalue = np.max(numpy.asfarray(all_values[1:])
print (maxvalue)
With the test data (the CSV file shown at the very top of this question) I would expect the answer to be
highestValue=5647
lowestValue=0.35
If you're using numpy, you can read your csv file as a numpy.ndarray using numpy.genfromtxt() and then use the array's .max() and .min() methods
import numpy
array = numpy.genfromtxt('Anaconda3JamesData/james_test_3.csv', delimiter=',')
array[:, 1:].max()
array[:, 1:].min()
The [:, 1:] part is using numpy's array indexing. It's saying take all the rows (the first [:, part), and for each row take all but the first column (the 1:] part) . This doesn't work with Python's built in lists.
You're overwriting maxvalue each time through the loop, so you're just getting the max value from the last line, not the whole file. You need to compare with the previous maximum.
maxvalue = None
for record in test_data_list:
all_values = record.split(',')
if maxvalue is None:
maxvalue = np.max(numpy.asfarray(all_values[1:])
else:
maxvalue = max(maxvalue, np.max(numpy.asfarray(all_values[1:]))
You do not need the power of numpy for this problem. A simple CSV reader is good enough:
with open("Anaconda3JamesData/james_test_3.csv") as infile:
r = csv.reader(infile)
rows = [list(map(float, line))[1:] for line in r]
max(map(max, rows))
# 5647.0
min(map(min, rows))
# 0.35
I think using numpy is unneeded for this task. First of all, this:
test_data_file = open ("Anaconda3JamesData/james_test_3.csv","r")
test_data_list = test_data_file.readlines()
test_data_file.close()
for record in test_data_list:
can be simplified into this:
with open("Anaconda3JamesData/james_test_3.csv","r") as test_data_file:
for record in test_data_file:
We can use a list comprehension to read in all of the values:
with open("Anaconda3JamesData/james_test_3.csv","r") as test_data_file:
values = [float(val) for line in test_data_file for val in line.split(",")[1:]]
values now contains all relevant numbers, so we can just do:
highest_value = max(values)
lowest_value = min(values)
Here's a pandas solution that can give the desired results:
import pandas as pd
df = pd.read_csv('test1.csv', header=None)
# df:
# 0 1 2 3 4 5 6 7 8 9 10
# 0 0.01 10 20 0.35 40 50 60 70 80 90 100
# 1 2.00 22 32 42.00 52 62 72 82 92 102 112
# 2 3.00 33 43 53.00 63 5647 83 93 103 113 123
df = df.iloc[:, 1:]
print("Highest value: {}".format(df.values.max()))
print("Lowest value: {}".format(df.values.min()))
#Output:
Highest value: 5647.0
Lowest value: 0.35
I expect that this don't be a classic beginner question. However I read and spent days trying to save my csv data without success.
I have a function that uses an input parameter that I give manually. The function generates 3 columns that I saved in a CSV file. When I want to use the function with other inputs and save the new data allocated at right from the previous computed columns, the result is that pandas sort my CSV file in 3 single columns one below each other with the headings.
I'm using the next code to save my data:
data.to_csv('/Users/Computer/Desktop/Examples anaconda/data_new.csv', sep=',',mode='a')
and the result is:
dot lake mock
1 42 11.914558
2 41 42.446977
3 40 89.188668
dot lake mock
1 42 226.266513
2 41 317.768887
dot lake mock
3 42 560.171830
4. 41. 555.005333
What I want is:
dot lake mock mock mock
0 42 11.914558. 226.266513. 560.171830
1 41 42.446977. 317.768887. 555.005533
2 40 89.188668
UPDATE:
My DataFrame was generated using a function like this:
First I opened a csv file:
df1=pd.read_csv('current_state.csv')
def my_function(df1, photos, coords=['X', 'Y']):
Hzs = t.copy()
shifts = np.floor(Hzs / t_step).astype(np.int)
ms = np.zeros(shifts.size)
delta_inv = np.arange(N+1)
dot = delta_inv[N:0:-1]
lake = np.arange(1,N+1)
for i, shift in enumerate(shifts):
diffs = df1[coords] - df1[coords].shift(-shift)
sqdist = np.square(diffs).sum(axis=1)
ms[i] = sqdist.sum()
mock = np.divide(ms, dot)
msds = pd.DataFrame({'dot':dot, 'lake':lake, 'mock':mock})
return msds
data = my_function(df1, photos, coords=['X', 'Y'])
print(data)
data.to_csv('/Users/Computer/Desktop/Examples anaconda/data_new.csv', sep=',',mode='a'
I looked for several day the way to write in a csv file containing several computed columns just right to the next one. Even the unpleasant comments of some guys! I finally found how to do this. If someone need something similar:
First I save my data using to_csv:
data.to_csv('/Users/Computer/Desktop/Examples/data_new.csv', sep=',',mode='a', index=False)
after the file has been already generated with the headers, I remove the index that I don't need and I only call the function using at the end:
b = data
a = pd.read_csv('data_new.csv')
c = pd.concat ([a,b],axis=1, ignore_index=True)
c.to_csv('/Users/Computer/Desktop/Examples/data_new.csv', sep=',', index=False)
As a result I got the CSV file desired and is possible to call the function the times that you want!
I need to save a column vector obtained in an iteration to a text file using python.
This is what I have been using till now
savetxt('displacement{0}.out'.format(globdat.cycle), a, delimiter=',', fmt='%10.4e',)
globdat.cycle is used as a count so that in each iteration a separate file is made.
requirement - I do not require separate files but a single file which contains all the vectors corresponding to each iteration.
eg - iteration 1 values = [ 1 2 3 4 5 6 ]' and iteration 2 values = [ a c v b f h ]'
my text file should look something similar to
1,a
2,c
3,v
4,b
5,f
6,h
I would much appreciate some help.
Thanks