Print Pandas without dtype - python

I've read a few other posts about this but the other solutions haven't worked for me. I'm trying to look at 2 different CSV files and compare data from 1 column from each file. Here's what I have so far:
import pandas as pd
import numpy as np
dataBI = pd.read_csv("U:/eu_inventory/EO BI Orders.csv")
dataOrderTrimmed = dataBI.iloc[:,1:2].values
dataVA05 = pd.read_csv("U:\eu_inventory\VA05_Export.csv")
dataVAOrder = dataVA05.iloc[:,1:2].values
dataVAList = []
ordersInBoth = []
ordersInBI = []
ordersInVA = []
for order in np.nditer(dataOrderTrimmed):
if order in dataVAOrder:
ordersInBoth.append(order)
else:
ordersInBI.append(order)
So if the order number from dataOrderTrimmed is also in dataVAOrder I want to add it to ordersInBoth, otherwise I want to add it to ordersInBI. I think it splits the information correctly but if I try to print ordersInBoth each item prints as array(5555555, dtype=int64) I want to have a list of the order numbers not as an array and not including the dtype information. Let me know if you need more information or if the way I've typed it out is confusing. Thanks!

The way you're using .iloc is giving you a DataFrame, which becomes 2D array when you access values. If you just want the values in the column at index 1, then you should just say:
dataOrderTrimmed = dataBI.iloc[:, 1].values
Then you can iterate over dataOrderTrimmed directly (i.e. you don't need nditer), and you will get regular scalar values.

Related

how to create multiple variables with similar name in for loop?

I had a problem with for loops earlier, and it was solved thanks to #mak4515, however, there is something else I want to accomplish
# Use pandas to read in csv file
data_df_0 = pd.read_csv('puget_sound_ctd.csv')
#create data subsets based on specific buoy coordinates
data_df_1 = pd.read_csv('puget_sound_ctd.csv', skiprows=range(9,114))
data_df_2 = pd.read_csv('puget_sound_ctd.csv', skiprows=([i for i in range(1, 8)] + [j for j in range(21, 114)]))
for x in range(0,2):
for df in [data_df_0, data_df_2]:
lon_(x) = df['longitude']
lat_(x) = df['latitude']
This is my current code, I want to have it have it so that it reads the different data sets and creates different values based on the data set it is reading. However, when I run the code this way I get this error
File "<ipython-input-66-446aebc48604>", line 21
lon_(x) = df['longitude']
^
SyntaxError: can't assign to function call
What does "can't assign to function call" mean, and how do I fix this?
I think the comment by #Chris is probably a good way to go. I wanted to point out that since you're already using pandas dataframes, an easier way might be to make a column corresponding to the original dataframe then concatenate them.
import pandas as pd
data_df_0 = pd.DataFrame({'longitude':range(-125,-120,1),'latitude':range(45,50,1)})
data_df_0['dfi'] = 0
data_df_2 = pd.DataFrame({'longitude':range(-120,-125,-1),'latitude':range(50,45,-1),'dfi':[2]*5})
data_df_2['dfi'] = 2
df = pd.concat([data_df_0,data_df_2])
Then, you could acess data from the original frames like this:
df.loc[2]

Remove specific elements from a numpy array

I have an np.array I would like to remove specific elements from based on the element "name" and not the index. Is this sort of thing possible with np.delete() ?
Namely my original ndarray is
textcodes= data['CODES'].unique()
which captures unique text codes given the quarter.
Specifically I want to remove certain codes which I need to run through a separate process and put them into a separate ndarray
sep_list = np.array(['SPCFC_CODE_1','SPCFC_CODE_2','SPCFC_CODE_3','SPCFC_CODE_4])
I have trouble finding a solution on removing these specific codes in sep_list from textcodes because I don't know exactly where these sep_list codes would be indexed as it would be different each quarter and I would like to automate it based on the specific names instead because those will always be the same.
Any help is greatly appreciated. Thank you.
You should be able to do something like:
import numpy as np
data = [3,2,1,0,10,5]
bad_list = [1, 2]
data = np.asarray(data)
new_list = np.asarray([x for x in data if x not in bad_list])
print("BAD")
print(data)
print("GOOD")
print(new_list)
Yields:
BAD
[ 3 2 1 0 10 5]
GOOD
[ 3 0 10 5]
It is impossible to tell for sure since you did not provide sample data, but the following implementation using your variables should work:
import numpy as np
textcodes= data['CODES'].unique()
sep_list = np.array(['SPCFC_CODE_1','SPCFC_CODE_2','SPCFC_CODE_3','SPCFC_CODE_4'])
final_list = [x for x in textcodes if x not in sep_list]

How to find the sum of a column from a csv file using vanilla python (without using numpy or pandas)?

I have tried a lot of different things to do this without using numpy or pandas. I have looked at similar posts, but I just can't get anything to work. How can I solve this?
The reason I want to do this is that I have read that I should avoid using packages whilst learning vanilla python. (https://chrisconlan.com/learning-python-without-library-overload/)
import csv
import numpy as np
import os
with open('ams_data.csv') as ams_data:
read_csv = csv.reader(ams_data, delimiter=';')
data = list(read_csv)
x_dagar, y = (len(data) - 1) // 24, np.genfromtxt(open('ams_data.csv', 'rb'), delimiter=';', skip_header = 1)
A = np.delete(y, [0, 1], 1)
print(sum(A))
I get the output I want, however I do not want to use 'import numpy as np', or any other package I have to download, how can I change my code for it to do the same thing as it does now. Which would be summing the float of the last element in every row, but the first, of my csv file.
Trying to solve this myself I have gotten;
[[1.152]
[0.91 ]
[0.773]
[0.766]
[0.898]
[0.628]
[1.76 ]
[2.58 ]
[2.026]
[2.774]
[1.746]
[1.089]
[0.884]
[0.816]
[0.847]]
but without the last brackets [[1.152]+ ... + [0.847]] sorrounding all the numbers. Which in my belief is what I have to do to get the sum? Any help would be much appreciated! :D
As far as I understand, you want to read the CSV file without Numpy or Pandas, and then compute the sum of all columns but the first two (it seems), starting from the second row. You could do this, using list comprehensions:
with open('ams_data.csv') as ams_data:
lines = ams_data.readlines()
data = [[float(elt) for elt in line.split(";")] for line in lines]
result = [sum(row[2:]) for row in data[1:]]
The conversion to float assumes that all elements in your CSV file are floats. The first row is excluded from the sum with data[1:], and the first two columns are excluded with row[2:]. I guess you can adapt from here.
Try doing this -
my_list = [row[-1] for row in read_csv]
print(sum([float(i) for i in my_list[1:]]))

Find big differences in numpy array

I have a csv file that contains data from two led measurements. There are some mistakes in the file that gives huge sparks in the graph. I want to locate this places where this happens.
I have this code that makes two arrays that I plot.
x625 = np.array(df['LED Group 625'].dropna(axis=0, how='all'))
x940 = np.array(df['LED Group 940'].dropna(axis=0, how='all'))
I will provide an answer with some artificial data since you have not posted any data yet.
So after you convert the pandas columns into a numpy array, you can do something like this:
import numpy as np
# some random data. 100 lines and 1 column
x625 = np.random.rand(100,1)
# Assume that the maximum value in `x625` is a spark.
spark = x625.max()
# Find where these spark are in the `x625`
np.where(x625==spark)
#(array([64]), array([0]))
The above means that a value equal to spark is located on the 64th line of the 0th column.
Similarly, you can use np.where(x625 > any_number_here)
If instead of the location you need to create a boolean mask use this:
boolean_mask = (x625==spark)
# verify
np.where(boolean_mask)
# (array([64]), array([0]))
EDIT 1
You can use numpy.diff() to get the element wise differences of all elements into the list (variable).
diffs = np.diff(x625.ravel())
This will have in index 0 the results of element1-element0.
If the vaules in diffs are big in a specific index positio, then a spark occured in that position.

how to read a file, sort based on specified column

I am trying to convert all my codes, written in MATLAB, to python. I have a problem and I couldn't find a way to solve it. Maybe someone has an idea.
I have a file which has m rows and two columns. I want to read file, and then sort file based on the second column. Later, I must use the sorted first column (from first row to 1000th row) and find values larger than threshold (here for example 0.2) and sum them.
Hope someone has an idea.
Thanks
If the file is for example with fields separated by tabs and rows separated by columns the problem is quite simple:
f = open("filename.csv")
data = [map(float, x.split("\t")) for x in f.readlines()]
data.sort(key = lambda x:x[1])
result = sum(x[0] for x in data[:1000] if x[0] > 0.2)
Consider using Numpy arrays and its accompanying functions. They are (usually) quite similar to those in MATLAB, which might make your conversion from the latter easier.
import numpy as np
data = np.genfromtext("filename.csv", delimiter="\t", dtype=np.float)
idx = np.argsort(data[:, 1])
data1000 = data[idx[:1000]] # First 1000 of sorted data
result = np.sum(data1000[data1000[:, 0] > 0.2, 0])

Categories

Resources