how to create multiple variables with similar name in for loop? - python

I had a problem with for loops earlier, and it was solved thanks to #mak4515, however, there is something else I want to accomplish
# Use pandas to read in csv file
data_df_0 = pd.read_csv('puget_sound_ctd.csv')
#create data subsets based on specific buoy coordinates
data_df_1 = pd.read_csv('puget_sound_ctd.csv', skiprows=range(9,114))
data_df_2 = pd.read_csv('puget_sound_ctd.csv', skiprows=([i for i in range(1, 8)] + [j for j in range(21, 114)]))
for x in range(0,2):
for df in [data_df_0, data_df_2]:
lon_(x) = df['longitude']
lat_(x) = df['latitude']
This is my current code, I want to have it have it so that it reads the different data sets and creates different values based on the data set it is reading. However, when I run the code this way I get this error
File "<ipython-input-66-446aebc48604>", line 21
lon_(x) = df['longitude']
^
SyntaxError: can't assign to function call
What does "can't assign to function call" mean, and how do I fix this?

I think the comment by #Chris is probably a good way to go. I wanted to point out that since you're already using pandas dataframes, an easier way might be to make a column corresponding to the original dataframe then concatenate them.
import pandas as pd
data_df_0 = pd.DataFrame({'longitude':range(-125,-120,1),'latitude':range(45,50,1)})
data_df_0['dfi'] = 0
data_df_2 = pd.DataFrame({'longitude':range(-120,-125,-1),'latitude':range(50,45,-1),'dfi':[2]*5})
data_df_2['dfi'] = 2
df = pd.concat([data_df_0,data_df_2])
Then, you could acess data from the original frames like this:
df.loc[2]

Related

Print Pandas without dtype

I've read a few other posts about this but the other solutions haven't worked for me. I'm trying to look at 2 different CSV files and compare data from 1 column from each file. Here's what I have so far:
import pandas as pd
import numpy as np
dataBI = pd.read_csv("U:/eu_inventory/EO BI Orders.csv")
dataOrderTrimmed = dataBI.iloc[:,1:2].values
dataVA05 = pd.read_csv("U:\eu_inventory\VA05_Export.csv")
dataVAOrder = dataVA05.iloc[:,1:2].values
dataVAList = []
ordersInBoth = []
ordersInBI = []
ordersInVA = []
for order in np.nditer(dataOrderTrimmed):
if order in dataVAOrder:
ordersInBoth.append(order)
else:
ordersInBI.append(order)
So if the order number from dataOrderTrimmed is also in dataVAOrder I want to add it to ordersInBoth, otherwise I want to add it to ordersInBI. I think it splits the information correctly but if I try to print ordersInBoth each item prints as array(5555555, dtype=int64) I want to have a list of the order numbers not as an array and not including the dtype information. Let me know if you need more information or if the way I've typed it out is confusing. Thanks!
The way you're using .iloc is giving you a DataFrame, which becomes 2D array when you access values. If you just want the values in the column at index 1, then you should just say:
dataOrderTrimmed = dataBI.iloc[:, 1].values
Then you can iterate over dataOrderTrimmed directly (i.e. you don't need nditer), and you will get regular scalar values.

Organize flow data from multiple excel sheets into one excel by iterrating through each column

I have the paths to each excel file in 'files'(using this thread).
And then, I was trying to use a for loop to iterate through each file and gather flow data and combine it into new matrix 'val' by adding it to a new column each time. 'Flow' is also the column name in the excel so I use that on line 5 to call the column I want.
For example,
Excel 1
Flow data
1
2
Excel 2:
Flow data
3
4
val matrix should have
Excel 1 Excel 2
1 3
2 4
I keep getting this error however.
could not broadcast input array from shape (105408,1) into shape (105408,)
Seems like a common error but I haven't been able to solve it from similar question on here.
val = np.zeros((105408, 50), int)
for x in range (len(files)+1):
dt = pd.read_csv(files[x])
flow_data = dt[['Flow']]
val[:,x] = flow_data
#print(val)
I think you are running into an issue due to the extra pair of brackets surrounding 'Flow', removing should function as you intended. dt[['Flow']] --> dt['Flow']
Using a dataframe might be a better approach to aggregate results though, a numpy.ndarray will throw an error if len(files) turns out to be larger than the preset array width (50 in this case). A df will be more flexible for varying file counts/rows. Which seems to be the case given you are using len(files) and not a specific file count.
Working example (using pd.DataFrame):
aggregate_df = pd.DataFrame()
for x in range (len(files)+1):
dt = pd.read_csv(files[x])
flow_data = dt['Flow']
aggregate_df.loc[:, x] = flow_data # using a df to aggregate results
# print(aggregate_df)

Adding new dask column based on a vectorized function

I have what I thought would be a straightforward thing to do in python using dask. I have a dataframe with some records in it, and I want to add a new column based on calling a function with values from two other columns as parameters.
Here is what I mean (pretend ge exists and takes two parameters):
def gc(x, y):
return ge(x, y)
def gdf(df):
func1 = np.vectorize(gc)
gh = da.from_array(func1(df.x, df.y))
df['gh'] = gh
However, I seem to get one issue or another no matter what I try to do. Currently, in the above state, I get
Number of partitions do not match (2 != 33)
It feels like I'm either going about this all wrong (like maybe I need map_blocks or map_partitions or even gufunc), or I'm missing something easy where I can set the number of partitions on my array to match that of my dataframe.
Any help would be appreciated.
It should be possible to do this with assign or map_partitions:
func1 = np.vectorize(gc)
df = df.assign(gh=lambda df: func1(df.x, df.y))
# or try this
def myfunc(df):
df['gh'] = func1(df.x, df.y)
return df
df = df.map_partitions(myfunc)

Delete series value from row of a pandas data frame based on another data frame value

My question is little bit different than the question posted here
So I thought to open a new thread.I have a pandas data frame with 5 attributes.One of these attribute is created using pandas series.Here is the sample code for creating the data frame
import numpy as np
mydf1=pd.DataFrame(columns=['group','id','name','mail','gender'])
data = np.array([2540948, 2540955, 2540956,2540956,7138932])
x=pd.Series(data)
mydf1.loc[0]=[1,x,'abc','abc#xyz.com','male']
I have another data frame,the code for creating the data frame is given below
mydf2=pd.DataFrame(columns=['group','id'])
data1 = np.array([2540948, 2540955, 2540956])
y=pd.Series(data1)
mydf2.loc[0]=[1,y]
These are sample data. Actual data will have large number of rows & also the series length is large too .I want to match mydf1 with mydf2 & if it matches,sometime I wont have matching element in mydf2,then I will delete values of id from mydf1 which are there in mydf2 for example after the run,my id will be for group 1 2540956,7138932. I also tried the code mentioned in above link. But for the first line
counts = mydf1.groupby('id').cumcount()
I got error message as
TypeError: 'Series' objects are mutable, thus they cannot be hashed
in my Python 3.X. Can you please suggest me how to solve this?
This should work. We use Counter to find the difference between 2 lists of ids. (p.s. This problem does not requires the difference is in order.)
Setup
import numpy as np
from collections import Counter
mydf1=pd.DataFrame(columns=['group','id','name','mail','gender'])
x = [2540948, 2540955, 2540956,2540956,7138932]
y = [2540948, 2540955, 2540956,2540956,7138932]
mydf1.loc[0]=[1,x,'abc','abc#xyz.com','male']
mydf1.loc[1]=[2,y,'def','def#xyz.com','female']
mydf2=pd.DataFrame(columns=['group','id'])
x2 = np.array([2540948, 2540955, 2540956])
y2 = np.array([2540955, 2540956])
mydf2.loc[0]=[1,x2]
mydf2.loc[1]=[2,y2]
Code
mydf3 = mydf1[["group", "id"]]
mydf3 = mydf3.merge(mydf2, how="inner", on="group")
new_id_finder = lambda x: list((Counter(x.id_x) - Counter(x.id_y)).elements())
mydf3["new_id"] = mydf3.apply(new_id_finder, 1)
mydf3["new_id"]
group new_id
0 1 [2540956, 7138932]
1 2 [2540948, 2540956, 7138932]
One Counter object can substract another to get the difference in occurances of elements. Then, you can use elements function to retrieve all values left.

How to transform a slice of dataframe into a new data frame

I'm new to python and I'm confused sometimes with some operations
I have a dataframe called ro and I also have filtered this dataframe using a specific column PN 3D for a specific value 921 and I assigned the results into a new data frame called headlamp by using the following code:
headlamp = ro[ro['PN 3D']=="921"]
Does my headlamp is also a dataframe or is just a slice?
The reason I'm asking this is because I'm getting some strange warnings and results later on my script.
Such as, I create a new column called word and I assigned to headlamp
headlamp['word'] = ""
I got the following warning:
A value is trying to be set on a copy of a slice from a DataFrame
After that I used the following script to assign the results to headlamp['word']
i = 0
for row in headlamp['Comment'].astype(list):
headlamp['word'][i] = Counter(str(row).split())
i+=1
print headlamp['word']
The same warning appeared and it has impacted on my results, because when I used the headlamp.tail(), The last rows of headlamp['word'] were empty.
Does anyone has an idea what is the problem and how to fix?
Any help will be highly appreciated
Use .loc
headlamp = ro.loc[ro['PN 3D']=="921"]
As for the rest and your comments... I'm very confused. But this is my best guess
setup
import pandas as pd
from string import ascii_lowercase
chars = ascii_lowercase + ' '
probs = [0.03] * 26 + [.22]
headlamp = pd.DataFrame(np.random.choice(list(chars), (10, 100), p=probs)).sum(1).to_frame('comment')
headlamp
headlamp['word'] = headlamp.comment.str.split().apply(lambda x: pd.value_counts(x).to_dict())
headlamp

Categories

Resources