how to replace elements of pandas series? [duplicate] - python

This question already has an answer here:
How to split elements of a pandas series and put them into JSON format?
(1 answer)
Closed 6 years ago.
I have a pandas series object S, some elements are name-value pairs, like
a-12
b-23
c-42
d-25
...
some are just
a
b
c
d
....
so on, what I need to do is to get this into Json format like:
{Name:a,Value:12}
{Name:b,Value:23}
{Name:c,Value:42}
{Name:d,Value:25}
...
If only a, b, c, d, not pairs, Values is NaN.
I used str.split("-") function to separate the pairs, for non pairs this would produce NaN for the value part.
I wonder if I can put them together like
result=[{"Name": S.str.split("-").str.get(0),"Value": S.str.split("-").str.get(1)}]
?

I'm not sure you want to start from a series object at all? How did you get there? It might be easier to think of a series as an indexed list, or as a dictionary, in which case you can see that it gets confusing if the items are of different types.
FWIW, you can convert a series directly to json or dict with myseries.to_json() or myseries.to_dict()
Did you try that?

Related

List comprehension to get rows that contain matching values from 2 seperate dataframes

So im trying to match 2 very different dataframes on a single column each containing numbers in string format. I need a concise, very fast solution so i tried using string comprehension and succeeded a few days back and then lost my work, im trying to recreate it.
df1=pd.DataFrame({'col':['hey','hi','how ya durn']})
df2=pd.DataFrame({'col':['hey','hi','hello','what']})
df3=df2[[x for x in df2.col for y in df1.col if x in y]]
df3.head()
So i made this work the other day with 2 dataframes, both 20-30 columns, ~100k rows, different column data except for 1 column each, which im trying to match on.
I either get ValueError: Item wrong length # instead of #. or it takes an insane amount of time because the system im using is slow.
I know i need to use list comprehension or something faster, and i know .apply() takes too long. both of my matching columns contain 10-15 length numbers in string format. when i got it to work a few days back using a similar list comp one-liner it took seconds to complete and was perfect and now i lost it and cant recreate it lol. any help is greatly appreciated.
(p.s. i may have used an any() statement in the list comp, and im 95% sure i used if x in y.)
You can find strings in both columns with
df2[df2.col.isin(df1.col)]
Out:
col
0 hey
1 hi
A solution with comprehension would be
df2[df2.col.isin([x for x in df2.col for y in df1.col if x in y])]
But this gets slow for larger columns

create a new column which is a value_counts of another column in python [duplicate]

This question already has answers here:
pandas add column to groupby dataframe
(3 answers)
Closed 2 years ago.
I have a pandas datafram df that contains a column say x, and I would like to create another column out of x which is a value_count of each item in x.
Here is my approach
x_counts= []
for item in df['x']:
item_count = len(df[df['x']==item])
x_counts.append(item_count)
df['x_count'] = x_counts
This works but this is far inefficient. I am looking for a more efficient way to handle this. Your approach and recommendations are highly appreciated
It sounds like you are looking for groupby function that you are trying to get the count of items in x
There are many other function driven methods but they may differ in different versions.
I suppose that you are looking to join the same elements and find their sum
df.loc[:,'x_count']=1 # This will make a new column of x_count to each row with value 1 in it
aggregate_functions={"x_count":"sum"}
df=df.groupby(["x"],as_index=False,sort=False).aggregate(aggregate_functions) # as_index and sort functions will allow you to choose x separately otherwise it would conside the x column as index column
Hope it heps.

Fast conversion of multicolumn dataframe into dictionary

I have the following problem. I have a pandas dataframe with columns A to D with columns A and B being kind of the identifier. My ultimate goal is to create a dictionary where the tuple (A,B) denotes he keys and the values C and D are stored under each key as numpy array. I can write this in one line if I only want to store C or D, but I struggle to get both under the hood. That's what I have:
output_dict = df.groupby(['A','B'])['C'].apply(np.array).to_dict()
works as expected, i.e. the data per each key is a dim(N,1) array. But if I try the following:
output_dict = df.groupby(['A','B'])['C','D'].apply(np.array).to_dict()
I receive the error that
TypeError: Series.name must be a hashable type
How can I include the 2nd column such that the data in the dict per key is an array of dim(N,2).
Thanks!
You can create a new column (e.g. C_D) containing lists of the corresponding values in the columns C and D. Select columns C and D from the dataframe and use the tolist() method:
df['C_D'] = df[['C','D']].values.tolist()
Then run your code line on that new column:
output_dict = df.groupby(['A','B'])['C_D'].apply(np.array).to_dict()
I played a bit more around and next to Gerd's already helpful answer I found the following matching my needs by using lambda.
output_dict = df.groupby(['A','B']).apply(lambda df: np.array( [ df['C'],df['D'] ] ).T).to_dict()
Time comparison with Gerd's solution in my particular case:
Gerd's: roughly 0.055s
This one: roughly 0.035s

How do I convert 'the array saved as string to csv' back to float array?

I had to merge a lot of files (containing word embeddings and othe real valued vectors) based on some common attributes so I used Pandas DataFrame and saved the intermediate files as csv.
Currently I have a dataframe whose columns look something like this:
I want to merge all last 4 columns (t-1embedding1a,t-1embedding7b,t-2embedding1a,t-2embedding7b) into a single vector to pass to neural network.
I planned to iterate over the current dataframe and take 4 temporary tensors with value of each column and concatenate and write to new dataframe.
However torch.tensor doesn't work as it says:
torch_tensor = torch.tensor(final['t-1embedding1a'].astype(float).values)
could not convert string to float: '[-6.12873614e-01 -5.58319509e-01 -9.73452032e-01 3.66993636e-01\n
I also tried np.fromstring() but the original values are lost in this case.
Sorry, if the question is unnecessarily complicated, I am a newbie to pytorch. Any help is appreciated!
First of all, the data type for columns with "t-lembeddingXX" is string that look like "[-6.12873614e-01 -5.58319509e-01 -9.73452032e-01 3.66993636e-01]". You have to convert them to a list of float.
final["t-lembeddingXX"] = final["t-lembeddingXX"].apply(lambda x : [float(x) for x in x.replace("[", "").replace ("]", "").split()])
Then, you have to check that each list of final.loc[i,"t-lembeddingXX"] has the same lengths.
If I haven't mistaken, you want to merge the 4 columns into one verctor.
all_values = list(df["t-lembeddingX1"]) + list(df["t-lembeddingX2"]) + list(df["t-lembeddingX3"]) + list(df["t-lembeddingX4"])
# there is sureliy a better way
Then pass to tensor:
torch_tensor = torch.tensor(all_values)
------------
Finally, I advise you to take a look at the function of torch.cat. You can convert each column to a vector and then use this function to concatenate them together.

Efficient way of converting a numpy array of 2 dimensions into a list with no duplicates

I want to extract the values from two different columns of a pandas dataframe, put them in a list with no duplicate values.
I have tried the following:
arr = df[['column1', 'column2']].values
thelist= []
for ix, iy in np.ndindex(arr.shape):
if arr[ix, iy] not in thelist:
thelist.append(edges[ix, iy])
This works but it is taking too long. The dataframe contains around 30 million rows.
Example:
column1 column2
1 adr1 adr2
2 adr1 adr2
3 adr3 adr4
4 adr4 adr5
Should generate the list with the values:
[adr1, adr2, adr3, adr4, adr5]
Can you please help me find a more efficient way of doing this, considering that the dataframe contains 30 million rows.
#ALollz gave a right answer. I'll extend from there. To convert into list as expected just use list(np.unique(df.values))
You can use just np.unique(df) (maybe this is the shortest version).
Formally, the first parameter of np.unique should be an array_like object,
but as I checked, you can also pass just a DataFrame.
Of course, if you want just plain list not a ndarray, write
np.unique(df).tolist().
Edit following your comment
If you want the list unique but in the order of appearance, write:
pd.DataFrame(df.values.reshape(-1,1))[0].drop_duplicates().tolist()
Operation order:
reshape changes the source array into a single column.
Then a DataFrame is created, with default column name = 0.
Then [0] takes just this (the only) column.
drop_duplicates acts exactly what the name says.
And the last step: tolist converts to a plain list.

Categories

Resources