Pyspark: function joining changeable number of columns - python

I wonder if there is a way to automatise that...
I want to make a function in which I will tell, how many columns I want to join. If I have dataFrame with 3 columns and give a parameter "number_of_columns=3", than it will join columns: 0, 1, 2. But if I have dataFrame with 7 columns and give a parameter "number_of_columns=7", than it will join columns: 0, 1, 2, 3, 4, 5, 6.
Names of the columns are always the same: From "0" to "number_of_columns-1".
Is there any way to do that? Or I must have another function if I have another number of columns to merge?
def my_function(spark_column, name_of_column):
new_spark_column = spark_column.withColumn(name_of_column, concat_ws("",
col("0").cast("Integer"),
col("1").cast("Integer"),
col("2").cast("Integer"),
col("3").cast("Integer"),
col("4").cast("Integer"),
col("5").cast("Integer"),
col("6").cast("Integer") ))

You can use a list comprehension to do this:
from pyspark.sql.functions import concat_ws, col
def my_function(spark_column, n_cols, name_of_column):
new_spark_column = spark_column.withColumn(
name_of_column,
concat_ws("", *[col(c).cast("Integer") for c in spark_column.columns[:n_cols]])
)
return new_spark_column

Related

Take rows of ith duplicated index and put in i number of dataframes

This is a bit tricky to put into words, but I'll give it a try. I have a dataframe with duplicated indices as provided below.
a = [0.00000, 0.071928, 1.294, 2.592563, 0.000318, 2.575291, 0.439986, 2.232147, 6.091523, 2.075441, 0.96152]
b = [0.00000, 0.399791, 1.302446, 1.388957, 1.276451, 1.527568, 1.614107, 2.686325, 4.167600, 6.135689, 5.945807]
df = pd.DataFrame({'a' : a, 'b' : b})
df.index = [1,1,1,1,1,2,2,3,3,3,4]
I want the row of the first duplicated index for every number to be appended to df1, and the row of the second duplicated index to be appended to df2, etc; the first time indices 1, 2, 3, 4... n have a duplicate, those rows get appended to dataframe 1. The second time indices 1, 2, 3, 4...n have a duplicate, those rows get appended to dataframe 2, and so on. Ideally, it would look something like this if concatenated for the first three duplicates under the 'index' column:
Any idea how to go about this? I've tried to run df[df.duplicated(subset = ['index'])] in a for loop to widdle down the df to the very first duplicates, but it doesn't seem to work the way I think it will.
Slicing out the duplicate indices via cumcount and using concat to stitch together the resulting sub-dataframes will do the job.
cols = df.columns
df['id'] = df.index
pd.concat([df[df.groupby('id').cumcount()==i][cols] for i in range(0, max(df.groupby('id').cumcount().values))], axis=1)

Python: Extract unique index values and use them in a loop

I would like to apply the loop below where for each index value the unique values of a column called SERIAL_NUMBER will be returned. Essentially I want to confirm that for each index there is a unique serial number.
index_values = df.index.levels
for i in index_values:
x = df.loc[[i]]
x["SERIAL_NUMBER"].unique()
The problem, however, is that my dataset has a multi-index and as you can see below it is stored in a frozen list. I am just interested in the index values that contain a long number. The word "vehicle" also as an index can be removed as it is repeated all over the dataset.
How can I extract these values into a list so I can use them in the loop?
index_values
>>
FrozenList([['0557bf98-c3e0-4955-a23f-2394635ab531', '074705a3-a96a-418c-9bfe-14c37f5c4e6f', '0f47e260-0fa2-40ba-a417-7c00ea74248c', '17342ca2-6246-4150-8080-96d6125cf2b5', '26c6c0d1-0134-4b3a-a149-61dd93afab3b', '7600be43-5d0a-49b3-a1ee-fd107db5822f', 'a07f2b0c-447c-4143-a361-d7ddbffdcc77', 'b929801c-2f32-4a95-bfc4-48a05b48ee01', 'cc912023-0113-42cd-8fe7-4df4005127c2', 'e424bd02-e188-462e-a1a6-2f4ed8fe0a2d'], ['vehicle']])
without an example its hard to judge, but I think you need
df.index.get_level_values(0).unique() # add .tolist() if you want a list
import pandas as pd
df = pd.DataFrame({'A' : [5]*5, 'B' : [6]*5})
df = df.set_index('A',append=True)
df.index.get_level_values(0).unique()
Int64Index([0, 1, 2, 3, 4], dtype='int64')
df.index.get_level_values(1).unique()
Int64Index([5], dtype='int64', name='A')
to drop duplicates from an index level use the .duplicated() method.
df[~df.index.get_level_values(1).duplicated(keep='first')]
B
A
0 5 6

Python: for cycle in range over files

I am trying to create a list that takes values from different files.
I have three dataframes called for example "df1","df2","df3"
each files contains two columns with data, so for example "df1" looks like this:
0, 1
1, 4
7, 7
I want to create a list that takes a value from first row in second column in each file, so it should look like this
F=[1,value from df2,value from df3]
my try
import pandas as pd
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
df3 = pd.read_csv(file3)
F=[]
for i in range(3):
F.append(df{"i"}[1][0])
probably that is not how to iterate over, but I cannot figure out the correct way
You can use iloc and list comprehension
vals = [df.iloc[0, 1] for df in [df1,df2,df3]]
iloc will get value from first row (index 0) and second column (index 1). If you wanted, say, value from third row and fourth column, you'd do .iloc[2, 3] and so forth.
As suggested by #jpp, you may use iat instead:
vals = [df.iat[0, 1] for df in [df1,df2,df3]]
For difference between them, check this and this question

Apply function to 2nd column in pandas dataframe groupby

In a pandas dataframe, a function can be used to group its index. I'm looking to define a function that instead is applied to a column.
I'm looking to group by two columns, except I need the second column to be grouped by an arbitrary function, foo:
group_sum = df.groupby(['name', foo])['tickets'].sum()
How would foo be defined to group the second column into two groups, demarcated by whether values are > 0, for example? Or, is an entirely different approach or syntax used?
Groupby can accept any combination of both labels and series/arrays (as long as the array has the same length as your dataframe), so you can map the function to your column and pass it into the groupby, like
df.groupby(['name', df[1].map(foo)])
Alternatively you might want to add the condition as a new column to your dataframe before your perform the groupby, this will have the advantage of giving it a name in the index:
df['>0'] = df[1] > 0
group_sum = df.groupby(['name', '>0'])['tickets'].sum()
Something like this will work:
x.groupby(['name', x['value']>0])['tickets'].sum()
Like mentioned above the groupby can accept labels and series. This should give you the answer you are looking for. Here is an example:
data = np.array([[1, -1, 20], [1, 1, 50], [1, 1, 50], [2, 0, 100]])
x = pd.DataFrame(data, columns = ['name', 'value', 'value2'])
x.groupby(['name', x['value']>0])['value2'].sum()
name value
1 False 20
True 100
2 False 100
Name: value2, dtype: int64

Split a dataframe column's list into two dataframe columns

I'm effectively trying to do a text-to-columns (from MS Excel) action, but in Pandas.
I have a dataframe that contains values like: 1_1, 2_1, 3_1, and I only want to take the values to the right of the underscore. I figured out how to split the string, which gives me a list of the broken up string, but I don't know how to break that out into different dataframe columns.
Here is my code:
import pandas as pd
test = pd.DataFrame(['1_1','2_1','3_1'])
test.columns = ['values']
test = test['values'].str.split('_')
I get something like: [1, 1], [2, 1], [3, 1].
What I'm trying to get is two separate columns:
col1: 1, 2, 3
col2: 1, 1 ,1
Thoughts? Thanks in advance for your help
Use expand=True when doing the split to get multiple columns:
test['values'].str.split('_', expand=True)
If there's only one underscore, and you only care about the value to the right, you could use:
test['values'].str.split('_').str[1]
You are close:
Instead of just splitting try this:
test2 = pd.DataFrame(test['values'].str.split('_').tolist(), columns = ['c1','c2'])

Categories

Resources