How elaborate pyspark code with sacala code?

How elaborate pyspark code with sacala code? - python

I want to transform this Scala code in Pyspark code.
Scala Code:
Row={
val columnArray = new Array[String](95)
columnArray(0)=x.substring(0,10)
columnArray(1)=x.substring(11,14)
columnArray(2)=x.substring(15,17)
Row.fromSeq(columnArray)
}
How elaborate same scala code on pyspark?

#Felipe Avalos
#Nicolas Grenié
Assuming you are trying to convert an array of strings to a data frame with substrings as the corresponding columns this will do the trick in pyspark.
Change the column_array to have the array of strings and the column_names to have the names of each column:
column_array = ["abcdefghijklmnopqrst", "abcdefghijklmnopqrst"]
column_names = ["col1", "col2", "col3", "col4"]
This will convert map the array to an rdd with the strings and substrings as the value. The rdd is then converted to a data frame with the column names given.
sc.parallelize(column_array).map(lambda x: (x, x[0:10], x[11:14],
x[15:17])).toDF(column_names).show()
This will generate the following data frame:
+--------------------+----------+----+----+
| col1| col2|col3|col4|
+--------------------+----------+----+----+
|abcdefghijklmnopqrst|abcdefghij| lmn| pq|
|abcdefghijklmnopqrst|abcdefghij| lmn| pq|
+--------------------+----------+----+----+

Related

Creating List Comprehension using pandas dataframe

I am new to pandas, and I would appreciate any help. I have a pandas dataframe that comes from csv file. The data contains 2 columns : dates and cashflows. Is it possible to convert these list into list comprehension with tuples inside the list? Here how my dataset looks like:
2021/07/15 4862.306832
2021/08/15 3474.465543
2021/09/15 7121.260118
The desired output is :
[(2021/07/15, 4862.306832),
(2021/08/15, 3474.465543),
(2021/09/15, 7121.260118)]

use apply with lambda function
data = {
"date":["2021/07/15","2021/08/15","2021/09/15"],
"value":["4862.306832","3474.465543","7121.260118"]
}
df = pd.DataFrame(data)
listt = df.apply(lambda x:(x["date"],x["value"]),1).tolist()
Output:
[('2021/07/15', '4862.306832'),
('2021/08/15', '3474.465543'),
('2021/09/15', '7121.260118')]

Create a dictionary from pandas empty dataframe with only column names

I have a pandas data frame with only two column names( single row, which can be also considered as headers).I want to make a dictionary out of this with the first column being the value and the second column being the key.I already tried the
to.dict() method, but it's not working as it's an empty dataframe.
Example
df=|Land |Norway| to {'Land': Norway}
I can change the pandas data frame to some other type and find my way around it, but this question is mostly to learn the best/different/efficient approach for this problem.
For now I have this as the solution :
dict(zip(a.iloc[0:0,0:1],a.iloc[0:0,1:2]))
Is there any other way to do this?

Here's a simple way convert the columns to a list and a list to a dictionary
def list_to_dict(a):
it = iter(a)
ret_dict = dict(zip(it, it))
return ret_dict
df = pd.DataFrame([], columns=['Land', 'Normway'])
dict_val = list_to_dict(df.columns.to_list())
dict_val # {'Land': 'Normway'}

Very manual solution
df = pd.DataFrame(columns=['Land', 'Norway'])
df = pd.DataFrame({df.columns[0]: df.columns[1]}, index=[0])
If you have any number of columns and you want each sequential pair to have this transformation, try:
df = pd.DataFrame(dict(zip(df.columns[::2], df.columns[1::2])), index=[0])
Note: You will get an error if your DataFrame does not have at least two columns.

Pandas read_csv into multiple DataFrames

I have some data in text file that I am reading into Pandas. A simplified version of the txt read in is:
idx_level1|idx_level2|idx_level3|idx_level4|START_NODE|END_NODE|OtherData...
353386066294006|1142|2018-09-20T07:57:26Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:26Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:26Z|3|18260005359901|18260004567689|...
353386066294006|1142|2018-09-20T07:57:31Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:31Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:31Z|3|18260005359901|18260004567689|...
353386066294006|1142|2018-09-20T07:57:36Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:36Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:36Z|3|18260005359901|18260004567689|...
353386066736543|22|2018-04-17T07:08:23Z||||...
353386066736543|22|2018-04-17T07:08:24Z||||...
353386066736543|22|2018-04-17T07:08:25Z||||...
353386066736543|22|2018-04-17T07:08:26Z||||...
353386066736543|403|2018-07-02T16:55:07Z|1|18260004580350|18260005235340|...
353386066736543|403|2018-07-02T16:55:07Z|2|18260005235340|18260005141535|...
353386066736543|403|2018-07-02T16:55:07Z|3|18260005235340|18260005945439|...
353386066736543|403|2018-07-02T16:55:07Z|4|18260006215338|18260005235340|...
353386066736543|403|2018-07-02T16:55:07Z|5|18260004483352|18260005945439|...
353386066736543|403|2018-07-02T16:55:07Z|6|18260004283163|18260006215338|...
353386066736543|403|2018-07-02T16:55:01Z|1|18260004580350|18260005235340|...
353386066736543|403|2018-07-02T16:55:01Z|2|18260005235340|18260005141535|...
353386066736543|403|2018-07-02T16:55:01Z|3|18260005235340|18260005945439|...
353386066736543|403|2018-07-02T16:55:01Z|4|18260006215338|18260005235340|...
353386066736543|403|2018-07-02T16:55:01Z|5|18260004483352|18260005945439|...
353386066736543|403|2018-07-02T16:55:01Z|6|18260004283163|18260006215338|...
And the code I use to read in is as follows:
mydata = pd.read_csv('/myloc/my_simple_data.txt', sep='|',
dtype={'idx_level1': 'int',
'idx_level2': 'int',
'idx_level3': 'str',
'idx_level4': 'float',
'START_NODE': 'str',
'END_NODE': 'str',
'OtherData...': 'str'},
parse_dates = ['idx_level3'],
index_col=['idx_level1','idx_level2','idx_level3','idx_level4'])
What I really want to do is have a seperate panadas DataFrames for each unique idx_level1 & idx_level2 value. So in the above example there would be 3 DataFrames pertaining to idx_level1|idx_level2 values of 353386066294006|1142, 353386066736543|22 & 353386066736543|403 respectively.
Is it possible to read in a text file like this and output each change in idx_level2 to a new Pandas DataFrame, maybe as part of some kind of loop? Alternatively, what would be the most efficient way to split mydata into DataFrame subsets, given that everything I have read suggests that it is inefficient to iterate through a DataFrame.

Read your dataframe as you are currently doing then groupby and use list comprehension:
group = mydata.groupby(level=[0,1])
dfs = [group.get_group(x) for x in group.groups]
you can call your dataframes by doing dfs[0] and so on

To specifically address your last paragraph, you could create a dict of dfs, based on unique values in the column using something like:
import copy
dict = {}
cols = df[column].unique()
for value in col_values:
key = 'df'+str(value)
dict[key] = copy.deepcopy(df)
dict[key] = dict[key][df[column] == value]
dict[key].reset_index(inplace = True, drop = True)
where column = idx_level2

Read the table as-it-is and use groupby, for instance:
data = pd.read_table('/myloc/my_simple_data.txt', sep='|')
groups = dict()
for group, subdf in data.groupby(data.columns[:2].tolist()):
groups[group] = subdf
Now you have all the sub-data frames in a dictionary whose keys are a tuple of the two indexers (eg: (353386066294006, 1142))

creating list from dataframe

I am a newbie to python. I am trying iterate over rows of individual columns of a dataframe in python. I am trying to create an adjacency list using the first two columns of the dataframe taken from csv data (which has 3 columns).
The following is the code to iterate over the dataframe and create a dictionary for adjacency list:
df1 = pd.read_csv('person_knows_person_0_0_sample.csv', sep=',', index_col=False, skiprows=1)
src_list = list(df1.iloc[:, 0:1])
tgt_list = list(df1.iloc[:, 1:2])
adj_list = {}
for src in src_list:
for tgt in tgt_list:
adj_list[src] = tgt
print(src_list)
print(tgt_list)
print(adj_list)
and the following is the output I am getting:
['933']
['4139']
{'933': '4139'}
I see that I am not getting the entire list when I use the list() constructor.
Hence I am not able to loop over the entire data.
Could anyone tell me where I am going wrong?
To summarize, Here is the input data:
A,B,C
933,4139,20100313073721718
933,6597069777240,20100920094243187
933,10995116284808,20110102064341955
933,32985348833579,20120907011130195
933,32985348838375,20120717080449463
1129,1242,20100202163844119
1129,2199023262543,20100331220757321
1129,6597069771886,20100724111548162
1129,6597069776731,20100804033836982
the output that I am expecting:
933: [4139,6597069777240, 10995116284808, 32985348833579, 32985348838375]
1129: [1242, 2199023262543, 6597069771886, 6597069776731]

Use groupby and create Series of lists and then to_dict:
#selecting by columns names
d = df1.groupby('A')['B'].apply(list).to_dict()
#seelcting columns by positions
d = df1.iloc[:, 1].groupby(df1.iloc[:, 0]).apply(list).to_dict()
print (d)
{933: [4139, 6597069777240, 10995116284808, 32985348833579, 32985348838375],
1129: [1242, 2199023262543, 6597069771886, 6597069776731]}

split, map data in two columns in pandas data frame

I want to split data in two columns from a data frame and construct new columns using this data.
My data frame is,
dfc = pd.DataFrame( {"A": ["GT:DP:RO:QR:AO:QA:GL", "GT:DP:RO:QR:AO:QA:GL", "GT:DP:RO:QR:AO:QA:GL", "GT:DP:GL", "GT:DP:GL"], "B": ["0/1:71:43:1363:28:806:-71.1191,0,-121.278", "0/1:71:43:1363:28:806:-71.1191,0,-121.278", "0/1:71:43:1363:28:806:-71.1191,0,-121.278", "1/1:49:-103.754,0,-3.51307", "1/1:49:-103.754,0,-3.51307"]} )
I want individual columns named GT, DP, RO, QR, AO, QA, GL with values from column B
I want to produce output as,
We can split the two columns using a = df.A.str.split(":", expand = True)and b = df.B.str.split(":", expand = True) to get two individual data frames. These can be merged with c = pd.merge(a, b, left_index = True, right_index = True) to get all desired data. But, not in the format as expected.
Any suggestions ? I think better way can be using split on both columns A and B and then creating a dictcolumn with values from A as key and B as values. Then this column can be converted to data frame.
Thanks

Use an OrderedDict to preserve the order after creating a dict mapping of the two concerned columns of the dataframe split on the sep ":", flattened to a list.
Feed this to the dataframe constructor later.
from collections import OrderedDict
L = dfc.apply(
lambda x: OrderedDict(zip(x['A'].split(':'), x['B'].split(':'))), 1).tolist()
pd.DataFrame(L)

I'm going to split everything by ':'. But I have 2 columns. If I stack first, I get a series in which I can more easily use str.split
I now have a split series in which I can group by level=0 which is the original index.
I zip and dict to get series like structures with the original column A as the indices and B as the values.
unstack and I'm done.
gb = dfc.stack().str.split(':').groupby(level=0)
gb.apply(lambda x: dict(zip(*x))).unstack()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How elaborate pyspark code with sacala code? - python

I want to transform this Scala code in Pyspark code. Scala Code: Row={ val columnArray = new Array[String](95) columnArray(0)=x.substring(0,10) columnArray(1)=x.substring(11,14) columnArray(2)=x.substring(15,17) Row.fromSeq(columnArray) } How elaborate same scala code on pyspark?

Related

Creating List Comprehension using pandas dataframe

Create a dictionary from pandas empty dataframe with only column names

Pandas read_csv into multiple DataFrames

creating list from dataframe

split, map data in two columns in pandas data frame

Categories

Resources