How to replicate dataframe rows based on condition - python

I have input dataframe df1 like this:
url_asia_us_emea
https://asia/image
https://asia/location
https://asia/video
I want to replicate the rows in df1 with changes in region based on the column names.Lets say for this example , I need the below output as all three regions are in the column name
url_asia_us_emea
https://asia/image
https://asia/location
https://asia/video
https://us/image
https://us/location
https://us/video
https://emea/image
https://emea/location
https://emea/video

You could do something like this:
list_strings = [ 'us', 'ema', 'fr']
df_orig =pd.read_csv("ack.csv", sep=";")
which is
url
0 https://asia/image
1 https://asia/location
2 https://asia/video
and then
d = []
for element in list_strings:
df =pd.read_csv("ack.csv", sep=";")
df['url'] = df['url'].replace({'asia': str(element)}, regex=True)
d.append(df)
df = pd.concat(d)
DF = pd.concat([df,df_orig])
which results in
index url
0 https://us/image
1 https://us/location
2 https://us/video
3 https://ema/image
4 https://ema/location
5 https://ema/video
6 https://fr/image
7 https://fr/location
8 https://fr/video
9 https://asia/image
10 https://asia/location
11 https://asia/video
​

Related

create dataframe with outliers and then replace with nan

I am trying to make a function to spot the columns with "100" in the header and replace the values in these columns with NaN depending on multiple criteria.
I also want in the function the value of the column "first_column" corresponding to the outlier.
For instance let's say I have a df where I want to replace all numbers that are above 100 or below 0 with NaN values :
I start with this dataframe:
import pandas as pd
data = {'first_column': [product_name', 'product_name2', 'product_name3'],
'second_column': ['first_value', 'second_value', 'third_value'],
'third_100':['89', '9', '589'],
'fourth_100':['25', '1568200', '5''],
}
df = pd.DataFrame(data)
print (df)
expected output:
IIUC, you can use filter and boolean indexing:
# get "100" columns and convert to integer
df2 = df.filter(like='100').astype(int)
# identify values <0 or >100
mask = (df2.lt(0)|df2.gt(100))
# mask them
out1 = df.mask(mask.reindex(df.columns, axis=1, fill_value=False))
# get rows with at least one match
out2 = df.loc[mask.any(1), ['first_column']+list(df.filter(like='100'))]
output 1:
first_column second_column third_100 fourth_100
0 product_name first_value 89 25
1 product_name2 second_value 9 NaN
2 product_name3 third_value NaN 5
output 2:
first_column third_100 fourth_100
1 product_name2 9 1568200
2 product_name3 589 5

How to read data from url to pandas dataframe

I'm trying to read data from https://download.bls.gov/pub/time.series/ee/ee.industry using pandas, like this:
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/ee/ee.industry'
df = pd.read_csv(url, sep='\t')
Also tried getting the separator:
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/ee/ee.industry'
reader = pd.read_csv(url, sep = None, iterator = True)
inferred_sep = reader._engine.data.dialect.delimiter
df = pd.read_csv(url, sep=inferred_sep)
However the data is not very weelk formated, the columns of the dataframe are right:
>>> df.columns
Index(['industry_code', 'SIC_code', 'publishing_status', 'industry_name'], dtype='object')
But the data does not correspond to the columns, it seems all the data is merged into the fisrt two columns and the last two do not have any data.
Any suggestion/idea on a better approach to fet this data?
EDIT
The expexted result should be something like:
industry_code
SIC_code
publishing_status
industry_name
000000
N/A
B
Total nonfarm 1 T 1
The reader works well but you don’t have the right number of columns in your header. You can get the other columns back using .reset_index() and then rename the columns:
>>> df = pd.read_csv(url, sep='\t')
>>> n_missing_headers = df.index.nlevels
>>> cols = df.columns.to_list() + [f'col{n}' for n in range(n_missing_headers)]
>>> df.reset_index(inplace=True)
>>> df.columns = cols
>>> df.head()
industry_code SIC_code publishing_status industry_name col0 col1 col2
0 0 NaN B Total nonfarm 1 T 1
1 5000 NaN A Total private 1 T 2
2 5100 NaN A Goods-producing 1 T 3
3 100000 10-14 A Mining 2 T 4
4 101000 10 A Metal mining 3 T 5
You can then keep the first 4 columns if you want:
>>> df.iloc[:, :-n_missing_headers].head()
industry_code SIC_code publishing_status industry_name
0 0 NaN B Total nonfarm
1 5000 NaN A Total private
2 5100 NaN A Goods-producing
3 100000 10-14 A Mining
4 101000 10 A Metal mining

Pandas Dataframe convert column of lists to multiple columns

I am trying to convert a dataframe that has list of various size for example something like this:
d={'A':[1,2,3],'B':[[1,2,3],[3,5],[4]]}
df = pd.DataFrame(data=d)
df
to something like this:
d1={'A':[1,2,3],'B-1':[1,0,0],'B-2':[1,0,0],'B-3':[1,1,0],'B-4':[0,0,1],'B-5':[0,1,0]}
df1 = pd.DataFrame(data=d1)
df1
Thank you for the help
explode the lists then get_dummies and sum over the original index. (max [credit to #JonClements] if you want true dummies and not counts in case there can be multiples). Then join the result back
dfB = pd.get_dummies(df['B'].explode()).sum(level=0).add_prefix('B-')
#dfB = pd.get_dummies(df['B'].explode()).max(level=0).add_prefix('B-')
df = pd.concat([df['A'], dfB], axis=1)
# A B-1 B-2 B-3 B-4 B-5
#0 1 1 1 1 0 0
#1 2 0 0 1 0 1
#2 3 0 0 0 1 0
You can use pop to remove the column you explode so you don't need to specify df[list_of_all_columns_except_B] in the concat:
df = pd.concat([df, pd.get_dummies(df.pop('B').explode()).sum(level=0).add_prefix('B-')],
axis=1)

reshape a pandas dataframe

suppose a dataframe like this one:
df = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]], columns = ['A', 'B', 'A1', 'B1'])
I would like to have a dataframe which looks like:
what does not work:
new_rows = int(df.shape[1]/2) * df.shape[0]
new_cols = 2
df.values.reshape(new_rows, new_cols, order='F')
of course I could loop over the data and make a new list of list but there must be a better way. Any ideas ?
The pd.wide_to_long function is built almost exactly for this situation, where you have many of the same variable prefixes that end in a different digit suffix. The only difference here is that your first set of variables don't have a suffix, so you will need to rename your columns first.
The only issue with pd.wide_to_long is that it must have an identification variable, i, unlike melt. reset_index is used to create a this uniquely identifying column, which is dropped later. I think this might get corrected in the future.
df1 = df.rename(columns={'A':'A1', 'B':'B1', 'A1':'A2', 'B1':'B2'}).reset_index()
pd.wide_to_long(df1, stubnames=['A', 'B'], i='index', j='id')\
.reset_index()[['A', 'B', 'id']]
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
You can use lreshape, for column id numpy.repeat:
a = [col for col in df.columns if 'A' in col]
b = [col for col in df.columns if 'B' in col]
df1 = pd.lreshape(df, {'A' : a, 'B' : b})
df1['id'] = np.repeat(np.arange(len(df.columns) // 2), len (df.index)) + 1
print (df1)
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
EDIT:
lreshape is currently undocumented, but it is possible it might be removed(with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.
I solved this in 3 steps:
Make a new dataframe df2 holding only the data you want to be added to the initial dataframe df.
Delete the data from df that will be added below (and that was used to make df2.
Append df2 to df.
Like so:
# step 1: create new dataframe
df2 = df[['A1', 'B1']]
df2.columns = ['A', 'B']
# step 2: delete that data from original
df = df.drop(["A1", "B1"], 1)
# step 3: append
df = df.append(df2, ignore_index=True)
Note how when you do df.append() you need to specify ignore_index=True so the new columns get appended to the index rather than keep their old index.
Your end result should be your original dataframe with the data rearranged like you wanted:
In [16]: df
Out[16]:
A B
0 1 2
1 5 6
2 9 10
3 3 4
4 7 8
5 11 12
Use pd.concat() like so:
#Split into separate tables
df_1 = df[['A', 'B']]
df_2 = df[['A1', 'B1']]
df_2.columns = ['A', 'B'] # Make column names line up
# Add the ID column
df_1 = df_1.assign(id=1)
df_2 = df_2.assign(id=2)
# Concatenate
pd.concat([df_1, df_2])

How to iterate over DataFrame and generate a new DataFrame

I have a data frame looks like this:
P Q L
1 2 3
2 3
4 5 6,7
The objective is to check if there is any value in L, if yes, extract the value on L and P column:
P L
1 3
4,6
4,7
Note there might more than one values in L, in the case of more than 1 value, I would need two rows.
Bellow is my current script, it cannot generate the expected result.
df2 = []
ego
other
newrow = []
for item in data_DF.iterrows():
if item[1]["L"] is not None:
ego = item[1]['P']
other = item[1]['L']
newrow = ego + other + "\n"
df2.append(newrow)
data_DF2 = pd.DataFrame(df2)
First, you can extract all rows of the L and P columns where L is not missing like so:
df2 = df[~pd.isnull(df.L)].loc[:, ['P', 'L']].set_index('P')
Next, you can deal with the multiple values in some of the remaining L rows as follows:
df2 = df2.L.str.split(',', expand=True).stack()
df2 = df2.reset_index().drop('level_1', axis=1).rename(columns={0: 'L'}).dropna()
df2.L = df2.L.str.strip()
To explain: with P as index, the code splits the string content of the L column on ',' and distributes the individual elements across various columns. It then stacks the various new columns into a single new column, and cleans up the result.
First I extract multiple values of column L to new dataframe s with duplicity index from original index. Remove unnecessary columns L and Q. Then output join to original df and drop rows with NaN values.
print df
P Q L
0 1 2 3
1 2 3 NaN
2 4 5 6,7
s = df['L'].str.split(',').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'L'
print s
0 3
2 6
2 7
Name: L, dtype: object
df = df.drop( ['L', 'Q'], axis=1)
df = df.join(s)
print df
P L
0 1 3
1 2 NaN
2 4 6
2 4 7
df = df.dropna().reset_index(drop=True)
print df
P L
0 1 3
1 4 6
2 4 7
I was solving a similar issue when I needed to create a new dataframe as a subset of a larger dataframe. Here's how I went about generating the second dataframe:
import pandas as pd
df2 = pd.DataFrame(columns=['column1','column2'])
for i, row in df1.iterrows():
if row['company_id'] == 12345 or row['company_id'] == 56789:
df2 = df2.append(row, ignore_index = True)

Categories

Resources