I have some columns in my dataframe that look like:

 total
  NaN
26-27
52-53
88-89
  165
  280
  399
  611
  962
 1407
 1937

I would like to transform them into numerical values using a round-up:

 total
  NaN
   27
   53
   89
  165
  280
  399
  611
  962
 1407
 1937

clearly, pd.to_numeric() does not work as 26-27 is an object. I can do it one by one, but is there an elegant and fast way to do the transformation?

score:3

Accepted answer

In case you truly need a round up between two numbers, that could be out of order, you can do this:


df.total.str.split('-').apply(pd.Series).astype(float).max(axis=1).astype('Int64')                                                                                                    

0      NaN
1       27
2       53
3       89
4      165
5      280
6      399
7      611
8      962
9     1407
10    1937
dtype: Int64

Similar question

score:0

Yet another solution with regex:

df.total.str.replace(r"\d+\-","").astype(float)

score:2

You can separate it according to the - sign and take the last element with split. This allows you to convert the data to float and then to integer if you like.

>>> df.total.str.split('-').str[-1].astype(float)
0        NaN
1       27.0
2       53.0
3       89.0
4      165.0
5      280.0
6      399.0
7      611.0
8      962.0
9     1407.0
10    1937.0
Name: total, dtype: float64

Or if you want to cast to integer,

>>> df.total.str.split('-').str[-1].astype(float).astype('Int64')
0      NaN
1       27
2       53
3       89
4      165
5      280
6      399
7      611
8      962
9     1407
10    1937
Name: total, dtype: Int64

score:6

IIUC, we can use a little bit of regex to extract all numbers grabbing the last element before a line terminator

Before \n using $

\d+ matches a digit (equal to [0-9])

+ Quantifier — Matches between one and unlimited times, as many times as

df['total'].str.extract(r'(\d+)$').astype(float)
out:
0        NaN
1       27.0
2       53.0
3       89.0
4      165.0
5      280.0
6      399.0
7      611.0
8      962.0
9     1407.0
10    1937.0
Name: total, dtype: float64